INVESTIGATING THE PERFORMANCE OF RANDOM FOREST AND SUPPORT VECTOR REGRESSION FOR ESTIMATION OF CLOUD-FREE NDVI USING SENTINEL-1 SAR DATA

The current study focuses on the estimation of cloud-free Normalized Difference Vegetation Index (NDVI) using the Synthetic Aperture Radar (SAR) observations obtained from Sentinel-1 (A and B) sensor. South-West Summer Monsoon over the Indian sub-continent lasts for four months (mid-June to mid-October). During this time, optical remote sensing observations are affected by dense cloud cover. Therefore, there is a need for methodology to estimate state of vegetation during the cloud cover. The crops considered in this study are Paddy (Rice) from Punjab and Haryana, whereas Cotton, Turmeric, and Banana from Andhra Pradesh, India. We have considered, observations of Sentinel-1 and Sentinel-2 sensors with the same overpass day and non-cloudy pixels for each crop. We used Google Earth Engine to extract surface reflectance for the Sentinel-2 and Ground Range Detected (GRD) backscatter for Sentinel-1. The Red and NIR bands of Sentinel 2 were used to estimate NDVI. Sentinel-1 based VV, and VH backscatter was used for estimation of Normalized Ratio Procedure between Bands (NRPB). Regression analysis was performed by using NDVI as an independent variable, and VV, VH, NRPB, and radar incidence angle as dependant variables. We evaluated the performance of Linear regression with tuned Support Vector Regression (SVR) as well as tuned Random Forest Regression (RFR) using the independent data. Results showed that the RFR produced the lowest RMSE for all the crops in the study. The average RMSE using the RFR was 0.08, 0.09, 0.11, and 0.10 for Rice, Cotton, Banana, and Turmeric, respectively. Similarly, we have obtained R values of 0.79, 0.76, 0.69, and 0.71 for the same crops using the RFR. A model with 80 trees produced the best results for Rice and Cotton, whereas the model with 90 trees produced the best results for Banana and Turmeric. Analysis with NDVI threshold of 0.25 showed improved R and RMSE. We found that for grown crop canopy, SAR based NDVI estimates are reasonably matching with the optical NDVI. A good agreement was observed between the actual and estimated NDVI using the tuned RFR model.


INTRODUCTION AND STATE OF THE ART
Continuous regional crop mapping and monitoring is essential especially in countries like India to keep a track on spatio-temporal coverage of various crops. This information can be consumed by various stakeholders like the government for the planning of various import-export activities, agri-input companies for facilitation of various fertilizers/chemicals, farmers to get the status of their crop in real-time (Mohite et al. (2018)). Satellite based remote sensing sensors are being effectively used over the years for continuous crop mapping and monitoring. Such methods are always preferred over manual surveys due to efficiency in terms of time, accuracy, spatial coverage, etc. Space exploration agencies such as the Indian Space Research Organization and international agencies such as the National Aeronautics and Space Administration (NASA), European Space Agency (ESA) have launched multiple Optical (IRS, Landsat 5,7,8, MODIS Terra, Aqua, Sentinel 2) as well as Synthetic Aperture Radar (RISAT-1, Sentinel 1) satellites. These satellites are extensively being used for crop mapping and monitoring.
Optical satellites provide rich spectral information in multiple wavelength bands which offer advantages for various agriculture applications such as crop type identification (Mohite et al. (2018)), crop monitoring, crop loss assessment (Sawant et al. * Corresponding author (2019)), yield estimation ), etc. Various methods based on the vegetation indices have been proposed in the past for agricultural applications. The Normalized Difference Vegetation Index (NDVI) is one of the widely used vegetation index (Rouse et al. (1974)). NDVI is derived using the Red and Near Infrared (NIR) bands of optical satellites such as Sentinel-2, Landsat-8, MODIS Terra and Aqua, etc. However, loss of information due to the presence of clouds in the optical dataset restricts its utilization to its maximum extent. In India, Kharif season is the main cropping season which starts in mid-June with the onset of the Indian Summer Monsoon (ISM) and extends upto November. During this season Indian sub-continent is mostly covered with the dense clouds.
Numerous attempts have been made for the cloud removal and cloud induced gap filling in the optical data using the time-series information and information available in the neighborhood pixels (Roerink et al. (2000); Padhee and Dutta (2019); Adam et al. (2018)). Nonetheless, the cloud removal process is useful in the presence of thin clouds and can be performed effectively but such process can not be considered successful in the case of thick clouds. Also, these methods can not be very useful in India during the Kharif season (June-October) when there is thick cloud cover over most of the season. Alternatively, the Synthetic Aperture Radar (SAR) sensor can collect continuous data in cloudy conditions as well as during day/night. Hence, synergistic use The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLIII- B3-2020, 2020XXIV ISPRS Congress (2020 of optical and SAR sensor observations can generate the continuous stream of NDVI time-series for vegetation monitoring. Studies have attempted to estimate the NDVI using SAR observations (Capodici et al. (2013); Davidse (2015); Filgueiras et al. (2019); Mazza et al. (2018); Navarro et al. (2016); Vreugdenhil et al. (2018)). Capodici et al. (2013) have shown that temporal changes of HV backscatter acquired with off-nadir angle greater than 40 degree best correlates with variations in the vegetation index from optical data. The study has a dependency on historical optical and SAR observations. Frison et al. (2018) showed a strong relationship between Sentinel-1 backscatter and vegetation phenology derived from Landsat-8. Mazza et al. (2018) have developed a CNN based model to derive NDVI from SAR data. Filgueiras et al. (2019) established the regression-based relationship between Sentinel-1 SAR and NDVI from Sentinel-2 to derive the continuous cloudless NDVI for Soybean and Maize (Corn). The study was focused on adjacent fields from a small area. Limitations of the research studies are a) the dependency on data from optical sensors for model development, b) methods are limited to certain incidence angles, c) heterogeneity in the spatial and temporal resolution of the SAR and Optical observations and d) geographical coverage for the model development.
The current study focuses on the estimation of cloud-free NDVI using the SAR observations obtained from Sentinel-1 sensor. The proposed method explores the Linear Regression (LR), Support Vector Regression (SVR) and Random Forest Regression (RFR) for estimation of NDVI using SAR observations. The study was conducted during the Kharif season of the year 2019 for two regions of India.

Study Area
The analysis was performed over two Indian regions namely, Andhra Pradesh and Punjab-Haryana. The study regions are situated in India's southern and northern parts respectively. The crops considered in this study are Paddy from Punjab and Haryana state.
Punjab and Haryana are one of the major paddy producing belt in India. Cotton, Turmeric and Banana crops considered from Andhra Pradesh, India. Figure 1 shows the two locations where the geotagged field data has been collected.

Datasets Used
In this study we have used Sentinel-1 and Sentinel-2 satellite imagery, ground truth data collected from the field visits.

Sentinel-2 Data and Preprocessing
ESA launched the constellation of optical satellite Sentinel-2 A and B which provides the earth observation in 10, 20 and 60 meter spatial resolution at five days repeat period (ESA (2020b)). Observations provided by Sentinel 2 are available in the 13 spectral bands mainly visible and NIR at 10 meters, red edge and SWIR at 20 meters, and atmospheric bands at 60 meters spatial resolution, respectively. For research purposes, Google Earth Engine cloud platform (Gorelick et al. (2017)) provides the collection of time-series Sentinel 2 Level-2A orthorectified atmospherically corrected surface reflectance data. In the present study, the data in Red and NIR bands was accessed from GEE to estimate the NDVI. Table 1 shows the location specific availability of Sentinel-2 data overlapping (or 1 day difference) with the Sentinel-1 overpass date. First number in the pair (1) shows the Sentinel-1 overpass date, however second number represents Sentinel-2 overpass date. Pixels with no cloud cover were considered for model development. NDVI threshold is used for obtaining the cloud-free pixels.
2.2.2 Sentinel-1 Data and Pre-processing Sentinel-1 satellite mission launched by ESA also has a constellation of two satellites 1-A and 1-B (ESA (2020a)). Data has been captured in dual-polarization by C-band Synthetic Aperture Radar. Satellite provides the observations at 5 meter in range and 20 meter in azimuth direction with 6 days repeat period. GEE (Gorelick et al. (2017)) has a collection of S1 Ground Range Detected (GRD) scenes, processed using the Sentinel-1 Toolbox to generate a calibrated, ortho-corrected product. The GRD product has been generated by pre-processing the scenes for thermal noise removal, radiometric calibration and terrain correction (Filipponi (2019)). Sentinel-1 C-band SAR has all weather, day-night capability hence all the observations available during the growing season are useful for the analysis. We have accessed backscatter information in VV, VH polarization along with local incidence angle. Normalized Ratio Procedure between Bands (NRPB) was estimated using VV and VH backscatter using equation 1 and used in the analysis as one of the variables.
2.2.3 Ground truth data from field visits We have developed an android mobile application RuPS (Mohite et al. (2015)) for collection of field geo-coordinates and reporting various agricultural activities and events. For the current research, geo-tagged locations of the fields, crop cultivated on the field, its sowing or planting date and estimated harvest date were collected using the RuPS. Table 2 shows the number of plot boundaries collected for each crop and the total number of pixels associated with those crops.

Overall Approach
Each crop has a different crop season length therefore based on crop sowing and estimated harvest date concerning the region, we have considered NDVI and SAR data. For each crop and plot, we have identified the same satellite overpass dates and data with 1 day difference for Sentinel-1 and 2 and only that data was considered in the analysis. Data on all other dates were ignored to avoid noise and have the same reference. Plots were scattered all over the region to account for the regional variations of crop growth. The problem was devised as a regression analysis to establish the relationship between NDVI as an independent variable using

RESULTS AND DISCUSSION
To carry out the regression analysis, we have extracted the data of NDVI, VV, VH, incidence angle and NRPB for all the pixels associated with individual crops. Crop-wise models are developed for NDVI estimation. For each crop, data was divided into 80% data for model training and 20% data for independent validation of the developed model. We evaluate the performance of Linear Regression (LR), Support Vector Regression (SVR) and Random Forest Regression (RFR). For models such as SVR, RFR there are hyperparameters which could be tuned to obtain the optimum performance. Hence we carried out 3 fold cross-validation on the training data to obtain the best parameters for SVR and RFR.
SVR is tuned for C at 0.1,1,10,100, Sigma at 1, 0.1, 0.01, 0.001 and type of kernel tried were Linear and Radial Basis Function. The model with best parameters (out of 32 models) was determined using 3 fold cross validation. Performance of the best model was evaluated using a 20% validation dataset. RMSE was used as a performance measure to decide the best model. Model with the lowest RMSE was considered as the best model. The same strategy was applied for RFR by tuning the parameters such as number of Trees. The number of trees were varied from 10 to 100 with an interval of 10. A total of 10 models were evaluated to find out the model with optimum trees. In the case of LR, we simply train the model on a random 80% dataset and tested of remaining 20% dataset. To avoid the bias in the random selection of dataset and noise, we ran the LR model 10 times and averaged the RMSE. Overall modeling was repeated considering NDVI values greater than 0.25. This is to verify whether there is any influence of soil background on the overall model performance. Table 5 shows the performance of various models for the data with NDVI greater 0.25.
We can clearly see the improvements across all the models (both linear as well non-linear) when considering the NDVI greater than 0.25. We observed decrease in the RMSE and improvement in R 2 values for all the crops using the RFR models. Such results show that, the soil background available during initial crop growth period was responsible for poor relationship between NDVI and SAR data.

Temporal analysis of few pixels
For continuous monitoring of vegetation, it is important to get the temporal and continuous data of NDVI. To check the temporal feasibility of the developed models, we applied the best models (Linear, SVR, RFR) on unknown fields for each crop. We did not consider this field for model development as well as for validation. For each crop, we have chosen one field and plotted the time-series of NDVI estimated using the best model and actual time-series of median NDVI for that field. Figure 3 shows the time-series pattern for cotton where we can see the RFR model predicts the NDVI which closely matches the actual NDVI for almost all the dates. Also, there was a cloud during the month of July, August and September so there was a drop in the actual NDVI but RFR model predicted NDVI which closely follows the actual temporal NDVI pattern. In the case of Banana crop time-series (Figure 4), although the crop is present throughout the year, we have plotted the time-series between July to Dec 2019. The banana field was mostly affected by clouds during July-September. RFR model predicted NDVI which is closely following the pattern of actual NDVI wherever the actual cloud-free NDVI values are available. Figure 5 shows the time-series pattern for Turmeric. The field is covered by clouds towards the end of August and September. However, RFR predicted NDVI was in good agreement with actual NDVI and predicted the values at cloudy dates which followed the pattern of actual NDVI. Figure 6 shows the time-series of actual and predicted NDVI for rice. All the models were good to follow the actual NDVI however, RFR followed the actual NDVI pattern more accurately among all.

SUMMARY AND CONCLUSIONS
We have attempted to establish a relationship between NDVI derived from Sentinel-2 and Sentinel-1 based VV, VH backscatter,  it is observed that the model with 90 trees produced best results for Banana and Turmeric. Further, we have considered data with NDVI greater than 0.25 and carried out a similar analysis. We observed a decrease in the RMSE and improvement in R 2 values for all the crops using the RFR models. We found that, RMSE was decreased to 0.05, 0.06, 0.10 and 0.09 for Rice, Cotton, Banana and Turmeric respectively. Moreover, R 2 was increased to 0.83, 0.78, 0.71 and 0.77 respectively for these crops. We found that the estimation of NDVI was good for high canopy density compared to crop in the early stages with soil background. Further, we have also plotted the time-series of actual vs estimated NDVI using all the models for various crops. NDVI predictions made by the RFR model were closely matching with actual NDVI for almost all temporal instances. This was followed by SVR and LR.

FUTURE WORK
As a part of future work, we plan to implement the method on every cloudy pixel with respective crop and generate the cloudless NDVI images. This will basically help us to carry out the comparison between the actual and generated NDVI images on a spatial level. In addition to this, we plan to collect more data on other crops cultivated during Kharif season and develop models for those crops.