ON THE JOINT EXPLOITATION OF OPTICAL AND SAR SATELLITE IMAGERY FOR GRASSLAND MONITORING

Abstract. Time series of optical and Synthetic Aperture RADAR (SAR) images provide complementary knowledge about the cover and use of the Earth surface since they exhibit information of distinct physical nature. They have proved to be particularly relevant for monitoring large areas with high temporal dynamics and related to significant ecosystem services. Grasslands are such crucial surfaces, both in terms of economic and environmental issues and the automatic and frequent monitoring of their agricultural practices is required for many purposes. To address this problem, the deep-based SenDVI framework is presented. SenDVI proposes an object-based methodology to estimate NDVI values from Sentinel-1 SAR observations and contextual knowledge (weather, terrain). Values are regressed every 6 days for compliance with monitoring purposes. Very satisfactory results are obtained with this low-level multimodal fusion strategy (R2 = 0.84 on a Sentinel-2 tile). Finer analysis is however required to fully assess the relevance of each modality (Sentinel-1, Sentinel-2, weather, terrain) and feature sets and to propose the simplest conceivable framework. Results show that not all features are necessary and can be discarded while others have a mandatory contribution to the regression task. Moreover, experiments prove that accuracy can be improved by not saturating the network with non-essential information (among contextual knowledge in particular). This allows to move towards more operational solution.



Context
Accurate and regular monitoring of terrestrial biomass is today an essential need in environmental, economic, and social terms. Among vegetation, grasslands are crucial ecosystems: they are the second most represented land cover surfaces throughout the world after forests. They support many so-called ecosystem services, such as carbon storage, prevention of soil erosion, production of quality food, and preservation of biodiversity (Bengtsson et al., 2019, Buchmann et al., 2019. Remote sensing has proved to be a suitable tool to monitor in such a direction (Atzberger, 2013, Weiss et al., 2020. Initial efforts have explored the needs and challenges of regular monitoring of agricultural practices such as mowing, grazing or ploughing on grasslands (Garioud et al., 2019). Quantifying and determining the nature of these technical acts is important since they affect the quality of grassland ecosystem services. The key identified factor is the periodicity of monitoring. Many grasslands have a prominent part in production systems: they are intensively farmed. Grass regrowth can be favoured by nitrogen inputs or favourable climatic conditions. As a result, biomass values can return to their initial state after only about ten days following a technical act. Thus, reliable weekly measurements are required. Satellite constellations allowing a temporal revisit of the order of a few days, coupled with a free-access policy have emerged in the recent years. Possibilities of Earth Observation applications in monitoring vegetation status have steadily increased (Frampton et al., 2013, Mercier et al., 2020. The Sentinel satellites of the European Copernicus program allow the joint exploitation of optical and SAR data with similar high spatial * Corresponding author resolution, with global coverage, and with a temporal revisit of 5/6 days at mid-latitudes.

Problem Statement
For vegetation monitoring, e.g., grasslands, optical sensors are mostly preferred as recently enlightened in (Griffiths et al., 2020) and (Liu et al., 2020). These sensors capture the interaction of light with the chemical properties of the target, efficiently describing the vivacity and photosynthetic activity of the plants. However, optical data suffer from the recurring presence of clouds, resulting in numerous missing data, which can last from a few days to a few months. It prevents meeting the requirements of dense surveillance of grasslands (Kolecka et al., 2018). Most monitoring approaches integrating optical satellite data are based on the exploitation of derived vegetation indices (Ali et al., 2017, Pasqualotto et al., 2019, Tiscornia et al., 2019. Among these indices, NDVI is still widely used for its simplicity and ease of interpretation (Stendardi et al., 2019, Solano-Correa et al., 2020, Belgiu, Csillik, 2018. To deal with cloud coverage, most of the studies apply a gap filling step to reconstruct missing values in time-series with available cloudless images. Among traditional gap-filling methods, common practice relies on a linear (or splines) interpolation method between the cloudless date preceding and following the missing one. This approach has shown its effectiveness especially in classification tasks (Vuolo et al., 2017, Fauvel et al., 2020. Nevertheless, for a precise follow-up of the targets evolution and especially in the case of abrupt and rapid phenomena such as management practices on grasslands, these techniques do not allow for a proper handling of missing data. The integration of various optical sensors from different satellites has been proposed in (Claverie et al., 2018, Li, Guo, 2018, Griffiths et al., 2019. This strategy however leads to problems of spatial and radiometric standardization between data sources. Despite an increase in the number of images, relying on optical imagery alone only diminishes the missing data issue. Synthetic Aperture Radar (SAR) data has therefore been increasingly exploited. As an active remote sensing source, it is capable of acquiring data day and night. The radio frequency transmission power of the SAR sensors also allows them to be cloud insensitive. On the ground, SAR sensors interact with the dielectric and geometric properties of the target. As a result, these data are affected by climatic events (rain, frost, humidity), as well as by the effects of ground relief. Although monitoring of grasslands with SAR data only shows satisfactory results (Buckley, Smith, 2010, Siegmund et al., 2016, Tamm et al., 2016, Chiboub et al., 2019, most of studies are either relying on full polarimetry (giving access to the Radar Vegetation Index) or on Very High Resolution (VHR) data. Both are difficult to access and not compliant with dense monitoring requirements.

Motivation and contribution
The joint exploitation of optical and SAR data is necessary to monitor grasslands with high regular temporal sampling. These techniques mainly focuses on independent processing of optical and SAR data with a late fusion strategy. High-level fusion schemes are based on complex ad-hoc decision rules defined on restricted areas. They often rely on empirically tuned parameters. Few works propose a low-level fusion of optical and SAR data, mainly for image-to-image translations tasks (Merkle et al., 2017, He, Yokoya, 2018, Hughes et al., 2019. In a similar direction, supervised regression of optical and SAR data for vegetation parameter retrieval has been proposed in (Baghdadi et al., 2016) using Neural Networks or with Gaussian Processes in (Pipia et al., 2019), both for LAI estimation. In (Scarpa et al., 2018) Convolutional Neural Networks are adopted for NDVI S1/S2 reconstruction but focused on short time series, discarding the temporal information. In parallel, a first low-level fusion approach has been suggested in (Garioud et al., 2019) enabling the regression of S1 SAR observations to S2 NDVI values. This Reccurent Neural Network framework integrates contextual data (weather/terrain) provided from external sources. It allows the regression task to refine the satellite measurements. Nevertheless, the model relies on a number of important and sometimes difficult to access sources of information. In this paper, we target to: • Explore the relationship between optical, SAR and contextual data by proposing a deep-based network for a regression task. The analysis is two-fold: on the basis of physical understanding and computational constraints.
• Quantify, for various feature sets, for scalability and automation purposes, gains and losses in accuracy. We aim to select the required input data for satisfactory regression results.

Data
The study area is located on the South-East of France covering a Sentinel-2 tile (T31TFM).The French Land-Parcel Identification System (LPIS) is used to retrieve 23, 850 parcels of grasslands from the studied area. Agricultural season from October 2016 to October 2017 is considered. For each grassland parcel, the two types of features described in the following have been considered : Sentinel derived features consist of 53 optical Level-2A S2 images from which NDVI is calculated. The pre-processing is carried out by the MAJA chain which integrates cloud and shadow masks. 60 ascending and 60 descending orbit images from S1 are available during the agricultural season. σ0 values (VV, VH, and VV/VH polarization), and 6-day InSAR coherence (VV and VH polarization) are derived. In order to take into account incidence angles and acquisition times, the ascending and descending orbits of Sentinel-1 are independently processed. For each parcel, mean, median, and standard deviation of the optical and SAR satellite bands are computed. Additionally, for each SAR feature, slope between each date and the previous one is given for individual parcels as well as the mean slope for parcels in a spatial neighborhood (e.g., 2km) around this parcel.
Contextual information are derived from ancillary data and from acquisition metadata. 25 climatic variables (e.g. rain, temperature, frost, etc.) are extracted from the Météo France SAFRAN-ISBA dataset since weather has a strong effect on S1 signals and grasslands (Moreira et al., 2019). These variables are daily provided and computed for each parcel from the nearest weather station. 5 topographic descriptors (e.g. height, exposition, slope, etc.) are given with a 5 m resolution Digital Surface Model provided by IGN-France. Temporal context (e.g., week number, month, season) is given to each input date, as well as time shift separating Sentinel-1 and Sentinel-2 acquisitions. The LPIS agronomic type of the grassland is also provided (22 grassland classes on the area).
Because of the prerequisites of the method described below, Sentinel-1 and contextual data are all standardized by channel. With less Sentinel-2 data available, a common temporal grid between S2 and S1 data is required for the regression task. S2 data is interpolated by the nearest neighbor method on the S1 descending acquisition grid.

Method
The Sentinels-NDVI (SenDVI) network is based on Recurrent Neural Networks (RNN) and follows the preliminary works described in (Garioud et al., 2019). Refined with a two-branch approach in order to integrate contextual information more accurately, the network is assessed by its capacity to reconstruct NDVI time-series on a high temporal sampling. A deep learning approach was chosen for the task since it generally offers better results than traditional machine learning approaches such as Random Forest or Support Vector Machine (Ienco et al., 2017, Reichstein et al., 2019, Kamilaris, Prenafeta-Boldú, 2018. This method allows to alleviate the issue of selecting input data, permits to explore voluminous data and solve complex regression configurations. SenDVI aims to learn the statistical relationships among multisensor multi-variate time-series. Sentinel-1 features are considered as inputs and Sentinel-2 cloud-free NDVI values as labels. Cloud mask associated to each S2 date is used to select the cloudy dates among the time-series for each parcel. These dates are not considered for computing the loss (Mean Squared Error loss function) during the training phase since they may not be reliable labels.
The network is divided into three blocks as shown in Figure 2. The first block consists of two separate branches, one for ascending and one for descending SAR acquisitions. This block is The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLIII-B3-2020, 2020 XXIV ISPRS Congress (2020 edition) used to encode both SAR data and contextual data with Multi-Layer Perceptrons (MLPs) composed of successive linear regression layers, normalization, non-linear activation (i.e. Rectified Linear Units layer), and dropout. The outputs of each branch are merged by element-wise multiplication (Hadamard product) in order to apply contextual measurements to SAR observations. Finally, ascending and descending branches are concatenated in a single weighted feature vector for each date corresponding to the input data of the second block. The second block performs the recurrent learning of the SARto-optical regression parameters for each date. The Recurrent Networks have the advantage of retaining a memory of past observations through gating mechanisms. This capability is particularly effective for time series and is suitable for vegetation measurements with periodicity. A bi-directional Gated Recurrent Unit (GRU) network was chosen. It exhibits similar results to other types of RNNs for fewer parameters (Ndikumana et al., 2018, Zhao et al., 2019. Lastly, the third block decodes the outputs of the second block with an in-depth funnel-shaped MLP. It results in a single value of NDVI for each date.

Validation of SenDVI framework
The input data used to train and validate SenDVI is composed of 25 Qualitative assessment can be performed through visual inspection of reconstructed time-series by looking at Figure 1. Results show very satisfactory reconstructed NDVI time-series. Our method is able to (i) reconstruct the signal despite significant temporal gaps and (ii) preserve main breakpoints, which are crucial for detecting human activity. In average for the whole dataset, only 26 images are exploitable among the 53 available.
More than half of the image set over a year is obscured by clouds and therefore do not provide information on the parcel vegetation evolution. With the help of the low-level fusion from SenDVI, even when significant gaps exists (e.g., >1 month, see left part of the signals in Figure 1), 60 reliable values are ensured every year. Our results are therefore compliant with the requirements of weekly measurements needed in an accurate monitoring of grasslands.

OPTICAL, SAR AND ANCILLARY DATA : BEHAVIOR AMONG REGRESSION TASK
The reconstructed NDVI time-series show satisfactory statistical as well as visual results. However, the acquisition and pre-processing steps for the various features can be tedious, cost-intensive, and therefore not in line with operational constraints. A first question arises about the need to process the large amount of input data. The second question is the assessment of which input and contextual information are mostly essential for this regression task. Therefore, the qualitative and quantitative analysis of the contribution of each feature is studied by an ablation study. Considering the two different types of inputs described in 2.1, each feature within the two categories will be removed one by one for training a new model. The same metrics as in 2.3 are computed allowing to compare the regression accuracy of the model without the feature of interest against the full model. The rank of the model among the configurations is also given according to the R 2 score of the configuration (the highest being rank 1). The so-called full model consists of 5 runs of 5-fold cross-validated models with the full set of data as presented in 2.3 and is our baseline. The splitting of the dataset (train, validation, test) as well as the networks hyper-parameters and architecture (e.g. encoder depth and sizes, learning rate and hidden layer size of RNN and decoder depth and sizes) are equally fixed for all the experiments.

Ablation study
We call "ablation" the relevance assessment of the various features. We do not here evaluate the potential simplification or complexity of the deep-based architecture. We consider 16 distinct configurations of feature sets. They are decomposed into 3 families (see the respective subsections): 6 context features, 5 features derived from Sentinel-1, and 5 more generic sets (called global features) will be separately removed.
Main results are shown in Figure 3. For the 16 configurations, the median R 2 value is 0.8227 and the standard deviation is 0.064. Five configurations exhibit very close scores, with a R 2 value around 0.84: the full model, the model without the features related to topography, temporal context, agronomic type, and without the VV/VH ratio band. Results shown that the best performances are not obtained by the full model, which is ranked in 4 th position. We can however notice a limited difference.
The best results are obtained by removing the information related to the temporal context. This information is in fact induced by the stable sequencing (e.g. every 6 days) of the timeseries. Somehow duplicating this information is therefore not necessary and does not provide relevant additional information. The use of a GRU rather than an Long short-term memory (LSTM) network, which has a dedicated long-term memory vector, may also explain the low contribution of this feature. Nevertheless, tests done by replacing the GRU with an LSTM do not seem to affect the importance of this feature. The model without the topographic information obtains also equivalent results, even if this information seems important at first glance. Indeed, the topographic feature provides information on the topography related to the area but also on the exposition of each parcels that could be related to the aiming specifications of the sensor. However without this information the standard deviation between the five runs is the lowest among the tested configurations. Agronomic type of the grassland parcel also does not seem to influence the results. Although 22 agronomic classes are present in the area, 61.78% of the parcels are represented by a single class and 95.9% of the total are spread over 3 classes (long rotation grassland of 6 years or more, permanent grassland of 5 years minimum, temporary grassland of 5 years or less). Moreover, these 3 classes are mainly discriminated by the inter-annual duration of the grassland cover and do not necessarily distinguish between different agronomic types. VV/VH ratio band often appears to be the most correlated with NDVI (Fieuzal et al., 2013, Veloso et al., 2017, El Hajj et al., 2019, albeit hiding this information to the network input has little impact on the regression capability. Although, the standard deviation between runs is more significant. The MLPs used for encoding in the first block of the network (Section 2.2) are certainly efficient to compute similar simple information such as a ratio of two values. Their preliminary calculation therefore appears to be negligible.
3.1.1 Contextual features. Among the contextual features, the time shift (delta) between the acquisition date of Sentinel-2 and the interpolated Sentinel-1, the climate information as well as the information from neighboring parcels are the most important ones. For these three features, their removal results in a larger standard deviation of the accuracy scores. Time-delta information is visibly more expressive for the network than temporal context (e.g. moment within the year, etc.). Climatic conditions have a direct impact on the state of the vegetation. This information therefore seems adequate to regress NDVI values accurately. Nevertheless, it is derived from the nearest meteorological station and is therefore similar for a large number of parcels. It seems certain that if such information with a finer spatial resolution was integrated with the contextual information, the results would be improved. The spatial component of contextual information (e.g. neighborhood slope) is the most significant. Indeed, knowing the average evolution of neighbouring parcels for the network makes it possible to frame the evolution of the parcel of interest within a range. The suppression of this information thus leads to a drop in the regression score to 0.8227. Since the information given is only a simple mean, more elaborate metrics on the evolution of neighboring parcels such as spatial entropy (Gao et al., 2014) or Moran's Index (Das, Ghosh, 2017) could improve the results. Figure 3. Ablation study of the SenDVI Sentinel-1 to Sentinel-2 NDVI regression task. Mean MSE, RMSE and R 2 as well as standard deviation of R 2 (red bar) for five runs of each configuration are provided. Ranks are given in terms of R 2 scores.

Satellite features.
Regarding the five Sentinel-1 features, their removal leads to a greater drop in scores as for the contextual features. This is not true for the ratio band (see above). Comparing σ0 features (which measure the proportion of signal backscattered by the target), and coherence features (which calculates the interferometric stability of the target between two dates), we assess that the sigma bands have less relevance than that the coherence information (R 2 =0.8044 and 0.7890, respectively). Since two acquisition dates are embedded with coherence, the later is particularly less affected by climatic effects. Moreover, as a proxy for change between two dates, coherence tends to introduce less variance between measurements. As a result, coherence information can be considered important in time-series analysis. Regarding both VV and VH polarization channels, the removal of VV has more impact than VH (0.7909 against 0.8138). Nevertheless, physically, the VH polarization should be more sensitive to vegetation growth since vertical structures cause depolarization of the wave and thus increased its response sensitivity.

Global features.
Finally, full families of attributes are considered for ablation. Removing one by one context features does not result in a significant loss of regression quality. Nevertheless, when removing all of the context features, the regression score drops to 0.7982 (-0.05). This result demonstrates the relevance of multimodality and ancillary data for such a regression task. Then, Sentinel-1 attributes are removed from the network (and thus also contextual data related to the neighborhood). This configuration gives the lowest regression score (0.5662). The network, fed only with the climate, topography, temporal context, and agronomic type features of the parcel, does not correctly estimate NDVI values over grasslands. An example of differences in regression on a parcel is given in Figure 4. It is apparent that without the Sentinel-1 data the timeseries shows very little dynamics, based mainly on the evolution of contextual features. The model without the contextual features seems very close to the full model, especially for the dates when the Sentinel-2 values were available. For the cloudy dates mainly at the beginning of the season, however, a distinction is present which by visual analysis reflects less the evolution of the plot. Another global family considered are the ascending and descending acquisitions of Sentinel-1. Indeed, the integration of the two orbits requires twice as much storage and processing. The ablation of all the descending acquisitions allows the network to obtain a score of 0.8166 compared to 0.7929 for the descending acquisitions. On the study area, significant relief is present on the eastern part. Many of the grasslands parcel are thus not visible by SAR acquisitions in descending orbit, explaining the difference in these results. This distinction could also be caused by the difference in time-deltas between Sentinel-2 measurements and ascending and descending acquisitions. Figure 5 illustrates the inconsistency in R 2 results among global features removed. Overall, parcels decrease in regression accuracy as more information is removed. For few parcels however the full model shows lower results. Finally, the cloud mask wasn't used for loss calculation. This suppression results in a considerable decrease in accuracy with a R 2 of 0.75. Despite intrinsic errors of the mask (omission and commission of clouds and shadows), his usefulness remains significant.

Computational and storage efficiency
Operational mapping requires network optimization and simplicity while keeping high performances (Mallet, Le Bris, 2020). The first ablation study is informative in terms of regression quality. Nevertheless, in order to reduce storage and preprocessing requirements, other configurations have been tested. All the features were first iteratively removed: the interdependence of these features must be analyzed for the purpose of proposing a satisfactory network with a minimal number of inputs.
Among the contextual information, the topography and climatic features often require the acquisition of proprietary data with The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLIII-B3-2020, 2020 XXIV ISPRS Congress (2020 edition) Figure 4. Results of different reconstructions of NDVI time series with various feature sets for one grassland parcel. Only non-cloudy S2 values are shown. It shows that SenDVI can restore the signal despite clouds (S1 features) while keeping only significant high frequencies (context features). spatial interpolation. Topographic information do not seem to provide crucial information to the network. Digital Terrain Models with higher spatial resolution may favorably impact the accuracy of SenDVI. However, they are bound not to be available at large scales.
Topographic features were thus first removed from the network entries and the remaining contextual features were suppressed one by one. The configuration with the topographic information and the additional contextual information removed having the minimal impact on the score is retained and the operation is repeated from this configuration. We consider satisfactory a regression score close to 0.84. Results are provided in Table 2. At a certain point, regression scores are increasing. One possible explanation is that essential data are kept but with approximately same number of parameters, the network has further opportunity to explore this inputs and therefore achieves better high-level representation. Indeed, by removing the topography, time shift between acquisitions (deltas), and temporal context, the scores of R 2 reaches 0.8438. Similarly these three features had little impact on the scores obtained in the ablation study. In the same way, meteorological and agronomic information about grasslands can be neglected. This conclusion is in agreement with the ablation study above on the agronomic type but the results following the ablation of the climatic information are more surprising. Indeed, removing only the climate information scores lower than when it is removed with a set of features. Removing all of the features (Table 2) provides the best results (i.e. above R 2 =0.84) of all configurations while requiring less data. Removing σ0 ratio band VV/VH in this case makes the score drop at 0.8355. For the other features (Sentinel-1 or global ones), in all cases, the scores were significantly worse.

DISCUSSION
The network integrates numerous multi-modal data: both satellite data and data aimed at contextualizing the measurements (i.e. topographic, climatic data). It was appropriate to assess which features were most contributing to the regression. The results presented above allow to draw preliminary conclusions but are certainly not generally applicable: a single Sentinel-2 tile is not sufficient.
The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLIII-B3-2020, 2020 XXIV ISPRS Congress (2020 edition) Contextual features are derived from external sources, which may be acquired or exploited differently. Specifically, topographic and climatic data can be handled in different ways. Topographic data have not been standardized by the orientation of the Sentinel-1 acquisition plan. Fluctuations found in the data are likely not to reflect actual variations from one parcel to another. Climate measurements are (i) daily averaged, and (ii) provided with a coarse spatial resolution (no dense sampling), with significant disparities depending on the location. For ascending and descending orbits, the acquisition times are not the same (beginning of the morning and end of the afternoon, respectively). This difference is therefore not taken into account in the contextualisation effort. In addition, there are a limited number of weather stations in the area, whose values are interpolated. This results in (i) a highly variable accuracy of the measurements depending on the location of the plots and (ii) in similar values for neighboring parcels.
The agronomic type is also subject to uncertainty. In addition to the fact that such classes are highly imbalanced, it includes a declarative bias due to the financial subsidies granted by the French Paying Agency and the constraints imposed on certain classes.
This analysis determines that contextual data provided a gain in regression accuracy. Counter-intuitively, however, the ablation study scores show that the most important features are all derived from Sentinel-1 data (i.e., neighborhood as most important context feature). In order to improve these results, it would therefore be advisable to extract from external sources, features more specifically tailored in relation to Sentinel-1 acquisitions and the parcels of interest. The improvement of the neighborhood feature, through a graph-based approach for example, would support a complete data-driven approach that could also surely lead to at least similar results. Several points are nevertheless conclusive: (i) a simple approach integrating only contextual data types does not provide satisfactory regression scores; (ii) InSAR coherence although requiring costly calculations is a significant information, and (iii) "the more data, the better" approach often associated with deep learning does not necessarily apply despite a dedicated solution (two branches, attention mechanisms).
Areas for improvement in the network architecture are related to the strategy proposed to integrate the contextual data into the SenDVI architecture. The common Hadamard product used here could be improved. The use of more dedicated networks, such as a Transformer (Vaswani et al., 2017, Sainte Fare Garnot et al., 2020, could improve the results. The hypothesis of significantly increasing the size of the network could also potentially provide better results. However, for most uses, and with a perspective of automation and deployment over large areas, this option is not feasible.

CONCLUSION
The SenDVI neural network based regression model proposes a multi-sensor approach allowing to obtain time series of high temporal sampling of vegetation indices that do not suffer from the traditional missing data due to clouds. Resulting highly sampled time series showed satisfactory profiles and regression scores. The ablation study made it possible to select the features needed for this purpose and to propose a strategy integrating less data and therefore less pre-processing and storage requirements.
These time series allow weekly monitoring of the evolution of the vegetation, objective identified for the detection of management practices on grasslands. The development of a change point detection framework will be based on these reconstructed time series. Since the proposed regression method is based solely on Sentinel-1 data, both an on-line and off-line method can be implemented. A multi-task learning approach based on SenDVI architecture seems to be a promising direction.