BENCHMARKING OF CONVOLUTIONAL NEURAL NETWORK APPROACHES FOR VEGETATION LAND COVER MAPPING

Satellite Image Time Series (SITS) are becoming available at high spatial, spectral and temporal resolutions across the globe by the latest remote sensing sensors. These series of images can be highly valuable when exploited by classification systems to produce frequently updated and accurate land cover maps. The richness of spectral, spatial and temporal features in SITS is a promising source of data for developing better classification algorithms. However, machine learning methods such as Random Forests (RF), despite their fruitful application to SITS to produce land cover maps, are structurally unable to properly handle intertwined spatial, spectral and temporal dynamics without breaking the structure of the data. Therefore, the present work proposes a comparative study of various deep learning algorithms from the Convolutional Neural Network (CNN) family and evaluate their performance on SITS classification. They are compared to the processing chain coined iota, developed by the CESBIO and based on a RF model. Experiments are carried out in an operational context using with sparse annotations from 290 labeled polygons. Less than 80 000 pixel time series belonging to 8 land cover classes from a year of Sentinel-2 monthly syntheses are used. Results show on a test set of 131 polygons that CNNs using 3D convolutions in space and time are more accurate than 1D temporal, stacked 2D and RF approaches. Best-performing models are CNNs using spatio-temporal features, namely 3D-CNN, 2D-CNN and SpatioTempCNN, a two-stream model using both 1D and 3D convolutions.


Context
Land cover maps provide spatial information on the variety of different types, or classes, covering the Earth's surface. Such maps were originally produced by using only spectral features available in satellite images sensed by Earth observation systems. However, some land cover classes, despite their characteristic spectral signatures, remain difficult to classify with a lack of spatial and temporal information. In order to make these maps available on time, accurate, robust, and reliable, automatic methods need to better handle multidimensional data such as spectral, spatial and temporal domains.
The present paper will benchmark various CNN models to produce land cover maps from SITS, defined as a sequence of satellite images of the same scene taken at subsequent times. As the Earth's surface is rapidly changing due to natural and socioeconomic factors, land cover maps are an essential tool for mapping and monitoring its biophysical cover. They are highly valuable in many applications such as urbanization, natural resources management, and during extreme events accentuated by climate change such as drought, flooding, wildfires or biomass changes. Indeed, SITS can provide detailed information on the status and evolution dynamics of different land cover classes and hence make possible leveraging class-specific spectro-temporal profiles to improve the classification. However, the majority of land cover maps are still only relying on spectral information or as in recent studies, in spectral and spatial information. Consequently, the use of temporal dependencies has been poorly investigated as explained in (Gómez et al., 2016) and (Gbodjo et al., 2020).

Related work
Due to the recent availability of SITS and their increasing spatial and temporal resolutions, an array of new methods to better handle multidimensional data takes form in the literature. Among them, CNNs are the most widely used and frequently beat state-of-the-art approaches from machine learning such as RF.
The present work will restrict its investigation to the benchmark of CNNs on SITS land cover classification even though other deep learning architectures have proven to be successful recently, such as recurrent-based models in (Rußwurm and Körner, 2018) with convolutional LSTM cells, or attentionbased models as in (Rußwurm and Körner, 2019). We focused on CNNs for their (i) relative ease of training compared to recurrent models, (ii) ease of deployment in an operational production framework, and (iii) ability to efficiently blend spatial and temporal information in convolution kernels. Moreover, recent experiments in (Garnot et al., 2019) on the respective performance of recurrent, convolution and hybrid models show that best-performing approaches are reached when up to 90 % of the models' parameters are allocated to modelling the temporal dimension of the data, suggesting that simple convolutional architectures are well-suited and probably sufficient for SITS classification.
One important aspect of this work is that the reference data is only sparsely labeled, meaning that only a a small subset of pixels is labeled in each training instance, leading to multiple issues with patch-based approaches such as spatial CNNs. Since only a fraction of pixels of the training images is labeled, much of the geometric information is simply not present in the data at first. Class borders pixels and spatial arrangements between classes are rarely annotated and, in our case, where labeled pixels are localised using small polygons within larger class objects, they are totally absent of the data. Geometric degradation such as smoothing of corners and erosion or dilation of small elements in the classified map is a well-known drawback induced by CNNs. Unfortunately, this phenomenon is accentuated when using sparse annotations. Indeed, since the training data drive both the feature extraction and the classification steps, the learning of rich patterns is impossible if the data is not rich enough.
Most conventional methods that try to incorporate the temporal dynamics of the data heavily rely on either on a simple temporal stacking in the channel dimension or on handcrafted feature descriptors. While a straightforward stacking of time acquisitions is oblivious to the temporal structure and causality present in the first place, feature engineering is based on domain knowledge and may fail to capture the relevant part of the raw data. In the meantime, there is a strong need to leverage simultaneously spatial and temporal features to perform land cover classification, preferably jointly learnt to take the most out of the feature interplay that guides the dynamics of SITS.
Time series could help handle intra-class variability across time, which is one of the major aspect of land cover mapping that plummets its performance. Obviously, deep learning made much more advances in recent years than traditional methods. Especially, CNNs are promising candidates to address the task of SITS land cover classification.

Goal
The present paper proposes a benchmark of different CNNbased approaches and a RF classifier on SITS land cover classification from sparsely annotated data. A special focus is put on the handling of spectral, spatial and temporal information. The benefit of the proposed methods will be assessed by comparing evaluation metrics on a separate test set. This has been divided into two sub-goals: 1. Exploiting spatial and temporal dependencies Most land cover classification systems rely on spectral features and lack spatial or temporal information. Indeed, while a temporal stack contains temporal statistics, it does not model the sequential nature of the data. Indeed, shuffling the temporal order has no consequence on the model and results. We aim to fill this gap, especially for land cover classes that vary over space and time and which are hence prone to misclassification. We will study the ability for different CNNs to extract relevant temporal and spatial features in SITS to better classify them. Indeed, such features may help discriminate between certain land cover classes which may have similar spectral signatures at some point in time and being radically different at a later time, especially vegetation.

Operational solution for real-world applications
Since data is scarce and costly in operational works, we aim to propose a scalable solution for real world applications defined by large areas and a little amount of sparsely annotated ground truth data. Such data is indeed expensive since it is provided by photo-interpreters that manually label it. Additionally, the solution is expected to be computationally light and feasible.

Random Forest (RF)
State-of-the-art methods extensively use machine learning to perform land cover classification. Methods such as RF classifiers are commonly found in the literature and is used herein as a baseline model. Particularly, we mention the solution iota 2 developed by  at the CESBIO. Briefly, a RF is an ensemble of decision trees, acyclic graphs that can be used to make decisions. In each node of the graph, a given feature in the input feature vector is submitted to a binary question. In this way, one can construct trees of various depths that are used to classify a given input feature vector.
To account for the temporal domain, temporal stacking in the channel dimension is performed. Therefore, inputs to the model are time series vectors of length c · T where c the number of spectral bands and T the number of time acquisitions. As for any pixel-based approach and without the addition of spatial features beforehand, this algorithm sees each pixel irrespective of its spatial neighborhood. Moreover, since the temporal ordering has no effect on the classification process, this type of classifier cannot properly leverage temporal dynamics and causality present in the data.
Two versions of this model are proposed: a basic one denoted by iota 2 base which only relies on spectral input features, and a spatially-aware one denoted by iota 2 ctx, developed by (Derksen et al., 2020) at the CESBIO, which is enhanced by a prior object spatial segmentation on which spatial features are computed and stacked to the spectral ones. This additional feature engineering step that we decided to avoid in our proposed deep learning methods introduces more complexity to the model but provides it with valuable spatial information. Both RFs are implemented on CPUs using the same set of hyperparameters: minimum number of samples in each node of 20, maximum depth of 50, maximum number of trees of 100. Other parameters are set to default values as described in Orfeotoolbox documentation (Grizonnet et al., 2017).

Deep learning
Deep learning models are increasingly used to perform land cover classification. Especially, given the filtering nature of convolution kernels, CNNs can be applied to extract relevant spectral, spatial and temporal features from data. This ability to handle multidimensional data makes them promising candidates to produce more accurate land cover maps from SITS. (Hu et al., 2015) and temporal (Pelletier et al., 2019a) dimension. Given the filtering properties of convolution kernels, they are very appropriate to handle temporal information. For instance, 1D convolutions are typically applied to sequential data such as sentences in natural language processing, audio signals, and more generally to any structured one-dimensional signal like time series. Figure 1 shows an example of a 1D convolution filter sliding through a pixel time series of a particular spectral band.

1D CNN 1D CNNs have been used on the spectral
CNNs using 1D convolutions have been used for land cover classification of SITS as in (Pelletier et al., 2019b) with a model coined TempCNN. This model performs convolutions through pixel time series. Hence, no spatial information is taken into account but it shows competitive results due to its ability to handle The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLIII-B2-2021 XXIV ISPRS Congress (2021 edition)  2.2.2 2D CNN In 2D CNNs, spatial convolutions are applied in both x and y directions to extract relevant spatial features from images. More sophisticated architectures try to leverage multi-scale spatial information by downsizing input feature maps at subsequent stages in the network as a Fully Convolutional Network in (Maggiori et al., 2017) and the wellknown U-net architecture in (Stoian et al., 2019). Yet, since our input training images have a size of 32 × 32, this approach is not conceivable as most of the information would be lost in the early layers. To account for the temporal dimension, the use of standard 2D convolutions is not straightforward and workarounds such as temporal stacking in the channel dimension are needed (Kussul et al., 2017). Consequently, the input data becomes of shape c · T × h × w where c denotes the spectral channels and h and w denote image height and width. Figure 3 shows an example of such 2D convolution with temporal stacking. This solution does not fully model the temporal dynamics that exist in the data since retraining the model with a different temporal order would statistically provide similar results. The architecture of the model is similar to that of Figure 5 except that 2D convolution kernels are used and inputs at sub-sequent time steps are stacked along the channel dimension. The forward pass comprises a series of convolutional layers followed by batch normalization (BatchNorm), Rectified Linear Unit (ReLu) activations and Dropout layers. The last two layers are fully convolutional layers instead of fully connected ones since they permit filters to remain spatially invariant and allow to keep the input image shape. The model outputs a tensor of shape K × h × w where K is the number of classes. Each of the K channel is the probability for a given pixel in h × w to belong to class K.

3D CNN
Since 3D convolutions allow to convolve in more dimensions, 3D CNNs have been tried across spatial and spectral domains (Ben Hamida et al., 2018) and across spatial and temporal ones for crop classification (Ji et al., 2018). Figure  4 shows an example of a 3D convolution filter applied in both spatial and temporal dimensions. CNNs using 3D convolutions are rare in the literature. By using cubes of convolutions of shape i × j × k, they are able to operate on temporal and spatial dimensions simultaneously. A 3dimensional convolution filter uses all input channels and slides along three dimensions. Inputs to the model are 4D tensors of shape c × T × h × w. We chose convolution filters of shape 3 × 3 × 3, which means a temporal extent of 3 and a spatial window of 3 × 3 pixels. Figure 5 details the architecture of the model.

2.2.4
Two-stream models Two-stream models are often used to extract two different types of features (e.g. spatial and temporal ones) by using two models in parallel and combining their respective feature vectors in a single one as in (Benedetti et al., 2018), (Interdonato et al., 2019). In this work, we propose a two stream model to benefit from both patch-based and pixel-based approaches.
Patch-based methods such as spatial CNNs inevitably loose geometric precision on the classified map since any pixel prediction takes into account its direct neighborhood, for example in a 3 × 3 window. Therefore, small spatial features can be totally erased and large ones dilated in the classification output.
To tackle this issue and inspired from ensembling methods, a hybrid model is proposed. Often seen in the literature in order to merge temporal and spatial features, we propose to combine our two best performing models into a single one to balance the respective disadvantages of pixel-based and patch-based models. Figure 6 depicts this model ensembling. The output prediction is controlled by a weighting trade-off parameter set manually.
The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLIII-B2-2021 XXIV ISPRS Congress (2021 edition)

BENCHMARK
This section first describes the data used in this paper and describes the aforementioned benchmark.

Training data
Our dataset is composed of 11 dates of Sentinel-2 images of the tile ID 31TCJ processed at level L3A, which are monthly syntheses produced from their L2A counterparts acquired every 5 days. The time series span monthly from February to December 2018 as depicted in Table 1. The tile 31TCJ covers an area of 110 km × 110 km and is located near Toulouse, France. Each training instance consists of a pair of an image time series of shape (c × T × h × w) and a binary label mask indicating the presence or absence of labeled pixels. The label masks are created by extracting patches around each labeled polygons. Information about ground truth data collection is provided in the next section.

Ground truth data
The labeled polygons have been collected by trained photointerpreters on satellite imagery of the tile 31TCJ and reviewed by experts. The polygon sampling strategy, crucial for accuracy assessment, is let to the expert's knowledge of the area. Since all classes are not uniformly distributed over large regions, the strategy must account for the specificity of the terrain and the distribution of classes in order to avoid accentuating class imbalance. Table 2 describes the class nomenclature which comprises 8 land cover classes. Water 9444 The best way to unsure a fair and correlation-free strategy consists in splitting the training and validation sets at the polygon level rather than the pixel level. Indeed, pixels extracted from the same polygons and found in both sets are more likely to be similar and have higher auto-correlation than pixels from separate polygons. Therefore, the dataset is split between a training and a validation set while ensuring a balanced class distribution. The training data accounts for 70 % of the polygons and the remaining 30 % are kept separate for the validation during which the performance metrics are computed.
In terms of pixel counts inside the polygons, the data is unbalanced with two minority classes: Shrubland and Non vegetated. A straightforward oversampling strategy during training is chosen to alleviate this class imbalance issue.

Spectral features
Pixel time series are composed of 13 spectral values (10 bands and 3 spectral indices). We used the Sentinel-2 10 m bands (B2, B3, B4, B8) and 20 m bands (B5, B6, B7, B8A, B11, B12) resampled to 10 m. The three following spectral indices are The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLIII- B2-2021XXIV ISPRS Congress (2021 added. The NDVI is designed for vegetation detection and is defined by Likewise, the NDWI as defined by McFeeters in 1996in (McFeeters, 1996 is used to perceive changes in water bodies and is defined by Lastly, the BI is sensitive to the brightness of soils where high soil brightness is linked with soil humidity and presence of salts. It is defined by While an increasing amount of research experiments shows that adding handcrafted spectral indices may be useless when training deep learning models as in (Pelletier et al., 2019b), we do not assess their usefulness in the present work.

Spatial context and temporal dynamics
Our hypothesis states that the extraction of combined spectral, temporal and spatial features is a key factor when analyzing SITS. While state-of-the-art approaches focus on either one or two of these dimensions, only too few have investigated to do it all at once in an end-to-end fashion. Indeed, spectral information only is sometimes insufficient to identify certain land cover classes that are similar at a particular time.
3.4.1 Spatial context Pixel-based approaches suffer from this lack of information since they consider each pixel irrespective of their spatial context. Figure 7 shows examples of time series for the three spectral indices. One can quickly notice general trends that correspond to what we should expect. For instance, the NDWI curve, which is used to detect water bodies, clearly separates it from other classes (Figure 7, middle).

Temporal dynamics
On the contrary, BI shows important variations for some classes depending the month of the year (Figure 7, right).
Additionally, some classes can look similar in terms of spectral signatures at a given moment in time while being totally different at a later time. Figure 8 shows that particular spectral features patterns are very characteristic for certain classes at different times.
As expected, vegetation classes such as Woodland needleleaved trees, Permanent herbaceous land and Shrubland show a high normalized NDVI (green starry curve). Especially, infrared bands (B6, B7, B8, B8A) clearly show the expected variations along the year for the Periodically herbaceous land class: a steady increase during winter and mid-spring followed by a rapid decrease during summer when the harvest season comes. This observation is in accordance with the vegetation phenology. We can notice that most of the Built-up spectral features are close and follow similar trends, except for NDWI. This particular pattern, which could make more difficult to discriminate this class, is related its inherent heterogeneity. Indeed, this class often contains different features such as buildings, trees, grassland and roads.
In order to enrich classification systems, there is a strong need to incorporate both spatial and temporal information that may help discriminate between classes. Consequently, our approaches will focus in adding these valuable insights to the classification process. We believe that intertwined feature modeling can have a high potential for the leverage of relevant information to improve land cover classification systems.

Methodology
A comparative research methodology using a benchmark is adopted in this paper. It is facilitated by a generic training and validation of deep learning models that use sparsely annotated data from labeled polygons. This framework and all proposed models are coded in Python with the deep learning library PyTorch (Paszke et al., 2017).

Training
As with all deep learning algorithms, training occurs over multiple repetitions, or epochs, of some optimization procedure. The training session stops either when the number of epochs is filled or when the optimization has converged according to some stop criterion. The latter option is chosen in this project using an early stopping regularization mechanism monitoring the F1-score macro averaged over all classes on the validation set using a patience parameter of 5 epochs.

Validation
After each training epoch, the model's performance is measured on the validation set using an array of evaluation metrics such as OA, F1-scores for each class, F1score macro averaged over all classes, confusion matrices, or the model loss. We used the macro-averaged F1-score across all classes as criterion of early stopping. The macro-average takes into account class imbalance by assigning the same weight to each class, irrespective of their population size. It is widely used to assess multi-class classification results and is defined as the geometric average of precision and recall. Since we perform 5 training sessions for each model to ensure statistical reliability, the best model among these trials is again chosen according to the evaluation metrics.
3.5.3 Testing All proposed models are eventually assessed using a separate test set of 131 polygons whose repartition is shown in Table 3. The test set is labeled by an external photo-interpreter to minimize any bias in the labelling procedure between training and test sets.

Results and Analysis
Class F1-scores on the validation and test sets for the benchmark models are shown in Table 4. Test results show that 3D-CNN is the best performing model with a mean F1-score of 0.804. It is followed by 2D-CNN, Spatio-TempCNN, iota 2 ctx, TempCNN and iota 2 base with F1-scores of 0.799, 0.798, 0.753, 0.750 and 0.723 respectively. Statistical reliability and efficiency results are also provided in Table 4. Training and inference time are measured using two NVIDIA Tesla V100 GPUs. This experiment proves that the performance of the models are reliable and robust as shown by the standard deviation values. Yet, the small sized dataset and class imbalance in both training and validation sets may limit this reliability assessment.
Besides evaluation metrics, a visual inspection of classification maps reveals interesting properties of each model. Pixel-based models such as iota 2 base and 1D temporal TempCNN tend to produce more speckle-like noise in classification maps as expected since they are oblivious to spatial context. On the contrary, patch-based models like 3D-CNN produce more spatially coherent maps with homogeneously classified areas. Figure 9 shows an example of classified area by different models. This example shows one of the recurring error we observed: RFbased models (top row) tend to wrongly classify Periodically herbaceous land (crop fields) areas as Built up in red or as Non-vegetated in grey. This is mostly due to the fact that crop fields vary much over time and can be spectrally similar to other classes over the year. We also observe that CNN models produce less speckle-like labelling. Roads, paths and class borders are often labeled as Built-up. As mentioned earlier, small features are often erased in the patch-based 3D-CNN as individual pixel features are smoothed out in the kernel window. On the other hand, certain features can be dilated. To help combat these issues, one would need to enrich the dataset using more dense polygons where mixed pixels in class borders are available during training.

CONCLUSIONS AND PERSPECTIVES
This paper introduced a benchmark of CNNs and compared their classification performance on multi-class land cover classification of SITS using sparse annotations. The proposed models are compared to a RF-based approach with and without prior spatial segmentation. A sparsely annotated ground truth dataset constituted by 290 polygons (less than 80 000 labeled time series) belonging to 8 land cover classes is used and sampled from the Sentinel-2 tile 31TCJ near Toulouse. All models are eventually evaluated on a separate test set of 131 polygons.
Results show that 3D-CNN, a spatio-temporal CNN using 3D convolutions is the best performing model of the benchmark with a mean F1 score of 0.804 on a test set. On this set, all CNNs exploiting spatio-temporal features outperform RFbased approaches. Therefore, a proper exploitation of the spatial context and temporal dynamics in satellite images appears as a powerful lever arm to improve land cover maps, especially for classes usually prone to misclassification such as vegetation ones. Moreover, given the honorable performance of the 1D  temporal TempCNN which has no spatial information, it seems that temporal dynamics only can be determining.
Our initial hypothesis stating that combining spectral, spatial and temporal features contained in SITS would improve land cover classification systems is verified by the conducted experiments on the dataset at hand. In addition, the proposed models are light and meet the efficiency requirements needed in operational contexts for real-world applications. Besides, a generic training and validation framework for deep learning models has been developed for further research and development.
However, the little amount of labeled data is a serious impediment in deep learning models. In this regard, the proposed benchmark is currently being trained and evaluated using a richer and dense dataset of ground truth data on another area and first results clearly show the superiority of CNN-based approaches when massive amounts of data are available. Given the overall good but heterogeneous performance of CNNs, hybrid approaches such as the two-stream SpatioTempCNN will be studied using learning-based ensembling methods instead of rule-based ones.