CLASSIFICATION OF TIME SERIES OF SENTINEL-2 IMAGES FOR LARGE SCALE MAPPING IN CAMEROON

Abstract. Sentinel-2 satellites provide dense image time series exhibiting high spectral, spatial and temporal resolution. These images are in particular of utter interest to map Land-Cover (LC) at large scale. LC maps can now be computed on a yearly basis at the scale of a country with efficient supervised classifiers, assuming suitable training data are available. However, the efficient exploitation of large amount of Sentinel-2 imagery still remain challenging on unexplored areas where state-of-the-art classifiers are prone to fail. This paper focuses on Land-Cover mapping over Cameroon for the purpose of updating the national topographic geodatabase. The ι2 framework is adopted and tested for the specificity of the country. Here, experiments focus on generic classes (five) which enables providing robust focusing masks for higher resolution classifications. Two strategies are compared: (i) a LC map is calculated out of a year long time series and (ii) monthly LC maps are generated and merged into a single yearly map. Satisfactory accuracy scores are obtained, allowing to provide a first step towards finer-grained map retrieval.



The need for Land-Cover (LC) mapping
Land cover (LC) is defined as the description of the biophysical material over the surface of the Earth and immediate subsurfaces and man-made structures. Understanding and monitoring LC is important especially in environmental studies. Thus, updating LC maps is essential for countries in development and implementation of environmental policies . Human activities related to serving specific societal and individual needs are the main drivers of contemporary LC dynamics .

Sentinel-2 for LC mapping
The launch of the Sentinel-2 satellites in June 2015 enabled the free access to an unprecedented amount of optical images. The large swath, the short revisit time, the high spatial resolution of about 10 m, and the spectral bands from visible to short wave infra-red have already become essential to monitor large territories for a great variety of environmental applications. Sentinel-2 (S2) satellites provide a global cover of continental surfaces every five days in 13 spectral bands. This now enables the implementation of novel LC map production systems for the delivery of up to date and accurate information with the appropriate timeliness (Whitcraft et al., 2015a, Whitcraft et al., 2015b. The enhanced temporal resolution ensures a better monitoring of land use and cover with better opportunities to obtain cloudless mosaics and to discriminate natural classes (vegetation types, crops, etc.) (Immitzer et al., 2016). The wide spectral resolution facilitates the thematic identification of LC, while the high spatial resolution allows for the identification of quite small objects or landscapes structures (built-up areas, large roads, specific textures). * Corresponding author

Needs of the National Cameroonian Mapping Agency
The national Cameroonian mapping agency aims at implementing a fully operational automatic LC map production system using both time series of Sentinel-2 images and available aerial images. So far, the topographic database is established on the basis of the visual interpretation of the latter images and field surveys. The nomenclature of the national Cameroonian reference topographic and land cover database is highly detailed (9 themes, 47 classes, and 136 sub-classes) and does not fully match the current requirements of a yearly basis with detected changes, as it now appears to be a crucial feature (Wulder et al., 2018). Therefore, it is here targeted at investigating how Sentinel-2 and aerial images can be jointly beneficial for this task since they exhibit complementary strengths.

Scope of this work
This study focuses on the first step of this exploratory workflow: the analysis of Sentinel-2 images for yearly LC map generation. It aims at assessing state-of-the-art classifiers to this specific context. Indeed, a first difficulty states in the lack of reference data at the spatial resolution of Sentinel images for the training of a supervised classifier. The proposed solution consists in a direct use of existing Very High Resolution topographic database with the existing ι 2 LC classification framework. Other challenges consist in working on areas with numerous cloudy days and significant intra-annual variability (Mertens, Lambin, 2000).

DATA AND STUDY SITE
In order to design a reliable fully operational automatic Cameroon LC map system out of Sentinel-2 time series, experiments are conducted over a study area in the Extreme-North (EN) of Cameroon.

Study site
Current experiments and analyses are limited to one specific site, located near Petté in the East-North of Cameroon, (see Figure 1). Covering 189.072 km 2 , it was selected both for its challenging conditions and the availability of reference and Copernicus data ( Figure 1). It has a short rainy season from June to September and a long dry season from October to May. The slope in the whole scene greatly varies with some hilly areas and rocks. The soil is sandy in the dunes with sandy clay at the median part levels, and essentially clayey at the shallow levels. The drainage is almost nonexistent. The essential waterways, Mayel Tchilé and Mayel Petté, are loosing their activities in September. The principal LC categories are vegetation (principally accacias, balanites, and goods of thorns), water areas, buildings and roads.

Data
2.2.1 Remote sensing data Sentinel-2 is the only image data source used in this work: 10 spectral bands from 10 to 60 m, (see Figure 2). All Levels 2A and 2B data available through the Copernicus Open Access Hub 1 over the study area for the year 2018 were used. For each month, 4-7 images were available ( Figure 2). More specifically, a focus was made on a sub-area of 10 km × 10 km.

Reference data
The proposed approach relies on existing geodatabases available over the area to build the training and reference datasets, needed for the supervised classification and the subsequent validation of the LC maps. Two data sources have been used in order to build a unique reference dataset : • The topographic LC database of the Cameroonian National Cartographic Institute (BDINC, simplified for our purpose); • OpenStreetMap data, made available through the QUICK-OSM 2 plugin of QGIS software (Herbreteau et al., 2018).
In further experiments, five generic classes are considered for the classification: Roads, Vegetation, Water, Buildings and Other. The Other class corresponds to the rest of Land-Cover objects in the study area except the four others ones. Remaining unlabelled areas correspond to LC types not included in the specifications of the reference database.

ι 2 framework
Land-cover map generation is cast as a supervised classification task as routinely adopted in the literature. The Iota2 or ι 2 supervised classification framework (Inglada et al., 2017) is adopted. Indeed, it is specifically dedicated to process Landsat or Sentinel time series and it has proven to be an efficient and highly parametrizable open-source solution. ι 2 has already demonstrated its high efficiency in performing country wide LC classification out of multi-temporal high resolution Landsat and Sentinel imagery at pixel level. It can run either on standard desktop computers or High Performance Computing clusters.
A per-pixel machine learning classification scheme is retained, on which simple post-processing steps can be plugged. Contextual classifiers or object-based approaches (Derksen et al., 2018) would have been interesting but have been discarded yet.

Random Forests
A Random Forests (RF) classifier is used (Breiman, 2001). This classifier is an aggregation of a set of decision CART trees. Indeed, to improve the performance of single CART classifier, RF classifier is based on two main principles: bagging and randomsubspace.
A fusion of the different basic classifiers is carried out by majority vote: each decision tree predicts a label for a new sample to be classified and the label finally assigned to it is the one which receives the greatest number of votes. This study uses the RF algorithm as base classifier as it has proven to be efficient for land-cover discrimination at large scales under noisy labels, both in terms of computation times and classification quality (Rodriguez-Galiano et al., 2012, Pelletier et al., 2016. Several parameters come into play during the construction of a RF classifier: the number K of trees (arbitrarily set to a large value), the number m of attributes drawn during each cutting of the node of one of the trees (usually set to the value of the square root of the number of attributes), the maximal depth max depth of each tree and the minimal number of samples min samples per node.

Process workflow
The proposed workflow can be decomposed into the next steps:   3.3.1 Data preparation This step ( Figure 3) consists in adapting the Copernicus Sentinel-2 data to fit the input format requirements of ι 2 . First, images bands are converted from .jp2 to .tif format using GDAL translate functionalities. Then, for each Sentinel-2 image, three different masks are computed in a manual way using the r.mask.rast command of the Raster(r. * ) library of GRASS in QGIS. They correspond to unusable image areas because of clouds (CLM R1*), saturation (SAT R1*) or diverse reasons (EDG R1*).
3.3.2 Feature computation After data preparation, the feature vector that will be used for the classification of each pixel has to be built. For each Sentinel-2 image, four standard features are computed out of the original bands: • Normalized Difference Vegetation Index (NDVI) (Tucker, 1979) to highlight vegetation cover; • Normalized Difference Water Index (NDWI) (Gao, 1996) to enhance water and wet areas; Selecting batchProcessing parameter in ι 2 's configuration file improves feature computation times. At the end, each pixel is characterised by the original spectral bands (limited to the 10 bands exhibiting a native 10 or 20 m GSD) and the spectral indices corresponding to the different dates of the time series. Here, for a year long time series, it amounts to 760 features: 10 spectral bands plus 4 indices on 70 dates (see Figure 2). For monthly and yearly classifications, all features are stacked into a single set.

3.3.3
Training the classifier As the proposed approach relies on a supervised RF classifier, training this classifier is an important step of the workflow. It consists in (i) sampling data to obtain a training set and (ii) building a classification model per image that will be applied in the next step (Section 3.3.4).
Training set design Five generic classes Roads, Vegetation, Water, Buildings and Other are considered in these experiments. (The Other class corresponds to the rest of Land-Cover objects in the study area except the four others ones.) Training samples are extracted for these classes from the reference dataset described in Section 2.2.2, and especially the existing Very High Resolution national topographic database. Using existing databases as training data has proved to be a suitable solution (Gressin et al., 2014, Postadjian et al., 2017, even if noisy labels may appear due to misregistration, changes and geodatabase specifications. Pixels are randomly taken as a 20% subset of each class, keeping the same ratio between initial reference and training sets. In practice, a maximum sampling of 1,051,919 pixels is performed over the area covered by the (month or year long) Sentinel-2 time series. The number of training samples selected for each class is proportional to the area covered by this class over the useful parts of the image (Figure 1). These areas are different (see Figure 4) depending on classes and so the class samples are unbalanced.
Build RF model In supervised classification the balance between class samples is important. There are many ways to manage class balancing in iota 2 , using either data augmentation to enrich the training sample set or class weighting when training the classifier. Data augmentation is used here, generating synthetic samples with jitter (strategy jitter standard factor of 10) and smote (strategy smote neighbors of 5) methods by using the minNumber samples strategy to set the minimum number of samples by class required.
RF is not very sensitive to the choice of parameters as demonstrated in (Pelletier et al., 2016). Thus, it is here applied with the standard ι 2 configuration (see fig. 2): a number K of 1,000 trees, a number m of features randomly selected at each node equals to the square root of the total number of features, a maximum depth max depth for each tree of 25 levels and a minimum number min samples of 5 samples per node have been used.

Prediction This step simply consists in classifying
Sentinel-2 data according to model obtained at previous step. Here a LC map was generated for year 2018 at a spatial resolution of 10m over the areas of interest for the five generic classes listed above. In this step, for each per epoch classification, each pixel contains a value corresponding to its class. To synthesize this information to a unique map, one can rely on the Fusion by Voting. It consists in merging classifications by letting each of them vote for its label and choosing the final label as the winner of this majority vote. Such vote can be applied to hard label classification results, but it can also exploit confidence information associated to these maps. This strategy is here applied to the twelve monthly hard label classifications (supposed to be all well coregistered).

Experiment set up
This work aims at demonstrating the ability of state-of-the-art RF classifier to classify time series of Sentinel-2 for large scale land cover mapping in the specific Cameroonian context. Several strategies are considered and compared. On one hand, a LC map is calculated out of the complete year long time series for 2018. On the other hand, the year is split into epochs (months), and one map is computed for each epoch. The quality of these twelve monthly classifications is then assessed to identify whether some periods are more prone to deliver good results. At the end, these monthly classifications can be merged into a unique map. One interest of such per month strategy compared to year-long time series analysis states in its ability to handle smaller amount of data for each classification.

Quality assessment
Results (from month or year long time series, and fusion of monthly results) are all assessed quantitatively and qualitatively.
3.5.1 Qualitative assessment A qualitative and visual assessment of obtained results is done. Indeed, it may sometimes be useful to assess the quality of the classifications locally (here in a given yellow squared area of the study area ( Figure 9). For this aim, a classification indicating the confidence of the RF classifier on the decision for each pixel of the classification can be provided. Finally, a visual evaluation in order to highlight anomalous characteristics will also be presented.

Quantitative evaluation
Classification results are compared to the reference data (section 2.2.2). Then, several classic metrics are derived from the confusion matrix : Overall Accuracy (OA), Kappa coefficient (K) and F-Score averaged (Cohen, 1960), (Fleiss, Cohen, 1973) for global assessment, as well as F-Score, producer's and user's accuracies for per class analysis.

RESULTS AND DISCUSSION
The different classifications described in Section 3 were calculated over the study area for the year 2018 and evaluated as explained in Section 3.5. A visual assessment of a yellow squared particular area displayed was also proposed to highlight some phenomena which can not be detected with the metrics. Figure 5 presents the mean F-Score and Kappa coefficients as global metrics as well as the per-class F-Score, recall and precision quality metrics for the two classification modes (monthly and yearly) and for the fusion. These metrics were calculated over one run using the same training and validation sets (of 1,051,919 pixels each). Kappa coefficients are generally very good with values above 0.8 (0.88-0.93 for monthly classifications, 0.94 for the yearly one and 0.87 the fusion). Table 4 shows the confusion matrix for the fusion of monthly classifications. It can be highlighted that overall accuracy and Kappa coefficient values are 0.91 and 0.87 respectively. One can observe that the yearly classification yields better results than monthly classifications and their fusion for these two global metrics. Most monthly classifications results are also better than the fusion one. This is slightly different when considering per class metrics. The classes Vegetation and Other obtained the best results for the all classification modes. Compared to monthly results, the fusion can slightly improve results (especially considering underdetection), mostly for the other classes, but the best results are still obtained for yearly classifications.The results are even more improved for the classes with few training samples (Water and Building). The class Vegetation is much better recognised using the two classification modes and the fusion of monthly classifications, mainly because of its higher training samples (see Figure 4). It can also be shown from Figure 5 the improvement for these minority classes (Water and Buildings) mostly states in the recall: underdetection is reduced. Tables 1, 2 and 3 show the confusion matrices for the best month (May), the baddest month (December) and the year 2018 respectively. They allow a detailed analysis of the errors for both modes. Several aspects of the improvement yielded by the yearly classification can be highlighted:

Quantitative evaluation
• No confusion between Other and Vegetation in both directions; • The confusion of Water with Vegetation is reduced in one direction; • The confusion of Water with Other is reduced in one direction from 15.61% in June 2018 to 0.67% in the year 2018. Table 4 shows the confusion matrix for the result from the fusion of the twelve monthly classifications. Several aspects of his improvement can also be highlighted: • No confusion between Other and Vegetation in both directions; • The confusion of Water with Vegetation is reduced in one direction; Figure 5. Global (F-Score averaged on classes and Kappa) and per class (F-Score, recall and precision) quality metrics for the study area for months and year 2018 and fusion.
• The confusion of Water with Other increases to 2.57% compare to the yearly classification.
It can be seen from the confusion matrices most misclassifications occur between Roads and Vegetation classes, as well as the Roads classified as Other class. This is partly caused by the fact that bare ground areas (that can be included in these classes in the reference databases) are often classified as Roads class.

Classification confidence
The RF classifier associates a confidence (unsupervised margin) calculated out of the distribution of the labels predicted by the trees (Frenay, Verleysen, 2014) of the forest. Figure 12 shows the confidence map obtained for the LC map in the yellow squared area of Figure 9 for the year 2018. One can observe that confidence is influenced by the proportion occupied by the classes samples in the reference data (see Figure 4). For example, the Vegetation class in that specific yellow zone is easily recognised in the yearly confidence classification. Besides, areas corresponding to LC not included in the reference databases generally exhibit high confidence, which is encouraging for a use of such information. It is useful to study how confidence values are related to correct and incorrect classifications. Figure 13 shows the distribution of confidence values calculated for only the learning samples categorie for December, May and yearly classifications. One can observe that all the learning samples were correctly classified (in yellow) so there was not incorrectly classified learning samples (in blue) for each of these specific classifications. However, these correctly classified learning samples have different behaviours for confidence values lower than 40% and higher than 60% and the same behaviour between 50% and 60%. We have more pixels (<200) which have confidence values lower than 40% in the December classification compare to May and the yearly classifications where we have less and less pixels having these confidence values respectively. On the opposite, we have more and more pixels having confidence values which evolve from 60% to 100% in May and yearly classifications respectively compare to the December classification where the number of pixels having these confidence values is decreasing from 1000 to 0. These results are consistent with the quantitative evaluation arguing that the yearly classification exhibits better results than monthly classifications. December also exhibits the worst accuracy among the per month results. This difficulty to classify the December month could be due to the presence of clouds and cloud shadows associated with bad landscape in our study area. Thus, introducing confidence to weight the majority vote would probably improve fusion results.

Visual analysis
Quantitative assessment is not able to reveal alone several kinds of errors present in the classifications, while a visual analysis makes it possible to identify other phenomena which are analysed in this section. Indeed, important parts of the study area are not labeled in the reference databases, and thus not included in the quantitative evaluation. For the considered legend, the study area is globally not difficult to classify using 10 m S2 images even if the small size of Buildings class makes it more difficult to classify. Figures 6, 7 and 8 illustrate the classification of the yellow squared area in the South-West of the study area for year 2018, December 2018 and May 2018 respectively. One can observe an important over-detection of the Roads class, not necessarily revealed by the metrics, as roads are mostly over-detected over unlabelled      areas corresponding to LC types not included in the reference database. December gives the baddest result per month, even though the over-detection of roads is less important. This result is consistent with the previous quantitative assessment.

CONCLUSION
This paper presented some experiments of fully automatic production of LC maps at regional scale out of Sentinel-2 images time series using the ι 2 supervised classification chain. This approach uses all available Sentinel-2 image data (no scene selection in terms of appropriate dates or cloud cover) and uses existing databases as reference training and validation data for a supervised classification process robust to errors in the reference data (e.g., out of date databases). The process is efficient (enabling a fast delivery of the classifications after the acquisition of the satellites image data) and not expensive (requiring neither field surveys for model calibration, nor human operators for decision making, and using open and freely available imagery). This work was conducted in the context of defining a joint use of airborne and free satellite images to update and enrich existing LC data. The advantage of the proposed approach resides in the regular availability of new images provided by Sentinel-2 to reduce the uncertainty in the detection of study areas where airborne acquisitions and advanced analysis are mandatory. The considered legend was quite simple, including Roads, Vegetation, Water and Buildings classes, as well as an additional Other class corresponding to all other database objects of the study area. Despite obtained accuracy is higher than most comparable (in terms of areas covered and number of classes) state-of-the-art approaches, Roads and Water classes (corresponding to thin objects) suffered from poor recognition. On the opposite, better results (90%) were obtained with the Vegetation class. It will now be beneficial to develop the same study only on this particular changing class for also fulfilling the user requirements in terms of crop type mapping and easily compare crop type maps of successive years. S2 time series is supposed to be especially relevant for this task.