SEMANTIC SEGMENTATION USING A UNET ARCHITECTURE ON SENTINEL-2 DATA

: This paper presents the development of a methodological framework, based on deep learning, for the efficient mapping of main land cover classes (built-up, vegetation, barren land, water body) on different urban and suburban landscapes. In particular, the proposed framework integrates the superpixel segmentation (an essential procedure) with deep learning. A combination of spectral bands and indices is introduced to produce optimal results, ensuring adequate discrimination between built-up and barren land classes. A UNET architecture is implemented, which can learn the characteristics of main land cover classes from the input data that can be deployed from a Colab notebook without excessive computational needs. The resulted classifications depict promising accuracy values (above 90%).


INTRODUCTION
Global population has increased rapidly over the last century, which triggered transformations of the earth surface increasing the rate of land cover (LC) changes, particularly in urban areas, where more than a half of the global population lives (Addae and Oppelt 2019;Talukdar et al. 2020). Land is a limited resource and cities are continuously expanding. Urban growth can be described as the expansion of built-up areas that implies alterations in land cover of the natural landscape. Urban land cover and land use mapping plays an important role in urban planning and management. Quantifying urban growth is essential to perform an evaluation of its environmental, economic, and social impacts (Bhat et al. 2017;Zhang et al. 2018; Sapena and Ruiz 2019; Addae and Oppelt 2019). The above can be achieved, among others, through thematic information extraction techniques from satellite data.
Several methods have been developed for thematic information extraction from satellite observations that offer a cost-effective, spatially extensive, multi-temporal, and time-saving solution in comparison with traditional field surveys (Talukdar et al. 2020). Pixel-based methods categorize individual pixels mainly based on the spectral information and this may cause "salt-andpepper" issues due to the fact that spectral responses of individual pixels do not represent the characteristics of the surface object. Over the last two decades, a framework that employs objects as a basis for the analysis has emerged that is called Object-Based Image Analysis (OBIA). Image segmentation is the preliminary and critical step process to produce the fundamental elements of OBIA, that includes the partition of an imagery into spatially adjoining and relatively homogenous regions (segments). These elements form the foundation for further analysis as classification units (Blaschke et al. 2004;Baatz et al. 2008;Nussbaum and Menz 2008;Thenkabail 2015;Cheng and Han 2016;Kotaridis and Lazaridou 2021). However, in a complex urban environment, the selected features cannot be representative of all land cover types. Thus, instead of using manually selected features, automatic feature learning from remote sensing data is valuable (Zhang et al. 2018). In recent years, more advanced methods for pattern recognition i.e., Deep Learning (DL) architectures contributed to a breakthrough in semantic segmentation of remote sensing imagery (Kotaridis and Lazaridou 2021), assigning every pixel a class label of its corresponding image object (Mi and Chen 2020). The rise of Convolutional Neural Network (CNN) has played a major role towards this direction, emphasizing on automatic feature learning. Satellite image semantic segmentation, including the extraction of roads, buildings, and identification of land cover types, is essential for sustainable development, urban planning, and climate change research. on are based on semantic segmentation task (Wu et al. 2019).
A few researchers have employed satellite data and implemented CNNs and specifically UNETs to carry out image land cover classification tasks (Zhang et al. 2018;McGlinchy et al. 2019;Yi et al. 2019;Soni et al. 2020;Han et al. 2020).
The major objectives of this paper are: 1. The combination of spectral bands and indices to produce optimal results, ensuring adequate discrimination between built-up and barren land classes. 2. The development of a methodological framework, based on deep learning, for the efficient mapping of main land cover classes (built-up, vegetation, barren land, water body) on different urban and suburban landscapes. In particular, the proposed framework integrates the superpixel segmentation (an essential procedure) with deep learning. 3. The implementation of a UNET architecture, which can learn the characteristics of main land cover classes from the input data that can be deployed from a Colab notebook without excessive computational needs.

4.
The effective integration of the presented methodology in relevant thematic information extraction tasks.

Study area and satellite data
In this paper, three Sentinel-2 level-2A (Bottom-Of-Atmosphere) corrected reflectance images were obtained. The first imagery concerns the train area (Thessaloniki city). The second and third images include the test areas (two Italian cities, Bari and Genoa). The criteria for the selection of scenes were the high quality of data and the limited cloud coverage. A subset was extracted from each scene for analysis in order to include an urban area of various density values. They comprise of several diverse land cover types including concrete, asphalt, water, vegetation and soil, as presented in natural color composites in Figure 1. The detailed features of these images are provided in Table 1.

Tools:
For the purpose of this study, Python code was developed and executed on Google Colab (Colaboratory). In specific, the following libraries were used: • Numpy, a core library for scientific computing, for basic array operations.

•
Pyrsgis to read and export GeoTIFFs. • patchify library to split images into small overlapping patches by given patch and step size, and merge patches into original image during the prediction step. The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLIII-B3-2022 XXIV ISPRS Congress (2022 edition), 6-11 June 2022, Nice, France • Scikit-learn, machine learning package that offers functionality supporting supervised and unsupervised learning, for data pre-processing and accuracy checks.

•
Keras with Tensorflow backend, a machine learning and artificial intelligence framework, for building and deploying the CNN model. • matplotlib for creating visualizations (plots).
In addition, QGIS and Orfeo Toolbox (OTB) were used to preprocess the data. QGIS is a free and open-source Geographic Information System that supports the creation, editing, visualization, and publication of geospatial data 1 . In specific, it was used for digital processing of the Sentinel-2 images. OTB, an open-source software library that supports processing of remote sensing data 2 , was employed for input data normalization. Finally, the validation of the results and the visualization (maps) were also carried out in QGIS.

Training and testing dataset preparation:
Dataset preparation produces ready-to-use samples for the UNET model. It can be distinguished into the following discrete phases: initial processing of Sentinel-2 imagery, superpixel segmentation, and data sampling ( Figure 3).
Initial processing includes spectral indices calculation, normalization of spectral bands (R, G, B, NIR), stacking of the normalized bands and spectral indices to produce a single image product and clipping this product to the boundaries of the area of interest (AOI).
In this study, the combined use of four common spectral bands (R, G, B, NIR) and three suitable spectral indices (MNDWI, NDVIre, NDTI) is proposed to extract the main land cover classes in a complex and heterogeneous environment, that is, a city and its surrounding areas. Distinguishing barren land from built-up environment is often a difficult task (Osgouei et al. 2019). We deduced from several experiments that the aforementioned input produces the optimal results. Table 2 summarizes the spectral indices and their corresponding equations. Figure 2 clearly illustrates the contribution of spectral indices in terms of spectral separability between the different land cover classes.   Another concern during the train/test dataset preparation was to normalize the input features into similar value ranges. Data normalization is critical to ensure that all the features are treated in an equal manner, considering that neural networks are sensitive to the distribution of data (Chollet 2018). We implemented a data normalization (range from 0 to 1) procedure to obtain a land cover probability output. To achieve that, we used the maximum value from each individual spectral band (spectral indices are already normalized). For this purpose, the BandMathX application was accessed through the otbApplication Python module. The normalized spectral bands and indices were then stacked to from a single raster image. Finally, this raster image was clipped to the boundaries of the AOI.
Image segmentation was carried out in Terminus QGIS plugin 3 . Terminus was developed in order to provide an easily accessible tool that allows user to perform image segmentation tasks. It is a fast and straightforward plugin that includes four popular image segmentation algorithms: Felzenszwalb's, quickshift, SLIC and watershed. Each algorithm produces two outputs, a vector file and a raster file with the produced segments. The plugin offers user the option to compute various statistics over each segment. If this is the case, these statistics are included in the fields of the output vector file and the bands of the multiband raster file that is created. This raster file contains the statistics of the pixels within each segment as the output bands. Thus, it can be displayed as a color composite of user's choice.
For the purpose of this study, following several trial-and-error attempts, Felzenszwalb's superpixel segmentation algorithm was employed. Felzenszwalb's method is a graph-based image segmentation algorithm based on pairwise region comparison. It produces a segmentation of a multichannel image using a fast, minimum spanning tree-based clustering on the image grid. An important aspect of this algorithm is the ability to maintain detail in low-variability image areas whereas ignoring detail in high-variability areas (Felzenszwalb and Huttenlocher 2004). In addition, the mean value statistic was included in the form of the bands of the output raster file. Segmentation produced a controlled oversegmentation that is preferable to undersegmentation, since splitting segments a posteriori is a more complicated task than merging them.

Figure 3. Schematic diagram of data preparation.
A CNN model determines the relationship between characteristics (features) of an entity with a property (label). For this reason, several samples (features with their corresponding labels) are fed into the model and undergo a learning procedure to predict labels for new data (unlabelled data). In this paper, the samples from the segmented stacked image will be referred to as input features (X) and classified land cover data (a reference land cover map was generated) as input labels (Y).
This paper proposes a standard UNET architecture that is based on image patches to perform semantic segmentation. In such an approach, during the training phase, patches are generated from the input features. Once the UNET model is trained with these patches, it is validated on the test patches. We used a 64x64 window with a stride of 64 (window slide) that resulted in 375 non-overlapping patches that were fed to the UNET model. In order to evaluate the performance of the applied model, 70% of them (262) were used as training samples, while the rest 30% (113) were used as test samples.

UNET architecture:
A fully convolutional neural network (FCN) replaces the fully connected layers in CNN with up-convolutional layers and concatenates with a shallow, finer layer to produce end-to-end labels (Long et al. 2015). The standard CNN operates in an "image-label" way, while the "end-to-end" labelling mode in FCN is more suitable for pixelbased image classification, i.e., assigning each pixel the label of its corresponding class (Zhang et al. 2018). UNET is an architecture for semantic segmentation. It is an improved FCN model defined by its symmetrical U-shaped architecture consisting of symmetric contracting path (follows the typical architecture of a convolutional network) and expansive path. It combines low level features with detailed spatial information with high level features with semantic information to improve segmentation accuracy (Ronneberger et al. 2015;Zhang et al. 2018). A UNET model (like other CNNs) determines the relationship between characteristics (features) of an entity with a property (label). For this reason, several samples (features with their corresponding labels) are fed into the model and undergo a learning procedure to predict labels for new data (unlabelled data).

Computational resources
All source codes of the procedures previously described were seamlessly implemented in Python using the Keras framework (https://keras.io) via TensorFlow (serves as a backend engine) on GPU. Keras is a deep-learning framework for Python that provides a convenient way to define and train almost any kind of deep-learning model (Chollet 2018). The experimental results were produced on Google Colab. It allows user to write and execute Python scripts in the browser together with explanatory text in a single document (notebook). An important feature is the capability to import data from Google Drive as well as from Github.

Model performance evaluation
The proposed methodological framework introduced in section 2 was employed in the area of Thessaloniki for training, while experiments were conducted in two other areas to investigate its potential in land cover classification for urban growth monitoring. During the validation phase the performance of the trained UNET is examined on the corresponding test dataset. The model was trained for 50 epochs. A few common error metrics regarding the validation of the model are presented in Table 3 and the produced learning curves are presented in Figures 5 & 6 for both training and validation phase.  Table 3. Evaluation of the of the UNET model through a few common metrics with land cover types indicated as follows: 1 = built-up, 2 = vegetation, 3 = barren land, 4 = water body.

Results
Following the successful training of the proposed model, it was applied to the test areas to produce a land cover map for each of them (Figure 7 & 8). It has to be underlined that the testing material has to undergo the same preprocessing steps as the input training data (i.e., steps 1 and 2 in Figure 3).

DISCUSSION AND CONCLUSION
In this research, a promising land cover classification method based on deep learning is proposed for Sentinel-2 data. The method combines superpixel segmentation with deep learning on input data that include spectral indices to distinguish built-up environment from barren land and overcome traditional pixelbased limitations. The imagery is first segmented into superpixels using Felzenszwalb's algorithm, and then a UNET architecture is employed, which can extract land cover features. The proposed framework was validated in two coastal cities on the Mediterranean Sea, and performed quite well depicting high accuracy values for an improved classification of main land cover classes.
In the proposed land cover classification framework, superpixel segmentation was carried out to cluster the raster file from the initial processing of Sentinel-2 imagery into small homogeneous regions. We strongly support that this step is necessary, since it transforms the input data in a way that when the UNET model is applied, the final classification map does not suffer from the common issues of a pixel-based approach.
To our concern, a critical aspect is the feasibility of implementation of the proposed method. Since it has relatively minor computational needs, it can be deployed for similar purposes from Google Colab, executing the code on Google's cloud servers. The setup of the pre-requisite ML libraries is quite easy task, since most of them are already installed. In addition, a comprehensive understanding of machine learning fundamentals is essential for a smooth implementation of the aforementioned procedures.
The proposed methodology can be implemented in urban growth monitoring, ensuring an automated procedure on multitemporal imagery. This will be included in a future research work. In due time, the Google Colab notebooks will be available in the authors' GitHub repository 4 .   The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLIII-B3-2022 XXIV ISPRS Congress (2022 edition), 6-11 June 2022, Nice, France