Very High Resolution Land Cover Mapping of Urban Areas at Global Scale with Convolutional Neural Networks

This paper describes a methodology to produce a 7-classes land cover map of urban areas from very high resolution images and limited noisy labeled data. The objective is to make a segmentation map of a large area (a french department) with the following classes: asphalt, bare soil, building, grassland, mineral material (permeable artificialized areas), forest and water from 20cm aerial images and Digital Height Model. We created a training dataset on a few areas of interest aggregating databases, semi-automatic classification, and manual annotation to get a complete ground truth in each class. A comparative study of different encoder-decoder architectures (U-Net, U-Net with Resnet encoders, Deeplab v3+) is presented with different loss functions. The final product is a highly valuable land cover map computed from model predictions stitched together, binarized, and refined before vectorization.


INTRODUCTION
Land cover maps are cartographic products widely demanded by land managers. As they provide a quantification of land resources into thematic categories (e.g. forest, water or asphalt surface), land use/land cover maps are used to measure current conditions and how they are changing.
The different classes composing the final land cover map are derived from OCSGE product (Description of OCSGE). They have been selected to best fit the description of urban areas: asphalt, bare soil, building, grassland, mineral material, forest, and water. The final product (Figure 1) can be used for deriving environmental indicators and measuring land consumption.
In the following paper, we propose a full methodology to produce a very high resolution 7-classes land cover map at a large scale with aerial images. Specific attention has been given to subsequent tasks: • creating a consistent and representative training dataset for each class • benchmarking different models and configurations • producing a relevant land cover map from model predictions As databases (BDTOPO® or OpenStreetMap®) do not describe all the objects of each class, manual annotation is needed. We chose to create full ground truth on specific areas of interest (AOI) in the production zone to facilitate the labelling process.
In the end, only 10% of the targeted zone is used for training. Semantic segmentation of aerial images with convolutional neural networks is then performed from this training dataset.
Since the publication of an encoder-decoder deep convolutional network architecture (Ronneberger et al., 2015), scientists have tried to improve semantic segmentation on medical images (Ibtehaz, Rahman, 2019) or terrestrial images (Kaiming He et al., 2015;Liang-Chieh Chen et al., 2018) and others have transposed those solutions to earth observation with high resolution images taken from satellite (Nagesh Kumar Uba, 2019), airplane (Ce  or drone . Multiple competitions have been recently carried out to improve semantic segmentation results on earth observation images with a provided dataset (Spacenet challenges, ISPRS semantic labelling contest). Lots of work on roads or building extraction from satellite images (Zhengxin Zhang et al., 2017) have been published.
We propose to reuse top-class semantic segmentation architectures (U-Net with ResNet encoders, DeeplabV3+) on very high resolution aerial images. This article also describes some post-processing techniques to convert model predictions to a relevant land cover map.

Input data
We chose a sample size of 512 pixels for convenience as we are using a DeeplabV3+ model with a receptive field of 470 pixels and a GPU with limited memory (8Gb).

2.1.1
Images : As input data, we use a mosaic of aerial images with a resolution of 20cm. The camera has 4 bands (Red, Green, Blue and Near-Infrared) and original pixel depth is reduced to 8 bits before the mosaicing process. As a complement, a Digital Surface Model -derived from the aerial images by photogrammetric techniques -is used in combination with a Digital Terrain Model to produce a Digital Height Model (also called normalized DSM) that we concatenate as a 5th channel to input images.

2.1.2
Masks: Associated masks are created separately for each class. Some classes are labelled with different techniques detailed in paragraph 3.1. A multi-channel binary image is produced to store the label information with a band for each class.

2.1.3
Samples: In the end we feed the model with a tuple of (image, mask) with the shapes:

Network architectures
Three encoder-decoder network configurations are assessed in this article. Original U-Net is the baseline configuration and we measure precisely improvements brought by more sophisticated architectures: U-Net with ResNet50 as encoder on one side and DeeplabV3+. Those models have very different characteristics and specificities.
• U-Net is a simple encoder-decoder convolutional network. The encoder part processes the input image through several convolutional and down-sampling layers, giving semantically rich feature maps. The decoder part consists of convolutional and upsampling layers combined with skip connections with the encoder feature maps to access further spatial information. Those connections furthermore alleviate the risk of vanishing or exploding gradients.
• ResNet, used as an encoder, differs mainly by the use of skip connections between consecutive layers in the encoder path, allowing to learn residual feature maps. • DeepLabV3+ ( Figure 5)   . Deeplab V3+ architecture Those models differ not only in their architecture but also in the number of parameters, impacting both the amount of training data and of computation time required. The smallest models are based on U-Net and the largest on Resnet18 encoders.

Loss functions
Several loss functions are evaluated in this work.
• Cross-Entropy (CE) loss takes into account the variation between prediction and target. But it is evaluated on individual pixels, thus large objects contribute more to it than small ones which can be an issue when dealing with land cover objects. • Weight-balanced Cross-Entropy (WCE) loss is used to tackle the problem of an unbalanced training dataset. Under-represented classes get higher weights to assure that all classes contribute equally to the loss value. • Binary Cross-Entropy (BCE) is used to evaluate the benefits of allowing the network to perform a multi-label classification task (each pixel could belong to multiple classes). • A combo loss with BCE + Jaccard. Jaccard loss is computed from the Jaccard index defined by: (1) with a predicted segmentation and a y pred y true ground truth labelling. Jaccard index is transformed to obtain Jaccard loss: (2) The combo loss is a weighted combination of BCE and Jaccard loss: (3) Using Jaccard loss mixed with BCE is a quite common technique used to handle imbalanced input.

Post-processing
The raw result is a probability map or heat map with a band per class. Each pixel is represented by a vector of belonging probability with values rescaled between 0 and 255. We perform a 2-level thresholding: • binarization of heat map with a low threshold • vectorization • averaging of probability by object • filtering of objects whose average probability is below a high threshold A last, a refining process is made to remove objects according to their size, then geometries are simplified.

EXPERIMENTS
We focus on producing a land cover map of the urban area of Gironde department (Figure 7). Aerial images taken by the French Mapping Agency in 2018 with a resolution of 20cm, the associated Digital Surface Model and BDALTI® are used to build the image samples.

Training dataset elaboration
The dataset specifications are issued from some practical choices during its production. Most of the time, when training datasets are not provided for experiments, they are generated by picking a large number of random samples among the targeted area. We decided instead to create our dataset focusing on 5 consistent areas of interest carefully selected to best represent the overall territory characteristics ( Figure 6). This choice has been determined by two main reasons: • It is easier to automatize or semi-automatize the training label production, with standard machine learning tools, on large images tiles than on many small patches. • It brings more flexibility in the choice of relevant size and resolution for training samples.

AOIs characteristics:
This dataset is mainly dedicated to urban area segmentation. Figure 6. Training zones So, areas where land cover is mainly forest, natural or cropland have not been taken into consideration. But in order to catch urban area diversity, territories with different urban morphology have been manually selected. The chosen urban morphology is coastal and touristic areas (dark green in Figure 6), dense downtown (red in Figure 6), residential suburban zones (orange in Figure 6) and small towns in agricultural territories (light green and yellow in Figure 6).

3.1.2
Labelling: Each class is labelled separately. French BDTOPO® database provides consistent information for buildings, roads, natural water areas. But it does not contain any parking lots or sidewalks useful for asphalt class nor urban grassland or forests and mineral materials. Those objects are extracted either with manual annotation or semi-automatic techniques from classical machine learning. Some of them are detailed in the following sections.
Classical machine learning for vegetation: To build the training dataset for vegetation classes (forest and grassland), classical machine learning is used. We combine a pixel-based classification through random forests and a large-scale mean shift segmentation. Vegetation indices NDVI and NDWI, as well as RGB and Near-Infrared bands, are used as input of a supervised random forest classifier. We finally combine the pixel classification with regions of a large scale mean shift segmentation. Thanks to a majority voting on pixel class among each region, we get larger homogenous regions. Taking advantage of DHM and NDVI to detect vegetation and trees, the results are acceptable as is for ground truth. All those processings are done using functionalities of orfeo-toolbox (Grizonnet et al., 2017) command-line tools.  The strict definition for mineral material class is: every area with a permeable but man-made coverage, including path, railways, yards, … It can be easily mistaken for asphalt or concrete. The common approach to this problem relies on massive manual annotation for each training zone. We rather choose to use a basic U-Net trained on a tiny subset of the training data manually labelled (5 km 2 over 100 km 2 ). The network predictions over the 5 training areas are then post-processed (vectorization, simplification, merging with available vector data like road network) and manually corrected to constitute a ground truth.

3.1.3
Training dataset consistency: From each AOI, we extract square samples of 512 pixels width. The constituted training dataset is evaluated to ensure that each class is adequately represented.
Here are few statistics on the dataset showing that it is quite balanced with only 2 classes under-represented: mineral materials and bare soil (see Table 2

Benchmarking configurations
The dataset is split in train|test|validation parts: • training set represents 64% of the total • validation set (16% of the total) used between each epoch to prevent overfitting and decide when convergence is reached • test set (20%) is used to provide the metrics reported in the following paragraphs The experiment is run on a desktop computer with the following characteristics: Intel(R) Core(TM) i7-6950X CPU @ 3.00GHz and 16Gbits of RAM and a GPU NVIDIA GeForce RTX 2070 with 8Gbits of RAM. The source code of the processing pipeline is based on pytorch and will be made available soon.
Training hyper-parameters are: • training on 300 epochs with a patience of 20 (if loss do not decrease on the validation set for 20 epochs, the training is stopped) • learning rate starts at 0.001 and is halved when the loss on validation does not decrease • Adam optimizer Every network is trained from scratch.
We use basic data augmentation with a randomized 90-degree rotation.

3.2.1
Configuration performances: The measure used to compare configurations is the Intersection Over Union (IOU) index also known as the Jaccard index described by formula 1. This index measures the overlap between prediction and ground truth. It is preferred to pixel accuracy as it is not affected by class imbalance.  Table 3. IOU in percentage for each class and architecture configuration The overall result is quite good even if we note a severe defect for mineral material class which is known to be problematic. The training dataset is not very reliable for this class and it is a real challenge even for a human eye to distinguish between concrete and mineral material on proposed images.

Analysis:
Although it has more trainable parameters, Resnet18 based models perform equally with simple U-Net or even worse for difficult classes (mineral materials and bare soil). It seems that our dataset is too small to be able to benefit from the complexity of Resnet18. Loss performances are quite equivalent for this experiment. We note that the use of Binary Cross Entropy degrades measured performances and combo loss (BCE + Jaccard loss) improves IOU for many classes but to a relatively limited extent.

Visual comparison:
Results are generally equivalent, though we can point out some minor differences. Deeplab models delineate objects more precisely than U-Net and give better performance on small objects.  Whereas visual differences between CE, BCE or even Combo losses are not significant, WCE loss performs better for subsidiary classes. Mineral materials or bare soils segmentation get improved.

Global land cover map from model predictions
Having a trained model performing well on test images is one thing many studies would settle for. But the final purpose here is to produce a global vectorized land cover map describing elementary objects.

From model predictions to stitched map:
A trained model predicts probability of class belonging for tiled images also called patches. Stitching those predicted tiles together is challenging. Vignetting effects can appear on predictions due to a poor consideration of models' receptive fields. As (Bohao Huang et al, 2018) points out, modern Convolutional Neural Networks used in this article do not guarantee translational equivariance due to zero-padding and strides operations underlying. We present results of 2 methods to perform a smooth combination of predictions: • clipping edges of patches • blending predictions with a windowing function Clipping edges of patches: The following predictions are made on larger tiles (2048 x2048 pixels) to optimize the available GPU memory . A fishnet is calculated with an overlap size chosen with respect to the network receptive field.  Blending prediction with a windowing function: Work from (Chevalier, 2017) has been reused to improve the final result. The algorithm makes several predictions of a single tile with transformations applied (rotations, mirrors) and averages them. Then it uses a 2d spline function to blend overlapping predictions together.
The result is of very high quality. Not only each pixel prediction is much more precise, but connections between prediction tiles have totally disappeared.  The second image in Figure 14 shows how we can adjust the thresholds to retrieve more objects from predictions. The final result is refined with a removal of elements of small size and geometry simplification.

CONCLUSIONS
This article gives hints to build a training dataset for classes asphalt, bare soil, buildings, grassland, mineral materials (permeable artificialized areas), forest and water from scratch when full manual annotation is not an option. DeeplavV3 architecture shows very good results for semantic segmentation of aerial images. Dataset quality is the most important criteria to obtain good results on prediction. Prediction concatenation is a real challenge when a seamless heat map is expected. This experiment has been conducted within a project whose objective is modernizing production lines for land cover products on French territory. Land cover maps produced on Gironde department are then derived to a generalized product and will be used by public authorities to monitor the land consumption.
In the future, the dataset will be completed with new images captured at other dates and in other departments (with different lighting conditions). We will study how the model can be generalised with more data augmentation techniques dealing with radiometry. The current experiment focuses on urban areas but there is a work in progress to generate land cover maps on rural areas with slightly different classes though.