THE LAND COVER CLASSIFICATION USING A FEATURE PYRAMID NETWORKS ARCHITECTURE FROM SATELLITE IMAGERY

Extracting land cover information from satellite imagery is of great importance for the task of automated monitoring in various remote sensing applications. Deep convolutional neural networks make this task more feasible, but they are limited by the small dataset of annotated images. In this paper, we present a fully convolutional networks architecture, FPN-VGG, that combines Feature Pyramid Networks and VGG. In order to accomplish the task of land cover classification, we create a land cover dataset of pixel-wise annotated images, and employ a transfer learning step and the variant dice loss function to promote the performance of FPN-VGG. The results indicate that FPN-VGG shows more competence for land cover classification comparing with other state-of-the-art fully convolutional networks. The transfer learning and dice loss function are beneficial to improve the performance of on the small and unbalanced dataset. Our best model on the dataset gets an overall accuracy of 82.9%, an average F1 score of 66.0% and an average IoU of 52.7%. * Corresponding author


INTRODUCTION
Many global and regional applications require land cover information about Earth's surface. Extracting land cover from satellite imagery is considered as a low cost way and has been applied in many fields such as land resource management, environmental protection. With the development of remote sensing technology, the spatial resolution of satellite images is higher and higher, which provides more information for land cover classification but also brings great challenges (Tong et al., 2018). Thus it is very difficult to find an universal method for land cover classification from the images covering different geographical areas.
The prevalent remote sensing classification methods are mainly based on the spectral and spatial features. These methods consist of two sections: feature extraction and feature classification. Firstly, the features are extracted by manually designed operators such as scale-invariant feature transform (SIFT), histogram of oriented gradients (HOG) et al (Yang, Newsam, 2013). Then the features are classified by classifiers such as support vector machine (SVM), and conditional random field (CRF) et al (Melgani, Bruzzone, 2004, Li et al., 2015. However, these methods is hard to classify the images in complex conditions. In recent years, deep learning methods have surpassed traditional methods in various computer vision tasks, such as object detection, classification. Convolutional Neural Networks (CNNs) are the most representative deep learning models, which are constructed in deep hierarchical architectures and capable of extracting the intrinsic features of data. In 2012, Professor Hinton and his student Alex (Krizhevsky et al., 2012) won the ILSVR (ImageNet Large Scale Visual Recognition Competition) by employing CNNs. After that, the deep learning method has been widely used in remote sensing (Zhu et al., 2017) and other fields. At first, remote sensing scientists exploited the deep learning in the scene classification which is a more coarse classification method than pixel level (Nogueira et al., 2016, Zhong et al., 2016, Tong et al., 2018. The Fully Convolutional Networks (FCN) (Long et al., 2015), which replaces the fully connected layers with convolution layers, could directly obtain the pixel-wise classification results (Wu et al., 2018, Zhang et al., 2018. There are several FCN networks such as FCN-8s (Long et al., 2015), Segnet (Badrinarayanan et al., 2015) and U-net (Ronneberger et al., 2015).
Although FCNs are the most popular approach to pixel-wise classification, they require huge computing resources, as well as a large dataset of pixel-wise annotated images, which impedes their application in remote sensing. There are very few pixelwise annotated land cover dataset such as ISPRS Benchmark dataset. In order to meet our classification from satellite imagery, we create a land cover dataset consisting of images and manually pixel-wise annotated labels. We design a FCN architecture which combines the Feature Pyramid Networks (Lin et al., 2017) and VGG (Simonyan, Zisserman, 2015), and overcome the limitation of small and unbalanced dataset by using a transfer learning step and the variant dice loss function.

Network architecture
Feature Pyramid Networks (FPN) is coined to detect objects (Lin et al., 2017). FPN builds feature pyramids inside convolutional networks which are critical to address multiscale problems. We present a new network architecture named FPN-VGG ( Figure 1) which is modified from FPN and combine the VGG network. The VGG network, proposed by the Visual Geometry Group from University of Oxford (Simonyan, Zisserman, 2015), is used as a feature extractor for FPN. To train a deep learning network, the following problems always impede us to obtain the best model: (a) The overfitting led by the small training dataset. (b) Slow convergence because of random initialization. In order to overcome these two problems, we employ a transfer learning strategy by initializing the feature extractor of FPN-VGG with a VGG16 pretrained model. Transfer learning has been proved a good way to train deep neural networks on small dataset (Huh et al., 2016). It has been proved that ImageNet pretrained networks could promote the performance of classification on remote sensing data (Marmanis et al., 2016). Thus, we transfer the parameters of VGG16 model (excluding the top fully connected layers) trained on 2012 ImageNet dataset to initialize the encoder part of FPN-VGG ( Figure 2).

Loss function and accuracy assessment
In multiclass classification task, categorical cross entropy loss (L_ categorical_crossentropy) is the most commonly used loss function. It is calculated from categorical cross entropy between the ground truth (gt) and the prediction (pr).
We have also employed the dice loss (L_dice) to train the FPN-VGG network. The dice loss is defined as follow, Where, Where  is a coefficient for precision and recall balance, and it is set to 1 in our work.
We also used the sum of L_categorical_crossentropy and L_dice as a variant dice loss function. This loss function is named L_cce_dice.
We employ F1 score and IoU (intersection over union), which are the most common indexes used to assess the accuracy for semantic segmentation (Maggiori et al., 2017), to assess the classification accuracy of remote sensing images. For F1, the relative contribution of precision and recall are equal. Where, Where, TP denotes true positives, FP denotes false positives, FN denotes false negatives. F1 and IoU reach the best value at 1 and worst score at 0.
For multiclass classification task, we employ mF1 and mIoU to assess the accuracy for remote sensing image classification.

RESULTS AND DISCUSSION
In this section, we present a land cover dataset and design experiments to analyse the performance of FPN-VGG.

Dataset:
We prepare a dataset of land cover classification from high resolution satellite images. The dataset consists of a set of Digital Orthophoto Maps (DOM) and the corresponding annotated labels. The DOM achieved from ZY-3 satellite consists of the four spectral bands in the visible (VIS: red(R), green(G), blue(B)) and in the near infrared (N).
We have extracted 3 images from large DOM as training data which are manually annotated with 6 classes (background, low vegetation (lowVeg), tree, building, road, water). These classes are commonly used in the applications of land cover. These images cover 3400 km 2 on East Asia with 5.8m ground resolution. The images and labels are shown in Figure 3. The pixel number of each class is shown in Figure 4.  It is known that working with large image patches could maximize the advantage of CNN (Wu et al., 2018). However, the maximum of patch size is limited by the memory of the GPU hardware. Thus we have created training data by extracting image patches of size 480480. After slicing the original images and labels, we split the 480480 patches into a training set and a validation set with a ratio of 0.25. In order to test the performance of models, a test image has cropped from a large DOM and manually annotated it as a reference map ( Figure 5). The test image is also from ZY-3 satellite on East Asia covering 40 km 2 .

Comparison with other networks
The performance and classified maps of FPN-VGG and other FCN networks are shown in Table 1 and Figure 6. These networks are all trained and tested with input data of RGB bands.
From Table 1, the worst result for all models comes in the class "road". And the roads cannot be extracted by Segnet. This problem is mainly owed to the fact that proportion of road class in training dataset is much smaller than other classes (see Figure  4). Overshadowed by contiguous trees and buildings might also give rise to this problem (see Figure 5(a)).
From the overall performance, the FPN-VGG surpasses the other three networks. The FPN-VGG takes the highest mF1 (66.0%), mIoU (52.7%) and OA (82.9%), followed by the FCN-8s. The performances of Segnet and U-net are worse than FCN-8s. From Figure 6, the map classified by U-net is more fragmented than maps classified by other networks.

(a) FCN-8s (b) Segnet
The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLIII-B3-2020, 2020 XXIV ISPRS Congress (2020 edition) (c) U-net (d) FPN-VGG Figure 6. The predictions by different networks with input data of RGB images

Performance of FPN-VGG models with different input bands
Being different from natural photos, the satellite images always have more than 3 spectral bands containing the visible (RGB) and the near infrared (N). Thus, we train the FPN-VGG by supplementing the N band.
The accuracy of FPN-VGG models with different input spectral bands is listed in Table 2. The performance of the model with NRG input is significantly better than that of model with RGB input. Although the OA of model with NRG input is only 1.0% higher than that of model with RGB input, the mF1 and mIoU are 4.7% and 5.4% higher respectively. This improvement is mainly contributed by the improvement of "road" and "water". This indicates that the NRG bands has the advantage for extracting road and water.
While the model with 4 bands input has one more band than model with NRG input, the performance with 4 bands input is slightly lower. This means that more input bands may not obtain better performance. We should choose the optimal band combination for the specific classification task.

Transfer learning
In this section, we initialize the FPN-VGG with ImageNet pretrained model and NRG input bands in this section. In Table  3 we provide the comparative results of ImageNet initialization and random initialization. Comparing to the result of random initialization, the FPN-VGG initialized by ImageNet pretrained model shows the improvement on all classes except background. The mF1 , mIoU and OA are improved by 1.6%, 1.9% and 1.0% respectively. This indicates that the ImageNet model is beneficial to improve the model with NRG input bands despite it is trained on RGB ImageNet dataset.

Training with dice loss function
In order to exam the classification ability of FPN-VGG with dice loss on unbalanced dataset, we present the results of models with different loss function are shown in Table 4.
Comparing to the model with L_categorical_crossentropy, although the OA, mF1 and mIoU of model with L_dice is worse, the F1 of "background" and "road" is 2.7% and 2.9% higher. It shows that the dice loss could improve the classification ability of minority classes. Furthermore, the model with L_cce_dice has the highest mF1 (75.9%) and mIoU (63.2%). The best prediction map is shown in Figure  7. Comparing to model with L_categorical_crossentropy, the mF1 and mIoU of L_cce_dice have been improved 3.6% and 3.2%. Especially, the F1s of "background", "building", "road" and "water" classified by the model with L_cce_dice are higher than those of other two models. It can be seen that the L_cce_dice is more competent for unbalanced multiclass classification than L_categorical_crossentropy and L_dice.

CONCLUSION
In this paper we present a new deep learning modelling framework for land cover classification of high spatial resolution satellite imagery. The framework is named FPN-VGG which is based on feature pyramid networks combining with VGG16. The performance of our framework is evaluated on a dataset manually annotated. The dataset consists images with four spectral bands (Blue, Green, Red and the Near infrared) and corresponding labels of 6 classes (background, low vegetation, tree, building, road, water). The training dataset are extracted from ZY-3 satellite covering on East Asia.
We found that the FPN-VGG could extract more accurate land cover map from satellite images than other state-of-the-art fully convolutional networks (FCN-8s, Segnet, U-net). The experiment results show that inputting with NRG spectral bands and initializing by ImageNet model could improve the performance of FPN-VGG on small dataset. In addition, the model trained with the sum of the categorical cross entropy loss and dice loss is more competent for classification on unbalanced dataset.

Networks
Background