A SHORT-CUT CONNECTIONS-BASED NEURAL NETWORK FOR BUILDING EXTRACTION FROM HIGH RESOLUTION ORTHOIMAGERY

Extracting building footprints utilizing deep learning-based (DL-based) methods for high-resolution remote sensing images is one of the current research interest areas. However, the extraction results suffer from blurred edges, rounded corners and detail loss in general. Hence, this article presents a detail-oriented deep learning network named eU-Net (enhanced U-Net). The method adopted in this study, imagery send into the pre-module, which consists of the Canny edge detector, Principal Component Analysis (PCA) and the inter-band ratio operations, before feeding them into the network. Then, process skips connections used in the network to reduce the loss of details during edge and corner detection. The encoding and decoding modules, in this network, are redesigned to expand the perceptual field with shortcut connections and stacked layers. Finally, a Dropout module is added in the bottom layer of the network to avoid the over-fitting problem. The experimental results indicate that the methods used in this study outperform other commonly used and state-of-the-art methods of FCN-8s, U-net, DeepLabv3 and Fast SCNN.


INTRODUCTION1
Building maps are an important task for urban planning, disaster monitoring, traffic management and scientific planning of the ecological environment (Wei and Liu, 2020). While traditional in-situ mapping and surveying can generate high accuracy maps, they tend to be inefficient. With the development of sensor technologies, remote sensing-based object extraction provides an efficient way to map buildings. Considering all remote sensing-based methods for building footprint extraction, deep learning-based methods (Xin et al., 2012) show a higher performance compared with visual interpretation and traditional machine learning based methods. For instance, He et al. (2020) utilized Mask R-CNN with the attention mechanism for building footprint extraction. The extraction results revealed that deep learning-based methods were better than traditional machine learning methods, such as Support Vector Machines (SVM) (Cortes and Vapnik, 1995). Therefore, the focus of research for extracting building footprints turned to deep learning-based methods.
With the revival of deep learning techniques, various commonly used networks have been widely used in remote sensing applications and building footprint extraction. In 2015 ， Ronneberger. O (2015) proposed U-Net, which used an Encoder-Decoder structure. This structure was effective and many researchers chose the structure as a module combined with their existing works. Wang et al. (2020) proposed a network named Efficient Non-local Residual U-shape Network (ENRU-Net), which improved non-local block with U-net * Corresponding author: Zhimeng He, 20191235004@nuist.edu.cn structure to capture contextual information. This network also achieved better result compared to Fully Convolution Networks (FCN)-8s (Wu, 2015), U-Net, SegNet (Badrinarayanan et al., 2017) and DeepLab v3 (Chen et al., 2017). Xu et al. (2021) improved U-net combining with attention mechanism, and developed a loss function. They achieved higher F1-score of more than 10.78% compare to U-net. Liu et al. (2021) demonstrated the effectiveness of pyramid scene parsing and a residual connection in enabling U-net to catch global context information by the large-scale experimental. The DeepLab v3 (Chen et al., 2017) architecture proposed by Chen et al. (2017) combined a Atrous Spatial Pyramid Pooling (ASPP) and an Encoder-Decoder, which achieved higher performance than that of ResNet-38 (He et al., 2016), and PSPNet (Zhao, 2016). In addition, Yang et al. (2020) utilized the DeepLabv3 architecture and the dilated convolution to extract building footprints, which proved that combining the DeepLabv3 and dilated convolution is more effective than using DeepLabv3 only. Zhang et al. (2020) used deep separable convolution to improve U-net, which proved the efficiency of XU-Net. Zhou et al. (2020) introduced encoder and decoder sub-networks and connected them via a series of nested, dense skip connection, which showed high performance as shown in their experiment. This network solved the disadvantage of the skip -connection, and used deep supervision to help network choose path by itself. Other modifications includes combining U-Net and Mask R-CNN as an ensemble model (Vuola et al., 2019) and involving a pre-trained VGG-encoder in U-Net (Shvets, 2018) developed r to improve the U-Net.
However, all methods mentioned above cannot fully overcome detailed information loss during edge and corner detection, which are important for building footprint extraction in terms of extraction accuracy. To address the issue, Salient object detection is proposed which focuses on edge preserving. At present, binary label is more fashion in works compared with class activation map, and more benefit for segmentation application. There are mainstream researches like centersurround difference (L. et al., 1998) and (T. et al., 2007), uniqueness prior (K et al., 2013) and (P et al., 2013) and backgrounds prior (Y. et al., 2012) and (W. et al., 2014). However, they haven't achieved considering context information and clear edge, therefore the accuracies are expected to be improved continued. Wang et al. (2016) proposed a method based on Fast R-CNN framework (Girshick, 2015) to generate a saliency map both clear and context information. The imagery is segmented into super-pixel regions and edge regions to achieve state-of-the-art performance.
This paper presents a new method called enhanced U-Net (eU-Net), which is based on U-Net. For this method, the dilated convolution (Yu and Koltun, 2015), the dropout module (Hinton et al., 2013), the jump connection (Ronneberger. O, 2015), the short-cut connection (He et al., 2016) and a premodule were applied to preserve detailed information of building footprints. The rest of the paper is organized as follows. Section 2 describes the pre-module and the eU-Net architecture. Section 3 presents the results of two experiments and demonstrates the performance of eU-Net. Section 4 concludes the paper.

The pre-modules
To preserve more detail information, a pre-module was inserted before passing the images to the deep learning network. The pre-module converted the original images to 6-band tensors via Principal Component Analysis (PCA) (Ke and Sukthankar, 2004) (only the first component was kept), Canny edge detection operator (Canny and John, 1986) and the Red Green Index. The Canny edge detection was an edge detection operator that uses a multi-stage algorithm to detect a wide range of edges in images. The Red Green Index is like the Normalized Difference Vegetation Index (NDVI), which is an index to detect live green plant canopies in multispectral remote sensing data.
The pre-modules were used for (1) emphasizing edges and (2) enhancing inter-band links. Deep neural networks can be seen as multi-layer matrix multiplication, which can only find multiplicative relationships between bands. In this study, the red and green bands were chosen to calculate Red Green Index because the buildings in the image were more obvious ( Figure  1).

Figure 1. Ratio results of different bands
The inter-band ratio, Red Green Index, can be calculated by the formula 1.
where ( ) represented the coordinates of the pixel points on the image, and ( ) was the pixel value at point ( ) on the image. The ( )was the pixel value at point (i, j) on the image of green band. The ( )was the pixel value at point (i, j) on the image of red band.

eU-Net network architecture
U-Net was an effective network because of its jump connections, symmetric structure, and multi-scale feature extraction capability. The jump connection was useful to prevent the gradient explosion and the Vanishing Gradients; the symmetrical structure of the encoding and decoding not only eased the jump connection but it also facilitated end-to-end classification (Hao et al., 2020); the multi-scale feature extraction acquired both detailed information and high-level semantic information which made the network acquire more information from the images. However, the results of U-Net were commonly inaccurate with respect to building boundary detection. Figure 3 shows the architecture of eU-Net, the Dropout module was employed between the encoder and decoder to avoid overfitting (Hinton et al., 2013). Dropout not only improved the robustness of the network, but also reduced the training time to 1-p of the original time (p is the probability of stopping units.). The Dropout strategy was frozen at the testing phase, which ensured a stable performance.
Dilated convolution was used to increase the field size of the module. The increase in the size of the kernel was determined by the dilation rate. In this study, dilated convolution was applied after the pre-module. Feature maps, after the dilated convolutions with the dilation rates of 2, 3 and 5, respectively, were connected to three encoding modules (except the first module) via a jump connection. The encoding and decoding units were aligned with the Yresidual unit (Figure 2) of the short-cut connections to increase the robustness of the model. The receptive field with different sizes was helpful for analyzing the targets with different size, while concatenating the receptive field by increasing the size of the kernel will cause an increase in the number of parameters. Therefore, in the eU-Net, two 3x3 kernels are applied instead of one 5x5 kernel which resulted in the same receptive field but with fewer parameters (Szegedy et al., 2015). The concatenate algorithm is applied to merge the output from the 1x1 and the 3x3 convolutions layers.

Evaluation metrics
In this study, the metrics include Overall Accuracy (OA), Recall, Precision, F1-score, and Intersection over Union (IoU) were applied to evaluate the performance of building footprint extraction models. All metrics were calculated as follows: Where the True Positive (TP) was the number of correctly predicted pixels indicating the positive classes; the False Positive (FP) was the number of predicted pixels wrongly indicating the negative classes; the False Negative (FN) was the number of predicted pixels wrongly indicating the positive classes; the True Negative (TN) was the number of predicted pixels correctly indicating the negative classes.

Implementation Details
In this experiment, a Rectified Linear Unit (ReLU) was selected as the activation function except for the output layer, which suppressed negative values and prevented the resulting gradient exploring and vanishing. Batch size, loss function, learning rate and epochs have been set to 4, binary cross-entropy loss, 1e-4 and 100 respectively, which were same as the experimental setting of He et al. (2021). All algorithms were trained and tested on a GeForce RTX 2060 with CUDA 10.2. The experiments were performed using the Keras library with the TensorFlow framework.

Experiment Data
In this study, the eU-net model and other models were tested on the Waterloo Building Dataset, which was released by He et al.

Ablation study
In this section, the contribution of the pre-modules to the performance of the method using the Waterloo Building Dataset (He et al., 2021) supports, the mosaic results of eU-Net shown in Figure 6. The following five experiments were conducted: (1) an experiment without the pre-module (baseline), (2) an experiment without the PCA, but with the edge detection and inter-band ratio operation, (3) an experiment without custom ratios, but with the PCA and inter-band ratio operations, (4) an The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLIII-B1-2022 XXIV ISPRS Congress (2022 edition), 6-11 June 2022, Nice, France experiment without edge detection, but with the inter-band ratio operations and PCA, and (5) an experiment with the pre-module. Table 2 presents the results and Figure 5 shows the training accuracy and loss.   As shown ablation study results in Table 2 and Figure 7, the pre-module yielded a significant improvement in all the metrics compared to its baseline. The F 1 -score increased from 89.1% to 93.5 %, which indicated that the pre-module can effectively promote deep learning networks to improve their accuracy. Apart from the baseline, the lowest F 1 -score of 90.2% occur in Experiment (4), which removed the inter-band ratio compared to Experiment (5). This removal resulted in the decrease of F 1score by 3.4%, which proved that the inter-band module is the most effectively module of the three modules for preprocessing. Experiment (3) showed the PCA components have a negligible contribution to the performance improvement in terms of F 1score, which decreased by 1% when PCA components were removed from the pre-modules.

Performance of eU-Net and Comparative Study
As shown in Table 3, a comparative study was conducted between the eU-Net and the methods used in He et al. (2021) using the Waterloo Building Dataset. For a fair comparison, we used the same experiment settings as those in He et al. (2021). Specifically, the batch size, loss function, learning rate and epochs were set to 4, binary cross-entropy loss, 1e -4 and 100, respectively. eU-Net achieved higher values in terms of IoU, mIoU, Recall and F 1 -score produced results like all other methods. The result demonstrated that eU-Net performs well compared to state-of-the-art DL-based methods in building footprint extraction. There are the number of parameters for eU-Net (ours) and Mask R-CNN (He et al., 2021)

CRITICAL ASSESSMENT
The manual search for inter-band ratio relationships and edge enhancement generally has limited effectiveness. When changing targets, specific pre-processing needs to be adapted. Edge enhancement by manually selected features will significantly improve extraction accuracy, but in forest extraction it tends to cause a pepper noise. In future research, the pre-processing module of this paper is replaced by a deep learning network to do feature enhancement. Combining two deep learning networks, ensuring them carry special functions that one for feature enhancement as pre-processing and another one for extraction. Deep learning networks should be modularization, and then become more controllable and make the functions of each module recognizable.

CONCLUSION
The primary goal of this research was to extract the building footprints with precise edges. For this purpose, a new deep learning network named eU-Net was developed, which consisted of two components: the pre-modules and the deep learning neural network. The pre-modules were constructed using PCA, edge detection and the inter-band ratio results. The processed 6-band image set was transferred to the designed FCN. And the FCN employed dilated convolutions, jump connections, Y-residual units and U-type architecture. The comparative study demonstrated that method used in this study exhibited high performance in building footprint extraction compared to other commonly used and state-of-the-art methods