The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences
Download
Publications Copernicus
Download
Citation
Articles | Volume XLIII-B3-2022
https://doi.org/10.5194/isprs-archives-XLIII-B3-2022-55-2022
https://doi.org/10.5194/isprs-archives-XLIII-B3-2022-55-2022
30 May 2022
 | 30 May 2022

BUILDING EXTRACTION FROM HIGH-RESOLUTION REMOTE SENSING IMAGERY BASED ON MULTI-SCALE FEATURE FUSION AND ENHANCEMENT

Y. Chen, H. Cheng, S. Yao, and Z. Hu

Keywords: High-Resolution Remote Sensing, Building Extraction, Encoder and Decoder, Multi-Scale Features, Dual Attention

Abstract. The accurate detection and mapping of buildings from high-resolution remote sensing (HRRS) images have attracted extensive attention. However, as an artificial target, buildings not only have various types, but also have multi-scale characteristics and complex context, which brings great challenges to the accurate identification of buildings. To deal with this problem, a semantic segmentation model based on multi-scale feature fusion and enhancement (MSFFE) is proposed for building extraction from HRRS images. Specifically, the proposed model uses the network structure of encoder and decoder. In the encoding stage, densely connected convolutional neural network is used as an encoder to extract multi-level spatial and semantic features. To effectively use the multiscale features of buildings, a multi-scale feature fusion (MSFF) module between encoder and decoder is designed to distinguish buildings of different scales in complex scenes. In the decoding stage, an attention weighted semantic enhancement (AWSE) module is introduced into the decoder to assist the up-sampling process. It not only makes full use of the multi-level features output by the encoder, but also highlights the key local semantic information of the building. To verify the effectiveness of the proposed model, experiments were conducted on two building segmentation data sets, WHU and INRIA. The preliminary results show that the proposed model can effectively identify buildings with different scales in complex scenes, and has better performance than the current representative networks including FCN, U-net, DeeplabV3+ and MA-FCN.