The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences
Download
Publications Copernicus
Download
Citation
Articles | Volume XLIII-B2-2020
Int. Arch. Photogramm. Remote Sens. Spatial Inf. Sci., XLIII-B2-2020, 605–610, 2020
https://doi.org/10.5194/isprs-archives-XLIII-B2-2020-605-2020
Int. Arch. Photogramm. Remote Sens. Spatial Inf. Sci., XLIII-B2-2020, 605–610, 2020
https://doi.org/10.5194/isprs-archives-XLIII-B2-2020-605-2020

  12 Aug 2020

12 Aug 2020

DUAL PYRAMIDS ENCODER-DECODER NETWORK FOR SEMANTIC SEGMENTATION IN GROUND AND AERIAL VIEW IMAGES

S. L. Jiang1,3, G. Li3, W. Yao1,2, Z. H. Hong4, and T. Y. Kuc3 S. L. Jiang et al.
  • 1Department of Land Surveying and Geo-informatics, The Hongkong Polytechnic University, Hong Kong
  • 2Research Institute for Sustainable Urban Development, The Hong Kong Polytechnic University, Hong Kong
  • 3College of Information and Communication Engineering, Sungkyunkwan University, Suwon, Korea
  • 4College of Information Technology, Shanghai Ocean University, Shanghai, China

Keywords: Semantic segmentation, Encoder-decoder network, Convolution neural network, aerial and ground view image

Abstract. Semantic segmentation is a fundamental research task in computer vision, which intends to assign a certain category to every pixel. Currently, most existing methods only utilize the deepest feature map for decoding, while high-level features get inevitably lost during the procedure of down-sampling. In the decoder section, transposed convolution or bilinear interpolation was widely used to restore the size of the encoded feature map; however, few optimizations are applied during up-sampling process which is detrimental to the performance for grouping and classification. In this work, we proposed a dual pyramids encoder-decoder deep neural network (DPEDNet) to tackle the above issues. The first pyramid integrated and encoded multi-resolution features through sequentially stacked merging, and the second pyramid decoded the features through dense atrous convolution with chained up-sampling. Without post-processing and multi-scale testing, the proposed network has achieved state-of-the-art performances on two challenging benchmark image datasets for both ground and aerial view scenes.