IMPACT OF DEEP LEARNING-BASED SUPER-RESOLUTION ON BUILDING FOOTPRINT EXTRACTION

Automated building footprints extraction from High Spatial Resolution (HSR) remote sensing images plays important roles in urban planning and management, and hazard and disease control. However, HSR images are not always available in practice. In these cases, super-resolution, especially deep learning (DL)-based methods, can provide higher spatial resolution images given lower resolution images. In a variety of remote sensing applications, DL based super-resolution methods are widely used. However, there are few studies focusing on the impact of DL-based super-resolution on building footprint extraction. As such, we present an exploration of this topic. Specifically, we first super-resolve the Massachusetts Building Dataset using bicubic interpolation, a pre-trained SuperResolution CNN (SRCNN), a pre-trained Residual Channel Attention Network (RCAN), a pre-trained Residual Feature Aggregation Network (RFANet). Then, using the dataset under its original resolution, as well as the four different super-resolutions of the dataset, we employ the High-Resolution Network (HRNet) v2 to extract building footprints. Our experiments show that super-resolving either training or test datasets using the latest high-performance DL-based super-resolution method can improve the accuracy of building footprints extraction. Although SRCNN based building footprint extraction gives the highest Overall Accuracy, Intersection of Union and F1 score, we suggest using the latest super-resolution method to process images before building footprint extraction due to the fixed scale ratio of pre-trained SRCNN and low speed of convergence in training.


Motivation
With rapid urbanization, automated mapping of buildings became important for urban planning and management (Rastogi et al., 2020). In natural hazard and disease control, it also plays a key role (Thomas et al., 2013). With the development of imaging techniques, remote sensing-based methods provide promising ways for building footprint extraction and building mapping. High spatial resolution (HSR) images can be used to extract detailed information and map buildings accurately. However, HSR images are sometimes unavailable in practice. In such circumstances, the super-resolution of existing data is both economical and practical compared to acquiring new data or waiting for a new sensor (Shermeyer and Van Etten, 2019).
When considering existing super-resolution methods, deep learning (DL) based super-resolution show higher performance compared to traditional methods (Dong et al., 2015). There are two classes of super-resolution techniques: single-image superresolution (SISR) and joint-image super-resolution methods. The former super-resolves images by taking only low spatial resolution target images as input, while the latter takes auxiliary data to aid the super-resolution process. In most cases, the first one is preferred given limited availability of data. In remote sensing, DL-based SISR methods are widely used to superresolve remote sensing images. However, from literature, we are only aware of few works exploring the impact of single-image super-resolution on downstream tasks. As such, we would like to explore the impact of using cutting-edge DL-based SISR on building footprints extraction.

Objectives of the study
The paper aims at exploring the impact of super-resolution on building footprints extraction. We applied a bicubic interpolation, and a pre-trained Super-Resolution CNN (SRCNN) (Dong et al., 2015), Residual Channel Attention Network (RCAN) , Residual Feature Aggregation Network (RFANet) (Liu et al., 2020) to super-resolve the Massachusetts building dataset (Mnih, 2013). A comparative study was then conducted using High Resolution Network (HRNet) v2  for building footprint extraction from the original, the bicubically interpolated and the super-resolved Massachusetts building datasets. The contribution of this paper is the exploration of the DL-based SISR impact on building footprints extraction.

Related work
1.3.1 Super-resolution methods: Super-resolution refers to recovering high spatial resolution images from low spatial resolution images (Dong et al., 2005). According to the data used as model input, there are single-image super-resolution and jointimage super-resolution methods. The former does not require multiple modalities (in terms of resolutions and spectral characteristics) in the data, which makes it easier to implement. Therefore, it is widely used in different disciplines.
Starting from SRCNN, CNNs were introduced into single-image super-resolution, which outperformed conventional methods (Dong et al., 2005). To further improve the performance of single-image super-resolution, different methods were proposed with more complex architectures and more parameters (He et al., 2021). Nowadays, the state-of-the-art methods, including RCAN and RFANet for single-image super-resolution can directly super-resolve low spatial resolution images to higher spatial resolution without any pre-processing steps (such as the bicubic interpolation in SRCNN).

The application of CNN-based single-image superresolution in Geoscience.
In Geoscience, single-image superresolution also showed its high performance in multiple applications. Ducouman and Fablet (2016) applied SRCNN in downscaling satellite-based sea surface temperature data. In 2017, Lei et al. (2017) proposed a Local-Global Combined Network (LGCNet) for super-resolving remote sensing images. Ma et al. (2018) introduced transfer learning in super-resolving remote sensing images and proposed a network named Transferred Generative Adversarial Network (TGAN). The proposed network achieved the highest performance with respect to the benchmark methods used in their experiment. In Tuna et al. (2018), SRCNN and Very Deep Super-Resolution (VDSR) were adapted to super-resolve remote sensing images. Shermeyer and Van Etten (2019) explored the effect of super-resolution in object detection from satellite images. Ma et al. (2019) combined wavelet transform and the recursive Res-Net (Tai et al., 2017) for super-resolving remote sensing images, which obtained highest performance in their comparative study. In He et al. (2021), SRCNN, Very Deep Super-Resolution (VDSR) and RCAN were used and compared in downscaling Gravity Recovery and Climate Experiment (GARCE) Terrestrial Water Storage data. In a recent work, Lei and Shi. (2021) proposed a Hybrid-scale Selfsimilarity Exploitation Network (HSENet), which applied a Single-scale Self-similarity Exploitation Module (SSEM) and a Cross-scale Connection Structure (CCS). The experiment demonstrated the high performance of HSENet compared to the state-of-the-art methods.
Apart from these supervised super-resolution networks, unsupervised super-resolution networks were also proposed and used in remote sensing. Haut et al. (2018) developed an hourglass generative network for super-resolution which did not require high-resolution images in the training process. In 2019,  combined two generative CNNs and proposed the Cycle-CNN for super-resolving remote sensing images in an unsupervised manner. Their experiments showed the robustness and competitive performance of their methods compared to other methods.
As we mentioned above, single-image super-resolution were widely used in remote sensing images super-resolution. However, we are only aware of a few works exploring the impact of superresolution on the down-stream tasks, especially for semantic segmentation and instance segmentation. Among research we summarized above, only Shermeyer and Van Ettent (2019), and Lei and Shi (2021) explored the effect of super-resolution on object detection and remote sensing scene classification. Further exploration of the impact of single-image super-resolution on semantic segmentation is necessary. In this work, we selected building footprints extraction as the downstream task of singleimage super-resolution and the effects thereof.

CNNs-based building footprints extraction.
With the rapid development of the Fully Convolutional Networks (FCN), they were widely used in different remote sensing applications (Zhu et al., 2017). For building footprint extraction, methods were proposed for extracting building masks and building boundaries using CNNs. For masks extraction, from ConvNet (Yuan et al., 2017), Res-U-Net (Xu et al., 2018), Deep Encoding Network (DE-Net) , Capsule Feature Pyramid Network (CapsFPN) (Yu et al., 2020) and Multipath Hybrid Attention Network (MHA-Net) (Cai et al., 2021) to the Coarseto-fine Boundary Refinement Network (CBR-Net) (Guo et al., 2022), newly developed deep learning techniques were gradually introduced in this task, which includes the skip connection, the capsule Network, the attention scheme, the multi-scale feature extraction, etc. It can be noticed that with sophisticatedly designed architecture, the performance of building footprint extraction was improved gradually.
For boundaries delineation, two ways were developed. The first way was introducing a regularization process after the mask extraction. The second way was embedding active contour model or recurrent neural network in CNN based networks for end-toend building delineation. For the first type of methods, the most representative of its kind was proposed by Zhao et al. (2018). They used a regularization process to generate the boundaries of building footprint from mask extraction results from Mask R-CNN. For end-to-end building delineation, Marcos et al. (2018) developed a Deep Structured Active Contours (DSAC) framework to extract building boundaries directly from input images. In recent work, Zhao et al. (2021) proposed a new network based on PolyMapper (Li et al., 2019), which combined recurrent neural network and convolutional neural network together to extract building boundaries directly from images. The new method proposed by Zhao et al. (2021) showed the highest performance compared to other building delineation methods in their experiments.
Although CNNs-based building footprint extraction methods were quite successful in both mask extraction and building delineation, the spatial resolution of input images limited its applications. As we mentioned in the previous section, singleimages super-resolution provided a promising way to overcome the spatial resolution related issues in practice. However, the impact of super-resolution on building footprint extraction was still under explored, Therefore, in this work, we would like to mitigate the gap 2. METHOD

Data
Among all publicly available building datasets, we selected the Massachusetts Building Dataset in this work as it has a 1 m spatial resolution, which limits detailed information extraction. The dataset consists of 137, 4 and 10 1500×1500 aerial images of Boston area in the training, validation, and test datasets, respectively. Due to computational requirements, we cropped and padded the images and masks into small patches of size 512×512.
To explore the impact of DL-based SISR, we super-resolved the processed images to 30 cm per pixel using bicubically interpolation and pretrained SRCNN, RCAN and RFANet. We used bicubic interpolation masks as ground truth for both superresolved images as we found the interpolation is sufficient to upsample the simple geometric shapes in the masks. To explore the impact of super-resolution on building footprint extraction, we also applied High Resolution Network v2 (HRNet)  to extract building footprint from original and superresolved images, which is a high-performance semantic segmentation method. As we only care about the impact of superresolution on building footprint extraction, we did not apply the innovative mask extraction methods or boundaries delineation methods in this work. Although, in theory, they are affected in the same way by super-resolution.

Super-resolution methods
RFANet is composed of a head part, a trunk part, and a reconstruction part. The first-and second-part perform shallow and deep feature extraction. The reconstruction part is an Efficient Sub-Pixel Convolutional Neural Network (ESPCNN) (Shi et al., 2016), which works to recover high spatial resolution images from feature maps generated from the trunk part. The trunk part consists of 30 Residual Feature Aggregation modules which are composed of 4 Enhanced Spatial Attention (ESA) blocks (Liu et al., 2020).
The architecture of RCAN is similar to RFANet. It is also composed of a head part, a trunk part, and a reconstruction part (including an upscale module and a convolutional layer). Each part serves a similar purpose as its counterpart in RFANet. The reconstruction part also has an ESPCNN. The trunk part for RCAN includes 10 residual groups, each of which has 20 residual channel attention blocks .
The SRCNN was the first CNNs-based single-image superresolution method, which is composed of three convolutional layers. These layers handle patch extraction and representation, non-linear mapping, and reconstruction, with kernel size of 9×9, 1×1 and 5×5, respectively (Dong et al., 2015).
For detailed architecture information, such as the number of filters and each kernel size, we direct the readers to the original works.

Building footprint extraction method
Given high performance of HRNet v2, we adopted it in this work for building footprint extraction. In HRNet v2, a combination of multi-resolution group convolutions and multi-resolution convolutions is embedded in four-stage networks, which results in the strong representation of features and high performance for semantic segmentation . For detailed architecture information, we direct the readers to the original work.

Evaluation metrics
For single-image super-resolution, we applied commonly used Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index (SSIM) (Liu et al., 2020) to evaluate the performance of pre-trained models on SWOOP dataset 1 . To quantitatively evaluate building footprint extraction results, we applied Overall Accuracy (or average pixel accuracy), Intersection of Union, mean IoU (mIoU), Precision, Recall and F1 score, which are widely used to evaluate CNNs-based building footprint extraction method (He et al., 2022). Details about these metrics are omitted here to conserve space.

Implementation details
For CNNs-based super-resolution, we set epochs as 20, batch size as 16, loss function as mean absolute error and optimizer as Adam. The learning rate was initialized as 1e-4 and decreased to half every 1e5 iterations. For HRNet v2, we used the Adam optimizer, 1e-4, Jaccard loss and 100 as the optimizer, learning rate, loss function and epoch, respectively. We conducted the experiment using TensorFlow framework 2.4.0 on a single GeForce RTX 3090 GPU and CUDA 11.2.

Qualitative evaluation
To qualitatively evaluate the impact of super-resolution on building footprint extraction, we selected two patches of images. We compared super-resolved images and extracted building footprints generated using different super-resolution methods by using HRNet v2 trained on the super-resolve images. As shown in Fig.1 to 5, in the first row are the original images and ground truth label of two examples scenes. From the second row to the fifth row 2 , the super-resolved images and extracted building footprint were tabulated. In Fig.1, the extraction model HRNet v2 was trained on original images. In Fig.2 to Fig.5 the HRNet v2 models were trained using bicubically interpolated images, SRCNN super-resolved images, RCAN super-resolved images and RFANet super-resolved images. For simplification, in the rest of the paper, we denote the extraction model trained on bicubically interpolated, SRCNN super-resolved, RCAN superresolved, RFANet super-resolved Massachusetts training building dataset as "BI-HRNet v2", "SRCNN-HRNet v2", "RCAN-HRNet v2" and "RFANet-HRNet v2", respectively. The model trained on the original Massachusetts Building Dataset is denoted as "HRNet v2".
As shown in Fig.2 to Fig.5, we found that the models trained on images super-resolved by the newer method achieves higher performance on all test datasets, with the model trained on RFANet super-resolved images obtaining the best results Any particular trained model performs best when the test set is processed/super-resolved in the same way as the training set. SRCNN and bicubic interpolation are approximately the same type of super-resolution (bicubic-like). When super-resolving the test-set using the same method as the training set, the newer method generated higher quality images and more accurate extraction results. For example, with the model RCAN-HRNet v2, the extraction results of RFANet super-resolved images were more accurate than that of RCAN super-resolved images. Consequently, we can summarize our findings as 1) newer superresolution methods facilitated higher performance extraction models allowing them to achieve more accurate extraction results, 2) with a given extraction model, when the training dataset and test dataset shared similar super-resolving process, more complicated super-resolution processing generated more accurate extraction results. These effects are quantified in the next section.

Quantitative evaluation
Before analyzing the impact of super-resolution on building footprint extraction, we first explored the performance of superresolution methods we used in our experiment. We used the super-resolution methods to super-resolve the SWOOP dataset from 1 m to 0.25 m and evaluated the quality of super-resolved images using PSNR and SSIM. As shown in Table 1, RFANet surpassed other methods in both PSNR and SIIM on the SWOOP dataset.
As shown in Table 2, all metrics' values followed the expected trends found in the previous section. As we were exploring the impact of super-resolution on building footprint extraction. IoU and precision were good indicators. In Table 2, each model achieved its highest test scores when the test set is super-resolved using the same or similar methods as those used for the training set. For example, extraction results of SRCNN super-resolved images had the highest IoU of 47.4% and 50.65% using extraction models BI-HRNet v2 and SRCNN-HRNet v2.
Extraction results of RFANet super-resolved images had the highest IoU of 49.89% and 49.49% using extraction models RCAN-HRNet v2 and RFANet-HRNet v2. These results confirmed the qualitative findings from the previous section. Therefore, we recommend using the latest and most powerful CNNs-based super-resolution to super-resolve building datasets The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLIII-B1-2022 XXIV ISPRS Congress (2022 edition), 6-11 June 2022, Nice, France and then extract building footprints from those super-resolved images. This approach will work if the test set is super-resolved using the same method. Although SRCNN showed high performance in our experiment, we do not recommend use SRCNN in practice due to requiring retraining when superresolving images with different scales. In addition, without the skip connection included in more modern methods, the convergence speed of training would be slow.
As shown in Table 2, there were results which were hard to explain. The highest OA, mIoU and F1 score came from the extraction results of SRCNN super-resolved test dataset using the SRCNN-HRNet v2, which is out of our expectation. We plan to investigate the causes and implications of these results in future works.

CONCLUSION AND RECOMMENDATION
Despite the wide use of super-resolution methods on remote sensing images, few works explored the impact of superresolution on downstream remote sensing applications, especially semantic segmentation. Therefore, in this work, we explored the impact of single-image super-resolution on building footprint extraction. We selected pre-trained SRCNN, RCAN, RFANet and bicubic interpolation to super-resolve the Massachusetts Building Dataset. Then, we trained and tested HRNet v2 using the super-resolved images. Our experiment showed that the extraction model performs best when the training set and test set are super-resolved using the same methods. Therefore, we suggest using the latest super-resolution method to process building dataset during both training and testing/deployment to improve the accuracy of building footprint extraction.