PERFORMANCE EVALUATION OF FUSION TECHNIQUES FOR CROSS-DOMAIN BUILDING ROOFTOP SEGMENTATION

: Convolutional Neural Networks have been widely introduced to building rooftop segmentation using satellite and aerial imagery. Preparing efficient training data is still among the critical issues on this topic. Therefore, adopting available annotated cross-domain multisource dataset is needed. This paper evaluates the performance of fusing the state-of-art deep learning neural network architectures for cross-domain building rooftop segmentation. We have selected three semantic image segmentation neural networks, including Swin transformer, OCRNet and HRNet. The predictions from these three neural networks are combined with majority voting, max value and union fusion techniques, a refined building rooftop segmentation mask is therefore delivered. The experiments on two benchmark datasets show that the proposed fusion techniques outperform single models and other state-of-art cross-domain segmentation approaches.


INTRODUCTION
Building rooftop segmentation is one of the fundamental tasks in photogrammetry and remote sensing. In particular, an up-to-date building rooftop map is required for many applications, including urban mapping, city planning, and land use analysis. Many edgedriven and region-driven approaches are proposed in the last two decades (Cui et a., 2011, Tian and Reinartz, 2013, Qin et al., 2016Hossain et al., 2019).
The development of machine learning and deep learning (DL) algorithms has further boosted the region-driven building extraction approaches. Especially, semantic segmentation methods based on Convolutional Neural Network (CNN) have achieved great success in extracting building rooftop segmentations (Ji et al., 2018, Yuan et al., 2021b. However, training an efficient semantic segmentation model requires large amounts of manually annotated pixel-level building masks, which requires a lot of manual work and is therefore very expensive and time-consuming (Farahani et al., 2021. Complex and diverse scenes are also increasing the difficulty of data labelling. Luckily nowadays more research institutes and universities are willing to publish their annotated building rooftop segmentation as benchmark datasets (Chen et al., 2020). However, introducing the available benchmark datasets for other building roof segmentation applications is not an easy task. Due to the differences in building types and distributions, which is explained as a domain gap in computer vision, the training data annotated for one city cannot be easily adapted to a different test region. Therefore, cross-domain learning is a critical research topic for building roof segmentation, in particular for the applications when diverse test regions are involved (Peng et al., 2021). * Corresponding author In this paper, we evaluate the performance of fusing three advanced deep neural network models for cross-domain building rooftop segmentation, including Swin Transformer , OCRNet (Yuan et al., 2020), and HRNet . Three fusion techniques are tested, which are majority voting, max value and union fusion. Moreover, in order to minimize the appearance discrepancy between the source domain and target domain images, we adopt the LAB-based image translation method in the pre-processing step. We assess the proposed fusion methods on two building extraction benchmark dataset WHU Building dataset (Liu and Ji, 2020) and Potsdam Building dataset (Rottensteiner et al. 2012). Through comparing to other cross-domain semantic segmentation approaches, the evaluation results prove that fused predictions from three stateof-art semantic segmentation models retain a more robust performance.

Building rooftop segmentation with deep neural networks
Building rooftop segmentation is a binary classification task, which aims to label pixels of original images as two classes: buildings and non-buildings. Driven by the trend in semantic segmentation tasks, in recent years most building rooftop segmentation works use deep neural networks, achieving stateof-the-art results in benchmark datasets. Among those works, well-known fully convolutional networks (FCNs) (Long et al., 2015) are widely employed for image semantic segmentation, such as U-Net (Ronneberger et al., 2015), SegNet , DenseNet (Li et al., 2018), and HRNet . In detail, Ji et al. (2018) and Kang et al. (2019) adopt U-Net for building extraction from optical images. Xu et al. (2018) and Yi et al. (2019) introduce residual blocks to U-Net in order to facilitate training. Yang et al. (2018) combines signed-distance labels with SegNet to achieve instance-level building extraction. To better utilize high-order structural features for accurate building extraction, Li et al. (2018) adopt DenseNet with an adversarial module. Inspired by HRNetv2 (Sun et al., 2019), Zhu et al. (2020) propose MAP-Net, which introduces a channel-wise attention module to adaptively squeeze multiscale features extracted from the multipath network.
Recently, transformer networks such as the Swin transformer attract researchers' attention, benefit to its high efficiency and effectiveness with a shifted window based self-attention module. Yuan et al. (2021a) and Chen et al. (2022) directly apply Swin transformer with multiscale features in the building roof segmentation task.

Cross-domain learning for building roof segmentation
Due to the domain gap of images captured from distinct cities or with different shooting conditions, the performance of FCNs drops significantly on unseen datasets, which usually causes poor generalization (Peng et al., 2021). To efficiently process largescale data with relatively low costs, effective cross-domain strategies are desired. In building rooftop segmentation works, few pioneer studies are available. For instance, Peng et al. (2021) introduces full-level domain adaptation methods including the mean-teacher model (Tarvainen et al., 2017), adversarial learning, and self-training. Although classic domain adaptation methods can achieve notable progress in reducing domain shifts between different datasets, but they are not data-friendly for complex scenarios, as the target (test) data are required in the training phase. It means that the network has to be repeatedly trained if multiple test data sets are planned.
Benefiting from advanced neural network structures and learning abilities, a single DL segmentation model can already cope with some domain shift problems. As each neural network structure can learn unique features from the image, and provides its own predictions, in this paper, we propose a data-friendly framework combining the predictions from three advanced semantic segmentation networks, thus to further improve the robustness of their performance in cross-domain building rooftop segmentation

METHODS
In this section, three fusion approaches are described and used to combine the prediction results from Swin Transformer , OCRNet (Yuan et al., 2020) and HRNet  models. In the pre-processing step we have adopted the LAB color translation to reduce the appearance discrepancy between the source domain and target domain images.

Image translation
Basically, two categories of approaches are available for image translation, including color transform and generative adversarial networks (GANs). It has been proven that it is difficult to train an efficient GAN model for the image translation using the current techniques (Peng et al., 2021). Therefore, we selected the CIELAB (LAB) based color translation (He et al., 2021) to reduce the domain discrepancy.
Instead of randomly select one image from the target dataset, we take 10 images and translate them to the LAB color space (LAB) (Jain, 1989). In LAB (l*a*b) color space, l represents for the perceptual lightness, a is relative to the green-red opponent color, while b represents the blue-yellow opponent. After that we calculate the mean and standard deviation of these ten images, which are noted as and , respectively. We project all source domain image to LAB space , and then then shift the distribution of pixels values of each channel to the target domain as Equation. 1.
In the end the LAB images ̂ are translated back to RGB color space, which are used as input for the building rooftop segmentation task.

CNN based building rooftop segmentation
CNN based segmentation approaches have received increasingly interest as they are able to deliver more accurate result and robust to noises containing in the training datasets (Alzubaidi et al., 2021). In this paper we have selected three state-of-art semantic image segmentation deep neural network architectures for building rooftop extraction, including Swin transformer , OCRNet (Yuan et al., 2020), and HRNet .
Swin Transformer is one respective vision transformer proposed by Microsoft . The main highlight of Swin Transformer is hierarchical feature representation and its linear computational complexity with respect to input image size. Using the proposed shifted window approach to compute selfattention can significantly enhance the modelling power, thus to further improve the efficiency and effectiveness for vision tasks. Up to now, Swin Transformer achieves the state-of-the-art performance on many semantic segmentation tasks, including building extraction (Xu et al., 2021, Chen et al., 2022 OCRNet: As its name states, Object-Contextual Representations (OCR) addresses the semantic segmentation problem with a focus on the context aggregation strategy (Yuan et al., 2021). It presents a simple yet effective approach for object-contextual representations, which characterizes each single pixel with its corresponding object representation, thus to improve the learning ability and decrease the influences of unnecessary details in images. Object region learning and object region representation computation are presented as parallel modules, and are integrated as the cross-attention module in the decoder. It has been tested on various object extraction and segmentation applications (Jin et al., 2021, Huang et al., 2021 HRNet is an earlier semantic image segmentation network structure from Microsoft research . It enables the high-resolution representations through the interaction of the high-to-low resolution convolution streams in parallel. In particular, it can repeatedly exchange information across highand low-level presentations. The benefit is that the resulting representation is semantically richer and spatially more precise, until now it has been used in a wide range of applications, including human pose estimation, semantic segmentation, and object detection. It has also a good performance in building extraction (Seong et al., 2021.

Fusion methods
Each CNN neural network model output can be presented as a softmax probability maps , which approximately indicate the certainty that each pixel belongs to the building rooftop class. We explore three fusion approaches to generate a final segmentation mask.

Majority Voting:
Majority voting is widely used in image processing and classification tasks (Jimenez et al., 1999;Hajdu et al., 2013). Under this fusion scheme, each segmentation model can provide a separate decision after giving a predefined threshold value (T). Thus, three labelling results are provided for each pixel. If at least two segmentation models classify one pixel into building rooftop class, the majority voting recognizes it to belong to building rooftop. They are defined as Equation (2) and (3).
where y , , and denote the category label, softmax probability map value, number of models and the segmentation results, and 1 is building rooftop, 0 means background.

Max Value:
In the max value fusion, we firstly generate a fused softmax probability map by taking the maximum value of each pixel among three probability prediction, which are generated by Swin Transformer, OCRNet and HRNet model, respectively. Then, we generate a building mask by setting a threshold value on this fused softmax probability map. It's defined as Equation (4).
where denotes threshold value, which is related with maximum value method.

Union:
In union fusion, we sum up the probability maps that are generated by Swin Transformer, OCRNet and HRNet model. The category label is predicted by comparing the summed probability value to a given threshold. It is defined as Equation (5).
where denotes threshold value, which is related with union fusion method.

Descriptions of Datasets
To verify the effectiveness and efficiency of the proposed method, WHU Building dataset (Ji et al., 2019) and Potsdam Building dataset (Rottensteiner et al. 2012) are employed in the experiment. We use WHU and Potsdam Building dataset alternately as source and target domain datasets.
WHU Building Dataset. The dataset consists both aerial and satellite imagery over Christchurch, New Zealand (Ji et al., 2019). In our experiment we take only the aerial dataset, which covers an area of 450 km² with an original resolution of 0.075 m. Over 220 000 independent buildings various types and locations were manually digitalized and corrected from the New Zealand government published building footprint vectors (https://data.linz.govt.nz/).

Potsdam Building Dataset.
This dataset is generated from The International Society for Photogrammetry and Remote Sensing (ISPRS) 2-D Semantic Labeling Contest's dataset, where binary building masks were extracted based on the semantic label maps. It covers an area of 3.42 km² and consists of 38 VHR aerial images tiles with a size of 6000×6000 pixels with the GSD of 0.05 m. The dataset was collected over the city of Potsdam, Germany, which is a typical historical European city with large building blocks and dense settlement structures (Rottensteiner et al. 2012).

WHU
Potsdam Figure 2. Example images of the WHU and Potsdam benchmark datasets.

Experiment setup and training details
To reduce the influences of image resolution differences and the domain gap between the Potsdam and WHU dataset, both datasets are down sampled into a GSD of 0.3 m, which is also a generally recommended resolution for building segmentation.
After that all images are cropped in the 512 ×512 pixels patches, which results in a total of 8189 tiles for WHU Building dataset and 152 tiles for Potsdam dataset. Meanwhile, we use the officially recommended method to divide the dataset into training, testing and validation set, and we calculate the result accuracy using the officially provided testing datasets. The proposed method is implemented under the MMSegmentation framework (Chen et al., 2019a), and all the experiments were conducted on 4 GeForce RTX 2080Ti GPUs.

Evaluation Method
Two parameters F1-score (F1) and the Intersection-Over Union (IoU) of the building rooftop segments are calculated to evaluate the accuracy of the extracted building rooftop segments. They are defined as Equation (6) and (7).
where TP, FP, and FN denote the pixel numbers of True Positives, False Positives, and False Negatives, respectively. Note that higher F1-score and IoU denote better overall performance.

Experimental Results
The aim of this section is to evaluate the fused cross-domain building segmentation approach by comparing them to single segmentation models and other state-of-art cross-domain segmentation approaches. Case 1: Potsdam→WHU: Firstly, comparing to OCRNet and HRNet, Swin transformer has a generally better performance on cross-domain datasets. However, fusing the predictions by OCRNet and HRNet can still further improve the accuracy. As Table I shows, the proposed Union-1.6 approach archives the increase of IoU and F1 by 1.11% and 0.74%, respectively. It also outperforms Majority Voting method with an IoU gain of 3.47% and 2.34%. The Majority voting and Max value fusion approaches cannot overstep the results from Swin transformer.

Compare to single training models:
For the visual comparison, we have selected five image patches and presented in Figure. 3. The buildings derived by Swin, OCRNet, HRNet and the best results from each fusion approach method are presented together with the ground truth. As presented in the first row, the building rooftop segments obtained by Union-1.5 method are almost identical to the ground truth, and it has more precise edge than the segments predicted by HRNet, Swin and OCRNet model. In the third row of Figure.3, Swin transformer has shown more miss-detections than the other two models. After the fusion steps, more building rooftops are correctly detected. The last two row of Fig.3 clearly demonstrates that the Union-1.5 method is capable of identifying small sized buildings, and it can help to correct some recognition errors. The visually comparison in Figure 4 shows a similar trend as Figure 3. Building rooftop segments from Swin transformer have much sharpen edges than results from other single models, especially for the first and third examples. Union 1.1 method is advantaged in identifying tiny buildings than other methods. However, at the fourth row, OCRNet and HRNet could detection the additional part the large building, which is not included in the Swin transformer prediction results  Figure.3 Examples of building extraction maps obtained by different methods for the Case1 Potsdam→WHU.

DISCUSSION AND CONCLUSION
Adapting annotated benchmark multisource datasets to building rooftop segmentation task is a crucial issue. With the traditional matching learning approaches, the classifier trained in one dataset can hardly be used on another dataset due to the various of building types and distributions, which is now defined as domain shift or domain gap in computer vision. Benefiting from the development of deep learning techniques and neural network architectures, the learning ability of the DL based classification and segmentation approaches have been largely improved, some of which can be directly performed on cross-domain datasets. In this paper, we have compared three state-of-art DL based segmentation models and various fusion techniques for crossdomain building rooftop segmentation datasets. Our experiments on two cross-country benchmark datasets have shown that combing the predictions from more segmentation models can bring a considerable improvement to the accuracy and robustness. Benefit partly from the advanced segmentation neural network architectures, our fusion approach has also outperformed other cross-domain segmentation approaches. Union fusion approach has achieved the highest accuracy compare to other approaches when a proper threshold value is provided. The light weighted LAB based image translation can help to reduce the appearance discrepancy between the source domain and target domain images, thus improve the performance of segmentation models. It has to be noted, the selection of threshold values has a direct influence on the accuracy of the Union and Max Value fusion approaches. In the next step, we plan to use adaptive methods to solve this problem and introduce advanced fusion techniques.