ADAPTIVE AND NON-ADAPTIVE FUSION ALGORITHMS ANALYSIS FOR DIGITAL SURFACE MODEL GENERATED USING CENSUS AND CONVOLUTIONAL NEURAL NETWORKS

The digital surface models (DSM) fusion algorithms are one of the ongoing challenging problems to enhance the quality of 3D models, especially for complex regions with variable radiometric and geometric distortions like satellite datasets. DSM generation using Multiview stereo analysis (MVS) is the most common cost-efficient approach to recover elevations. Algorithms like Census-semi global matching (SGM) and Convolutional Neural Networks (MC-CNN) have been successfully implemented to generate the disparity and recover DSMs; however, their performances are limited when matching stereo pair images with ill-posed regions, low texture, dense texture, occluded, or noisy, which can yield missing or incorrect elevation values, in additions to fuzzy boundaries. DSM fusion algorithms have proven to tackle such problems, but their performance may vary based on the quality of the input and the type of fusion which can be classified into adaptive and non-adaptive. In this paper, we evaluate the performance of the adaptive and nonadaptive fusion methods using median filter, adaptive median filter, K-median clustering fusion, weighted average fusion, and adaptive spatiotemporal fusion for DSM generated using Census and MC-CNN. We perform our evaluation on 9 testing regions using stereo pair images from Worldview-3 satellite to generate DSMs using Census and MC-CNN. Our results show that adaptive fusion algorithms are more accurate than non-adaptive algorithms in predicting elevations due to their ability to learn from temporal and contextual information. Our results also show that MC-CNN produces better fusion results with a lower overall average RMSE than Census.


INTRODUCTION
The quality of the digital surface model (DSM) generated from satellite images has always been a crucial element to most remote sensing and photogrammetry applications. DSM generated using Multiview stereo (MVS) algorithms is very common due to its high efficiency and low cost, but it is limited performance due to its sensitivity to radiometric and geometric distortions in the stereo images, which lead to noise, occlusions, missing elevation values, etc. in the DSM. One of the most promising techniques that raised significant attention to improving the quality of DSM is fusion (Albanwan and Qin, 2020;Cigla et al., 2017;Papasaika et al., 2008). Fusion is the process of combining multi-temporal DSMs into a single highquality DSM; it takes advantage of the redundant temporal information to compensate for the incorrect representations or missing elevation points (Albanwan and Qin, 2020;Cigla et al., 2017;Papasaika et al., 2008). DSM fusion algorithms can be categorized into 1) adaptive and 2) nonadaptive; the prior approach learns from the context, shape, and type of objects in the scene, in addition to the temporal information between DSMs, whereas non-adaptive approaches simply learn and predict elevation from the temporal information (Cigla et al., 2017;Wang and Gong, 2019). One of the oldest non-adaptive algorithms to perform fusion is median filtering, it is known to be robust to outliers and preserve boundaries of the objects * Corresponding author (Kuschk and D'Angelo, 2013;Ozcanli et al., 2015). Many fusion algorithms have upgraded simple median filter to a more robust approach by including the concept of adaptivity to scene objects, for instance, Qin, (2017), has proposed adaptive median filtering where he incorporates a flexible window that is formed on the shape of the object and applies median filtering on each object instead of using a fixed-sized window. Such adaptive methods are able to retain boundaries and shapes of objects in the scene (e.g., buildings, roads, etc.). Other studies have shown that the uncertainty of the DSM can be correlated with the class cover type (Albanwan and Qin, 2020), for instance, trees and grass changes based on the acquisition date and season, which can adversely influence the performance of MVS algorithm, and reduces the DMS uncertainty, whereas structures like buildings and roads are less changeable over time and most of the times have lower uncertainty. This led to the development of classoriented fusion algorithms, as an example, (Albanwan and Qin, 2020) developed adaptive spatiotemporal fusion to impose different bandwidths based on the class of objects. On the other hand, other methods used the concept of k-median clustering to locate the cluster with the most consistent elevation points to reduce the number of outliers (Facciolo et al., 2017), where others used a pair ranking scheme based on a scoring technique to evaluate and sort stereo pairs by their quality and merge only the pairs with the best scores (Facciolo et al., 2017;Qin, 2019;Qin et al., 2020). There are many factors influencing the fusion outcomes including the type of input and the fusion algorithm. Traditional MVS algorithms generate DSMs using local window-based approaches, which extract and match similar feature correspondences to compute the disparity and then transforming it into a DSM. Census-Semi global matching (SGM) proposed by (Hirschmuller, 2005) is one of the simplest and most costefficient algorithms to generate DSMs; it is one of the widely used methods from 2005 until today. Although it is considered invariant to radiometric changes, it is still sensitive to illumination changes and window size, therefore many have proposed adaptive window to capture the size and shape of objects instead of rigid window (Han et al., 2020;Heiko Hirschmuller and Scharstein, 2007;Loghman and Kim, 2013). Nowadays, deep learning algorithms have captured a great interest in the area of dense image matching and elevation generation, where they are intended to enhance dense image matching by better understanding and learning from the scene components (Chang and Chen, 2018;Hamid et al., 2020;Zbontar and LeCun, 2015). (Zbontar and LeCun, 2015) are the first to introduce matching cost Convolutional neural networks (MC-CNN) for stereo analysis and disparity and DSM generation; their target was to enhance the matching cost and produce faster and better similarity matching results. Although MC-CNN is able to capture the shapes of objects well, it still requires a lot of post-processing and filtering. Many deep learning MVS algorithms are inspired by MC-CNN, which are developed to further enhance predictability, cost matching, and time efficiency. One of the drawbacks to deep learning algorithms is their limited performance due to the training process; they require a great amount of time and data along with the ground truth data for training. Additionally, the generalization of the training is a very critical matter, the trained model should be well reflected on the testing dataset or any other dataset regardless of the domain difference, otherwise, the network must be retrained.
In this work, we aim to evaluate fusion adaptive and nonadaptive algorithms for DSM generated using Census and MC-CNN, we use these algorithms as they have been widely used in the area of MVS algorithms. Understanding such work can help to close the holes in current algorithms of fusion and provide insights to improving their performance and output, in addition, to help the user understand how each data works under different fusion algorithms.
The paper is organized as follows: Section 2 includes the data description, pre-processing steps like DSM generation, and the fusion algorithms used, Section 3. Include the discussion and analysis of the results, and finally, we present our conclusion in the last section.

Dataset description
In Our work, we use three different datasets from Omaha (OMA), Jacksonville (JAX), and Argentina (ARG) conducted from a very high-resolution satellite Worldview-3. Each dataset includes hundreds of multispectral image pairs with 0.3 meters spatial resolution. Every dataset includes three testing regions with varied spatial complexity and density, as some locations may be urban or suburban areas with either dense small houses or sparse large buildings or a mix of both as can be seen in Table  1. The total number of DSMs generated ranges between 91 -482. The dataset images were captured within almost a one-year time-span from September 2014-November 2015, October 2014-February 2016, and January 2015-January 2016 for OMA, JAX, and ARG respectively. For evaluation, we use a reference ground truth DSM generated from the LiDAR dataset. Table 1. The dataset information and testing regions.

Data pre-processing
Our pre-processing steps can be summarized as the following: 1) Image pair selection based on specific criteria, 2) georegistration to ensure alignment, and finally 3) Disparity and DSM generation using census and MC-CNN cost metrics followed by semi-global matching for optimization (Hirschmuller, 2005).

Image pair selection
Since we have about 20 images for every dataset, we can obtain hundreds of DSMs for the stereo pairs, however, in practice, only a few numbers of DSMs can be available. Therefore, we choose the best 20 pairs to generate elevation and perform the fusion. The selection of the stereo pairs is performed using a scoring or ranking scheme that sorts the images based on the metadata information from the pair of images, we include geometrical information such as the intersection angles, sun angle difference, and the number of days in which the images were acquired. We rank the stereo pairs based on their scores that are computed from the optimal values of sun angle difference and intersection angles as mentioned in (Qin, 2019).

2.2.2
Ortho-ready image, geo-registration, and image rectification Each image is transformed into an ortho-ready image and registered to assure alignment of images using a reference image and using RPC Stereo Processor RSP software (Qin, 2016). We also rectify the images to produce epipolar images in which we use in the following steps for disparity and DSM generation using RSP software.

2.2.3
Disparity and digital surface models (DSM) generation Initially, we generate the disparity from the rectified epipolar images, where we Census (Hirschmuller, 2005) and MC-CNN (Zbontar and LeCun, 2015) are used as the cost matching metrics to determine the horizontal displacement between feature correspondences, which leads to the disparity images. Census requires a predefined window to perform string bit/binary transformation, where any pixel lARGer than the central pixel takes a value of 1 and 0 otherwise. This transformation process is followed by a hamming distance to measure the score and compute the disparity based on the minimum score. On the other hand, we follow the CNN architect as in (Zbontar and LeCun, 2015), where we first train the MC-CNN using a satellite dataset, then we extract image patches with a size of 9x9. The image patches are then fed into a single layer of MC-CNNs each with size 5x5 and 32 kernels, followed by a couple of series of fully connected layers each with 200 neurons, which are then concatenated into a single layer of 400 neurons and passed to several layers with 300 neurons until the last layer which classified each pixel into a match or no match. The DSM is finally generated using Semi-global matching as proposed by (Hirschmuller, 2005) and implemented in RSP software.

DSM fusion and evaluation
For DSM integration, we use different fusion algorithms that vary between adaptive and nonadaptive approaches including: 1) Non-adaptive fusion algorithms: • Median filter • K-median clustering fusion (Facciolo et al., 2017) • Weighted average fusion (Papasaika et al., 2008) 2) Adaptive fusion algorithms: • Adaptive median fusion (Qin, 2017) • Adaptive spatiotemporal fusion (Albanwan and Qin, 2020) The median filter simply takes the median of the temporal DSMs at any pixel position and produces median DSMs. The adaptive median filter as suggested by (Qin, 2017) applies the same concept except using an adaptive window that captures the shape of the object and applies median filtering on individual objects. K-median clustering in fusion is proposed by (Facciolo et al., 2017) to merge several pre-ranked stereo pairs into a single DSM by clustering the input temporal elevations and picking the cluster with minimum cost to fuse its elevation points. Weighted average fusion on the other hand is a broader algorithm (Papasaika et al., 2008); first, it computes the residual map between two DSM images, then use these residuals to get weights and multiply by the elevations as follows: (1) Where F= the fused DSM = is the weight computed from the residual maps = the number of temporal DSMs Finally, we perform adaptive spatiotemporal fusion by applying different bandwidths for different classes, where highly complex and seasonally variable classes like trees, grass, and water take higher bandwidths, while static rigid objects like buildings and roads take smaller bandwidths. In Adaptive spatiotemporal fusion, we first compute the median and generate masks as suggested by (Albanwan and Qin, 2020), then difference the spatial and temporal DSMs from the median to calculate the weight and impose different bandwidths for each type of class, the algorithm is as follows: (2) Where = fused DSM x, y = position of pixels in the DSM ℎ = median height from the temporal DSMs = spectral weight = spatial weight ℎ= temporal height weight = total weight Wr and Ws computes spatial and range weights from the orthophoto as the bilateral filter, as follows: Where ℎ= the height bandwidth relative to each class We perform the fusion on the 20 best pairs, and evaluate the fusion output using root mean squared error (RMSE) and the LiDAR data as the ground truth (GT) as follows: Where is the fused DSM and N is the number of valid pixels.

Parameter selection for fusion methods
For the adaptive median filtering, we use the default setting as (Qin, 2017), and for the k-median clustering, we use elevation point from the temporal and spatial domains, where the window size is set to 5 and the threshold to determine the cost and when to stop clustering is set to less than 10. For the weighted average filter, we used the median fused DSM as a reference to compute the residuals map and compute the corresponding weights. Finally, for the adaptive spatiotemporal fusion we used a window size of 5, we obtain 4 to 5 classes for each dataset including buildings, roads and ground, trees, grass, and water. The spectral and spatial bandwidths are chosen empirically and set to = 30 and = 5; the elevation bandwidths for the classes are set to = 3, _ = 3, = 7, = 7, and _ = 7.

DSM analysis
The first step in the analysis is DSM quality inspection, which is necessary for the evaluation and comparison between the before and after fusion results. Figure 2. Represents a sample of the initial DSM generated by Census and MC-CNN from two OMA testing regions (OMA I and OMA III). We can notice that regardless of the method, the generated DSM is always associated with problems such as noise, outliers, missing elevation points, and fuzzy representation of object shapes, which can raise due to mismatching in the dense image matching process. These issues can vary DSMs and may be addressed in some of the generation algorithms, for instance, Census produce DSMs that are smoother, fuller, and well distributed as can be seen in Figure 2. (a) and highlighted by the blue boxes. On the other hand, MC-CNN includes more missing elevation points which can be obvious from the black spots in the right image of Figure 2. Nevertheless, MC-CNN captures the edges and boundaries of buildings better than Census, which can be indicated in the red and yellow box around the buildings in

Statistical analysis
We provide a comprehensive analysis and comparison of the overall performance of the adaptive and nonadaptive fusion algorithms, in addition to the performance of fusion algorithms for the non-deep learning and deep learning (i.e., Census and MC-CNN) DSM generation algorithms. We present a visual and statistical evaluation for the results as presented in Figure  3. and Table 2. From the bar representation in Figure 3., we can notice that the adaptive methods such as the adaptive median fusion and adaptive spatiotemporal fusion always produce less uncertainties in comparison to the other fusion algorithms, which can be indicated in the orange and purple bars in Figure  3. They also had the lowest average RMSE of 5 (meters) as can be seen in Table 2. Their performance is consistent over different datasets regardless of the DSM generation algorithm, which is indicated by having less varying RMSE values (See Figure 3. and Table 2.). On the other hand, we can notice that fusion methods like weighted average fusion and k-median clustering fusion have the highest ranges of RMSE as indicated from green and yellow bars in Figure 3., their RMSE also ranges between 5 to 35 (meters). We can also notice during our analysis that MC-CNN performs better than Census in terms of robustness to outliers, like in the case of median filter, k-means filtering, adaptive spatiotemporal fusion, and weighted average fusion for dataset ARG I, MC-CNN was able to suppress these invalid elevation errors that were not resolved in Census. This can lead to concluding that fusing Census DSMs is not robust to outliers as in MC-CNN. Table 2., also show that the error range for MC-CNN was about 4-12, whereas for Census the range is between 6 to 31.   Table 2. The uncertainty of the fused DSMs. Note: The abbreviations stand for median filter (MF), Adaptive median filter (AMF), adaptive spatiotemporal fusion (ASPF), K-median clustering fusion (KMF), and weighted average fusion (WAF), and Bold numbers indicate the lowest RMSE for all fusion methods in each testing region and each DSM generation algorithm.

Visual analysis
In general, we can notice that fusion has solved many problems related to missing or incorrect elevation points (See Figure 4). Moreover, the visual illustration in Figure 4. confirms our findings in Section 3.3. that adaptive methods produce the best fused results not only statistically but also visually, this can be seen from the second and last rows in Figure 4, where they produced smooth and sharp DSMs, we can also see that the buildings are well captured and most of the noise in the original DSMs have been reduced. Other methods like k-median clustering fusion and weighted average fusion are less robust to noise and outliers in the DSM, which is evident from the third and fourth images in Figure 4.
Additionally, we can see that fusion using Census is better in terms of generating smoothed fused DSM for all fusion methods (Seen Figure 3), especially in the case of k-median clustering fusion and weighted average fusion, whereas fusion of MC-CNN DSM can produce noisy results in these methods. Nevertheless, unlike Census, MC-CNN fused DSMs seem better in edge and boundary preservation in all fusion methods, as can be noticed in Figure 3 where buildings are sharp and better outlined.

Method
Census MC-CNN

CONCLUSION
To conclude, our work has shown that in general median filter produces fairly good results with low computational cost. However, adaptive fusion methods such as adaptive median fusion and adaptive spatiotemporal fusion always produce the best results due to their robustness towards outliers and flexibility to learn and predict elevation from homogeneous objects and a consistent set of neighboring pixels. The generalized non-adaptive fusion method did not perform as well due to the lack of contextual and temporal information and correlation in the elevation prediction and fusion process. We also evaluate the usage of DSMs generated using different dense image matching algorithms in the fusion process. We found that in general, MC-CNN performs better than Census for most fusion algorithms, due to the architecture and mechanism of MC-CNN and since it can learn similarity from highly complex features and produces more detailed DSM. Therefore, combining its results can help achieve better fusion outcomes. Census on the other hand may generate less accurate DSMs than MC-CNN but leverages between quality and computational extensive algorithms. Overall, Census can generate satisfactory results and is appropriate for when there are limited resources and expertise to avoid training and deep learning algorithms. Although MCCNN has shown its ability to preserve the shape of objects, they still require post-processing and refinement to reduce the noise and outliers in the resultant fused DSMs. Such work can further be extended to improve fusion methods by combining DSMs from both MC-CNN and Census to take the advantage of both methods, however, the distribution of data must be taken into consideration, since different methods generate different DSMs.