A NEW STEREO DENSE MATCHING BENCHMARK DATASET FOR DEEP LEARNING

: Stereo dense matching is a fundamental task for 3D scene reconstruction. Recently, deep learning based methods have proven effective on some benchmark datasets, for example Middlebury and KITTI stereo. However, it is not easy to ﬁnd a training dataset for aerial photogrammetry. Generating ground truth data for real scenes is a challenging task. In the photogrammetry community, many evaluation methods use digital surface models (DSM) to generate the ground truth disparity for the stereo pairs, but in this case interpolation may bring errors in the estimated disparity. In this paper, we publish a stereo dense matching dataset based on ISPRS Vaihingen dataset, and use it to evaluate some traditional and deep learning based methods. The evaluation shows that learning-based methods outperform traditional methods signiﬁcantly when the ﬁne tuning is done on a similar landscape. The benchmark also investigates the impact of the base to height ratio on the performance of the evaluated methods. The dataset can be found in https://github.com/whuwuteng/benchmark ISPRS2021.


INTRODUCTION
Dense matching is a traditional topic in 3D reconstruction, which can be performed in stereo (with only two views) (Scharstein, Szeliski, 2002) or multi-view stereo (MVS) (Jensen et al., 2014). In this paper, we focus on stereo dense matching in the specific case of epipolar stereo pairs (where expected correspondences are on the same lines of the two images) as most of the recent deep learning approaches are limited to this simple configuration. The recent successes of deep learning based dense matching methods in the computer vision community (Laga et al., 2020) raise the question of their applicability in the geospatial context. This paper will investigate this question by comparing traditional and machine learning especially deep learning dense matching techniques on geospatial data.

Traditional methods
Traditional dense matching methods (Hirschmuller, 2005) are usually decomposed into four steps: hand-crafted features computation, feature matching across images (i.e., the cost volume), cost aggregation and disparity refinement. They can be divided into local and global methods. Local methods mainly take into consideration the local features (hand-crafted feature) (Hirschmuller, Scharstein, 2007) or a local region (Tombari et al., 2008). The global methods mainly add an optimization based cost aggregation step, based on dynamic programming (Van Meerbergen et al., 2002), belief propagation (Sun et al., 2003) or Graphcut optimization (Boykov, Kolmogorov, 2004). Semi global matching (SGM) is a reference method combining mutual information and dynamic programming optimization on several directions (Hirschmuller, 2005). A GPU variant of SGM (Hernandez-Juarez et al., 2016) on the full image scale is also evaluated in this paper, as well as a Graphcut based method using plane constraints (Taniai et al., 2017). Graphcut based methods are slower than SGM which is often considered * Corresponding author to offer the best balance between efficiency and accuracy (Bethmann, Luhmann, 2015).

Learning based methods
Traditional machine learning base methods have been proposed to address the problem of dense matching. Support vector machine can be used to learn a linear discriminant function (Li, Huttenlocher, 2008). Because features have their pros and cons, a random forest (RF) can be used to fuse several feature types, e.g. census, normalized cross-correlation (NCC), zero-mean sum of Absolute Differences (SAD), SAD of Sobel feature (Batsos et al., 2018). After feature fusion, a traditional optimization is used to obtain the final result.
With development of deep learning, some steps of the traditional methods can be replaced by deep learning counterparts (Laga et al., 2020). For instance, 2D convolutional neural networks (CNN) prove effective in feature extraction (Žbontar, Le-Cun, 2016). In order to make CNN efficient, feature similarity calculation can be treated as a multi-class classification (Luo et al., 2016). After feature similarity calculation, traditional optimization is used to obtain the final result. SGM-Net uses a CNN network to provide learned penalties for SGM (Seki, Pollefeys, 2017). GC-Net uses a 3D CNN based network as cost aggregation (Kendall et al., 2017). Pyramid Stereo Matching network uses spatial pyramid pooling and 3D CNN (Chang, Chen, 2018). High resolution stereo network uses upscaling in the 2D CNN network, so that the 3D cost volume does not need to be down-scaled (Yang et al., 2019a). To address the high memory consumption for high-resolution image matching, a recent method (Duggal et al., 2019) proposes to prune the 3D cost volume with a differential patch match method. CNN and conditional random fields (CRF) can be combined into a hybrid CNN-CRF model which is more effective than fully-connected CRFs (Kendall et al., 2017). While these methods were developed by the computer vision community on indoor or outdoor dataset, we tested them on the ISPRS Vaihingen dataset (Knöbelreiter et al., 2018), and we used Lidar data to generate ground truth disparity maps for training and evaluation. This paper evaluates the state-of-the-art deep learning based dense matching approaches for which a public implementation is available. Table 1 lists some benchmarks for stereo dense matching on real scenes proposed by the robotics and computer vision communities. Some are popular and widely used: the indoor Middlebury dataset (Scharstein et al., 2014); the KITTI datasets in two versions, KITTI 2012 (Geiger et al., 2012) and KITTI 2015 (Mayer et al., 2016); or the ETH3D dataset containing stereo pairs (Schops et al., 2017). Using advanced computer graphics, some virtual datasets have also been generated (Yang et al., 2019a;Mayer et al., 2016). The strong ongoing research activity on autonomous driving has also resulted in several dedicated datasets such as Drivingstereo or AppolloScape (Yang et al., 2019b;Huang et al., 2019).

Benchmark data
In aerial photogrammetry, benchmark datasets focus mainly on MVS dense matching (Cavegn et al., 2014). For satellite imagery, IARPA dataset is widely used for image dense matching evaluation (Bosch et al., 2016), the traditional evaluation method depends on DSM (Bosch et al., 2019), but the pipeline contains other steps, for example, point cloud fusion and DSM generation (Cournet et al., 2020). These steps can influence the accuracy. A recent satellite image stereo benchmark from LiDAR DSM (Patil et al., 2019) uses IARPA dataset. At the same time, for high density LiDAR, when generating the DSM, points on the walls are ignored.
Generating ground truth data from real scenes is challenging. To avoid errors introduced by grid interpolation or point cloud fusion and keep the points on walls, we propose a method to generate sparse disparity ground truth for training deep learning architectures. Then, we evaluate both traditional and learning based methods on the proposed benchmark.

Overview
The paper is structured as follows: Benchmark data generation is described in Section 2. Evaluations are provided in Section 3. Finally, conclusions are drawn and the perspectives are proposed in Section 4.

GROUND TRUTH DATA GENERATION
In this paper, we use the Vaihingen dataset from the ISPRS 3D reconstruction benchmark which provides a good registration of oriented images and LiDAR point clouds. The dataset is composed of 20 images with a depth of 11 bits and the ground sample distance (GSD) of 8 cm. The median LiDAR point density is 6.7 points/m 2 (Rottensteiner et al., 2012). From this data, epipolar stereo image pairs and the corresponding disparity maps are generated in four steps: data preprocessing, epipolar image generation, point cloud projection, and benchmark data production.

Data preprocessing
The Vaihingen dataset contains the images and their orientation parameters, plus a LiDAR point cloud. First, we convert the orientation files into the open source MicMac format (Pierrot-Deseilligny et al., 2014) on which our workflow is based. As the images can only capture the apparent surface we keep only the first LiDAR echo. Moreover, we remove isolated points based on the point cloud density using a grid filtering approach (Cho et al., 2004).

Epipolar image generation
The first step of our process is to create epipolar image pairs from input image pairs with sufficient footprint overlap. Coarse image footprints are obtained using an approximate height of the scene and the Computational Geometry Algorithms Library (CGAL) (Flötotto, 2020) is used to compute their intersections. Only image pairs with an intersection area above half the image footprint are considered for epipolar rectification, which was also done using the MicMac library. The corresponding orientation parameter files are generated in order to allow the direct projection of 3D points into the rectified images.

Point cloud projection
The ground truth disparity maps used for both training and evaluation are computed from the LiDAR point cloud. The perspective projection model is used to project the 3D point cloud into image plane. The disparity d is defined in Equation (1).
where x l and xr are the projection coordinate on x axis in the image plane after projecting the 3D point cloud to the left and right images.
Because the point cloud is sparse, it is difficult to predict the occlusions which can be important (Bevilacqua et al., 2017). An example of occlusion (in the left image) is shown in Figure 1. Some points from the ground area should be removed as they are occluded by the roof.
According to the epipolar geometry, the disparity is related to the depth of object (Jain et al., 1995), in Equation (2), z is the depth, b is base line length, f is the focal length, d is the disparity. The disparity is inversely proportional to the depth.
For the aerial images, the disparity is related to the elevation, such that ground points have a larger disparity than points on the roof. Similarly as (Biasutti et al., 2019), we use a filtering based on the density to remove occluded points: if the difference between the disparity and its median is larger than a threshold, then the point is removed. This filter removes the occluded points quite effectively, as shown in Figure 2.

Benchmark data production
After the disparity map generation, the benchmark dataset is split into a training and testing sets. In our experiment, we split the dataset according to the LiDAR point cloud area, as shown in Figure 3. Similarly to the KITTI dataset, the final data are 1024x1024 cropped images with 8bit depth color band, and the disparity images are stored on 16bits with the disparity value scaled by 256. After cropping the left image, the offset of the right image is set to the average disparity in this area in order to avoid too large disparity values. If the footprint of a cropped image intersects the training area, it is attributed to the training dataset, otherwise to the testing dataset. An example is shown in Figure 4. Using this procedure, our training set contains 585 image pairs while the testing set has 507 image pairs. In the training dataset there are 337 stereo pairs along the flight strip (along-strip),  and 248 stereo pairs are across the flight strip (across-strip).
In the testing dataset there are 323 stereo pairs along-strip and 184 across-strip, and they will be used to evaluate the impact of the base to height ratio (B/H) as the B/H is different in these two settings. More detailed information can be found in https://github.com/whuwuteng/benchmark ISPRS2021.

EVALUATION
After generating the epipolar image pairs and the corresponding ground truth disparity images from LiDAR, we evaluate several traditional and learning based methods on this dataset: 3. GraphCuts: A Graphcut based method (Taniai et al., 2017), using intensity and gradient consistencies, and using plane as a constraint, the code is written in C++ (Taniai, 2020).

CBMV:
A coalesced bidirectional matching volume based method (Batsos et al., 2018) using a random forest (RF) to fuse the features, followed by dynamic programming optimization (CBMV (SGM)) or Graphcut (CBMV (Graph-Cuts)) which has the same implementation with Graph-Cuts. The code is based on C++, CUDA (SGM implementation) and Python (Batsos, 2020).

5.
DeepFeature: A deep learning network is used to obtain the features, then optimize with dynamic programming (Luo et al., 2016), the code is based on C++, CUDA (SGM implementation) and Lua torch (Luo, 2020).
6. PSM net: A pyramid stereo matching network using a spatial pyramid pooling and a 3D CNN to train an end-to-end model (Chang, Chen, 2018). The code is based on Pytorch (Chang, 2020).
7. HRS net: A high resolution stereo network structure improving the accuracy without downscaling the cost volume The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLIII-B2-2021 XXIV ISPRS Congress (2021 edition) Figure 3. Training and testing area on Vahingen. (Yang et al., 2019a). The code is based on Pytorch (Yang, 2020).
8. DeepPruner: A 3D cost volume is pruned with a differential patch match method to avoid a full cost volume calculation and evaluation (Duggal et al., 2019). The code is based on Pytorch (UberResearch, 2020).

Experimental setup
The disparity search range is an important parameter for stereo dense matching. Some methods do not need this parameter, i.e., MICMAC and DeepPruner. In SGM(GPU), the range is set to 128 and is dictated by the implementation. For other methods, it is set to 192.
For machine learning based methods, the training data and hyper-parameters impact significantly the results. For the Random Forest based method CBMV, 54 epipolar pairs are used for training. For deep learning based methods, all the training data is used. For the evaluation, all the testing data is used for all methods.
In our experiments, we compare deep learning methods pretrained on the KITTI dataset and fine tuned on our Vahingen ground truth disparity. We found that batch size has a strong impact on the memory requirements and on the accuracy, as shown in Table 2 for PSM net. We decided to use the default batch size proposed in the implementation: 12 for PSM net, 28 for HRS net and 16 for DeepPruner. For the fine-tuning experiments on Vaihingen dataset, we did the same for the number of epochs: 20 for DeepFeature, 500 for PSM net, 700 for HRS net and 900 for DeepPruner.

Pixel error
In stereo dense matching evaluation, the pixel error (error between the estimated disparity and the ground truth) is an important metric to analyze the performance. In our experiment, we compute the 2, 3 and 5-pixel error. Accuracy is the percentage of positive pixels in valid pixels within the ground truth. As shown in Table 3, the result shows that machine learning based methods have a significantly better performance than traditional methods. While slower, Graphcut optimization achieves better results than dynamic programming. PSM net has the best result among all tested deep learning methods.
As for the deep learning methods, the data used for learning highly influences the results. A comparison between the results obtained with a pretrained model and with a model finetuned on our learning set is provided in Table 4. This experiment shows that the dependency of deep learning methods on the training data depends on the method. Comparing to DeepFeature which The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLIII-B2-2021 XXIV ISPRS Congress (2021 edition) is a hybrid method, end-to-end methods depend more on the training data. In our experiments, we found that the base height ratio B/H of the epipolar stereo pairs also influences the result, as shown in Table 5. Images along-strip are usually used for dense matching because they have a large overlap and small B/H. Images across-strip have a larger B/H which can increase the intersection accuracy, but leads to more errors because of a larger perspective distortion and more occlusions (Tola et al., 2008). In order to investigate the effect of the B/H, we experimented two different trainings of the model: using all the training dataset and only the along-strip images. As shown in Table 6, the result in across-strip is not as good as along-strip. For the along-strip based training, the dataset size is smaller than the all-based training, but there is no big difference on the results of along-strip testing data. This means that the along-strip images are sufficient for the fine-tuning. However, using images across-strip to fine-tune can improve the performance on across-strip pairs. Especially, the 1-pixel error can be improved significantly for all the methods.

Error analysis
Pixel error is a statistical metric. When the scene is complex, errors can show different patterns on buildings (man-made structure), ground, vegetation and so on. To visualize this, in Figure 5 we provide the error maps on different scenes. Because the ground truth is sparse, the error maps are interpolated with the nearest point in order to facilitate the interpretation. The disparity map is visualized using an ambient occlusion shading implemented in MicMac as shown in Figure 6. Figures 6(p) and 6(r) demonstrate that the Graphcut based methods are smooth on building's roofs. Figures 6(k) to 6(m) demonstrate that the deep learning based methods perform well on building's boundary (disparity discontinuity).
DeepPruner proves to have the second best result among these methods, but the pre-trained result is poor as shown in Figure 7(i), which means that DeepPruner is highly dependent on the training dataset. Because there is a plane constraint in the GraphCuts method, the result are not satisfactory on trees, as shown in Figure 7(c).

CONCLUSIONS AND FUTURE WORK
In this paper, we propose a new stereo benchmark dataset adapted to deep learning and evaluate some existing traditional and learning based methods on this dataset. Experiments show that: • Learning based methods perform better than traditional methods.
• Fine-tuning deep learning architectures by transfer learning on a specific dataset improves the results significantly.
• Along-strip images and across-strip images should be considered in training.
Our future work will focus on extending the benchmark to satellite images, but also studying how training on different scenes and sensor types affect the performance of learning-based methods.