SEMI-GLOBAL MATCHING WITH SELF-ADJUSTING PENALTIES

The demand for 3D models of various scales and precisions is strong for a wide range of applications, among which cultural heritage recording is particularly important and challenging. In this context, dense image matching is a fundamental task for processes which involve image-based reconstruction of 3D models. Despite the existence of commercial software, the need for complete and accurate results under different conditions, as well as for computational efficiency under a variety of hardware, has kept image-matching algorithms as one of the most active research topics. Semi-global matching (SGM) is among the most popular optimization algorithms due to its accuracy, computational efficiency, and simplicity. A challenging aspect in SGM implementation is the determination of smoothness constraints, i.e. penalties P 1, P2 for disparity changes and discontinuities. In fact, penalty adjustment is needed for every particular stereo-pair and cost computation. In this work, a novel formulation of self-adjusting penalties is proposed: SGM penalties can be estimated solely from the statistical properties of the initial disparity space image. The proposed method of self-adjusting penalties (SGM-SAP) is evaluated using typical cost functions on stereo-pairs from the recent Middlebury dataset of interior scenes, as well as from the EPFL Herz-Jesu architectural scenes. Results are competitive against the original SGM estimates. The significant aspects of self-adjusting penalties are: (i) the time-consuming tuning process is avoided; (ii) SGM can be used in image collections with limited number of stereo-pairs; and (iii) no heuristic user intervention is needed. * Corresponding author.


INTRODUCTION
The extraction of dense 3D information and the accurate visual recording from a set of images is a core part in various Cultural Heritage applications.Typically, accurate visual and geometric recording supports documentation, restoration and preservation activities ranging from large scale monuments to small artifacts.Lately, cultural heritage has benefited from new emerging technologies, based on 3D information, and the impressive increase in available smart mobile devices.Gamification of guided tours and story-telling approaches for the public presentation of cultural heritage are based on augmented and virtual reality tools, which all share the extraction of 3D information as a key enabling technology.As an active research topic, extraction of dense 3D information is bundled with many intermediate products and application fields in the areas of photogrammetry, computer vision and image processing.3D reconstruction, ortho-projection, pose estimation, simultaneous localization and mapping, image stitching, recognition and novel view synthesis are but a few of the topics of interest.In this context, dense image matching is a fundamental task for every application undertaking automated 3D reconstruction from images.3D model generation concerns an ever-growing list of diverse applications, which includes cultural heritage recording, precision agriculture and farming, automation in construction, large-scale city modeling, 3D GIS, automotive industry, industrial robotics, infrastructure inspection, and security.
While several related software has been commercially introduced, the varying application conditions, the demand for complete and accurate measuring products, as well as for computational efficiency in a variety of hardware, keep image-matching algorithms as one of the most active research topics.Stereo-matching, or multiple view stereo-matching, is indeed a challenging task when compared to multi-view matching, as it addresses the question with limited number of observations.This said, it represents an indispensable tool for case scenarios where multiple views are limited, such as in the cases of historical images, aerial images, robot vision, autonomous vehicles, mobile devices.
Scene reconstruction usually falls under two distinct processes: sparse matching for retrieving correspondences among images for camera extrinsic and intrinsic calibration, and dense matching for full 3D surface reconstruction.Dense stereo-matching, i.e. the estimation of a homology in the matching (right) image for each pixel in the base (left) image, is typically performed on rectified (epipolar) stereo pairs, and it is an essential element in both multi-view stereo or stereo-view reconstruction processes.A well-established approach in analysing and classifying stereo-matching algorithms is to typically decompose them in four basic components: matching cost computation, support aggregation, disparity optimization and disparity refinement (Szeliski, 2011).An evaluation of stereo-matching methods based on their actual results and usefulness in real life applications is quite difficult and depends on several diverging criteria.This is particularly true if one considers the variety of both applications and arising issues, e.g.depth variability, lighting conditions, reflecting surfaces, scene occlusions, image acquisition geometry, and illumination changes, just to name a few.
In the matching cost computation step a dissimilarity measure is given to each pixel for every value in the disparity range.The matching measures may be simple (for instance, absolute pixel differences) but they could also involve image transformations such as the non-parametric Census transformation and its variations to produce robust results based on binary relationships of pixels with their vicinity.One of the most recent reviews evaluates an extensive collection of matching cost functions (Hu & Mordohai, 2012).Computed cost volumes need to be smoothed against noise, while usually exploiting the 'fronto-parallel' assumption, thus the pixel-wise cost is aggregated within a support neighbourhood.A thorough review is presented in Tombari et al. (2008).A common distinction is between local and global methods; disparity selection in local methods is typically carried out in the winner-takes-all (WTA) mode, while global methods rely on energy minimization systems to optimize disparity over all image pixels against the need to keep continuous surfaces and satisfy pixel-wise matching criteria.Between local and global methods a class of algorithms for semi-global matching (SGM) has been presented by Hirschmüller (2005).In addition, the class of non-local methods attempt to extend the kernel of the local ones onto the whole image (Huang et al., 2016;Yang, 2012).Zhang et al. (2015) combine the information from different scale spaces to efficiently exploit the image pyramid in addressing issues in texture-less regions and restricting the disparity search space.Li et al. (2016) reduce the 'fronto-parallel' effect in disparity estimation over support aggregation neighbourhoods by proposing the formation of slanted support windows which greatly improve the results for non-frontal surfaces.Following the most recent trend in computer vision research and state-of-the-art applications, deep learning approaches, i.e. convolutional neural network (CNNs) schemes, are constructed for the purposes of stereo-matching in the matching cost computation step.Some of the top-ranking algorithms in the evaluation platforms are based on such formulations.Thus, Zbontar & Le Cun (2015) train a convolutional neural network on small image patches of known disparity, and the result is used as an initial cost volume.On the other hand, Luo et al. (2016) estimate a product layer from the inner product of the two representations of the typical Siamese network in order to simplify the process and exponentially speed up the process to real-time applications.The promising idea of exploiting the strengths and avoiding the weaknesses of different matching functions is proposed in Spyropoulos & Mordohai (2015), where an ensemble classifier is trained to decide the appropriate cost functions on a certain pixel.Lately, it has been discussed that the cost aggregation process is the key process for most local methods and an important component for many global ones (Yang et al., 2009;Wang & Zheng, 2008).In Georgousis et al. (2016) such a hybrid method refining the global estimations by local support windows has been presented.
One of the most cited, publicly available databases of stereo images, which at the same time serves as an online evaluation platform, is that of Middlebury College*.The images used here have been taken from the Middlebury 2006 stereo-pairs and the newest Middlebury 3 high resolution dataset, which has separate training and testing stereo-pairs.Furthermore, stereo-pairs from the EPFL multi-view datasets of external architectural scenes † were chosen for evaluating the proposed approach.Finally, it is noted that the KITTI datasets ‡ provide a series of images of urban driving scenes.On the evaluation sites new stereo-matching algorithms are being constantly reported.
* http://vision.middlebury.edu/stereo/data/† http://cvlabwww.epfl.ch/data/multiview/denseMVS.html‡ http://www.cvlibs.net/datasets/kitti/eval_stereo.php In this paper, an improved approach of the Semi-Global Matching (SGM) algorithm is presented, which eliminates the need for scenario-specific tuning of the SGM penalty parameters.Thus, its main contribution is that it introduces a method for automatically estimating penalties P1 and P2 of SGM and methods derived from it.This is achieved after computing certain statistical properties of the Disparity Space Image (DSI), which is estimated during the matching cost computation.The presented method of self-adjusting penalties (SGM-SAP) was evaluated using internal stereo-images from the Middlebury online evaluation platform datasets, as well as images from external architectural scenes selected from the EPFL multi-view datasets.
Next, Section 2 reviews the specifics of SGM and penalty definitions; Section 3 analyses the process of self-adjusting of the penalty values; Section 4 evaluates the results of our tests; the paper is concluded with final remarks and possible future tasks.

SEMI-GLOBAL MATCHING AND PENALTIES
Semi-global matching (Hirschmüller, 2005(Hirschmüller, , 2008) ) is among the top-ranking dense matching algorithms.Its main advantages are accuracy, computational efficiency and simplicity in implementation when compared to high performance global and local methods.Consequently, it is used in stereo as well in multi-view stereo scenarios from real-time to large-scale satellite applications.In this Section, the SGM algorithm is briefly reviewed for the purposes of completeness, and some variations relevant to this work are presented.
SGM is employed in the optimization step, as it defines a global 2D energy function E that depends on the disparity map D: (1) The global function contains a data term C(p, D(p)) as well as a smoothness term for each pixel p.The latter adds a penalty P1 or P2 to each pixel q in the neighbourhood Np of p, if the disparity of q differs by 1 or more pixels from the disparity of p, respectively.SGM suggests approximating the global function by following 1D paths L in several directions r through the image: In each of the r = 8 paths, the optimized cost Lr(p,d) for every pixel p(x,y) and every x-disparity d is estimated from the sum of three terms.The first two are the matching cost C(p,d) and the minimum path cost of the preceding pixel (p-r); the latter is computed after comparison of the path costs of the previous pixel in the same (d), the lower (d-1), the higher (d+1) or all the disparity range (i), while taking into consideration penalties P1 and P2.Finally, the minimum path cost of the preceding pixel is subtracted.P1 penalizes slightly slanted surfaces, P2 penalizes discontinuities.The costs from all paths Lr(p,d) are summed up to each pixel for all possible disparities, resulting in the aggregated cost S(p,d): The optimal disparity for each pixel is chosen by the WTA strategy on S, thus creating the final disparity map DL(p): Since the introduction of SGM several variations or extensions have emerged, aiming at improving its performance, computational efficiency, or both.SGM is also implemented in real-time on a variety of platforms, i.e.FPGA or GPU.Moreover, thanks to its implementation in OpenCV, many algorithms use SGM as part of their stereo matching procedure.Recently, non-local methods (Huang et al., 2016) have also introduced cost-aggregation approaches similar to that of SGM; two iterations are needed for the image-guided non-local matching cost computation, and afterwards the estimated cost is optimized via SGM.
Regarding the definition of cost penalties, a class of SGM variations is dedicated to the development of functions for the adjustment of penalty P2, which is imposed on disparity changes between neighbouring pixels larger than 1 pixel; they have been reviewed in detail by Stentoumis et al. (2015).These penalty functions are based on the fact that, if the intensity change between pixel p and the preceding one in path L is high and the disparity change between them is larger than 1, the existence of actual edges or object boundaries is highly probable.Hirschmüller ( 2005) has firstly introduced an adaptive penalty function.The function was created by dividing P2 with the intensity gradient of neighbouring pixels in the reference image for each path, while checking that P2≥ P1.Besides, Banz et al. (2012) evaluated the performance of three more penalty functions for P2 and the case of constant penalty, which is fixed to an empirically defined value.The proposed penalty functions were: negatively (P2n) and inversely (P2i) proportional to the absolute intensity gradient of the currently processed pixels along the path; and negatively proportional (P2v) to the variance of intensity in a local window.A lower bound P2min was introduced to guarantee that P2≥ P1.
A challenging aspect of SGM implementation is obviously the selection of values for the penalties.If parameters have not been properly tuned, the performance of the algorithm may not be as efficient as expected.In fact, penalty adjustment is needed for every different pair of images or, if a different matching cost method is used, even for the same stereo-pair.In this paper, we introduce a method for automatically estimating penalties P1 and P2.This follows the computation of certain simple statistical properties from the DSI volume which is created in the previous step of cost calculation.Therefore, penalties are considered as being self-adjusted to the particular stereo-pair, in relation to the cost function used.
To our knowledge, no method for the automatic estimation of penalties of SGM has been proposed up to now -with the exception of Chuang et al. (2016), where however a specific cost function was used, the penalties were extracted after the creation of an initial disparity map from only two costs of each pixel (the lowest and the second lowest), and the evaluation was based on only four image pairs.

SELF-ADJUSTING PENALTY VALUES
The idea behind extracting the values for the SGM penalties from the DSI itself originates from the fact that penalties P1, P2 are actually costs that influence the pixel-wise matching cost C.In the above equations, W and H are the width and height of the base image; Nd is the number of disparity labels; and N is the number of image pixels.The minimum matching cost Smin(x,y) of a pixel over all labels l is subtracted from all potential costs S(x,y,l) in order to normalize the DSI values per pixel.Finally, the mean value cost per all pixels corresponds to penalty P1, while the maximum value cost per all pixels corresponds to penalty P2.Of course, in this way it is ensured that P2>P1.The importance of such a definition for the penalties is that they are estimated, without user intervention, from the DSI itself; thus, the self-adjusting penalties remove the need for the conventional, time-consuming tuning step.Appropriate penalty values will be automatically derived, regardless of the stereo-pair, or the matching cost used.Furthermore, no training datasets of stereopairs will be needed for penalties estimated from them to be applied to testing stereo-pairs under an assumed scenario of many similar images.Moreover, these self-adjusting penalties are not computationally expensive.In conclusion, for every stereo-pair the penalties for SGM, or every SGM-like method, can be estimated solely from the DSI, regardless of the matching cost employed and the existence of a ground truth disparity map or of multiple data for training.
In the cost calculation step, common cost functions such as Absolute Difference (AD) of intensities, or Census transform with a 7x7 window were used.Next, SGM was used for cost optimization.The WTA strategy is adopted during the disparity optimization step for acquiring the initial disparity map.Finally, disparity refinement is possible, e.g. with the use of photo-consistency, sub-pixel disparity interpolation, or median filtering.

RESULTS
The presented algorithm has been evaluated on the 15 training stereo image pairs of quarter-size resolution from Middlebury Stereo Evaluation -Version 3 (Scharstein et al., 2014), and also on the 21 quarter-size stereo image pairs of 2006 datasets from Middlebury College.The algorithm has also been tested using an EPFL multi-view dataset with external architectural scenes (Strecha et al., 2008).All processes have been implemented in the Matlab programming environment.

Middlebury 2014 datasets
The Hamming distance on Census transformed images was used as the initial matching cost.Next, penalties P1 and P2 for SGM were estimated via the suggested method and were employed to the SGM algorithm.In The initial disparity map was derived by the WTA strategy.Finally, sub-pixel disparities are estimated by a sequential disparity interpolation and 7x7 median filtering for smoothing with outlier tolerance.The error percentage is computed by comparing each resulting disparity value of non-occluded pixels with the corresponding ground truth value, while applying an error threshold of 0.5 pixel.This threshold value was chosen because the default value used in Middlebury online evaluation platform is 2.0 pixels for full image resolution, which corresponds to a threshold of 0.5 pixel for quarter-size images.
In Fig. 1 some results of the method are seen for three representative stereo pairs as far as the size of matching error is concerned.The strong impact of sub-pixel interpolation on the disparity map can be noted, which is mainly due to the fact that the raw algorithm estimates integer disparities, whereas a 0.5 pixel error threshold is used.A considerable effect is also achieved by denoising via a median filter with large kernel.
The estimated disparity maps of training images were submitted to the Middlebury benchmark evaluation page (Fig. 2), resulting in an error of 22.8% and the 34 th position for non-occluded pixels and a 2.0 pixel error threshold (date of evaluation: January 22, 2017).The image pairs displaying the best performance were Playtable (28 th position) and Vintage (32 th position), whereas those of poorest performance were ArtL (43 th position), Pipes and PlaytableP (42 th position).
Compared to the original SGM algorithm (Hirschmüller, 2008) and its results submitted in the Middlebury platform, our method presents an error higher only by 1.8%, and it is only 3 positions lower in the evaluation list.Playtable and Vintage show lower errors (35% to 38.8% and 40.6% to 41.1%, respectively), whereas Jadeplant and ArtL show the highest errors compared to those of Hirschmüller (31.9% to 26.4% and 18.8% to 15%,respectively).The errors of stereo the pairs for both methods as well as their ranking are presented in Fig. 3. Figure 3. Errors for all stereo-pairs (left) and ranking in Middlebury benchmark (right) for SGM and our method (SGM-SAP).
Finally, it is noted that, compared to our method, in the original SGM algorithm additional refinements are being used, such as left-right consistency or removal of disparity segments smaller than 100 pixels, whereas median filtering is not applied.
In Fig. 4 disparity maps are seen, in which differences in errors when compared to ground truth between the original SGM and our method are highlighted.Pixels whose disparity difference against ground truth is larger than 0.5 pixel if original SGM is applied but less than 0.5 if our method is used are highlighted in blue.Pixels whose disparity difference compared with ground truth is above 0.5 pixel if our method is applied but below 0.5 if the original SGM is used are highlighted in red.It is observed that, ignoring small artifacts produced by either method, our method outperforms SGM in slightly slanted surfaces (e.g.floor of Playtable or Motorcycle), but performs less well in areas of texture (e.g. in the background of the Jadeplant stereo-pair).
Finally, additional experiments regarding the suggested method were conducted and evaluated on the Middlebury 2014 datasets.
In particular, the median instead of the mean value was used as far as the estimation of penalty P1 is concerned.The differences in the total error of 15 pairs regarding the initial method were negligible (0.01%).Furthermore, when a 9x7 window is used for Census transform (as in the original SGM algorithm) before the penalty adjustment, the error is the same regarding the initial disparity maps and by 0.4% higher after refinements.Besides, after the automatic estimation of both penalties by our method, a penalty function proposed by Hirschmüller (2005) was tested for the adjustment of penalty P2 to the intensity gradient.The estimated disparity map appeared as noisier and initially showed an error higher by 2%, which after disparity refinement was reduced to 0.8%.

Middlebury 2006 datasets
For this case, Absolute Differences of intensities and Hamming distance on Census-transformed images were used as cost metrics.Subsequently, penalties P1 and P2 for SGM were computed from the proposed method and were applied to the SGM algorithm.The initial disparity map has been created in the WTA mode.
The overall error for the 21 pairs was compared against the errors obtained from the method without automatic penalty estimation, namely by using the optimal parameters of a tuning process (Stentoumis et al., 2015).The error percentage is calculated after the comparison of each resulting disparity value in non-occluded areas with the corresponding ground truth value, while an error threshold of 1 pixel is applied.The error of our method was higher by only 0.87% (11.89% to 11.02%) when the Census metric served as cost function and by 2.27% higher (25.72% to 23.45%) when Absolute Differences were applied.Therefore, it is concluded that the proposed method is expected to work well for any matching cost function.
In Table 2 the estimated (SGM-SAP) penalty values for two matching costs are seen against the values derived by the tuning process for each individual stereo-pair.Optimal penalties [P1, P2] which lead to the minimum of the mean errors over all stereo-pairs are [10,100] from the tuning of AD-SGM method, while for Census-SGM method these are [25,100] Figure 5. Disparity maps derived from a tuning process (top row) and the suggested method (bottom row).From left to right: stereopairs Monopoly, Flowerpots, Lampshade1, Lampshade2.In the first two our method (with the use of AD and Census as matching costs) has the best performance; in the other two the optimal parameters of tuning perform best concerning the estimated errors.
In Fig. 5 representative results are seen.In particular, the disparity maps of pairs in which lower errors are obtained with the penalties of our method and the corresponding disparity maps which use the optimal parameters of tuning are shown.On the other hand, the disparity maps of pairs in which lower errors are achieved with the estimated parameters of tuning and the corresponding disparity maps which use the penalties of the proposed method are displayed.Both methods employ Absolute Differences and Census as matching costs.As it may be observed, our method performs better in slanted surfaces with adequate texture (e.g. the Monopoly board or the surface of a flowerpot in the corresponding pair).However, its performance lags behind SGM when matching surfaces are of low texture (e.g. the magazine box and the foreground object in the Lampshade1 and Lampshade2 pairs).

Herz-Jesu-K7 dataset
fied via the left-right consistency check.Fig. 6 shows the epipolar images of Herz-Jesu-K7 and the estimated disparity map.
Stereo-pair Penalties P1 P2 Left-to-right 12.6 47 Right-to-left 13.1 47 Table 3. Penalty values for the stereo-pair of Herz-Jesu-K7 estimated by the suggested method.The accuracy of reconstruction can be estimated after registration of the generated point cloud onto the ground truth data (obtained by laser scanning) via the ICP algorithm.It is noted that first some minor pre-processing of the point cloud was conducted (only the object of interest was kept).The overall mismatch is represented by an average distance of 25 mm and a standard deviation of 20 mm.If reduced to mean image scale, these values correspond to ~1.1 and ~1.1 pixel, which are considered as satisfactory.In Fig. 6 an image of the result of the registration is shown, while a detail of the reconstruction is also illustrated.

CONCLUSIONS
This work has presented a novel approach (SGM-SAP) aiming at the self-adjustment of penalty values of Semi-Global Matching for any image pair for any matching cost method.This is achieved by the automatic estimation of the penalties through a simple process with low computational requirements, relying on the Disparity Space Image (DSI) volume, which has been already computed in the previous step of the matching process.Therefore, no tuning of penalties is needed and no dataset of similar images with corresponding ground truth disparity maps has to be available.The proposed method has been evaluated on the challenging Middlebury-Version 3 stereo-pairs, as well as on Middlebury 2006 datasets.Results show that the percentages of errors of the estimated disparity maps from SGM-SAP are competitive to the results from the typical SGM approach (in essence they differ by only ~2%).The significance of the proposed method of self-adjusting penalties is that in existing applications of SGM the values of these penalties are generally being estimated after a time-consuming tuning process.
Future work includes attempts for further improvements of the method and testing it with the use of other matching cost methods or SGM-like approaches.Furthermore, evaluation of the suggested method on more complex or outdoor scenes, e.g. on the KITTI dataset, will be conducted in the near future.
Cost penalties are added to each pixel's initial cost C(p,d) depending on the disparity d, so their values should be related to this initial cost.In the proposed method penalties are derived from the DSI S(x,y,l) representation of the initial cost C(p,d); (x,y) are the image coordinates of a pixel p and l is the label that maps a disparity d to the DSI, l = l(d):

Figure 1 .
Figure 1.Estimated disparity maps using SGM-SAP.Left: disparity maps without any refinement; centre: sub-pixel interpolation; right: median filtering.Differences above 0.5 pixel from the ground truth are highlighted in green.Top to bottom: Motorcycle, PlaytableP and Jadeplant stereo-pairs.

Figure 4 .
Figure 4. Disparity maps of the suggested method (from left to right: Playtable, Motorcycle and Jadeplant stereo pairs).Pixels in blue indicate errors of SGM algorithm which do not exist in our method; pixels in red depict errors of our method not present in SGM.

Figure 6 .
Figure 6.Results of SGM-SAP on the Herz-Jesus stereo-pair.Top: epipolar images of the stereo-pair; 2 nd row: estimated disparity map of the base image; 3 rd row: registration of the reconstructed point cloud onto the ground truth data; bottom: detail of the registration between the laser scanner point cloud and the image-based reconstructed model.

Table 1 .
Penalties estimated by the SGM-SAP method for each stereo-pair of the Middlebury 2014 dataset.
Table 1 the computed penalties for each stereo-pair are shown.
Penalty values extracted from the tuning process and the values estimated by the suggested SGM-SAP method for each stereo-pair of Middlebury 2006 (using two cost metrics).