PHOTOMATCH: AN OPEN-SOURCE MULTI-VIEW AND MULTI-MODAL FEATURE MATCHING TOOL FOR PHOTOGRAMMETRIC APPLICATIONS

Automatic feature matching is a crucial step in Structure-from-Motion (SfM) applications for 3D reconstruction purposes. From an historical perspective we can say now that SIFT was the enabling technology that made SfM a successful and fully automated pipeline. SIFT was the ancestor of a wealth of detector/descriptor methods that are now available. Various research activities have tried to benchmark detector/descriptors operators, but a clear outcome is difficult to be drawn. This paper presents an ISPRS Scientific Initiative aimed at providing the community with an educational open-source tool (called PhotoMatch) for tie point extractions and image matching. Several enhancement and decolorization methods can be initially applied to an image dataset in order to improve the successive feature extraction steps. Then different detector/descriptor combinations are possible, coupled with different matching strategies and quality control metrics. Examples and results show the implemented functionality of PhotoMatch which has also a tutorial for shortly explaining the implemented methods.


INTRODUCTION
The photogrammetric problem of 3D reconstruction from multiple images has received a lot of attention in the last decade, especially focused on its two main pillars: (i) image orientation and self-calibration and (ii) dense matching reconstruction. However, the overall performance of both steps strongly depends on the quality of the initial feature (keypoints) extraction and matching stage. Therefore, determining which feature detectors and descriptors offer the most discriminative power and the best matching performance is of significant interest to a large part of the photogrammetry and computer vision communities. Methods for performing these tasks are usually based on representing an image using some global or local image properties and comparing them using a similarity measure or some machine/deep learning approaches. Nevertheless, most of the existing methods are designed for matching images within the same modality and under similar geometric conditions.

Aims of the work
The contributions are multifold:  Develop an open-source educational tool, named PhotoMatch, that encloses different state-of-the-art algorithms for tie point extraction, including different detectors and descriptors as well as matching strategies;  Improve the computational cost exploiting GPU and parallel computing, including CUDA programming capabilities;  Assess the results of tie point extraction from a quantitative point of view using some statistical and robust parameters;  Release the tool under GitHub in C++ and QT languages to allow people for further contributions;  Prove the applicability of the developed tool with various datasets (aerial oblique, terrestrial, drone, multi-modal);  Provide a tutorial and manual to describe the implemented methods.

The PhotoMatch project
With the aim of providing a contribution in the context of tie point extraction, an open-source feature extraction and matching tool, called PhotoMatch, has been developed. PhotoMatch encloses and combines different state-of-the-art detectors and descriptors, together with different matching strategies. PhotoMatch allows to solve feature extraction and matching steps with special focus on precision, reliability and flexibility. PhotoMatch is also an educational tool that allows the user to test and combine different detectors and descriptors, as well as to assess the precision and reliability of the results obtained. The project, supported as an ISPRS Scientific Initiative, was led and managed by USAL in collaboration with UCLM, UNILEON, FBK, TWENTE and UDINE universities, aimed to develop an open-source tool for the image pre-processing, feature extraction and matching and system evaluation, including also an educational tutorial. The project was successfully built upon a multidisciplinary and international team with experience in image analysis, photogrammetry and computer vision in order to design and develop this feature matching tool.

PHOTOMATCH
The developed open-source tool encloses a pipeline divided in 6 main steps applied sequentially to the set of loaded images ( Figure 1): 1. Project/session definition: it allows to process the same dataset with different algorithms and/or parameters, and compare the achieved results. 2. Pre-processing: different enhancement and decolorization methods are available to improve the successive feature extraction steps. 3. Feature extraction and description: many detectors and descriptors algorithms are included (e.g. SIFT, SURF, etc.) and users can run tests modifying all necessary parameters and combinations to extract and describe keypoints in the images. 4. Feature matching: once keypoints are identified in two or more images, they are matched using different matching strategies (brute force, FLANN, etc.). 5. Quality control: feature matching results are evaluated using several options and metrics. 6. Export: PhotoMatch allows to export tie points and matching results in formats compatible with most of the common photogrammetric software in order to run a bundle adjustment and derive the orientation parameters. In the same project, each of these steps can be repeated several times, allowing the assessment and comparison of different algorithms and parameters. More details on the different options are given in the following sections.

Pre-processing
PhotoMatch allows to pre-process the input images in order to improve their radiometric content and support the successive feature extraction. The image pre-processing has been reported in many papers as a fundamental step, in particular in those cases where the texture quality is unfavourable (Aicardi et al., 2016;Gaiani et al., 2017;Jende et al., 2018). Different pre-processing algorithms are available in PhotoMatch, including among others: ACEBSF (Lal and Chandra., 2014), POHE (Liu et al., 2013), RSWHE (Kim and Chung., 2008), Wallis (Wallis, 1974), etc. This step is optional but highly suggested in order to achieve better results in the subsequent feature extraction step. Note however that feature detectors are typically invariant to certain radiometric transformations, so not all the pre-processing algorithms that improve the visual perception have an impact on the extraction stage.

Feature extraction
Many photogrammetric and computer vision tasks rely on feature extraction as primary input for further processing and analysis, including point matching, image registration, object detection, etc. Matching features have the following characteristics (Haralick and Shapiro, 1992): distinctness (clearly distinguished from the background), invariance (independent from radiometric and geometric distortions), stability (robustness against image noise), interpretability (the associated interest values should have a meaning and possibly usable for further operations) and uniqueness (distinguishable from other points). Feature extraction consists on the identification of several meaningful features in the images, depicting a salient and distinctive part of the object scene seen in an image. Good features differ from other pixels as they have specific radiometric properties that make them distinctive and therefore re-detectable in different images with automated procedures. Image features can be categorized into corners, blobs and edges and their extraction consists in two consecutive steps: feature detection and description.
Detectors are operators which search for 2D locations in the images (i.e. a keypoint or a region) that are geometrically stable under different transformations and containing high information content. On the other hand, descriptors analyse the surrounding of the detected feature (e.g. a keypoint) and produce a 2D vector of information. This information can be used to quickly classify the extracted points or in a matching process. Descriptors can be generally divided into floating and binary, according to the type of information stored in the vector. Several extraction and detection algorithms have been proposed in the last decades in order to reliably detect features among images with geometric and radiometric transformations. However, many extreme operative conditions (e.g. multi-modal or multi-temporal images, wide baseline, etc.) still represent a challenge for most of the existing algorithms. PhotoMatch implements diverse sets of detectors (e.g. SIFT (Lowe, 2004), SURF (Bay et al. 2006), MSER (Matas et al., 2004), MSD (Tombari and Di Stefano, 2014), ORB (Rublee et al., 2011), AKAZE (Alcantarilla et al., 2013), BRISK (Leutenegger et al., 2011), etc.) and descriptors (e.g. BOOST (Trzcinski et al., 2013), BRIEF (Calonder et al., 2011), DAISY (Tola et al., 2010), FREAK (Alahi et al., 2012), etc.) algorithms to let the user run and test different combinations and assess the results in different conditions. Any kind of combination is allowed in the software. For each algorithm, several advanced parameters can be defined by the user.
The implemented algorithms belong to the more traditional features category, as they are defined in advance and not learnt from the images. In the current implementation of PhotoMatch, modern feature descriptors embedding deep learning algorithms (Žbontar and Le Cun, 2015;Ono et al., 2018;Christiansen et al., 2019) are not implemented yet. As reported in the literature, these methods need different trainings according to the typology of processed data, therefore they are still difficult to be generalized to any type of scenarios, and to be handled in an educational software.

Feature matching
Once keypoints are identified in two or more images, they need to be matched among the images in order to find a set of correspondences or tie points. PhotoMatch contains different matching methods (Brute-Force and FLANN) and strategies (Robust Matching-RM and Grid-based Motion Statistics-GMS), and different typologies of descriptor distances (e.g. L1 or L2 norm and Hamming norm). For each strategy, a different set of parameters can be defined by the user to test the results. In order to validate the matches and remove outliers, homography (H) or Fundamental (F) matrix computation can be used as relative orientation backbone. The robust filtering can be performed using different statistical methods: threshold values can be also set to assess the sensitivity of the achieved results.

Quality control and export
PhotoMatch includes several options for validating and analysing the feature matching results. Quality assessment can be checked based on different approaches:  manually defining a ground truth within the tool or importing an external one from an input file;  computing H and F matrices transformation;  analyzing different quality metrics such as repeatability, ROC/DET curves (Receiver Operating Characteristic/Detection Error Tradeoff) to measure precision and recall of the retrieved correspondences. Thanks to a developed GUI, tie points can be directly checked and edited on the images to have a better understanding of the algorithm performance. Last but not least, PhotoMatch allows to export the extracted tie points and matching results in format compatible with most of the common photogrammetric and SfM software in order to run a bundle adjustment and derive camera parameters.

Educational tutorial
PhotoMatch includes also short descriptions of the algorithms implemented in every step, i.e. pre-processing, feature extraction, feature matching and quality control. The tutorial has a dual purpose:  to give users an overview of the algorithm functionalities and facilitate optimal combinations and parameter selection based on the specific needs of each project;  to serve as an educational tool for non-expert users with respect to other black-box solutions. Each algorithm has a short description in the Help page of PhotoMatch and relevant references are also provided to allow a deeper understanding of the used methods.

EXPERIMENTAL RESULTS
The implemented algorithms were applied to various multi-view (Section 3.1) and multi-modal (Section 3.2) datasets to demonstrate the potentials of our tool, with special focus on flexibility and rigorousness. Although various detector/descriptors and matching functions were combined and tested, only few combinations are afterwards reported. Note that PhotoMatch allows us to create different sessions in order to compare and analyse the different combinations between detector/descriptor and matching. Figure 2 shows the main GUI of PhotoMatch, with the available combinations of detectors/descriptors and results on a UAV dataset. Figure 3-4a report results on a set of 9 nadir and oblique images provided by the ISPRS/EuroSDR's Benchmark on High Density Image Matching for DSM Computation (Nex et al., 2015). In order to improve the feature extraction, all input images were pre-processed based on the same algorithm: Recursively Separated and Weighted Histogram Equalization-RSWHE (Kim et al., 2008), since RSWHE preserves the image brightness more accurately and produces images with better contrast enhancement. All employed detectors were limited in the maximum number of keypoints (5,000) and the used matching strategy was robust matching supported by RANSAC using the F matrix as geometric test (Gonzalez-Aguilera et al., 2018). All the computations were performed exploiting parallel and GPU capabilities of the hardware. Table 3.1 shows the evaluation results for the different combinations (detector/descriptor) considered in the processing. The combinations were based on the following aspects: (i) prioritize those detectors with affine invariant performance such as SIFT, MSER, BRISK, MSD; (ii) use those detectors that incorporate their own descriptor and those which are invariant to rotation and scale such as BRIEF, BRISK and SIFT. Figure 4b shows some quality analyses with the ROC curves for the extracted correspondences in the Dortmund dataset. According to the achieved results on the aerial oblique Dortmund dataset, the following aspects can be highlighted:  SIFT+SIFT (detector+descriptor) provides the best results in terms of number of matchings, as well as the true positive matching rate. However, its efficiency decreases considerably when it is combined with BRIEF and BRISK descriptors.  The BRISK detector shows a good performance with its own descriptor, but when it is combined with SIFT descriptor its performance is even better.  MSD detector is less efficient in extracting correspondences with respect to SIFT and BRISK detectors and the best results of MSD are those obtained when combining it with the BRISK descriptor.  MSER detector is underperforming with respect to SIFT and BRISK detectors and its best results are achieved by combining it with SIFT and BRISK descriptors.  BRIEF descriptor, in all our datasets and case studies, is delivering the worst results in terms of extracted correspondences.  In some cases, BRISK+SIFT has improved considerably the results obtained by SIFT+SIFT.    (a) (b) Figure 4: Multi-view visualization of the matching results (a) and ROC curves for the results presented in the Table 1 (b). Table 2 present some further tests performed on a multi-view dataset of 5 convergent and rotated terrestrial images of the main façade of the Modena Cathedral (Italy). Different combinations of feature detectors and descriptors were tested to assess the algorithm performances when rotations are present. Table 2 presents the total number of matches extracted with a robust matching strategy and adopting 9 detector and descriptor combinations. In this case, the SURF-SURF combination returned the highest number of extracted correspondences, outperforming in identifying matches among differently tilted images ( Figure 5).   Figure 5.

Figure 5 and
The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLIII-B5-2020, 2020 XXIV ISPRS Congress (2020 edition)

Multi-modal datasets
A set of 10 thermographic and visible images ( Figure 6) over an urban area were captured with a manned ultra-light motor (Lopez et al., 2015). In order to co-register the multi-modal dataset finding homologues points, a specific detector/descriptor combination together with a differential adaptation of the detector parameters was used. More specifically, the MSD detector combined with the SIFT descriptor was used, considering different salience thresholds (S) and number of selected points (KNN) which overpass the salience threshold for visible and thermographic images ( Table 3). The different tests carried out demonstrate that both parameters yield remarkable performance on multi-modal images, turning out the best parameters to be setup. The remaining parameters of MSD detector and SIFT descriptor were considered as suggested in the original implementation. On the other hand, it was performed a robust matching (RM) function supported by RANSAC estimator with different distance thresholds, D, and filtering coefficients, k, since both parameters were considered important in multimodal matching.
In order to check the results, the F matrix defined through the precise and reliable identification of a set of 12 well-distributed homologous points was used.
The results of the application of MSD+SIFT+RM are illustrated in Figure 7 and Table 3.
The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLIII-B5-2020, 2020 XXIV ISPRS Congress (2020 edition) (a) (b) Figure 6: An example of visible (a) and thermal (b) images acquired with a manned ultra-light motor and that need to be automatically co-registered by finding homologues points.   Table 3. Different parameters analysed for the triplet detector/descriptor/matcher (MSD+SIFT+RM) in the multimodal dataset. mc and mt refer to correct matchings and total matching, respectively. In bold the combination which provides the best efficiency.
According to the results presented in Table 3, it is worth to note that the salience threshold S is related to the level of dissimilarity between neighboring pixels, i.e. the way a keypoint is different. So, it will be essential to make a different consideration of this threshold in visible (S=650) and thermographic (S=65) images, being higher in visible images (more demanding in dissimilarity) and lower in thermographic images (less demanding in dissimilarity). KNN indicates the minimum number of salience points considered. If it was 3, it would keep the three points that have the highest salience, i.e. more different. The parameter D indicates the orthogonal distance to the epipolar line in pixels. k is a weight factor based on the Norm-L2 distance: if it is 1, it will keep all the points and it is 0.8 will stay only with the 80% of the points prioritized by their lowest L2 distance.

CONCLUSIONS
The paper documents the results of a ISPRS Scientific Initiative, led and managed by USAL in collaboration with UCLM, UNILEON, FBK, TWENTE and UDINE universities, aimed to develop an open-source tool for the image pre-processing, feature extraction and matching and system evaluation, including also an educational tutorial. The output is PhotoMatch tool, an opensource (https://github.com/TIDOP-USAL/photomatch/releases) educational tool that encloses different state-of-the-art algorithms for tie point extraction, including different detectors and descriptors as well as matching strategies. Extracted correspondences can be exported in various formats in order to launch a bundle adjustment with other tools. PhotoMatch features GPU and parallel computing, including CUDA programming capabilities. It offers various metrics to evaluate the matching results, including manually defined ground truth or ROC/DET curves. An educational tutorial and manual are also available in order to explain the implemented methods. Some preliminary tests have been performed considering the available on-line benchmarks. Different tests on airborne, terrestrial and multi-modal (RGB-thermal) datasets have been performed, showing the performance of different combinations of algorithms and parameters. The performed tests have shown how the combination of different detector and descriptors can deliver higher accuracies in specific situations. PhotoMatch could be further extended and improved, e.g. by adding other operators based on machine and deep learning approaches and especially focusing on multi-modal datasets.
PhotoMatch was developed with an educational approach in mind, nevertheless its GPU and parallel computing capabilities allow to quickly process the datasets. The secret of success has been to find a multidisciplinary and international team with experience in image analysis, photogrammetry and computer vision in order to design and develop this feature matching tool.