ROBUST FEATURE MATCHING IN TERRESTRIAL IMAGE SEQUENCES

From the last decade, the feature detection, description and matching techniques are most commonly exploited in various photogrammetric and computer vision applications, which includes: 3D reconstruction of scenes, image stitching for panoramic creation, image classification, or object recognition etc. However, in terrestrial imagery of urban scenes contains various issues, which include duplicate and identical structures (i.e. repeated windows and doors) that cause the problem in feature matching phase and ultimately lead to failure of results specially in case of camera pose and scene structure estimation. In this paper, we will address the issue related to ambiguous feature matching in urban environment due to repeating patterns.


INTRODUCTION
Many photogrammetric and computer vision applications are relying on more than one image of same scene or object.In order to relate images to one another, the corresponding points of same scene (3D features) are need to be matched across those images.From the last few years, image feature detectors and descriptors are most widely used techniques for such applications which includes 3D scene reconstruction, panoramic mosaicking/stitching, image classification, object recognition and robot localization etc., all are depends upon the presence of stable and representative features in an image space.Thus, the image features detection and extraction are important steps for these applications (Hassaballah et al., 2016).
Nowadays there are number of algorithms available for feature detectors and descriptors, which provide region of interest, edges or corners (Remondino, n.d.) the most common of them are Speeded Up Robust Features (SURF) (Bay et al., 2006), Scale Invariant Feature Transform (SIFT) (Lowe, 2004), Features from Accelerated Segment Test (FAST) (Rosten and Drummond, 2005) or Binary Robust Invariant Scalable Key points (BRISK) (Leutenegger et al., 2011) etc. Ideally the feature matching characteristics reported by (Haralick and Shapiro, 1992) are: invariant (independent from geometric and radiometric distortions), stability (robust against image noise), distinctness (clearly distinguish from background) and uniqueness (distinguishable from other points).
The feature detection and matching can be split into three steps.1) Detection: find the keypoints in each images.2) Description: Ideally, the local appearance around each feature point should be invariant to scale, rotation, noise, change in illuminations and affine transformations.The distinctive feature descriptors are calculated from each region by picking the neighborhood region around the every key point.Normally we end up with a descriptor vector for each keypoint.3) Matching: To identify similar features, descriptors are compared across the images.In successfully matched features we may get the pairs of (xi, yi) ↔ (x i , y i ).Where (xi, yi) is features in first image and (x i , y i ) is the matched feature in other image.
However in terrestrial imagery of the urban scenes, there are many repeated feature patterns, nearly identical or duplicate structures with similar texture patters, which ultimately cause the problems in feature matching and subsequently lead to applications result failure (e.g.sparse scene 3D reconstruction).Removal of these incorrect matches is a necessary step to perform specially in case of urban scenes, where the accurate recovery of camera pose and scene structure is necessary.Typical feature matching strategies lead to high number of outliers and due to the fact that the ambiguous matches are parallel to the epipolar lines due to inherent scene geometry and camera motion, robust estimators like RANSAC (used to reject incorrect matches) sometimes lead to wrong solution of correspondences and camera poses.
In the current paper, we investigate and discuss the issues related to ambiguous feature matching using SIFT (Vedaldi and Fulkerson, 2008) and SURF (MATLAB based Implementation) algorithms in urban environment due to repeating patterns that ultimately lead to false camera pose estimation for scene reconstruction.We also provide advices and suggestions about the removal of these known issues.The reason of using SIFT and SURF descriptors is due to their good performance and are widely used technique in many applications.

RELATED WORK
In urban scene architecture, symmetry and repetition in designs are most commonly used.The buildings contain hierarchy of symmetries and repetitions on frontage: for example windows and doors, which excessively appears along the horizontal direction.Changchang Wu et al. (Wu et al., 2010) presented the technique to find the repeated features on architectural frontal plane with precise recovery of boundary selection for finding the repetition.There method works well for horizontal direction repetition and low-count.
Kyle Wilson et al. (Wilson and Snavely, 2013) also presented the new approach for urban scenes, that contains the repeated features by considering the local visibility graph.There model leads to highly scalable, fast and simple technique for disambiguating the repeated elements without solely relying on geometric reasoning.They used the large datasets drawn from internet photo collections for demonstration of their method and compared it with other geometry based technique of disambiguation.
Richard Roberts et al. (Roberts et al., 2011) examined the geometric ambiguities caused by existence of duplicate and repeated structures when different instances are matched on the basis of visual similarity.They proposed the algorithm that recovers the true data association (problem of determining the correspondence either in whole image or feature points) even if there is large number of false pairwise matches exist.
Similarly, the Nianjuan Jiang et al. (Jiang et al., 2012) also worked on the repetitive scene structure, which cause the issue in epipolar geometry (EG) due to wrong feature correspondences between image pairs.They proposed the optimization technique called missing correspondences, in which the correct solution was calculated by finding the global minimum of objective function.However, there algorithm contain certain limitations: First, scenes contains complicated occlusion cause the incorrect estimation of visibility.Second, fail in-case of duplicate structures with little background features.Finally, there method may struck at local minimum due to greedy searching and cannot assure to obtain the global minimum, yet its convergence is guaranteed.

OVERVIEW OF SIFT AND SURF METHODS
The brief description of both SIFT and SURF operators are illustrated in Fig. 1 (Wu et al., 2013).In literature, lot of evaluations are done for SIFT and SURF operators related to their performance, time consumption and behavior under different conditions such as change in scale or rotation etc. but choosing the method between them is solely relying on the application.
The feature descriptor used in SIFT and SURF is typically a 128 element vector.The feature descriptor can be reduced to 64 elements, which can lead to faster matching at the cost of lower accuracy.SIFT is invariant to scale change, rotation, affine transformation and rescaling of images, but not good in case of illumination change.Whereas, the SUFT is not fully affine invariant, unstable under extreme rotation and illumination changes (Juan and Gwun, 2009, Hassaballah et al., 2016, Vedaldi and Fulkerson, 2008).

EXPERIMENTAL RESULT
The performance evaluation of both SIFT and SURF operators is presented in this section.First, the speed and quantity of keypoints extraction is compared and discussed.Second, the accuracy and speed of key-points matching in two view images.Finally, the multi-view keypoints matching to estimate the camera pose and scene structure (image geometry) which requires a number of correspondence points between input images of same scene.Traditional procedures establish point correspondence are based on the local descriptors.So, images captured by cameras contains repeated structures and occlusions when the view of camera change, which ultimately induces the inaccuracy in scene structure and camera pose estimation.
All the keypoints are attained by using the default parameters presented by their implementations.For testing, two datasets are used containing the duplicate and repeated structure specially the windows which frequently appears along the horizontal axis on building facades.The images was captured using digital camera Nikon D5300.It is 24.2 Million pixel camera with image sensor size of 23.5 x 15.6 mm CMOS and maximum image resolution is 6000 x 4000 pixels.
To detect and extract the keypoints in images, the SIFT implementation by VLFeat library (Vedaldi and Fulkerson, 2008) was utilized.Whereas, the SURF implementation in MATLAB environment is used for extrating SURF Features (Bay et al., 2006).

Keypoints extraction comparison
The extraction time and number of keypoints detected by SIFT and SURF are compared in this section.Table-1 shows the average results obtained after applying both operators to datasets which contain sequence of images.The largest number of keypoints are detected by SIFT operator in both datasets, whereas, the SURF detected relatively less number of keypoints.The variation in number of keypoints is expected due to implementation difference, however one can change the parameters settings to detect the various number of features.For example in SURF implementation, the threshold defined to select strong features using the determinant of the Hessian can be reduced to detect more features.Similarly, in SIFT implementation the thresholds used for detecting peaks in Difference of Gaussian scale space and the threshold to determine the points belonging to an edge can be varied to detect different number of features.But the speed of detecting the keypoints in SURF is more efficient compare to SIFT.In Fig. (2 &

Efficient keypoints matching
In this section, the matching speed and quality of matches between consecutive images pairs are investigated.keypoints are matched between two images at a time using KD-tree data structure (Friedman et al., 1977).This method is effective and efficient in low dimensions, but its efficiency reduces for high dimensional data (Silpa-Anan and Hartley, 2008).Here, the bidirectional search (Jianxiong, n.d.) to effectively match keypoints is used.So, first a KD-tree for all detected keypoints in both image pairs are built and then using KD-tree based nearest neighbor search, two nearest neighbor for each feature point in first image are extracted.All feature matches that do not qualify the ratio threshold (ratio of the distance between the first and the second nearest neighbor) are removed.In the bidirectional search, the nearest neighbor query is performed then for the nearest neighbor extracted in the second image using all the points of the first image.If the the nearest neighbor is the same feature point for which this query was initially performed then the two matched points are stored.The Fig. (4 & 5) show the matched keypoints in both dataset.One can also see there is some wrong correspondences between the matched keypoints in both datasets, to rid of this problem the method called Radom sample consensus (RANSAC) provided by Fischler et al. (Fischler and Bolles, 1987) is most commonly exploited.

Multi-view keypoints matching
Generally, the two view geometry states the epipolar geometry among two images which show the relation of point and line in two images (the corresponding point in one image lies on the epipolar line on other image) which can be estimated using fundamental or essential matrix (Peng et al., 2018).Conventionally, the error in correspondence points can be divided into outlier error and localization error.The localization error is normally To accurately estimate the camera pose and scene geometry it is necessary to remove all of them.The robust algorithm RANSAC is use to remove the outlier (incorrect matches), whereas at the same time it finds the inliers (correct matches).However the RANSAC algorithm suffer the decreased in accuracy when the outlier ratio is high (Peng et al., 2018).Therefore, the bidirectional matching step is essential for reducing the number of outliers in the feature matching.Our evaluation has shown that, the RANSAC algorithm typically found the correct inlier set of features following the matching strategy stated above.Occasionally some outliers remained in the matched points which were detected during the Bundle Adjustment.Three view geometry using Trifocal tensors can also be used to removed such outliers (Remondino and Ressl, 2006).
The multi-view matching (in three images) is shown in From the Fig. 6 it is clear that, even after applying the RANSAC algorithm, there is still some missed matched keypoints which Figure 6.SURF based Multi-view matched keypoints appears on the epipolar line due to repeated pattern.This issue will be resolve in next step, the 3D scene reconstruction using bundle adjustment.The red points are clearly depicting the walls, corners and windows pattern in building facades, whereas the blue points are   In this paper we illustrated the problems in urban scenes where the duplicate and repetitive structures cause the issue in accurate camera pose estimation and precise recover of scene structure.The experimental results shows that, SURF operator is fast in keypoints detection and matching compared to SIFT operator.However, SIFT detect more keypoints in images and in matching phase.The bidirectional feature matching using Kd tree gives the least number of outliers which are then removed using RANSAC.
And finally the sparse and dense scene structure was created.

Figure 1 .
Figure 1.Comparison of SIFT and SURF Operators

Figure
Figure 12.Dense point cloud

Table 1 .
3) below SIFT and SURF keypoints are plot for both dataset.Comparison of SIFT and SURF keypoints detection and runtime