FOR AUTOMATED NETWORK ORIENTATION

Every day new tools and algorithms for automated image processing and 3D reconstruction purposes become available, giving the possibility to process large networks of unoriented and markerless images, delivering sparse 3D point clouds at reasonable processing time. In this paper we evaluate some feature-based methods used to automatically extract the tie points necessary for calibration and orientation procedures, in order to better understand their performances for 3D reconstruction purposes. The performed tests based on the analysis of the SIFT algorithm and its most used variants processed some datasets and analysed various interesting parameters and outcomes (e.g. number of oriented cameras, average rays per 3D points, average intersection angles per 3D points, theoretical precision of the computed 3D object coordinates, etc.).


INTRODUCTION
Automated image processing for 3D reconstruction purposes is flooding every day with new tools and algorithms.Camera calibration, image orientation and dense matching methods are more and more hidden behind one-click button software and so affordable to non-expert users.In particular, automated image orientation approaches, built on feature-based methods for tie point extraction and pushed by the great developments in the Computer Vision community, are nowadays able to process large networks of unoriented and markerless images delivering sparse 3D reconstruction at reasonable processing time (Snavely et al., 2008;Agarwal et al., 2009;Frahm et al., 2010;Wu, 2013).This has led to the well know Structure from Motion (SfM) concept (firstly introduced by Ullman, 1979), i.e. the automated and simultaneous determination of camera parameters together with scene's geometry.SfM has been adopted also in the photogrammetric community (Barazzetti et al., 2010;Del Pizzo & Troisi, 2011;Deseilligny & Clery, 2011;Roncella et al, 2011) although camera calibration and image orientation are normally kept separate unless the image network is acceptable for self-calibration (Barazzetti et al., 2011).In all automated approaches, image correspondences are normally extracted using feature-based methods and then the unknown camera parameters and 3D object coordinates are determined using a bundle adjustment method.Commercial and opensource solution exist with performances sometimes unclear and often low reliability and repeatability (Remondino et al., 2012).Moreover a deep and metric evaluation of the different (hidden) steps is still missing.In this work we evaluate some feature-based methods used to automatically extract the tie points necessary for calibration / orientation procedures.An automated calibration / orientation procedure is normally based on the following steps: feature detection, feature description, detector comparison, outlier removal, tie point transfer throughout the images, bundle adjustment and determination of unknown parameters.The detection and description steps are salient stages for the performances of an automated procedure.Recent investigations and comparisons of detectors and descriptors were presented in (Burghouts & Geusebroek, 2009;Juan & Gwon, 2009;Aanaes et al., 2012;Heinly et al., 2012;Oyallon & Rabin, 2013;Wu et al., 2013) mainly on indoor datasets, planar surfaces, lowresolution images and without geometric analyses with respect to 3D object coordinates.Therefore an in-depth analysis and comparisons in terms of photogrammetric parameters is needed.In this contribution the Scale-invariant Feature Transform (SIFT) algorithm (Lowe, 2004) and its most interested variants are considered, paying great attention to the description phase of each method.The considered feature-based methods are (Section 2): SIFT (in the VLFeat implementation), SIFT-GPU, ASIFT, ColSIFT, DAISY, LDAHash, SGLOH and SURF.The employed feature-based methods are proper implementations adapted from the open-source domain.For the presented evaluation, different image networks are used (Section 3 and Fig. 1).Evaluation results and critical comments are reported.

FEATURE-BASED METHODS: DETECTORS & DESCRIPTORS
Feature identification and matching is at the base of many automated photogrammetric and computer vision problems and applications like 3D reconstruction, dense point cloud generation, object recognition or tracking, etc.A feature detector (or extractor) is an algorithm that takes an image as input and delivers a set of local features (or regions) while a descriptor computes on each extracted region a specific representation of the extraction.Good image features should be independent from any geometric transformation applied to the image, they should be robust to illumination changes and they should have a low feature dimension in order to perform a quick matching.
Once points and regions (invariant to a class of transformations) are detected, (invariant) descriptors are computed to characterize the feature.The descriptors are a variable number of elements (from 64 to 512) computed with histogram of gradient location and orientation (Lowe, 2004), moment invariant (Van Gool et al., 1996), linear filters (Schaffalitzky & Zissermann, 2002), PCA (Mikolajczyk, K. & Schmid, C., 2005), intensity comparison and binary encoding (Calonder et al., 2010;Leutenegger et al., 2011), etc. Descriptors have proved to successfully allow (or simplify) complex operations like wide baseline matching, robot localization, object recognition, etc.
In the detection phase, in order to produce translation and scale invariant descriptors, structures must be unambiguously located, both in scale and position.This excludes image edges and corners since they are translation-, view-and scale-variant features.Therefore image blobs located on flat areas are the most suitable structures although not so precisely located as interest points and corners (Remondino, 2006).
Nowadays the most popular and used operator is the SIFT method.SIFT has good stability and invariance and it detects local keypoints with a large amount of information using the DoG method.As reported in literature (Remondino et al., 2012;Zhao et al., 2012;Apollonio et al. 2013;Morel & Yu, 2009), the typical failure cases of the SIFT algorithm are changes in the illumination conditions, reflecting surfaces (e.g.cars or windows), object / scene with strong 3D aspect, highly repeated structures in the scene and very different viewing angle between the images.In order to overcome these failures but also to quickly derive compact descriptor representations, many variants and alternatives of the SIFT algorithms were developed in the last years (Ke & Sukthankar, 2004;Brown et al., 2005;Bay et al., 2008;Morel & Yu, 2009;Bellavia et al., 2010;Tola et al., 2010;Vedaldi & Fulkerson, 2010;Rublee et al., 2011;Yeo et al., 2011;Strecha et al., 2012;Wu, 2014) and nowadays used in many open-source and commercial solutions which offer automated calibration / orientation procedures (VisualSFM, Apero, Eos Photomodeler, Microsoft Photosynth, Agisoft Photoscan, Photometrix Iwitness, 3DF Zephyr, etc.).
Between the available feature-based detector and descriptor algorithms, the evaluated methods are afterwards reported.

Scale Invariant Feature Transform (SIFT)
SIFT (Lowe, 2004) derives a large set of compact descriptors starting from a multi-scale representation of the image (i.e. a stack of images with increasing blur simulating the family of all possible zooms).In this multi-scale framework, the Gaussian kernel acts as an approximation of the optical blur introduced by a camera.The detection and location of keypoints is done by extracting the 3D extrema with a DoG operator.SIFT detects a series of keypoints mostly in the form of small patch structures, locating their centre (x,y) and characteristic scale (σ) and then it computes the dominant orientation (θ) from the gradient orientation over a region surrounding each patch.Given 8 bins for quantizing the gradient directions, the dominant orientation (responsible for the rotation invariance of the keypoint) is given by the bin with the maximum value.The knowledge of (x, y, σ, θ) allows to compute a local descriptor of each keypoint's neighbourhood that encodes the spatial gradient distribution by a 128-dimensional vector.This compact feature vector is used to match the keypoints extracted from different images.Since there are many phenomena that can lead to the detection of unstable keypoints, SIFT incorporates a cascade of tests to discard the less reliable points.Only those that are precisely located and sufficiently contrasted are retained.Main parameters that control the detection of points are: -local extrema threshold (contrast threshold): points with a lower local extrema value are rejected but, since this threshold is closely related to the level of noise in the input image, no universal value can be set.Additionally, the image contrast of the input image plays the inverse role of the noise level therefore the contrast threshold should be set depending on the signal to noise ratio of the input image.
-local extrema localization threshold (edge threshold): it is used to discard unstable points, i.e. if the local extremum is on a valley.Extrema are associated with a score proportional to their sharpness and rejected if the score is below this threshold.The number of remaining features increases as the parameter is increased.The original value in Lowe ( 2004) is 10.Calibration of these parameters is fundamental for the efficiency of the detection mechanism.Following the literature (May et al., 2010) and our practical experiences with different dataset (from 1x1x1 to 10x20x50 meters), a value of 6 for the contrast threshold and of 10 for the edge threshold appear to be very suitable choices.

SIFT-GPU
SIFT-GPU (Sinha et al, 2006) is a SIFT implementation on the GPU based on the following steps: 1. convert colour to intensity and up-sample or down-sample the input image; 2. build Gaussian image pyramids (Intensity, Gradient, DoG); 3. detect keypoint with sub-pixel and sub-scale localization; 4. generate a compact list of features with GPU histogram reduction; 5. compute feature orientations and descriptors.SIFT descriptors cannot be efficiently and completely computed on the GPU as histogram bins must be blended to remove quantization noise.Hence this step is normally partitioned between the CPU and the GPU.SIFT-GPU uses a GPU/CPU mixed method to build compact keypoint lists and to process keypoints getting their orientations and descriptors.SIFT-GPU, particularly on large size images, may get slightly different results on different GPUs due to the different floating point precision.
In the presented tests, the SIFT-GPU implementation available at http://cs.unc.edu/~ccwu/siftgpu was used.To speed-up the computation, it presents some changes in the parameter values compared with the original implementation: -in the orientation computation, a factor σ=2.0 for the sample window size is used (typical value is σ=3.0) to increase the speed of 40%; -the keypoint's location is refined only once and without adjusting it with respect to the Gaussian pyramids; -the image up-sampling is not performed; -the number of detected features (max 8000) and the image size (max 3200 pixel) are limited; -the local extrema threshold (contrast threshold) is set to 5.16 instead 3.4.
In the presented tests, an optimized implementation of SIFT-GPU is also experimented with these specifications: -the orientation computation uses σ=3.0 as in the original paper (Lowe, 2004); -image up-sampling is performed as in Lowe (2004); -the number of detected features and the image size are not limited; -the local extrema threshold is set to 6; -the detection is performed using GLSL, with an adaptation of the DoG threshold to detect more features in dark regions; -matching is done using CPU and not limiting the number of matches.

Affine-SIFT (ASIFT)
ASIFT (Morel & Yu, 2009) aims to corrects the SIFT problem in case of very different viewing angles, i.e. it aims to be more affine invariant than SIFT by simulating the rotation of camera axes.ASIFT first adds rotation transformation to an image.Then, it further obtains a series of affine images by a tilt transformation operation u(x, y) → u(tx, y) on the image in x direction.From a technical point of view, unlike SIFT which normalize all six affine parameters, ASIFT simulates three parameters (the scale and the two rotations along the camera vertical and horizontal axes) and normalizes the other parameters (rotation along the axis orthogonal to the image plane and the two horizontal and vertical translations).ASIFT detects many feature points (as the detection is repeated several times), but the detection time rises significantly and matching time rises even more (Mikolajczyk et al., 2010).Comparing many pairs of putative homologous points, ASIFT can accumulate many wrong matches.Furthermore it shows many wrong points when used on repeated patterns.

Colour SIFT
Colour SIFT expresses different ways of extending the SIFT descriptor from grey-level to colour images using colour moments and moment invariants (Mindru et al., 2004).The main goal is to (i) obtain invariance from colour description instead of grey-values description in particular for photometric events (such as shadow and highlights) and (ii) exploit the colour information to solve possible problems arising from the colour to grey conversion.Bosch et al. (2006) compute SIFT descriptors over all three channels in the HSV colour space, resulting in a 3×128dimensional HSV-SIFT image descriptor.Van de Weijer and Schmid (2006) concatenate the SIFT descriptor with a weighted Hue histogram.But this revealed some instabilities of the hue around the grey axis and that the hue histogram component of the descriptor is not invariant to illumination color changes or shifts.Burghouts and Geusebroek (2009) defined a set of descriptors with 3 vectors of 128 values (following the opponent model of Eward Hering theory): the first vector is exactly the original intensity-based SIFT descriptor (representing the intensity, shadow and shading information), whereas the second and third vectors contain pure chromatic information as opponent colour channels (yellow-blue and red-green).
Other approaches are presented in Geusebroek et al. (2001) and Van de Sande et al. (2010).
In the presented tests, the implementation available at http://staff.science.uva.nl/~mark/downloads.html was used.

Shifting Gradient Location an Orientation Histogram (SGLOH)
SGLOH (Bellavia et al., 2010) is a modification of the GLOH descriptor (Mikolajczyk and Schmid, 2005) based on n circular grids centered on the feature point.SGLOH checks the similarity between two features not only in the gradient dominant orientation but also according to a set of discrete rotations.This is achieved by shifting the descriptor vector and by using an improved feature distance.This improves the descriptor stability to rotation for a reasonable computational cost.SGLOH descriptor is normally couple with the HarrisZ detector (Bellavia et al., 2008) for the extraction of the keypoints.
In the presented tests, the SGLOH implementation of Bellavia et al. ( 2010) is used with 3 circular rings centred on the feature point and 8 radial sectors per ring.Images needed to be downsampled to 800x600 pixels.

DAISY DAISY (Tola et al., 2010
) is a local descriptor inspired by SIFT and GLOH but faster and more robust.In SIFT each bin contains a weighted sum of the norms of the image gradients around its centre, where the weights roughly depend on the distance to the bin centre.In DAISY these descriptors are reformulated so that they can be efficiently computed at every pixel location.This means that the histograms are computed only once per region and reused for all neighbouring pixels.To this end, the weighted sum of the norms is replaced with convolutions of the gradients in specific directions (normally 8) with several Gaussian filters.DAISY provides for a 264 dimensional vector and this formulation gives the descriptor the appearance of a flower, hence its name.DAISY gives the same kind of invariance as the SIFT and GLOH but is much faster for dense-matching purposes and allows the computation of the descriptors in all directions with little overhead (Winder et al., 2009).
In the presented tests, the DAISY implementation available at http://cvlab.epfl.ch/software/daisywas used.

Linear Discriminant Analysis (LDAHash)
LDAHash (Strecha et al. 2012) is a SIFT-like local binary feature descriptor that maps the descriptor vectors into the Hamming space, where the Hamming metric used to compare the resulting representations.LDAHash introduces a global optimization scheme to better take advantage of training data composed of interest point descriptors corresponding to multiple 3D points seen under different views.LDAHash performs a Linear Discriminant Analysis (LDA) on the descriptors before the binarization.Binarization techniques take advantage of training data to learn short binary codes whose distances are small for positive training pairs and large for others.This is useful to reduce the descriptor size and increase the performances of the descriptor.However LDAHash uses an exhaustive linear search to find the matching points, which reduces significantly its efficiency.Moreover LDAHash is a supervised and data-dependent approach that needs additional human labelling in the needed training stage.The approach is then fast and usable only when similar training data are available.In the presented tests the implementation available at http://cvlab.epfl.ch/research/detect/ldahashwas used.

Speeded Up Robust Features (SURF)
The SURF descriptor (Bay et al. 2008) implements a similar algorithm to SIFT but reduces the processing time by simplifying and approximating the steps.All layers of the pyramid are generated from the original image by up-scaling the filter size rather than taking the output from a previous filtered layer.The final descriptor vector has 64 dimensions.SURF can be computed efficiently at every pixel, but it introduces artefacts that can degrade the matching performance when used densely.

DATASETS DESCRIPTION
The evaluation of the feature-based methods performances and potentialities is performed with four datasets (Fig. 1

EXPERIMENTAL SETUP AND EVALUATION RESULTS
In order to have a common evaluation procedure, once the feature points are extracted and described with the aforementioned algorithm implementations, the descriptors matching procedure, the outlier detection phase and the final bundle adjustment are run inside the same software environment.Particularly, the generation of the correct image correspondences is performed following (Agarwal et al., 2009;Frahm et al., 2010) and then RANSAC to eliminate possible mismatches -or one of its variants (Chum et al., 2004;Chum et al., 2005;Chum & Matas, 2005, 2008).Other similar approaches are presented in (Nister & Stewenius, 2006;Farenzena et al., 2009).The performed tests compare the results achieved at the end of the descriptor matching phase and after the bundle solution to better understand the performances of the feature-based methods for 3D reconstruction purposes.
In particular, the following outcomes were analysed: -pairwise matching efficiency: using a set of images (Fig. 2) featuring illumination differences, textureless surfaces, possible loss of information in the colour-to-grey conversion and elements with strong 3D features, we tested pairwise matching efficiency of the operators with respect to three camera movements: (i) parallel with limited baseline (00-01); (ii) rotation of 90° (00-03); (iii) tilt of more than 30° (01-02).The number of correct inlier matches (after the RANSAC phase) is then normalized with all putative correspondences: The optimized SiftGPU obtained the higher number of correct inlier for each situation (Table 2).This is probably due to the variation of the DoG threshold in the dark areas, allowing a higher number of matching.Conversely, ASIFT seems the more efficient solution from the efficiency point of view (Table 3).-number of oriented cameras: Fig. 3 shows that ColSIFT and FAST+SURF achieve poor results in case of b/w objects (Testfield) and scenes with uniform colour (Portici).Best performances are obtained with ASIFT, LDAHash, SiftGPU and VLFeat.-root mean square error of the bundle adjustment (Fig. 4): it expresses the re-projection error of all computed 3D points.ASIFT and SIFT-GPU have limited results in all the datasets (last one probably due to changes in the parameter values compared with the original one).VLFeat shows rather good results as well as FAST+SURF although not able to orient all the images in the datasets (just 20% for the Testfield and Porticoes dataset).-visibility of 3D points in more than 3 images (Fig. 5): the results show that more than 50% of the triangulated points are visible at least in 3 images, although for highly overlapped images (Testfield and Jaguar) this might not be so significant.
-average rays per 3D points (i.e. the redundancy of the computed 3D object coordinates): for the Jaguar and Testfield datasets the point multiplicity is shown in Fig. 6a and Fig. 7a.
In comparison with the ground truth measurements and despite the high overlap of the images (almost 100% for the entire dataset), the feature-based methods show a low average multiplicity.
-average intersection angles per 3D points: as 3D from images is determined by the triangulation, a higher intersection angle of homologues rays provides for more accurate 3D information.The Testfield and Jaguar datasets consist of highly overlapping images (almost 100%) acquired with very convergent image, therefore high intersection angles should be expected.On the other hand, the average angles (Fig. 6b and 7b) are always quite low, compared to the ground truth measurements.
-points per intersection angle: the analysis of the Jaguar dataset (Fig. 6c) shows that less than 10% of the correspondences provide for a 3D point under an angle larger than 60 degrees.Instead more than 30% of the 3D points are determined with an intersection angle smaller than 20 degrees (LDAHash up to 50%, VLFeat3.4 almost 40%).The results of the Testfield dataset are pretty similar and all the feature-based methods normally provide 3D points with a small intersection angle (Fig. 7c).
-theoretical precision of the computed 3D object coordinates (for the Testfield dataset): in comparison with the ground truth values (x=0.01mm, y=0.009mm, z=0.017mm-Z is the depth axis), all the feature-based methods deliver much higher accuracy (Fig. 8a).SGLOH has so high values due that the used implementation process only low-resolution images.Observing Fig. 6c and Fig. 7c -i.e.fact that a large number of 3D points are determined with an intersection angle smaller than 10 degrees -after removing all those points we obtained much better theoretical precisions of the object coordinates (Fig. 8b).

CONCLUSIONS
The paper reports some experiments and evaluations carried out to test the performances and efficiency of common featurebased methods used for automated homologues point extraction.Different datasets were used featuring scale variations, camera rotations, illumination changes, affine transformations, variable image overlap and resolution, flat and textureless surfaces, repeated patterns.Image correspondences were extracted following the typical detection, description, matching and blunder detection phases.Then a bundle adjustment was used to derive the camera poses and sparse 3D reconstructions.The achieved results were compared and the following considerations can be summarized:  real and complex scenarios show that the automated image orientation is still an open issue and that unsuccessful results can still be achieved;  each method has a set of parameters which needs to be correctly set otherwise its performances can be very poor (no unique set of parameter is valid in all the situations);  repeated patterns and 3D scenarios, very common in architectural scenes, show the necessity of more invariant descriptors vectors;  all the methods cannot detect correspondences on longer track of images and deliver 3D points with small intersection angles;  the small intersection angles affect negatively the quality of the 3D reconstruction but, given the large number of extracted correspondences, the low-angle intersections can be removed;  the processing time can be considerably high in case of high-resolution images and descriptors involving multiple detection routines or involving multiple channels;  comparing all the graphs, some operators have limited discrepancies, particularly SIFT-GPU and VLFeat which seem to be the more stable.Beside these considerations, we cannot declare any winner.For sure fully automated feature-based methods combined with accurate and reliable results are still a hot research topic.

Figure 1 :
Figure 1: The employed datasets with their images and different camera networks: Jaguar (A), Testfield (B), Albergati (C), Porticoes (D).They feature high-resolution images, convergent acquisitions, variable image overlap, camera rolls, flat and textureless surfaces as well as repeated patterns and illumination changes.
): A) Jaguar bass-relief, a heritage monument located in Copan (Honduras) with uniform texture and highly overlapping convergent images.Ground truth measurements are available.B) Calibration testfield, with coded targets and scale bars, imaged by highly overlapped and convergent images.Ground truth measurements are available for this datasets.C) Albergati building, a three floors historical palace (54 x 19 m) characterized by repeated brick walls, stone cornices and a flat facade.The camera was moving along the façade of the building, with some closer shots of the entrances.D) Building with porticoes, a three floors historical building (19 x 10 m) characterized by arches, pillars/columns, cross vault and plastered wall.The camera was moving along the porticoes, with some closer shots of the columns.Other characteristics of the employed photogrammetric datasets are summarized in Tab.1.The datasets are characterized by different image scales (ranging from 1/800 for the Albergati case to 1/30 for the calibration dataset), image resolution, number of images, camera network, object texture and size.The employed datasets try to verify the efficiency of different techniques in different situations (scale variation, camera rotation, affine transformations, etc.).In particular datasets C and D represent urban test frameworks summarizing scenarios typical of historical urban environments.The datasets contain, besides convergent imaging configurations and some orthogonal camera rolls, a variety of situations typical of failure cases, i.e. 3D scenes (non-coplanar) with homogeneous regions, distinctive edge boundaries (e.g.buildings, windows/doors, cornices, arcades), repeated patterns (recurrent architectural elements), textureless surfaces and illumination changes.With respect to other evaluations where synthetic datasets, indoor scenarios, low resolution images, flat objects or simple 2-view matching procedures are used and tested, our datasets are more varied and our aim is the final scene's 3D reconstruction.The datasets are available to the scientific community for research purposes.

Figure 2 :
Figure 2: Images used to test pairwise matching efficiency.

Figure 3 :
Figure 3: Percentage of oriented cameras for each dataset and featurebased method.

Figure 4 :
Figure 4: Results of the bundle adjustment for each dataset in terms of reprojection error.

Figure 5 :
Figure 5: The visibility of the derived 3D points in more than 3 images (normalized with respect to the all extracted points).
The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, VolumeXL-5, 2014   ISPRS Technical Commission V Symposium, 23 -25 June 2014, Riva del Garda, Italy

Table 1 :
Main characteristics of the employed datasets for the evaluation of feature-based methods for tie point extraction.

Table 2 :
Total number of correct inliers for each operator.

Table 3 :
Efficiency of each operator.