A CRITICAL REVIEW OF AUTOMATED PHOTOGRAMMETRIC PROCESSING OF LARGE DATASETS

The paper reports some comparisons between commercial software able to automatically process image datasets for 3D reconstruction purposes. The main aspects investigated in the work are the capability to correctly orient large sets of image of complex environments, the metric quality of the results, replicability and redundancy. Different datasets are employed, each one featuring a diverse number of images, GSDs at cm and mm resolutions, and ground truth information to perform statistical analyses of the 3D results. A summary of (photogrammetric) terms is also provided, in order to provide rigorous terms of reference for comparisons and critical analyses.


INTRODUCTION
The availability of fully automated photogrammetric software allows just about anyone with a camera, even low-quality mobile phones (Tanskanen et al., 2013;Nocerino et al., 2017), to generate 3D models for various purposes.Researchers and practitioners employ nowadays photogrammetry as a valuable, powerful and cheap alternative to active sensors for textured 3D reconstruction of heritage scenarios, museum artefacts, cities, landscapes, consumer objects, etc.However, the majority of image-based users is often unaware of strengths and weaknesses of the used methodology and software, employing it much like a black-box where they can drop photographs in one end and retrieve a (hopefully) completed 3D models on the other end.Previous works (Kersten and Lindstaedt, 2012;Remondino et al., 2012;Gonizzi-Barsanti et al., 2014) demonstrated that automation in image-based methods is very efficient in most heritage projects, with great potentials, although some open research issues still exist (James and Robson, 2014;Nocerino et al., 2014;Menna et al., 2016;Cipriani and Fantini, 2017).The quality of automatically derived 3D point clouds or surface models is normally satisfactory although no standard quality analysis tools are generally implemented and used to evaluate the value of the achieved (3D) products.Moreover, not all software solutions allow a rigorous scaling & geo-referencing procedure and there is generally a lack of standard terms when reporting the results.

State-of-the-art in automated image-based 3D reconstruction
The image-based processing pipeline, based on the integration of photogrammetric and computer vision algorithms, has become in the last years a powerful and valuable approach for 3D reconstruction purposes.If in the beginning of 2000's many researchers and users moved their attention and interest to laser scanning technologies, since few years an opposite trend is happening and the image-based approach is once again very commonly used.Indeed, it generally ensures sufficient automation, low cost, efficient results and ease of use, even for non-expert users.The recent progresses were achieved in all core components of the image-based processing pipeline: image preprocessing (Verhoeven et al., 2015), keypoints extraction (Hartmann et al., 2015), bundle adjustment (Schoenberger and Frahm, 2016) and dense points clouds generation (Remondino et al., 2014).These progresses have led to fully automated methodologies (normally called Structure-from-Motion -SfM and Multi-View Stereo -MVS) able to process large image datasets and deliver 3D (both sparse and dense) results with a level of detail and precision variable according to the applications (Frahm et al., 2010;Crandall et al., 2013).Particularly in terrestrial and UAV applications, the level of automation is reaching very high standards and it is increasing the impression that few randomly acquired images -even found on the Internet (Heinly et al., 2015) and a black-box tool are sufficient to produce a professional 3D point cloud or textured 3D model.However, when it comes to applications different from web visualization or quick 3D reconstructions, end-users are still missing a valuable solution for metric applications where results can be deeply analysed in terms of accuracy, precision and reliability.As a consequence, algorithms and methods could be understated or overrated and weakness in dataset could be missed.

The trend and risk
The ease of use of many commercial photogrammetric software allows any user to take some photographs, blindly load them into the package, push a button and enjoy the obtained 3D model.This is compelling, but dangerous.Without sufficient knowledge of the processes and the software being used, non-expert users can potentially invest greater confidence in the results of their work than may be warranted.Nowadays many conferences are filled with screenshots of photogrammetric models and cameras floating over a dense point cloud.Nonetheless object distortions and deformations, scaling problems and non-metric products are very commonly presented but not understood or investigated.Therefore it is imperative that users move beyond black-box approaches of photogrammetric (or SfM/MVS) tools and begin to understand the importance of acquisition principles, data processing algorithms and standard metrics to describe the quality of results and truly quantify the value of a 3D documentation.A proper understanding of the theoretical background of algorithms running in software applications is thus advisable in order to obtain reliable results and metric products.Leaving the black-box approach behind will ensure a better usability of the results, long-lasting quality data, transferability of the methodology and a better diffusion of 3D technologies in the heritage field.

Paper objectives
This paper wants to critically evaluate the performances of three commercial packages (Agisoft PhotoScan, Pix4D Pix4Dmapper Pro and Capturing Reality RealityCapture) commonly used in the heritage community for automated 3D reconstruction of scenes.Different large datasets are employed (Table 1), each one featuring a diverse number of images, varying GSDs and some ground truth information to perform statistical analyses of the results.The null hypothesis is assuming that given the same processing parameters (number of extracted keypoints, maximum reprojection error, same GCPs, etc.), each software would produce a very similar result without any significant variation from the others.However, since each software offers a slightly different set of parameters, different terminology as well as different approaches for the image orientation and dense matching procedures, there will be some variability between the different processing results.In the paper, we are not taking into account the generation of a mesh or texturing, as the work assumes that the best measure of performances is the result of image orientation and dense matching procedures.Due to a lack of output standards, it is generally difficult to present comparisons.However, in order to understand differences, strengths and weaknesses, we will focus on:  orientation results, in terms of number of oriented cameras, theoretical precision of object points, RMS on check points (CPs), redundancy/multiplicity of 3D points;  dense point clouds: as we are familiar with each of the datasets presented here, challenging areas known to be particularly problematic for photogrammetry are analysed.Although recent benchmarks and software evaluations exist (Nex et al., 2015;Aanaes et al., 2016;Knapitsch et al., 2017;Schoeps et al., 2017), the paper focuses on more complex environments, surveyed with different platforms / cameras and comparison metrics are given following a standard terminology (Luhmann et al., 2014;Granshaw, 2016).

ADOPTED TERMINOLOGY
The fusion of photogrammetric, computer vision and robotics methods has led from one side to open and commercial solutions able to automatically process large sets of unordered images but, from the other side, to a misused terminology and a lack of clear meanings and measures.Although standard terms and metrics do exist, they are not always properly employed by all software packages and researchers, making the comparison of processing methodology and the understanding of delivered results a nottrivial task.In the following we report the most common terms and metrics which should be used when processing image datasets and delivering 3D sparse or dense point cloud results.
Bundle Adjustment (BA): "bundle" refers to the set of optical rays that, according to the collinearity condition (or central perspective camera model), connect each camera projection centre, the measured image point and corresponding 3D point in object space.Therefore, BA means to 'arrange' the bundle of optical rays departing from the images and pointing to the 3D object points to iteratively, jointly and optimally reconstruct both the 3D scene and camera (interior and exterior) parameters.If interior (principal distance and principal point coordinates) and additional parameters (radial and tangential lens distortion, affinity and shear) are also estimated, it takes the name of selfcalibrating bundle adjustment (Gruen and Beyer, 2001).
Classically, the BA is formulated as a non-linear least squares problem (Triggs et al., 1999) with all unknowns simultaneously estimated.A least squares method minimizes an objective function, being the sum of the squared residuals of the available observations (i.e.reprojection error of the image measurements).
For the collinearity model, the objective function is not independent from the model parameters and it is practical to use linear equations.Linearization implies that approximate values for all parameters are known and the most optimal values are computed in an iterative framework so that with each iteration the estimates are updated and hopefully closer to the real solution.Initial approximations of unknown parameters are normally computed with a subsequent concatenation of triangulation and resection (or DLT) procedures.The existing algorithms for finding a minimum of the objective function differ in how the structure and derivatives of this function are exploited (Nocedal and Wright, 2006).Within the photogrammetric community the most common BA solution is the iterative Netwon's method (i.e.Gauss-Markov method) whereas in the computer vision community, Gauss-Newton or Levenberg-Marquadt are used (Triggs et al., 1999;Boerlin and Grussenmeyer, 2013).Popularity of the Newton-like methods lies in their fast convergence near the (absolute) minimum.The disadvantage is that the worse are the initial approximations of the unknowns, the more costly are the iterations and the less is the guarantee that a global minimum is reached.
Structure from motion (SfM): it is a procedure to simultaneously estimate both 3D scene's geometry (structure) and camera pose (motion) (Ullman, 1979).If the camera is not pre-calibrated, calibration parameters can be simultaneously estimated as well (Szeliski, 2010).SfM entails two steps: a preliminary phase where 2D features are automatically detected and matched among images and then a bundle adjustment (BA) procedure to iteratively estimate all camera parameters and 3D coordinates of 2D features.The democratization of SfM started with the early self-calibrating metric reconstruction systems (Fitzgibbon and Zisserman, 1998;Pollefeys, 1999) which served as basis for the first systems on large and unordered Internet photo collections (Snavely et al., 2008) and urban scenes (Pollefeys et al., 2008).Inspired by these achievements, increasingly large scale reconstruction solutions were developed for thousands, millions and hundreds of millions images (Frahm et al., 2010;Agarwal et al., 2012;Heinly et al., 2015).A variety of SfM strategies were proposed, including incremental (Snavely et al., 2008;Agarwal et al., 2009;Wu, 2013;Schoenberger andFrahm, 2016), hierarchical (Gerardi et al., 2010;Cefalu et al., 2017) and global approaches (Crandall et al., 2013;Sweeney et al., 2015).Nowadays the incremental SfM is the most popular, starting with a small seed reconstruction, then growing by adding additional images/cameras and 3D points.Nevertheless, they have various drawbacks, such as repeatability, scalability, drifting, various non-estimated cameras and high computational costs (Remondino et al., 2012;Schoenberger and Frahm, 2016).
Functions describing imaging errors: deviations from the ideal central perspective camera model, due to imaging errors, are normally expressed using correction functions for the measured image coordinates.The most common functions to model systematic errors in photogrammetry were presented in Brown (1976) and Beyer (1992), considering additional parameters to model the effects of radial and tangential distortion as well as affine errors in the image coordinate system.When an individual set of additional parameters is considered (and estimated within the self-calibrating bundle adjustment), the process is defined as 'block-invariant' self-calibration.If a set of parameters is assigned to each image, the bundle is called 'photo-variant' selfcalibration (Moniwa, 1981).All available processing software applications include various variants of additional parameters but the values of these parameters are generally not directly comparable (Drap and Lefevre, 2016).Indeed, they may be normalized to the focal length value and in some cases are provided as correction values, in others as proper distortion parameters.
Residuals of image coordinates: also called reprojection error, it indicates the difference between the image observation values (i.e.measured coordinates of the matched 2D points in the images) and their computed values within the adjustment process.The reprojection error is thus the Euclidean distance between a manually or automatically measured image point and the back-projected position of the corresponding 3D point in the same image.A 3D point generated only from 2 images, in an ideal case, has a reprojection error of zero.But in real processes it differs from zero due to noise in image measurements, inaccurate camera poses and unmodelled lens distortions.
Nevertheless the reprojection error in image space is not an appropriate metric to evaluate the outcome of a BA, particularly when most of the 3D points are generated only from 2 images.
Standard deviation, variance, mean and median: in statistics, the standard deviation  is the square root of the variance, being the variance the mean of the squared deviations of a random variable x from its mean value µ.So, the variance measures the spread, or variability, of a set of (random) numbers from their mean value µ: (1) The median is the 'middle' value of a sample or population of numbers, separating it in two halves, one containing the higher values and one the lower.
Root Mean Square (RMS) and RMS Error (RMSE): while the RMS is the square root of the mean of the squared differences between the variable and its most probable value, the RMSE is computed with respect to a reference measurement, provided by an independent method.In particular, in this paper the following definitions are adopted:  RMS of the residuals in image space, i.e. the reprojection error: (3) (4) where (  ,   ) represent the image coordinates, i.e. the position of the matched 2D points, and (̅  ,  ̅  ) are the reprojected values of the computed 3D coordinated within the adjustment procedure.While  indicates the variability of a variable around its mean value, the RMS provides a measure of how much the differences, i.e. the residuals, are in average far from zero.Theoretically,  and RMS should coincide when the bias has been removed (Deakin and Kildea, 1999). RMSE computed on check points (CPs): where the subscript Comp indicates the coordinates estimated from the bundle adjustment whereas Ref indicates the reference values, i.e. the coordinates of check point measured with a reference surveying technique (e.g.GNSS).
Accuracy: it is the closeness of the result of a measurement, calculation or process to an independent, higher order reference value.It coincides with precision when measurements or samples have been filtered from gross errors, and only random errors are present.Usually, accuracy is widely used as a general term for quality (Luhmann et al., 2014;Granshaw, 2016).Typical procedures for determining accuracy include comparison with independent reference coordinates or reference lengths.The relative accuracy represents the achieved object measurement accuracy in relation to the maximum extent of the surveyed object.
Precision: it provides a quantitative measure of variability of results and is indicative of random errors, following a Gaussian or normal distribution (Granshaw, 2016).It is related to concepts like reproducibility and repeatability, i.e. the ability to reproduce to a certain extent the same result under unchanged conditions.In an adjustment process, it is calculated as a standard deviation and its estimate should always be provided with a coverage factor, e.g. 1 sigma (Luhmann et al., 2014).
Theoretical precision of object coordinates: it is the expected variability of estimated 3D object coordinates, resulting from the BA process and depending on the camera network (i.e.spatial distribution of the acquired images) and precision of image observations (i.e.quality of the image measurements).The precision is computed according to error propagation theory and it can be obtained from the BA covariance matrix.The theoretical precision would coincide with the accuracy of object coordinates if all the systematic errors are properly modelled.
Reliability: it provides a measure of how outliers (gross or systematic errors) can be detected and filtered out from a set of observations in an adjustment process.It depends on redundancy and network (images) configuration (Luhmann et al., 2014).
Redundancy and multiplicity: from a formal point of view, redundancy, also known as degree of freedom, is the excess of observations (e.g.image points) with respect to the number of unknowns (e.g.3D object coordinates) to be computed in an adjustment process (e.g.BA).For a given 3D point, the redundancy is related to the number of images where this point is visible / measured, commonly defined as multiplicity or number of intersecting optical rays.Normally, the higher the redundancy, and, consequently, the multiplicity, the better is the quality of the computed 3D point (assuming a good intersection angle).A 3D point generated only with 2 collinearity rays (multiplicity of 2 and redundancy of 1) is not contributing too much in the stability of the network and provided statistics.

Spatial resolution and ground sample distance (GSD):
the spatial resolution is the smallest detail which can be seen in an image or measured by a system, i.e. it's the smallest change in the quantity to be measured.The GSD is the projection of the camera pixel in the object space and is expressed in object space units.It can be seen as the smallest element that we can see and, ideally, reconstruct in 3D.

THE IMAGE PROCESSING PIPELINE
The tests performed in this research follow the typical photogrammetric workflow, consisting of the following steps.

Identification of image correspondences
Image correspondences (or tie points) are extracted relying on the most outperforming detector and (float or binary) descriptor algorithms (Miksik andMikolajczyk, 2012: Apollonio et al., 2014): SIFT (Lowe, 2001) and all its variants (ASIFT, ColSIFT, PCA-SIFT, SIFT-GPU, DAISY, etc.), SURF (Bay et al., 2008), FAST (Rosten et al., 2010), BRIEF (Calonder et al., 2010), ORB (Rublee et al., 2011), LDAHash (Strecha et al. 2012), MSD (Tombari and Di Stefano, 2015), etc.Those (separated or combined) methods provide a set of keypoints coupled with a vector of information useful for the successive matching and tie point detection.The keypoint matching is normally performed with the brute force method based on the Hamming distance, a conventional L2-Norm matching strategy (Kullback and Leibler, 1951) or the efficient FLANN -Fast Library for Approximate Nearest Neighbours strategy (Muja and Lowe, 2009) which is independent from the image acquisition protocol and implements a fast search structure (e.g. based on kd-trees).

Unknowns estimation through BA
The extracted image correspondences (tie points) are used to estimate all unknown parameters (camera positions and angles, camera interior parameters, and 3D coordinates of image points) in a BA process.The Levenberg-Marquardt method has proven to be one of the most successful BA solution due to its ease of implementation and its use of an effective damping strategy that gives it the ability to converge quickly from a wide range of initial guesses (Lourakis and Argyros, 2009).

Dense image matching (DIM)
Once the camera poses and the sparse point cloud consisting in the 3D coordinates of triangulated tie points are recovered, a pixel-based matching algorithm (Rothermel et al., 2012;Furukawa and Ponce, 2010;Hirschmüller, 2008;Pierrot-Deseilligny and Paparoditis, 2006) is applied to obtain dense and colorized 3D point clouds.Stereo-or multi-view approach exist, relying on precise exterior and interior orientation parameters as well as epipolar images to constraint the search for matches (Remondino et al., 2014).Most of the approaches are based on the minimization of an energy function whose components are a cost function which considers the degree of the similarity among pixels and includes constraints to consider possible errors in the matching process as well as geometric discontinuity changes.

TESTS AND ANALYSES
For the sake of consistency, all datasets (Table 1) are processed using the same computer.In the datasets with available GCPs, in order to avoid multiple collimation errors, the image coordinates of the points are measured just once and then imported and used in the other packages.The tie point extraction phase is performed forcing the same number of extracted keypoints.In the selfcalibration process, the same additional parameters are computed.In each test, the same image resolution is adopted for all the software applications in both the image correspondences extraction and DIM steps.Comments:  the BA is carried out in free-network, i.e. without any prior knowledge or constraints.The RMS error on CPs is computed after a seven-parameter Helmert transformation to obtain the photogrammetric model in the coordinate system defined by the GCPs;  the significant processing speed of ReCap is clearly noticeable;  despite a high value of max multiplicity, its value drops immediately after image pairs and this may cause instability effect in the network orientation;  very similar accuracy in object space is achieved.Comments:  although all images are oriented by the three software applications, Pix4D does not provide a correct solution for the circular network.An incorrect orientation is achieved even if the images are imported in different orders (Fig. 1);  most of the 3D points are triangulated from only 2 views.Comments:  despite the large overlap, most of the 3D points are determined only under two views, leading to noisy point clouds close to the statue edges/borders (Fig. 2).DATASET 5 -Neptune temple (857 images) Comments:  the combination of terrestrial and UAV images is not easily handled and the two sub-blocks are hardly completely oriented together. the image shuffling is not facilitating the orientation of the entire datasets.Comments:  The BA is constrained, i.e. the camera centres available from GNSS data and GCPs 3D coordinates are included in the processing.A user cannot assign the a-priori accuracy value to the GCPs coordinates in ReCap, where a worse accuracy in object space (higher RMSE on CPs) is observed. The profiles extracted from the DIM outputs (Fig. 3) are smoother in PS and ReCap, and nosier in Pix4D.However, only Pix4D could partly reconstruct the small fountain in the square.
Figure 3: DIM results (PS, Pix4D, ReCap, respectively) on an AOI of the block (above).Three profiles of the dense point clouds (below).

CONCLUSIONS
The paper presented a critical evaluation of commercial packages tested on complex (aerial and terrestrial) datasets acquired with various cameras/platforms.The analysed packages are getting very commonly used in the heritage (but not only) community where expertise and critical considerations are often overtaken by blind processing.The attained results (oriented images, computational times, accuracies, etc.) should encourage improvements in terms of reliability, repeatability and computational efficiency, notwithstanding the use of standard terminology to report the results.Although the paper has considered only commercial tools, an evaluation of available open source solutions (Table 2) will be performed soon.The paper's aim is not to declare a winner, but the presented results and comments might provide useful suggestions and valuable insights to interested readers and users.
All datasets employed in these tests are available to the community for further research purposes.The versions of the employed software are the following:  Agisoft Photoscan (PS): 1.3.1.4030 Pix4D Pix4D Mapper (Pix4D): 3.1.23 Capturing Reality RealityCapture (ReCap): 1.0.2.2600It is worth mentioning that the tested version of ReCap does not provide access to the result of the DIM, being it fused with the meshing step.Therefore, the obtained 3D output corresponds to the vertices of the generated mesh model.In the next tables report the results of the image orientations and, in two cases, for the DIM.DATASET 1 -Duomo square (359 images) -1 is characterised by small geometric details (bricks);  the three DIM outputs do not show grooves or indentations along the edges of the bricks, although showing a comparable plane fitting RMS;  the AOI-2 features a homogenous texture, which causes noisy DIM results and gaps in dense cloud;  the geometric details of AOI-3 are better provided by the all software solutions, even if ReCap seems to generate sharper results.

Figure 1 :
Figure 1: Camera poses retrieved by Pix4D in various tests keeping the same processing settings and just changing the order of the images.

Table 1 :
Summary of employed datasets.