HIERARCHICAL STRUCTURE FROM MOTION COMBINING GLOBAL IMAGE ORIENTATION AND STRUCTURELESS BUNDLE ADJUSTMENT

: Global image orientation techniques aim at estimating camera rotations and positions for a whole set of images simultaneously. One of the main arguments for these procedures is an improved robustness against drifting of camera stations in comparison to more classical sequential approaches. Usually, the process consists of computation of absolute rotations and, in a second step, absolute positions for the cameras. Either the first or both steps rely on the network of transformations arising from relative orientations between cameras. Therefore, the quality of the obtained absolute results is influenced by tensions in the network. These may e.g. be induced by insufficient knowledge of the intrinsic camera parameters. Another reason can be found in local weaknesses of image connectivity. We apply a hierarchical approach with intermediate bundle adjustment to reduce these effects. We adopt efficient global techniques which register image triplets based on fixed absolute camera rotations and scaled relative camera translations but do not involve scene structure elements in the fusion step. Our variant employs submodels of arbitrary size, orientation and scale, by computing relative rotations and scales between - and subsequently absolute rotations and scales for - submodels and is applied hierarchically. Furthermore we substitute classical bundle adjustment by a structureless approach based on epipolar geometry and augmented with a scale consistency constraint.


INTRODUCTION
Research on Structure from Motion (SfM), i.e. automatically solving the task of image orientation and scene structure computation for sets of visually connected images, has made large progress over the past decades.Generally speaking, the process consists of three steps: (1) visual connections are found, usually using distinct image features which are matched between images based on local texture descriptors and subsequently validated geometrically, (2) provide approximate solutions for scene structure and/or camera orientation and (3) optimize the approximate solution.The first step is crucial for success and is one the main 1 bottlenecks in terms of time consumption.However, in this paper we will focus on the latter two steps, which are of no less importance for the overall efficiency.
The majority of approaches are of incremental nature, i.e. starting from a pair or triplet of images, a number of images is added to the current model ( 2) and the result is optimized (3).The process repeats until the full model is solved.This approach benefits from careful progression and can be successful in complicated and large scenarios (Agarwal et al. 2011;Snavely et al. 2006;Snavely et al., 2008;Jeong et al. 2012).However, the optimization is carried out by bundle adjustment.The resulting system is highly nonlinear and solving it is computationally intensive.Its' repetitive use on a growing number of images is one of the drawbacks of incremental approaches.The problem of loop closure poses another challenge especially for larger image sequences.The orientation error built up during reconstruction may become large enough to prevent the geometric establishment of a connection between images.This also directly induces a dependency on the order in which images are added to the model.Hierarchical approaches separate the data into smaller subsets, which are solved separately and fused in subsequent steps following the tree like structure of connected images clusters.These approaches work on problems of moderate size during intermediate reconstructions and thereby reduce the overall workload.Furthermore, the orientation error build-up is reduced, leading to better loop closing behaviour.
Global approaches to SfM on the other hand solve step (2) for all cameras simultaneously by exploiting relative transformation information from step (1), aiming at a single and final execution of the optimization (3).The majority of approaches solves absolute rotations in a first step and separately estimates scene structure and / or camera positions in a subsequent step.Apart from efficiency, an advantage is the even distribution of orientation errors over the whole data, which effectively eliminates the loop closing problem.On the other hand, the main difficulty lies in finding robust solutions, as the underlying set of relative relations may be corrupted by inaccuracies and erroneously established connections.
We present our SfM-approach which bases on the use of structureless bundle adjustment in combination with global image orientation techniques which are applied hierarchically to submodels instead of cameras.

RELATED WORK
An overview on classical bundle adjustment can be found in (Triggs et al., 2000) which covers many efficiency related topics.A very efficient structureless approach has been presented by (Rodriguez et al., 2011a(Rodriguez et al., , 2011b)).Only two-view epipolar constraints, encapsulated in a measurement matrix (Hartley, 1998) are used.Factorization reduces the system to 9x9 per image pair, resulting in remarkable speed up.In (Steffen et al., 2010) epipolar constraints are combined with trifocal constraints to overcome the problem of distance ambiguity for collinear camera stations which is induced by the exclusive use of epipolar constraints.(Indelman, 2012a, Indelman et al., 2012b) derives a scale consistency constraint from addition of scaled observation vectors and camera baselines in a three-view scenario by reformulating rank conditions of the equation system.(Rodriguez et al., 2011a(Rodriguez et al., , 2011b) ) and (Indelman et al., 2012b) demonstrate that accurate solutions can be achieved without introduction of additional unknowns, which is also the case for our implementations (Cefalu et al., 2016).Apart from a reduced number of unknowns, a further advantage is found in the absence of numerical problems due to points at far distance.
Early work on global image orientation has been conducted by (Govindu, 2001).In the majority of existing approaches absolute camera rotations are estimated and held fix for a subsequent separate estimation of camera poses.Thorough discussion on rotation averaging and distance measures in SO(3) is given in (Hartley et al., 2013).Based on quaternions (Govindu, 2001) suggests a linear least squares solution which in practice suffers from missing orthonormality constraints.In (Martinec & Pajdla, 2007) the constraints are enforced in a subsequent step.(Arie-Nachimson et al., 2012) include these constraints while using semi-definite programing.The approach is adopted and enhanced in (Reich & Heipke, 2016).The Weiszfeld algorithm under  1 norm is applied in (Hartley et al., 2011) to increase robustness.(Chatterjee & Govindu., 2013) refine their  1 solution in an iteratively reweighted least squares process in which they incorporate the Huber estimator (Huber, 1964), i.e. an M-estimator.Other approaches aim at detecting and eliminating incorrect relative rotations based on the concept of cycle consistency.The Bayesian inference approach of (Zach et al., 2010) is picked up in (Moulon et al., 2013) and combined with the cycle length weighting of (Enqvist et al., 2011).The latter can be categorized as a spanning tree approach in which paths are drawn from the camera graph following some heuristic and used to identify inconsistent graph edges (Govindu, 2006;Olsson & Enqvist, 2011, Cefalu & Fritsch, 2014b).A survey on various methods is presented in (Tron et al., 2016).(Wilson et al., 2016) investigates problem sources in rotation averaging and suggests reducing to smaller subproblems, very much in the sense of our solution to model fusion using absolute rotations for models.
Given known absolute rotations for cameras, various approaches exist to estimate absolute positions.Some approaches simultaneously estimate scene structure.A reformulation of the collinearity constraint is used in (Kahl, 2005) to simultaneously estimate camera poses and scene structure under the  ∞ norm.The concept is picked up by e.g.(Olsson & Enqvist, 2011) and (Martinec & Pajdla, 2007).The latter suggest a reduction of the number of reconstructed points by selecting four representative points per image pair to reduce the computational workload.(Arie-Nachimson et al., 2012) makes use of point observations, but reformulates the epipolar constraint by replacing relative orientations with absolute ones.With fixed rotations, the resulting system is linear and can be solved efficiently.Essentially, this approach equals structureless bundle adjustment without additional constraints.The number of unknowns is drastically reduced, but collinear camera distributions cannot be handled satisfyingly.The same holds for early methods using only baseline estimates (Govindu, 2001;Govindu, 2006).Implicit use of point triangulation is made in (Cui et al., 2015) which overcomes this problem.(Reich & Heipke, 2016) further robustify the approach.The problem can also be avoided by the use of trifocal tensors (oriented image triplets) as relative scales are encapsulated (Moulon et al., 2013;Jiang et al., 2013;Özyesil & Singer, 2015).In (Jiang et al., 2013) a linear model is used which is closely related to the scale consistency constraint used in our structureless bundle adjustment.
Work on hierarchical reconstruction is given in (Nister, 2000;Farenzena et al., 2009;Gherardi et al., 2010;Ni & Dellaert, 2012;Havlena et al. 2009, Chen et al., 2016), to name a few.Approaches differ in how the tree structure is defined and in strategies for balancing the tree, i.e. maintaining a homogenous size of partial reconstructions and avoidance of unnecessary workload.However, usually 3D points are used for registration of models.We are not aware of approaches within this category which rely on exterior camera orientation.
Our approach builds up on trifocal tensor based methods and makes a step towards combining some of the previously mentioned strategies.Scene structure is only taken into account when solving single image triplets.Our variant fuses models of arbitrary image number, scale and orientation and is applied repeatedly to locally neighbouring models.The resulting process creates a hierarchy of overlapping models and allows intermediate optimization including camera calibration.Global rotation computation is applied to models instead of cameras and camera rotation computation is broken down to a single rotation averaging problem.To refine fused models we apply structureless bundle adjustment.Section 3.5 explains the scale consistency constraint used in our current implementations, which to our knowledge has not been used in this context.The development of this strategy was mainly driven by the desire for intermediate adjustment.Insufficient knowledge on calibration parameters may result in drifts which in some cases are hard to compensate with a single final adjustment after a fully global orientation of images.Examples demonstrating such cases are given in section 4, which also contains a numerical evaluation on public benchmark data sets.

Process Chain
We start by detecting image features using SIFT (Lowe, 2004) and match over image pairs based on the descriptors.Geometric validation is carried out using parallax based pre-filtering (a histogram based variant of Cefalu et al., 2014a), followed by a standard RANSAC (Fischler & Bolles., 1981) process.Here, the 5-Point-algorithm (Nister, 2004) is applied, using approximate information on the camera, e.g. from EXIF headers.In order to stabilize relative orientation estimates for pairs with strong perspectives, we establish point tracks throughout the complete data, thereby creating additional point matches which have not been revealed during descriptor matching.The tracks are furthermore used to select well distributed stable points in every image to reduce the total amount of points.The results are revalidated in a second run of the 2-view RANSAC to make use of new established matches.Weak pairs are refused based on minimal point count (20) and overlap of the convex hull of matching features (5%).The connectivity graph is thinned out further by reducing to the 10 strongest connections per image in terms of point support.A test on rotational consistency follows.In contrast to (Zach et al., 2010), we consider only rotational errors of image triplets, which makes the process fast since no longer loops have to be found.Every triplet is evaluated by a simple 3σ outlier test against the error budget of its' local triplet neighbourhood.The edges of remaining triplets define the new image connectivity graph.The test is repeated until no further edges are filtered out.The remainder of image triplets is ranked based on the number of threefold point observations and, in contrast to (Moulon et al., 2013), intersection angles between observed points.The latter measure influences the ranking towards a preference of triplets with longer base lines.Following the rank order (best first), we reconstruct triplet models (see section 3.2) only if they consist of at least one edge which has not been part of an already successfully solved triplet.Suspicious triplets are rejected.If rejections occurred, the process is repeated on the remainder of untested triplets.Edges of the connectivity graph which are not part of a successfully reconstructed triplet are removed.Depending on the overall connectivity, this subselection reduces the number triplets to roughly 30% -60%.This set of triplets represents the lowest level of our submodel hierarchy.The following hierarchical fusion process consists of three repeatedly applied steps.First, target models are defined by local growth of all source models (models of the current highest level, section 3.3).Second, the target models are fused using the corresponding source models as described in section 3.4 and eventually refined using our structureless bundle adjustment implementation (section 3.5).

Triplet Reconstruction
Every Triplet selected for reconstruction is solved by first computing absolute rotations for the three cameras and their local neighbourhood (similarly to our approach for absolute model rotations in section 3.4.2).The inclusion of the neighbourhood is actually not necessary but can smooth possible inaccuracies inherent in a single triplet.Similar to (Moulon et al., 2013) we solve the position of the cameras within a RANSAC framework, while keeping the rotations (of the triplet) fixed and forcing correct chirality, i.e. points must triangulate to the viewing direction of the cameras.However, we use a different model.Let  =  −1  denote the ray cast in viewing direction from a projection centre  to an observed point, where  is the point's projection on the image plane in homogeneous coordinates,  is the camera rotation and  the camera matrix (see also section 3.5).Further, let  be a scale factor which scales the ray  to end in the observed point in object space (Figure 1).For three cameras , , and , vector addition yields: In every RANSAC iteration a solution is generated from two sampled threefold observations, solving the above system using an interior point method, forcing positivity of .The position estimates are used to triangulate all points of the triplet and carry out the consensus test on chirality and reprojection errors.The final RANSAC result is further refined using structureless bundle adjustment.Triplets resulting in low inlier rates or high error budget are rejected.Taking scene structure into account in this step significantly improves robustness of the reconstruction process, as it eases identification and removal of outlier correspondences and therefore significantly improves the quality of orientation estimates.

Model Growth
The set of triplets which have been solved in the previous step are the starting level of our iterative fusion of models.We refer to models of the current level as source models and to the models of the next level as target models.For every source model we perform a growth step, which defines a corresponding target model.All source models having at least one edge in common with the currently growing source model define the set of images of the resulting target model and will be used for fusion.Further source models are added if they consist solely of this set of images.The latter step ensures that all possible triplets are used in the following fusion step, but is of minor importance on higher hierarchy levels.We remove duplicate target models and those fully being part of others (e. g. at borders of the reconstruction).The resulting set of target models is a set locally grown models, overlapping in the sense of sharing identical cameras.As every edge of the camera graph remains represented, loop closings are preserved.
The defined target models are created by fusion of the corresponding source models and eventually optimized via bundle adjustment in a subsequent step.The results serve as source models for the next stage of local growth, while models of a size of 15 or more images are not grown further.Instead they are passed through to the next level, unchanged.If a single model is created, a final adjustment terminates the process.Otherwise, when all source models have reached maximum size, the local growth and fusion is replaced by a global fusion of all source models and a final adjustment is carried out.

Model Fusion
Given a set of source models which are to be fused to a target model, we apply a three step procedure to fuse the models.First, relative rotations and scales between the models are computed from groups of camera rotations and base length estimates respectively.In a second step, the relative estimates are used to derive absolute rotations and scales for the models.Finally, the third step separately solves absolute camera rotations and positions.

Relative Model Rotations and Scales:
In the same way as images may be represented by a graph of relative orientations, we can build a graph of relative orientations between source models being fused (Figure 2).We first compute relative rotations between source models using the rotations of all cameras shared between models.Let    denote the rotation of a camera  in a model .With  being the set of  cameras shared by the models  and , the corresponding rotations are stacked to  ̅  = ({  ′ }),  ∈  and  ̅  accordingly.The two resulting matrices are of size 3 × 3. We apply Kabsch's algorithm (Kabsch, 1976), which is also used in Procrustes analysis.With  =  ̅  ′  ̅ (2) and  =  ′ (singular value decomposition), we compute the relative rotation as: Further, we compute a relative scale between source models.For every possible image pair in Q, an estimate can be derived from the ratio of the corresponding base lengths in the two models.We use the mode of a kernel density estimation (KDE) as the final relative scale  , .

Absolute Model Rotations and Scales:
The desired absolute rotations and scales of the source models are derived from multiple paths (kinematic chains) drawn at random from the source model graph.We start at random nodes (models) and guide the selection of edges (relative relations between models) to be roughly evenly distributed.We establish a common rotational frame for the paths, by setting the mean rotation (see section 3.4.3) of every path to identity, i.e. applying the inverse mean rotation.In a similar way the mean scale of a path is set to 1.As a result we obtain multiple estimates for the absolute rotation   and scale   of every source model .Single solutions can be computed independently.We use KDE again for the scales and search for the mode.For the rotation   , we use  2 chordal mean rotation computation with iterative reweighing.

Mean Shift Single Rotation Averaging:
A  2 mean rotation  under chordal metric (Hartley et al., 2013) for a set of  rotations can be computed as in (3), except: In order to robustify the resulting estimate against outliers, we compute an initial guess of  and use the chordal distance   between  and   to derive weights for subsequent iterations.
As weighting function we use a Gaussian kernel with a bandwidth of the current weighted rms σ.
Accordingly, we rewrite  = ∑      and apply (3).We iterate the process until convergence or at most ten times.The process essentially resembles a mean shift algorithm.A three sigma cut-off is added to fully suppress strong outliers.

Absolute Camera Positions:
Given an absolute rotation   and an absolute scale   are known for every source model , the absolute poses   and   of two cameras  and  can be expressed as ( 6).The full equation system is formed by considering all visually connected image pairs in all source models participating in the fusion of the target model.
Figure 3. Camera positions in a target model are derived from corresponding base lines in rotated and scaled source models ,  and .For simplification only one camera pair is included in the sketch.
In order to robustify against incorrect bases we apply iterative reweighting using three weighting functions.A first weight   is derived from the number of matches between the images and stays fix during iterations.A second weight evaluates the a posteriori Euclidian position residual using the Hampel estimator (Hampel et al., 1986) with thresholds at 1, 2.5 and 4.A third weight for directional discrepancy is computed as the scalar product between the (normalized) left-and right-hand sides of (6), with negative values set to zero.Multiplication yields the final weights.Standard weighted least squares is used to solve the system.

Absolute Camera Rotations and Intrinsic Camera
Parameters: By applying the absolute source model rotations to the single cameras, we can obtain one or more solutions for every camera rotation.In the latter case, we again apply the single rotation averaging described in section 3.4.3.As initial values for the subsequent bundle adjustment, intrinsic camera parameters are 'fused' by taking the median of every parameter over the set of source models which have been used for fusion.

Structureless Bundle Adjustment
The results of the source model fusion step are utilized to initialize a structureless bundle adjustment.Here, the adjustment takes place without use of 3D structure elements by founding the system of equations on epipolar constraints instead of collinearity equations.We have improved our approach presented in (Cefalu et al., 2016), the major difference being the additional use of a scale consistency constraint.Though satisfying results could be achieved with our previous implementation for many datasets, this augmentation significantly improved the robustness of our approach for two reasons.First, collinear camera positions form a degenerate case for epipolar geometry, as distances between cameras become ambiguous.As our approach to image orientation relies on the quality of base line estimation, the exclusive use of epipolar geometry may induce disadvantageous instabilities, though this is rarely the case.Second, and to our experience of much greater importance, human made objects often exhibit regularly distributed repetitive structures.As a result, mismatches of feature descriptors occur more frequently and may often not be identified by epipolar distance.The scale consistency constraint is violated if rays do not intersect in a single point and thereby helps, not only in disambiguating distances between cameras, but also in identifying outliers.
As in section 3.2, let  be the projection of a point onto an image plane in homogenous coordinates.We define its' pendant corrected for lens distortions using the Brown model (Brown, 1966) as ̅ =  + Δ  + Δ  .We compute the epipolar distance in image  for the same point observed in the images  and  as: Where  , =   −′   ′ [  −   ] ×     −1 ̅  is the formulation for the epipolar line in image  using absolute camera rotations  and positions .The camera matrix  encapsulates the camera constant and the principal point.Normalization is necessary to avoid scale dependency, which would cause the camera stations to collapse to a single point.Moreover, the variant used above expresses the residual in pixel metric.However, it results in  , ≠  , and we observe increased quality of the results when both projection directions are used.
Figure 4. Three cameras ,  and  casting rays  towards an observed object point as in Figure 1.Here, angles  between base lines and image rays and angles  for intersection angles between observations are added.
We further redefine  =  −1 ̅ to include distortion corrections.For readability we write  for the base between two cameras in the further context.Moreover, we introduce angles  between base lines and image rays and  for intersection angles between rays (Figure 4).For a triangle formed by two images  and  and a point observed in both images, we may express the distance from projection centre of image  to the point using the law of sines: The relation between the sine and the cross product allows the substitutions: The distance must be equal when computed using an observation of a third image , which leads to the constraint: Setting the ratio of the two right hand side expressions to one, formulates our functional model, which is free of the unknown .Subtraction is also possible, but suffers from dependency of the overall scale.In any case, the errors are not in pixel metric and are therefore handled in the sense of a variance component analysis.Moreover, the number of equations which could be used may be very high for long point tracks.Therefore we use only one equation for every ray , derived from two other cameras in the order of appearance.As (Rodriguez et al., 2011a(Rodriguez et al., , 2011b) ) and (Indelman, 2012b) we solve the system without additional unknowns.Our implementation uses weighted (Hampel estimator) nonlinear least squares in a Levenberg-Marquardt framework.

EXPERIMENTAL RESULTS
We numerically evaluate the accuracy of our overall approach on the six benchmark datasets published in (Strecha et al., 2008) and compare our results to those reported by other authors.For these datasets ground truth is available for image orientation as well as two camera constants and principal point.The images have been corrected for radial distortion.However, we do not use the given intrinsic parameters but initialize our parameters from the EXIF headers of the dataset FountainR25 (for which unfortunately no ground truth orientation is available) and solve for all parameters implemented in our camera model.Furthermore, we have deactivated the hierarchical procedure for this test to assess the quality of our fusion approach.I.e.image triplets are fused directly to a final model and refined by a single final adjustment.
For evaluation we transform our resulting camera stations onto the ground truth camera stations using a seven parameter transformation.Table 1 and Table 2 summarize the results before adjustment, while Table 3    results.Entries are left empty when no result is reported on a dataset.For authors reporting different variants of their approaches, a representative one is selected and indicated in the table.The tables also state whether EXIF information or ground truth (GT) was used for camera initialization.The number of images is given in the names of the datasets (full names are given in Table 4).The examples in Figures 5 to Figure 8 have been computed with the hierarchical approach as described in the paper.The sparse point clouds have been triangulated after final adjustment for visualization purpose.Camera stations are depicted as coordinate frames (red, green and blue as X, Y and Z, the latter points in viewing direction).An example for a relatively large dataset with many narrow camera stations is given in Figure 8 ( Farenzena et al., 2009).Figure 5 exemplarily demonstrates the ability of our approach to handle collinear and forward looking cameras at the example of an image based mobile mapping sequence (Cavegn et al., 2016).Figures 6 and Figure 7 show two cases in which the effect of intermediate camera calibration is evident.
Figure 8. Result for the Piazza Bra dataset (Farenzena et al., 2009).The largest of five resulting models is displayed (287 of 329 images).Image connectivity cut offs and drop off in reconstruction quality occur in areas with weak observation redundancy, as indicated by the colors in the lower image (blue, green, yellow and red for 2, 3, 4 and ≥5 observations).However, narrow baselines are handled succesfully.

CONCLUSION
We have presented our approach to Structure from Motion which combines hierarchical reconstruction, global image orientation techniques and structureless bundle adjustment.
Though initialized with EXIF information only (and hierarchical processing turned off), the final position results achieved on the benchmark datasets are comparable to results of other authors who initialized with given camera parameters.Our rotation results are reasonably close to the ground truth, considering the different underlying camera models.The process successfully handles narrow baselines as well as collinear and forward looking scenarios.Since scene structure is not utilized in the majority of our approach, it is insensitive to noisy surface reconstructions or numerical problems induced by far distant points.The hierarchical progression and intermediate camera calibration help bridging areas of weaker connectivity.However, the overall robustness will remain subject to future work.Currently, our work focuses on reducing the computational effort by a model selection strategy, as a balancing of the hierarchy is not tackled at present.An inclusion of the absolute source model orientations in the adjustment, as well as local optimization techniques, could be beneficial to the convergence behaviour and may become topics of near future work.

Figure 1 .
Figure 1.Three cameras ,  and  casting rays  towards an observed object point (red).When scaled by factors , the rays should intersect in this point.

Figure 2 .
Figure 2. Sketch of a group of three models ,  and .Dots indicate cameras.Bases shared by different models are marked by colour.Relative rotations and scales between the models are indicated by arrows.

Figure 5 .
Figure 5. Partial reconstruction of an image based mobile mapping dataset (first 156 images,Cavegn et al., 2016).The approach is able to handle collinear and forward looking camera constellations.

Figure 6 .
Figure 6.Top: Result for a UAV flight over a farm (185 images) computed with intermediate camera calibration.Bottom: Result with camera calibration activated only during the final adjustment.An area with weak point connectivity causes a part of the reconstruction to drift off.

Figure 7 .
Figure 7. Partial reconstruction results for 20 images of a granite relief captured with an industrial camera.The used lens causes radial distortion of ~250 pixels in the image corners at a sensor size of 1600x1200 pixels.Top: Result with intermediate camera calibration.Bottom: Result with camera calibration activated only during the final adjustment.The adjustment converged before lens distortion effects could be fully compensated.
and Table 4 present the final