A NEW APPROACH FOR AN INCREMENTAL ORIENTATION OF MICRO-UAV IMAGE SEQUENCES

Civil applications for small size unmanned aerial vehicles (UAV) have become quite important in recent years and so have accurate orientation and navigation of these devices in unknown terrain. In this work we focus on on-line compatible positioning in facade observation based on monocular low resolution still images acquired by a camera mounted on a UAV. Also, a 3D point cloud of the facade is generated. This allows further processing steps, e. g. navigation assistance, collision avoidance or the evaluation of the point cloud density, verifying completeness of the data. To be able to deal with the increasing amount of observations and unknown parameters we implement an incremental bundle adjustment based on automatically determined tie points and sliding image triplets. The tripletwise orientation allows for an efficient double cross-check of the detected feature points and hence guarantees reliable initial values for the nonlinear bundle adjustment. The initial values are estimated within a convex formulation delivering a sound basis for the incremental adjustment. Our algorithm is evaluated by means of imagery we took of the facade of the Welfenschloss in Hannover, captured from a manually flown Microdrones md4-200 micro-UAV. We compare the orientation results of our approach with an approach in which initial values for the unknown object coordinates are computed algebraically.


INTRODUCTION
Comprehensive information about facades is important e. g. for different fields like architecture, cultural heritage or the construction of realistic 3D city models.UAVs allow for an efficient observation of facades because of their high flexibility and thus fill the gap between terrestrial and aerial photogrammetry which are often constrained by an inappropriate observation angle.Nowadays high-performance stabilising devices as well as light weight imaging sensors allow a manual navigation of a small UAV through narrow streets or close to buildings.Since the complexity of a building is not always known in advance, the completeness of the collected data cannot be guaranteed and a potential lack of data is not revealed until postprocessing.In order to avoid missing relevant information, the user must be able to adapt the flight path during the capture, which requires among other results, an on-line orientation of the UAV and a sparse point cloud of the environment.
In this work we present a new incremental orientation approach based on image triplets, which is able to deliver the position and orientation of the UAV and a sparse point cloud of the observed object in near real-time.We use image triplets to increase the robustness of image matching because triplets allow a double cross-check on the consistency of the keypoint matches between all image pairs that can be formed of an image triplet.We use a combination of projective and perspective geometry.The former is used to obtain initial values for the first image triplet.The latter forms the foundation of an incremental bundle adjustment.In this way we avoid effects of over-parametrisation and furthermore are able to integrate results of a precalibration of the camera in a straightforward way.A previous version of this approach is also described in (Reich et al., 2013), this paper contains two important extensions: Firstly, the estimation of initial values for the unknown object coordinates of the homologous feature points is formulated as a convex optimisation problem.Hence the risk of getting stuck into a local minimum leading to a distortion of the adjustment can be eliminated.Secondly, in the new version we adaptively remove already oriented images from incremental adjustment based on the number of supporting object points measured in the related image to keep computation time constant.
The remainder of this paper is structured as follows.Section 2 gives an overview of work related to this topic.In section 3 the used hardware, namely the UAV and the camera are presented.In section 4 we describe the methodology of this approach with a focus on the convex formulation of three-image spatial intersection.Experiments and results are shown in section 5. Finally, section 6 summarises this work and gives an outlook into future prospects.

RELATED WORK
An extensive overview of UAV applications and experiments can be found in (Remondino et al., 2011).For our approach we need an on-line compatible implementation.The idea of on-line user assistance in UAV mapping is given in (Hoppe et al., 2012).Simultaneous georeferencing of images and 3D point cloud generation in real-or near real-time based on monocular imagery, which is also called simultaneous localisation and mapping (SLAM), has been in the focus of several publications: On-line orientation and dense-matching implementations based on projective geometry alone can be found in (Klein and Murray, 2009) and (Wendel et al., 2012), respectively.In these approaches a local and a global bundle adjustment optimise all unknown parameters in parallel to avoid drifting effects.In (Cesetti et al., 2011) a so called "visual global positioning system" is presented in which the position is estimated based on the matching of the acquired images with freely available georeferenced satellite imagery.A vision-aided navigation for UAVs based on a fusion of relatively oriented image pairs and data of an inertial measurement unit (IMU) is presented in (Wang et al., 2013).Steffen and Förstner (2008) describe the real-time orientation of a UAV for mapping purposes formulated in an unscented Kalman filter.The approach also deals with the drift problem in performing a loop closure based on points structured in a 3 dimensional octree.
The incremental bundle adjustment is comprehensively derived in (Beder and Steffen, 2008).Using an incremental bundle adjustment with a functional model different from ours, Meidow (2012) presents a loop closure approach for the purpose of stitching subsequent images.The idea of using triplets of images for the orientation of image sequences can be found in (Nistér, 2000).A hierarchy of trifocal tensors is used for the whole sequence without incremental estimation.The nature of the trifocal tensor and its estimation is described in a general context in (Hartley and Zisserman, 2000).The formulation of optimisation problems via a convex objective function is of increasing interest in research in many different disciplines.The entire derivation of its principles can be found in (Boyd and Vandenberghe, 2004).One of the first works integrating this technique into geometric reconstruction problems is (Hartley and Schaffalitzky, 2004).In (Ke and Kanade, 2007), the geometric interpretation of reconstruction problems is pointed out.Furthermore they propose an L1norm based reprojection error integrated into the objective function.The works of Olsson and Kahl (2010) and Kahl and Hartley (2008) describe the spatial intersection problem from three images and show that there is only one minimum in optimisation with the L∞-norm.
In our approach we combine projective and perspective geometry and implement an on-line compatible orientation within an incremental bundle adjustment.The initial values for the bundle adjustment are estimated using convex optimisation.

PRELIMINARIES
There are many types of UAVs.We use a Microdrones md4-200 micro-UAV1 , a vertical take-off and landing quadrocopter.It has a maximum payload of 300 g, which is enough for a high quality compact camera.We use a Canon PowerShot S1102 in combination with the Canon Hack Development Kit3 .This software allows to manipulate many relevant parameters including exposure time, sensor sensitivity and focal length.Furthermore it enables automatic capturing based on a fixed time interval.The camera is calibrated before the flight based on a five-parameter-distortion model, three parameters for radial and two parameters for tangential distortion, respectively (Laganière, 2011).
During the flight images are acquired with a time interval of two seconds.In the following we tackle the transmission of the imagery as a problem that is solvable but whose solution is not in the focus of this work.We perform the estimation in postprocessing, although our implementation allows for a near real-time orientation and point cloud computation.

METHODOLOGY
The orientation of the UAV is based on imagery captured by a precalibrated camera, though our implementation is able to estimate the parameters of the interior orientation and lens distortion during bundle adjustment for refinement.For each incoming image SIFT features are extracted and matched (Lowe, 2004), which are the basis for the subsequent orientation work-flow.This workflow can be divided into two steps.The first one deals with the initial image triplet and is explained in section 4.1.The second part covers the spatial resection of subsequent images and convex spatial intersection of feature points for the derivation of initial values (section 4.2) as well as the incremental bundle adjustment, in which the unknown parameters are refined (section 4.3).

Initial Image Triplet
The computation starts as soon as features of the three first images are available.Matching of feature points in image triplets is more robust than in image pairs.Based on a pair of matching feature points [x I , x II ] in two images [I I , I II ] one can perform a double cross-check, because for a triple [x I , x II , x III ] also [x I , x III ] and [x II , x III ] have to be a match.Nevertheless, erroneous matches generally still remain after double cross-check.Therefore, the estimation of the trifocal tensor, based on seven three-point correspondences, is carried out based on RANSAC (Fischler and Bolles, 1981).Inliers that support an estimation are found by a point transfer into the third image using the fundamental matrix of the first two images derived from the tensor.More details can be found in (Reich et al., 2013) and (Hartley and Zisserman, 2000, Chapter 15).
The first image triplet defines the local coordinate system of the whole block.A certain number of supporting matches has to be found to consider the estimation of the trifocal tensor as valid and to proceed with the following steps.Otherwise the first image is rejected and computation starts again with the three subsequent images.
When a valid initial image triplet is found, we derive the fundamental matrices of the first two images from the trifocal tensor and estimate the relative orientation of the two images based on the calibration data of the camera.The local coordinate system (XY Z) is defined as follows: The origin is located in the projection centre of the first image, the X and Y axes are parallel to the x and y axes in the image, respectively.The Z axis coincides with the negative viewing direction.The scale is given by the distance between the projection centres of the first and the second image.That base length is set to one.Initial values for the unknown object points are estimated by convex spatial intersection (see 4.2) and the relative orientation of the third image is computed by a RANSAC-based spatial resection (Kraus, 1997).Having initial values for all unknowns a robust bundle adjustment based on non-linear collinearity equations can be performed.Observations are iteratively re-weighted based on their residuals so that their contribution on the adjustment result can be controlled.In this way outliers in the observations can be detected and excluded from further estimation.

Convex Spatial Intersection
Subsequent images are added to the existing block based on spatial resection.First we perform a matching between the feature points of the new image and the ones of the previous images for which object coordinates have already been estimated.This allows us to compute the relative orientation of the new image analogously to the way described for the third image of the initial triplet in section 4.1.
In order to extend the existing 3D point cloud, we are interested in finding additional homologous points between the three images.The related 3D coordinates are to be estimated via incremental bundle adjustment, which needs appropriate initial values.These initial values are computed using convex spatial intersection.
Generally, the quality of a triangulated point is derived by its reprojection error in the images it is measured in: fres,i(X) is the residual function for each image i based on the three-dimensional parameter vector X, the object coordinates of the point to be intersected.It is the difference between the projected 3D point x proj and the observation xi in some norm m. c and (x0, y0) are the focal length and the principal point of the image, respectively, determined in camera calibration and considered to be constant.X0,i is the position of the projection centre and R T i the transposed rotation matrix of image i with r T i,j being its jth row.
In the case of two images taking the midpoint of the shortest line that connects the two image rays which, in fact, is an algebraic minimisation with no obvious geometrical or statistical meaning (Hartley and Schaffalitzky, 2004), may not lead to suitable results.It is shown in (Kahl and Hartley, 2008) that algebraic minimisation in special cases may result in very high reprojection errors.Olsson and Kahl (2010) show that the reprojection error as a function of X (equation ( 1)) is quasiconvex.Since quasiconvexity is not preserved under summation (Boyd and Vandenberghe, 2004), L2-minimisation (minimising the sum of squares) of the reprojection error does not represent a convex (or quasiconvex) optimisation.L2-minimisation of the spatial intersection problem with three images mathematically means finding the minimum of a 47th order polynomial (Stewenius et al., 2005).Kahl and Hartley (2008) and Olsson and Kahl (2010) illustrate the chance to get stuck in a local minimum in three-image spatial intersection based on L2-minimisation.
Therefore, we carry out an L∞-minimisation as proposed in (Hartley and Schaffalitzky, 2004) and (Olsson and Kahl, 2010).This allows a quasiconvex formulation of the three-image spatial intersection problem: Taking the pointwise maximum of the three quasiconvex residual functions preserves quasiconvexity (Olsson and Kahl, 2010).
We reformulate (3) to derive a quasiconvex formulation for which an effective solver can be generated.By substituting equation (2) into equation (1) we obtain (note that while equation 3 represents the L∞-norm, each reprojection error itself can be expressed in any norm m): This can be extended to: Now we can write equation ( 5) as with Equation ( 6) is a quasiconvex function on the set c T i X + di ≥ 0, implying that the point X is in front of the camera (Hartley and Schaffalitzky, 2004), (Kahl and Hartley, 2008).Equation ( 6) can geometrically be interpreted as a convex cone of norm m (Boyd and Vandenberghe, 2004) in front of the camera i with its apex in the projection centre.This cone intersects with the image plane defining some uncertainty measure (e. g. circle in case of the L2-norm and square in case of the L∞-norm).
Taking the pointwise maximum of equation ( 6) generally implies that the resulting function at some points is not differentiable anymore.Hence the minimum cannot be found by gradient based methods.Kahl and Hartley (2008) propose to write the minimisation of (6) as a convex feasibility problem find X with a fixed µ > 0. Note that a L2-norm is used for representing the reprojection error.The minimisation problem (7) can be geometrically thought of as finding the intersection of the three convex cones located at each of the projection centres of the three images.The size of the cones is controlled by the factor µ, which, in fact, is the radius of the intersecting circle in image plane.The problem ( 7) is feasible if µ is large enough so that the cones intersect (Ke and Kanade, 2007).Such a feasibility problem can be solved by a bisection algorithm (Olsson and Kahl, 2010), (Kahl and Hartley, 2008).
In our work we modify the feasibility problem (7) according to (Ke and Kanade, 2007) and reformulate it as a minimisation problem.µ is kept fixed and we try to minimise the distance to the cones defined by µ.

minimise subject to
AiX In this way the intersection of the three cones defined by µ may be empty.If this is the case the minimal distance * will be positive.If * ≤ 0 an intersection of the cones exists and the optimal point X * lies in that intersection.The implementation of the optimisation procedure is based on CVXGEN 4 (Mattingley and Boyd, 2012), which produces a C-based solver for convex optimisation problems.Note that in contrast to (Kahl and Hartley, 2008) we use a L∞-based reprojection error, because CVXGEN is for quadratic programs only and using the L2-norm would imply solving a second order cone program (Kahl and Hartley, 2008).
We are aware of the fact, that the L∞-norm is sensitive to outliers.Hence, after the optimisation the reprojection error of each point (in each image) is required to lie below a certain maximum value.

Incremental Bundle Adjustment
As we want to provide an on-line compatible orientation procedure, unknown parameters are estimated incrementally.For each incoming image there are six parameters for the exterior orientation, namely the position and the orientation of the camera, and three object coordinates for each point measured in the current image triplet for the first time, enlarging the set of parameters of the whole block.Besides, already estimated parameters that are related to the current image triplet, can be improved by incremental bundle adjustment.For each triplet, two different types of unknowns occur: The parameter vector p is split into two components, p1 and p2 for the first and second type of parameters, respectively.Similarly, we split the observation vector l into two components, l1 and l2.l1 contains all observations that were used in previous triplets, whereas l2 contains observations derived in the current image triplet.The incremental bundle adjustment can now be formulated as: As one can see in equation ( 9) the new observations l2 are related to both types of parameters p1 and p2 via the design matrices A21 and A22, respectively.Therefore, not only the new parameters can be estimated (p2) but also the previous parameters are improved (p1,+): with N −1 + being the inverse normal equation matrix of the incremental adjustment and P11 and P22 the weight matrices derived from the inverse covariance matrices of the observations l1 and l2, respectively.The dimensions of the matrices involved in the estimation equations, thus the normal equation matrix of the already estimated parameters and A21, do increase over time.At some point older images (and their orientation parameters plus points that have been measured in that image for the first time) have to be removed from the orientation process to keep the size of the matrices small enough for near real-time processing.For each incoming image a minimum number of points found in the oldest image involved in the estimation process and re-measured in the current image has to be present to keep that image in the estimation process.Otherwise this old image is eliminated.For a complete derivation of the incremental bundle adjustment the reader is referred to (Beder and Steffen, 2008) or (Reich et al., 2013).

EXPERIMENTS
The evaluation of the presented approach is based on imagery showing the facade of the Welfenschloss in Hannover.The images were taken by the camera and micro-UAV presented in section 3 with a fixed focal length and with a time interval of two seconds between neighbouring images.The UAV was manually controlled in front of the facade in several distances and heights.We ended up with three separate flights each consisting about 250 images.
The analysis of these image sequences is focused on two specific aims.First, we want to show the general performance of our algorithm.Hence, we show if the computation time per image is kept constant and the effects of older images removed from the estimation if they are not relevant any more.Second, we demonstrate the improvement of the convex formulation of the estimation of the initial values with respect to an algebraic estimation that was presented in (Reich et al., 2013).The following analysis is based on one image sequence consisting of 233 images.
As mentioned in section 4.3 the criterion for keeping an image in the estimation process is a minimal number of points found in that image which are re-measured in the current image.In our experiments we set that number to 20 points.Figure 1 shows the distribution of points with respect to the images they are measured in for the first part (43 images) of the image sequence.The number of images involved in one iteration of the incremental bundle adjustment varies between five and seven.This interval is somehow restricted since we limit the number of new observations connected to already estimated points to 50 due to reasons of computation time.Furthermore, we select all observations depending on their location in image space to achieve a well distributed set of observations.A maximum number of 200 new observations is involved into each incremental bundle adjustment.The computation time per image is kept constant and amounts to about ten seconds on a standard desktop computer in a nonoptimised implementation.This timing includes the computation of the whole covariance matrix of the parameters to be able to extract its precision.We estimate that an improvement by a factor of five is achievable, which delivers real-time results based on the time interval of two seconds between neighbouring images.
In figure 2 the point cloud and the orientation results of the whole image sequence are illustrated.One can see the uppermost image strip is captured in a rather oblique orientation so that also surface elements and trees are detected.Although the point cloud appears noisy it has to be said that the facade is quite rough with many decorative elements, window sills and balconies.Nevertheless, offset effects are present.The regions that are obviously affected either in scale or in translation are highlighted in red.The facade at that part of the building consists of a planar wall.
In a second experiment we compared the results of our approach with results for which the initial values of the object coordinates have been estimated using the shortest distance between the related rays (called 'algebraic approach' below).After each incremental bundle adjustment we extracted the standard deviations from the covariance matrix of the newly estimated parameters Σ p2 p2 .We computed the trace of that part of the matrix concerning the position of the projection centre and its three rotation angles, respectively, representing the sum of the standard deviations of the projection centre and that of the three rotation angles.Furthermore, we extracted the mean precision of all new object points after each incremental bundle adjustment iteration by averaging the trace of Σ p2 p2 concerning the object points.Figure 3 illustrates the mean standard deviations of the estimated posi- after each incremental bundle adjustment iteration for our approach (green/solid) and the algebraic approach (blue/dotted).tions, rotations and object point coordinates of the first part of the sequence after each incremental bundle adjustment iteration.The color and type of the curve decode the type of initial value computation.The solid green curve shows the standard deviations for our approach, whereas the dotted blue curve shows the standard deviations for the algebraic approach.It can be seen that the precision of the orientation parameters in the case of convexly estimated initial values is nearly always better than for the algebraic approach, in particular for the images around image 13.The precision of the object points is rather equal for both approaches except for a few images near image 26.The fact, that differences in precision between the two approaches are visible for different images in the case of figure 3(a) and 3(b) compared to figure 3(c), may be an indication that the initial values itself play a secondary role only for the solution of the incremental bundle adjustment.Nevertheless, the way of initial value computation causes another factor influencing the precision of the bundle adjustment results as there is the number and distribution in space of points used.In fact, the distribution of object points observed in image 13 is better in our approach (std.deviation with respect to its centre of gravity: 3.2 [baselengths]) than for the algebraic approach (2.8 [baselengths]).
Differences can also be seen in the point cloud.Figure 4 shows the resulting point clouds of the image sequence used above.The white point cloud results from our approach.A second point cloud estimated using the algebraic approach is depicted in colour where different colours depict different distances between the two point clouds.One can see that the observations and hence the estimated object points differ between the two approaches.The wall on the left side only exists in the result of our approach.As can be seen, the differences between the point clouds increase with each iteration (UAV flew from right to left).This can be explained by a rotation between the two solutions.

CONCLUSION
In this work we have presented a new incremental orientation approach of a micro-UAV based on triplets of images acquired at a certain time interval.We use convex estimation of the object coordinates as initial values for the incremental bundle adjustment.Furthermore, older images that become irrelevant for the orientation process are automatically removed from computation.We were able to show that our implementation estimates the orientation and a sparse point cloud of a whole flight consisting of hundreds of images with a constant computation time per image.In addition, we investigated the influence of different initial values for the unknown object point coordinates stemming from an algebraic computation and from our new approach.
However, there are several problems which need further investigation.Firstly, the comparison of the two approaches computing initial values for the bundle adjustment revealed differences in the estimation results.Whether the use of convexly optimised initial values itself or a deduced effect like the distribution of the object points in space is a significant reason for the enhanced quality of the results of our approach could not finally be verified.This will be part of our future work.Secondly, to be able to analyse the reliability and geometrical quality of our results we need a reference model with predefined and precisely measured points.Finally, the implementation is far from being applicable in realtime.But, irrespective of computation time, our approach is realtime compatible in terms of its incremental implementation.
1. unknowns which have already been estimated in previous triplets.These are (a) orientation parameters of the N − 1 already oriented images remaining in the incremental bundle adjustment (b) object coordinates of points that have been estimated in previous triplets and are used to tie the new image to the existing block (see beginning of section 4.2) 2. unknowns that are estimated for the first time, i. e.(a) orientation parameters of the new image (b) object coordinates of points that have been measured and triangulated in the current image triplet for the first time (see section 4.2).

Figure 1 :
Figure 1: Indices of the images in the sequence in which each object point was visible.

Figure 2 :
Figure 2: Point cloud and oriented images of an image sequence consisting of 233 images seen from above (first row) and from the front (second row).Regions that are highly affected by an offset are highlighted in red.

Figure 3 :
Figure 3: Mean standard deviations of the position X0 (a) and the rotation parameters [ω, φ, κ] of the projection centre (b) and the object points (c) in [baselengths] and [gon]after each incremental bundle adjustment iteration for our approach (green/solid) and the algebraic approach (blue/dotted).

Figure 4 :
Figure 4: Comparison of the two resulting point clouds using our approach (white) and algebraically estimated initial values (coloured, depending on the distance to the other point cloud).