CROSS-COVARIANCE ESTIMATION FOR EKF-BASED INERTIAL AIDED MONOCULAR SLAM

Repeated observation of several characteristically textured surface elements allows the reconstruction of the camera trajectory and a sparse point cloud which is often referred to as "map". The extended Kalman filter (EKF) is a popular method to address this problem, especially if real-time constraints have to be met. Inertial measurements as well as a parameterization of the state vector that conforms better to the linearity assumptions made by the EKF may be employed to reduce the impact of linearization errors. Therefore, we adopt an inertial-aided monocular SLAM approach where landmarks are parameterized in inverse depth w.r.t. the coordinate system in which they were observed for the first time. In this work we present a method to estimate the cross-covariances between landmarks which are introduced in the EKF state vector for the first time and the old filter state that can be applied in the special case at hand where each landmark is parameterized w.r.t. an individual coordinate system.


Motivation
Navigation in unknown environments is often hindered by the absence of external positioning information.In the context of pedestrian navigation for instance, the GPS signal may be temporarily lost or severely disturbed due to multipath effects in urban canyons.The need to cope with such situations has motivated the research in systems which are capable of building and maintaining a map of the environment while at the same time localizing themselves w.r.t. that map.This problem is commonly referred to as simultaneous localization and mapping (SLAM).For the particular application of pedestrian navigation, a promising approach is to combine a low-cost inertial measurement unit (IMU) and a camera to an inertial aided monocular SLAM system and perform sensor data fusion in an extended Kalman filter (EKF), cf.(Veth and Raquet, 2006).Here, characteristically textured surface elements serve as landmarks which can be observed by the camera to build up a map while the IMU's acceleration and angular rate measurements are integrated in order to obtain a short-time accurate prediction of the camera's pose and thereby help to reduce linearization error.

Related Work
An important aspect of such monocular SLAM systems is the representation of landmarks in the filter state vector.Montiel et al. have proposed an inverse depth parameterization of landmarks that conforms well to the linearity assumptions made by the EKF and offers the possibility of instantly introducing new landmarks in the filter state with only one observation (Montiel et al., 2006).In the original inverse depth parameterization six additional parameters are included in the filter state for each freshly introduced landmark.To alleviate the computational burden imposed by this over-parameterization, Civera et al. introduce a method to transform landmarks to Cartesian coordinates once their associated covariance has sufficiently collapsed (Civera et al., 2007).Alternatively, Pietzsch proposes to initialize bundles of landmarks and to estimate only the inverse distance to the origin of the coordinate system of the camera that observed the landmarks for the first time for each landmark individually and the position and orientation of the camera coordinate system for the whole bundle (Pietzsch, 2008).
The importance of the cross-covariance terms in SLAM is stressed in a seminal work by Dissanayake et al. (Dissanayake et al., 2001).Julier and Uhlmann present a detailed investigation of the consistency of EKF-SLAM implementations (Julier and Uhlmann, 2001).Therein it is shown that errors in the estimated cross-covariance terms due to linearization errors lead to inconsistent estimates.A comparison of several landmark parameterizations for monocular SLAM regarding their effects on filter consistency is provided by Solà (Solà, 2010).This work also gives a detailed description of landmark initialization in monocular SLAM.

Contribution
Our approach is to parameterize each landmark in inverse depth polar coordinates w.r.t. the coordinate system of the camera at the time of its first observation.Therefore, the camera's orientation and position as well as the parameters that describe the landmark's position in the camera coordinate frame have to be stored for each landmark.However, we regard the camera's position and orientation as fix model parameters and thus only include the three parameters which describe the landmark's uncertain position in the filter state, thereby avoiding overparameterizing the landmark's position.Since the camera's position and orientation are regarded as fix model parameters, the corresponding uncertainty estimate has to be conveyed to the landmark's uncertainty estimate.In addition, the cross-covariances between the new landmark and the landmarks already present in the filter state have to be computed.This is aggravated by the fact, that the landmark coordinates in the filter state are given with respect to distinct coordinate frames, which precludes the adaption of standard error propagation methods in this case.The main contribution of our work is a method to convey the uncertainty estimate from the camera to the landmark parameters and to determine the cross-covariances between the new landmark and the parameters already present in the filter state.
This section describes the EKF-SLAM approach employed in this work with an emphasis on the parameterization of landmarks.

Coordinate systems
The following coordinate systems are of particular interest for the derivation of the cross-covariances in sec.2.5.2.An overview is also given in fig. 1.
The body or IMU-coordinate system {b} that is aligned to the IMU's axes and therefore describes position and orientation of the whole sensor system.Its position and orientation are included in the filter state.
The camera coordinate system {c}.The camera's position and orientation are not part of the filter state.They can be calculated from the IMU's position and orientation by means of the camera-IMU transformation that only depends on the mechanical setup and is assumed to be fix and known in this work.
The navigation frame {n}.This is the reference frame whose x-and y-axis point north-and eastwards while its z-axis points in the direction of local gravity.We assume that the distance to the local reference frame is small compared to the curvature of the earth and therefore the direction of gravity can be considered constant during operation of the system.In this case the position of the navigation frame can be chosen arbitrarily.
Figure 1: Overview of the coordinate systems used in this work

State parameterization
The goal is to determine the position and orientation of the body frame w.r.t. the navigation frame and a sparse map of point landmarks.Hence, the EKF state vector comprises parameters which describe the IMU's motion and biases as well as the coordinates of observed landmarks: Where n p b and n v b are the IMU's position and velocity w.r.t. the navigation frame, the unit quaternion q n b represents the orientation and b a , b g are the sensor biases, which systematically disturb acceleration and angular rate measurements.For convenience, the landmark coordinates Y i are subsumed in the map vector m.Similarly, the part of the state vector that describes the motion of the IMU is denoted by s .In the following, estimated values are denoted with by a hat (•) and a tilde (•) is used to indicate the error, i.e. the deviation between a true value (•) and its estimate:

Error state formulation.
Since the EKF relies on a truncation of the Taylor series expansion of the measurement equation as well as the time update step after the first derivative, it can be regarded as an estimator for the state error s.This is the basis for the error state formulation of the EKF which is commonly used for GPS-INS integration, cf.(Farrell and Barth, 1999, pp. 199-222).Therefore, the covariance matrix associated with the filter state describes the distribution of s under the assumption that the errors follow a normal distribution.It is given by The error of the estimated orientation can be written in terms of the incremental orientation that aligns the estimated coordinate system with the unknown true coordinate system: Where * denotes quaternion multiplication.Using the camera coordinate frame {c k } as an anchor for the landmark therefore avoids over-parameterization in the filter state and should thus increase computational efficiency and stability during Kalman filter updates.In order to determine the position of a landmark in the reference coordinate system, the transformation from the anchor coordinate system to the reference frame is needed: In the above equation, the direction cosine matrix C n c k describes the orientation of of the anchor system and n p c k is its position.Y is a unit vector that points in the direction of the projection ray.
For every landmark C n c k and n p c k are therefore stored in a separate data structure because they are not part of the filter state although they vary for different landmarks.

Time update
2.3.1 Integration of inertial measurements.During the time update, the estimated state is propagated by integrating the inertial measurements.To integrate angular rate measurements ω, a quaternion that describes the incremental rotation in the body frame is formed from the angular rate measurements and subsequently used to update the orientation estimate: Where τ is the time interval between two consecutive inertial measurements.Acceleration measurements have to be transformed to the reference coordinate system before the gravitational acceleration can be subtracted.Then, the resulting estimate of acceleration is used to integrate velocity and position: 2.3.2Covariance propagation.The physical model described by equations 5-6 corresponds to the first order differential equation that describes the propagation of the estimation error s for the time dependent part of the state vector: Here, F is determined by the physical model and n summarizes the white noise terms.From F the discrete time transition matrix Φ is computed and used thereafter to calculate the propagated error state covariance matrix P t+τ as stated below: In the expression above, Q is the power spectral density matrix which characterizes the noise vector n.

Measurement update
New images are continuously triggered by the IMU as it moves along its trajectory.Therein the image coordinates of point features are extracted and tracked.The coordinates of all features extracted in one image are stacked together to form the measurement vector that is subsequently used to update the state vector.
2.4.1 Landmark observation model.The observation model describes the relationship between the observed image coordinates and the state vector entries.For this purpose, the estimate of each observed landmark in inverse depth polar coordinates w.r.t.its anchor system is first transformed to Cartesian coordinates w.r.t. the navigation frame as shown in eq. 4. Subsequently, the coordinates are transformed to the current camera coordinate system and projected on the image plane: Where z are the measured image coordinates, v is the zero mean measurement noise, and π(•) is the projection function.Eq. 11 describes the projection of one landmark.The Jacobian of h(s) w.r.t. the entries of the state vector for landmark no.i is: With the derivatives J Ψ , J p , and J Y of c X w.r.t. the orientation q n b , the position n p b , and the landmark Y. Similarly to the measurement vector, the Jacobian H for the whole set of measurements is obtained by stacking the measurement Jacobians for individual landmarks.
Given the measurement model derived in this section and the prediction from sec.2.3, an EKF update step can be performed as described in (Bar-Shalom et al., 2001, pp. 200-217).

Introduction of new landmarks
Landmarks are deleted from the filter state if they could not be observed in a predefined number of consecutive images.Whenever the number of landmarks in the state drops below a predetermined threshold, new landmarks have to be introduced.Since the standard formulation of the Kalman filter does not allow for a variable state size, the new filter state entries have to be estimated based on previous observations.The inverse depth polar coordinates for each new feature can be calculated based on the image coordinates of its last observation and the camera calibration by means of trigonometric functions: Where p x , p y are the measured image coordinates, k contains the intrinsic camera calibration parameters, ρ init is an arbitrarily chosen inverse depth measurement and f (•) is the back projection function, which projects to a point on the projection ray through the observed image coordinates.In the following, J f denotes the Jacobian of f w.r.t.p x , p y ,and ρ init .
A new landmark is introduced in the Kalman filter state by augmenting the state vector with the initial estimate of the landmark's position Y new : In addition the covariance matrix has to be augmented with the new cross-covariance terms and the covariance of the new landmark: that depends on one or more measurements z, the sensor system's position and orientation in s as well as on predefined parameters like ρ.Let J s , J z , and J ρ be the derivatives of g(•) w.r.t.s , z, and ρ.The sought-after covariance entries can then be approximated as follows (Solà, 2010): Where σ 2 ρ is the variance of the initial inverse depth estimate and R is the measurement noise covariance matrix.
2.5.2Proposed method.The scheme presented in the previous section is not directly applicable to landmark parameterizations as described in sec.2.2.2.In this case function f (•) from eq. 13 takes the role of g(•) in the above equations.The problem is, that f (•) does not depend on the system's position and orientation.Thus, the Jacobian J s and with it the cross-covariances P Ỹnew ,s , P Ỹnew , m become zero.As a result, the uncertainty of the position and orientation estimate of the sensor system will be neglected when eq. 17-19 are applied.Fig. 3 depicts the situation when a landmark's position estimate in Cartesian coordinates w.r.t. the reference frame {n} is computed based on an erroneous estimate { ĉk } of the camera's position and orientation.The blue arrows indicate the landmark position estimates in the true camera coordinate system {c k } and in the estimated camera coordinate system { ĉk } that are consistent with the observed feature locations.Note, that the vectors X, X expressed in camera coordinates are identical in case of a perfect measurement.The red arrow marks the correct landmark estimate in the erroneously estimated camera coordinate system {c k }.The key idea behind our approach is to express the landmark position error X = X − X (the dashed arrow) in the estimated camera coordinate frame in terms of the unknown transformation between the true and the estimated camera coordinate systems and to use this transformation to propagate the error from the camera coordinate system estimate to the landmark position estimate and to calculate the cross-covariance entries.
In the ensuing derivation, the orientation error model described in sec.2.2.1 will be used.The transformation between the true coordinate system {c k } and the estimated coordinate system { ĉk } can be written in terms of a rotation matrix C ĉk c k and the position ĉk p c k .The rotation matrix C ĉk c k depends on the Rodrigues vector that is defined in eq.3: Where Ĉc k n = C ĉk n = C c k n is the rotation matrix calculated from the estimated orientation of the sensor system and the IMU-camera calibration and Ψ ĉk c k is the orientation error expressed in the estimated camera coordinate system.With this an expression for the landmark error in Cartesian coordinates can be derived where the index k, which is used to mark the anchor coordinate system for a landmark, is omitted for brevity: In eq.22 the approximation c X ≈ ĉ X is used, thereby assuming that the main error is caused by the erroneous estimate of the coordinate system, cf.fig. 3. Using the small angle approximation from eq. 20, ĉ X can be written as a linear function of the errors of the estimated state: This is the sought relationship between the errors of the current orientation and position estimates for the sensor system and the error of the newly introduced landmark in Cartesian coordinates w.r.t. the current camera coordinate system.It only depends on entities, which are either estimated or known a-priori, like the the position of the camera in the body coordinate system.The partial derivatives of the landmark coordinates w.r.t. the IMU's position and orientation follow directly from eq. 23: Given the partial derivatives 24-25, a new landmark can be introduced to the filter state by following the subsequent steps: 1. Calculate the inverse depth polar coordinates Y and the Cartesian coordinates c k X for the new landmark given the observation and an arbitrarily chosen inverse depth estimate ρ.
2. Calculate the partial derivative ∂ Y/∂ c k X, which describes how the inverse depth polar coordinates vary with c k X.
3. Determine J s : 4. Use eqs.17-19 to calculate the missing covariance matrix entries and augment the filter state according to eqs. 14 and 15.

Experimental setup
The cross covariance estimation algorithm derived in sec.2.5.2 was compared against a naive landmark introduction method that simply discards the cross-correlations in a number of simulation runs.For the simulation a reference trajectory was defined by two C 2 splines.One spline determines the viewing direction while the other spline describes the position of the IMU.The second derivative of this spline provides acceleration measurements whereas angular rate measurements are derived from the differential rotation between two sampling points.In addition, image measurements were generated by projecting landmark positions onto the image plane.
Artificial white Gaussian noise was added to all measurements.Its variance was chosen to resemble the characteristics of a good tactical grade IMU with 0.47 • / √ h angular random walk and 0.0375m/(s √ h) velocity random walk parameters.The artificial noise added to the image coordinates had a standard deviation of 0.1 Pixel.Though the IMU measurement biases are modeled as random walk processes, their values stayed fix for the duration of the simulation.However, the biases were also estimated during the simulation runs, i.e. their initial covariance and their process noise power spectral density were initialized with realistic values.The state was initialized with the true values from the reference trajectory after a standstill period of 3.2 seconds.Deviating from the physical model described in sec.2.3.2 a small amount of pseudo-noise was added to the diagonal elements of the covariance matrix for the landmark position estimates.
The simulation provides ground truth for the position and orientation of the IMU but not for the estimated uncertainty (the covariance matrix).Therefore, the normalized estimation error squared (NEES) is used as a measure of filter consistency for the comparison of the two methods, cf.(Bar-Shalom et al., 2001, p. 165): It is also interesting to investigate the covariance bounds for position and orientation errors.Since no measurements are available that relate the sensor system to the navigation frame, aside from the implicitly measured gravity vector, the uncertainty of the position and yaw angle estimates should not fall below their initial estimates.

Results
Fig. 4 presents the results for a simulated trajectory that imitates a walk along a long hallway.During the simulation run, landmarks go out of sight and are replaced by newly initialized landmarks.A comparison of the NEES plots shows that estimating the crosscovariances with the proposed method indeed yields more consistent filter estimates.However, the initialization of new landmarks after 8,12, and 16 seconds goes along with a considerable drop in the uncertainty estimate and an increasing NEES.This is probably because the linearization points used to calculate the derivatives for cross-covariance estimation deviate increasingly from the ground truth during the simulation run.
By contrast, fig. 5 shows the evaluation of a trajectory around a cube.Here, the camera's principal axis always points in the direction of the cube so that all landmarks are visible during the whole run, i.e. the cube is completely transparent.Thus, landmarks are initialized only once at the beginning of the run when the filter is initialized with the true parameters from the ground truth.Apparently this results in considerable more consistent estimates.In particular, the uncertainty estimate never falls below its initial value when the proposed cross-covariance estimation method is applied.

CONCLUSIONS
In this work we study the effects of inter-landmark crosscovariance estimation for an EKF-based inertial aided monocular SLAM system.In particular, we describe how the crosscovariances between new features and existing filter state entries may be computed for the special case when the landmarks are parameterized w.r.t.coordinate systems whose position and orientation is also uncertain.This situation naturally arises when parameterizing features with inverse depth polar coordinates w.r.t. the camera in which they were observed for the first time.Using simulation runs, we show that neglecting the cross-covariances for freshly inserted features results in a systematic underestimation of the filter state uncertainty and that this effect may be mitigated with the proposed algorithm.
2.2.2 Landmark parameterization.The coordinate vector of the i-th landmark in the filter state Y i describes the position of the landmark in inverse depth polar coordinates w.r.t. the coordinate frame {c k } of the camera at the time when the landmark was observed for the first time as illustrated in fig.2.

Figure 2 :
Figure 2: Landmark parameterization in inverse depth polar coordinates.X is the Cartesian coordinate vector associated with the inverse depth parameterization Y = [α β ρ ], with elevation angle α, azimuth β and inverse depth ρ = 1/d, all w.r.t. the anchor system {c k }.

Figure 3 :
Figure 3: Position of a feature expressed in the true camera coordinate system c k and the estimated camera coordinate system ĉk .
Yaw angle and position covariance bounds

Figure 4 :
Figure 4: Hallway sequence.Results for one simulation run are shown in Fig. 4(a).Blue: Reference trajectory and reference landmark positions, Pink: Estimated landmark positions, Green: Estimated trajectory.Fig. 4(b) shows the NEES averaged over 25 Monte-Carlo simulation runs.Orange: With cross-covariance estimation for new landmarks.Green: without cross-covariance estimation.Notice the log scale in the NEES-plots.Fig. 4(c) compares the estimated covariance bounds for yaw angle and position estimates.Orange: With cross-covariance estimation for new landmarks.Green: without cross-covariance estimation.