INTEGRATING DEPTH AND IMAGE SEQUENCES FOR PLANETARY ROVER MAPPING USING RGB-D SENSOR

RGB-D camera allows the capture of depth and color information at high data rates, and this makes it possible and beneficial integrate depth and image sequences for planetary rover mapping. The proposed mapping method consists of three steps. First, the strict projection relationship among 3D space, depth data and visual texture data is established based on the imaging principle of RGB-D camera, then, an extended bundle adjustment (BA) based SLAM method with integrated 2D and 3D measurements is applied to the image network for high-precision pose estimation. Next, as the interior and exterior elements of RGB images sequence are available, dense matching is completed with the CMPMVS tool. Finally, according to the registration parameters after ICP, the 3D scene from RGB images can be registered to the 3D scene from depth images well, and the fused point cloud can be obtained. Experiment was performed in an outdoor field to simulate the lunar surface. The experimental results demonstrated the feasibility of the proposed method * Corresponding author.


INTRODUCTION
In planetary exploration rover missions, accurate localization of the rover and mapping of the surrounding terrain are essential for safe and efficient execution of exploration tasks, such as traverse planning, hazard avoidance and target approaching.Simultaneous localization and mapping (SLAM) is the key technology to create a reliable 3D map by estimating the camera pose accurately.Stereo vision based rover localization method, which is called visual odometry (VO), has been successfully applied in the 2003 Mars Exploration Rover (MER) mission to reduce position errors accumulated by dead-reckoning localization.In Chang'e-3 mission, cross-site visual localization is also applied for rover localization.However, rational SLAM always depend on images of stereo cameras, the method is almost invalid when rover located at poorly textured areas.
RGB-D camera, consisting of a RGB camera and a depth camera, is a new type of sensor that can directly and simultaneously obtain both visual texture information and perpixel depth information.It allows the capture of depth and color information at high data rates.Therefore, RGB-D camera has a natural advantage for visual-based localization and mapping.Recently, many researches are focused on dense mapping and simultaneous localization and mapping (SLAM) with RGB-D camera.(Dryanovski et al., 2013;Hu et al., 2012;Whelan et al., 2013Whelan et al., , 2015)).The Kinect-Fusion algorithm of Newcombe et al. (2011) is one of the most notable RGB-D-based 3D reconstruction systems of recent times.Huang et al. (2011) developed a RGB-D SLAM method in which sparse bundle adjustment (SBA) is used for global consistency by minimizing the matching errors of the visual FAST feature correspondences between frames.Henry et al. (2012) introduced a RGB-D Mapping framework that can generate dense 3D maps of indoor environments despite the limited depth precision and field of view provided by RGB-D cameras, the system utilizes a novel joint optimization algorithm combining visual features and shape-based alignment.Hu et al. (2012) proposed a switching based algorithm to heuristically choose between RGB-BA and RGBD-BA based local maps building, and a low cost and consistent optimization approach is used to join these maps.based on a novel GPU implementation of an existing RGB-D visual odometry algorithm.Dryanovski et al. (2013) presented a real-time visual odometry and mapping system for RGB-D cameras, which used a probabilistic framework to optimize the camera pose estimation.Whelan et al. (2013) presented a system for improved RGB-D camera pose tracking that yields high quality color surface models with low visual artifacts.Further, Whelan et al. (2015) presented a real-time dense SLAM system which makes use of a dense every-frame volumetric fusion frontend for camera pose estimation and surface reconstruction in combination with a non-rigid map deformation backend to correct the mapped dense surface upon loop closure.Moreover, a new method of RGB-D camera SLAM is proposed based on extended bundle adjustment with integrated 2D and 3D information on the basis of a new projection model (Di et al.,2016).
The remarkable advantages of these systems lie in the high mobility and low cost.However, RGB-D sensors have some significant drawbacks with respect to dense 3D mapping.These sensors only allow measurement ranges of a limited distance and a limited field of view, making the point cloud noisy and loss of details.This may cause details of far range loss due to lack of the depth information needed to constrain ICP (iterative closest point) alignments.To determine how the overlap of several infrared beams affects the tracked position of the user, Oliver et al. (2016) studied the precision of collected data using multiple RGBD sensors.Tang et al. (2016) presented a novel approach to integrate the depth scene and RGB scene geometrically to enlarge the measurement distance of RGB-D sensors and enrich the details of model generated from depth images.To enhance the quality of depth efficiently, Lee et al. (2017) proposed a method which consists of an image segmentation algorithm to extract object regions and a weighted linear combination of spatial filtering algorithms.
In planetary exploration, the complex conditions of illumination and surface texture may make the traditional methods which only depend on visual feature tracking infeasible.In this paper, we proposed a method based on RGB-D camera by integrating depth and image sequences for planetary rover mapping.Results of field experiment is given to verify the accuracy and effectiveness of this new method.
The rest of this paper is structured as follows: Section 2 presents and specifies the proposed method; Experimental result is presented in Section 3. Finally, conclusions and suggestions for future work are given in Section 4.

METHOD
The flowchart of our method is shown in Figure 1.First, the strict projection relationship among 3D space, depth data and RGB data is established based on the imaging principle of RGB-D camera.As there exist holes in depth images, depth images are enhanced.And based on a precise calibration for both of IR and RGB cameras, the relationship between the depth and RGB camera are obtained.Then, an extended bundle adjustment (BA) based SLAM method with integrated 2D and 3D measurements is applied to the image network for highprecision pose estimation, the method takes depth measurements as independent observations and integrates them with image measurements through the projection model and error model.Next, as the interior and exterior elements of RGB images sequence are available, dense matching is completed with CMPMVS tool, and the dense model can be generated.Finally, according to the registration parameters after ICP, the 3D scene from RGB images can be registered to the 3D scene from depth images well, and the fused point cloud can be obtained.

Camera calibration
In order to obtain mapping results accurately, the RGB image and the simultaneously acquired depth image should be registered.As there exist errors of several pixels of the register function given in the Microsoft Kinect SDK, precise camera calibration is performed by the camera calibration model of Di et al.(2016), the lens distortion model can be represented by Equation(1) based on the pinhole camera model where (k1, k2, k3, p1, p2)= lens distortion coefficients (δ x , δ y ) =the camera distortion along x direction and y direction, (x d , y d ) = the original image coordinates of an image point (x,y) = the image coordinates after distortion correction.Coordinate systems o-xyz and o r -x r y r z r of the depth camera and the RGB camera are defined such that the z(z r ) axis points along the optical axis, the x(x r ) axis is horizontal to the right (perpendicular to the plane z = 0), and the y(y r ) axis forms the right-hand coordinate system, a 3×3 rotation matrix R and a translation vector T(X S ,Y S , Z S ) are used to represent the relative relationship between the RGB camera and the depth camera.

Depth enhancement
Supposing that there is an object point (X, Y, Z), the projected point on the depth image is (x D , y D ) the depth value obtained by depth camera is Taking the depth camera coordinate system as the reference coordinate system, the projected points on the RGB image is (x R , y R ), so that the imaging geometric models of RGB camera is: where (f Dx , f Dy ) = the focal lengths of the depth camera ( f Rx , f Ry ) = the focal lengths of the RGB camera (x0D, y0D) = principal points of the depth camera (x0R, y0R) = principal points of the RGB camera R ij (i, j, =1, 2, 3) = the elements of the rotation matrix R of the RGB camera with respect to the depth camera Based on equations (1), ( 2) and (3), the rotation matrix R and the translation vector T can be determined through camera calibration.
The Kinect cameras are calibrated together using a planer checkerboard.Based on images taken from different distances and orientations, Camera Calibration Toolbox for Matlab(http://www.vision.caltech.edu/bouguetj/calib_doc/) is used to complete the calibration with the images.The calibration results include external and internal parameters, lens distortion coefficients as well as the relative pose between RGB camera and IR camera.

Depth data enhancement
The depth values returned by the Kinect suffer from noise, and sometimes are absent for some points depending on the reflectance properties of the scene.Therefore the depth map will be enhanced.
Assuming that at regions around the depth holes, there are valid depth values to fill the holes.A depth value histogram of neighboring pixels of depth holes is computed and the dominant peaks are detected in the histogram.The proposed method assumes that a valid depth value to fill a depth hole corresponds to a dominant peak that is the most distant depth value among multiple peaks.The detail of depth data enhancement are shown as follows: Step 1: Depth holes are detected to make a binary map, B that indicates pixels with unknown depth information as 0 (black).
Step 2: 8-connectivity is used to label depth holes.To decide a neighboring region of depth holes, validation region V is generated of region label map using morphological operation.
Step 3 : The histogram of valid regions of each labeled region is computed, dominant peaks are experimentally chosen as peaks that have pixels more than 10% of the total number of pixels in the region.
Step4 : An average value of dominant depth values is used as threshold value, a median of depth values which are greater than the threshold is used to fill depth holes.
The depth hole filling is a block-wise process.A uniform depth value on invalid region may cause natural artifact.Therefore, we filter the initial depth map by a depth color cross-bilateral filter to remove the artifact and refine the initial depth map.The final depth map is filtered as where f(p, q) = domain term that measures the closeness of the pixels, p and q g(.) = an depth range term that measures the pixel similarity of the modeled depth map h(.) = an intensity term that measures the intensity similarity.
Ωp =denotes the spatial neighborhood of position p.

Extended bundle adjustment
Bundle adjustment of the image network, which is the technique of refining a visual reconstruction to produce jointly optimal 3D structure and orientation parameters estimated by using accurate projection model, statistical error models and well-developed quality control methodology [35,36], is used to optimize the initial exterior orientation result.Constructing the projection model and error model of RGB-D camera is the key to achieve optimal estimation.
In order to calculate the initial exterior orientation calculation, firstly 2D and 3D feature points are automatically extracted and matched between consecutive frames to build a continuous image network.2D features are detected in the registered images and 3D features are re-projected to the registered images, so that all the 2D features and 3D features have the image plane coordinates values and the depth measurement values, and matching errors are detected using RANSAC.These matched feature points are used as tie points to link consecutive frames to build a continuous image network.Then based on the 2D and 3D features, the rotation matrix R and translation vector T between the first frame and next frame are recovered through by SVD.
The projection model of a RGB-D camera represents the relationship of an object point in the real world and its measurements in the RGB-D images.There are two types of measurements: image coordinates from the RGB-image and depth values from the depth image.The camera pose is R and T= [Xs, Ys, Zs] which express the relationship between the world coordinate system and the local camera coordinate system For an object point p=[X,Y,Z], its image plane coordinates is (x,y) and the depth value is d.
According to the collinearity equation model, the following equation is obtained (5) where (x 0 ,y 0 ) = the coordinate of principal point (f x , f y ) = the focal length of the depth camera.
In the projection model, the relationship between the measurement data and the unknowns is nonlinear.The linearized equation can be represented as: Rewrite equation ( 6) into a matrix form: where P = the weight matrix of the image plane coordinates and the depth measurement values.
The weight of the measurement data is inversely proportional to the variance of its measurement accuracy.The measurement accuracy of the image plane coordinates depends on the matching accuracy of the SIFT keypoints which is up to subpixel level (0.3 pixel is taken in this research).The depth measurement accuracy can be computed by the equation given by Di et al. (2016).
Through the above steps, the error model of RGB-D camera is built and the camera pose of each frame can be refined by least squares solution of Equation ( 7).

Dense mapping with sequence images
After bundle adjustment, the point cloud from depth data can be merged.However, the depth sensors capture depth information based on the concept of structured light pattern and time-offlight, and the measurement is highly related to the material and structure of objects.There would be some holes when modeling objects with smooth surfaces or low reflection certain materials or scene structures which do not reflect infra-red (IR) light.Moreover, the point cloud at far range are of low accuracy.
The sequence images can be used to for dense matching with CMPMVS tool.Camera positions derived from Bundler are used as input, CMVS then decomposes overlapping input images into subsets or clusters of manageable size, while PMVS is used to reconstruct 3-D data from these individual clusters (Furukawa and Ponce, 2007).The core idea of PMVS can be understood as the following steps: Firstly, the feature points in each image are extracted using Harris + DoG operator, and the feature point pairs are matched.Under the condition that the epipolar geometry constraint is satisfied, the triangulation method is adopted.A rectangular patch centered on a threedimensional space point is generated.The patch is called a seed patch with a normal vector as a specific direction; then two similar characteristics of the adjacent point cloud patch are used as conditions (normal vector and space point location ), step by step to diffuse and reconstruct the similarity point clouds of the seed point cloud piece; finally, after the diffusion is completed, a filtering operation is required to remove the error patch that does not meet the consistency.The result of this processing is a significant increase in point density;

Point cloud matching
Two points pertaining to a point cloud taken by Kinect can have a minimal distance of about a few millimeters.It is acceptable to lose some precision in order to reduce all point clouds' size, reducing considerably both computational time and memory usage.Firstly, the number of points is down sampled by using Voxel Grid.It divides a point cloud in boxes ("voxels") with the desired width.Then, all points within a box are reduced to an unique point corresponding to their centroid.
Point clouds taken from Kinect can have measurement errors that create sparse outliers.Such points can lead to errors while estimating local point clouds features such as surface normals.
The outliers are removed as follows: first, mean distance from each point to all its neighbors is calculated.Next, assuming a Gaussian distribution, with a mean and a standard deviation, all points having their mean distances outside a certain interval are considered as outliers and are removed from the dataset.In our work, 30 neighbours are used for each point to analyse its status, removing all points having a distance larger than 1 standard deviation of the mean distance to the analysed point.
We use the iterative closest point (ICP) algorithm (Besl and McKay, 1992) for registration of point cloud by PMVS to depth point cloud.As represented in Equation ( 8), the goal of the ICP algorithm is to minimize the sum of square errors with respect to the PMVS points and the corresponding closest depth points.In each iteration step, the algorithm selects the closest points as the correspondences and re-calculates the rotation and translation parameters (R, t) of the rigid transformation to minimize the equation Where m i = coordinates of the RGB dense points d i = coordinates of the depth points t = a 3*1 vector that describes the translations between the two datasets, R = a 3*3 matrix that describes the rotations between the two datasets.

EXPERIMENT
To verify the actual performance of the proposed method, a field experiment has been performed.Figure 2   Figure 3 shows some typical images acquired by the RGB camera and the depth camera.

CONCLUSION
In this paper ,we proposed a method based on RGB-D camera by integrating depth and image sequences for planetary rover mapping.We focused on verifying mapping ability of RGB-D camera onboard a rover that could be used in a GPS-denied environment such as lunar and Martian surface.In outdoor environment with sunlight, the depth sensor may have weak performances, hole filling strategy and cross-bilateral filter were used for depth enhancement.Based on the extended bundle adjustment accurate pose estimation results are refined.The 3D scene from RGB images with PMVS tools were registered to the 3D scene from depth images well after ICP.Our method provides an effective way to enlarge mapping distance of a RGB-D camera.Further research will be performed to improve the computational efficiency of the method.And more data will be used for validation of the new method.

Figure 1 .
Figure 1.Flowchart of the integrating mapping method shows the moving platform (model rover) used in these experiments.A Microsoft Kinect V2 camera which has a RGB image resolution of 1920*1080 pixels and a horizontal field of view of 70 degrees was attached on the top of the camera mast.The camera is about 100cm above the ground.

Figure 2 .
Figure 2. The RGB-D camera mounted on the moving platformExperiment was performed in an outdoor field to simulate the lunar surface.The rover travelled along a loop path with the origin set at [0,0].The same image was used for the first and last positions to ensure that the true last camera pose was exactly the same as where the first image was recorded.Given that the loop is closed, we can use it to evaluate the accuracy.Figure3shows some typical images acquired by the RGB camera and the depth camera.

Figure 3 .
Figure 3.Typical RGB images Figure 4(b) shows depth images after enhancement, compared to the original depth images in Figure 4(a), the proposed method uses depth distributions of neighboring regions of depth holes.Due to some invalid areas, there still exist holes in enhanced images.