KINECT V2 AND RGB STEREO CAMERAS INTEGRATION FOR DEPTH MAP ENHANCEMENT

Today range cameras are widespread low-cost sensors based on two different principles of operation: we can distinguish between Structured Light (SL) range cameras (Kinect v1, Structure Sensor, ...) and Time Of Flight (ToF) range cameras (Kinect v2, ...). Both the types are easy to use 3D scanners, able to reconstruct dense point clouds at high frame rate. However the depth maps obtained are often noisy and not enough accurate, therefore it is generally essential to improve their quality. Standard RGB cameras can be a valuable solution to solve such issue. The aim of this paper is therefore to evaluate the integration feasibility of these two different 3D modelling techniques, characterized by complementary features and based on standard low-cost sensors. For this purpose, a 3D model of a DUPLO bricks construction was reconstructed both with the Kinect v2 range camera and by processing one stereo pair acquired with a Canon Eos 1200D DSLR camera. The scale of the photgrammetric model was retrieved from the coordinates measured by Kinect v2. The preliminary results are encouraging and show that the foreseen integration could lead to an higher metric accuracy and a major level of completeness with respect to that obtained by using only separated techniques.


INTRODUCTION
Nowadays 3D modelling is a subject of great interest in very different fields, from industry to robotics, from cultural heritage to medicine.Thus, it is essential to find instruments providing accurate, dense and low cost 3D data, at high frame rates and in a user-friendly way.
To satisfy these requirements, a single method is not sufficient.For this reason, in this work we present the results of a preliminary test, performed to evaluate the integration feasibility of two different 3D modelling techniques: range imaging and traditional stereo vision.In fact, the integration between these two methods can offer many advantages since they are characterized by complementary features.
On the one side, range cameras are low cost and easy to use imaging sensors, able to measure distances of the scanned scene at high frame rate.They can be used as 3D scanners to easily reconstruct dense point clouds practically in real time.However these sensors show issues with transparent and very reflective surfaces.Furthermore, the depth maps obtained are generally noisy and the finer details of the resulting 3D models are often smoothed.
On the other side, stereo vision is an established technique, but the resulting 3D models are mostly incomplete in low texture regions.In addition, the processing needs an external scale that the user must provide and it is often computationally onerous and time consuming.Therefore the leading idea, which will be developed in future works, is that a preliminary depth map of the investigated object can be obtained in real-time through a low-cost range camera.This depth map will be employed as a coarse 3D model for classical stereo processing, which will add the details coming from the stereo images acquired through standard cameras.
In details, the coarse depth map acquired by the range camera will be the geometrical constraint for the subsequent Semi Global Matching (SGM) algorithm that will compute the stereo disparity map.In this way the efficiency of the dense matching algorithm will be increased.This paper is organized as follows.In Paragraph 2 a short review of the state-of-the art is illustrated; in Paragraph 3 the preliminary test is described and its results are discussed in Paragraph 4. Finally some conclusions are outlined.

RELATED WORKS
The topic of integration between products from range and RGB cameras has been investigated for several years and a substantial literature is available (Nair et al., 2013), mainly considering middle to high cost professional sensors.Our goal here is to reconsider the already obtained methodological results under the light of new available low-cost sensors that can be integrated in a flexible solution.Hereafter we just summarize the most relevant achievements of the previous works: • (Zhu et al., 2008) combines a professional ToF sensor with two CCD cameras, introducing a method for improving the range camera manufacturer calibration • (Yilmaz and Karakus, 2013) proposes a system formed by a professional three stereo camera and a low-cost SL range camera and suggests an accuracy improvement of the resulting 3D model through a stereo visual odometry integration • (Evangelidis et al., 2015) presents a high-resolution stereo matching algorithm guided by low-resolution depth data, that helps the algorithm to compensate for its difficulty in estimating disparities over weakly textured areas

PRESENTATION AND DISCUSSION OF THE PRELIMINARY TEST
For the purposes of this study, a 3D model of a DUPLO TM bricks construction was reconstructed both with the Kinect v2 range camera and by processing one stereo pair acquired with a Canon EoS 1200D DSLR camera.The two 3D models were then fused, obtaining the integrated model.
In particular, the Kinect v2 is a ToF range camera: it evaluates the distance from the scene by measuring the time of flight that an infrared wave takes to travel from the sensor itself to the scanned scene and back.We developed our own software tool to download the 3D data with the Microsoft SDK and the model point cloud was reconstructed from the depth map (see Figure 1) acquired in a single frame, since the final aim is the near real time integration.to eight points collimated on both images (see Figure 2).
The fusion of the Kinect v2 and the photogrammetric models was performed trough the CloudCompare (Girardeau-Montaut, 2016) 3D point cloud and mesh processing software.Since the two points clouds were already in the same reference system, the co-registration was only refined using the Iterative Closest Point (ICP) algorithm (Best and McKay, 1992), which estimated the parameters of the residual roto-translation (with scale) transformation.

RESULTS
To assess the metric quality of the results obtained, both the integrated model and the models reconstructed with the single 3D modelling technique were compared with the reference model of the DUPLO bricks construction.
The dimensions of bricks were measured with a vernier caliper and the reference mesh model (see Figure 3) was reconstructed with a standard CAD software.
In particular, precision and accuracy were evaluated in terms of signed distances (positive inside and negative outside the reference mesh surface) of the 3D model points from the reference mesh.As regards the model reconstructed by the Kinect v2, it is less accurate and less precise: the mean and the standard deviation of distances from the reference mesh model reach the values of 0.004 m and 0.015 m respectively.The details of the bricks are generally less recognizable (see Figure 4(c)) and the model shows some inaccuracies (flying pixels) on the edge of the DUPLO construction, where there are high depth variations.
Finally the integrated model (see Figure4(e)) preserves the accuracy and the precision of the photogrammetric model, but it also presents the greatest level of completeness, provided by the contribution of the Kinect v2 sensor.

CONCLUSIONS AND FURTHER DEVELOPMENTS
Our preliminary results are encouraging and show that this integrated approach leads to higher metric accuracy of the final 3D model with respect to that obtained by only using a range camera and to an higher level of completeness respect to that obtained by only processing a stereo image pair.
Future works will deeply investigate the effects of distance from the scanned object and automate the processing procedure by implementing the method previously described.

Figure 1 :
Figure 1: Depth map acquired by the Kinect v2 range camera

Table 1 :
Distance statisticsThe photogrammetric model is the most accurate and precise, as the statistics of distances show (see Table1), and it is reported in Figure4(a).Anyway the borders between the bricks are reconstructed very well, thanks to the high texture variation, but the model is not complete in correspondence of the areas with uniform texture (single bricks).