INTEGRATION OF DEPTH MAPS FROM ARCORE TO PROCESS POINT CLOUDS IN REAL TIME ON A SMARTPHONE

: Real-world three-dimensional reconstruction is a project of long-standing interest in global computer vision. Many tools have emerged these past years to accurately perceive the surrounding world either through active sensors or through passive algorithmic methods. With the advent and popularization of augmented reality on smartphones, new visualization issues have emerged concerning the virtual experience. Especially a 3D model seems to be essential to provide more realistic AR effects including consistency of occlusion, shadow mapping or even collision between virtual objects and real environment. However, due to the huge computation of most of current approaches, most of these algorithms are working on a computer desktop or high-end smartphones. Indeed, the reconstruction scale is rapidly limited by the complexity of both computation and memory. Therefore, our study aims to find a relevant method to process real time reconstruction of close-range outdoor scenes such as cultural heritage or underground infrastructures in real time locally on a smartphone.


INTRODUCTION
Real-world three-dimensional reconstruction is a subject of longstanding interest in computer vision. It has become even more predominant with the advent and popularization of autonomous cars or augmented reality. Especially a 3D model seems to be essential to provide more realistic AR effects including consistency of occlusion, shadow mapping or even collision between virtual objects and real environment. This opened up the need for real time scanning at scale with continuous integration of the accumulated 3D data.
Initially, reconstructions were created by fusing depth measurements from specific active sensors such as structured light, time of flight (Izadi et al., 2011) or LiDAR (Lightning Detection and Ranging) into 3D models. Although these sensors provide accurate results, their expansive aspect and the need for adequate equipment make them less attractive. Therefore, multiple approach have emerged to reconstruct a scene based on monocular (Yang et al., 2011), binocular or multi-view stereo methods that predict depth maps according to RGB images only. A depth map represents an image where pixels does not contain colour information, but a distance to objects in the scene from a given viewpoint (Figure 1). This data describes a pseudoreconstruction of space that does not directly represent a surface, but rather a sample of discrete values that allows understanding the geometry of the environment captured from a specific camera position.
Most of the existing solutions are relying on massive data exchange and cloud computing because the smartphone alone cannot provide sufficient computational performance. Therefore, the challenge of our research is to imagine a new approach of producing outdoor 3D models from a smartphone-based acquisition and a full processing performed on the device for the * Corresponding author SYSLOR Company. We aim to reach a productivity in real-time in order to provide a continuous overview of the 3D reconstruction from the recording. The interest of this visualisation is to assist continuously the user during the survey by providing him security and confidence regarding both the quality of the acquisition and the exhaustiveness of the data produced.
The algorithmic procedure has thus induced a succession of constraints to be taken into account during the developments: -Ensuring a computational reliability of the process efficiently supported by the performances offered by the current smartphones, -Enabling optimization of the computational process to achieve real-time productivity, -Ensuring accessibility of the acquisition for everyone.

STATE OF THE ART OF EXISTING SOLUTIONS
Before presenting how reconstruction can be processed, we need to take a closer look at scene representation. As depicted by (Seitz et al., 2006), it is possible to consider a differentiation of the reconstruction approaches in terms of scene representation, since the geometry of an object can easily be described according to: -A voxel grid (3D equivalent of pixel), used for its simplicity and its ability to approximate any surface, -A polygonal mesh (usually triangular), very popular for simplicity of storage and convenience of data visualization, -A depth map, which avoids the need to resample the 3D geometry.
Then, more generally, the reconstruction approaches can be subdivided into four ways according to how they are undertaken: 1. Extraction and mapping of feature points used to fit a surface to the generated features surface, 2. Calculation of a cost function associated with a 3D volume, then extraction of a surface from it, 3. Recurrent evolution of a surface to minimize a cost function, 4. Computation of a set of depth maps.

2.5 reconstructiondepth maps
In the absence of specific depth sensors, depth data can come from the study of a colorimetric similarity, also called photoconsistency criterion, between the pixels of a stereoscopic pair. A first well-known approach is the planesweep algorithm (Collins, 1996). The idea of this algorithm is to project backwards the whole image on successive virtual planes, perpendicular to the center of projection, indented according to a certain sampling step in 3D space. According to each virtual plane established, each pixel will be assigned a depth value derived from a similarity calculated between a reference view and the adjacent viewing position (Figure 2). Especially (Muratov et al., 2016) have reused and improved this method to generate a highresolution mesh within minutes on a smartphone. The nature of the photo-consistency indicator determining the depth value has a significant impact on the relevance of the results in the case of changes in illumination or edge effects as indicated by (Mari Molas, 2017).

Figure 2. Planesweep method
Source: (Graber, 2012) More recently, depth maps find a relevant use in augmented reality application, especially to offer management of occlusion effects between the real world and the virtual contents (Valentin et al., 2019). In the Depth API provided by ARCore, depth maps are generated from the hybrid optimization of two algorithms: PatchMatch (Bleyer et al., 2011) and HashMatch (Fanello et al., 2017). On the basis that an image is made up of regions of constant depth, the idea is to alternate generating an isolated depth value and propagating it to neighbouring pixels ( Figure 3).
Finally, the popularization of machine learning has led to the emergence of new solutions for creating depth maps, notably by integrating convolutional neural networks (CNN) such as MVSNet (Yao et al., 2018) or DORN .

Depth maps fusion
Usually, to fully reconstruct a 3D model, accumulated depth maps need to be gathered and combined. The integration of all these new data allows to continuously update the model and can be approached from different perspectives.
A first one is to represent the model as a surfel (surface element) cloud. Adopted by (Kolev et al., 2014), this representation seems more adequate for interactive applications running in real time in contrast to triangular meshes (Piazza et al., 2018). Indeed, the unstructured set of surfels can easily be kept consistent. A very interesting alternative to the one proposed above is based on Truncated Signed Distance Functions (TSDF). Inspired by (Zach, 2008), the idea is to convert the depth maps into a Signed Distance Field (SDF) where pixels are projected as voxel (3D point) with a value between [-1;1]. This value describes the distance of the voxel to the true surface of the object. This volumetric fusion of maps, taken up by (Izadi et al., 2011) in KinectFusion and (Graber, 2012), is interesting since it offers a relatively robust management of outliers while guaranteeing an efficient update of the model.
Then, the geometry of the model can be extracted using known methods such as raycasting (Graber, 2012) (Ondruska et al., 2015) or marching cubes (Lorensen and Cline, 1987). The raycasting approach is generally preferred over the marching cubes method because the latter relies on a binary decision to determine the surface position. However, as a result, the geometry of the 3D model may contain false positives, i.e. nonexistent artefacts, as well as false negatives, i.e. poorly extracted features. In addition, the marching cubes method is more favourable because it offers directly a mesh model against a point cloud for raycasting.

3D volume creation
Rather than going through an intermediate representation of the data, it is possible to build a volumetric geometry directly. From CNNs constructed as graphs, Pixel2Mesh  with and GEOMetrics (Smith et al., 2019) are extracting spatial deformations that will shape a predefined mesh model to converge to the captured geometry. Pixel2Mesh uses dissociation layers to refine the facets of the mesh by uniform oversampling. In GeoMetrics, the mesh is redefined only in target areas, which allows a better adaptation to the local complexity of an object. Nevertheless, these two methods are built from an initial mesh model, either a sphere or an ellipse, which directly induce topological biases.
To bypass this specific constraint, Gkioxari et al. (2019) developed Mesh R-CNN. The 3D mesh structure is obtained indirectly by passing through an intermediate description, namely a voxel occupation grid. This coarse description of the object will be transformed into a triangular mesh using the operation called cubify. Based on the voxel occupancy probabilities within the grid, each voxel is substituted by a triangular cubic mesh with 8 vertices, 18 edges and 12 faces. Then, the vertices and edges common to adjacent entities are merged while shared interior faces are eliminated.

Comparison of reconstruction methods
The specifications defined within our problematic enabled us to classify the various methods. Some of them have already been implemented on old generation smartphones (Ondruska et al., 2015;Muratov et al., 2016) or latest generation smartphone (Valentin et al., 2019). This observation confirms that 3D processing can be potentially embedded on this type of device. Although most of the presented methods based on deep learning represent the concrete future of real-time reconstruction for computer vision, they are still too limited for direct implementation on a smartphone.
We have thus confronted the identified methods with our specifications ( Table 1). The results provided are the outcome of tests undertaken, but for many an investigation carried out in relation to the remarks reported and correlated between the various authors cited. During the study, we have especially noticed that depth maps are particularly relevant data to bypass the triangulation steps related to collinearity equations while being an interesting trade-off between computation complexity and effective results. Especially, we highlighted the relevancy of the planesweep method implemented on smartphone by (Muratov et al., 2017) and the ARCore Depth API (Valentin et al, 2019) to generate accurate depth maps. Therefore, these methods seem to be very interesting for the development of our solution. Concerning data fusion, we prefer the method using surfels (Schöps et al., 2017), as it seems to be quite relevant for an interactive application while being within reach of our objectives from an algorithmic point of view.

Planesweep algorithm
The planesweep method presented by (Collins, 1996), operates locally to estimate the photo-consistency between two adjacent views. Its interest is to promote a result relatively quickly by ensuring robustness to incongruous changes in brightness, but also in the event of large bases. The indicator used to measure photo consistency directly influences the calculation of colorimetric homogeneity and therefore affects the overall quality of the depth maps.
We tested several indicators very well described by (Mari Molas, 2017) in order to determine the relevance of the method as a whole. Among those studied (SAD Sum of Absolute Differences, SSD Sum of Squared Differences, NCC Normalized Cross Correlation, Census transform, Rank transform), the one that seems to be the most relevant because of a favourable ratio accurate results / computation time is the SAD indicator (see Figure 4). Indeed, it allows describing the photographed environment more efficiently with a minimum error of 13.1 cm. This result was obtained after comparing computed depth map to * : Poor consistency, ** : Acceptable consistency, *** : particular adequacy to the criteria  a ground truth map obtained from a famous benchmark: fountain-P11 (Strecha et al., 2008).
After testing the influence of the parameters composing the planesweep, it appears that this method is quite legitimate for generalized multi-view reconstruction applications. However, given that it is based on a purely sequential computational procedure, it results in a latency of the calculations. Therefore, we have discarded this solution and turned to a more localized approach of stereo-correspondence, which uses a simpler procedure to implement, namely the API integrated in Google's ARCore application.

Depth API ARCore
The development of mobile technologies, has allowed the emergence and democratization of augmented reality applications. Google services have especially developed an API integrating the measurements of depth maps generated from a monocular camera (Valentin et al., 2019) at a rate of 30 Hz. In order to understand how these depth maps are generated, it is necessary to look at the overall working principle of ARCore. This application is based on three fundamental concepts: motion tracking, environment understanding and light estimation. Using an approach similar to conventional SLAM, feature points are detected along the movement to locate the smartphone in space. Using a methodology similar to RANSAC, these points are used to determine average planes ensuring an accurate understanding of the surrounding environment.
To use the Depth API from ARCore wisely, it is important to take into account the following factors that can influence the quality of a recording: -Illumination conditions, -Potential phenomena of reflection or scattering specific to the materiel, -Poor textured scenes. Figure 5 shows the analysis of the data produced by ARCore. The first case was performed in a relatively poor textured environment. The differences in depth observed are of the order of ten centimeters. This is interesting but it highlights the limitations of the solution, which relies solely on visual colorimetric information to calculate the different distances. The second case represents a more realistic scene with different depths. We can see right away that the results are better than the first case. In particular, the difference in depth is estimated to be about 8 cm. It is important to understand that these results depend on the quality of the movement performed, so a redundancy of views will provide more accurate results than a single view as showed in Figure 5.   ARCore applies a smoothing process to its depth maps in order to express a value for all the pixels of an image. This means that the depth gradient is quite small; implying that around the edges of the objects discontinuities will appear causing a slight difference in accuracy within the comparison. To get rid of these distorted values, we set up a post-processing of the data consisting in the elaboration of a discontinuity mask based on a Canny detection (Canny, 1986) followed by mathematical morphology operators to remove a maximum of uninteresting pixels. The size of the kernel is chosen so that only the dominant contours are kept in the image.
Finally, the API provided by ARCore meets our expectations more favourably in terms of both the quality of the results and the rate of data production. This method has therefore been selected in our developments.

3D RECONSTRUCTION METHODOLOGY
ARCore can been used in the reconstruction process. Nevertheless, (Manni et al., 2021) decided to bypass the classical reconstruction process by using an object classification method to determine the most similar synthetic object from a database.
The contribution of depth maps alongside image data, resulting in so-called RGB-D data, opens the way to the creation of 3D models. Thanks to all the research carried out, we were able to design an optimal processing chain that converts RGB-D data to point clouds based on the intrinsic parameters of the camera. Figure 6 illustrates some reconstructed point clouds created with our method.

Characteristics of point clouds
The characteristics of the produced point clouds are directly influenced by the resolution of the RGB-D images. Initially, the resolution of the depth maps produced by ARCore is 160x120 pixels to ensure a high productivity. Given that each pixel in the image produces a single 3D point, this will lead to a sparse point cloud of 19200 points with a spacing of about 2 cm (Figure 7a). Therefore, we have increased the size of depth maps to 320x240 and 640x480 pixels using linear interpolation to oversample the data. This operation intends to increase the amount of data from 19200 to 76800 and 307200 points respectively while providing a better density of points with a spacing reduced to 1 cm and 0.5 cm respectively. Even if more data to be processed implies more calculations to be carried out, a simple parallelization of the projection will be enough to guarantee a real time productivity. Moreover, visually, a higher resolution makes sense when we wish to have a finer detail of the recorded scene (Figure 7b-7c).

Analysis of results obtained
We built a proof of concept that can be generalized to several environments. Our solution has been designed to digitize all elements within a maximum range of 4 m. Actually, beyond a certain range, depth data provided by ARCore are going to be less accurate. Indeed, as one moves away from the scene or the object, the depth values will become more and more inaccurate. In particular, the ARCore documentation informs us that the data estimation error increases quadratically with the distance.
Therefore, to quantify the value of the models resulting from our method, we conducted comparisons with a ground truth built from Agisoft Metashape. The results are quite interesting and encouraging since we obtain an average deviation of 6-7 cm ( Figure 8). The accuracy of our point clouds is directly linked to factors influencing depth estimation without forgetting the acquisition procedure. Indeed, to provide good estimation, a solid motion tracking at an adapted speed is required so that the algorithms can correctly understand the surrounding environment. Moreover, since the depth data results from colorimetric similarity between the pixels of a stereoscopic pair, deformations can be visible on the facade or the ground. For a first sketch of the result, the latter is quite promising, especially since the generation of the models is done almost in real time.

Computational bottleneck
Producing 3D model through cloud processing induces data exchange that can be time-consuming. Therefore, after conducting the acquisition, users have often to wait a certain amount of time to have a feedback. By relying on simple but effective methodology, we designed a proof of concept to ensure a permanent monitoring of the reconstruction with reasonable computation time.
At the time we present these results, our solution has not yet been fully integrated into the smartphone due to code conversion issues. Therefore, time consumptions are given according to computer computation using the minimum amount of CPU (Central Processing Unit) and by doing without the GPU (Graphical Processing Unit). In Table 2, we provide an overview of the average computation time of creating 3D model from a single view according to different input data resolutions. Results are very promising because we are able to produce very efficiently complex data consisting of several hundred thousand points all in a fraction of a second. In comparison, with a basic photogrammetry process, it can took several minutes to provide the corresponding model. Even if progress in automation has been made, allowing reducing the production time, the results of our method are still produced more quickly and requires less computing power.
During our tests, we also compared computational time while including post processing of the depth maps provided by ARCore (Table 2). As we already mentioned, these depth maps are smoothed to provide sufficient data all over the scene, but this induces flying points when we compute the corresponding 3D model (Figure 9).These particular points are outliers that damage the visual representation of the scene. The process described to remove these points seems to lead to an increase in calculation time of about 20%. Therefore, further research have to be conducted to find some alternatives that suppress these outliers without increasing that much the computational time. However, in the case of more accurate depth data, we can easily imagine doing without this process.

RGB-D data resolution Low
Medium High w/o depth map postprocessing 0.004s 0.02s 0.08s w/ depth map postprocessing 0.02s 0.09s 0.4s Table 2. Computational time of our method for a single view

CONCLUSION
Designing a proof of concept is not an easy thing to do, especially when significant material constraints are imposed. It implies having to find a computational approach simple enough while being as effective as possible compared to a reference method. For our study, we had three major constraints on which our developments had to rely: -Ensuring a computational reliability of the process efficiently supported by the performances offered by the current smartphones, -Enabling optimization of the computational process to achieve real-time productivity, -Ensuring accessibility of the acquisition for everyone.
Thanks to all the researches and tests carried out, we ended up with a first processing chain, which works quite well. We are able to create dense point clouds within an accuracy of 7 cm for a single view, which can further be refined by the addition of more surrounding views. Even if the depth maps generated by ARCore can be noisy in the context of scattering surfaces or around edges, The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLIII-B2-2022 XXIV ISPRS Congress (2022 edition), 6-11 June 2022, Nice, France we still managed to develop a very satisfactory method that provides quite encouraging results of about 8 cm. The initial objectives of the project led by the SYSLOR Company have been well outlined since we are able to provide results very quickly with limited computing power compared to usual method such as photogrammetry, which provide relevant information about the completeness and conformity of the acquisition. Although real time has not yet been concretely reached since the implementation of the algorithm has not been completely finished on smartphone, the provided initial results are very promising.

FUTURE WORKS
Our ambitions go beyond a simple ephemeral visualization as found in most AR applications integrating reconstruction methodologies. We want to establish a 3D model used to concretely map the environment around us, and more specifically underground infrastructures. For this, we wish to integrate extrinsic elements to our methodology, in particular GNSS data from a low cost antenna developed by (Quentin et al., 2019) within the company SYSLOR for permanent geolocalized restitution.