Moving objects aware sensor mesh fusion for indoor reconstruction from a couple of 2D lidar scans

: Indoor mapping attracts more attention with the development of 2D and 3D camera and Lidar sensor. Lidar systems can provide a very high resolution and accurate point cloud. When aiming to reconstruct the static part of the scene, moving objects should be detected and removed which can prove challenging. This paper proposes a generic method to merge meshes produced from Lidar data that allows to tackle the issues of moving objects removal and static scene reconstruction at once. The method is adapted to a platform collecting point cloud from two Lidar sensors with different scan direction, which will result in different quality. Firstly, a mesh is efﬁciently produced from each sensor by exploiting its natural topology. Secondly, a visibility analysis is performed to handle occlusions (due to varying viewpoints) and remove moving objects. Then, a boolean optimization allows to select which triangles should be removed from each mesh. Finally, a stitching method is used to connect the selected mesh pieces. Our method is demonstrated on a Navvis M3 (2D laser ranger system) dataset and compared with Poisson and Delaunay based reconstruction methods.


INTRODUCTION
3D reconstruction from images (Schops et al., 2017) and Lidar (Berger et al., 2017) is a widely researched topic in the photogrammetry and computer vision communities. With the development of sensor devices, 3D reconstruction is widely applied to various scenes, both outdoors (Musialski et al., 2013) and indoors (Huitl et al., 2012). During the data collection, there may exist moving objects in the scene (people inside, pedestrian and vehicles outside). For robot applications, these moving objects should be detected to adapt the behaviour of the robot to its surrounding for scene mapping. For 3D reconstruction purposes, these objects should be removed in order for the 3D model to represent only the static part of the scene (Jiang et al., 2017a). In both cases, motion analysis is a mandatory preprocessing step.

Previous Works
For indoor static scene reconstruction, the related work can be categorized into two research areas: moving objects analysis and 3D reconstruction. Moving objects analysis can be based on images (Tron, Vidal, 2007) or Lidar (Schauer, Nüchter, 2018) points clouds. In this paper, we emphasize on Lidar based indoor reconstruction, so we focus our state of the art on Lidar based methods.

Moving Objects Analysis
The methods can be divided into two groups: * Corresponding author (1) Motion flow relies on ICP (Iterated Closet Point) using point correspondence to analyze the velocity of moving objects (Pomerleau et al., 2014). Simultaneous localization and mapping with moving object tracking method is mainly used in robot scanning (Wang, Thorpe, 2002). A 3D flow field is computed to analyze the motions during ICP (Jiang et al., 2017b), and dynamic objects are detected by flow clustering.
(2) Visibility analysis (Underwood et al., 2013) uses the sensor information to remove the objects that are volumetrically inconsistent between scans (objects from a scan traversed by rays from another scan). The input are two point clouds acquired at different time. The Dempster-Shafer theory can be used to improve the combination of volumetric information from the scans (Xiao et al., 2015). Several point based data structures such as Voxel (Andreasson et al., 2007;Schauer, Nüchter, 2018) and OctoMap (Gehrung et al., 2019) have been proposed to improve the performance of the ray tracing method.
Our method falls in the second category as it leverages visibility information to detect 3D volumetric changes. However, while the methods mentioned above are based on point clouds, which contains samples of the continuous surface of the scene, our method handles meshes, providing a continuous representation.

3D Reconstruction
There are many 3D reconstruction methods for static scenes (Berger et al., 2017). They can be categorised as: (1) Implicit methods: the reconstructed surface is represented as the zero set of a function defined in space, which Step 1 (top left): generate a sensor mesh from each sensor, introducing in Section 3.1 and Section 3.2; Step 2 (bottom left): detect and remove the inconsistent objects based on a combination of distance and visibility, the detail is in Section 3.3; Step 3 (bottom right): select triangles based on a boolean optimization framework, refer to Section 3.4.1; Step 4(top right): stitch the selected mesh pieces(cf Section 3.4.2).
values are positive outside and negative inside the solid scene. This guarantees the watertightness of the resulting surface. A triangle mesh discretizing this zero set can then be generated, usually using the Marching Cubes method (Lorensen, Cline, 1987). The most widely used is Poisson surface reconstruction (Kazhdan et al., 2006) which aligns the gradient of the function with normals computed from the point cloud. Truncated signed distance functions(TSDF) is another successful implicit method that processes RGB-D datasets (Newcombe et al., 2011). An elastic registration has been proposed to improve the performance of TSDF (Zhou et al., 2013). An extensive experiment of TSDF on multi-line Lidar point clouds has been proposed in (Roldão et al., 2019).
(2) Explicit methods usually use local information to estimate the surface and produce a watertight or non watertight surface (it might have boundaries). In (Ryde et al., 2013), the surface is approximated locally using a voxel-based plane detection. (Labatut et al., 2009) uses a Graphcut optimization framework to find the surface in a 3D Delaunay triangulation which is robust to noise. (Marton et al., 2009) proposed a greedy surface triangulation algorithm relying on normal information that also focuses on robustness to noise. Finally, sensor meshing (Boussaha et al., 2018) is a simple and fast method that makes use of the sensor topology to build triangles connecting consecutive points in each scanline, and corresponding points in consecutive scanlines.

Overview
In this paper, we propose a static mesh reconstruction method considering moving objects based on merging two scans collected at same time period from two 2D laser rangers. Sensor meshes (Boussaha et al., 2018) are generated from the two scans, with the difference that we store the optical center positions for each mesh vertex which will be mandatory for the subsequent visibility analysis. This visibility analysis detects moving objects as volumetric inconsistencies by ray tracing both within the same scan and between the two scans. A boolean optimization produces a mosaic of the two meshes by maximizing surface coverage and minimizing seam line length while forbidding overlaps. Finally, the resulting mesh pieces are stitched. The paper is structured as follows: Section 2 presents the dataset and scanned area. Details of the method are presented in Section 3. Results and analysis are provided in Section 4. Finally, conclusions are drawn and perspectives proposed in Section 5.

DATASET
The dataset used in this paper is acquired with a Navvis M3 trolley (Marcus, Georg, 2017), which is mainly for indoor mobile mapping. This platform integrates an IMU, three HOKUYO UST-10LX 2D Lidar rangers and six cameras allowing to create panoramic images of the surrounding scene as illustrated on Figure 2. One of the laser rangers is mounted horizontally to allow for 2D Lidar based simultaneous localization and mapping(SLAM). The other two are mounted vertically, one on the left and the other on the right of the trolley so we will identify them as "Left" and "Right" Lidars respectively. For the scene shown in Figure 2, the point cloud is visualized in Figure 3.
The intrinsic parameters of the Lidar sensors are given in (Hokuyo, 2019). The intrinsic and relative extrinsic parameters are obtained from calibration, and pose parameters can be obtained from the Navvis system. As shown in Figure 4, the trajectory of the system is not a straight line, and it includes rotations in horizon plane during scanning, consequently some areas are scanned twice by the same Lidar. In the experiment, considering the computation speed and device memory, the dataset is divided into two blocks shown in Figure 4.
The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLIII-B2-2020, 2020 XXIV ISPRS Congress (2020 edition)

METHOD
There are four steps in our pipeline shown in Figure 1: sensor mesh generation including pre-processing(cf Section 3.1) and mesh generation(cf Section 3.2), moving objects detection and removal introduced in Section 3.3, triangle selection using a boolean optimization, detail is in Section 3.4.1, and mesh pieces stitching is in Section 3.4.2 respectively.

Pre-Processing
The HOKUYO UST-10LX Lidar scans an angular sector of 270 • such that it has a 90 • blind angle, usually oriented towards the ground as illustrated in Figure 5. In order to fill this blind angle, some virtual points are added by intersecting lines sampling the blind angle and the ground plane. Considering the platform is mainly for indoor mobile mapping using 2D based SLAM (Marcus, Georg, 2017), the ground plane is the z = 0 plane for the platform. The distance of the lidar sensors to the plane and angle relative to the plane are fixed by the design of the trolley, the ground plane is at a constant position in the Lidar scans. The blind angle is divided into two parts symmetrically, only the last scan length r is near to the hypothetical length, points are added in the half of the blind sector, as the red points are added shown in Figure 3.
Then, in order to reduce the noise level, smooth filter is considered. Whole mesh based smooth filter maybe better than scan line based method, but scan line based smooth is fast. We rely on the prior that indoor environment often presents piecewise flat surfaces by using a Douglas-Peucker algorithm (Wu, Marquez, 2003) to smooth the scan line. Douglas-Peucker creates a polygonal approximation of the points along the scanline on which the input points are projected along their ray, as shown in Figure 6.

Sensor Mesh
In planar Lidars such as the HOKUYO UST-10LX, the laser beam is directed at a rotating mirror such that the reflected beam rotates in a plane. Points are acquired in order at a constant sampling rate, such that each point have two natural neighbors, the points acquired just before and just after. Moreover, the angle of the beam can be accessed for each point, such that we can define four additional neighbors for each point: the points with the beam angle just over and under his own beam angle in the preceding and following lines. These 6 neighbors allow to create 6 triangles defining the sensor mesh, as explained in more details in (Boussaha et al., 2018). Elongated triangles appearing on objects borders (depth discontinuities) can be filtered out based on simple geometric rules such as maximum edge length or circumradius, such that the sensor mesh can have holes. Sensor mesh of block 2(cf Figure 4) is shown in Figure 7.

Moving Objects Analysis
Moving objects are present in the Lidar scans thus in the sensor mesh. We propose to detect them through a visibility analysis as in (Xiao et al., 2015) for instance. Because a ray indicate free space, if it intersects a sensor mesh triangle, either from the other mesh or from the same mesh but acquired at another time, then the intersected triangle belongs to a part of the scene that has changed, this is considered moving in this situation.
As shown in Figure 8, to make the inconsistent situations more clear, a virtually visibility analysis in 2D is performed. Figure 8(a) shows the two sensor meshes with visibility information. Note that mesh1 has two different sight vectors meaning the scanner has scanned the same part of the scene twice from two separate viewpoints, which special case called selfintersection is handled by our method.
Visibility analysis consists in tagging as inconsistent the triangles of each mesh that are traversed by a ray either from the other mesh (cross-intersections) or from the same mesh (selfintersections). Figure 8(b) illustrates the two types of inconsistent areas: 1. 1 and 2 are cross-intersections (triangles from mesh1 intersected by rays from mesh2)

3 is a self-intersection of mesh1
4 , 5 and 6 are not inconsistent but they are considered to belong to the static part of the scene and will be the inputs of the following mesh fusion. Note that 6 may be also belong to a moving object as it is in the continuity of an inconsistent part, but we do not have enough information to validate this (no rays traverse it).
In our experiment, visibility analysis is performed by ray/triangle intersection in 3D mesh. As the number of intersections to compute is the product of the number of rays and of triangles which can get huge, it is accelerated by building a AABB tree structure in CGAL (The CGAL Project, 2018) for the triangles. In practice, when the same part of the scene is scanned multiple times, many intersections can happen due to noise and uncertainty on the trajectory which will generate many false alarms. Thus we combine the visibility analysis with the distance analysis: if two triangles are close enough (based on a distance threshold, can be from the accuracy of the platform) and that they overlap, they are considered consistent. In this case, the corresponding rays are not intersected, and the consistency information is stored for the next step.

3D Mesh Fusion
The objective of mesh fusion is to create a single mesh from all the remaining triangles. After removing inconsistent triangles and defining consistent ones, the remaining triangles can be in one of two cases: 1. Single: the triangle has no corresponding triangles, which means this part of the scene was seen only once, and by a single sensor.
2. Redundant: the triangle has corresponding triangles either in the other mesh, either in the same mesh (this part of the scene has been scanned more than once by the sensors) While single triangles should trivially be kept in the merged mesh, some redundant triangles should be eliminated in order to keep a single mesh layer in all the parts of the scene. This is the aim of triangle selection.
3.4.1 Triangle Selection generalizes to 3D mesh the notion of 2D mosaicking for images. Once the images to mosaic are resampled in the same geometry, the remaining problem is to decide which pixels to keep in the overlaps. This is usually done by a labelling (one label per pixel in input images) optimization in a way that minimizes relief displacement and radiometric differences across seam lines (Lin et al., 2015). In 3D, the overlaps are defined by redundant triangles, and we will also aim at minimizing seam length, but it is not a grid labelling problem anymore as the sampling is different in the various meshes.
Thus we pose the problem as a boolean optimization with the constraint that two overlapping triangles should not be kept together.
Considering two meshes : M1 and M2, we define the following: 1} is a boolean on each triangle T i j of mesh Mi, indicating if the triangle is selected (1) or removed (0) in the result. x is a vector concatenating all the x j i . • Q(T i j ) is the quality of triangle T i j which choice will be discussed later.
• C is the set of all consistent pairs of triangles (from the same mesh or not), which means they are below the distance threshold and that they overlap, as computed in the previous step.
• Bi is the set of triangles of mesh Mi with at least one boundary edge. For a triangle T j i ∈ Bi, we call L(T j i ) the length of its boundary edge(s).
• For two triangles T i j 1 , T i j 2 of the same mesh Mi sharing an edge, L(T i j 1 , T i j 2 ) is the length of their common edge. • XOR(x1, x2) = x1 + x2 − 2x1.x2 the exclusive OR logical operator.
Then the triangle selection problem is defined as finding the minimum of: The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLIII-B2-2020, 2020 XXIV ISPRS Congress (2020 edition) 1. The first term ensures that the result has as many input triangles as possible and encourages to select higher quality triangles when triangles overlap. Quality definition can combine resolution and noise/uncertainty metrics. In our case with similar noise levels, we focus on resolution with the very simple choice Q(T i j ) is square root of area size of the triangle, as higher resolution means more triangles for the same area of the scene, i.e. smaller triangle.
2. The second term forbids overlapping triangles to be selected (P is a very large constant). This is easier to implement and optimize than a strict constraint with the same result.
3. The last two terms measure boundary length as there are two types of boundaries in the output mesh: boundaries from the input for which the corresponding triangle is kept and edges between a triangle kept and a triangle removed.
Note that while quality should be maximized, the other terms (overlap and boundary length) should be minimized, explaining the signs. The result of this optimization consists in a small number of quite compact, non overlapping mesh pieces (connected components) as the boundary penalty discourages creating many pieces and pieces with complex boundaries, and overlaps are very highly penalized. These pieces are not connected so the resulting mesh have gaps that need to be filled, which we describe in the next section.

Seam Line
Stitching is to create bands of triangles to connect the mesh pieces together to recover the continuous nature of the scene. It is performed in two steps: (1) match boundaries between pieces: Thanks to the halfedge data structure in CGAL (The CGAL Project, 2018), the boundary edges of each mesh piece and their orientation can be accessed efficiently. The boundary matching is done in a greedy manner: we look for the closest pair of edges from separate pieces, then grow two boundaries starting at these edges while the two boundaries are close enough. This creates a first pair of matched piece boundaries. This process is then iterated on the remaining edges while the pair of closest edges is close enough. A k-d tree structure in CGAL is used to accelerate nearest edge search.
(2) link the pairs of matched boundaries: To connect the pairs of matched piece boundaries, we start by snapping the vertices that are close enough to the opposite boundary to the closest vertex. For the remaining unsnapped boundary, we create stitching triangles by filling the hole formed by the two matched boundaries connected at their endpoints by adapting a standard hole filling algorithm (Liepa, 2003). This dynamic programming algorithm needs a comparison operator "<" definition which is the best between two triangles. We found that the proposition of (Liepa, 2003) that minimizes the sum of triangle area and largest dihedral angle, produces elongated triangles, so we propose the following comparison operator integrating the perimeter of the triangle: if µ1 < and µ2 <  where for the triangle Ti, µi is the maximum dihedral angle between Ti and its neighboring triangles, Ωi is the area of Ti, and C(Ti) is the perimeter of Ti (sum of its edges lengths). This choice favors smoothness of the reconstructed hole surface for small dihedral angles, but acknowledges that noise maiy imply some large dihedral angles in which case optimizing the filling triangle shapes is prefered (the second criteria). The parameter is a threshold on the dihedral angle to switch between these two behaviors, and 90 • is used in our experiment.

RESULTS AND DISCUSSION
In order to illustrate the performance of the proposed method, we experiment it on the Navvis dataset described in Section 2.

Moving Object Detection
In the dataset, the present people may move fast or stay still for a short time. The moving object detection is based on inconsistency analysis through ray tracing and distance computation.
A focus on an inconsistent area is shown in Figure 9, there are two type of inconsistent objects: moving object and shortly static object. As shown in Figure 9(b), 9(c), 9(d), only the combination of distance computation, inter ray tracing and self ray tracing can recover all the moving objects without false alarms on the static part of the scene. In Figure 9(g) and 9(h), the blue  rectangle shows self ray tracing can remove self inconsistence objects in the single area between two meshes.

3D Mesh Fusion
There are two steps in the mesh fusion : triangle selection and stitching. In the triangle selection, we emphasize the triangle quality and self overlap. After triangle selection, stitching is utilized to obtain a continuous mesh.
(1) triangle quality: Because of the different scan direction of the two laser sensors, some mesh triangles can be very large. The label optimization selects the highest quality (=smallest) triangles, as shown in Figure 10, the high quality triangles are selected rather than the elongated triangles in the red rectangles due to the depth discontinuity.
(2) self overlap triangles: Some areas are scanned several times. In the experiment shown in Figure 11, most self overlaps occur on the ground areas: in Figure 11(a), the red area is covered several times while in Figure 11(c), only one layer is selected.
(3) mesh stitching: Mesh stitching is the last step to produce a continuous mesh. The gaps between mesh pieces result- ing from triangle selection are filled by triangles. As shown in Figure 12, the triangles in magenta are added to fill the gaps. Boundary which are close enough are merged to avoid creating very small triangles.
In our experiment, if there are no self overlaps in the mesh, all the nodes can be labeled using QPBO method. If there is self overlap, not all the nodes are labeled so the QPBO-I method is used to improve the label ratio, which is slow if there are too many unlabelled triangle nodes. After mesh stitching, a few small holes may remain in the mesh, so the filling hole method of (Liepa, 2003) can be used to improve visualization and completeness.

Comparison
In order to show the effectiveness of the proposed method, results are compared with the Poisson method of (Kazhdan et al., 2006) and the greedy triangulation reconstruction method of (Marton et al., 2009) available in the PCL library (Rusu, Cousins, 2011). To ease comparison, the virtual points(cf Section 3.1) are also added to the point cloud as the input of the Poisson and triangulation methods.
After Poisson reconstruction, the mesh can be trimmed with density. In the greedy triangulation reconstruction, input point cloud is smoothed using bilateral smoothing in CGAL (The CGAL Project, 2018). The experiment shows that the Poisson method is robust to noisy points, but points on moving objects are different. As shown in Fig. 13(a), the result is different along to it density. Using density, some triangles are removed, but if there are a lot points, the triangles are remained as shown in Figure 13(b). The origin points are coarse, even Poisson give a rough reconstruction. And triangulation reconstruction is sensitive to noise points, as shown in Figure 13(c), even after smooth, the result is rough. Our method can produce a clean mesh.
Another example is that, if a person stays static shortly, because there are a lot points on the back, the person is reconstructed well as shown in Figure 14(a). In our method, most part of the points are removed.

Block Stitch
For large scale indoor reconstruction, QPBO-I method may become too slow if having too many triangles. We can divide the scene into blocks as shown in Figure 4, and use the triangle selection optimization framework block by block, then stitch all the resulting pieces as shown in Figure 15.

CONCLUSIONS AND FUTURE WORK
Emphasizing on moving objects removal and mesh fusion, this paper proposes a static indoor scene reconstruction method, using visibility analysis to remove moving objects, then using  boolean optimization to select the triangles and a stitching method to fill the gaps. An important aspect of the method is that it relies on preserving sensor information, both optical center position for ray tracing and sensor topology for sensor mesh generation. Sensor information is important for 3D geometry analysis as advocated in (Xiao et al., 2015). Most 3D mesh reconstruction methods lose the sensor information despite it being physically significant (space is empty along the rays). Sensor topology mesh generation is an extremely fast way to produce a high quality mesh, but it does not cope with self overlaps that are very frequent as soon as the acquisition platform is moving freely, and it does not allow to integrate multiple data sources, which is why our proposed approach is a mandatory post-processing for such meshes. However, any reconstruction method adapted to keep the sensor information (in fact only the optical center position for each mesh vertex) such as the Graphcut method of (Labatut et al., 2009) can be used in our pipeline. It should also be well adapted fo fusing depth maps from dense matching, as the optical center is the same for all the points from the same depth map, and this information is often available.
Many objects in indoor scenes are unsustainable along the time. For example, in our experiment, only people are moving, but on a longer time scale objects and furniture could also move. Thus an interesting future work would be to perform a time series analysis of several scans acquired at different times of day and even at different dates to analyse the dynamic behavior of the scene at various time scales.
In the optimization step, all the triangles are treated as a graph node which does not scale up very well. If the mesh is very large, memory and time consumption will grow quadratically. A possible solution to this problem would be a divide and conqueer approach where the data is divided until the block sizes become reasonable enough to be processed at once, then the results can be iteratively stiched.   Figure 15. A block stitching experiment. Block 1 is in , block 2 is in , added triangle is in .