GENERATION OF GROUND TRUTH DATASETS FOR THE ANALYSIS OF 3 D POINT CLOUDS IN URBAN SCENES ACQUIRED VIA DIFFERENT SENSORS

In this work, we report a novel way of generating ground truth dataset for analyzing point cloud from different sensors and the validation of algorithms. Instead of directly labeling large amount of 3D points requiring time consuming manual work, a multi-resolution 3D voxel grid for the testing site is generated. Then, with the help of a set of basic labeled points from the reference dataset, we can generate a 3D labeled space of the entire testing site with different resolutions. Specifically, an octree-based voxel structure is applied to voxelize the annotated reference point cloud, by which all the points are organized by 3D grids of multi-resolutions. When automatically annotating the new testing point clouds, a voting based approach is adopted to the labeled points within multiple resolution voxels, in order to assign a semantic label to the 3D space represented by the voxel. Lastly, robust lineand plane-based fast registration methods are developed for aligning point clouds obtained via various sensors. Benefiting from the labeled 3D spatial information, we can easily create new annotated 3D point clouds of different sensors of the same scene directly by considering the corresponding labels of 3D space the points located, which would be convenient for the validation and evaluation of algorithms related to point cloud interpretation and semantic segmentation.


INTRODUCTION
In the past decade, the automatic 3D scene analysis using point clouds has attracted increasing attentions in research fields of photogrammetry (Vosselman and Maas, 2010), remote sensing (Lefsky et al., 1999), computer vision (Buch et al., 2011), and robotics (Rusu et al., 2009).For the dense and accurate 3D scene analysis and interpretation, especially in the context of urban areas, plenty of algorithms and approaches have been developed for a wide variety of applications such as semantic interpretation (Weinmann et al., 2015, Landrieu and Simonovsky, 2017, Vosselman et al., 2017), segmentation (Rabbani et al., 2006, Vo et al., 2015), registration (Aiger et al., 2008, Yang and Zang, 2014, Ge and Wunderlich, 2016, Theiler et al., 2014), object recognition (Schnabel et al., 2007, Yao et al., 2011, Niemeyer et al., 2014, Aldoma et al., 2012, Yu et al., 2016).For any proposed algorithms and methods, satisfying experiments and convincing evaluations are always non-trivial and crucial steps to validate the feasibility and performance of the proposed method.To conduct such experiments and evaluations, the benchmark dataset or the ground truth are normally required.In fact, lots of efforts have been paid to the generation of the benchmark dataset in the community of point clouds processing, for example, the ISPRS Benchmark Test on Urban Object Detection and Reconstruction, containing challenging aerial laser scanning point clouds for 3D object reconstruction (Rottensteiner et al., 2014), semantic 3D benchmark for classification having large-scale terrestrial point clouds of various urban, suburban, and rural scene (Hackel et al., 2017).
However, for the majority of the benchmark that published, they only focus on data obtained from one type of sensor, for example photogrammetric or LiDAR point clouds.However, for test-ing the generality, in many cases we need to test the proposed method on datasets acquired by more than one kind of sensors at the same scene, for example, the LiDAR points and TomoSAR points.Thus, for the testing scene, if the ground truth can be used for automatically annotating new datasets acquired from other sensors, it will contribute a lot to the validation of the generality of the proposed methods, and provide an possibility of fusing the datasets of multiple sensors.To this end, we present a novel strategy for generating annotated dataset for new point clouds from different sensors based on the labeled 3D grid structure.Instead of directly labeling the 3D points with time consuming manual work, multi-resolution 3D voxel grids for the testing site are generated.On the basis of a set of basic labeled reference points (i.e., point clouds with low density and easier to be stored and manually annotated), we can generate a 3D labeled space of the entire testing site with different resolutions.To be specific, the voxel structure is applied to voxelize the annotated reference point cloud (see Fig. 1b), by which all the points are organized by a 3D grid.Then, a voting based strategy is adopted to the voxels of multiple resolution levels, in order to assign a semantic label to the 3D space represented by the voxel.Lastly, robust line-and plane-based fast registration methods are developed for aligning point clouds obtained via various sensors.Benefiting from the labeled 3D spatial information, we can easily create new annotated 3D point clouds of different sensors of the same scene directly by considering the corresponding labels of 3D space the points located, which would be convenient for the validation and evaluation of algorithms related to point cloud interpretation and semantic segmentation.In Fig. 1c, we illustrate an example of the reference MLS point cloud we used.

METHODOLOGY
Conceptually, the implementation of our proposed method consists of three core steps: the voxelization of the reference dataset and the extraction of primitives, the primitive-based registration, and the voxel-based 3D labeling.In the first step, the reference data set is manually labeled and voxelized into the 3D grid structure with cubics.Simultaneously, geometric primitives (e.g., lines or planes) are extracted from the reference dataset and the testing point cloud, respectively.Combining the labeled reference dataset and the voxel structure, the 3D space covering the testing dataset can be labeled.In the second step, the primitivebased registration is conducted between the reference dataset and the testing point cloud, in order to align them into a same coordinate frame.In the last step, the multi-scale 3D labeled voxel space is applied to the aligned testing point clouds, labels of the points in the testing point cloud are assigned via the multi-scale voting strategy.The processing workflow is sketched in Fig. 2, with the key steps of involved methods and sample results illustrated.The detailed explanation of each step will be introduced in the following sections.

Annotation of the reference dataset
The annotation of the reference dataset is a crucial step in our work.Normally, the acquired measurements (e.g., point cloud) have a large amount of points, which is a challenge for the manual annotation.However, in our method, it is not necessary to label all the points of the reference dataset.Instead, only those points representing or covering the 3D space should be labeled.Thus, in fact, there is no limitation for the type of reference dataset (e.g., point clouds, mesh, or BIM model).In our case, the original reference dataset (i.e., MLS point cloud) is considerably downsampled, in order to facilitate the manual labeling.

Octree based voxelization
For labeling the 3D space covering the reference dataset, the reference dataset is voxelized with multiple resolutions.Here, we utilize the octree-based voxelization to decompose the entire point cloud with 3D cubic grids.
As discussed in many of our former studies (Boerner et al., 2017, Xu et al., 2017c), the main reason of using the octree structure is that the nodes of an octree structure have explicit linking relations, which facilitate the traversal for searching the adjacent ones (Vo et al., 2015, Xu et al., 2017b).Besides, benefiting from the structure of the octree, it is also quite easier for us to generate the multi-resolution voxel spaces.In Fig. 3, we provide an illustration about the generated 3D labeled voxel spaces of different resolutions.

Primitive extraction
To label the testing point cloud with the given labeled 3D space, it is needed to align the coordinate frame of the testing point cloud to that of the labeled 3D  space.However, since they are normally different types of dataset and obtained via various sensors.Thus, traditional point-based registration algorithms may work not so well.Considering that in the same scene, the basic geometric structures of the buildings are always consistent (Boerner et al., 2018), we developed the strategy of using geometric primitives to align these two coordinate frames.Here, the primitives we used include lines (Koch et al., 2016) and planes (Xu et al., 2017a) The ways of extracting 3D lines have been widely reported.For example, the 3D lines can be reconstructed from a set of images using the method of (Hofer et al., 2017) or (Jain et al., 2010).While for extracting 3D lines from LiDAR point clouds, methods are reported in (Lin et al., 2017) or (Hackel et al., 2016), aim at extracting boundaries and contours of the point cloud.As for the extraction of planes, there are also plenty of work like (Xu et al., 2017a, Nguyen et al., 2017, Dong et al., 2018).The benefits of using 3D lines and planes lie on less heavy computational cost than points, because such primitives always represent important structures of the environment for finding correspondences (esp. in urban scenes, which are rich of straight edges), which can reduce the number of candidates.Besides, primitives have more dimensions for limiting the degree of freedom which can make the estimation of transformation more robust.Especially for the case of urban scenes, buildings have flat facades, the registration is easier in 2D if corresponding planar structures can be found.In Fig. 4, an illustration of the extracted lines and planes from the corresponding photogrammetric point clouds are given.

Primitive-based registration
The registration of the reference and test point clouds can be expressed by a simple 3D similarity transformation T = (Tr, Tt, s), with Tt, Tr and s defining the parameters of the transformation as a 3D translation vector, a 3 × 3 rotation matrix and a scale.In the following, we will briefly introduce the line-and plane-based registration methods.

Line-based registration
The 3D line-based registration follows a modification of a methodology for aligning individual indoor and outdoor 3D building models (Koch et al., 2016).Unlike aligning non-overlapping image-based 3D reconstructions from different views, the 3D models to be aligned are captured from different sensors but represent identical structures from similar viewpoints.After extracting 3D line segments from LiDAR points LL = l 1 L , ..., l n L and images LI = l 1 I , ..., l m I as explained in Section 2.2.2, the transformation T aligning LL to LI can be estimated by minimizing perpendicular distances between k corresponding 3D line segments of both line sets where π l, T projects a line segment l with T, and d (lI , lL) computes the length of the perpendicular of two 3D line segments extended to infinity.
In order to find line correspondences, planar characteristics of urban environments are exploited by first extracting a set of 3D planes in both line sets in a RANSAC-based scheme.These 3D planes mainly contain 3D lines representing fac ¸ade boundaries and openings like window and door frames, which are suitable geometrical matching structures.For each detected 3D plane, corresponding inlier 3D lines are projected to these plane hypotheses for generating 2D binary images.The matching step is finally conducted in 2D, by exhaustively comparing all 3D plane hypotheses using chamfer matching on all 2D binary images.Corresponding 3D planes share identical geometrical structures and therefore result in a low chamfer score.The identification of corresponding 3D plane hypotheses solves the 3D rotation, while the 3D translation is given by the result of the binary matching.This prior transformation can be applied to LL and corresponding 3D lines in LI can be detected in the 3D space by a simple nearest neighbor search.These 3D line matches are finally used to refine the transformation T using Equation 1.

Plane-based registration
The 3D plane-based registration is a modification of the methodology for aligning 3D building scenes in urban areas (Xu et al., 2017a).The matching between corresponding plane sets is conducted by finding a triple of planes, forming a 3D corner in the urban scene, which is used as a constraint to define a coordinate frame determining six degrees of freedom.Normal vectors of three planes forming the corner are firstly calculated.Afterwards, the coordinate frame defined by this corner is estimated by a cross product of normal vectors.
Based on a pair of corners from different scans, we can define the transformation to align one scan to another, which consists of the estimation of rotations and the determination of translations.
As stated in (Xu et al., 2017a), in theory, if we can find the corresponding corners standing for the corresponding plane sets, the correct transformation can be calculated.For this purpose, we assume that if the majority of the planes in one scan can be matched to another scan, the transformation parameters are estimated optimally.Similar to the line-based registration, a RANSAC-based strategy is adopted, sampling a triple of planes from two scans concurrently in each random selection round.These planes should be non-parallel and their corresponding coordinate frame will be estimated.In RANSAC process, we apply the estimated transformation parameters to transform all the selected planes in the target scan, and count the number of matched planes.The triple of planes having the largest number of matched planes are identified as the optimal corresponding plane sets, and used for estimating the final transformation parameters.For rotations, the angle differences α, β, and γ of the rotation matrix Tr are calculated by comparing the vectors of axis in corresponding coordinate frames Fs and Ft.Then, is obtained as follows: For the translation, the intersecting points P of triples of planar patches are used as the reference: The translation matrix Tt is calculated as follows: By applying Tr and Tt to coordinate frame of the testing dataset can be aligned to the reference dataset.

Voxel-based 3D labeling
To avoid directly labeling 3D points of testing point clouds, the annotated 3D space with labels is used.More specifically, the 3D voxel-grids with different resolutions are generated with merely a subset of the point cloud in order to further label the testing dataset.Here, a simple voting approach is applied to label the voxels in the grid with only basic labeled points.Multiple resolution can be achieved by means of changing parameter settings the division levels in the process of octree based voxelization.
For giving labels to the points of the aligned testing point clouds with the labeled 3D space, the position of the testing point is considered.To be specific, if the point is located in a voxel with a given label k, then this point will be tagged with label k.For the labeled 3D space with n multiple resolutions, each testing point will finally get maximum n different tags.These n tags will conduct a voting process, the tag having the highest voting score will be assigned to the point.Here, the use of multiple resolution voxel space is designed to deal with the ambiguity caused by the boundaries and connections between different structural components.In theory, the smaller the voxel, more accurate the assignment of testing points will be.Howerver, this requires a detailed manual labels, which increase the workload definitely.The use of multi-resolution strategy can make a good balance between the preservation of details and the accuracy of the tags giving to testing points.

EXPERIMENTAL RESULTS
The testing site is the Arcisstrasse along the main entrance of Technische Universität München (TUM) city campus, covering approximately an area of 29000 m 2 .In Fig. 1a, we provide an overview of the entire scene.To some extent, this scene is a representative scenario of the urban area, including rich information of buildings, vehicles, vegetations, ground surfaces, et al.Here, the basic point cloud is original acquired by Fraunhofer Institute of Optronics, System Technologies and Image Exploitation (IOSB) (Gehrung et al., 2017).The utilized point clouds are acquired by two Velodyne HDL-64E mounted with an angle of 35 • on the front roof of the van.Fig. 5 provides sketch about how the two scanners are mounted (Gehrung et al., 2017).The original raw point clouds are also preprocessed by a statistical outlier removal for down-sampling and noise suppressing.The raw point cloud is downsampled from 0.5 billion points to 20 million points, in order to facilitate the manually labeling work.For testing the per- formance of automatically labeling new point clouds acquired by other sensors, we also generate photogrammetric point clouds using hand held camera with 500 images via stereo matching based on the method depicted in (Xu et al., 2018).This photogrammetric point cloud will be annotated by the use of our 3D labeled space.In Fig. 6, we provide the labeling result of the photogrammetric point cloud.Comparing with the RGB textured original point cloud, it is clear that the majority part of this point clouds has been correctly labeled.However, there are still some part wrongly labeled or missed.For example, the area marked by the black dash box is missing in the labeled results.The original photogrammetric point cloud includes around 33 million points, but after the labeling only 30 millions point corrected labeled.The reason of this error is due to the error caused by the registration process, which will cause a bias of the position of the 3D voxel, so that the testing points cannot fall into correct voxels or locate outside the voxel space.In addition, our generated 3D labeled can also be used to generate annotated MLS point clouds with ultra high density.In Fig. 7, we display an illustration of the annotated original MLS point clouds.In this annotated dataset, the points of each scan have been labeled individually.This work is collaborated with Fraunhofer-IOSB, and currently this dataset is already published.In this case, the voxelized 3D space has fixed resolution and the resolution of the voxel is 0.2m.Therefore, at the connections between the ground surface and facades, there are still some obvious "zig-zag" edges, which can be further improved by the use of multi-resolution strategy.
Figure 7. Labeled each scan of the MLS dataset.

CONCLUSION
In this work, we proposed a novel strategy of automatically generating ground truth dataset for analyzing point cloud from different sensors and validation of algorithms.Without directly labeling the 3D points, a multi-resolution 3D voxel grid for the testing site is generated, with 3D the entire 3D space labeled based on a simple annotated basic point cloud.Benefiting from the labeled 3D spatial information, we can easily create new annotated 3D point clouds of different sensors of the same scene directly considering the corresponding labels of 3D space where points located, which can provide benchmark datasets of various sensors for the validation and evaluation of algorithms related to point cloud interpretation.Moreover, our proposed voting based approach for assignging labels to the 3D space can also be used for the purpose of fusing datasets of multiple sensors, which is also a promising research topic of the community of remote sensing (Schmitt and Zhu, 2016).

Figure 2 .
Figure 2. Workflow of generating new labeled point cloud by using labeled 3D space.

Figure 3 .
Figure 3. (a) and (b) Labeled 3D space with resolutions of 0.5 m and 1.0 m.

Figure 5 .
Figure 5. Two oblique mounted laser scanners of the MLS system.Figure courtesy of (Gehrrng et al., 2017).

Figure 6 .
Figure 6.(a) Original testing point cloud and (b) Labeled testing point cloud.