INDOOR SCENE REGISTRATION BASED ON KEY POINTS SAMPLING AND HIERARCHICAL FEATURE LEARNING

: PointNet has been widely considered as a popular representation for unstructured point clouds with the aim of classiﬁcation and segmentation. To date, recent researches represent the limitation of the PointNet to pose estimation and alignment of real environment, due to the low performance in pattern learning to complex scenes. This paper presents an end-to-end deep learning method for point clouds registration of indoor environment. The proposed method involves three steps. Firstly, feature pre-processing extracts the key-points by adaptive Harris 3D algorithm and generate the local group by point grouping. Second, hierarchical feature learning network is trained to describe the local group as feature descriptors. Finally, loss function between feature descriptor is trained. The key contribution is that we innovatively use the key-points to generate multi-layer feature vector, which can provide the contextual local features of the indoor environment. The results shows that our method achieves comparable registration accuracy to the present state-of-art geometric methods in the indoor environment. We comprehensively validate the accuracy of our approach using S3DIS dataset. The high accuracy demonstrates that our method can be used in point clouds registration accurately.


INTRODUCTION
In recent years, three-dimensional (3D) mapping in indoor environment based on point clouds has received considerable critical attention. It has been an important data source for indoor 3D model and information visualization, which are an integral part of indoor service such as geo-hazard monitoring, urban asset management, and so on . Managing such indoor structures for timely maintenance, smooth operation, and safety can be quite challenging without up-to-date spatial information of the structural conditions and space use. Individual point clouds from single scanning position cannot provide the whole environment due to the limitation such as moving objects and long distance. It is necessary to register point clouds and obtain detailed 3D mapping from multi-position scanned point clouds.
Point cloud registration is a basic and important technology in spatial information. Considering that feature descriptor is one of the main factors in the process of registration, many researchers proposed different kinds of feature space for point clouds registration. Traditional methods to obtain the completed and detailed point clouds are mainly implemented by matching the geometric pairs of points and calculating the transformation. Geometric feature, such as ICP (Iterative Closest Point) and its variants, establish the point corresponding and performing a least squares optimization. However, these algorithms are sensitive to the quality of initialization and known to be susceptible to local minima since it is difficult to explicitly establish closest point correspondences due to the data noise. In addition, many of these descriptors do not work well in the real environment due to the noisy and the low density of the point clouds (Yew and Lee, 2018). Thus, developing the registration algorithm with higher feature space that can be used in the 3D mapping is necessary.
Inspired by the process of the deep learning for image-based tasks, such as moving object recognition and image understanding, several feature learning algorithms are proposed after the popularity of deep learning, which are proposed based on the properties of * Corresponding author point clouds, such as PointNet, PointCNN, PointNet++ (Li et al., 2018, Qi et al., 2017a, Qi et al., 2017b. However, the unique aspects of point clouds limit the performance of registration and enhance the complexity of registration, including the lack of local feature description and efficient feature analysis. Thus, there is with great challenges to apply the training network for registration. In this paper, we design a hierarchical learned feature to register point clouds. Firstly, to improve the accuracy of registration, the key points are extracted by an adaptive Harris3D algorithm. To generate the grouping for training network, neighbourhood points are selected around the key points with various size. Mini-PointNet is used as the training network to extract the feature vector. Multi-layer of the feature learning composes the feature descriptor. Transformation is trained by the LK algorithm to achieve the registration (Aoki et al., 2019). Our contributions mainly include two points: Firstly, for the input of mini-PointNet training network, key points sampling, which considered as the feature pre-processing is proposed, can be applied to the indoor point clouds with multi-objects. Secondly, the feature descriptor with hierarchical structure is constructed for registration, which improves the performance of feature learning.

Classical Handed-crafted Feature
Survey work from (Pomerleau et al., 2015) provides a comprehensive review of traditional registration algorithms. Classical handcrafted features are proposed before deep learning aiming to find the correspondence between target and source point clouds. The design of these features are mainly based on the geometric knowledge of the 3D point clouds (Besl and McKay, 1992, Magnusson et al., 2007, Yang et al., 2013. Some algorithms are designed by describing the geometry of each point locally. For example, in the (Zhong, 2009), feature points are selected by the principle direction or the unique curvatures and matched through descriptor. 3D-SIFT focuses on reducing the influence of scale by using Difference-of-Gaussian (DoG) and representing the difference of the intensity values (Rusu and Cousins, 2011a). PFH, FPFH consider the feature histograms and use surface normal to describe the patch around each key point (Rusu et al., 2008, Rusu et al., 2009, Tombari et al., 2010. However, around the key points, these descriptors will fall into the spatial bin. Some handcrafted features are designed by describing the surface model. For example, Rotational Projection Statistics (RoPS) calculates the scatter matrix lying on the surface and obtaining the distribution of projected points on the 2D planes (Guo et al., 2013). However, it requires the surface data and can not be applied to the raw point clouds. More evaluation of the handcrafted features can be found in the review (Hänsch et al., 2014). Based on the evaluation, the handed-crafted feature works well for the highquality surface and point clouds with low-noisy and high-density. However, features are unstable and sensitive to the the number of scans. As a result, point clouds in real world with noisy and lack of points number will not work well (Yew and Lee, 2018).

Learned Feature
With the development of deep learning, learned 3D feature descriptor are widely used to describe the unstructured point clouds. Some of them operate on the depth image, while others operate on the point clouds directly (Zeng et al., 2017, Kehl et al., 2016. For example, PointNet uses point clouds as input to realize classification and segmentation. The structure of training network follows the neural networks and realizes theoretical insight into raw point clouds (Qi et al., 2017a). As the extension of PointNet, PointNet++ is proposed to improve the performance of complex environment, by hierarchically extracting feature with local feature (Qi et al., 2017b). However, the inherent lack of structure presents difficulties in using point clouds registration directly in deep learning architectures.
Some researches have focused on developing deep leaning feature for the purpose of point clouds registration. For example, a deep learning feature descriptor extracted by a Siamese network is proposed to register mobile point clouds in the indoor environment. However, the descriptor use RANSAC to reduce the wrong matching points. The method also requires another refinement step using ICP to improve the accuracy (Zhang et al., 2019). PPFNet proposes the pairs of point clouds feature and global context to improve the descriptor (Deng et al., 2018). PointNetLK considered the PointNet as the "imaging function" and designed the modified Lucas Kanade (LK) algorithm as the loss function to minimize the distance between the candidate point clouds (Aoki et al., 2019). This work provides us a good intuition that classical imaging matching algorithm can be used as the loss function for the point clouds training. However, the performance of the various approaches does not consider different level of feature description (Groß et al., 2019). Despite the good performance and achievements of these works, none of the work consider both related feature with key points and comprehensive learning structure for point clouds registration. Thus, it is with great challenges and potential to apply the learning features for matching and registration.

METHOD
Our work can be considered as the design of deep learned feature with feature analysis and the application in the indoor environment. In section 3.1, we introduce the innovation and mathematics of the proposed algorithm. In section 3.2, the derivation of feature pre-processing is introduced. In section 3.3, we describe the hierarchical feature structure and training model used for the point cloud alignment.

Overview
Let φ denotes the feature function. φ presents the R 3×N → R K . For an input point cloud P , φ(P ) will obtain a K-dimensional feature vector descriptor. The Multi-Layer Perceptron (MLP) is operated to each point, then the output is the feature vector, with K-dimension. Following the PointNet ++, the φ is designed with multi-layer to extract both local and global feature (Qi et al., 2017b).
The registration process is formulated as follows. Let PS , PT be two groups of input data, the source point clouds and target point clouds, respectively. The T, T ∈ SE(3), which represents the rigid-transform, is the best aligns from source PS to target PT . The alignment process can be described as finding T such that φ (PT ) = φ (T · PS ).
In order to compute ∆T each time, an iterative optimization solution is designed as equation.
Where Jacobian J will be denoted as J = ∂ ∂ξ φ G −1 · PT , J ∈ SE(6). For each J, Jacobian can be approximated by a finite difference gradient, which calculated by the equation: Where ti is the infinitesimal perturbations of the parameters ξ. R is the generate of the exponential map with twist parameters. J + is the Moore-Penrose inverse of J.
The transformation matrix will the re-computation with the looping function, using equation 3. Then a new source data will be updated by calculating with the new transformation matrix. The final pose estimation T is the composition of each iterative loop, as equation 4. The iterative computation is based on the threshold for ∆T .

Feature Pre-processing
In order to reduce the noisy in the point cloud data, a pre-process is designed including statistical filtering and voxel filter (Zhang et al., 2019, Rusu andCousins, 2011a). The statistical filtering is used to remove the noise points and voxel filter is used to reduce the resolution. The process of the point cloud registration is to calculate the transformation for the coordinate alignment.
Since the transformation matrix can be calculated from the several matching pairs of points between the source and target point clouds, it is more efficient to use the most informative points than all the points. Compared with the farthest point sampling (FPS) or random sampling, key points has better performance of the feature extraction given the same number of key points. Key points sampling is considered as the feature analysis in this paper.
In the review of key points sampling, several researches show that the Harris 3D method is robust to several transformations The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLIII-B2-2020, 2020 XXIV ISPRS Congress (2020 edition) Figure 1. Point clouds input source PS and target PT are passed through feature learning and a MLP to compute the feature vectors φ (PS ) and φ (PT ) . The jacobian matrix J is computed by using theφ (PT ) . Pose information ∆ T will be updated incrementally if the value is higher than the thresh. During the training, one lose function is used, which is based on the difference between the estimated transform and the ground truth transform.
including noise, local scaling and presence of holes (Guo et al., 2014, Hänsch et al., 2014. In this paper, the adaptive Harris 3D method is used to extract the key points from source and target point clouds (Sipiran and Bustos, 2011). If one point is selected as the key points, the cluster of neighbourhood points will be considered as the local patch to represent points. The selection of the neighbourhood point is shown in the Algorithm 1.
Based on the neighbourhood set for each point, given a point p, p ∈ PS , p ∈ PT , the neighbouring points are translated to fit a quadratic surface based on the equation 5. Then the derivatives of f (x, y) is calculated in the point. A symmetric matrix E is defined using the derivatives of this function, as the equation 6. Then the highest Harris responses will be selected as the key points. The neighbourhood points will be selected as the local patches.

Hierarchical Feature Learning
The feature descriptor is composed of a number of feature layers to achieve the hierarchical structure (Qi et al., 2017b). For each feature layer, an N * (d + C) matrix is used as the input that represents N points with each point composes of d-dimensional coordinates and C-dimensional point feature. It outputs a N × (d + C ) matrix. The N represents the sub-sampled points with d-dimensional coordinates and extracted C -dimensional point feature. The layers of the descriptor will be introduced in the following paragraphs.
In the sampling layer, given input points{x1, x2, . . . , xn}, a subset of points {x1, x2, . . . , xm} are extracted using the section 3.2, so that xxy is the most informative point for each layer. Comparing with random sampling or farthest point sampling (FPS), this method provides more useful information to point cloud registration. The output of the sampling layer is N , which represents N selected points.
In the grouping layer, the input to this layer is N * (d + C) with N * d -dim coordinates and C -dim feature. It outputs several points set with a size of N × K × (d + C). Each point set corresponds to a local region for PointNet to convert the point into one fixed length feature vector. The K varies to adapt to points set. Neighbourhood points are selected in the step of feature pre-processing. Compared with the other methods, such as K nearest neighbor search or other threshold method, it has a better performance on requiring the local pattern (Jiang et al., 2018).
In the PointNet layer, points set with a size of N × K × (d + C) is the input. The output size is N × (d + C), which is the basic building block for the local pattern learning from PointNet (Qi et al., 2017a). The function can be summarized as follows: f (x1, x2, . . . , xn) = γ (M AXi=1,...,n {h (xi)}) The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLIII-B2-2020, 2020 XXIV ISPRS Congress (2020 edition) Figure 2. Different overlapping with the initial transformation.
(a)ratio of overlapping is 95%, (b)ratio of overlapping is 75%, (c)ratio of overlapping is 50%, (d)ratio of overlapping is less than 30%. The point clouds in gray scale is target point clouds, while the point clouds in RGB is the source scale.
Where {x1, x2, . . . , xn} represents the unordered point clouds, xi ∈ R (d) . The set function is designed to transform the points set to a feature vector. γ and h are the multi-layer perceptron (MLP) networks.f is the set function and invariant to the input source and target point clouds.
In the PointNet, maximum pooling function, average pooling function and weighted sum pooling function are used as the symmetric pooling function, following the MLP operation (Qi et al., 2017a), to realize the permutation invariance and unordered point clouds. In this paper, maximum pooling function will be used as the pooling function.

Experiment Design
We experiment with various type of objects with different real indoor scenarios. Stanford S3DIS indoor dataset is used to generate the source and target point clouds (Armeni et al., 2017). To evaluate the performance in the indoor environment, we demonstrate the use of proposed method to estimate the transformation in the Area 1 from S3DIS dataset. The dataset contains the corresponding semantic annotations and global XYZ images as well as surface normals.
To evaluate our method, we discuss the effect of different ratio of overlapping. To prepare the target point clouds and source point clouds, we implement the rotation and translation on the test data with different ratio of overlapping. Initial rotation and translation for test are in the range of [0, 5] meter and [0,90] degree. Figure  2 shows the different overlapping and initial transformation as the example. For evaluation purpose, three algorithm including ICP, NDT and Go-ICP are considered as the base line (Besl andMcKay, 1992, Yang et al., 2015). The compared algorithms are implemented with the same point clouds without any additional process.
For each training dataset, we prepare the h5 file for the point clouds and property, respectively. For each input data, the size of training and testing data is N * 4096. Since we only consider the feature of geometry, only coordinate value is used and normalized from XY Z into X Y Z . For parameters setting, the epochs, batch size, learning rate and the momentum are set to 200, 16, 0.01, 0.9. The experiments are implemented on a single GPU with Tensorflow 1.70.

Experiment Result
Point clouds are pre-processing by detecting key points and grouping. This step is implemented by PCL and Open3D (Rusu and  Cousins, 2011b, Zhou et al., 2018). Figure 3 shows the points sampling (partial data in Conference Room 1 from S3DIS), with the radius is 0.1. As shown in Figure 4, compared with other neighbourhood selection algorithm, the adaptive method will determine the size of neighbourhood points without a constant value. Two standard indoor environments are selected to evaluate our method. The ratio of overlapping between target and source point clouds is 100 %. Figure 5, 6 show the registration results conference and hallway, comparing to the ICP, NDT and Go-ICP. As shown in figure 5, 6 and Table 1, our method can get good performance in two standard indoor environments.
To evaluate the effect of different ratio of overlapping, the root mean square error (RMSE) is calculated, as shown in Table 2. For ICP, the RMSE is calculated based on corresponding points after registration. For our method, with the true transformation between the target and source point clouds, the RMSE is calculated as equation.
The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLIII-B2-2020, 2020 XXIV ISPRS Congress (2020 edition)

RM SE
Where N is the number of the corresponding points. Tst is the known transformation matrix between source and target point clouds. S represents the source point clouds.

CONCLUSION
A learned feature registration algorithm with point pre-processing and hierarchical training network is proposed in this paper. Adaptive Harris 3D algorithm is used to detect the key points and the hierarchical feature descriptor obtains the feature of the extracted points. The results show that our approach achieve good accuracy and computational efficiency with different ratio of overlapping in the indoor environment. In the future, we will evaluate on more dataset and scenes to improve the accuracy of the algorithm.