Robust indoor point cloud classification by fusing LSTM neural networks with supervoxel clustering

To address the problems of lack of training data and inaccurate classification of existing 3D point cloud data segmentation and classification methods, this paper proposes a high-precision classification algorithm for indoor point clouds by fusing LSTM neural network and super voxels. The algorithm first performs super voxel segmentation on the original point cloud and uses it as the basic unit for machine learning classification, and then introduces LSTM (Long Short-Term Memory) neural network to model the super voxel domain relationship and optimize the classification results. Finally, the accuracy of the proposed method is evaluated based on open dataset, and the experimental results show that 83.2% classification accuracy can be achieved in the open dataset.


INTRODUCTION
With the increasing number of indoor spatial applications, semantic segmentation of indoor 3D data has become a hot topicof research for many researchers and scholars (Kang et al.,2020)It is the key to support various intelligent applications,such as indoor navigation (Choi et al., 2014) indoor navigation, indoor robotics (Taira et al., 2018) and augmented reality, among others.It is key to supporting various intelligent applications such as indoor navigation, indoor robotics and augmented reality.Semantic information extraction from point cloud data is the process of identifying and extracting elements from a cluttered and unorganised point cloud.The core of the process is to use segmentation algorithms to divide the disorganised point cloud data across the scene into a series of point cloud collections, so that each collection contains data with the same semantic and perceptual information, and each collection corresponds to a certain type of entity within the scene, making the point cloud have objectified semantic information (Hu et al., 2020).The point clouds have objectified semantic information.A lot of work has been done to improve the segmentation accuracy and processing speed of indoor point cloud data, but there are still two important challenges.Firstly, the raw point cloud data is heterogeneous.First, the original point cloud data is cluttered, sparse and unstructured, and there are problems such as incomplete data collection, uneven density and noise (Tran et al., 2019).This makes it difficult to generalise point cloud data segmentation algorithms to different scenarios.This makes it difficult to generalise point cloud data segmentation algorithms to different scenarios.Secondly, the current point cloud segmentation algorithms mainly classify point cloud data based on colour and geometric features, which rely on a large amount of training data for model learning, while the complex and diverse structure of objects in indoor space makes the current algorithms prone to low applicability and poor stability (Qi et al., 2017a).These algorithms rely on a large amount of training data for model learning.The current research on semantic segmentation of indoor 3D point clouds consists of three main types: multi-view based point cloud classification, voxel grid based point cloud classification and classification algorithms based on the original 3D point cloud.The multi-view based point cloud classification algorithm is to project the 3D point cloud data into 2D images from different angles according to the 3D imaging principle, and then perform the semantic segmentation of the scene based on the mature 2D image segmentation algorithm (Su et al., 2015).This type of algorithm can be used to initialise the multi-view model parameters using a mature, highly accurate pre-trained 2D convolutional neural network.The difficulty of neural network training is significantly reduced, while the effects of 3D geometric problems such as hollow objects and non-fluid geometry in 3D space can be avoided.
GVCNN is a deep neural network for multi-view 3D object recognition, which uses a convolutional neural network with shared view information to extract individual 2D image features for each view, and finally fuses the extracted multiview feature information into a global 3D object feature information based on the maximum pooling layer of multiple views to achieve global 3D object feature classification.The GVCNN (Feng et al., 2018) and Dominant (Wang et al., 2019) frameworks improve on the MVCNN by using a grouping approach to fuse multi-view features to further exploit the similarity between views to improve recognition accuracy.However, these methods require powerful GPUs for data training and cannot take into account all features in 3D space.Some researchers have therefore investigated the use of 3D point cloud voxel representations to learn features from the scene in 3D space.Unlike point clouds and polygon slices, each voxel has a regular index in the stereo grid.The method extends the 2D convolutional neural network to a 3D convolutional neural network and can be directly applied to 3D voxel convolution.The Rotation Net method (Kanezaki et al., 2018) combines two objective functions, object recognition and view estimation, to build a semantic recognition neural network, and adds the information of each view as an implicit variable in the training of the neural network.3D-ShapeNet (Wu et al., 2015), the first neural network model to adopt this idea is the 3D-ShapeNet, which expresses a 3D shape in terms of the spatial distribution of binary variables (the presence or absence of objects in voxels) on a grid of stereoscopic voxels.voxNet (Maturana and Scherer, 2015) uses a shallow 3D convolutional neural network to process voxelised 3D point cloud data.The ORION method (Sedaghat et al., 2016) is an addition to VoxNet that adds a subobjective for estimating the rotational orientation of the object, and the addition of this sub-objective improves the accuracy of semantic recognition.However, the processing time and storage footprint of voxels grows in cubic powers depending on their resolution, and most of the early methods studied were only able to learn with low resolution and shallow neural networks.
Therefore, the OctNet approach (Riegler et al., 2017) In the OctNet method, an unbalanced octree is proposed to divide the 3D stereo grid to solve the sparse problem of effective voxels in the 3D stereo grid,and the algorithm can be used for higher resolution and deeper neural network training.The above methods still suffer from feature loss in the feature computation process, and in recent years, a large number of researchers have investigated how to learn features from raw point clouds for semantic classification.Pointnet (Qi et al., 2017a) is the first neural network model based on 3D point clouds, which first learns the features of each point using a multilayer perceptron (MLP) and then uses a symmetric function to obtain a global object descriptor.pointNet++ (Qi et al., 2017b) adds a hierarchical feature extraction structure to PointNet.It proposes to partition the entire point cloud into several locally grouped ensemble abstraction layers, which act similarly to the convolutional layers in a convolutional neural network, and finally output the perceptual field of features by fusing several ensemble abstraction layers.In contrast to the idea of PointNet++, KCNet (Shen et al., 2018) proposes to use graph pooling layers and kernel correlation to mine the local feature information in the point cloud.
Similar to the aim of KCNet, kd-Net (Klokov and Lempitsky, 2017) is based on the input point cloud and then extracts the feature information hierarchically from the leaf nodes to the root node in a bottom-up manner.However, due to the high complexity of the indoor structure, the data itself is prone to data occlusion, and the training dataset is difficult to obtain, resulting in the current indoor 3D point cloud semantic segmentation methods taking a long time to train and still struggling to achieve the desired classification accuracy.
To address the problem of internal inconsistency of classification targets in existing 3D point cloud data segmentation and classification methods, we propose a highprecision classification algorithm for indoor point clouds jointly optimized by super voxel random forest (Ramiya et al., 2016, Oshiro et al., 2012) and Long Short-Term Memory (LSTM) neural network (Sherstinsky, 2020) .The algorithm is based on the feature that super voxels have internal feature consistency, divides the original point cloud into super voxels, and uses super voxels as the basic unit for multivariate feature calculation, builds the indoor point cloud super voxel random forest classification model, and realizes the coarse classification of point cloud data.On this basis, LSTM is introduced to train and predict the neural network model for the hyper voxel neighborhood connectivity of coarse classification to achieve the optimization of hyper voxel coarse classification results.Finally, the validity and accuracy of the proposed classification method are verified based on the open dataset, and the results show that the classification method of this paper can achieve 83.2% classification accuracy in the public dataset.

COARSE CLASSIFICATION OF INDOOR POINT CLOUDS IN SUPER VOXEL RANDOM FORESTS
As shown in Figure 1, the process of indoor point cloud segmentation method is optimized jointly by super voxel random forest and LSTM network.In the coarse classification stage, the original point cloud is clustered by super voxels to obtain the super voxel centroids, and the multi-dimensional features of super voxels are calculated and used for the training of the random forest model, which mainly consists of randomization, decision tree generation and voting classification steps.The process of coarse classification mainly consists of randomization, decision tree generation, and voting classification.The hyper voxel features involved in this paper contain four main types, which are local density features, Point Feature Histogram (PFH) features (Rusu et al., 2009), normal vector features, color information, relative elevation features, and shape features.After the supervoxel RF classification, the keras deep learning framework is used as the basis for the construction of super voxel LSTM neural network, which is used for the optimization for point cloud classification.
1) After indoor scene hyper voxels are clustered, the scene is partitioned into several blocks and the hyper voxels are linked to each other.
2) Iterative search is performed on the neighborhood information of each super voxel divided in the scene, and the surrounding voxels of the current super voxel are searched by KDtree, combined with their feature information, and combined into a spatial sequence set by distance size.
3) For the super voxel LSTM network training, a model with three LSTM layers plus one fully connected layer is designed, and the LSTM layers all use The LSTM layer is used as the activation function, and finally enters the fully connected layer, and uses The LSTM layer is used as the activation function and finally enters the fully connected layer, and the multi-objective classification of the scene is achieved using the activation function, in which the parameters of the neural network are initialized in a random way and rmsprop is used as the optimizer. In

Super voxel characterisation
There are four main types of hyper voxel features involved in this paper, namely local density features, Point Feature Histogram (PFH) features, normal vector features, colour information, relative elevation features and shape features.The method proposed in this paper uses the super voxels as the basic classification unit for classification, therefore the features extracted below are the feature information of the centroid of each super voxel.

Local density feature:
The local density feature is the average distance from a point to the k nearest neighbour points.Therefore, for each centroid in the super voxel, fast retrieval of neighbourhood points is achieved by constructing a KdTree and fast library for approximate nearest neighbors (FLANN) algorithm, which in turn obtains the local density feature of a point by calculating the average Euclidean distance between two pairs of neighbouring points.
PFH feature: The PFH feature is a description of the geometric properties of a point's k-neighbourhood by parameterising the spatial differences between the query point and its neighbourhood points and forming a multi-dimensional histogram.Specifically it is based on the relationship between points and their k-neighbourhoods and their normal vectors to describe the geometric features of the sample.In this paper, the PFH features of each centroid are obtained by creating a kdtree of the original point cloud, which is computed by kneighbourhood search.
Normal vector features: The normal vector of each point in the point cloud represents the direction of the surface on which the point is located, and can accurately describe both planar and surface information.In this paper, we calculate the normal vector information of the super voxels by plane fitting and calculate their feet to the vertical direction as random forest features.
Colour features: Most of the classification targets in indoor environments have colour consistency, and therefore RGB colour plays an important role in the indoor point cloud segmentation process.Considering the super voxels as the basic classification unit in this paper, the colour information of each super voxel is determined by the average RGB value of the points within the super voxel.
Relative elevation features: The relative elevation features of the super voxels are obtained from the difference between the height of the centre point of the super voxels and the elevation of the ground plane.
Shape features: The shape feature parameters are mainly calculated by the Eigen feature values obtained from the local PCA decomposition of the point cloud and combined to obtain them.The traditional Eigen eigenvalue is calculated based on the local point cloud obtained by K-neighborhood search.In order to obtain a more accurate domain point cloud, this paper takes the super voxel itself as the domain information of the current super voxel centroid and uses the point cloud inside the super voxel for the calculation of the Eigen eigenvalue,after the feature decomposition, three eigenvalues are obtained, which areλ 1 、 λ 2 、λ 3 , where the three eigenvalues are listed in order from largest to smallest, i.e. ( λ 1 ≥ λ 2 ≥ λ 3 ≥ 0 ).Based on this, the curvature, linearity, planarity, scattering and anisotropy of the super voxels are calculated according to the shape feature calculation method.The calculations are shown in the

Super voxel random forest model construction
The random forest model construction in this paper uses super voxels as the basic unit for training and prediction, similar to the traditional random forest construction method, the super voxel random forest consists of N decision trees {h X, θ n , n = 1,2,3, •••, N} as the initial classifier, and the final combined classifier is obtained by integrated learning.The random forest counts the results of each decision tree classification and votes on the output classification.Of these {θ n , n = 1,2,3, •••, N} are sequences of random variables, determined by the Bagging strategy and the feature subspace strategy in the random forest.Specifically: 1) The Bagging strategy is to randomly sample N training samples of the same size as the original dataset from the original dataset {T n , n = 1,2,3, •••, N} (about 63% of the samples are sampled each time), and a decision tree is trained for each training sample set.T n 2) The feature subspace strategy is to split and refine each node in the decision tree by selecting a subset of features from the data features and choosing the best feature segmentation node.
Finally, Random Forest is a combined classifier that integrates multiple decision tree classifiers and ultimately decides the classification result by classifier voting.The basic process of classification is as follows: 1) A bootstrap sampling method was used to randomly select K training sample sets from the original sample set.
2) A decision tree model is constructed for each of the K training sample sets to obtain K classification results.Specifically, each decision tree will select N features from the M features of the input variables.Generally the value of N is taken according to the formula N = M The value of N is determined according to the formula.In turn, information entropy (entropy) and Gini index (Gini) are used as node splitting criteria, as shown in Eqs. 1 and 2, wheren denotes the number of categories contained in the training data set D, and p i denotes the probability of the training data belonging to a certain category.
(3) The final classification is determined by voting based on the K classifications.

LSTM neural network optimization for fine classification of indoor point clouds
Unlike the original point cloud data, the point cloud can obtain the connection relationship between super voxels and super voxels after super voxel clustering, and the connection relationship contains the association characteristics between different types of elements, for example, the super voxels of the desktop have a certain correlation with the desktop clutter super voxels, and the judgment of the correlation can avoid the clutter from being incorrectly segmented into objects such as chairs.
Therefore, this paper proposes a Long short-term memory (LSTM) neural network optimisation method for modelling hyper voxel association sequences based on the results of random forest coarse classification of hyper voxels, and optimises the classification results of indoor 3D point clouds.The core reason for modelling the spatial connectivity of super voxels based on LSTM networks is that the LSTM models the sequence data in such a way that the horizontal neurons in its internal structure run through the series of data, and the state information of the neurons can be transmitted sequentially throughout the chain with only linear interactions, so that the information on the neurons in the chain can remain approximately constant, thus preserving long-term information.This gives it a significant advantage in the extraction of long and short term information, which is why LSTM has been used in the past for data with a clear sequence and correlation, such as long text classification and time series data prediction, where it is often possible to obtain better results than traditional time series prediction methods.
This paper uses the keras deep learning framework as the basis for the construction of the super voxel LSTM neural network.
1 ) After clustering of indoor scene hyper voxels, the scene is partitioned into blocks and the hyper voxels are linked to each other.
2 ) The neighbourhood information of each super voxel divided in the scene is searched iteratively, and the surrounding voxels of the current super voxel are searched through the KDtree, combined with their feature information, and combined into a spatial sequence set by distance size.
3 ) A model with three LSTM layers plus one fully connected layer was designed for training the super voxel LSTM network, with the LSTM layers all using The LSTM layer is used as the activation function, and finally enters the fully connected layer, and uses The LSTM layer is used as the activation function and finally enters the fully-connected layer, and the multi-objective classification of scenes is achieved using the activation function, where the neural network parameters are initialized in a random way and rmsprop is used as the optimizer.In the training process, the model training batch_size was set to 128 and the number of iterations epoch was 80. Considering the problem of unevenness of different types of super voxels in the training data, category weights were calculated and added to the training process based on the number of categories in the training set.

Datasets
The point cloud dataset used in this experiment is a publicly available dataset from Stanford University, referred to Figure Figure 1.The S3DIS dataset is a semantic dataset with pixel-level semantic annotations developed by Stanford University, and is divided into 6 regions containing 272 scenes, which can be classified into 11 categories of scenes, door, wall, floor, beam, window, chair, cluster, column, etc.In this paper, regions 1-5 are selected as training data and region 6 is used as test region for accuracy evaluation.1).Suppose there are k + 1 There are two categories (including the background category), denote p ij is the ratio of the i is the number of predicted categories asj the number of points in the class, andp ii denotes the true value ofi and the predicted value isi the number of points for which the predicted value isp ji denotes the number of points for which the true value isj and the predicted value isi the number of points with a true value of , and the number of points with a predicted value of .
In this paper, four commonly used point cloud classification frameworks, including RF classification based on the original point cloud, PointCNN, PVCNN++ and PointNet++, are selected and their classification results are compared and analysed.The classification accuracies of different classification methods are listed in Table 2. BRISK has the best performance of recall within the 0.25 meters and 2 degrees threshold, comparing with ORB and SIFT.And the average time cost of one image is 1.46 second, so we use BRISK in most visual localization experiment we carry out.Image retrieval based on multi-features provides more reliable results, so there would be more inliers of 2D-3D correspondences, which improve the localization performance.However, the time cost of this method is rather high, because image retrieval of different features is carried out instead of the classic single one.Though it reaches high precision, which feature strategy to choose is still based on the requirement of scene and experiment.
It can be seen that the original point cloud-based spontaneous classification method has the lowest accuracy, with its mIoU reaching only 8.7% accuracy and its mAcc only 24.3% accuracy, the lowest accuracy among all classification algorithms.In contrast, the original super voxel random forest coarse classification method can achieve 72.4% accuracy of mAcc, which indicates that the pre-processing of super voxels can effectively improve the classification accuracy of point cloud data.On the basis of coarse classification, the LSTM optimized mIoU can reach 46% and the mAcc is 83.2%, which is similar to the accuracy obtained by the deep learning frameworks PointCnn and PVCNN++.It is worth mentioning that the training data of the LSTM optimised network proposed in this paper only used the label information of region 1 for model training, while other deep learning frameworks used regions 1-5 for model training, therefore, from the perspective of training data requirements, the point cloud data classification framework proposed in this paper can achieve a relatively better prediction result with a small training data set.

CONCLUSION
In this paper, we propose a high-precision indoor point cloud classification method jointly optimised by super voxel random forest and LSTM network, which makes full use of the internal feature consistency of super voxels, divides the original point cloud into super voxels, calculates the geometric, colour and shape features of super voxels as the basic unit, and builds a super voxel random forest classification model for indoor point clouds to achieve coarse The classification is based on the idea of coarse to fine classification.Based on the idea of coarse to fine classification, this paper introduces LSTM to train and predict the coarse classification of super voxel neighbourhood connections, taking into account the correlation characteristics between different types of elements contained in the connection relations between super voxels, and achieves the optimization of the coarse classification results of super voxels.Finally, the validity and accuracy of the proposed classification method were verified based on open datasets, and the results showed that the classification method in this paper could achieve 83.2% classification accuracy in open datasets.
In future work, current LSTM network structures are more biased towards sequential inputs with logical order, whereas there is a two-by-two connection between spatial sequence data extracted by super voxels.How to design a sequential neural network structure with non-sequential super voxel arrangement will be an important research direction.
the training process, the model training batch size is set to 128 and the number of iterations epoch is 80. Considering the problem of unevenness of different types of super voxels in the training data, the category weights are calculated and added to the training process according to the number of categories in the training set.

Figure 1
Figure 1 The framework of the proposed method

Figure
Figure 2 S3DIS datasets3.2Experimental results and analysisAnalysis of super voxel segmentation results: reasonable super voxel parameter settings can avoid the occurrence of incorrect segmentation.As shown in Figure3Theoverall segmentation accuracy of super voxels is about 94.5%, and the segmentation accuracy of chair, floor, bookcase and sofa can reach over 97%, shown in Figure4.This is closely related to the distribution of indoor point clouds.For the point cloud data close to walls, ceilings and tables, there are sparse quantities and incomplete structures, while the super voxel segmentation algorithm only considers their vector, colour and shape information.It is difficult to avoid mis-segmentation of point clouds by considering only vector, colour and shape information.Based on the segmentation results of the super voxels, a highprecision indoor point cloud classification method based on the joint optimization of super voxel random forest and LSTM network is proposed in this paper to perform semantic classification experiments on the modified point clouds.

Figure 3
Figure 3 Super voxel segmentation accuracy

Figure 5
Figure 5 Loss vs epoch relationship Analysis of point cloud classification results: The average intersection and concurrence ratio mIoU and mAcc are used to evaluate the accuracy of point cloud classification.mIoU represents the ratio of the intersection and concurrence of the two sets of true and predicted values of data classification, and mAcc represents the ratio of the intersection and true values of the true and predicted values of classification.The specific calculation method is shown in Equation (1).Suppose there are k + 1 There are two categories (including the background category), denote p ij is the ratio of the i is the number of predicted categories asj the number of points in the class, andp ii denotes the true value ofi and the predicted value isi the number of points for which the predicted value isp ji denotes the number of points for which the true value isj and the predicted value isi the number of points with a true value of , and the number of points with a predicted value of .

Table 1
Calculation methods for shape features

Table 2
Comparison of the classification accuracy of different methods