SEMANTIC SEGMENTATION OF MOBILE LASER SCANNING POINT CLOUDS WITH LONG SHORT-TERM MEMORY NETWORKS: PRELIMINARY RESULTS

Although point clouds are characterized as a type of unstructured data, timestamp attribute can structure point clouds into scanlines and shape them into a time signal. The present work studies the transformation of the street point cloud into a time signal based on the Z component for the semantic segmentation using Long Short-Term Memory (LSTM) networks. The experiment was conducted on the point cloud of a real case study. Several training sessions were performed changing the Level of Detail of the classification (coarse level with 3 classes and fine level with 11 classes), two levels of network depth and the use of weighting for the improvement of classes with low number of points. The results showed high accuracy, reaching at best 97.3% in the classification with 3 classes (ground, buildings, and objects) and 95.7% with 11 classes. The distribution of the success rates was not the same for all classes. The classes with the highest number of points obtained better results than the others. The application of weighting improved the classes with few points at the expense of the classes with more points. Increasing the number of hidden layers was shown as a preferable alternative to weighting. Given the high success rates and a behaviour of the LSTM consistent with other Neural Networks in point cloud processing, it is concluded that the LSTM is a feasible alternative for the semantic segmentation of point clouds transformed into time signals.


INTRODUCTION
Point clouds are known to be a typical example of unstructured data. The points are distributed over surfaces in an irregular and unordered way, not fitting into a grid. However, in a detailed analysis of the distribution of points on surfaces in raw point clouds, certain patterns related to geometry and scanning time can be appreciated ( Figure 1). These patterns match the scanlines and are easily identified in terrestrial, mobile, or airborne laser scanning.
In the case of streets acquired with a Mobile Laser Scanning (MLS), the scanlines are distributed sequentially along the street according to the speed of the MLS during acquisition (Cahalane et al., 2010). Given the proximity between scanlines and the quasi-constant geometry of a street, the Z component of the points has a repetitive and common behaviour between scanlines. In turn, each scanline also has a common temporal direction, where the X and Y components are closely related to the timestamp of each point. Therefore, urban street point clouds can be considered as 3D data but also as a continuous signal in time, * Corresponding author defined by the Z component of the scanlines points and with a frequency obtained from the intervals between buildings and street. The aim of this work is to evaluate the semantic segmentation of street point clouds considering them as time signals and applying Recurrent Neural Networks (RNN). Specifically, this work evaluates the use of Long Short-Term Memory (LSTM), one of the networks in the state of the art in text and voice processing (Sherstinsky, 2020). In the segmentation of the street, two levels of detail are considered: one coarse with three classes (ground, buildings, and objects), and another more detailed with 11 classes (road, curbs, sidewalks, buildings, vegetation, cars, motorbikes, furniture, poles, pedestrian, and others). The rest of this paper is organized as follows. Section 2 contains works on point cloud semantic segmentation with artificial intelligence. Section 3 details the characteristics of the LSTM and the scanlines generation. Section 4 presents the case study, the results, and the analysis. Section 5 concludes the work

RELATED WORK
Semantic segmentation in point clouds is a research line extensively developed since its origin at the beginning of the 21st century (Zhang et al., 2019). Semantic segmentation, also known as point-based classification, consists of assigning a label to each point in the cloud. Given that each point is represented only by 3D geometry, it is crucial to establish relationships with neighbouring points in order to obtain local shape descriptors for subsequent classification.
In Machine Learning, feature extraction is a manual process, and the accuracy of the classification depends on well-designed feature descriptors (Ku et al., 2020). The most common geometric features are those based on surface normals, curvature, eigenvalues, and eigenvectors. These features are dependent on the neighbouring points. A change in the number or position of the neighbours will influence the value of the feature (Weinmann et al., 2015). The most common methods for estimating point relationships and structuring the point cloud are the voxelization and the k-nearest neighbours (knn) algorithm.
The adjustment of a point cloud to a grid, either in 2D (Hernandez and Marcotegui, 2009), or 3D (voxels) allows to transform the unobstructed point cloud into a regular structure where points are distributed in cells. In each cell, features can be extracted from the points that belong to that cell (Meida et al., 2020;Poux and Billen, 2019). In addition, cells with similar features can be grouped together, forming super-pixels or super-voxels (Ramiya et al., 2016). By contrast, features can be extracted of each point with respect to the k nearest neighbouring points directly by applying knn (Li et al., 2017;Xiang et al., 2016;Zhang et al., 2018), so it is not necessary to group points in voxels, with the adjustment and truncation problems that voxelization entails.
Feature extraction on Deep Learning is an internal neural network process and does not require manual feature extraction. However, neural network architectures influence feature extraction because they follow the same principles of neighbourhood calculation as in manual feature extraction. Thus, 3D-CNN structures the point cloud in voxels (Huang and You, 2016), while PointNet works directly on the points based on their proximity relationships (Charles et al., 2017). In (Wang et al., 2019), the semantic segmentation is based on the representation of the point cloud as a graph. In (Ye et al., 2018), the point cloud is structured in 3D blocks to obtain contextual features with an RNN.
With respect to other works, our proposal is based on structuring the point cloud in scanlines as a continuous time signal, without a structuring based on voxels or knn. Moreover, the use of Deep Learning, specifically the LSTM, implies that the feature extraction process is executed internally by the neural network. From the best of our knowledge, we have not found other studies that process point clouds as time signals with an LSTM.

Long Short-Term Memory
Recurrent Neural Networks (RNNs) are a specific type of network for signal processing. As a peculiarity, in these networks, information can persist through the layers of neurons, allowing to remember previous states. The RNN is that information can persist for a very short number of stages. To solve this, Long Short-Term Memory (LSTM) networks improve the architecture of neurons by adding gates to store relevant information longer, update it or discard it.
The input used by LSTM networks for segmentation is a series of arrays (samples) of variable length stored sequentially. In case there are several features, the input is in parallel with one parallel array for each available feature.

Scanlines generation
Given an input consisting of a street point cloud = [ , , , ], the generation of the scanlines is executed by ordering the cloud based on the timestamp attribute . The time ordered point cloud will fulfil < ∀ ∈ [1, ], being n the number of points. In this way, each point cloud attribute is ordered according to time and can be used as an array for the LSTM input.
The division of the point cloud into different and consecutive scanlines = [ , , , ] for subsequent distribution in the training and testing sets is performed based on the time difference between consecutive points ∆ = − ( Figure 2). Time difference between consecutive points ∆ is not always constant. The angular resolution ∝ of the MLS allows one point to be acquired every ∝ increment, but there is not always an element of the built environment to be acquired. Between points of consecutive scanlines there is a greater time difference, corresponding to the area of the sky not acquired, than between consecutive points of the same scanline. Therefore, when the average between all the ∆ is calculated, the points are assigned sequentially to different scanlines delimited by ∆ whose values are greater than 100 times the average ∆ . Most LiDAR mounted on a vehicle generate scan patterns perpendicular to the vehicle's trajectory, and coincident with the normal direction of the façades of a street (Balado et al., 2017). Therefore, X and Y components are related to the trajectory and the distance to the façade, and they are not used as features for the LSTM in this work. The point position with respect to the MLS along trajectory is not relevant to identify an object. The point position with respect to the façade is intrinsically stored in the signal since the point cloud is ordered according to the timestamp .
In order to take advantage of the point cloud distribution in scanlines, features based on the 3D proximity (nearest neighbours or voxels) between points are not calculated. Furthermore, since RNNs have the capacity to extract features related between the distribution and position of the samples inside signals, features based on consecutive points are also not calculated.

Sample Weighting
The number of points in the urban environment is clearly unbalanced between the different classes. This is due to the acquisition distance of the MLS from the element to be scanned as well as the scannable surface of each element. Given this problem, some trainings are performed by adding to the LSTM a module to weight the training accuracy according to the inverse of class percentages (Equation 1) = Eq. 1

Data
An MLS-acquired point cloud of 153 m in Camelias street in Vigo (Spain) was selected as case study. The area was segmented by half-street, containing one line of façades, one direction road line, sidewalks, and urban furniture. The point cloud had 12 million points and was acquired with LYNX Mobile Mapper of Optech (Balado et al., 2017). The point cloud was structured in 6942 scanlines, with an average distance between scanlines of 2.7 cm. The average time difference between consecutive points on the same scanline is 2x10 -6 seconds. The selection of the feature z in consecutive points generates a time signal suitable for the LSTM input ( Figure 3).
The point cloud was segmented and labelled manually. Classes were defined at two levels of detail (LoD). The coarse level of detail contains 3 classes (ground, building, and objects), and the fine level contains 11 classes (road, curb, sidewalk, building, vegetation, car, motorbike, pedestrian, pole, furniture, and others). Figure 4 shows the point cloud coloured by the 11 labelled classes. Figure 5 contains the number of points percentages of each class over the total number of points.
Considering the coarse level of detail, the ground class occupies more than half of the labelled samples, within ground, the road class has the highest weight. The road class occupies half of the samples in the case study. Similarly, within the object class, vegetation occupies the largest percentage, leaving the rest of the object classes in very small percentages.
The scanlines of the point cloud were distributed randomly in 50% for the training set and 50% for testing.

.2 Results
The trainings were performed with a maximum of 30 epochs, mini-batch of 64 observations, Adam optimizer, learning rate of 0.001, two levels of hidden layers (200 and 500 hidden layers), and with two weighted options (weighted and unweighted). All experiments were executed on Intel® Core™ i5-8400 CPU 2.8GHz, 8GB RAM, and NVIDIA 1050ti 4096 GDDR5 using MATLAB. Each training session lasted about 15 and 40 minutes, depending on the number of classes, hidden layers, and the weighting. The overall accuracy of all tests is shown in Table 1. Confusion matrices are shown in Tables 2 to 9 in the appendix. In addition, Figures 6 to 13 show an enlarged section of the street where successes and errors can be seen (also in appendix). The best results were obtained with the training session with 500 hidden layers, weighted for 3 classes (97.3%) and unweighted for 11 classes (95.7%). The errors are concentrated in the transition zones between elements and those classes with a low number of points, mainly objects.

Analysis and discussion
The results of semantic segmentation depend largely on the number of classes and the LoD. In Table 1, overall accuracy for the semantic segmentation of 3 classes is higher than for 11 classes. In the confusion matrices (Tables 2 to 9), it can also be observed that the classes with the highest number of points are the ones that reached a better classification, in both LoD.
The ground was very well segmented at a coarse LoD in all tests (Tables 2 to 5 and Figures 6 to 9). There is a different behaviour in the three elements when the ground is divided into three classes (Tables 6 to 9 and Figures 10 to 13). The curbs were mainly confused with the road class, as the road has the largest number of points in the scene. While road and sidewalk continue to be classes with high success rates. Weighted tests will be explained in the next paragraphs.
The building class showed a very stable behaviour in all tests, although without such high success rates as with the ground class. A small area of confusion was observed between the intersection of ground and buildings (example in Figures 6 or 10). But it was also observed as an important area of confusion in the lower part of the buildings, coinciding with the height of the objects (example in Figures 8 or 11).
The objects, being the class with the least number of samples (points), showed great confusion with sidewalks and buildings classes. This is due to the coincidence in the location of the objects with both classes. The treetops were well identified in all tests, probably due to irregularity in height acquisition in the time signal.
The aim of the weighting was to increase the accuracy of those classes with few samples. Also, with more hidden layers, the network can extract more complex characteristics, patterns, and thresholds to improve the classification. This behaviour was corroborated in the semantic segmentation with 3 classes, where adding the weighting (Table 3) or increasing the number of hidden layers (Table 4), the recall of the object class increased by more than 10% (although at the cost of 6% of the building class).
Increasing the number of hidden layers and weighting simultaneously obtained the best success rate for the semantic segmentation in 3 classes (Table 5).
However, the behaviour with 11 classes was markedly different. The increase of hidden layers produced an improvement of the recall in those classes with a lower number of points and a very slight worsening of the classes with a higher number of points (values of the main diagonal in Tables 6 and 8). But on adding the weighting, the same behaviour was produced in a much more aggressive way, reaching high values in the identification of object classes and curbs, but with a notable decrease in the road, sidewalk and building classes. As a result, the overall accuracy fell by up to 75%. Figures 11 and 13 show large areas misclassified.
The direct comparison of the proposed method on public datasets is a pending task, since most public datasets of point clouds do not include a timestamp attribute, let alone with an adequate resolution. However, in view of the results of this work compared to others (

CONCLUSIONS
In this work, the first experiments of semantically segmenting an MLS street point cloud through the conversion into a time signal and a LSTM. Tests were performed to classify two LODs corresponding to 3 classes and 11 classes. The use of a network with different depth (200 hidden layers and 500 hidden layers) and weighting was also tested.
The LSTM obtained an overall accuracy of 97.3% for 3 classes and 95.7% with 11 classes, which places it in the state of the art. However, the class objects and sub-classes present the greatest confusion given the low number of points in the street scene. Although this is also indicative of the relevance of the third dimension lost when the point cloud was transformed into a time signal of the Z attribute. The weighting improved the overall accuracy for the classification with 3 classes, but not with 11 classes, although it did improve the success rate of the object classes. In view of the results, the LSTM is a tool as useful as others for the segmentation and classification of point clouds with regular patterns.
Future work will focus on extending the test to more case studies, evaluating other features (intensity, number of returns, timestamp), testing other networks such as the Gated recurrence unit, and combining with some method to reduce the number of points and obtain a more balanced number of samples.