A COMPARATIVE STUDY OF POINT CLOUDS SEMANTIC SEGMENTATION USING THREE DIFFERENT NEURAL NETWORKS ON THE RAILWAY STATION DATASET

Point cloud data have rich semantic representations and can benefit various applications towards a digital twin. However, they are unordered and anisotropically distributed, thus being unsuitable for a typical Convolutional Neural Networks (CNN) to handle. With the advance of deep learning, several neural networks claim to have solved the point cloud semantic segmentation problem. This paper evaluates three different neural networks for semantic segmentation of point clouds, namely PointNet++, PointCNN and DGCNN. A public indoor scene of the Amersfoort railway station is used as the study area. Unlike the typical indoor scenes and even more from the ubiquitous outdoor ones in currently available datasets, the station consists of objects such as the entrance gates, ticket machines, couches, and garbage cans. For the experiment, we use subsets from the data, remove the noise, evaluate the performance of the selected neural networks. The results indicate an overall accuracy of more than 90% for all the networks but vary in terms of mean class accuracy and mean Intersection over Union (IoU). The misclassification mainly occurs in the classes of couch and garbage can. Several factors that may contribute to the errors are analyzed, such as the quality of the data and the proportion of the number of points per class. The adaptability of the networks is also heavily dependent on the training location: the overall characteristics of the train station make a trained network for one location less suitable for another.


INTRODUCTION
Over the last few years, technology has constantly been evolving, and with the constant growth of computational power, ideas from a long time ago have resurfaced to finally show their worth completely. With this constant growth of information, the collection of data has also increased as well as the detail with which this has been captured. This gives a way to the emerging deep learning methods that can learn from the data. Recently, deep learning has been used to address multiple geospatial problems and has proven to be competent (Zhu et al., 2017, Ma et al., 2019, Ardabili et al., 2019.
The training data of deep learning are defined by the application. For indoor and outdoor applications, naturally, this implies the use of different data. This paper strives to test the state-of-the-art deep learning models in an environment that is still somewhat less unexplored, namely the indoor scene. Compared with the outdoor scene, the indoor scene is more complex to parse since it is more costumed and the variety of the indoor features surpasses that of the outdoors (Meijers et al., 2005, Pang et al., 2018. Nevertheless, it does not imply that the indoor scene is inferior to the outdoor scene. On the contrary, the indoor scene is closer to the human habitat, therefore worthy of exploring equally, if not more. We focus on point cloud data which is currently less explored than the traditional image-based machine learning.
Processing unstructured point clouds is non-trivial, and it is only recently that deep learning approaches have been proposed * Corresponding author for tackling this task (Qi et al., 2017a, Qi et al., 2017b, Li et al., 2018, Thomas et al., 2019. These point clouds are usually obtained from LiDAR sensors mounted on a vehicle or from visual SLAM approaches; few are collected for the indoor environment. There is a lack of attention to public space where unexploited patterns may exist from the indoor scenes.
In this paper, we investigate how deep neural networks perform within the context of a public indoor environment. Specifically, we evaluate the performance with the point cloud acquired in a railway station. Compared with the existing indoor scenes (Khoshelham et al., 2017, Dai et al., 2017, Armeni et al., 2016, our scene contains more significant noise where moving objects appear. Besides, the point clouds captured by the terrestrial laser scanner exhibit varying density regarding the distance between the objects to the scanner. We extensively evaluate the performance of several deep neural network architectures for semantic segmentation on this data. An advantage that applying deep neural networks for applications, such as asset management, brings is that data does not need possibly hundreds of man-hours to be labelled. This saves a lot of time and also expense each time a scan is made.

RELATED WORKS
Recent advance in deep learning has boosted diverse computer vision applications. In the geospatial sector, deep learningpowered solutions contribute to the creation of the digital twin, where automatic object detection and semantic segmentation from point clouds play an important role (Zhu et al., 2017). These applications include urban planning (Urech et al., 2020), asset management (Fang et al., 2016), public safety (Wang et al., 2015), etc.
Deep Learning on Point Clouds. Point clouds are unordered and anisotropically distributed in space. Therefore, unlike the grid data such as images or voxels, point clouds are more difficult to process efficiently in deep neural networks due to this irregularity. Volumetric CNNs (Maturana and Scherer, 2015, Wu et al., 2015, Qi et al., 2016 project the point sets into grids with uniformity. However, this type of methods often involves non-trivial projection and are often constrained by its resolution due to the sparsity of the voxel representations. Recently, there are neural networks proposed that directly consume raw point clouds. PointNet (Qi et al., 2017a) and Deep Sets (Zaheer et al., 2017) both address order invariance of input points using a symmetric function over the inputs. PointNet++ (Qi et al., 2017b) further improves the local feature aggregation by applying PointNet hierarchically over the point set. PointCNN (Li et al., 2018) applies an χ-transformation to learn the weighting of the input features and the point set permutation. Moreover, with graph structures proven to be successful in geometric learning (Battaglia et al., 2018, Zhou et al., 2020, deep neural networks utilizing graph structures are proposed (Landrieu andSimonovsky, 2018, Wang et al., 2019). Within this paper, we evaluate several state of the art solutions for semantic segmentation.
Indoor Point Cloud Application. Indoor scene semantics based on point cloud is essential for many applications, such as planning, localization and navigation services (Flikweert et al., 2019, Quintana et al., 2016. However, indoor environments pose specific challenges for point cloud semantic segmentation due to complex layout, variety of object types and occlusions (Ochmann et al., 2016, Pang et al., 2018. There are indoor point cloud datasets available that target different scenes (Khoshelham et al., 2017, Dai et al., 2017, Armeni et al., 2016. However, none of the existing datasets cover the scene of a railway station. This paper strives to study the indoor scene that is less exploited. Specifically, the lack of benchmark on railway station point cloud semantic segmentation motivates the study of this paper.

Data
The study uses a LiDAR dataset at the Amersfoort Central Station. It consists of standard information, such as position (XYZ), intensity, and additional Red-Green-Blue (RGB) colour from the camera. An overview of the point cloud dataset from outside and inside the station is shown in Figure 1. The data acquisition was conducted in October 2019 with 19 different scan locations inside the station.
The raw point clouds were unlabeled and still noisy, i.e. moving objects were present. We distil the whole data first to distinguish the interesting assets that are typically inside the station. Based on the screening, we define five classes: clutter, entrance gate, couch, garbage can, and floor. Furthermore, we subset the data into several partitions based on the planar position X and Y. We did the manual labelling to the specified classes and data cleaning to each partition to remove noise, for example, people sitting on the couch. However, it is impossible to remove the noise completely, so we still have  clutter as another class in our classification scheme and end up having the number of points not proportional between classes. Based on our initial implementation through different partition sets, we found that the training did not perform well on the unbalanced data. Meanwhile, a local scene labelled correctly was able to produce plausible results. Thus, the large partitions are not suitable to train because each scene contains many undesirable objects classified as clutter. This paper uses the small data subsets generated from larger partitions with desired objects. The comparison between large and small partitions is illustrated in Figure 2. The final data subsets are shown in Figure 3. We use consistently three scenes to train the networks and the other two for testing.

Method
This paper uses three different neural network architectures to evaluate on our dataset, namely PointNet++ (Qi et al., 2017a), PointCNN (Li et al., 2018), and DGCNN (Wang et al., 2019). Specifically, we used the PyTorch implementation for Point-Net++ (Wijmans, 2018) and DGCNN (Tao, 2020), and the Arc-GIS API for PointCNN (Esri, 2021). These networks have been used for semantic segmentation tasks on the private indoor point clouds (Dai et al., 2017, Armeni et al., 2016. Figure 4, we perform training of the three networks using the same dataset. All common hyperparameters of the networks are structured as with the S3DIS dataset (Armeni et al., 2016). We adapt the data preparation process to fit the data we have. We use the default setting, except for the block size in PointCNN, which is changed from 1.5 m to 1 m as we consider

As shown in
The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLIII-B3-2021 XXIV ISPRS Congress (2021 edition)   the objects in our scenes to have different dimensions from that of S3DIS.
The training is analyzed by monitoring the loss and accuracy. We stop the training when there is no significant improvement of these metrics. Then, we evaluate each trained model of the networks in the testing stage. Here we use standard approaches to measure the quality of segmentation results by comparing the predicted and ground truth values w.r.t. the overall accuracy and Intersection over Union (IoU).
The overall accuracy describes the ratio between the numbers of points equal to truth values with the total number of points. It is given by where OA is the overall accuracy, T P is the total number of true positive points (e.g. if labelled couch predicted as a couch), T N is the total number of true negatives (e.g. if labelled noncouch predicted as a non-couch), F P is the total number of false positives (e.g. when labelled non-couch predicted as a couch), and F N is the total number of false negatives (when labelled couch predicted as a non-couch). In the overall accuracy, the numerator's sum is equal with the total number of predicted values that are classified correctly for each class, while the sum of the denominator is equal to the total number of ground truth points.
IoU expresses the ratio of the overlapping area and the union area between the predicted and the ground truth: Table 1 presents the evaluation results of each network. All overall accuracy can reach more than 90%. However, the class accuracy varies between 50% and 80%, and a similar observation also appears for the mean IoU. The reason is that some classes have significant low accuracy than others, as indicated in Table 2. To further determine the effect of block size in PointCNN, we clip out one particular type of objects, which are the surveillance cameras inside the station, and evaluate on them using different block sizes. With five cameras as input, the first test result for a block size of 1.5 m has only achieved an accuracy of 30%. The surveillance cameras are considered small, having a dimension of approximately 30 cm x 20 cm. After we reduce the block size to 0.5 m, the accuracy increased significantly to 97%. Figure 5 illustrates the prediction results. This experiment shows a high sensitivity of the block size for PointCNN. Figure 6 presents the prediction results of each network. We observe that some objects, including clutter, have a similar shape. For example, the board above the entrance gate is similar to the advertisement board on the floor labelled as clutter in the ground truth data. We argue that this affects how the networks learn from the training data and may cause misclassification in the prediction.

RESULTS AND DISCUSSIONS
Another observation is that the scanner has difficulty capturing the objects completely. For example, the couch's points have holes because it was partly occupied, and we removed the objects above it in the data cleaning process. Moreover, some objects may be obscured by others, so the LiDAR scanner cannot measure them fully, e.g. missing garbage can facades.

CONCLUSIONS AND FURTHER WORK
In this paper, a comparison of the semantic segmentation neural networks is addressed for a public indoor point cloud captured in a railway station. Our study scene is different from the existing indoor datasets in terms of the layout, shape and size of the objects, and the presence of moving objects. The results obtained by PointNet++, PointCNN and DGCNN are compared, and some factors that may influence the semantic segmentation performance are analyzed. First, the objects in the station were not completely recorded by the LiDAR scanner, given the difficulty of measuring the public space. Second, noise still exists in the data even after the manual cleaning process. Finally, a similar shape of the different objects, including the unclassified points, may cause misclassification. A caveat to this study is that only a minimal amount of data subset and classes are used. Despite the limitations in the data acquisition, this data represents a point cloud in the real-world indoor public space where several restrictions such as time, budget and administrative effort have to be taken into consideration.
The implementation of our data partitioning to create small scenes as input neglects an enormous number of points from the raw data. With an overall accuracy of more than 90%, though, the pre-trained model may still not be suitable to be used in a larger scene. Moreover, the quality of the data and the proportion of the number of points per class may affect the segmentation performance. The adaptability is also heavily dependent on the training location, for which the railway station's overall characteristics make a trained network for one location less suitable for prediction on another. Further study with more classes and attributes is required to analyze semantic segmentation with public indoor point cloud data comprehensively.