APPLICATION OF A SHELLNET BASED APPROACH TO SEMANTIC SEGMENTATION IN URBAN POINT CLOUD

: In recent years, the popularity of airborne, vehicle-borne, and terrestrial 3D laser scanners has driven the rapid development of 3D point cloud processing methods. The 3D laser scanning technology has the characteristics of non-contact, high density, high accuracy, and digitalization, which can achieve comprehensive and fast 3D scanning of urban point clouds. To address the current situation that it is difficult to accurately segment urban point clouds in complex scenes from 3D laser scanned point clouds, a technical process for accurate and fast semantic segmentation of urban point clouds is proposed. In this study, the point clouds are first denoised, then the samples are annotated and sample sets are created based on the point cloud features of the category targets using CloudCompare software, followed by an end-to-end trainable optimization network-ShellNet, to train the urban point cloud samples, and finally, the models are evaluated on a test set. The method achieved IoU metrics of 89.83% and 73.74% for semantic segmentation of buildings and rods-like objects respectively. From the visualization results of the test set, the algorithm is feasible and robust, providing a new idea and method for semantic segmentation of large-scale urban scenes.


INTRODUCTION
3D laser scanning measurement has the characteristics of fast, accurate, and non-contact, which can directly obtain the 3D dense point cloud on the surface of the object, and plays a very important role in the point cloud extraction of large-scene urban roads (Pierdicca et al., 2020). Firstly, the 3D laser scanner is used to scan the target and the 3D point cloud data is obtained which is exactly consistent with the field size, and then builds a true 3D real-world model of the physical scene through data processing software, and its touchless, high-precision and high-efficiency scanned scene data provides strong data support for the recently proposed smart city brain infrastructure (Han et al., 2016). Due to the complex and large scale of urban scenes, the acquired laser point cloud data often have the problems of large data volume, discrete type, serious noise, and loopholes, etc. Therefore, how to process urban road point cloud data quickly and at a high level is the current challenge to be solved (Duan et al., 2019).
The MLS (Mobile Laser Scanning) system is used to scan urban areas and the resulting high-density point cloud contains various types of objects such as buildings, street lights, trees, etc. (Lari et al., 2011). In the existing studies, the geometric information of the point cloud is mostly used to identify various target features in the scene (Huang et al., 2019). Fewer studies have attempted to use color point clouds for urban scene analysis.
Point cloud segmentation divides point cloud data according to certain rules, usually by labeling points with the same characteristics as the same class. 3D point cloud segmentation methods have been developed for a long time, and a large number of traditional classical segmentation algorithms have emerged, which can be mainly classified into the following categories: there are edge-based methods, region-based methods, modelbased methods, graph-based methods, and attributes-based  Corresponding author methods, etc. Edge-based (Himmelsbachc et al., 2009) segmentation algorithms filter boundary points by geometric features of the point cloud, then connect the filtered boundary points to form boundary lines and finally segment the point cloud surface area into independent point sets according to the boundary lines. Region-based (Dong et al., 2018) methods group points with similar geometrically defined properties into a plane by selecting seed points, while continuously correcting the feature parameters of the seed region fitted surface until there are no points that still satisfy the threshold condition. Model-based (Schnabel et al., 2007) approach uses the mathematical parametric model of simple geometric tuples as the most a priori information to classify the point cloud into the corresponding tuples category. Graph-based (Yang et al., 2014) segmentation approach treats the point cloud data as vertices, constructs edges using the spatial neighborhood relationship of the points, and constructs a graph by weighting the connected edges using the similarity of the neighborhood points. Attributes-based (Filin, 2002) approach is the geometric structure features or spatial distribution features exhibited by the point cloud are used to cluster the features of the point cloud to achieve segmentation. However, due to the noise points, object occlusions, and uneven acquisition density of the point cloud data we obtain, these methods are difficult to fit onto the object (Nguyen et al., 2013), which greatly affects the accuracy.
According to the 3D point cloud data processing method, the 3D point cloud semantic segmentation methods based on deep learning are divided into two categories, namely direct semantic segmentation methods and indirect semantic segmentation methods . The former is to extract feature information directly from point cloud data, and the architecture retains the intrinsic information within the original points to predict point-level semantics without transformation to voxels and multi-views (Su et al., 2015); The latter converts the original point cloud data into a regular 3D voxel mesh or multi-view, indirectly extracting features from the 3D point cloud data by means of data transformation and completing the segmentation. The features are indirectly extracted from the 3D point cloud data by means of data transformation for semantic segmentation purposes. The significant development of the point cloud data processing algorithm, although the accuracy rate in the scene segmentation task has been achieved, the training speed is slower and the network structure is complex. For example, PointCNN weights and displaces the input features at the same time, and then applies a typical convolution, but the convergence rate is slower; Pointwise uses point-by-point convolution to obtain the local features of the points, using voxel positioning weights to make it inflexible. The ShellNet algorithm used in this paper uses efficient ShellConv convolutional operators to directly process large-scale data sets. Since the neural network has fewer parameters, it can maintain a very fast training speed, and the experimental results also ensure the effectiveness of the network.
In summary, this study proposes a technical process for segmenting target point clouds in urban scenes based on the elevation, intensity, and geometry of the point clouds, with respect to the characteristics of various target point clouds in complex urban scenes. The experimental data verified that the technical flow has a good segmentation effect and improves the automation of the segmentation of urban scenes.

Point Cloud Denoising
In recent years, the availability of point cloud data has been increasing. When point cloud data is directly obtained from the MLS system, the inaccuracy of deep acquisition will cause the point cloud to be noisy and may contain many outliers (Javaheri et al., 2017). Point cloud denoising, as the first step in data preprocessing, has a relatively large impact on the follow-up and is therefore required in this study.
Based on the property that outlier points will move away from their neighbors, this study uses radius outlier removal, where each point is connected to its neighbors within the radius with a small graph (Schoenenberger et al., 2015). A threshold of the minimum number of neighbors within the neighborhood of the radius is set up to identify outliers.

ShellNet Network Structure
In recent years the field of point cloud research, it has been a research hotspot on how to perform efficient feature computation for unstructured data like point clouds (Chen et al., 2020). This study uses an algorithm for segmenting urban scenes-ShellNet . To achieve an efficient point cloud neural network, a convolution that can directly use point clouds needs to be defined. ShellConv is the core part of ShellNet network to obtain features of local point sets. The main idea of ShellConv is to output a deeper sparse point set by merging point sampling into the convolution (Joshi et al., 2021). The function implemented by ShellConv is to calculate the characteristics of the sample point. The input point cloud is randomly sampled to form a set of points centered on the representative points, distributed on these spherical shells, and then the local characteristics of the layer shell are derived by maxpooling. Finally, the characteristics of the sampling point are obtained by the local characteristics of multiple shells. This is shown in Figure 1. In this method, although the number does not increase, a larger acceptance area can be obtained. A set of representative points is randomly selected from the input point set, for a particular representative point p, its neighbor q is obtained by the nearest neighbor method, then the convolution on point p is where F represents the input characteristics of the point set for a particular channel, W is the weight of the convolution. The superscript (n) is used to indicate the parameters of the n layer. F(p) and F(q) denote the characteristics of point p and point q. ShellConv is used in ShellNet instead of the traditional 2D convolution. The segmentation network follows U-net, which is a classical full convolutional network that can combine local information and global information (Zhang et al., 2018). The deconvolution part starts from the set point of N2 in Figure 2. Through the three-layer ShellConv operator, the output points of the deconvolution layer gradually increase, but the characteristic channels gradually decrease, until the points upsampled are the same as the number of input points N.

Figure 2.
Technology Roadmap. For the input point cloud, preprocessing is first performed, including point cloud denoising, sample set labeling, and generation, where N is the number of raw point clouds, and XYZ coordinates and intensity are the four feature inputs for the point. Entering the ShellNet network, through three layers of ShellConv, a matrix of size N2 × C2 is obtained, where N2 is the number of representative points that are finally extracted from the input point cloud. Each point contains a high-dimensional feature vector of size C2. This matrix is entered into the mlp module, size (256, 128), to generate a probability plot for object classification.

PointNet++ Network Structure
PointNet ) is a pioneering effort that directly processes point sets. The main idea of PointNet++ is to add a multi-level feature extraction structure to PointNet, which is to divide the input point cloud into several local point sets, and extract the global features of each point set, then make the features continuously abstracted, so as to obtain higher-level features, each set is called set abstraction. Each set abstraction consists of three parts: the sampling layer, the grouping layer, and the PointNet layer (Yao et al., 2019). In the sampling layer, FPS (farthest point sampling) is used to collect the centroids; in the grouping layer, KNN is used to find the k nearest points around the centroids to form the local area; finally, PointNet is used to extract the local features from each local area given by the grouping layer.
For the segmentation task, each point is given a corresponding class label, that is, the set of points is restored to the original data, which is done mainly by interpolation and hopping connections. The interpolation is a weighted average of the inverse of the distances of the k nearest neighbors. The jump join is the stitching of the output features obtained from each of the previous set abstraction layers with the features of the interpolated points (Ma et al., 2022). As the obtained feature dimension is too high, which will affect the training speed and training effect, it will go through unit PointNet to reduce the feature dimension and improve the robustness of the model. This process is repeated until the features are propagated to the original set of points.

EXPERIMENT AND DISCUSSION
In this section, the efficiency and effectiveness of our solution for segmenting urban targets from MLS point clouds are investigated and discussed. Note that all experiments are performed on the same workstation with an Intel Gold 6130 @2.7GHz CPU and an NVIDIA RTX3090 GPU. During the training process, the initial learning rate is set to 0.001, and each iteration will be 0.7 times the original.

Dataset
In order to fully verify the feasibility and robustness of the algorithm in this paper, Nanjing Olympic Sports Center(In the WGS84/UTM coordinate system, the x-coordinate of the dataset is between 661650.06 ~ 666158.03m and the y-coordinate is between 3541576.99 ~ 3545957.31m.) was used as the experimental object in this study, as shown in Figure 4. The training data required for the experiments were scanned by a Lynx SG1 vehicle-mounted scanner released by Optech of Canada, containing relatively fine details covering a wide variety of urban scenes: apartments, gymnasiums, offices, buildings under construction, street lights, utility poles, billboards, etc. As Figure 5 shows, the data set is displayed in terms of height.
In total, the dataset consists of more than 60 million 3D points and contains 32 labeled urban scenes. Each scene has up to 10 8 points with XYZ coordinates and intensity information. This research proposes a set of technical processes for semantic segmentation, which is practically applicable in urban scenes. Whether it is a smart city or a modern industrial application, the point cloud data acquired is massive. In order to be able to realistically reflect the accuracy of this research method, we, therefore, chose to use this dataset. In addition, the small amount of data can lead to overfitting, which in turn affects the training results.
The data are manually labeled into three semantic categories, including buildings, rods-like objects, and others.
(3) Others: objects that do not belong to the previously mentioned classes.
To verify the segmentation performance, 26 of these 32 scenes are randomly selected as the training set and the remaining 6 as the validation set in this paper.

Evaluation
In order to evaluate the segmentation results, after the point clouds for each category have been extracted, the accuracy is assessed using an accuracy evaluation method. Overall Accuracy is a commonly used metric in multi-category segmentation problems but can be affected by uneven sample distribution. In order to more scientifically assess the effectiveness of this paper's method for each category of segmentation, specifically precision, recall, F-score, and IoU metrics will be used as evaluation metrics for comparison. In this study, we designate an object such as a building as a positive sample and denote it as TP if it is segmented correctly, or FN if it is segmented as other objects or rods-like. Rods-like are denoted as FP if they are segmented as buildings, and rods-like are denoted as TN if they are segmented correctly. Based on the above metrics, the required accuracy evaluation value can be calculated as follows. For precision and recall, the two are not necessarily correlated. However, in the real world, these two metrics can exhibit mutual constraints due to overly large data sets. In this study, we need to weigh these two metrics together and therefore include the Fscore as an evaluation metric. IoU is generally calculated based on categories, that is, the IoU of each category is calculated and then accumulated and averaged to obtain a global-based evaluation, which has been used as a standard metric in semantic segmentation.

Results and Discussion
In this study, semantic segmentation experiments were conducted on the constructed outdoor point cloud data using the ShellNet network, and the urban scene of Nanjing Olympic Sports Center was mainly selected and the segmentation results of this scene were visualized. In order to demonstrate the superior performance of the target segmentation algorithm proposed in this paper in terms of category segmentation, a PointNet++ network was used for comparison experiments, and the segmentation results are shown in Figure 6. The experimental results show that PointNet++ achieves 99.39%, 99.48% and 99.42% in terms of recall, precision and F-Score, respectively. However, the results of ShellNet are slightly higher with 99.53%, 99.58%, and 99.54%. When comparing the IoU indices, ShellNet and PointNet++ achieve 73.74% and 68.39% for rods-like objects and 89.83% and 83.87% for building objects, respectively. In summary, it can be seen from Table 2 that ShellNet outperforms PointNet++ in general, obtaining accurate segmentation, which illustrates its superior performance in segmenting urban scenes.
In order to get a better impression of the effectiveness of ShellNet on large scale data sets, this paper has selected scenes from six validation sets with good results and compared the segmentation results of ShellNet, PointNet++ and real ground scenes. As shown in Figure 6 for building 1, building 2 and building 3, it is clear that PointNet++ has a large error in segmenting buildings. ShellNet's segmentation results do not have this large error, except for building 2, where there is a significant missegmentation, but are basically the same as the real ground scene. As shown in Figure 6 for rods-like 1, rods-like 2, rods-like 3, PointNet++ also has many misclassifications when segmenting rods-like, misclassifying part of the point cloud on a rods-like as a building, in rods-like 2 ShellNet misclassifies the upper part of the point cloud on the rods-like as a building, in rods-like 3 classifies the rods-like into other classes. The specific accuracy evaluation values are shown in Table 1. In the fifth and sixth scenes, the accuracy evaluation values for ShellNet rods-like are lower than those of Pointnet++, which we suspect may be due to the small number of samples of rods-like in these two scenes, but this problem does not occur in the other scenes. Overall, the segmentation of PointNet++ is numerically good, but its visualisation results show a lot of errors, which is particularly evident in the comparison with ShellNet.

Method Validation set
The evaluation index

CONCLUSION
In order to minimize human intervention in the current situation where automatic semantic segmentation of complex urban scenes is difficult, this paper employs the ShellNet deep learning network for automatic semantic segmentation. The network is an end-to-end deep neural network for the point-by-point classification of outdoor large-scale point clouds, effectively segmenting the entire urban scene into three categories on our own dataset. Through semantic segmentation experiments on buildings, poles, and other objects, the results show that the research method in this paper is feasible and robust, and the accuracy of its test data meets the requirements in production activities. Compared to traditional methods, ShellNet and PointNet++, two deep learning methods, appropriately address the disorderly nature of point clouds and exploit the spatial relationships between points to aggregate information in a tandem fashion between local and global features. Compared to PointNet++, ShellNet's network model is a little more accurate, avoids misclassification, and can also correctly classify edge areas of buildings. The ShellNet model has fewer parameters and also outperforms PointNet++ in time, enabling fast classification of point clouds with large data volumes and is more suitable for point cloud classification in large-scale outdoor scenes.
In recent years, smart cities have become a strategic choice for promoting global urbanization, improving urban governance, and developing the digital economy. As a part of the smart city, architecture is one of the important carriers, and it is becoming more and more intelligent from the perspective of the building itself. Therefore, this study focuses on the division of buildings and rods-like. Due to the different characteristics of the distribution of point clouds and the intensity of various objects in the city, the next step is to select more algorithms for more complex urban scenes.