A TWO-STAGE APPROACH FOR RARE CLASS SEGMENTATION IN LARGE-SCALE URBAN POINT CLOUDS

Although deep learning has greatly improved the semantic segmentation accuracy of point clouds, the segmentation of rare classes in large-scale urban scenes has not been targeted in available methods. This paper proposes a two-stage segmentation framework with automated workflows for imbalanced rare classes based on general semantic segmentation. The proposed approach includes two stages: general semantic segmentation and object-based refined semantic segmentation. Firstly, general segmentation networks are utilized to segment general large objects. Secondly, refined semantic segmentation is conducted by an automated workflow: 3D clustering and bounding box (BBox) generation are applied to the point cloud of rare fine-grained objects during the training, followed by object detection to extract fine-grained objects. Afterwards, as the constraints, the extracted BBoxes further refine the segmentation results. Our approach is evaluated on the Hessigheim High-Resolution 3D Point Cloud (H3D) Benchmark and obtains state-of-the-art 89.35% overall accuracy and outstanding 75.70% mean F1-Score. Furthermore, rare classes Vehicle and Chimney achieve breakthroughs from zero to 63.63% and 52.00% in F1-score, respectively. * Corresponding author


INTRODUCTION
Automated semantic segmentation of point clouds is fundamental for various fields of application, including autonomous driving, building information modeling and robotics (Chen et al., 2020). With the advances of recent technology in remote sensing sensors and platforms, especially lightweight LiDAR devices and unmanned aerial vehicles (UAV), which facilitate the availability of fine-grained 3D data. Such data, while revealing the spatial distribution of target objects in high detail, also bring about the problem of significant class imbalance. Efficient methods are needed to fully harness this unpreceded source of information for 3D semantic segmentation.
Established methods like VoxNet (Maturana et al., 2015) voxelizes point clouds to make the data structure suitable for 3D CNNs. But the sparsity of point clouds causes low efficiency of voxel grid arrangement. SSCNs (Graham et al., 2018) takes advantage of sparsity and considers only occupied voxels to improve the efficiency. Schmohl and Soergel (2019) apply it to the large-scale ALS point clouds. But such methods only depend on the voxel boundary and ignore the geometric structure of local regions. PointNet++  effectively solves the problem of extracting local features by combining sampling-grouping layer and PointNet  layer. Nevertheless, features via the pooling operator in each individual dimension have the same weight. The selfattention operator in Point Transformer (Zhao et al., 2021) weights each element adaptively. However, when dealing with imbalanced rare classes in large-scale urban scenes, the above general semantic segmentation methods often fail to extract sufficiently effective semantic features of these classes and perform poorly. To alleviate this problem, surface features based on the local 3D neighborhood (Weinmann et al., 2013) can be utilized to strengthen the local perception of networks.
The goal of 3D object detection is to detect class-imbalanced high-value objects and indicate object location and size attributes in the form of 3D BBoxes. If objects have strong shape cues, detectors can easily locate objects and thus provide valuable information for semantic segmentation (Dong et al., 2014). In general, existing 3D detection methods can be broadly grouped into two categories, i.e., single-stage detection and two-stage detection. Single-stage detection methods regress 3D bounding box directly from extracted features, such as PointPillars (Lang et al, 2019) and 3DSSD (Yang et al., 2020). Two-stage detection methods like PointRCNN (Shi et al., 2019) and PV-RCNN (Shi et al., 2020) generate region-proposalaligned features in the first stage, and refine predictions in the second stage. Single-stage detection methods usually run faster due to simpler network structures, whereas two-stage detection methods often attain higher precisions benefited from the second refined stage.
So far, those 3D segmentation methods developed in the computer vision community have mostly been used for general large classes in ground scans with limited space, or indoor scenes. To our knowledge, specialized segmentation methods for imbalanced rare classes in large-scale urban point clouds have not yet been investigated. In this paper, we unify object detection models into the framework of general semantic segmentation, and present a two-stage segmentation framework for imbalanced rare classes.
The rest of the paper is organized as follows. In Section 2, the overall structure of the proposed two-stage segmentation framework is introduced in detail. Section 3 shows the experimental details on the H3D dataset, and analyzes the results of general semantic segmentation and our two-stage segmentation. Section 4 is left for conclusion and outlook.

METHODOLOGY
In this section, we present our proposed two-stage segmentation framework for imbalanced rare classes. The overall structure is illustrated in Fig. 1, which consists of general semantic segmentation stage and refined semantic segmentation stage. General semantic segmentation is used to extract general large classes, based on which the refinement of imbalanced rare classes is performed in the second stage.

General Semantic Segmentation
Since Point Transformer is invariant to permutation of the input elements due to the inherent set-level operation of the selfattention structure, which is consistent with the disordered distribution of point clouds, it is quite natural to choose the network as the main component of general semantic segmentation stage. But unlike the original Point Transformer, not only point original features but also local surface features of points are fed into the network. In this way, the local perception of the network can be enhanced to a certain extent.
Local surface features provide the attributes of the local approximate surface of each point (Weinmann et al., 2013), which can be calculated based on the local 3D neighborhood. Only descriptors with strong semantic interpretation are selected to construct the local features of each point p , which are described by one 6-tuple As part and parcel of general semantic segmentation stage, point transformer layer is formed by two linear mappings and a selfattention calculation. The linear mapping converts the inputoutput dimension, and the self-attention estimates the internal relationship among the input points. The self-attention calculation of each point x , which is obtained by the KNN algorithm.  is the softmax activation function, and  is the attention mapping function, which is implemented by a multilayer perceptron (MLP), i.e. 2 linear layers and a ReLU  consists of 4 encoder layers and 4 decoder layers. Transition down is implemented by the farthest point sampling and KNNs searching. Transition up is realized by trilinear interpolation. For the semantic segmentation task, a MLP maps the point feature to the label space k y at the last layer. All the learnable parameters of the network could be updated by optimizing the cross-entropy loss function.

Refined Semantic Segmentation
In the stage of refined semantic segmentation, the training process and the inference process are separated. During training, automated label generation in the form of 3D BBox is essential to unify object detection models into the framework of semantic segmentation. Firstly, fine-grained rare classes are selected individually to avoid confusion with general classes. Then, considering their discrete distribution, density-based spatial clustering of applications with noise (DBSCAN) method (Ester et al., 1996) is utilized to divided the point cloud of rare classes into separate reliable clusters. Afterwards, the vertices of each cluster corresponding to the convex hull are calculated and adjusted to the vertices of 3D BBox. After the automated process above, the generated BBox labels of rare classes and the original point cloud can be fed into the object detection block.
Thanks to its satisfactory detection precision in large-scale complex scenes, PointRCNN is chosen as the detection block. The main components of the network are 3D proposal generation and 3D BBox refinement. 3D proposal generation performs the rough segmentation of foreground points, based on which 3D BBox proposals are constructed. PointNet++ with multi-scale grouping is utilized as the backbone network to learn discriminative point-wise features of the raw point clouds. In order alleviate the class imbalance problem between foreground points and background points, the focal loss (Lin et al., 2017) is chosen to update the network as follows: In the stage of 3D BBox refinement, when the 3D intersection over union (IoU) between a ground-truth BBox and a BBox proposal is greater than 0.6, the point-wise features and associated features for each positive 3D proposal are fed to PointNet++ for refining the 3D Bbox locations as well as the foreground object confidence. All the learnable parameters could be updated by optimizing the following loss function: Finally, the 3D BBoxes of rare fine-grained objects are predicted in the inference stage and these high precision BBoxes are utilized as constraints for rare class segmentation.

Data Description
The experiments are based on the public H3D dataset (Kölle et al., 2021 integrated on a RIEGL UAV platform. The mean point density is 800 points/m² enriched by RGB colors and the ground sampling distance (GSD) of images is 2-3 cm. In addition, the points have been manually labelled with the following 11 classes: Low vegetation, Impervious surface, Vehicle, Urban furniture, Roof, Facade, Shrub, Tree, Soil/Gravel, Vertical surface, Chimney. However, this fine-grained class catalog leads to data imbalance.
Detailed statistics of class occurrences in H3D dataset is shown in Table 1, the most underrepresented classes are Vehicle and Chimney, which only occupy 0.43% and 0.04% in the training set, respectively. The significant data imbalance makes the semantic segmentation of rare classes a challenging task.

Implementation Details
Our implementation of the two-stage segmentation approach is realized on a NVIDIA RTX2080Ti GPU with the framework of Pytorch 1.0. According to the analysis in section 3.1, In order to reduce the computational burden, the training data and the test data are cropped into 49 splits and 22 splits, respectively.
In the stage of general semantic segmentation, the configuration of the feature encoder is (32, 2048) (64, 1024) (128, 512) (256, 256), where (32, 64, 128, 256) represents the feature dimension in the corresponding layer, and the output point number is (2048,1024,512,256). In the point transformer block, the decoder has a symmetrical configuration with the encoder. The Adam optimizer is employed in the network. We train the network for 20 epochs with batch size 4 and an initial learning rate of 0.0005.
In the stage of refined semantic segmentation, Vehicle and Chimney are treated as imbalanced rare classes for refined semantic segmentation according to the section 3.1. For the backbone network PointNet++ in the process of 3D proposal generation, we subsample 65536 points from each split as the inputs of the training. Then 4 set-abstraction layers with multi-scale grouping are used to subsample points into groups with sizes 4096, 1024, 256, 64. For the 3D BBox refinement network, 512 points are randomly selected from each 3D proposal as the input and 3 set abstraction layers with group sizes 128, 32, 1 are used to generate a single feature vector for the BBox refinement. The proposal generation network is trained for 300 epochs with batch size 8 and learning rate 0.002, while the BBox refinement network is trained for 200 epochs with batch size 4 and learning rate 0.002.

Segmentation Results
Our proposed approach is evaluated on the H3D Benchmark dataset. The segmentation results are evaluated by overall accuracy (OA) and F1-score.
The semantic segmentation results and confusion matrix are shown in Figure 3(b), where the overall accuracy achieves stateof-the-art 89.35% and the mean F1-score achieves outstanding 75.70%. The visualization of the corresponding result on test set is shown in Figure 4(c), where the ground truth is shown in Figure 4(a). It can be observed that the predictions are very close to the ground truth. The confusion mainly exists between Vehicle and Urban furniture, and Soil/gravel are often inferred as Low vegetation. These ambiguities are caused by their limited inter-class distances and scarce appearances.  Table 2. Performance comparison between the proposed twostage approach and the single-stage segmentation.
In order to verify the effectiveness of the proposed segmentation approach for imbalanced fine-grained objects, we also compare it with the single-stage segmentation (without refined semantic segmentation). result. The performance comparison is shown in Table 2. Benefited from the specialized segmentation for imbalanced rare classes, the two-stage approach performs better than the single-stage method in all evaluation metrics. Due to the low percentage of the fine-grained rare classes, there is only a limited improvement (0.16%) in overall accuracy. However, Vehicle and Chimney have achieved breakthroughs from zero to 63.63% and 52.00% in F1-score respectively, which has also promoted our two-stage approach outperforms the single-stage segmentation by a large margin of 10.34 percentage points in the mean F1-score.

CONCLUSION AND OUTLOOK
In this work we have presented a two-stage segmentation approach for imbalanced rare classes, which has unified object detection models into the semantic segmentation framework. Comprehensive experiments on large-scale urban data demonstrated that the proposed approach have obtained the state-of-the-art overall accuracy and the satisfactory mean F1score, and have achieved the outstanding F1-scores for imbalanced rare classes. However, the proposed solution also has limitations. The proposed method is only suitable for finegrained objects with strong discrete distributions, and it requires a considerable amount of computational resources due to the additional training for the refinement network. In future work, we will focus on the feature-level unification of detection networks into segmentation framework, and construct an endto-end lightweight segmentation network for imbalanced rare classes.
Glorot, X., Bordes, A., Bengio, Y., 2011. Deep sparse rectifier neural networks. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, 315-323. (c) Figure 4. Visualization of the test results on (a) ground truth (b) the single-stage segmentation (c) our proposed two-stage approach.