Label-efficient Deep Learning-based Semantic Segmentation of Building Point Clouds at LoD3 Level

. In recent years, Deep Learning (DL) techniques and large amounts of pointwise labels are employed to segment point clouds of the built environment. However, annotating pointwise labels is a time-consuming task. To address this issue, we propose a label-efficient DL network that obtains per-point semantic labels of LoD3 (Level-of-Detail) building point clouds with limited supervision. Experimentally, we compared our approach to the fully supervised DL methods, and we find our approach achieved comparable results on the ArCH Data Set, with only 10% of labelled training data obtained from fully supervised methods as input.


Introduction
In recent years, 3D buildings' point cloud representation enables and promotes new applications in many fields such as Cultural Heritage preservation [1][2], Construction Engineering [3][4], Emergency Decision-making [5], and Smart Cities [6]. Extracting semantic information from 3D buildings' point clouds to acquire high Level-of-Details (LoDs) modelling is an essential task [7].
LiDAR data sets have become available at an even growing resolution and accuracy. Inspired by the success of Deep Neural Networks (DNNs) used in Computer Vision (CV) to accomplish subset tasks (i.e., classification, detection and semantic segmentation), in recent research, fully-supervised Deep Learning (DL) techniques and large amounts of pointwise labels have been employed to train a segmentation network to be applied to buildings' point clouds. However, fine-labelled point clouds of the built environment are hard to find and manually annotating pointwise labels is a timeconsuming and expensive task. The application of fully supervised learning for semantic segmentation of buildings' point clouds at LoD3 level is severely limited.
In CV, the hunger for fine-labelled pointwise training data is often tackled by using unsupervised methods. However, these approaches are mostly designed for 2D images, which are fundamentally different from unordered 3D point clouds. Furthermore, the application of label-efficient unsupervised learning to downstream tasks in the 3D field is still limited to classification and segmentation tasks of small-scale point clouds. From a scientific viewpoint, the unsupervised DL-based semantic segmentation of buildings' point clouds is still an open issue, and current knowledge about it is deeply unsatisfactory.
To address this issue, we propose a novel label-efficient DL network that obtains per-point semantic labels of LoD3 buildings' point clouds with limited supervision. In general, it consists of two main steps. The first step, named Autoencoder, is composed of a Dynamic Graph Convolutional Neural Network-based [8] encoder and a foldingbased decoder. It is designed to extract discriminative global and local features from input point clouds by reconstructing them without any label. The second step is the semantic segmentation network. By supplying a small amount of task-specific supervision, a segmentation network is proposed for semantically segmenting the encoded features acquired from the pre-trained Autoencoder.

Related Work
Unsupervised learning refers to learning methods without using any human-annotated labels. Since the scarcity of fine-labeled point cloud datasets, unsupervised learning methods have become popular alternatives to fully supervised learning to exploit the inherent and underlying information in large unlabeled datasets, which may dramatically decrease the need for labeled training data. Following the impressive results that have been achieved with unsupervised learning in the 2D image field, previous efforts to perform unsupervised learning on point clouds have been derived from tailoring these methods. Several unsupervised methods (e.g., Generative Adversarial Networks, Autoencoder) applied to 3D point clouds are reported in the literature, partly due to the common criticism that a huge amount of labeled data is required for training in a DNN. We provide a quick overview of both types of methods.

Generative Adversarial Networks
Typically, Generative Adversarial Networks (GANs) consist of a generator that learns how to map from a latent space to a data distribution of interest. A discriminator distinguishes generated point cloud produced by the generator from the true data distribution. For example, Achlioptas [9] investigated and compared GAN-based method for generating point clouds in raw data space and latent space of a pre-trained autoencoder. Li [10] proposed a "sandwiching" reconstruction method that combines a modification of Wasserstein GAN [11] loss with Earth Mover's Distance (EMD). AtlasNet [12] introduces a shape generation framework that represents a 3D shape as a collection of parametric surface elements by locally mapping a set of squares to the target surface of a 3D shape. Although impressive results were achieved, GAN-based methods more focus on generative models of point clouds, which aims to generate point clouds or complete shapes of point clouds.

Autoencoders (AEs)
An Autoencoder (AE) is trained to learn a compressed representation by faithfully reconstructing input original image/point cloud [13]. In FoldingNet [14], the authors adopted the idea of the folding-based decoder to deform a canonical 2D grid onto the underlying 3D object surface of a point cloud, in which the learned representation achieves high linear SVM classification accuracy on ModelNet40 dataset. Built on the fully supervised PPFNet [15] and FoldingNet, in PPF-FoldNet [16] the authors improve their earlier solution by involving more features in their network in an unsupervised fashion. PPF-FoldNet achieves better reconstruction performance at rotations and different point densities, but their research focuses on reconstruction rather than downstream tasks. BAE-NET [17] proposed a branched AE network which trains with a collection of objects from the ShapeNetPart dataset trained with a shape cosegmentation task.
Existing methods achieve state-of-the-art in their downstream tasks (i.e., classification, part-segmentation and co-segmentation). However, most of these existing unsupervised AE methods for 3D point clouds are: 1) trained and tested using simple 3D objects; 2) designed for low-level tasks such as reconstruction, denoising and completion that are not designed for high-level downstream semantic segmentation task, resulting in downstream tasks of these AE methods that have not been applied to high-level semantic segmentation tasks either.

Method
In FoldingNet, an Autoencoder (AE) is utilized to reconstruct input point clouds, whilst discriminative representations were learned without any labelled data. Inspired by this, our label-efficient method aims to: (1) construct an AE network for extracting features without any labelled data; (2) with just a few labelled data, we train a segmentation network for the high-resolution LoD3 buildings' point cloud semantic segmentation. Specifically, we proposed an AE network that may learn representations without any label by a dynamically updated graph-based encoder and folding-based decoder. Thus, we may reduce the need for large amounts of labels. Instead of the encoder in FoldingNet, we employ the EdgeConv layers in Dynamic Graph Convolutional Neural Network (DGCNN) to exploit local geometric structures and generate discriminative representations. Then, we use the learned representations as input to our downstream task. In general, the proposed network architecture (see Fig. 1) consists of two components: an AE and a segmentation network.

Autoencoder
The input of the AE is given by the N coordinates (x, y, z) of buildings' points, and intermedia outputs are discriminative features, which are also the input of both decoder of AE and the segmentation network. The final outcome is a matrix of size (m, 3) representing the reconstructed point cloud. We use graph-based layers to extract the local geometric information around points and a max-pooling layer to aggregate information. The edge features are computed as follows: In this edge function, % is the central point belonging to Point Set { = % , … , . } ⊆ ℝ 1 , ' is the local neighbors around the central point % and ℎ " is implemented by a fully connected multi-perceptron layer, which includes learnable parameters. EdgeConv captures the global shape by encoding the coordinates of % , then obtains the local information by encoding ' − % . Then the learned local information aggregated by a local max-pooling operation on the constructed graphs = ( , ), where = {1, . . . , } and ⊆ × are the and the edges respectively and N is the number of vertices.
We use the "codeword" output from the DGCNN-based encoder and a 2D grid as input to our decoder. A folding-based decoder is then utilized to reconstruct input "codeword" with a 2D grid to 3D point clouds by two successive folding operations. The folding-based decoder in our AE network is adopted from FoldingNet's decoder that contains two successive folding operations. The first folding operation folds the 2D manifold into 3D space, and the second one operates inside the 3D space.

Semantic Segmentation Network
To semantically segment buildings' point clouds, we created a segmentation network. The goal here is to assign a semantic label to each of the points given an input point cloud. Hence, we treat this semantic segmentation as a per-point classification task. The output of the pre-trained AE is a Cout-dimensional representation ("codeword") and three stacked edge features, which are learned from non-labelled buildings' point clouds. We replicate the codeword N times and concatenate it with the outputs of three EdgeConv layers in the pre-trained AE. A standard 3-layer shared Multi-Layer Perceptron (MLP) with a cross-entropy loss is then employed as our semantic segmentation classifier after the above concatenation. Considering the features obtained by the proposed AE are already distinctive, we chose this simplest MLP for the segmentation of the point cloud. This semantic segmentation network is trained independently from the proposed AE. The final output is per-point classification scores (m, n_classes) for the segmentation network.

Implementation Details
Experimentally, we evaluate our approach based on the ArCH Data Set [18], which is acquired by both terrestrial laser scanners (i.e., a FARO Focus 3D X 130 and 120 a Riegl VZ-400) and Structure-from-Motion Photogrammetry based on images collected by a DJI Phantom UAV platform equipped with a SONY Ilce 5100L camera.
Our primary motivation to study unsupervised classification problems is that the number of training data is limited. To test the performance when the number of unlabelled and labelled data is small, we select three small (SMV_1, SMV_24, SMV_28) scenes from the 15 labelled scenes as the training data in both unsupervised AE training and supervised segmentation training stage. The training data in our experiment is only 10% of state-of-the-art [2], where 10 scenes are used as training data. Then we follow the settings in state-of-the-art [2] that remove the "others" category, select two unseen scenes: "A_SMG_portico" (Scene_A) and "B_SMV_chapel_27to35" (Scene_B) as our test data. We choose 1m×1m area as the block size for splitting each building scene into blocks to train. Prior to training, the input point clouds are aligned to a common reference frame. In addition, for training convenience, the points in each block are sampled into a uniform number of 8,192 points. At training time, we randomly sample n (2,048 or 4,096) points in each block on-the-fly. To train our AE network, we employ ADAM as an optimizer with an initial learning rate 0.001, batch size 16, and weight decay 1e−6, during 250 epochs. The setting of hidden layers in our encoder is the same as DGCNN, but we remove the layers after the max-pooling layer. Similarly, in the semantic segmentation network, we also use ADAM as our optimizer (learning rate 0.01, batch size 16, 250 training epochs). According to the dimension of Cout, our shared MLPs is (Cout+64+64+64, 512, 256, 128, n_classes) with layer output sizes (512, 256, 128, n_classes) on each point. The evaluation metrics of overall accuracy (OA) and mean Intersection-over-Union (mIoU) are calculated on the ArCH Data Set. The method is implemented using PyTorch. All experiments are conducted on an NVIDIA Tesla T4 GPU.

Results
If the features obtained by the proposed AE are already distinctive, the required number of labelled data in semantic segmentation network training process should be small. In this section, to demonstrate this intuitive statement, we report our experiment's results on the ArCH Data Set. We evaluate our model on an unseen scene (Scene_B) for testing. In Table 1, the overall performances are reported and compared with respect to state-of-the-art methods, which are retrieved from Pierdicca [2]: PointNet [19], PointNet++ [20], DGCNN [8] with 10 scenes, and DGCNN with 15 scenes [2] as training data.
Overall, with only about 10% of training data of state-of-the-art (SOTA) methods in both AE and segmentation network training stages, our model achieves the best results on the ArCH dataset with the same training strategy (only input x, y, z coordinates), as shown in Table 1. The mIoU on Scene_B is 0.408, which also outperforms the 0.353 of SOTA. The semantic segmentation qualitative results Scene_B are shown in Fig. 2, respectively. Our network is able to output smooth predictions.

Comparison with Different Training Data Size
To evaluate the impact of training data size (both labeled and unlabeled), we further provided more solid experiments on another unseen Scene_A from four aspects: • Add one scene ("4_CA_church") as unlabeled training data in AE training stage; • Add one scene ("4_CA_church") as labeled training data in segmentation network training stage; • Add one scene ("4_CA_church") both in AE and segmentation network training stage; and • Decrease the labeled training data size, we just keep one scene ("7_SMV_chapel_24") in the segmentation network training stage. The result of the segmentation result on Scene_A is shown in Table 2. The result here suggested that if we add labeled training data in segmentation network training stage will further improve our performance. For instance, when we add one labeled scene in the segmentation network training stage, our performance will increase by 1% and 3% on the AE pre-trained on three scenes and four scenes, respectively. Furthermore, no increase was detected when we tried to add the unlabeled training data, which infer through training AE from three scenes, we have already been learned a good representation. More importantly, we can further prove our network is label efficient. As even the labeled data was decreased to just one scene (4% of overall labeled data in the supervised method), our overall accuracy still remains at 0.695.

Conclusions
In this study, we have presented an effective label-efficient unsupervised network for LoD3 buildings' point cloud semantic segmentation. The result in our experiment provide support that our proposed Autoencoder architecture may learn powerful representations from unlabeled data, and these representations can be further used in downstream tasks. Furthermore, the segmentation task of building point clouds obtaining equal or better results with respect to the state of the arts on the basis of only 10% training data from the ArCH Dataset.
In future work, it might be possible to improve the performance by breaking through the input block size and incorporating more features (if available -see [21]) of the input point cloud of buildings while using the very limited amount of labeled training data.