EXPLORING CROSS-CITY SEMANTIC SEGMENTATION OF ALS POINT CLOUDS

: Deep learning models achieve excellent semantic segmentation results for airborne laser scanning (ALS) point clouds, if sufﬁcient training data are provided. Increasing amounts of annotated data are becoming publicly available thanks to contributors from all over the world. However, models trained on a speciﬁc dataset typically exhibit poor performance on other datasets. I.e., there are signiﬁcant domain shifts , as data captured in different environments or by distinct sensors have different distributions. In this work, we study this domain shift and potential strategies to mitigate it, using two popular ALS datasets: the ISPRS Vaihingen benchmark from Germany and the LASDU benchmark from China. We compare different training strategies for cross-city ALS point cloud semantic segmentation. In our experiments, we analyse three factors that may lead to domain shift and affect the learning: point cloud density, LiDAR intensity, and the role of data augmentation. Moreover, we evaluate a well-known standard method of domain adaptation, deep CORAL (Sun and Saenko, 2016). In our experiments, adapting the point cloud density and appropriate data augmentation both help to reduce the domain gap and improve segmentation accuracy. On the contrary, intensity features can bring an improvement within a dataset, but deteriorate the generalisation across datasets. Deep CORAL does not further improve the accuracy over the simple adaptation of density and data augmentation, although it can mitigate the impact of improperly chosen point density, intensity features, and further dataset biases like lack of diversity.


INTRODUCTION
Unordered point clouds in 3D space have become a standard representation of spatial data, used across a wide range of applications like digital mapping, building information modelling and transportation planning. An important task for many such applications is semantic segmentation, i.e., assigning a semantic class label to every point. As manual labelling is time-consuming and expensive, researchers have for a long time sought to automate that task. Thanks to deep neural networks the accuracy of supervised semantic segmentation has improved significantly in recent years. But deep learning relies on large quantities of annotated reference data. Labelling a sufficiently large and diverse training set for every location and/or every sensor still presents a significant workload and is not scalable. E.g., labelling 2km 2 of ALS data from Dublin (Ireland) into 13 hierarchical multi-level classes took >2,500 person-hours (Zolanvari et al., 2019). More and more annotated ALS data is available in public datasets and benchmarks, labelled according to various nomenclatures. If models trained from such public data (source scenes) could be transferred to other target scenes, per-project annotation would become obsolete. However, in practice almost every project (including the public datasets) is different in terms of source and target environment. Machine learning models, in particular deep learning models, will tend to overfit to the source data and therefore deliver poor results when naively applied to new, previously unseen target data.
Many studies have explored strategies to mitigate domain shift and overfitting (from here on simply termed "training strategies"), so as to employ machine learning when the source and target data follow different distributions. One natural approach, often used for point clouds, is data augmentation to artificially increase the diversity of the training data. Besides, there are also more formal methods for so-called unsupervised domain adaptation, meaning statistically inspired strategies to adapt to a new target distributions for which only data, but no ground truth annotations, are available. Unsupervised domain adaptation has recently shown promise in 2D image processing (Wilson andCook, 2020, Wang andDeng, 2018). Recently, some authors have also started to adopt it for 3D point cloud interpretation (Wu et al., 2019, Luo et al., 2020, Jaritz et al., 2020. Here, we investigate a number of elementary training strategies for semantic segmentation of ALS point clouds across different cities. To that end, we work with two public ALS datasets from Germany and China, and transfer models between them. In terms of semantic segmentation model, we construct a residual U-net style convolution architecture and employ KP-Conv (Thomas et al., 2019) as the backbone, due to its proven performance on ALS point clouds (Varney et al., 2020). In our experiments, we analyse three factors that may affect generalisation across cities: (i) the point density that is fed into the network; (ii) the augmentation method employed to synthetically increase data diversity; and (iii) the influence of intensity features (on top of pure point coordinates). Furthermore, inspired by the success of unsupervised, statistical domain adaptation in image processing, we also evaluate the effectiveness of a widely known method, deep CORAL (Sun and Saenko, 2016). We find that elementary measures, like setting a suitable point density and augmentation, significantly benefit cross-city generalisation, whereas deep CORAL does not further improve over them. On the contrary, intensity complicates generalisation and might best be discarded when the generality of the model is desirable.

RELATED WORK
This section reviews recent developments of point cloud semantic segmentation, and associated training strategies aimed at improving generalisation.

Semantic Segmentation Techniques for Point Clouds
Point cloud semantic segmentation is a supervised classification task. Shallow machine learning classifiers with manually designed features have been the traditional way to address the problem, including for example support vector machines (Zhang et al., 2013), random forests (Weinmann et al., 2015, Hackel et al., 2016, and Adaboost (Wang et al., 2014). The crucial step in this setting is feature extraction. For point clouds, the most common features are basic geometric properties, height-based features if the gravity direction is known, and features based on eigenvalues of the local point distribution (Weinmann et al., 2015, Hackel et al., 2016, Xu et al., 2019. Besides, graph-based neighborhood models such as conditional random fields have been utilised as a post-processing step to smooth the per-point labels (Landrieu et al., 2017).
In recent years, deep learning has become the dominant approach for point cloud analysis. It requires no feature engineering and achieves better performance for many tasks including semantic segmentation. Deep learning-based methods can be sorted into three main categories: image-based, voxel-based, and point-based . Image-based methods project point clouds to image-like 2D representations, then apply 2D convolutions to them (Boulch et al., 2018, Yang et al., 2017. Their main shortcoming is that they do not fully exploit the 3D geometry. Another solution is to discretise the point cloud to a regular, ordered voxel grid and then use regular 3D convolutions (Tchapmi et al., 2017). Voxel-based deep learning is time-consuming and memory-hungry, so most methods now exploit the sparsity of the voxel space and employ sparse convolutions (Choy et al., 2019, Graham et al., 2018 that only operate on non-empty voxels. Point-based methods include different techniques that make it possible to operate directly on the point cloud. They mainly differ by the way they define the kernels. The pioneering PointNet (Qi et al., 2017a) simply replaces convolution with a more general multi-layer perceptron (MLP). However, PointNet only learns global features, but not local ones. To overcome this limitation, PointNet++ was proposed, which captures local features via an image pyramid-like hierarchical aggregation (Qi et al., 2017b). Several recent works instead design explicit convolution kernels for point clouds. Among them, KPConv (Thomas et al., 2019) has demonstrated high efficiency and good performance for point cloud semantic segmentation, notably for large, mobile-mapping type outdoor scenarios.

Training Strategies for Point Clouds
Data augmentation is an elementary training strategy for deep learning tasks (Shorten and Khoshgoftaar, 2019). Rotation, scaling, symmetry, random noise, and randomly removing points are common augmentation operations for point clouds (Chaton et al., 2020, Thomas et al., 2019. By synthetically increasing the diversity of patterns in the data, they can help to prevent overfitting when training data is limited. Recently, it has also been proposed to learn the data augmentation . While, at first glance, deep learning continues to set the state of the art on many public benchmarks, the situation in reality is more complex. The excellent performance is achieved only when trained on data from the same dataset, i.e., recorded in the same (or a very similar) environment with the same sensor setup. Effectively, the semantic segmentation easily overfits to the unique, specific conditions, so that domain shifts exist even between seemingly similar datasets. From this extreme specialisation, due to the high capacity of deep networks, arises a need for domain adaptation. This was first observed for 2D images (Wilson andCook, 2020, Wang andDeng, 2018), but more recently also explored for various 3D point cloud analysis tasks. In the setting of self-driving scenarios, (Langer et al., 2020, Wu et al., 2019 first project LiDAR point clouds to images and then apply imaged-based domain adaptation on them to aid semantic segmentation. For the important, point cloud-specific domain shift of density differences, (Yi et al., 2020) formulate domain adaptation as a complete-and-label problem. A voxel completion network is proposed to fill in gaps between the source and target data, so they have similar density. xMUDA (Jaritz et al., 2020) utilises cross-modal learning with images to address the domain shift between point clouds in road scenes. Mutual information from cross-modal features is shown to improve semantic segmentation. These works are aimed at point cloud semantic segmentation, but the domain adaptation strategies do not directly operate on uni-modal point clouds. Towards direct point cloud domain adaptation, (Luo et al., 2020) propose a framework that jointly aligns data and feature distributions of MLS point clouds, with a small network to refine the elevation of target data and an adversarial network to align the features. (Peng et al., 2020) also address domain adaptation with adversarial learning and demonstrate their method for two similar ALS datasets (captured in the same region) and for an ALS dataset and a MLS dataset.

METHODOLOGY
We are not aware of any systematic comparison of different domain adaptation strategies for point clouds. In this work, we set up a state-of-the-art semantic segmentation pipeline, with KP-Conv as the backbone, and compare several basic and practical training strategies. We run experiments under different conditions in terms of input point cloud density, data augmentation, and the use of intensity features. Beyond these "hand-designed" manipulations of the input data, we also test a classical, wellestablished domain adaptation algorithm, deep CORAL (Sun and Saenko, 2016).

Semantic Segmentation by KPConv
KPConv (Thomas et al., 2019) is a direct point cloud convolution operator, based on the idea to approximate the continuous convolution operator in a local, spherical 3D neighbourhood. Let pi and fi be points from a point cloud P ∈ R N ×3 and their corresponding features from F ∈ R N ×D . The point convolution at a point p ∈ R N ×3 is denoted as follows: where g is the kernel function of KPConv. Np = {pi ∈ P | pi − p ≤ r} represent neighbour points of p within a fixed radius r ∈ R. In KPConv, g takes the those neighbours centered on p as input to the convolution. The domain g is defined as a 3D sphere: where qi = pi − p.
KPConv provides two kernel versions, a rigid and a deformable one. In the former, the kernel points are distributed in a fixed layout within the sphere, whereas the deformable one allows for learned shifts of their positions. In practice, deformable KP-Conv does not outperform the rigid version on scenes lacking diversity such as ALS point clouds (Thomas et al., 2019, Lin et al., 2021 but requires more GPU memory and run time. Hence, we use rigid KPConv in this work. In our experiments we use the authors' original PyTorch-based implementation (https://github.com/HuguesTHOMAS/ KPConv-PyTorch).
Our semantic segmentation embeds KPConv in a U-net architecture (Ronneberger et al., 2015), following ResNet block design (He et al., 2016) in the encoder. Each convolution layer in this network is followed by batch normalization (BN) (Ioffe and Szegedy, 2015) and a Leaky ReLU activation (Maas et al., 2013). Grid sampling is employed as the sub-sampling strategy to reduce the density and increase the context along the layers. Hence, the data in each layer are the center points of regularly spaced grid cells. The convolution sphere radius ri for the i-th layer is adjusted by a corresponding factor α, i.e., where li and ri denote the grid size and convolution radius in the i-th layer. Due to limited GPU RAM, the size of the input sphere, and thus the size of the receptive field in the network, are dependant on the grid spacing of first sub-sampling: wider spacing causes stronger down-sampling (with potential loss of information), but on the other hand allows for a larger receptive field (with more context).

Domain Adaptation by Deep CORAL
Correlation alignment is a popular, representative statistical algorithm for unsupervised domain adaptation. It tries to minimise the domain shift by aligning the second-order statistics of source and target feature distributions, which can be done without any labels for the target domain. We adopt a deep version of correlation alignment, named deep CORAL (Sun and Saenko, 2016), which can be directly integrated into any neural network architecture.
Deep CORAL imposes the correlation alignment as a soft constraint, via the loss function. The CORAL loss is defined as the distance between the second-order statistics in the source and target feature matrices: with CS and CT the source and target feature covariance matrices. During training, LCORAL is minimised with minibatches from the training set of the source domain and the target domain. The intuition behind deep CORAL is to "deform" the source and target feature distributions such that they match up to second-order statistics, assuming that the class-conditional distributions will then match better, too.
Multiple CORAL loss functions over different activation layers within the network can be combined, and added to the semantic segmentation loss Lseg, to obtain a joint loss function: where t is the number of layer-wise CORAL losses and λ (i) is the weight coefficient of the i-th CORAL loss.
We train our KPConv-based residual U-net with standard crossentropy loss for Lseg. Empirically, CORAL terms for lower layers did not have much influence, so we only align the feature maps of the last activation layer with a single CORAL loss LCORAL. The network architecture is depicted in Figure 1. xSi and xT i represent input source and target samples, respectively. ySi is the set of input labels (given only for the input source data).

Datasets
Two ALS point cloud benchmark datasets are adopted for the evaluation: the ISPRS Vaihingen benchmark (Cramer, 2010, Rottensteiner et al., 2012 and LASDU (Ye et al., 2020). ISPRS Vaihingen was captured with a Leica ALS50 system from an average flying height of ≈500m in Vaihingen, Germany; while LASDU was captured with a Leica ALS70 system at an average flying height of ≈1200m in a town of northwest China, which is a part of the HiWATER (Heihe Watershed Allied Telemetry Experimental Research) project (Li et al., 2013). The original labels of LASDU (6 classes, including a rejection class "unclassified") and ISPRS Vaihingen (9 classes) are different. However, for evaluation purposes the label sets of the two domains should match. Hence, we map the 9 classes of ISPRS Vaihingen to the 6 classes of LASDU, following the classification rule of LASDU. The "powerline" class of ISPRS Vaihingen is mapped to "others", as no powerlines are labelled in LASDU. Points with label "others" are used for training, but are ignored in the quantitative evaluation. Table 1 shows the mapping between the two label sets. The ISPRS Vaihingen benchmark contains defined training and test portions. LASDU consists of four portions. It is recommended to use files 2 and 3 as the training data, and 1 and 4 as the test set for semantic labelling (Ye et al., 2020). Table 2 shows the number of points in each class for both datasets.

Experiment Setup and Evaluation Metrics
Three experiments have been run. Each experiment includes six cases, obtained by using IPSRS Vaihingen ( In all experiments, the batch size is set to 8 and α in equation 3 is set to 2.5. Training with stochastic gradient descent (SGD) is run for 60,000 iterations, at which point the loss function has always converged. The initial learning rate is set to 0.01 and decays at a rate of 0.1 every 7,500 iterations when training on V H, respectively decays at the same rate every 12,500 iterations when training on the larger LS. Data augmentation, including random rotation around the z-axis, random scaling, random symmetry about the x-axis, and Gaussian noise, is always applied except for the dedicated experiments without data augmentation in Section 4.4. The scaling factor is randomized within [0.8, 1.2]. The standard deviation σ of Gaussian noise is set to 5cm. When using correlation alignment the weight coefficient λ = 1.0. Since KPconv can only operate on limited (spherical) subsets of a large point cloud, we adopt the authors' voting strategy during testing and average the estimated class probabilities of each point, obtained from at least 20 different sphere samples. Training and testing were performed on a Geforce RTX 2080 Ti GPU with 11GB RAM.
Following the ISPRS Vaihingen benchmark, all results are evaluated in terms of overall accuracy (OA) and F1 score.
where i is the class index and TP refers to the number of true positives, FP the false positives, TN the true negatives, FN the false negatives.

Experiment I: Evaluation of Input Point Density
As explained in Section 3.1, an important hyper-parameter of KPConv is the input point cloud density, i.e., in our setup defined via the grid size. A trade-off has to be found between input density and receptive field size. We test three settings for the grid spacing l: 0.25m, 0.5m and 0.8m. These correspond to input context spheres of radius 13m, 25m and 40m, respectively. When l is bigger, the network can see a larger region, but with sparser sampling and thus less geometric details and less information on small objects. Data augmentation as described in Section 4.2 is used in all runs. LiDAR intensities are not used. Table 3 shows that the best generalisation results, for both V H → LS and LS → V H, are achieved when l = 0.5m. Still, a clear domain shift exists. The OA of LS → V H (crosscity) is 9 percent points (pp) lower than that of V H → V H (within-city). Similarly, the OA of V H → LS is 10.5 pp lower than that of LS → LS. The F 1 scores are also lower across all classes. Deep CORAL has a negative impact on the overall accuracy, due to more mistakes on the large ground and vegetation classes. We note that in case of non-optimal sample density l deep CORAL has a mild positive effect. In V H → LS (l = 0.25m), LS → V H (l = 0.25m) and LS → V H (l = 0.8m), the OA increases between 0.7 and 2.2 pp. It seems that aligning the feature distributions somehow mitigates the domain difference for features derived from points sampled at sub-optimal density. An example for LS → V H is shown in Figure 2. Note that at overly coarse sampling (0.8m) large areas of ground points along roads are misclassified as low vegetation, and deep CORAL corrects these errors. However, at proper sampling density of l = 0.5m, the mistakes do not happen in the first place, so there is no room for improvement.

Experiment II: Evaluation of Data Augmentation
Deep neural networks require large training sets to avoid overfitting. Data augmentation is a technique to synthetically increase the sample size by manipulating existing samples in plausible ways. Table 4 presents results for the same settings as above for grid spacing l = 0.5m, but without data augmentation, compared to Table 3. Again, LiDAR intensity values are not used. Tables 4 and 3, we see that crosscity generalisation suffers if data augmentation is disabled. The  Table 3. Results with different input grid spacing. Data augmentation was carried out but intensity features were not used. OA decreased by 8 pp and 6 pp for LS → V H and V H → LS, respectively. Deep CORAL manages to mitigate that performance drop, improving OA by 2.8 pp and 4.2 pp, correspondingly. However, its impact is again class-specific and the F 1 scores of several classes are decreased significantly. In particular, in the V H → LS test the F 1 scores for trees and buildings are lower than before, and the score for artifacts even drops to 0.

Experiment III: Evaluation of Intensity
This experiment additionally assesses the role of intensity features, which one might expect to also influence cross-city generalisation. In the original ISPRS Vaihingen and LASDU datasets, the LiDAR return intensities have already been scaled to [0,255], so we directly concatenate them with the 3D point coordinates and feed the resulting 4D points to the network. Table  5 shows segmentation results with added intensities, at density l = 0.5m, with data augmentation, compared to Table 3. Intensities do improve the within-city results slightly for V H → V H and more significantly for LS → LS, especially the separation of ground and low vegetation. This makes sense, as the two classes are difficult to distinguish based only on geometric features -they both share low height and mostly horizontal, planar layout. However, the performance for both LS → V H and V H → LS drops significantly when using also intensity. OA decreases by 5 pp for LS → V H, and even by 27 pp for V H → LS. As can be seen in Figure 3, the classifier trained on V H misclassifies large regions of ground in LS as low vegetation. One can see that the intensity distributions of the two datasets differ significantly (Figure 3a and 3b

Further Analysis
Confusion matrix. To analyse the class-wise effect of the domain shift, we inspect confusion matrices of V H → LS, LS → LS, LS → V H, and V H → V H, for the bestperforming spacing l = 0.5m. In Figure 4 it can be seen that an obvious issue in both the V H → LS and LS → V H results is indeed the confusion between low vegetation and ground, due to their similar geometric characteristics. Within one dataset intensity can to some degree compensate the mistake, but it even extends the gap between the two point clouds. Arguably, the two large classes associated with the ground are the main challenge for domain adaptation between V H and LS.
Minority class. Minority classes, in our case especially artifacts, appear to be negatively affected by domain adaptation with deep CORAL. In extreme cases, e.g., V H → LS (w DC) in Tables 4 and 5, the F 1 score even drops to 0. There are multiple potential reasons for this behaviour. On the one hand, the deep CORAL loss is calculated without taking into account the classes (which are unknown for the target distribution). Rare classes will therefore have almost no influence on the adaptation. And the resulting warping of the feature space, optimised to accommodate the dominant classes, can be counter-productive. On the other hand, an aggravating factor might also be that the artifact class contains a too large variety of objects in LASDU, including walls, fences, light poles, vehicles, etc. The corresponding, wide and diffuse set of features each valid only for few examples might lead to a complicated feature distribution not sufficiently characterised by the second-order statistics.

CONCLUSION
We have empirically investigated cross-city learning of semantic segmentation for ALS point clouds, using example datasets from Germany and China. Three factors were considered that all affect the results, and a representative, generic domain adaptation strategy was evaluated. Our experiments indicate that data augmentation and proper choice of the input density play an important role and can significantly boost generalisation performance. On the contrary, LiDAR intensities exhibited stronger differences between datasets and might better be avoided, as they negatively impact performance across datasets. As for unsupervised, statistical domain adaptation with deep CORAL, we found that when training conditions are not optimal (e.g., intensities present or point density not well chosen) it brings a mild improvement, however it did not resolve the important problems and affected different classes rather unevenly. Elementary design choices, like choosing the right input density and using data augmentation, were much more important to achieve acceptable generalisation. Surprisingly, when those were set to support generalisation in the best possible way, correlation alignment even deteriorated the result by reinforcing class-dependent biases.
In future work we would like to develop domain adaptation methods that are better suited for semantic segmentation of ALS point clouds. Another important step for future research will be to introduce datasets with larger class nomenclatures, to analyse and tackle domain adaptation for application scenarios with more fine-grained semantics.