Weakly supervised segmentation-aided classification of urban scenes from 3D LiDAR point clouds

: We consider the problem of the semantic classiﬁcation of 3D LiDAR point clouds obtained from urban scenes when the training set is limited. We propose a non-parametric segmentation model for urban scenes composed of anthropic objects of simple shapes, partionning the scene into geometrically-homogeneous segments which size is determined by the local complexity. This segmentation can be integrated into a conditional random ﬁeld classiﬁer (CRF) in order to capture the high-level structure of the scene. For each cluster, this allows us to aggregate the noisy predictions of a weakly-supervised classiﬁer to produce a higher conﬁdence data term. We demonstrate the improvement provided by our method over two publicly-available large-scale data sets.


INTRODUCTION
Automatic interpretation of large 3D point clouds acquired from terrestrial and mobile LiDAR scanning systems has become an important topic in the remote sensing community (Munoz et al., 2009;Weinmann et al., 2015), yet it presents numerous technical challenges.Indeed, the high volume of data and the irregular structure of LiDAR point clouds make assigning a semantic label to each point a difficult endeavor.Furthermore the production of a precise ground truth is particularly difficult and time-consuming.However, LiDAR scans of urban scenes display some form of regularity and a specific structure can then be exploited to improve the accuracy of a noisy semantic labeling.
Foremost, the high precision of LiDAR acquisition techniques implies that the number of points far exceeds the number of objects in a scene.Consequently, the sought semantic labeling can be expected to display high spatial regularity.Although the method presented in (Weinmann et al., 2015) relies on the computation of local neighborhood, the resulting classification is not regular in general, as observed in Figure 1b.This regularity prior has been incorporated into context-based graphical models (Anguelov et al., 2005;Shapovalov et al., 2010;Niemeyer et al., 2014) and a structured regularization framework (Landrieu et al., 2017a), significantly increasing the accuracy of input pointwise classifications.
Pre-segmentations of the point cloud have been used to model long-range interactions and to decrease the computational burden of the regularization.The segments obtained can then be incorporated into multi-scale graphical models to ensure a spatiallyregular classification.However, the existing models require setting the parameters of the segments in advance, such as their maximum radius (Niemeyer et al., 2016;Golovinskiy et al., 2009), the maximum number of points in each segment (Lim and Suter, 2009), or the total number of segment (Shapovalov et al., 2010).
The aim of our work is to leverage the underlying structure of the point cloud to improve a weak classification obtained from very few annotated points, with a segmentation that requires no preset size parameters.We observe that the structure of urban scenes is mostly shaped by man-made objects (roads, façades, cars...), which are geometrically simple in general.Consequently, wellchosen geometric features associated to their respective points can be expected to be spatially regular.However the extent and number of points of the segments can vary a lot depending on the nature of the corresponding objects.We propose a formulation of the segmentation as a structured optimization problem in order to retrieve geometrically simple super-voxels.Unlike other presegmentation approaches, our method allows the segments' size to be adapted to the complexity of the local geometry, as illustrated in Figure 1c.
Following the machine-learning principle that an ensemble of weak classifiers can perform better than a strong one (Opitz and Maclin, 1999), a consensus prediction is obtained from the segmentation by aggregating over each segment the noisy predictions of its points obtained from a weakly-supervised classifier.The structure induced by the segmentation and the consensus prediction can be combined into a conditional random field formulation to directly classify the segments, and reach state-of-the-art performance from a very small number of hand-annotated points.

Related Work
Point-wise classification: Weinmann et al. (2015) propose a classification framework based on 3D geometric features derived from local neighborhood of optimal size.
Context-based graphical models: the spatial regularity of a semantic labeling can be enforced by graphical models such as random Markov fields (Anguelov et al., 2005;Shapovalov et al., 2010), and its discriminative counterpart, the conditional random field (Niemeyer et al., 2014;Landrieu et al., 2017b).The unary terms are computed by a point-wise classification with a random forest classifier, while the pairwise terms encode the probability of transition between the semantic classes.
Pre-segmentation approaches: A pre-segmentation of the point cloud can be leveraged to improve the classification.Lim and Suter (2009) propose defining each segment as a node in a multiscale CRF.The super-voxels are defined by a growing region method based on a predefined number of points in each pixel, and a color homogeneity prior.In Niemeyer et al. (2016), the segments are determined using a prior pointwise-classification.
A multi-tier CRF is then constructed containing both points and voxels nodes.An iterative scheme is then performed, which alternates between inference in the multi-tier CRF and the computation of the semantically homogeneous segments with a maximum radius constraint.In Shapovalov et al. (2010) the presegmentation is obtained through the k-means algorithm, which requires defining the number of clusters in the scene in advance.Furthermore k-means produces isotropic clusters whose size doesn't adapt to the geometrical complexity of the scene.In Dohan et al. (2015), a hierarchical segmentation is computed using the foreground/background segmentation of Golovinskiy et al. (2009), which uses a preset horizontal and vertical radius as parameters.
The segments are then hierarchically merged then classified.

Problem formulation
We consider a 3D point cloud V corresponding to a LiDAR acquisition in an urban scene.Our objective is to obtain a classification of the points in V between a finite set of semantic classes K.
We consider that we only have a small number of hand-annotated points as a ground truth from a similar urban scene.This number must be small enough that it can be produced by an operator in a reasonable time, i.e. no more than a few dozen per class.
We present the consituent elements of our approach in this section, in the order in which they are called.
Feature and graph computation: For each point, we compute a vector of geometrical features, described in Section 2.1.In Section 2.3 we present how the adjacency relationship between points is encoded into a weighted graph.

Segmentation into geometrically homogeneous segments:
The segmentation problem is formulated as a structured optimization problem presented in Section 3.1, and whose solution can be approximated by a greedy algorithm.In section 3.2, we describe how the higher-level structure of the scene can be captured by a graph obtained from the segmentation.

Contextual classification of the segments:
In Section 4, we present a CRF which derived its structure from the segmentation, and its unary parameter from the aggregation of the noisy prediction of a weakly supervised classifier.Finally, we associate the label of the corresponding segment to each point in the point cloud.

FEATURES AND GRAPH COMPUTATION
In this section, we present the descriptors chosen to represent the local geometry of the points, and the adjacency graph capturing the spatial structure of the point cloud.
With a view that the training set is small, and to keep the computational burden of the segmentation to a minimum, we voluntarily limit the number of descriptors used in our pointwise classification.We insist on the fact that the segmentation and the classification do not necessarily use the same descriptors.

Local descriptors
In order to describe the local geometry of each point we define four descriptors: linearity, planarity, scattering and verticality, which we represent in Figure 2.
The features are defined from the local neighborhood of each point of the cloud.For each neighborhood, we compute the eigenvalues λ1 ≥ λ2 ≥ λ3 of the covariance matrix of the positions of the neighbors.The neighborhood size is chosen such that it minimizes the eigentropy E of the vector (λ1/Λ, λ2/Λ, λ3/Λ) with Λ = 3 i=1 λi, in accordance with the optimal neighborhood principle advocated in Weinmann et al. (2015): As presented in Demantké et al. (2011), these eigenvalues allows us to qualify the shape of the local neighborhood by deriving the following vectors: The linearity describes how elongated the neighborhood is, while the planarity assesses how well it is fitted by a plane.Finally, high-scattering values correspond to an isotropic and spherical neighborhood.The combination of these three features is called dimensionality.
In our experiments, the vertical extent of the optimal neighborhood proved crucial for distinguishing between roads and façades, and between poles and electric wires, as they share similar dimensionality.To discriminate this class, we introduce a novel descriptor called verticality also obtained from the eigen vectors and values defined above.Let u1, u2, u3 be the three eigenvectors associated with λ1, λ2, λ3 respectively.We define the unary vector of principal direction in R 3 + as the sum of the absolute values of the coordinate of the eigenvectors weighted by their eigenvalues: We argue that the vertical component of this vector characterizes the verticality of the neighborhood of a point.Indeed it reaches its minimum (equal to zero) for an horizontal neighborhood, and its maximum (equal to 1) for a linear vertical neighborhood.A vertical planar neighborhood, such as a façade, will have an intermediary value (around 0.7).This behavior is illustrated at Figure in To illustrate the weak number of features selected, we represent their respective value and range in Figure 3.

Non-local descriptors
Although the neighborhoods' shape of 3D points determine their local geometry, and allows us to compute a geometrically homogeneous segmentation, this not sufficient for classification.Consequently, we use two descriptors of the global position of points: elevation and position with respect to the road.
Computing those descriptors first requires determining the extent of the road with high precision.A binary road/non-road classification is performed using only the local geometry descriptors and a random forest classifier, which achieves very high accuracy and a F-score over 99.5%.From this classification a simple elevation model is computed, allowing us to associate a normalized height with respect to the road to each 3D point.To estimate the position with respect to the road we compute the two-dimensional α-shape (Akkiraju et al., 1995) of the points of the road projected on the zero elevation level, as represented in Figure 4.This allows us to compute the position with respect to the road descriptor, equal to 1 if a point is outside the extent of the road, 0.5 if the point is close to the edge of the α-shape with a tolerance of 1m, and 0 otherwise.Figure 4. α-shape of the road on our Semantic3D example.In red, the horizontal extent of the road; in yellow, the extent of the non-road class.

Adjacency graph
The spatial structure of a point cloud can be represented by an unoriented graph G = (V, E), in which the nodes represent the points of the cloud, and the edges encode their adjacency relationship.We compute the 10-nearest neighbors graph, as advocated in (Niemeyer et al., 2011).Remark that this graph defines a symmetric graph-adjacency relationship which is different from the optimal neighborhood used in Section 2.1.

Potts energy segmentation
To each point, we associate its local geometric feature vector fi ∈ R 4 (dimensionality and verticality), and compute a piecewise constant approximation g of the signal f ∈ R V ×4 structured by the graph G. g is defined as the vector of R V ×4 minimizing the following Potts segmentation energy: This structured optimization problem can be efficiently approximated with the greedy graph-cut based 0-cut pursuit algorithm presented in Landrieu and Obozinski (2016).The segments are defined as the constant connected components of the piecewise constant signal obtained.
The benefit of this formulation is that it does not require defining a maximum size for the segments in terms of extent or points.Indeed large segments of similar points, such as roads or façades, can be retrieved.On the other hand, the granularity of the segments will increase where the geometry gets more complex, as illustrated in Figure 1c.
For the remainder of the article we denote S = (S1, • • • , S k ) the non-overlapping segmentation of V obtained when approximately solving the optimization problem.

Segment-graph
We argue that since the segments capture the objects in the scene, the segmentation represents its underlying high-level structure.
To obtain the relationship between objects, we construct the segmentgraph, which is defined as G = (S, E, w) in which the segments of S are the nodes of G. E represents the adjacency relationship Figure 5. Adjacency structure of the segment-graph.The edges between points are represented in black , the segmentation and the adjacency of its components in blue: . between segments, while w encodes the weight of their boundary, as represented in Figure 5.We define two segments as adjacent if there is an edge in E linking them, and w as the total weight of the edges linking those segments:

CONTEXTUAL CLASSIFICATION OF THE SEGMENTS
To enforce spatial regularity, Niemeyer et al. (2014) defines the optimal labeling of a point cloud as maximizing the posterior distribution p(l | f ) in a conditional random field model structured by an adjacency graph G, with f the vector of local and global features.We denote a labeling of V by a vector of ∆(V, K) = {l ∈ {0, 1} V ×K | k∈K l i,k = 1, ∀i ∈ V } (the corners of the simplex) such that l i,k is equal to one if the point i of V is labelled as k ∈ K, and zero else.For a point i of V , li is considered as a vector of R K .This allows us to define l as the maximizing argument of the following energy: with p i,k = log(p(li = k | f i )) the entrywise logarithm of the probability of node i being in state k, and M (i,j),(k,l) = log(p(li = k, lj = l | f i , f j )) the entrywise logarithm of the probability of observing the transition (k, l) at (i, j).
As advocated in Niemeyer et al. (2014), we can estimate p(li = k | f i ) with a random forest probabilistic classifier pRF.To avoid infinite values, the probability pRF is smoothed by taking a linear interpolation with the constant probability: 01 and |K| the cardinality of the class set.The authors also advocate learning the transition probability from the difference of the features vectors.However, our weak supervision hypothesis prevents us from learning the transitions, as it would require annotations covering the |K| 2 possible combinations extensively.Furthermore the annotation would have to be very precise along the transitions, which are often hard to distinguish in point clouds.We make the simplifying hypothesis that M is of the following form : with σ a non-negative value, which can be determined by crossvalidation.
Leveraging the hypothesis that the segments obtained in in Section 3.1 correspond to semantically homogeneous objects, we can assume that the optimal labeling will be constant over each segment of S. In that regard, we propose a formulation of a CRF structured by the segment-graph G to capture the organization of the segments.We denote L the labeling of S defined as: ) the logarithm of the probability of segment s being in state k multiplied by the cardinality of s.We define this probability as the average of the probability of each point contained in the segment: Note that the influence of the data term of a segment is determined by its cardinality, since the classification of the points remains the final objective.Likewise, the cost of a transition between two segments is weighted by the total weight of the edges at their interface ws,t, and represents the magnitude of the interaction between those two segments.
Following the conclusions of Landrieu et al. (2017b), we approximate the labelling maximizing the log-likelihood with the maximum-a-priori principle using the α-expansion algorithm of Boykov et al. (2001), using the implementation of Schmidt (2007).
It is important to remark that the segment-based CRF only in-volves the segment-graph G, which can be expected to be much smaller than G, making inference potentially much faster.

NUMERICAL EXPERIMENTS
We now demonstrate advantages of our approach through numerical experiments on two public data sets.First, we introduce the data and our evaluation metric, then present the classification results compared to state-of-the-art methods.

Data
To validate our approach, we consider two publicly available data sets.
We first consider the urban part of the Oakland benchmark introduced in Munoz et al. (2009), comprised of 655.297 points acquired by mobile LiDAR.Some classes have been removed from the acquisition (i.e.cars or pedestrians) so that there are only 5 left: electric wires, poles/trunks, façcades, roads and vegetation.We choose to exclude the tree-rich half of the set as the segmentation results are not yet satifying at the trunk-tree interface.
We also consider one of the urban scenes in the Semantic3D benchmark1 , downsampled to 3.5 millions points for memory reasons.This scene, acquired with a fixed LiDAR, contain 6 classes : road, façade, vegetation, car, acquisition artifacts and hardscape.
For each class we hand-pick a small number of representative points such that the discriminative nature of our features illustrated in Figure 3 is represented.We select 15 points per classes for Oakland and 25 to 35 points for semantic3D, for respective totals of 75 and 180 points.

Metric
To take into account the imbalanced distribution of each class (roads and façades comprise up to 80% of the points), we use the unweighted average of the F-scores to evaluate the classification results.Consequently, a classification with decent accuracy over all classes will have a higher score than a method with high accuracy over some classes but poor results for others.

Competing methods
To compare the efficiency of our implementation to the state-ofthe-art we have implemented the following methods: • Pointwise: we implemented the pointwise classification with optimal neighborhoods of Weinmann et al. (2015), with a random forest (Breiman, 2001) and restricting ourselves to the five geometric features mentioned in Section 2.1.
• CRF regularization: we implemented the CRF defined in (1) without aid from the segmentation.

Results
In Tables 1 and 2, we represent the classification results of our method and the competing methods for both datasets.We observe that both the CRF and the presegmentation approach significantly improve the results compared to the point-wise classification.Although the improvement in term of global accuracy of our method compared to the CRF-regularization is limited (a few % at best), the quality of the classification is improved significantly for some hard-to-retrieve classes such as poles, wires, and cars.Furthermore, our method provides us with a object-level segmentation as well.

CONCLUSION
In this article, we presented a classification process aided by a geometric pre-segmentation capturing the high-level organization of an urban scene.We showed that this segmentation allowed us to formulate a CRF to directly classify the segments, improving the results over the CRF-regularization Further developments should focus on improving the quality of the segmentation near loose and scattered acquisition such as foliage.Another possible improvement would be to better exploit the context of the transition.Indeed the form of the transition matrix in ( 2) is too restrictive, as it does not take into account rules such as "the road is below the façcade" or the "tree-trunk is more likely than foliageroad".Although the weakly-supervised context excludes learning the transition, it would nonetheless be beneficial to incorporate the expertise of the operator.

Figure 1 .
Figure1.Illustration of the different steps of our method: the pointwise, irregular classification 1b is combined with the geometrically homogeneous segmentation 1c to obtain a smooth, objects-aware classification 1d.In Figures1a, 1b, 1d, the semantic classes are represented with the following color code: vegetation, façades, hardscape, acquisition artifacts, cars, roads.In Figure1c, each segment is represented by a random color.

Figure
Figure Means and standard deviations of the local descriptors in the Oakland dataset for the following classes: wires, poles, façade, road, vegetation.
Figure2.Representation of the four local geometric descriptors as well as the two global descriptors.In (a), the dimensionality vector [linearity, planarity, scattering] is color-coded by a proportional [red, green, blue] vector.In (b), the value of the verticality is represented with a color map going from blue (low verticality -roads) to green/yellow (average verticality -roofs and façades) to red (high verticality -poles).In (c), is represented the elevation with respect to the road.In (d), the position with respect to the road is represented with the following color-code: inside the road α-shape in blue, bordering in green, and outside in red.withδ(• = 0) the function of R 4 : → {0, 1} equal to 0 in 0 and 1 everywhere else.The first part of this energy is the fidelity function, ensuring that the constant components of g correspond to homogeneous values of f .The second part is the regularizer which adds a penalty for each edge linking two components with different values.This penalty enforces the simplicity of the shape of the segments.Finally ρ is the regularization strength, determining the trade off between fidelity and simplicity, and implicitly determining the number of clusters.

Table 1 .
Precision, recall and FScore in % for the Oakland benchmark.The global accuracy are respectively 85.2%, 94.8% et 97.3%.In bold, we represent the best value in each category.

Table 2 .
Precision, recall and FScore in % for the Semantic3D benchmark.The global accuracy are respectively 88.4%, 96.9% et 97.2%.In bold, we represent the best value in each category.