Relation Network for Full-Waveforms Lidar Classification

: LiDAR data are widely used in various domains related to geosciences (flow, erosion, rock deformations, etc.), computer graphics (3D reconstruction) or earth observation (detection of trees, roads, buildings, etc.). Because of the unstructured nature of remaining 3D points and because of the cost of acquisition, the LiDAR data processing is still challenging (few learning data, difficult spatial neighboring relationships, etc.). In practice, one can directly analyze the 3D points using feature extraction and then classify the points via machine learning techniques (Brodu, Lague, 2012, Niemeyer et al., 2014, Mallet et al., 2011). In addition, recent neural network developments have allowed precise point cloud segmentation, especially using the seminal pointnet network and its extensions (Qi et al., 2017a, Riegler et al., 2017). Other authors rather prefer to rasterize / voxelize the point cloud and use more conventional computers vision strategies to analyze structures (Lodha et al., 2006). In a recent work, we demonstrated that Digital Elevation Models (DEM) is reductive of the vertical component complexity describing objects in urban environments (Guiotte et al., 2020). These results highlighted the necessity to preserve the 3D structure of the point cloud as long as possible in the processing. In this paper, we therefore rely on ortho-waveforms to compute a land cover map. Ortho-waveforms are directly computed from the waveforms in a regular 3D grid. This method provides volumes somehow “similar” to hyperspectral data where each pixel is here associated with one ortho-waveform. Then, we exploit efficient neural networks adapted to the classification of hyperspectral data when few samples are available. Our results, obtained on the 2018 Data Fusion Contest dataset (DFC), demonstrate the efficiency of the approach.


INTRODUCTION
Because of their ability to capture complex structures, many domains related to geosciences and earth observation are making increasing use of LiDAR data.Such systems provide indeed accurate 3D point clouds of the scanned scene which has a large number of applications ranging from urban scene analysis (Chehata et al., 2009, Guiotte et al., 2020, Shan, Aparajithan, 2005), geology and erosion (Brodu, Lague, 2012), archaeology (Witharana et al., 2018) or even ecology (Eitel et al., 2016).
However, the processing of such data is not obvious since unlike N-dimensional images, the spatial irregular distribution of the point clouds makes tricky (both from a theoretical and computational point of view) the computation and use of spatial features.Moreover, though efficient recent neural network have been designed for LiDAR and unstructured point clouds (Landrieu, Simonovsky, 2018, Qi et al., 2017a, Qi et al., 2017b), at the moment the lack of labeled data limits the use of advanced learning techniques.
Many strategies exist to deal with this issue.While some of them directly exploit the 3D point cloud structure (Brodu, Lague, 2012, Niemeyer et al., 2014, Mallet et al., 2011), in many applications the point cloud is first binned into a 2D regular grid (so-called "rasterization process") on which computer vision approaches can be applied (see e.g.(Lodha et al., 2006)).While first works have been focused on the characterization of single points (often through height and intensity) without including information related to their neighbours (Lodha et al., 2006), * Corresponding author more advanced approaches have included spatial relationships using a set of spheres or cylinders (of variable radius) around each point to extract consistent geometric features (Mallet et al., 2011, Weinmann et al., 2015, Niemeyer et al., 2014).Among others, we have demonstrated in (Guiotte et al., 2019b, Guiotte et al., 2020) that the various rasterization strategies may have an important impact on the final result.
Complementary to rasterization, it is also possible to bin the point cloud into a 3D regular grid (a.k.a."voxelization process") where all points are processed via voxels (Gorte, Pfeifer, 2004, Aijazi et al., 2013, Guiotte et al., 2019a, Serna, Marcotegui, 2014) using point-to-voxels and voxels-to-point projections.This approach enables to keep the 3D structure of the data while using more conventional 3D-processing tools.
As an intermediate structuration strategy, we propose in this paper to map the point cloud into ortho-waveform maps.This has the advantage to provide 2D-(multi/hyper)spectral data where in each pixel, a signal corresponding to a reconstructed waveform observed in the orthogonal direction is given.Therefore, the 3D structure is kept while one can still process 2D data, similar to hyperspectral ones.To deal with the fact that only few labeled data are in general available, we suggest to process such ortho-waveforms using neural networks adapted both to hyperspectral data and to few learning samples.To this end, the recombinination (or pairing) of samples is an efficient approach to increase the amount of input training data.The resulting architectures are known as relation networks where multiple inputs are taken into account: one labelled sample per class (called support sample) and one query sample to be classified.The network outputs similarities between the query sample and the support sample per class.This relation network is combined with a submodule, which is designed to extract common features (similar to the prototype of each class) of multiple samples per class, for the extraction of spatial-spectral features (Rao et al., 2019) and here the classification of ortho-waveforms.
The organisation of the paper is as follows: in the next section, we present the generation of ortho-waveforms from the 3D point clouds.Then in Sec. 3, we present the spatial-spectral relation network used for classification.Finally, we illustrate the benefits of this approach in the experimental part in Sec. 4, before concluding our paper in Sec. 5.

GENERATION OF ORTHO-WAVEFORMS
To exploit the 3D structure of LiDAR data while using 2D processing tools, we create ortho-waveforms from initial full waveforms signals.More formally, let us define: • X the LiDAR acquisitions in R 3 × R, where each data x = {x, y, z, I} ∈ X is such that the intensity taken in location (x, y, z) is I(x, y, z); • E h ⊂ N 2 a 2D grid with spatial resolution h (for the sake of simplicity, we consider here isotropic resolutions but the method can be applied with anistropic ones as well); • gσ a 1D Gaussian filter of standard deviation σ.
For each pixel (i, j) ∈ E h , the associated spectrum p(i, j) is computed as where δz(i, j) is a vector of diracs containing all vertical positions included in the spatial pixel (i, j) weighted by their corresponding intensities I : the N spatial coordinates in the point cloud X included in pixel (i, j), z k their corresponding vertical values and δ ↑ the dirac function.This provides, in each pixel, ortho-waveform data as illustrated in Fig. 1 where the original dataset and some waveforms are illustrated.The next section introduces the relation-network that we used to process such data.

SPATIAL-SPECTRAL RELATION NETWORK
The spatial-spectral relation network (SS-RN) (Rao et al., 2019) was designed to classify hyperspectral images.Not only it learns the relation between 3D features (spectral features and spatial features) of the samples, but also it iteratively learns the similarities between a query sample and several samples per class.The overview of SS-RN is presented in Fig. 2. The SS-RN method consists of the following main parts: input construction, embedding module and relation module.In the following, we successively introduce these three parts in detail.

Multi-Support Sample Recombination
The proposed SS-RN exploits the training set by episode-based training.In each training iteration, an input instance is formed by randomly selecting one query sample and several randomly selected labelled samples (called support samples) per class.
Here the query sample is the sample to be classified, and the selected support samples per class represent its class.
Consider a dataset X = {xi} N i=1 in sample space R d×w×w which contains N labeled samples.Here d is the number of spectral bands, w × w is the spatial neighbouring window size.Let yi ∈ {1, 2, . . ., C} is the class label of xi and C is the number of classes.The organization of labelled samples under the framework of SS-RN is presented in Figure 3. Firstly, we split X into the training set Xtrain and the testing set Xtest with no intersection between these two parts.Then we construct a query set for training Qtrain, a query set for testing Qtest, and a support set Strain defined as follows: Here Sj contains all labeled samples of the j-th class in Xtrain.
Concretely, to construct an input instance M q n , we randomly select a query sample xq from a query set (Qtrain or Qtest) and n support samples per class denoted as sj = {xj1, ..., xjn} from Sj.The formula of M q n shows in Equation 4and its class label is same as the selected query sample Label(M q n ) = yq.

3D Embedding Module for Feature Extraction
After constructing an input instance M q n , we feed M q n into a three-dimensional convolutional neural network (3D-CNN) as an embedding module to extract spatial-spectral features.As shown in Figure 4, the architecture of the embedding module consists of two sub-modules: the first sub-module fϕ1 is designed to extract features of a single sample xq or x jk (the k-th support sample of the j-th class).The second sub-module fϕ2 is dedicated to the extraction of common features (similar to a prototype of each class) of the input support samples per class.The whole embedding module fϕ can be defined by: As shown in Figure 4, the input instance contains five support samples each class and a query sample, and the embedding module consists of five 3D-CNN blocks.Taking the j-th class as an example, we feed five support samples sj with size 5×1×233×13×13 into the first sub-module fϕ1, then generate five features for each support sample, thus the size of fϕ1(sj) is 5 × 64 × 15 × 5 × 5.The feature size of the query sample sq extracted by fϕ1 is 1 × 64 × 15 × 5 × 5. To obtain a common feature of sj, we feed the fϕ1(sj) into the second sub-module fϕ2, then generate a feature with size 1 × 64 × 15 × 5 × 5.The common feature map of each class and the feature map of the query sample will be compared by the relation module.) representing the chance that xq belongs to the j-th class, which is called the relation score.In this setting, by feeding an input instance M q n into SS-RN, we obtain C relation scores rj,q, j = 1, 2, . . ., C and the query sample xq will be classified into the class with the highest relation score.
The loss function of SS-RN is the Mean Square Error (MSE) in eq. ( 7), where M is the total number of query samples, {yi} C i=1 is the label of support samples and yq is the label of the query sample.Adam optimizer (Kinga, Adam, 2015) is applied to minimize the MSE error over the training set.Note that SS-RN contains two modules (embedding module and relation mod- (rj,q − 1(Label(sj) == yq)) 2 (7)

EXPERIMENTS
Preliminary experiments have been performed on the dataset from the IEEE Data Fusion Contest (DFC) 2018 (Le Saux et al., 2018).To this end, we sampled the point cloud and the ground truth to 1 m 2 resolution (vs 0.5 in the initial DFC dataset) in order to sample enough points in the vertical columns and obtain interesting ortho-waveforms.The main characteristics of the data are: • Raw data: one LiDAR tile from DFC 2018 (mono-spectral) • Spatial grid resolution: -Horizontal (x, y): 1 m -Vertical (z): 0.15 m • Labels: 20 classes, some under-represented because of tiling and sub-sampling are removed.• Train, test: 20% of the points randomly selected to train the model.Validation is performed on the rest of the dataset Some classes non-present or under-represented in the chosen tile have been removed during the training process (cf.missing scores in Table 1).
Qualitative results are presented in Figure 6 and quantitative evaluations are depicted in Tables 1 and 2. As can be shown, numerical experiments show very high performances despite the low number of training data, which is a good behavior of our network.However on this map, one can observe that we still have some difficulties with thin structures.Nevertheless, the overall map is consistent.
While the reported results show a very high accuracy, they have to be considered with a specific caution.Indeed, even if there is no overlap between pixels in the training and testing sets, the spatial behaviour of the CNN (through the successive increase of the field) makes possible that training and testing pixels share some learnt features.The interested reader is referred to (Audebert et al., 2019) for an in-depth discussion of this issue that is encountered in many experiments of deep networks in remote sensing.So further experiments would be needed here to draw some final conclusions, including a larger data set and a more reliable split between training and testing sets.

CONCLUSION
In this work, we proposed an alternative to rasterization or voxelization strategies for LiDAR data.We suggest to keep the 3D structure of the point cloud and to create ortho-waveforms, resulting in rasterized data where each pixel is associated with a wavelength in the vertical direction.This has the advantage to keep both the data structure and the spatial organization in a grid.This is somehow similar to hyperspectral data and we demonstrated the efficiency of this procedure on the DFC 2018 dataset using a deep neural network initially tailored for hyperspectral data.

Figure 1 .
Figure 1.Illustration of ortho-waveforms (blue curves) computed from raw data (top-left and star points) for 5 spatial points.

Figure 2 .
Figure 2.An example of SS-RN architecture for hyperspectral image classification.

Figure 3 .
Figure 3.The organization of labelled samples under the framework of SS-RN.3D Relation Module for Similarity Measurement After the embedding module, we obtained the common features fϕ(sj) = fϕ2(fϕ1(sj)), j ∈ {1, 2, . . ., C} per class and a feature fϕ(xq) = fϕ1(xq) of the query sample.To determine the label of the query sample, we concatenate the fϕ(xq) with the common feature fϕ(sj) per class, respectively.In a second step, we feed the concatenate feature C(fϕ(sj), fϕ(xq)) into a relation module, which learns to compare the query feature and a common feature per class, respectively.We then define the

Figure 4 .
Figure 4. Architecture of the spatial-spectral embedding model, which is composed of five 3D-CNN blocks.The input data here consists of five support samples per class and a query sample from the DFC2018 dataset (where d × w × w = 233 × 13 × 13)

Figure 6 .
Figure 6.Classification results on the DFC 2018 dataset.(a) ground truth and (b) : our results.

Table 1 .
Classes of the DFC 2018 along with the F1 scores.