SEMANTIC SEGMENTATION OF TERRESTRIAL LIDAR DATA USING CO-REGISTERED RGB DATA

: This paper proposes a semantic segmentation pipeline for terrestrial laser scanning data. We achieve this by combining co-registered RGB and 3D point cloud information. Semantic segmentation is performed by applying a pre-trained off-the-shelf 2D convolutional neural network over a set of projected images extracted from a panoramic photograph. This allows the network to exploit the visual image features that are learnt in a state-of-the-art segmentation models trained on very large datasets. The study focuses on the adoption of the spherical information from the laser capture and assessing the results using image classiﬁcation metrics. The obtained results demonstrate that the approach is a promising alternative for asset identiﬁcation in laser scanning data. We demonstrate comparable performance with spherical machine learning frameworks, however, avoid both the labelling and training efforts required with such approaches.


INTRODUCTION
Over the past decade, the construction and real estate sectors have increasingly used Terrestrial Laser Scanners (TLS) to capture and document building interiors. This process usually delivers a dense, high-quality point cloud, which can serve as the basis for remodelling and asset management. Furthermore, modern instruments not only capture the 3D positions of interior surfaces, but also colour information from panoramic photographs, making it possible for a point cloud to be reasoned from both its spatial and photometric qualities. A key task in point cloud scene understanding is assigning an object label for every point, often referred as either per-point classification or semantic segmentation. In this work we adopt the latter.
In recent years, a surge of deep learning approaches for point cloud semantic segmentation have been proposed. Nevertheless, the problem is still considered hard. This can be accredited to a number of reasons. Firstly, point clouds are typically unordered, and sparse data types. This prevents normal convolution kernels, which assume discrete structured data, from being effective. As a result, deep learning based 2D approaches typically remain more mature. Despite great progress in addressing this problem (Qi et al., 2017b;Hermosilla et al., 2018;Thomas et al., 2019), another issue looms. Modern deep learning based methods require very large labelled datasets, however, such datasets for 3D data are typically not available at the same scale as that for their 2D counterparts.
In light of such limitations, we instead ask the question, can 3D point cloud semantic segmentation be achieved using only 2D models? Ultimately allowing us to exploit existing 2D CNN architectures and massive manually labelled 2D datasets.
In answering this, we propose a methodology which projects 3D data with co-registered RGB data into 2D images which can be consumed by standard 2D Convolutional Neural Networks * Corresponding author (CNNs). Our multi-stage pipeline first starts with the extraction of a panoramic image from a TLS acquired point cloud. Next, we compute tangential images in a perspective projection which can be fed into a CNN to map RGB values to per-pixel labels. Finally, we project the label map back to the point cloud to obtain per-point labels. Through a hyperparameter grid search we find that our method can be used to obtain a competitive semantic segmentation of point clouds leveraging only a pretrained off-the-shelf 2D CNN without any additional labelling or domain adaptation.
Empirically we show that despite the raw image data being in an equirectangular projection, CNNs trained using the more common rectilinear projection produce respectable labels using our approach. Our pipeline therefore makes data captured by polar devices, such as a TLS, compatible with any standard CNNbased image segmentation architecture.

RELATED WORKS
The process of assigning per-point classification labels to point clouds has a rich history. Traditionally, success has been owed to supervised machine learning based techniques. As a single point does not contain enough information to determine its label, researchers explore methods to encompass local neighbourhood context. Demantké et al. (2011), Weinmann et al. (2015 and others demonstrated the effectiveness of explicitly encoding features computed from a points local neighbourhood. Features such as linearity, planarity and Eigenentropy are calculated for each point and passed into a Random Forest classifier. This can be performed at scale (Liu and Boehm, 2015). Other feature sets such as Fast Point Feature Histograms (FPFH) (Rusu et al., 2009) and Color Signature of Histogram of Orientations (SHOT) (Salti et al., 2014) have also shown promising results.
More recently, there has been a surge of deep learning based approaches (Griffiths and Boehm, 2019). The seminal work of PointNet (Qi et al., 2017a) demonstrated the compatibility of deep learning with such problems. However, PointNet did not exploit local neighbourhood features like those explicitly encoded in early works. PointNet++ (Qi et al., 2017b) showed that by combining a PointNet with local neighbourhood grouping and sampling module, results could be significantly improved. More recent research looks at developing convolution kernels (which experienced unprecedented success in the 2D domain) that are capable of working in the unordered, sparse and continuous domain where the point cloud exists. Examples such as Monte Carlo Convolutions (Hermosilla et al., 2018), Kernel Point Convolutions (Thomas et al., 2019) and PointConv (Wu et al., 2019) address this.
In the 2D domain researchers have developed methods for processing spherical images. For example, the spherical crosscorrelation and generalised Fourier transform algorithms in Cohen et al. (2018), the adaptation of different convolution layers in Yu and Ji (2019), or transforming encoders and decoders for understanding the geometry variance derived from the input equirectangular panoramic image in Zhao et al. (2018b). Zhao et al. (2018a) improved spherical analysis for equirectangular images by creating networks that can iterate between image sectors and classify panoramas with significant performance and speed, which is comparable to classic two-dimensional networks.
As it is possible for 3D point clouds to be projected into a 2D spherical domain, naturally, approaches have been proposed to exploit the spherical 2D CNNs for 3D semantic segmentation. Jiang et al. (2019), parse spherical grids approximated to a given underlying polyhedral mesh, using what the author calls "Parameterised Differential Operators", which are linear combinations of differential operators that avoid geodetic computations and interpolations over the spherical projection. Similarly, Zhang et al. (2019) propose an orientation-aware semantic segmentation on icosahedral spheres. Concurrent research has also been present in the autonomous driving domain. Wu et al. (2018); Wang et al. (2018) transform 3D scanner data into 2D spherical image which is fed into a 2D CNN, before unprojecting labels back to the original point cloud. These methods are typically a lot faster than purely 3D approaches as projection and 2D convolutions are much faster than 3D neighbourhood searches required by geometric-based approaches. Similar to our work, Tabkha et al. (2019) perform semantic segmentation using a Convolutional Neural Network (CNN) on RGB images derived by projecting coloured 3D point clouds. However, our work differs from these approaches as we do not use an unordered point cloud as the representation for the LiDAR data. Instead, we use the ordered panoramic representation that is generated by polar measuring devices such as TLS. On the downside this restricts our approach to single scans captured with static TLS and excludes e.g., mobile scanners.
Also similar to our work, Eder et al. (2020) divide a spherical panoramic image into tangential icosahedral planes and the project individual perspective images. This allows each image to be fed into a pre-trained 2D semantic segmentation CNN. Furthermore, Eder et al. (2020) obtained comparable results using standard CNNs to more specialised spherical CNNs.

METHODOLOGY
Given a point cloud P ∈ R n×k captured using a polar-based TLS scanner, we aim to assign a per-point object class label i.e. R n×k → R n×1 where n is the number of points in P and k ∈ R x,y,z,r,g,b (although k can include other sensor features such as intensity). Whilst in remote sensing and photogrammetry this problem is typically referred to as (per-point) classification, we use the term semantic segmentation common in image processing as these are the networks we use for creating the label mapping function f : R n×k → R n×1 .
Our methodology can be split into the following primary processes. First, a point cloud P with corresponding RGB image data I ∈ R h×w×3 is captured using a survey-grade TLS. Such scanners are two-axis polar measurement instruments and acquire quasi regular samples on the two axes, effectively creating a regular grid in the polar space. This representation is also commonly used in panoramic imaging and is referred to as equirectangular. The scanner hardware or associated software warps the image data captured alongside the point cloud into this projection. The resulting panoramic colour images can be extracted using open standard file formats.
Next, we convert the information of the panoramic image I to tangential images I T to simulate a rectilinear lens. This projection is not a valid transformation for the complete panoramic image, and therefore we create a sequence of overlapping partial images. The position of the tangential images is determined using spherical grid sequence intervals, creating an almost equal distribution over the spherical space such We then obtain per-pixel labels I s ∈ R h×w×1 by utilising a semantic segmentation CNN S such that I s i = S(I t i ). All partial rectilinear label images I s i are then projected back to the original panoramic projection, allowing the final label map I C to be created using the confidence scores obtained by the semantic segmentation process. In our experiments S is a pre-trained UperNet model (Xiao et al., 2018) which was trained on the rectilinear based ADE20K dataset (Zhou et al., 2017). Finally, we map the class labels I C → P using the co-registration matrix, assigning per-point labels. Figure 1 gives a graphical overview of the process. In the following sections we will discuss each stage in detail.
Figure 1. This diagram shows the proposed semantic segmentation process using panoramic images from TLS data.
The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLIII-B2-2021 XXIV ISPRS Congress (2021 edition)

Data acquisition
Our scanner data (P and I) in this project was collected using a Leica RTC360 TLS. This system (along with many other commercially available systems) captures 3D measurements in a structured sequence. As mentioned above it acquires the points over a quasi-regular grid in the polar space. This polar grid is directly represented as a two-dimensional matrix. This enables the projection of 3D point cloud data from a polar to an equirectangular projection. Effectively transforming the captured data into a panoramic image (Figure 2 Row 1). This representation of TLS data is long established for image processing and object extraction (Boehm and Becker, 2007;Eysn et al., 2013).
We processed all data with the manufacturer software, exporting the point cloud to a grid-type separator file format that preserves the orientation header of the scan position and each corresponding scanned point on the ordered grid. We utilise this raster grid to extract the panoramic image directly. The final resolution of our panoramic image is (20, 334 × 8, 333). This is generated from a maximum of 169, 443, 222 points (as limited by the TLS), however, in practice much fewer points are actually captured due to lack of returns from angular surfaces, windows etc.

Rectilinear projection
With the TLS capture described above having a spherical equidistant subdivision, the creation of an equirectangular projection is trivial, interpreting the data as a raster. As this projection is neither equal-area nor conformal, there are distortions in the resulting panoramic image. To address the spherical distortion, we need to define a rectilinear projection for tangential images and a subdivision method from where the tangential points will be defined for each individual projection.
The mathematical foundations used in this reprojection process are detailed as follows, extracted from Weisstein (2018). Given a point pi ∈ P with a latitude and longitude (λ, φ), the transformation equations for the creation of a tangent plane at that point, with a projection with central longitude λ0 and central latitude φ1 are given by: Where c is the angular distance of the point (x, y) from the projection centre, given by: Knowing the image size and the corresponding field of view (FOV) angle for the respective c, we can generate individual images I T from the full-dome panorama I. The latitude and longitude (λ, φ) positions of the spherical intervals are defined by the golden ratio angle separation, where the generative spirals of a Fibonacci lattice turn between consecutive points along a symmetrical spiral sphere (Gonzalez, 2009). The semantic segmentation output re-projected from tangential to equirectangular. The full map is given in Figure 6. Row 4 Point cloud rendering with labels from merged equirectangular segmentation map.
To create the lattice, the function of this sequence for the symmetrical points n is described as n = 2N + 1 where N is any natural number defining the desired interval subdivision and the integer i range from −N to +N . The spherical coordinates of ith point are: where: Φ = 1 + Φ −1 = (1 + √ 5)/2 1.618 The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLIII-B2-2021 XXIV ISPRS Congress (2021 edition) and Φ is the golden ratio.
The result of this projection, which is also referred to as gnomonic projection, is a quasi perspective image I t i ( Figure  2 Row 2) which is equivalent to an image captured by a camera with a rectilinear lens. Typical cameras available today try to achieve such a projection. As a result the projected images are of the same projection as those of most large scale benchmark datasets used to train 2D ML models.

Semantic segmentation
At the centre of our pipeline is a deep learning based semantic segmentation network S which maps a single tangential image to a probability class map S : I T i → I s i (Figure 3). A key benefit of our pipeline is that it is compatible with any choice of S. In such a fast moving field this allows the user to drop in the current best performing network implementation. In this work we opt for the widely used UPerNet network (Xiao et al., 2018) as an example.
We choose this network for several reasons. Firstly, the network performs competitively on computer vision benchmarks. Next, the authors offer an easy-to-use publicly available implementation. Lastly, the authors release pre-trained weights on the ADE20K indoor scene parsing dataset, which contains all of the objects present in our datasets.
We note, that whilst any 2D CNN semantic segmentation network can be used in our pipeline, it is important that the user also has access to the prediction confidence scores C s i ∈ C C (as output from the final prediction probability distribution). These values are used to handle redundant label when recomposing I s i ∈ I C → I C . This is discussed in detail in Section 3.4.

Reprojection
After obtaining the semantically segmented images I s i ∈ I C and the confidence matrix associated to each tangential position (λ, φ), it is necessary to warp back the images to the equirectangular projection, in order to obtain a new set of panoramic images for the posterior unification process. The inverse transformation equations, having a pixel coordinate (x, y), are given by: φ = sin −1 cos c sin φ1 + y sin c cos φ1 ρ (7) λ = λ0 + tan −1 x sin c ρ cos φ1 cos c − y sin φ1 sin c With the central longitude λ0, central latitude φ1, φ and λ being the resulting latitude and longitude for each reprojected pixel (x, y), respectively. ρ and c are defined as: The resulting image has the corresponding order of latitude and longitude of the spherical subdivision (Figure 2 Row 3).

Panoramic Label Map
Following the processing methodology, it is necessary to recreate a full resolution panoramic label image I C from the overlapping tangential semantic segmentation maps (i.e. I t i ∈ I C → I C ). To achieve this, we adopt a winner-take-all approach from the corresponding pixel confidence scores C s ∈ C C . The final output map for any redundant pixels is therefore:

Point cloud semantic segmentation
As a final step we map the equirectangular label map onto the original point cloud (i.e. I C → P). This is easily achieved by storing the original mapping P → I (Section 3.1). Using the reverse of this mapping we simply assign each point pi ∈ P its corresponding value from I C . A rendering of the point cloud with label colours is shown in Figure 2 Row 4.

RESULTS
We test our methodology outlined in Section 3 for a range of configurations. Furthermore, we evaluate our approach on both an internal dataset and a sample from the common 2D3DS benchmark dataset (Armeni et al., 2017).

Performance metrics
It is important to define the metrics used to evaluate our proposed pipelines performance. Whilst the 2D3DS dataset contains labels, our internal dataset did not. It is therefore necessary to label the ground truth data. As we are not using the dataset for training a CNN, all data is test data, and as such, we do not require a large dataset. All data was therefore manually annotated using standard image processing software with a graphical user interface.
To evaluate each scenario's performance, we opt for the widely used Intersection over Union metric (Everingham et al., 2010), averaged over all classes (mIoU). In practise we compute the IoU over the N ×N confusion matrix C, where N is number of classes (21 in our case). Let cij be a single entry in C, where cij is a number of sampled from the ground truth class i predicted as class j, then the per-class IoU can be computed as: The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLIII-B2-2021 XXIV ISPRS Congress (2021 edition) mIoU is then: In addition to mIoU we also compute an average of the overall accuracy, however, as accuracy can be non-robust when strong class imbalance is present, we treat mIoU as our primary metric. Nevertheless, we compute mAcc as:

Hyperparameter search
It is evident that the configuration used to perform I → I T can affect model performance. We therefore perform a hyperparameter grid search to find the optimum configurations for generating the tangential images I T with respect to our performance metrics. We select the following hyperparameters for optimisation; spherical tangent points location, fov, image size and image ratio. Results of the search are visualised in Figure  4.

Internal dataset
The result of the mIoU evaluation shows that the 70-degree field of view, a 3:4 aspect ratio, an image size of 840 × 1120 and a spherical subdivision with 32 tangential points is the optimal pipeline configuration for this dataset, as shown in Figure  4. It is also remarkable that an increase in the resolution of the tangential images does not improve the final performance. Additionally, greater redundancy in the spherical positions also results in a decreased performance.
The final semantic segmentation image I C with the optimum hyperparameters is shown in Figure 6 (top). Analysing the areas captured in the original panorama from the TLS visible in Figure 5 (top), versus the final segmented image, we note high precision is achieved at the object boundaries, especially on the furniture and walls. In addition, the mIoU performance achieved is superior to the analysis of the raw panoramic image  The normalised confusion matrix (Figure 7) demonstrates that our pipeline is able to identify the majority of the required classes presented in the panoramic scene. However, we note classes H (door) and I (desk) are poorly detected.

2D3DS dataset
We processed the selected 2D3DS panorama shown in Figure  5 (bottom), using the same methodology. We obtain the resulting shown in Figure 6 (bottom). The output segmentation map I C is compared to the provided ground truth data. The selected image has a resolution of 4096 × 2048. The resulting image is generated by considering the best value obtained in the grid search presented before, but adjusting the image size and FOV resolution, with 80-degrees FOV, an aspect ratio of 3:4, an image resolution of 600 × 1200 and the spherical interval division as 32 tangential points.
Qualitatively analysing Figure 6 (bottom vs. top row), it is evident that the proposed method does not achieve similar performance in the lower resolution 2D3DS dataset, in comparison with the internal high-resolution TLS dataset. This is particularly evident for the ceiling. However quantitatively, it is clear from the confusion matrix (Figure 7 bottom) that nevertheless most areas of the dataset were correctly classified.

CONCLUSION
We presented a pipeline for semantic segmentation of TLS point clouds for indoor scenes. We show that by exploiting co-registered RGB image data, we can perform semantic segmentation using standard 2D CNNs. These labels can then be mapped back onto the original 3D point cloud data. We demonstrate satisfactory results using a pre-trained off-the-shelf 2D CNN, eliminating the need for manually labelled training data or specialised 3D point cloud networks. This allows us to exploit large 2D labelled datasets for 3D point cloud semantic segmentation. Furthermore, our results show that despite our original data being in an equirectangular projection, we still achieve reasonable class labels from a network trained on more commonly available rectilinear images. Whilst we expect results to improve if a network is trained directly on equirectangular images, we show that this is not strictly necessary. This significantly reduces workload and accelerates the adoption of new DL frameworks for TLS data.