GRAPH CNN WITH RADIUS DISTANCE FOR SEMANTIC SEGMENTATION OF HISTORICAL BUILDINGS TLS POINT CLOUDS

: Point clouds obtained via Terrestrial Laser Scanning (TLS) surveys of historical buildings are generally transformed into semantically structured 3D models with manual and time-consuming workﬂows. The importance of automatizing this process is widely recognized within the research community. Recently, deep neural architectures have been applied for semantic segmentation of point clouds, but few studies have evaluated them in the Cultural Heritage domain, where complex shapes and mouldings make this task challenging. In this paper, we describe our experiments with the DGCNN architecture to semantically segment historical buildings point clouds, acquired with TLS. We propose a variation of the original approach where a radius distance based technique is used instead of K-Nearest Neighbors (KNN) to represent the neighborhood of points. We show that our approach provides better results by evaluating it on two real TLS point clouds, representing two Italian historical buildings: the Ducal Palace in Urbino and the Palazzo Ferretti in Ancona.


INTRODUCTION
Built Cultural Heritage management and preservation requires the creation of accurate and rich digital representations of historical buildings. Laser scanning has become a widely used technique to obtain accurate digital representations of 3D scenes and is commonly adopted by architects, archaeologists and scholars. The obtained digital representations come in the form of point clouds, which, however, lack of semantic information and are often insufficient to conduct further analysis and studies. More informative and structured representations are needed, and are usually derived relying on knowledge representation standards like the Building Information Model (BIM), as done, for example in (Quattrini et al., 2017a) and (Quattrini et al., 2017b), where annotated BIM models drive a semantic aware user interface to meaningfully explore architectural heritage. The process of transforming a point cloud to BIM is referred to as Scan-to-BIM and is usually carried out manually through a careful and time-consuming work.
For this reason, the interest in finding viable ways to partially automate the Scan-to-BIM process is gaining growing interest within the research community. A central problem in this context is that of separating a point cloud into components that represent single classes of objects of interest. In the case of historical buildings, it is desirable to automatically segment the 3D point cloud into specific architecture elements, referring to robust and consolidated thesaurus. The interest in this kind of task is witnessed by a number of recent research efforts, which attempt at addressing it with algorithmic workflows (Murtiyoso, Grussenmeyer, 2020a) or leveraging machine learning methods based on hand-crafted features (Grilli et al., 2019b).
The task of classifying points in a point cloud according to some types of 3D objects is accounted as semantic segmentation and is of interest to a variety of research areas, as, for example, autonomous driving and robot scene interpretation. The recent advances in deep learning (DL), have driven the * Corresponding author research in this area and in the last years several deep neural architectures have been proposed that attempt at semantically segment 3D point clouds, directly operating on the coordinate representation of 3D scenes. Examples are PointNet (Qi et al., 2017a), the pioneer study in this area, its subsequent improvement, Pointnet++ (Qi et al., 2017b), the Dynamic Graph Convolutional Neural Network (DGCNN) architecture proposed in (Wang et al., 2019), which attempts at generalizing the previous approaches into a common formal framework, and PointCNN (Yu et al., 2019).
While such neural architectures have been extensively tested on standard benchmark datasets of indoor scenes (as the S3DIS dataset (Armeni et al., 2017)), few attempts were done to investigate their use in historical buildings TLS point clouds (Malinverni et al., 2019), (Pierdicca et al., 2020a). While classic ML approaches often require feature engineering on a case by case basis (Grilli et al., 2019b), DL approaches attempt at automatically extracting (hidden) feature, thus possibly providing a uniform method that can be applied to a variety of case studies.
In this paper we describe our experiments with the DGCNN architecture to semantically segment historical buildings TLS point clouds. We propose a variation of the original approach (Wang et al., 2019) where a radius distance based technique is used instead of K-Nearest Neighbors (KNN) to represent the neighborhood of points. We show that our approach provides better results by evaluating it on two real TLS point clouds, representing two Italian historical buildings: the Ducal Palace in Urbino and the Palazzo Ferretti in Ancona.

RELATED WORKS
In this section we provide an overview of the state of the art in deep learning approaches for semantic classification of dense point clouds. Then we restrict to the Cultural Heritage field and overview some of the Machine Learning (ML) efforts in achieving semantic segmentation of architectural elements. DL algorithms have been used in several application domains to semantically segment 3D point clouds (Xie et al., 2020, Zhang et al., 2019. However, the cultural heritage environment is still lacking in studies that use this methodology in the processing of 3D point clouds. A significant advantage is that this large amount of raw data represented by the 3D point clouds can be managed directly by the DL algorithms without using intermediate processing to obtain a more regular structure before classification. In this regard, the first algorithm was presented in the work (Qi et al., 2017a) in which the points belonging to the cloud are considered individually. Subsequently, an advanced version of this approach was presented (Qi et al., 2017b), which also exploits the local information of the points considering the nearby points and obtaining better results in terms of classification. A first attempt to use PointNet++ architecture to semantically segment 3D point clouds of CH dataset is presented by Malinverni et al. (Malinverni et al., 2019). The work aims at demonstrating the effectiveness of the method even in an area that has not been yet explored, considering a set of CH data suitably created and annotated manually by the domain experts. The Point Clouds Convolutional Neural Network (PCNN) implemented in the work of (Atzmon et al., 2018) is a novel architecture that uses a Convolutional Neural Network (CNN) to elaborate 3D point clouds. It is based on two operators: extension and restriction, one the opposite of the other, to classify the 3D points cloud. Novel point-based approaches to semantic segmentation have been recently proposed, often extending the previous architectures proposed in this field. According to (Guo et al., 2020), they can be divided into Point-wise MLP architectures, e.g. (Jiang et al., 2018), named PointSIFT and inspired by the 2D shape descriptor SIFT, Convolution based, as PointConv (Boulch, 2020), and Graph Based.
Dynamic Graph Convolutional Neural Network (DGCNN) is one of the first approaches in the latter category. DGCNN (Wang et al., 2019) is based on the EdgeConv operation. EdgeConv is a model that creates edge features and describes the neighbourhood relation among points, without generating point features from their embeddings. This module is designed to be invariant to permutation and ordering of neighbours. DGNN architecture was also used in CH context in the work proposed by (Pierdicca et al., 2020b). The aim is to make a semantic segmentation of 3D Point Clouds using an augmented DGCNN model by adding features like normal and colour. The advantages is to better manage CH elements with complex geometries, structures extremely variables and defined with a high level of detail. They make also a comparison between other DL methods.
In (Grilli et al., 2019c), the authors, taking into account the benefits of using ML and DL technologies, performed a comparison of performance in the classification of two different CH datasets. They highlight that the use of ML approaches (Random Forest and One-versus-One) allows to obtain excellent performance in terms of classification, even if there is no correlation between the characteristics. Although ML techniques are less recent than DL techniques, there are a limited number of applications in the literature that use ML-based methods to semantically segment 3D point clouds in the CH domain. However, according to the study proposed by (Grilli, Remondino, 2019), these methods have made great progress in this direction. After exploring the applicability of supervised ML approaches to cultural heritage, the authors propose a standardized pipeline with reference to the different case studies. In this context, the work proposed by (Oses et al., 2014) has two main objectives: the first is to provide a framework that extracts geometric prim-itives from a masonry image and second to make an extraction and a selection of statistical features for the automatic clustering of masonry. They make a combination between an image processing and ML methodologies for the classification of masonry walls and then compare the performances between five different ML algorithms in the classification task. In this approach the main problem is that each block of the wall is not separately characterized. To overcome this limitation the work of (Riveiro et al., 2016) presents a new automatic segmentation algorithm of masonry blocks departing from a 3D point cloud acquired by LiDAR technology. The image processing is based on an optimization of the watershed algorithm, that is used to improve segmentation algorithms in other works (Barsanti et al., 2017, Poux et al., 2017. In their research Grilli et al., (Grilli et al., 2018) use the UV maps of 3D models of DCH goods as input of supervised ML classification algorithms. In order to verify the efficiency of this method, the authors make a comparison of several classifier considering three different case studies. The research of (Grilli et al., 2019a) is another application of a supervised ML classifier (RF) used to classify a 3D points cloud of CH field. The authors evaluate the relationship between covariance features and architectural elements, in particular determining a relation between the feature search radii and the size of the element. Successively, the work (Grilli, Remondino, 2020) makes an analysis of the previous approach demonstrating its ability to generalise between different never seen architectural scenes. The research conducted by (Murtiyoso, Grussenmeyer, 2020b) has the aim to help the manual point clouds labeling of large training data set required from ML algorithms. Moreover, the authors introduce a series of functions that allow the automatic processing of some problems of segmentation and classification of point clouds for CH goods. Due to the complexity of the problem, the project considers only some important classes but it is suitable for different types of heritage.

Dataset
In the framework of the on going project CIVITAS (ChaIn for the excellence of reflectiVe socIeties to exploiT digital culturAl heritage and museumS), the 3dimensional digitization of the Ducal Palace at Urbino is carried out mainly based on TLS and photographic data capturing.
At the present survey phase, the total of acquired points is 1.790 mln (Nespeca, 2018, Clini et al., 2020, the comprehensive model for this large-complex building serves to develop innovative researches and tests facing the challenges in CH field. Although the whole numerical model refers to the complete Palace, in the current experiment only the part of the point cloud related to the Courtyard of Honour was exploited (Table 1).
The point cloud of the Ducal Palace is a main reference data set, acquired with different sensors and technologies, the current work refers to point cloud carried out by two laser scanners (Leica ScanStation C10 and a Leica ScanStation P40) mounted on a tripod. This acquisition was performed by setting different levels of resolution according to the complexity of the rooms and their decorative elements (see Figure 1), optimizing scanning times. Considering that one of the challenges of the project was to validate workflows from Scan to segmented, informed and semantically reality-based models, significant parts of the point cloud were selected and compared. In

DGCNN with radius-based EdgeConv
The DGCNN model, introduced in (Wang et al., 2019), is based on the EdgeConv operation, which is implemented with a Multilayer Perceptron (MLP) fed with the edge features of a point: the distance vectors between the point and its neighbour points. In this way the learned point features take into account its surrounding points, thus being able to learn different shapes and to map them to different types of objects.
In the original design (Wang et al., 2019), the edge features are calculated from the K-Nearest Neighbours (KNN) of a point.
As suggested by recent studies (Hermosilla et al., 2018), this method might not be optimal when the point cloud is characterized by non uniform density, as generally happens for TLS. In Figure 3 we show a fragment of the point cloud of the Ducal Palace of Urbino, where two windows were captured with different density. Using KNN the point neighbourhood cover a small area when the local density is high and a smaller one when points are more scattered. This means that different edge features can be derived from two point that belong to the same class (window). Aiming at overcoming this issue, we propose a variation of the DGCNN architecture (that we call RadDGCNN in this paper), where points neighborhood is based on the radius distance. As the neighbour points cover the same area for all points, independently from the local density, more representative features could possibly be learned. In our solution, for each point, all the points at distance D < R are selected, then K points are randomly sampled and used as input to the Edge-Conv operation.
The deep learning architecture used in our experiments is depicted in Figure 2. The input point cloud is processed by three EdgecConv operations, extracting hierarchical local features of points. Then the output of each EdgeConv for each point is concatenated and global features are learnt by a MLP layer, to finally produce a category prediction for each point in input. The architecture differs from the baseline DGCNN architecture in the EdgeConv block, based on radius distance, and in the adoption of a pseudo-random rotation.
As the DGCNN architecture suffers from sensibility to the orientation of shapes to be learned, in our experiments we use rotation to augment the dataset. In particular, we rotate each block around the up direction only and we perform, for each point, a rotation of a multiple of 90 degrees and a random one. This is motivated by simple considerations on the domain, where these constraints clearly reflect domain rules (e.g. walls are often positioned at 90 degrees from each other, architectural elements have always the same orientation with respect to the up direction). In this way we aim at optimizing the training phase, letting the system learn from meaningful variations of the input data.

EXPERIMENTS AND RESULTS
In our experiments we evaluated the DGCNN and RadDGCNN approaches on two point clouds: Ducal Palace of Urbino (PDU) The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLIV-4/W1-2020, 2020 3rd BIM/GIS Integration Workshop and 15th 3D GeoInfo Conference, 7-11 September 2020, London, UK and Palazzo Ferretti (PF), using half of the scene to train the network and the remaining for testing. In both cases we segmented the whole scene into blocks of dimension 1x1x1 meters, and for each block we sampled 4096 points. Such a scene segmentation step, usually applied in literature to experiment with DGCNN and other point based deep networks, is required because the MLP network needs a fixed number of points for each data example to be processed and, furthermore, processing the whole scene as a single data example would be too computationally expensive. The number of points for each class obtained after the segmentation and sampling for the two test sets used in our experiment is shown in Figure 4 The number of points representing neighbourhoods (K) was set to 20, as done in (Wang et al., 2019). In the case of Rad-DGCNN, the radius R was experimentally set to 0.1. Each point was represented with a 9 dimension vector encoding original XYZ coordinates, point color (expressed in HSV format) and normalized XYZ coordinates within each block. Regarding DGCNN network hyperparameter, we adopted the same setting used in (Wang et al., 2019) and trained the networks for 200 epochs.
Results are reported in 3 and 4, where the proposed architecture is compared with the DGCNN architecture adopted in (Wang et al., 2019). As shown, the use of radius-based EdgeConv provides better segmentation accuracy in almost all considered classes, leading to a overall accuracy increase of 1.8% in the case of PDU, and 1% in the for PF.
As an additional experiment, we tried to combined the two approaches using a simple multi-classifier architecture, where each point is processed by the two trained networks and the best output probability is selected (Combined column in the tables). In the PDU experiment, the combined model slightly improves mean IoU and accuracy for some classes (e.g. pillar and floor), while is overcome by RadDGCNN with respect to other classes, e.g., column, thus leaving the overall accuracy almost unchanged. In the PF experiment, however, the improvement provided by the combined approach is larger in overall accuracy (+1.7%) as well as in mean IoU (+1.9%). This indicates that a simple combination of pre-trained models can be, in some cases, effective in boosting results.
In general, the results obtained are sensibly better in the PDU experiment (as also intuitively shown in Table 2). This is possibly due to the PDU scene being more symmetric if compared to the PF scene, thus resulting in more regular shapes to be learned. We can see that, as expected, more regular, repeatable or recognizable shapes, like wall and columns are better recognized. However, we think results are promising even in highly diverse shapes as those belonging to moulding, doors and windows.
In Figure 5 and 6 we report the confusion matrix of the classification results using the combined method on the PDU and PF scene, respectively. As one can see, in both the scenes, a number of classes (e.g. windows/doors, moulding and other) are often confused with walls. A possible reason for this is that the dataset is highly unbalanced and the wall class has lot more  points if compared to other classes. In the PDU test dataset, there are around 1.4 million points belonging to the class wall, while only 460 k points for mouldings and 74 k for the class other. In the case of the PF scene, we can observe that the classifier fails in detecting ashlar and often classify it as wall. This is somehow expected, as ashlars are in fact similar to walls, thus more examples of the class would be probably needed to properly train the network. We also note that the other class, which simply collect elements that are not classified, is, as expected, hard to recognize (especially in the PF scene), as it lacks of distinct characteristics and a sufficient number of data examples.
Finally, we would like to point out that, beside networks regular hyper-parameters tuning, the choice of neighbourhood to be considered in EdgeConv operation, might lead to different results. In the case of DGCNN, as the learning is driven by KNN, the value of K can be changed to spatially enlarge or restrict the neighbourhood used to extract hidden features from points. In the case of Radius distance, one can leverage a different parameter, R, which allows to choose a specific neighbourhood radius. While investigating the optimal choices based on the considered point clouds is out of the scope of this study, we show as an example, in Table 5, the results obtained with a different choice of parameters, where the point neighborhoods have been extended to consider more surrounding points. Specifically, we set K to 40 for both DGCNN and RadDGCNN, and R to 0.2 for RadDGCNN. While in the case of the PDU, where results were already better, there is no significant difference, on the PF scene the accuracy of our method increases (approximately by Figure 6. Confusion matrix of the combined approach on the PF scene.

2%
). An interesting direction to investigate is that of understanding how to optimally tune such parameters, e.g. driven by the peculiarities of the specific data at hand.

CONCLUSIONS AND FUTURE WORKS
In this paper we focused on evaluation of DGCNN based techniques in the historical building domain. We therefore evaluated our method under a simple and directly measurable condition: we learn from a portion of a scene and we predict the remaining part. While this setting allows us to draw preliminary conclusions regarding the ability of semantically segmenting architectural objects, this is surely a simplified situation.
Even if the setting can be of a certain interest in practice, as automatically process a partially annotated scene can alone sensibly reduce annotation time consumption, more generalization has to be achieved by predicting totally unknown scenes. However, this requires enough annotated point cloud data to properly train the network, which can be not trivial to obtain. An interesting research direction is that of using synthetic data to form a critical mass of annotated point clouds to learn from. This idea was recently explored by our research group in .
In conclusion, we think that results provided by our experiments are promising and motivate further studies in this direction. Future works include to deeply investigate the application of more solid domain grounded constraints and data processing techniques, as well as the possibility to build such knowledge into The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLIV-4/W1-2020, 2020 3rd BIM/GIS Integration Workshop and 15th 3D GeoInfo Conference, 7-11 September 2020, London, UK the deep learning models. An other aspect to be addressed is that of understanding the importance of single features. For example, in the present experiments we used HSV encoded color. Further experiments are planned to quantify its effective contribution to the the classification results, and to measure performances in situations where color is not available.