APPLICATION OF MACHINE AND DEEP LEARNING STRATEGIES FOR THE CLASSIFICATION OF HERITAGE POINT CLOUDS

: The use of heritage point cloud for documentation and dissemination purposes is nowadays increasing. The association of semantic information to 3D data by means of automated classification methods can help to characterize, describe and better interpret the object under study. In the last decades, machine learning methods have brought significant progress to classification procedures. However, the topic of cultural heritage has not been fully explored yet. This paper presents a research for the classification of heritage point clouds using different supervised learning approaches (Machine and Deep learning ones). The classification is aimed at automatically recognizing architectural components such as columns, facades or windows in large datasets. For each case study and employed classification method, different accuracy metrics are calculated and compared.


INTRODUCTION
The 3D documentation of Cultural Heritage monuments and sites with point clouds or meshes, coming from photogrammetry and laser scanning surveys, is broadly diffuse.Given the recent evolution of technologies and digital tools, the need for automated and reliable methods to classify point clouds or meshes is becoming fundamental.Among the possible and interesting applications provided by the classification of heritage 3D data we can mention: identification and distinction of structural and decorative architectural elements, mapping of different states of conservation and materials, automatic recognition of similar architectural elements as a propaedeutic phase for Building Information Modelling (BIM), etc.In the literature, different methods of data classification were proposed (Grilli et al., 2017) like edge and region-based approaches (applied initially for image segmentation) (Wang and Shang, 2009;Vo et al., 2015) or model-fitting approaches, based on the possibility to fit geometric primitives to the 3D shapes (Chen et al., 2014).With the advent of Artificial Intelligence (AI) solutions, further progress in automation and interesting results came out.In particular, Machine and Deep Learning (ML/DL) methods allowed the development of algorithms that let machines to take decisions based on empirical training data.Deep Learning can be considered an evolution of Machine Learning.Its algorithms are structured in layers to create an artificial neural network that can learn and make intelligent decision on its own.The use of Machine Learning techniques for point cloud classification has been successfully investigated in the last decade in the geospatial environment (Guo et al., 2014;Niemeyer et al., 2014;Weinmann et al., 2015;Qi et al., 2017;Özdemir and Remondino, 2019a) while in the Cultural Heritage (CH) field it has only recently started to be explored (Poux et al., 2017;Grilli and Remondino, 2019).The paper aims to explore the potential offered by Machine and Deep Learning approaches for the supervised classification of 3D heritage case studies (Figure 1).In the paper, firstly, a literature review is presented.Secondly, different ML/DL point cloud classification approaches are presented and then experimented on two different case studies: the temple of Neptune in Paestum and some renaissance buildings with porticoes in Bologna.Classification results are finally presented and commented relying on confusion matrix scores.

RELATED WORKS
In recent years, significant progress has come out in automatic procedures for classification of point clouds or meshes thanks to the advent of Machine Learning approaches (Hackel et al., 2016;Weinmann et al., 2017;Wang et al., 2018).Several benchmarks have been proposed in the Geomatics community, providing labelled terrestrial and airborne data on which users can test and validate their algorithms.Most of the available datasets provide classified natural, urban, and street scenes (e.g., www.semantic3d.net, www.cityscapes-dataset.com, etc.).While in those scenarios, the object classes and labels are almost defined (mainly ground, roads, trees, and buildings), the identification of precise categories in the heritage field is much more complicated, as: • for the same case study several classes can be identified based upon different purposes; • not always a semantic architectural class is linked to a precise shape/colour.Probably for these reasons, up to now, the only available databases of annotated heritage are with 2D images and refer only to building facades, e.g.eTRIMS (Korc and Forstner, 2009), Ecole Centrale Paris (ECP) Facades dataset (Teboul et al., 2010), CMP Facade Database (Tyleček and Šára, 2013).Despite this existing data shortage, different Machine Learning approaches were proposed in the architectural and heritage context.Oses et al. (2014) 2016) and Llamas et al., (2017).CNNs are also used by Yasser et al. (2017) for visual categorization and to create a digital heritage search platform (ICARE) that allows users to archive digital heritage content and perform semantic queries over multimodal cultural heritage data archives.In some cases, the classification is performed for annotation and restoration purposes, and the information is transferred from 2D to 3D (Campanaro et al., 2016;Grilli et al., 2018).The web platform Aioli (www.aioli.cloud)allows a semi-automatic annotation of 3D heritage, where 2D mapping data are in realtime displayed onto a 3D model (Roussel et al., 2019).To the author's knowledge, there are no works applying Deep Learning methods for the classification of 3D architectural heritage.For every point of the dataset, the label predicted by the classifier is compared with the same manually annotated.Confusion matrices are then generated, and the following accuracy metrics are calculated for each class:

Features extraction and selection
For the training and classification goal, different sets of features are used (Figure 3), depending on the case study and the approach (ML / DL) (Özdemir and Remondino, 2019b).In case of heritage and architectural 3D data, we combined the use of: • Decentralised coordinates: they are used to represent the local geometry around a point as a patch of k-number of nearest points.To decentralise the coordinates, the minimum x, y, z values are subtracted within each sequence and the sequences are sorted with respect to the decentralised coordinate values (Figure 3a).• Radiometric values: the input data is a 3-band RGB colour space with an 8-bit radiometric resolution per band.The RGB values are re-scaled to have values between 0-1 (Figure 3b).• Geometric features (Figure 3c): covariance features and others are described in Section 3.1.1.Table 1.Considered geometric features.

Geometric features
The geometric features employed include (i) covariance features, (ii) normal based features (Verticality V), and (iii) height-based features (Z coordinates).The covariance features (also called eigenfeatures) are based on the covariance matrix (Cheata et al., 2009) computed within a local neighbourhood of a 3D point.The combinations of three eigenvalues λi (λ1 > λ2 > λ3) extracted from the covariance matrix hold a great potential to calculate local features and describe the shape of the neighbourhood (Blomey et al., 2014).The measures of linearity L, planarity P and sphericity S provide information about the presence of a linear 1D structure, a planar 2D structure or a volumetric 3D structure.Further measures are provided by omnivariance O, anisotropy A and Local surface variation C λ (Table 1).Different strategies can be applied to identify local neighbourhoods for points belonging to a 3D point cloud (Weinmann et al., 2013).In our method (Grilli et al 2019), features are first calculated on spherical neighbourhoods at various radius sizes (multi-scale approach) using CloudCompare (Hackel et al., 2016).Then, features are examined to investigate whether some classes are particularly well described by features at specific scales.Finally, the optimum subset of features is selected in order to emphasize the differences between the classes we are interested in.
Unlike the conventional DL approaches, we provide handcrafted features (the same features for the ML methods) as input to the employed DL algorithms (Ozdemir and Remondino, 2019b).

Machine Learning approach -Random Forest
Random Forest (RF) is a supervised classification classifier (Breiman, 2001) that uses an ensemble of classification trees, gets a prediction from each tree, and selects the best solution by means of voting.Two parameters need to be set to produce the forest trees: the number of decision trees to be generated (N tree ) and the number of variables to be selected and tested for the best split when growing the trees (M try ) (Belgiu et al., 2016).We rely on the RF implementation available in the Scikit-learn Python library (version 0.21.1).During the training process, the Ntree and Mtry are tuned considering the best F1-score computed on the test set.

Machine Learning approach -OvO classifier
The One-versus-One (OvO) classifier converts a group of binary classifiers into a multiclass classifier.It works training the binary classifiers in a one vs.one trend.In the case of N possible classes, it trains N*(N-1)/2 binary classifiers, which are then employed for the identification of the classes on the test sample.In our tested we used the OvO classifier available in the dlib C++ library (King, 2009).

Deep Learning approach -1D and 2D CNN
Two Convolutional Neural Networks (CNN) (Fukushima et al., 1980) methods are also proposed.CNN is a specific type of artificial neural network specialized in processing data that has a grid-like topology, such as an image.The layers of a CNN consist of an input layer, an output layer and a hidden layer that includes multiple convolutional layers, pooling layers, fully connected layers and normalization layers.The tested CNNs are: • 1D CNN: it consists of 1 input layer, 2 convolutional layers, 3 dense layers, 1 maximum pooling layer, 1 global average pooling layer and 1 dropout layer.• 2D CCN: it is composed of 1 input layer, 4 2D convolutional layers, 2 2D max pooling layer, 3 dropout layers, 1 flatten layer and 2 dense layers.

Deep Learning approach -Bi-LSTM
Recurrent Neural Network (RNN) (Rumelhart et al., 1988) is commonly used for modelling sequential data.The data is sequential if the building blocks in a dataset are not independent from each other.The most common application for RNN are handwriting or speech recognition, translation, etc.Our RNN consists of five layers: sequence input layer, Bidirectional Long Short-Term Memory layer (Bi-LSTM) with 200 hidden units, fully connected layer, softmax layer and classification layer.We describe each point with a sequence that is generated with its surrounding points (i.e. each row in Figure 3 is a part of the sequence).These sequences are expected to represent the geometry around each point in a better way when compared to a single feature vector representation.

CASE STUDIES AND RESULTS
The aforementioned ML/DL classification approaches were applied to two different heritage datasets: a) The Greek temple of Neptune in Paestum (Italy): it was built in the Doric order around 460 -450 BC.It measures ca 24,5 x 60 m (Fig. 4) and the available point cloud is the result of a combined UAV and terrestrial photogrammetric survey (Fiorillo et al., 2013).With ML and DL approaches, the aim is to semantically segment the 3D data of the temple considering its Greek architectural elements.For both case studies, the five classification approaches described in Section 3 are run, using different sets of features as input (Table 2).For the ML approaches and the 1D CNN all the points of the cloud are described by a feature vector that contains different geometric features chosen ad hoc.Regarding the 2D CNN and Bi-LSTM, in addition to these geometric features, the classification was led with, and without the decentralised coordinates.Table 2. Considered geometric features for each classification approach.As described in figure 3: a = decentralised coordinates, b = radiometric features, c = geometric features.

Temple of Neptune in Paestum
After applying a subsampling, to speed up the computational process, the data consist of some 2.2 million points (Figure 6).Ten different classes corresponding to the architectural elements of the temple were identified.Then a small portion of the entire data set was accurately manually annotated (Figure 7).Covariance features (Table 1, Fig. 8) were then extracted at different neighbourhood sizes, correlated with the dimensions of the column orders.According to the selected geometric feature chosen and the used radius size (r), it is possible to highlight different architectural elements.The feature Surface Variation, for example, can emphasise the columns if extracted at a neighbourhood size r = radius of the columns.The Planarity extracted with the same radius distinguishes planar (e.g.facades, floors) from cylindrical ones (e.g.columns).
Using a combination of these features, both Machine and Deep Learning methods were trained to predict the labels on the entire dataset (Figure 8).Table 3 reports the results obtained with the RF classifier, including confusion matrix and accuracy metrics.Each row of the matrix represents the instances in an actual class (ground truth), while each column represents the instances in a predicted class.In general, we notice that most of the classification errors are between classes with quite similar geometries as "Abacus" and "Architrave", and "Frieze", "Cornice" and "Tympanum.Table 4 shows in parallel the per-class F1-score results for each method applied.The F1-score averages are between the 86.69 % with Bi-LSTM and 92% with RF.Table 5 summarizes all the accuracy metrics reached with the different approaches.As we can observe from the diagram, higher levels of accuracy were achieved using the machine learning approaches (RF and OvO).Table 4.A summary of all tested ML/DL classification methods reporting per-class F1-score for the temple dataset.
Table 5. Summary of the classification results for the temple dataset achieved with the different ML/DL methods.

Renaissance buildings in Bologna
The photogrammetric point cloud of the porticos consists of ca 1 million points (Figure 10).A a small portion of the entire dataset was manually annotated with 14 classes (Figure 11) and some significant features were extracted (Figure 12).Then, the different classifiers were trained to classify the entire point cloud (Figure 13).Confusion matrixes and classification results obtained using the machine and deep learning approaches are showed below (Table 6-7-8).As for the previous case study, the best results were achieved applying machine learning approaches.Figure 13.RF classification results on the Bologna dataset.
Among the identified 14 classes, the ones with more classification errors were those with similar geometric properties.For example, the "road" was misclassified as "pavement " or also the "moulding" as "facade".This is probably due to the limited number of points in the annotations.

CONCLUSIONS
The paper presented an evaluation of different ML/DL classification approaches to semantically segment point clouds of architectural and archaeological scenarios.From the summarized results in Tables 5 and 6, we can state that: • ML approaches outperformed DL methods; • for the Temple dataset, the F1-scores are between 86% and 95% while for the Bologna they change between 34% up to 82%.In our opinion, even if the features were handcrafted appositely for the case studies, the class structure complexity of the Bologna dataset caused lower accuracy metrics; • based on the achieved accuracy metrics in the Bologna dataset, we can suggest that the used DL approaches are not suitable for this kind of dataset; • in case of heritage datasets, the conventional use of decentralised coordinates for DL approaches reduced the overall accuracy.From detailed analysis of Table 4 and Table 8 we can observe that the accuracy decreases when the classes share the same geometry.This represents one of the most challenging point for heritage classification, as there's not always a correspondence between shape/colours and semantics for the architectural classes.• as concern the training times, the used DL approaches took about 10 minutes on GPU, while the ML ones completed the training in less than a minute on a CPU.As further evolutions we plan to explore new features and class structures to improve our classification results.Moreover, starting from the achieved classification results (Fig. 9 and 14), it would be interesting to develop a tool to assist the conversion of semantic point cloud to parametric 3D model (HBIM/BIM).
used different Machine Learning classifiers to perform an image-based delineation of masonry walls.Amato et al. (2015) used k-nearest neighbour (kNN) classification and landmark recognition techniques to address the problem of monument recognition in images.Convolutional Neural Networks (CNN) were applied for the first time to heritage scenarios in Llamas et al. (

Figure 2 .
Figure 2. Classification workflow.The different ML/DL methods presented in this paper (Section 3.2-3.5)work directly on 3D point clouds and they are all supervised, as the input data contain associated labels (i.e.classes) information.The classification processes (Figure 2) consist of four steps: feature extraction, feature selection, model training and prediction.To evaluate the performance of the classification methods for each case study a test set is taken into consideration.For every point of the dataset, the label predicted by the classifier is compared with the same manually annotated.Confusion matrices are then generated, and the following accuracy metrics are calculated for each class:

Figure 3 .
Figure 3. Data sequence used for the classification purpose with m number of points and n number of features: decentralised coordinates (a), radiometric values (b), geometric features (c).

Figure 4 .
Figure 4. Temple of Neptune in Paestum, Italy.b)Building with porticos in Bologna (Italy): the historical porticos of Bologna (Fig.5) were built during the 11th-20th centuries.We consider a portion of ca 85x6m, surveyed with photogrammetric techniques(Remondino et al., 2016).The ML/DL classification is aimed at a semantic annotation of the different architectural and decorative elements.

Figure 6 .
Figure 6.Photogrammetric point cloud of the temple of Neptune in Paestum 2,2 million points).

Figure 8 .
Figure 8.Some relevant features computed on the Neptune's point cloud.From left to right respectively, clock-wise: verticality, surface variation, sphericity and planarity.

Figure 9 .
Figure 9. RF classification results and exploded view.

Figure 10 .
Figure 10.Photogrammetric point cloud of a renaissance building in Bologna (ca 1,1 million points).

Figure 11 .
Figure 11.A portion of the Bologna dataset manually labelled with 14 classes.

Figure 12 .
Figure 12.Example of the Surface Variation feature extracted on the Bologna dataset.

Figure 14 .
Figure 14.Exploded view of the Bologna dataset after the automated classification (vertical drainpipe and road classes are not visualized due to their low accuracy score).

Table 3 .
RF classification results: Confusion Matrix and per-class accuracy for the temple dataset.The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLII-4/W18, 2019 GeoSpatial Conference 2019 -Joint Conferences of SMPR and GI Research, 12-14 October 2019, Karaj, Iran

Table 6 .
Summary of the classification results for the porticos dataset achieved with the different ML/DL methods.The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLII-4/W18, 2019 GeoSpatial Conference 2019 -Joint Conferences of SMPR and GI Research, 12-14 October 2019, Karaj, Iran

Table 7 .
RF classification results: Confusion Matrix and per-class accuracy for the porticos dataset.

Table 8 .
A summary of all tested ML/DL classification methods reporting the per-class F1-score for the porticos dataset.