A BENCHMARK FOR LARGE-SCALE HERITAGE POINT CLOUD SEMANTIC SEGMENTATION

: The lack of benchmarking data for the semantic segmentation of digital heritage scenarios is hampering the development of automatic classification solutions in this field. Heritage 3D data feature complex structures and uncommon classes that prevent the simple deployment of available methods developed in other fields and for other types of data. The semantic classification of heritage 3D data would support the community in better understanding and analysing digital twins, facilitate restoration and conservation work, etc. In this paper, we present the first benchmark with millions of manually labelled 3D points belonging to heritage scenarios, realised to facilitate the development, training, testing and evaluation of machine and deep learning methods and algorithms in the heritage field. The proposed benchmark, available at http://archdataset.polito.it/, comprises datasets and classification results for better comparisons and insights into the strengths and weaknesses of different machine and deep learning approaches for heritage point cloud semantic segmentation, in addition to promoting a form of crowdsourcing to enrich the already annotated database.


INTRODUCTION
The growing ease of point cloud acquisition, especially due to the developments of automated image-based solutions, SLAM methods and laser scanning systems, has created an increasing interest of the scientific community towards the use, interpretation and direct exploitation of point clouds for many different purposes. Consequently, in the Cultural Heritage (CH) field, HBIM (Historical Building Information Modeling) has gained particular attention from experts, since it allows to manage architectural heritage data, in both geometrical and informative ways (Bruno and Roncella, 2018). As it is well-known, whether point clouds provide a needful starting point, the process of developing HBIM models is still entrusted on manual operation; experts are claimed at handling large and complex datasets, without the aid of any automatic or semi-automatic method to recognise and reshape 3D elements (Bitelli et al., 2017). Obviously, this process is very time consuming and brings to the waste of information, given the unavoidable simplification exerted. In this scenario, the comeback of Deep Learning (DL) in several research fields has been overwhelming (Griffiths and Boehm, 2019). Deep Neural Networks (DNNs) settled as the more efficient technology for learning-based tasks Bello et al., 2020). However, despite DNNs proved to be very promising for handling and recognising 3D data , for CH, manual operations look more trustworthy, at least to capture the real estate from point clouds . There are many reasons for such scepticism; first of all, CH goods have complex geometries, which can be described only with a high level of detail. 1 http://archdataset.polito.it/ Moreover, the irregular shapes joined with the uniqueness of objects, make supervised learning techniques arduous for 3D data. Besides the intrinsic complexity of 3D data, especially if compared with 2D ones (e.g. images or trajectories), there are other limitations that are hampering the exploitation of DNNs for CH; on one hand, the lack of training data, on the other, the computational effort. While this latter is going to be overcome by continuous technological advancements, enabling a system to learn from a labelled dataset, and generalise on unseen scenes, is still far. The manual annotation is expensive and time-consuming (even if more reliable), and exists a sort of reticence to share 3D data with the research community. With the main purpose of investing much more effort on these research lines, the authors provide a large dataset of CH architectures, which aspires to become the reference benchmark in the field. To the best of our knowledge, it is the first point cloud dataset specifically released for the CH domain, which comprises data collected with both TLS and photogrammetric surveys, even providing the semantic ground truth annotation. This paper aims to present a new 3D point cloud classification benchmark dataset (named ArCH dataset 1 -Architectural Cultural Heritage) with millions of manually labelled points belonging to heritage scenarios. The realised benchmark originates from the collaboration of different universities and research institutes (Politecnico di Torino, Università Politecnica delle Marche, FBK Trento, Italy, and INSA Strasbourg, France). It is unique as it offers, for the first time to the research community, annotated point clouds describing heritage scenes. These point clouds, labelled with 10 classes, are meant to facilitate the development, training, testing and evaluation of machine learning algorithms as well as its subset of deep learning methods in the heritage field. For a more profitable use of this benchmark, aside from free download of all data, we provide public results of the submitted approaches, providing rankings about the most performing ones.

PREVIOUS WORKS
Several benchmarks have been proposed in the Geomatics community; their value is priceless. In fact, labelled 3D data enable users to test and validate their algorithms, beside improving the training phase for both machine and deep learning approaches. Among the existing benchmarks, it is worth to cite ModelNet 40 (Wu et al., 2015) with more than 100k CAD models of objects, mainly furniture, from 40 different categories; KITTI (Geiger et al., 2013) that includes camera images and laser scans for autonomous navigation; Sydney Urban Objects (De Deuge et al., 2013) dataset acquired in urban environments with 26 classes and 631 individual scans; Semantic3D (Hackel et al., 2017) with urban scenes such as churches, streets, railroad tracks and squares; S3DIS (Armeni et al, 2016) that includes mainly office areas and the Oakland 3-D Point Cloud dataset (Munoz et al., 2009) consisting of labelled laser scanner 3D point clouds, collected from a moving platform in an urban environment. Besides, it is worth mentioning other specific datasets, such as iQmulus (Vallet et al., 2015), The Cityscapes Dataset (Cordts et al., 2016), Paris-rue-Madame (Serna et al., 2014), Paris-Lille-3D (Roynard et al., 2018), 3DOMcity (Özdemir et al., 2019) and MiMAP  for BIM feature extraction. Most of these datasets collect data from urban environments with point clouds composed of around 100k points. In these scenarios, the object classes and labels are fairly general and almost standard (e.g. ground, roads, vehicles, vegetation, buildings etc.). On the other hand, in the heritage field, the identification of precise categories is much more complicated. Several peculiar classes could be identified in the same dataset. Shape and colour are not always linked to a specific semantic class, and objects belonging to the same class could have completely different shapes, in addition to complex geometries. Moreover, to date, there are still no published datasets focusing on immovable cultural assets with an adequate level of detail. Up to now most of the available datasets of annotated architectural heritage consists of 2D images, such as the Ecole Centrale Paris (ECP) Facades dataset (Teboul et al., 2010), eTRIMS (Korc and Forstner, 2009), and CMP Facade Database (Tyleček and Šára, 2013), which all present datasets of manually annotated facade images from different cities around the world and diverse architectural styles. Still, in 2D, there is the work conducted by Llamas et al. (2017), where for the first time Convolutional Neural Networks (CNN) were applied to heritage scenarios. The authors also released a dataset with more than 10k images including categories like Altar, Apse, Belltower, Column, Dome (inner and outer), Flying buttress, Gargoyle, Stained glass, and Vault. In this context, several researchers have started to approach the topic of semantic segmentation of cultural heritage (CH) point clouds within the machine and deep learning framework (Grilli et al., 2019a;Kharroubi et al., 2019;Murtiyoso and Grussenmeyer, 2020;Pierdicca et al., 2020). However, the lack of an appropriate 3D heritage dataset does not allow an effective comparison between methods and results. Precisely for this reason, we propose ArCH dataset that can stimulate the scientific community on these challenging issues.

DATASET
The dataset is composed of 17 annotated and another 10 nonannotated point clouds, the latter of which could be labelled by users and added to the main dataset. Many of the scenes included in the ArCH benchmark are part (or a candidate) of the UNESCO World Heritage List (WHL):  the porticoes of Bologna presented as a candidate in 2020.
Other scenes are nevertheless part of historical built heritage and represent various historical periods and architectural styles. This difference could constitute a drawback in the definition of the dataset classes, as it introduces elements of inhomogeneity within the same classes. However, providing the neural network with differing elements improves its ability to generalise among various CH case studies. Among the labelled scenes of the benchmark, 15 scenes are available for training and 2 for testing. They all include churches, chapels, porticoes, loggias, pavilions and cloisters. The 2 test scenes (named A and B) have different characteristics: -the first (A_SMG_portico) represents a simple, almost symmetrical building on one level and with more standard and repetitive geometric elements ( -the second (B_SMV_chapel_27to35) represents a complex, non-symmetrical building, structured on two levels, surveyed both indoor and outdoor, with different types of vaults, stairways and windows ( Figure 2). These two test scenes were chosen to (i) simplify the comparisons of the results, (ii) assess the effectiveness of the proposed algorithms and (iii) try to highlight the generalisation and learning capability of the networks not only on a relatively simple scene but also on a complex one.

Data acquisition
The 3D data composing the benchmark (Table 1) are challenging, not only due to their size (up to ≈ 4 · 108 points per scan) but also because of their high measurement resolution and high density of the final point cloud. Most of the scenes are obtained through the integration of different point clouds, acquired with different sensors (cameras, scanners) and platforms (UAVs, etc.) and after an appropriate accuracy evaluation. The employed terrestrial laser scanners include a FARO Focus 3D X 130 and 120 and a Riegl VZ-400. The photogrammetric surveys of the Sacro Monte of Varallo were performed with a Nikon D880E whereas for Bologna and Trento a Nikon D3100 and D3X were employed, respectively. A UAV platform was equipped with a SONY Ilce 5100L whereas the DJI Phantom 4 Pro has its integrated camera.

Data pre-processing
The collected point clouds were initially pre-processed to make the cloud structures more homogeneous (Table 1). The cloud preprocessing was performed in CloudCompare and followed 3 steps:  spatial translation;  subsampling;  choice of features.
The spatial translation of the point clouds was necessary because of the georeferencing of the scenes. The coordinate values had too many digits to be processed by the neural networks, so the coordinates were truncated and every single scene was spatially moved close to the system origin (0,0,0). The subsampling operation became necessary due to the high number of points (mostly redundant) in each scene (> 20M points). The option of random subsampling was discarded because it would have limited the test repeatability, therefore other two methods were tested: octree-and space-based subsampling.
From the comparison of the results coming from the application of the octree-and space-based subsampling, we opted for the second option. The variation in the test results was 1%, therefore the uniformity and simplicity of setting were preferred.
As far as the space-based method is concerned, a minimum space of 0.01 m between points was set; in this way, a high level of detail is ensured, but at the same time it is possible to considerably reduce the number of points and the size of the file, in addition to regularise the geometric structure of the point cloud.
In the DL framework, the feature selection is subject to two different approaches. The first one consists in selecting as few features as possible and letting the neural network just learn from them. The second, mainly used for smaller datasets, foresees the selection of specific handcrafted features, thus facilitating the learning task and improving the overall performances, though increasing computational times. In this case, most of the features are usually handcrafted for specific tasks (Zhang et al. 2019) and can be subdivided and classified into intrinsic and extrinsic, or also used for local and global descriptors (Han et al., 2018;Weinmann et al., 2015). The local features define the statistical properties of the local neighbourhood geometric information, while the global features describe the whole geometry of the point cloud. The most used properties are the local ones, such as eigenvalues based descriptors, 3D shape context, etc. Nevertheless, we provide only common intrinsic features, in order to allow users to find the most appropriate combinations. The only features calculated are the normals. The point normals are computed on CloudCompare, most of the time with a plane local surface model and oriented with a minimum spanning tree with Knn=10. The orientation of the normals was then checked in MATLAB®. Hence, the point cloud structure is x, y, z, r, g, b, label, Nx, Ny, Nz.

CLASS DEFINITION
Through the automatic recognition of architectural elements, the authors would like to support and speed up the process of reconstructing 3D geometries for HBIM models. In this context, it is essential to choose classes for our benchmark that are already available in object-oriented software or the underlying standards. In this way, the output labels of the neural network correspond exactly to the BIM categories and, once the geometry has been reconstructed, it will be possible to associate its information directly to the specific classes.  In the current state of the art, some works have already associated semantics, based on taxonomies and ontologies, to heritage elements (Mallik and Chaudhury, 2012)  By semantically organising the data, they can be managed with a common vocabulary and the subdivision into classes is therefore not arbitrary but objective and standardised, equal for all the users and referring to an already codified lexicon. Thus, a unified method for the classification of the architectural elements was developed (Malinverni et al., 2019). The concept of Level of Detail (LOD) derives from the CityGML data model and allows to describe an object according to different scales of representation, in which both the geometries represented and the information inserted range from the general to the particular. We have therefore applied this concept to the semantic segmentation of our point clouds: at first, we tried to understand at which level of detail the point clouds are segmented and, subsequently, the corresponding classes have been identified in the aforementioned standards. In CityGML, the LOD 0 describes a regional and landscape scale, the LOD 1 the region or city, the LOD 2 the city districts, the LOD 3 and 4 the architectural models respectively with the outdoor and indoor elements. If we consider some literature examples about point cloud classification in the geospatial field using NNs (Landrieu and Simonovsky, 2018;Hackel et al., 2017), we can assert that the level of detail reached till now is between LOD 1 and 2. Among the almost identified standard classes (i.e. vegetation, roads, buildings, etc.), the individual architectural elements are still missing.
Semantic annotation of the point clouds according to a CityGML LOD 3/4 has been therefore defined. In particular, in the CityGML, the LOD 3 foresees the realisation of a detailed architectural model and its scheme has the insertion of objects as doors and windows. The classes identified are within "Feature"_Boundary Surface 'Floor', 'Roof' and 'Wall' and within "Feature"_Openings 'Window' and 'Door'. Regarding the IFC standard, the category that contains the architectural elements is IfcBuildingElement, a subclass of IfcElement. In this category, several architectural elements can describe a building, but just some of these are common in the DCH domain and some other are too specific for the new construction or for a specific construction technique. The classes identified are, therefore: 'Column', 'Door', 'Roof', 'Stair', 'Wall' and 'Window', two of which already in common with the CityGML data model. Moreover, as the classes included in these two standards are not enough to describe properly a CH, the AAT was perused and, within the Architectural elements class and Structural elements category, the 'Vaults', 'Arches' classes have been taken into account, whereas from Surface elements 'Moldings' have been selected. Following some studies and results of classification with the 3D features (Grilli et al., 2019b), it was decided to change the classification proposed in (Malinverni et al., 2019;Pierdicca et al., 2020), separating the class of columns and half-pilasters and inserting the latter in the new class 'Moldings' where there are also cornices and eaves. With this purpose, 9 classes have been selected (Figure 3), plus another one defined as 'Other', containing all the points not belonging to the previous classes (e.g. paintings, altars, benches, statues, waterspouts...). These classes have been used for the point clouds labelling ( Figure 4). Nevertheless, the possibility of further extending this scheme for a higher Level of Detail (LOD 4/5), to be exploited for Instance Segmentation, is planned. Interested readers can deep this topic in (Mo et al., 2019).

Architectural Cultural Heritage point clouds for classifcation and semantic segmentation
The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLIII-B2-2020, 2020 XXIV ISPRS Congress (2020 edition)

Guidelines for annotation
Once the classes were chosen, given the heterogeneity of the architectural elements, some guidelines were defined for the annotation of the point clouds. These guidelines for the dataset annotation allow other researchers to contribute to expanding the datasets ( Figure 5). The dataset has been labelled with common point cloud processing software as CloudCompare. However, on the benchmark page, an in-house web annotation tool built upon the Semantic-segmentation-editor web application is also available for users. Considering each class, excluding the standard ones of walls, floors, roofs and stairs, the guidelines followed for the annotation have been:  Columns. In this class, only stand-alone columns or pillars have been inserted, both with circular and square sections. As mentioned above, the half-pilasters or half-columns leaning on the walls have been included in the Moldings class.
 Moldings. Stuccos and all other types of moldings, like windows, doors or decorative moldings have been included in this class, in addition to the previously cited half-pilaster and half-columns ( Figure 5). More generally, everything that protrudes from the masonry falls into this class.  Vaults. Every type of vault (barrel, cross, dome ...) has been included in this class. If the individual vaults were divided by protruding arches with respect to the vault itself then they were interrupted, otherwise a unique annotation has been kept.
 Arches. This class includes both the arches on the facade and those that divide one vault from another, but only if they are jutting ( Figure 6).
 Other. Everything that does not fall within the previous classes has been included here. This class has the sole purpose of maintaining some architectural or furnishing elements (downpipes, benches, balustrades ...) which could be useful in the future and which, at the same time, help in the general understanding of the point cloud. For training and test phases, it is recommended to exclude this class, as it could adversely affect the loss function, the general performances of the neural networks or any other algorithms used. Figure 6. Examples of arches (blue) at a different height from the vaults (orange).

AIMS OF THE BENCHMARK AND EVALUATION
The benchmark is available at http://archdataset.polito.it/ and is divided into two sections:  the point clouds already labelled for the training phases;  the point clouds for the testing/evaluation.
In this way, the proposed benchmark could be used to train and evaluate state-of-the-art and new classification/segmentation methods. Furthermore, the users have the possibility to choose arbitrarily the scenes useful for their purposes.
The benchmark activity will also offer an evaluation of the performances of the segmentation methods. If authors will submit the predicted results for a given point cloud (ideally for all), we will automatically compare the achieved results with the ground truth ones and provide results in terms of Overall Accuracy, F1 Score, Precision, Recall and Intersection over Union (IoU). Currently, the performances of state-of-art point cloud semantic segmentation networks are reported for PointNet (Qi et al., 2017a), PointNet++ (Qi et al., 2017b), PCNN (Atzmon et al., 2018), DGCNN  and modified DGCNN (Pierdicca et al., 2020). These DNNs are evaluated on 11 scenes out of 17 available.
A critical issue to be mentioned is the balancing of the classes (Figure 7). In fact, some of them, both for training and test, have a higher number of points and this can negatively affect the performance of the network and the various metrics.

CONCLUSIONS
This paper describes ArCH benchmark, conceived for 3D point cloud semantic segmentation. The platform provides researchers with millions of points, labelled according to a defined standard, together with a generalised evaluation framework. The dataset comprises both annotated and not annotated point clouds, and we invite the research community in contributing to this tricky but essential task. Hopefully, in the upcoming months, the benchmark will become the reference source for testing and sharing new results and frameworks towards the end of automatizing object recognition for complex architectures. Some previous studies have demonstrated that CNN methods offer reliable strategies for 3D CH data classification. But it is fair to state that, conversely to other research domains, CH still presents several bottlenecks, which lead to the conclusion that, up to now, did not emerge an outperforming method. By providing open dataset and open source code, we foresee to infer a baseline for future implementations, as far as new algorithms will be developed in the near future. The class balancing, the heterogeneity of the architectural elements and the complexity of the scenes are currently the main drawbacks and open issues. We are confident the benchmark meets the needs of the research activities in the heritage field and becomes a central resource for the development of new, efficient and accurate methods for classification of 3D heritage. The benchmark will strongly contribute to add the body of knowledge for semantic segmentation of CH good through automatic, supervised learning-based methods.