AN EXTRACTION METHOD FOR ROOF POINT CLOUD OF ANCIENT BUILDING USING DEEP LEARNING FRAMEWORK

Chinese ancient architecture is a valuable heritage wealth, especially for roof that reflects the construction age, structural features and cultural connotation. Point cloud data, as a flexible representation with characteristics of fast, precise, non-contact, plays a crucial role in a variety of applications for ancient architectural heritage, such as 3D fine reconstruction, HBIM, disaster monitoring etc. However, there are still many limitations in data editing tasks that need to be worked out manually, which is time-consuming, labor-intensive and error-prone. In recent years, the theoretical advance on deep learning has stimulated the development of various domains, and digital heritage is not in exception. Whenever, deep learning algorithm need to consume a huge amount of labeled date to achieve the purpose for segmentation, resulting a actuality that high labor costs also be acquired. In this paper, inspired by the architectural style similarity between mimetic model and real building, we proposed a method supported by deep learning, which aims to give a solution for the point cloud automatic extraction of roof structure. Firstly, to generate real point cloud, Baoguang Temple, unmanned Aerial Vehicle (UAV) is presented to obtain image collections that are subsequently processed by reconstruction technology. Secondly, a modified Dynamic Graph Convolutional Neural Network (DGCNN) which can learn local features with taking advantage of an edge attention convolution is trained using simulated data and additional attributes of geometric attributes. The mimetic data is sampled from 3DMAX model surface. Finally, we try to extract roof structure of ancient building from real point clouds scenes utilizing the trained model. The experimental results show that the proposed method can extract the rooftop structure from real scene of Baoguang, which illustrates not only effectiveness of approach but also a fact that the simulated source perform potential value when real point cloud datasets are scarce.


1.INTRODUCTION
Ancient architecture, the crucial component of Chinese cultural heritage, has a unique architectural system, artistic value and significance of cultural research (Li, 2003;Li 2017). In recent years, the documentation utilizing digital technology such as scanning and photogrammetry has increasingly become a hot topic in cultural heritage domain (Vecco, 2010;Yin, 2013), appealing a large number of researchers. 3D reconstruction provides an important means for the protection, restoration and management of ancient architectural heritage as well as has been widely applied in various of preserve engineering and research works. This is because the data type of point cloud has some advantages in representing geometric objects. By describing continuous real objects through discrete points in Euclidean space, point clouds are considered to be a kind of surface sampling of three-dimensional geometry with high fidelity and fast acquisition. But generally, its points are disorder and * Corresponding author. E-mail: dongyouqiang@bucea.edu.cn (Youqiang Dong) unstructured without semantic information suitable for human understanding. For this reason, the researchers in HBIM (Historical Building Information Modeling) have brought trends on addressing the issue that transforming 3D models from a geometrical representation to an informative data . Conventionally processes based on scan-to-BIM that allow to generate a parametric 3D model from point cloud (Capone, 2019) are made manually by domain experts resulted in a lot of costs on time and manpower. Although, a few semiautomatic modelling methods have been developed to improve efficiency and user-friendliness, there are still tedious editing tasks in the process of converting point cloud data to BIM. Point cloud segmentation from complex scenes can establish semantic information for ancient roof. It is a basic link for further analysis of component parameters (Liu, 2010), restoration of internal relations between components (Ren, 2018), and realization of automatic 3D reconstruction for ancient buildings (Hu, 2021).
With the development of computer science and mapping geographic information, the algorithms for point cloud segmentation has also experienced an evolutionary process which could be divided into two stage and three types according to the principle. Point cloud region segmentation and semantic segmentation are included, and the former one tries to group disordered points into many different mutually-exclusive subsets suitable for some customized tasks. This method always depends on predefined thresholds, rules, and conditions, which develops some classic and reliable algorithms, such as regional growth, clustering, and RANSIC. But these subsets usually describe points, lines, planes and basic geometric primitives of whole scene, instead of semantic objective constructed by human being. The later one referring to a few learning algorithms, machine learning (ML) and deep learning (DL). On the basis of some literatures, ML is introduced to process Lidar point cloud collected by sensors fixed on aircrafts or vehicles, so it is also called point labeling. Meanwhile, point labeling can assign each point, presented by a set of features attributes, a specific label by means of trained ML classifiers. Therefore, the performance of this method much depends on the design of features and classifiers selection. In the 2012 ImageNet Visual Challenge, AlexNet once again brought artificial intelligence technology into the spotlight with its outstanding work. Soon afterwards, theory about deep neural network, open source framework and advanced technology were researched and developed gradually. Recently, deep neural network has been able to handle unstructured 3D data in order to achieve a few tasks, including point cloud classification, part segmentation even semantic segmentation, point-wise classification, and so on. They can be partitioned into two categories: indirect such as multi-views and voxel, and direct method including point ordering, graph based, etc. Direct method can consume points directly without any way of transformation of data form, avoiding the loss of useful information, which has aroused widespread search among scholars. The pioneering work are Point-Net/Pointnet++  and one of the state-of-the-art is DGCNN (Wang, 2019), which have given the incentive for developing many of frameworks based on various perspectives. These excellent works provide a solution for semantic segmentation of the point cloud as well as the researchers in cultural heritage (CH) domain has begun to explore point cloud semantic segmentation using the power tool, ML or DL, to decompose heritage elements automatically. Recently, , the authors, provided a benchmark for large-scale heritage point cloud semantic segmentation, promoting academic literatures on exploring point cloud semantic segmentation for cultural heritage, such as churches, etc. . The establishment of open datasets requires an immensely manual labelling work, but the significance for CH domain cannot be underestimated. Nevertheless, the complex structure not only brings difficulties to the digitization of Chinese ancient buildings, but also poses challenges to the follow-up labeling work, making available data that can be feed into deep neural network relatively few. In this paper, we propose a method for extracting ancient roof from real point cloud data using deep learning model trained by simulation data. First of all, the ancient building called Baoguang Temple ( Fig.1) is documented by UAV and automatic reconstruction technique to obtain real point cloud. Furthermore, an improved DGCNN model with edge attention convolution is trained using a mimetic dataset to achieve the extraction of roof structure. The experiment shows a good result, which not only demonstrates the effectiveness for our proposed method, but also reveal the potential of simulation data.

2.RELATED WORK
In the past several years, the applications and researches of DL technology in CH field is passing the stage of rapid development. Driven by advancement of DL, those literatures focusing intense attention on semantic segmentation for CH sites elements emerged with gratifying performance. In , the authors, release the first point cloud dataset for CH domain, Architectural Cultural Heritage (ArCH dataset). Many of the point cloud scenes are part of the UNESCO World Heritage List, including the chapel of the Strasbourg Cathedral inside the GrandeÎle, the courtroom of the Valentino's Castle (VAL), the Sacro Monte of Varallo (SMV) and Ghiffa (SMG) part of the wider site of "Sacri Monti of Piedmont and Lombardy", St. Pierre church located inside the Neustadt and the porticoes of Bologna. They introduce a data pre-processingof spatial translation, subsampling and normal calculation for the purpose suitable for feeding into deep neural networks. And this fundamental contribution considers many categories of elements: Wall, Floor, Roof, Column and etc. It is available to access the website (http://archdataset.polito.it/). In the work of , the authors propose a DL framework for point cloud segmentation on ArCH dataset. Their work selects a few meaningful features consisted of HSV color, normal and Euclidean space coordinates to generate 12 dimensional eigenvectors for presenting each point. In spired by DGCNN, they modify the first EdgeConv layer through enriching the attributes used to calculate pair-wise distance, that means local graph of each center point is no longer constructed in Euclidean, but in feature space adding colors and normal. Such improvement has achieved state of the art. According to (Malinverni, 2019), the authors test another extraordinary work for CH segmentation. In their work, PointNet++ has been trained to reach significant results for classifying and segmenting 3D point clouds. In addition to general color and normal, they introduce reflectance to enhance the separability between different materials. The annotated scene is described as 10 classes. As similar as possible to the ScanNet (Dai, 2017), the original dataset is pre-process by translation, scaling and subsampling. Subsequently, PointNet++ can partition the input points into overlapping regions in order to extract local features from neighborhoods. Similar as convolution in 2D images processing field, such local features The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLVI-M-1-2021 28th CIPA Symposium "Great Learning & Digital Emotion", 28 August-1 September 2021, Beijing, China will be further grouped into larger units to produce higher level features. But the results could reveal a challenge needed to be focused on, that it's not as good as an indoor scene due to the complexity of training scene in CH domain. Beneficial for geometric features were designed from mathematical inspirations, the calculated attributes, to some degree, could present points of characters, making combining these features with ML classifiers be available approaches.

3.MATERIALS AND METHOD
The study area and data are derived from a digital protection project of Chinese ancient buildings. A completed workflow from acquisitions of point cloud to the segmentation results has been operated, as shown in Fig. 2, which consists of four parts: (a) Automatically real scene reconstruction for Baoguang Temple relying on UAV equipment and multi-view geometry; (2) Preparing of training datasets deriving from open 3DMAX model library; (3) A phase for data pre-processing; (4) A technological process for roof extraction supported by modified DGCNN framework.

Point Cloud of Baoguang Temple
Baoguang Temple, located in Qinghai Province, was built during the Yongle reign of the Ming Dynasty, and it has been listed as a key cultural relic protection unit in China. As the most typically official-style architecture in the early Ming Dynasty, it is great in scale and well preserved. So as to serve purpose of planning, protection and management for it, we established long-lasting documentation. Many of measuring work were taken so as to document internal and external structures, colored drawing or pattern information. But for extracting the roof structure, only external point cloud is needed. For this reason, DJI UAV is equipped to capture digital pictures of Baoguang Temple. We set triple flight path where two tilting angles of view were used to surround the building and one was vertical shooting along a plane. Ultimately, 226 oblique images were recorded around it consisting with criterions and technical standards of photogrammetry operation. Then all imageries were imported into the commercial software Agisoft Photoscan to give a process. At this stage, Structure from Motion (SFM) is use to analyze image collections in order to reconstruct virtual positions of UAV sensors, in Fig. 3, and a dense multi-view stereo reconstruction is carried out subsequently to create a highresolution point cloud of real scene. The generated point cloud covers Baoguang Temple and its surrounding buildings, so we tested the area of building group. Besides, point cloud block which is flying in the sky is manually gotten rid of.
(a) UAV Imagery (b) Camera Attitude Recovery (c) Dense Point Cloud Generation Fig.3 The data acquisition and processing for Baoguang Temple

Simulated Data of Ancient Architecture
As we know, training a deep neural network to achieve segmentation on semantic level for CH elements needs to consume a large amount of manually annotated data that normally requires individual to meet the requirement of professional knowledge. Fortunately, plentiful available 3D models have been existed in the past decades of CH digitalized process, which are handily grouped by modelling software. If there is a definite relationship between these existed 3D models and real scanned scenes, at the same time, a mass of these models will benefit for training neural networks to learn the features in real point clouds of architectural structures. Meanwhile, the abundant model resource will not only greatly reduce the costs of data annotation, but also bring opportunities for the semantic segmentation for ancient buildings. This conjecture is inspired by the similarity between existed models and real architectures. Consuming that the researchers construct ancient building models according to a certain architectural style, then its local components fixed on model should be similar to the actual situation in their geometric shape. Supported by this perspective, when similar geometry in Euclidean space are projected into high dimensionality of features space, their projection points will be closed each other, so neural network learning from simulative point cloud have the ability of inferencing in scanned points. For this reason, the simulated point cloud we adopted is derived from an open source library which includes more than fifty 3DMAX models of the Forbidden City. These Chinese royal architectures are similar with Baoguang Temple in architectural style. We select three of them to perform a sampling on the model surface after grouping by category in 3DMAX software. As shown Fig.4 (a) (b) (c), three 3DMAX models are chosen, involving rooms of Lijing, Taiji, Tongdao. The whole model of them are divided into two parts, roof and other, by the grouping function in the software. Afterwards, they are exported as OBJ format models, which still exist in the form of two categories. We convert them from OBJ object to point cloud using sampling function provided in Cloud Compare (CC) software. All parameters are kept default in this processing and Fig.4 (d) are thinning point cloud to make visualization clear.

Data pre-processing
Due to the three 3DMAX model are located in one of independent coordinate system, lacking scale information, we compare the diameter of column with the real value recorded by historical documents to decide scaling factor which is finally calculated as 0.048314. We multiply all the coordinates of the three rooms by this scaling factor. The mimetic scenes of the three rooms are strictly aligned with the coordinate system, but the real Baoguang Temple which is measured in the geodetic coordinate system needs to be manually aligned in the CC software to convenient to the processing of subsequent steps. Additionally, shifting whole point cloud to origin is necessary and is processed automatically by our processing algorithm. We extract 1st, 2nd and 3rd eigenvalues, the geometric features of point cloud, from covariance matrix within three times default parameters of neighborhood radius in CC software. The neighborhood radius of the simulated data and the real point cloud are set to 0.042 and 0.120 respectively. The polar coordinates are added by the coordinate conversion formula with rectangular coordinates, where the polar diameter needs to be normalized. Finally, we use a 12-dimensional space feature vector to represent attributes of each point, which are centralized coordinates, normalized coordinates, polar coordinates, and geometric features. It not like some literatures, we did not subsample point cloud of real scene and also did not use the color information. This implies that we expect the neural network to learn pure geometric information.

Improved DGCNN for Roof Extraction
DGCNN as an outstanding work in computer vision field can directly process unstructured point cloud, learning deep features from 3D points suitable for solving such tasks liking point cloud classification, part segmentation and semantic segmentation with achieving good performance. The idea of DGCNN has a commonality with traditional mind of point presentation. Both two idea all use a group of local points to present some column The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLVI-M-1-2021 28th CIPA Symposium "Great Learning & Digital Emotion", 28 August-1 September 2021, Beijing, China attributes of single point. But the difference is that the traditional method calculates features for the group of many nearest points by a predetermined mathematical operation, instead, DGCNN makes deep neural network adaptively learn features from those. To be more precisely, DGCNN use the KNN to find a group of nearest points for single point and each adjacent point subtracts from the center to form a connecting edge for the pairs of points, so that the center and neighbors constitute a directed graph. So, these edge all have attributes and are concatenate with the attributes of its center point, which is the edge features. Multilayer perceptron (MLP) are expected to extracted deep features from every edge, and then the function of max pooling is use for selecting dominated features from all edges dimension by dimension in order to map each point into high dimension space. Such operation is dubbed edge convolution (EdgeConv). The input of one layer is the output of the previous layer. The graph structures of the first input layer are constructed with points in three-dimensional space. But the later outputs will be translated into the input of the next layer, so the graph structure will be dynamically established in the high-dimensional feature space.
Although DGCNN could learning features from graph structures, however, the number of neurons in MLP is empirical. In another words, the generated high dimensional attributes should have various weight at different dimension, instead of sharing common weight.
In our work, we propose a modified DGCNN based on edge attention convolution (EdgeAttentionConv) which is able to adaptively readjustment attributes for high dimensional points by assigning weights generated from global descriptor. As the EdgeAttentionConv in Fig.5 shows that we apply two types of global pooling function, such as maxing and average, to obtain descriptors of features distribution for each attribute channel. And then addition operator is able to perform an integration. Fully connected networks (FCN) is trained to learn attention weights from these descriptors with a following sigmoid activation function to activate them between 0 to 1. And then, the attention weights will be assigned back to the channels of high dimensional points. So as to achieve the purpose of redistributing internal importance of the attributes for every point. The structure of the neural network is shown in Fig.6, and we feed the 12-dimensional features: (1) Centralized coordinates X, Y, Z; (2) Normalized coordinates X', Y', Z'; (3) Normalized polar coordinates ρ, θ, φ; (4) Point cloud covariance matrix eigenvalue 1 , 2 , 3 .
These features fed in input layer will be constructed local graph based normalized coordinates and processed by three EdgeAttentionConv layers where dynamic graph is updated. We fuse these local features of input points extracted by three EdgeAttentionConv layers to generate global information that is replicated and spliced into each point. Finally, we adopt MLP and dropout layer to output segmentation scores and each point will be recognized as roof point or a non-roof.

4.EXPERIMENT RESULT
In the phase of experiment, we compare four networks, PointNet, DGCNN, LDGCNN and modified DGCNN by ours. ADAM optimizer is selected to train the neural network from original state with an initial learning rate of 0.0001, while the learning rate decay rate is 0.35, and the batch size is set to 6 with a 4096 sampling points. As mentioned above, we only use three simulative rooms named Lijing, Taiji, Tongdao as a training dataset to train models, at the same time, Baoguang Temple, the real scene, is used for testing the Overall Accuracy (OA) and the Intersection over Union (IOU). Fig.5 is distribution between roof and non-roof points in different scene. The whole training process lasted for 25 epochs, and the model with the highest OA was kept for testing and the accuracy index was calculated.
In order to verify the accuracy of the algorithm, we annotated the real point cloud, as shown in Figure 7(a). The extraction result and accuracy of the Baoguang Temple roof are shown in Figure  7(b) and Tab 1, respectively. Our method can not only extract the roof of the main building, but also extract the roof structure of the surrounding building group. It can be found that the top of the wall in front of the Baoguang Temple is also extracted because its local structure is similar to the roof structure. Some ground points and vegetation points behind the building were misidentified as building roofs which is caused by these points of height value approximatively equal for the points of roof. Furthermore, this preliminary work does not actually deal with the elevation value in the preprocessing phase, so the elevation of the points contributes greatly to the prediction of the segmentation fraction. These unprocessed elevation values are highly applicable to the semantic segmentation of single buildings in a small range, but for a large range of ancient building groups, especially for those area where the terrain dramatically changes, the negative effects of absolute elevation cannot be ignored. Finally, the segmentation accuracy of our method has reached 87.14%, which is 2.08% higher than the 85.06% of original DGCNN. mIOU achieves the best performance with a score of 77.05%. Therefore, the method we proposed shows a good robustness and has a great application prospect for extracting the point cloud of ancient building roofs, and verifies the fact that training neural network with the simulation data can apply in the segmentation for real point cloud with similar styles.

5.CONCLUSIONS
This work proves that the deep neural network trained only using simulation point cloud based on style similarity can be successfully applied to real point cloud scene. We try to extract the roof structure from a group of buildings exploiting improved DGCNN model which is able to adaptively readjust the weight of local graph constructed by neighbour points and is trained only with three subsampling points from 3DMAX models, as a fact that this idea has performed an effect without any transform learning technology. Obviously, the immense of existed simulation models have crucial significance for CH semantic segmentation. But it should also be noted that if we regard point cloud as a sample for the shapes in space, a key role influential to the difference between simulation points and real scene points is the points density which dominates what local attributes a point will be. For this reason, the capacity of resisting uneven density appear to be particularly important, from the perspective of local features, because of not only the difference is in these two different sources but also in single real scene. From the view of global features, the points block can map the characters of distribution, but the lacking of the distribution for whole structure leads segmentation confusion. Thanks to the contributions of researchers, these challenges will be discussed in our future works.