Deep Learning for 3D Building Reconstruction: A Review

: 3D building reconstruction using Earth Observation (EO) data (aerial and satellite imagery, point clouds, etc.) is an important and active research topic in different fields, such as photogrammetry, remote sensing, computer vision and Geographic Information Systems (GIS). Nowadays 3D city models have become an essential part of 3D GIS environments and they can be used in many applications and analyses in urban areas. The conventional 3D building reconstruction methods depend heavily on the data quality and source; and manual efforts are still needed for generating the object models. Several tasks in photogrammetry and remote sensing have been revolutionized by using deep learning (DL) methods, such as image segmentation, classification, and 3D reconstruction. In this study, we provide a review on the state-of-the-art machine learning and in particular the DL methods for 3D building reconstruction for the purpose of city modelling using EO data. This is the first review with a focus on object model generation based on the DL methods and EO data. A brief overview of the recent building reconstruction studies with DL is also given. We have investigated the different DL architectures, such as convolutional neural networks (CNNs), generative adversarial networks (GANs), and the combinations of conventional approaches with DL in this paper and reported their advantages and disadvantages. An outlook on the future developments of 3D building modelling based on DL is also presented.


INTRODUCTION
Although 3D city models have initially been used for visualization purposes; they have been increasingly utilized in a variety of domains and tasks, such as collaborative urban planning, population density analysis, mobile telecommunication applications, solar potential assessments, disaster management, 3D navigation, and environmental simulations (Biljecki et al., 2015).Researchers have been investigating methodson the automatic 3D reconstruction of buildings from Earth Observation (EO) data and their modelling for three decades (Haala and Kada, 2010).There are several manual steps involved in 3D reconstruction in traditional methods, including image pre-processing, 3D point cloud extraction, data fusion, and texture mapping.Thus, cumulative errors occur in the process and cause inaccurate semantic features in the reconstruction of 3D shapes that seriously affect their quality (Liu et al., 2021).Thanks to the availability of benchmark datasets, such as the airborne images and laser scanner data by ISPRS WGIII/4 (Rottensteiner et al., 2014, Rottensteiner et al., 2012), it has been possible to comparatively evaluate various methods for the segmentation of urban objects and also the 3D reconstruction of buildings.
In the last decades, city models were produced either manually by photogrammetry operators from aerial imagery or by using conventional methods (non-Machine Learning methods).The conventional 3D building reconstruction methods can be categorized as model-driven and data-driven methods.The model-driven methods aim to match the geometry of the roof generated from digital surface models (DSMs), e.g. point clouds, with the roof types in a library (Henn et al., 2013).Using this approach, it can be ensured that the reconstructed roof model is topologically correct; but problems may occur if there is no candidate for the roof shape in the library.Moreover, model-driven methods utilize a limited number of pre-defined shapes given in the model libraries, which reduces the production accuracy.In addition, complex roof structures may not be modelled.In data-driven methods, a DSM (often in the form of a point cloud) is utilized as primary data source and the models are generated from this data as a whole without focusing on any particular parameter.In the data-driven approach, the main problem is that the extracted segments may not be intersected successfully leading to topological or geometrical errors.The data-driven methods are usually not robust and highly sensitive to noise in the data.Due to the noise sensitivity of data-driven methods, data pre-processing is an essential step to avoid incorrect results.
The way the geospatial domain operates is changing fundamentally as a result of Artificial Intelligence (AI) (Döllner, 2020).Deep learning (DL) methods, in particular, Convolutional Neural Networks (CNNs) have been the game changers for several tasks related to photogrammetry and remote sensing in recent years.The recently developed DL methods have potential to overcome the limitations of conventional 3D city modelling and building reconstruction methods.
3D building construction using DL is a relatively new research area studied during the last years with few publications on this topic.The DL approaches achieved state-of-the-art results in classification, segmentation, and change detection using EO data when compared to the conventional methods and there are already published review articles on these topics (Ma et al, 2019;Hoeser and Kuenzer, 2020;Heipke and Rottensteiner, 2020).As DL gains popularity in different fields, new areas of application will emerge in the future.DL based 3D reconstruction has become increasingly feasible with the rapid development of 3D building models and the availability of many different 3D shapes in recent years.DL models can be trained to learn 3D shapes and their features, characteristics and details.Wichmann et al. (2018) presented RoofN3D, a new 3D point cloud training dataset that can be used to train machine learning (ML) and DL models for a variety of tasks in the context of 3D building reconstruction.An overview of the timeline of the development of commonly used machine learning (ML) algorithms and the DL methods is given in Figure 1.Cao et al., 2018).NN: neural network; BP: backpropagation; DBN: deep belief network; SVM: support vector machines; AE: auto-encoder; VAE: variational AE; GAN: generative adversarial network; XGBoost: Extreme Gradient Boost; WGAN: Wasserstein GAN In this paper, based on the recent advancements in the field and considering the increasing interest in the community; we aim at providing an overview of methods and state-of-the-art applications for DL-based 3D building reconstruction with a focus on object generation.The remainder of this paper is structured as follows: Section 2 gives a brief overview of conventional (non-DL) 3D building reconstruction methods.Section 3 presents a summary of ML methods used in 3D building reconstruction.The state-of-the-art DL method studies according to method types are given in Section 4. Discussions and conclusions are provided in the final section together with future directions.

CONVENTIONAL METHODS
3D building reconstruction is still largely based on conventional methods and algorithms (i.e., non-DL-based).Two comprehensive reviews on conventional urban reconstruction methods were presented by Musialski et al. (2013) and Halaa and Kada (2010).Sub-surface growing is an example to conventional methods for 3D building reconstruction (Kada and Wichmann, 2012).Nan and Wonka (2017) proposed a datadriven method, Polyfit, for reconstructing lightweight polygonal surfaces from point clouds.By combining the Random Sample consensus (RANSAC) method (Fischler and Bolles, 1981) with contextual knowledge, Malihi et al. (2018) have developed a novel two-level segmentation scheme for generating 3D building models from point clouds derived from UAV photogrammetry.LoD1 building models can be automatically constructed with the combination of 2D building footprints and digital surface models (DSMs) (Buyukdemircioglu and Kocaman, 2018).Model-driven approaches can also be used for semi-automatic reconstruction of 3D city models in LoD2 using stereo aerial imagery (Buyukdemircioglu et al. (2018) as depicted in Figure 2.  (Buyukdemircioglu et al., 2018) 3D building models can be generated from different EO data types such as aerial imagery, UAV imagery, satellite imagery, or point clouds using conventional approaches such as rule-based methods (Xie et al., 2021), model-driven methods, or datadriven methods.One of the fast and widely used reconstruction method for 3D city models is to extrude building footprints.Another automatic 3D building reconstruction method based on half-spaces in LoD2 was proposed by Bizjak et al. (2021).Their proposed algorithm performed reconstruction on the ISPRS benchmark dataset with RMSE of 1.31m and completeness level of 98.9%, respectively.Drešček et al. (2020) presented an approach for 3D building reconstruction using an unmanned aerial vehicle (UAV) photogrammetric point cloud based on an extract, transform, load (ETL) solution.A data-driven and algorithmic solution to the automatic reconstruction of 3D buildings at LoD2 from UAV point clouds was presented by Murtiyoso et al. (2020).
RANSAC has been a popular method used with 3D point clouds.Li and Wu (2020) generated 3D models of complex buildings automatically using incomplete point clouds with RANSAC and topological-relation constraints.A RANSACbased multi primitive reconstruction (MPR) method was proposed by Li and Shan (2022) to segment a compound boundary into predefined primitives and determine their parameter values from the point clouds.
Automatic generation of high detailed LoD3 models is still a challenging topic for researchers.These models can be generated manually and combined with automatically generated 3D city models in different LoDs (Buyukdemircioglu and Kocaman, 2020).An automatic workflow for reconstructing 3D building models in LoD1 and LoD2 based on 2D building footprints and LiDAR (Light Detection and Ranging) point cloud was developed by Peters et al. (2021).Their approach was used to reconstruct 10 million buildings in the Netherlands.

MACHINE LEARNING METHODS
In this Section, an overview of the ML methods other than the DL-based methods (e.g., random forests, SVM, etc.) used in literature for reconstructing 3D building models is given.Dehbi et al. (2016) developed weighted attribute context-free grammar rules for 3D building reconstruction.The weighted context-free grammar was inferred using SVMs from input-output pairs as structured data; and Markov Logic Networks (MLNs) was used to reconstruct the 3D buildings as a statistical relational learning method.
In their study, Biljecki et al. (2017) have shown that building models also can be automatically reconstructed without elevation data by using the random forest method.The proposed method predicts the height of buildings based on the footprints and building attributes and then extrudes the footprints to generate 3D models.As a result, they have reached a mean absolute error of 0.8 m in the inferred heights.Biljecki and Dehbi (2019) demonstrated that it is possible to predict the roof types from lower LoD (i.e., LoD0 and LoD1) datasets and to generate LoD2 models without roof measurements.They achieved an accuracy of 85% of the roof type from sparse data using a multiclass classification and 92% accuracy in predicting whether a roof is flat or not.Park and Guldmann (2019) reconstructed LoD1+ building models with ML-based point cloud classification methodology that assigns LiDAR points to different classes, extracts the points reflecting a rooftop surface, and uses those points to estimate building heights.The ML methods can also be used to reconstruct multi-temporal (4D) city models.Farella et al. (2021) presented a methodology for reconstructing buildings in 4D using ML algorithms and historical information.Using digitized historical city maps and information about actual city conditions, they were able to reconstruct multi-temporal 3D representations of two urban city centres using different regression algorithms for inferring missing building heights.

DEEP LEARNING METHODS
Through DL, the computational models consisting of multiple layers of processing can learn the representations of data at multiple abstraction levels (LeCun et al., 2015).According to Liu et al. (2021), there are two main problems with conventional 3D reconstruction methods.First, they involve multiple manual designs that can lead to the accumulation of errors but can hardly learn semantic features of 3D shapes automatically.Secondly, they are highly dependent on the quality and content of images, as well as a precisely calibrated camera.The DL-based 3D reconstruction methods overcome these bottlenecks by automatically learning 3D shape semantics from images or point clouds using deep networks.
Different DL architectures have been used in the literature for 3D building reconstruction from EO data (aerial imagery, UAV imagery, satellite imagery, point clouds, etc.).In this section, we have investigated convolutional neural networks (CNNs), generative adversarial networks (GANs), and the combinations of DL methods with conventional methods in detail for realizing this task.Here, the DL-based methods are explained in a separate section due to their increasing popularity and the availability of reference datasets.

Convolutional Neural Networks
CNNs (Simonyan and Zisserman, 2014) allow to learn the characteristics of images at various levels using convolution and pooling operations, which is an extremely useful DL model for image classification and reconstruction.They can also be used for 3D building reconstruction from EO data.A DL approach was proposed by Wang and Frahm (2017) for performing a single-view parametric reconstruction of buildings based on satellite imagery by parametrizing buildings as 3D cuboids.CNNs also can be used for the procedural reconstruction of buildings.Using Neural Procedural Reconstruction (Zeng et al., 2018), 3D points were mapped into CAD (Computer-aided design)-quality models with procedural structures inferred by sequences of shape grammar rules.An interactive tool was developed by Nishida et al., (2018), with which users can generate a grammar automatically from a single image of a building with the help of CNNs for procedural modeling of buildings.Alidoost et al. (2019) developed a DL architecture for detecting buildings from a single aerial image.Using the proposed method, the 3D reconstruction of buildings with a variety of shapes and complexity was achieved with root mean square error (RMSE) values of 3.43 m and 1.13 m for the predicted normalized DSM (nDSM), respectively.To generate block-like city models using depth maps, Agoub et al. (2019) developed a pipeline based on multiple CNNs with an encoderdecoder architecture.A view of the reconstructed buildings of Manhattan area in their study is given in Figure 3.  2020) for automatic detecting, localizing, and estimating building heights simultaneously from a single aerial image for LoD2 building reconstruction.A multi-task, multi-feature learning framework was presented by Mahmud et al. (2020) for modeling a building in 3D from a single overhead image.Using this approach, the authors generate 2D building outline proposals, a pixel-by-pixel heightmap, a modified signed distance function (BPSH), and pixel-by-pixel semantic labels; and then produce 3D models of each building.Another example of CNN applied to a specific grid structure was presented by Knyaz et al. (2020), where CNN was used for automatic semantic segmentation of wire structures and overcoming the limitation of photogrammetric processing applied to the 3D reconstruction of complex grid structures.The roof structure lines can be used for reconstructing 3D building models.Muftah et al. (2021) used a CNN-based method for classifying and segmenting roofs based on aerial imagery for 3D building reconstruction in LoD2.The Deep Roof Definer network proposed by Qian et al., (2022) uses satellite imagery to generate roof structure lines using a detail-oriented DL network.

Generative Adversarial Networks
GANs were proposed by Goodfellow et al. (2014).A generator and a discriminator are the main parts of a GAN.The generators are mostly utilized to learn the distribution of real images, resulting in more realistic-looking images and fooling the discriminator.Discriminator involves judging the generated images either as real or fake.As part of the generation modeling process, GAN-based methods introduce the adversarial discriminator, which implicitly learns the similarities and differences between 3D shapes and can therefore identify occluded or missing portions (Liu et al., 2021).
By applying a Conditional GAN (cGAN), Bittner et al. (2018a) presented an automatic processing method for better-quality LiDAR-like DSMs with refined 3D building shape extraction from noisy DSMs.They used stereo half-meter resolution satellite imagery to create three-dimensional surfaces models and refine building shapes in LoD2 (Bittner et al., 2018b).In their next study, Bittner et al (2019), produced good-quality DSMs that show a full, accurate level of detail that is similar to LoD2-like building forms, as well as assign an additional object class label to every pixel.GANs can also be used for automatically reconstructing building models in LoD1 (Beer, 2019).FrankenGAN (Kelly et al., 2018) proposed a network for modeling realistic geometric and texture information on largescale mass models of coarse buildings with examples as guides, users can add realistic details to large-scale models.Qian et al. (2021) presented Roof-GAN, a network that generates structured geometry of residential roof structures as a combination of roof primitives.A sample view of generated roof models by Roof-GAN is given in Figure 4.

Combination of DL-based and Conventional Methods
Rectified linear unit neural network (ReLu-NN) is another DL network that was used for the classification and reconstruction of building from airborne laser scanning point cloud data (Zhang et al., 2018).Using 3D CNNs, a deep Q-network and a residual recurrent neural network (RNN), Zhang and Zhang (2018) developed a deep reinforcement learning framework for parsing the semantics of large-scale 3D point clouds and reconstructing 3D building models.In another study, an end-toend system for reconstructing urban 3D buildings from WorldView-3 multiview satellite imagery using DL was demonstrated by Leotta et al. (2019) by segmenting buildings and bridges and reconstructing low polygon 3D textured mesh models.Satellite imagery-derived point clouds were used by Xu et al. (2020) to build an automated DL-guided 3D reconstruction framework that distinguishes the shape of building roofs in complex and noisy scenes.
The study by Yu et al. (2021) introduced a new fully automatic 3D building reconstruction pipeline based on DL that can automatically construct building models at LoD1 from multiview aerial images without any assistance from the other data sources.In a recent study, Gui and Qin (2021) presented a DLbased approach for reconstructing the LoD-2 models using data derived from very-high-resolution multi-view satellite stereo images.Several steps were involved in their proposed method including instance-level building segment detection, initial building polygon extraction, building polygon decomposition and refinement, basic model fitting, and merging.In their study, Kapoor et al. (2019) proposed a four-step approach for generating 3D city models from historical images using DL.Another automatic 3D building model reconstruction workflow was proposed by Partovi et al. (2019).The workflow was composed of several steps, including DL-based building boundary extraction, decomposition, classification of roof types based on images, and computation of initial roof parameters for 3D model fitting.Teo (2019) proposed a DL approach (i.e.Fully Convolutional Network-FCN) to detect initial building regions from LiDAR data and automatically reconstruct 3D prismatic building models from 3D LiDAR data.With the integration of 3D BAG CityJSON and floor plan images, Kippers et al. (2021) proposed a new automatic DL-based method for constructing building models.An automatic 3D building reconstruction in LoD1 that consists of three main parts, DSM generation, Deep learningbased 2D building footprint generation, and 3D building reconstruction proposed by Yu et al. (2020).Another study by Li et al. (2021) demonstrated a novel method for reconstructing 3D building models with accurate roofs, facades, footprints, and height from monocular remote sensing images.In their study, Zhao et al. (2021) developed a novel 3D reconstruction framework based on an off-nadir satellite image.Their approach consists of three parts: Scale-ONet for model reconstruction, Optim-Net for model scale optimization, and Model-Image match for restoring reconstructed scenes.The holistic primitive fitting method (Zhang et al., 2021) was also used along with PointNet++ (Qi et al., 2017) for 3D building reconstruction from point clouds.Another three-step 3D building reconstruction approach using deep implicit fields and point clouds was proposed by Chen et al. (2021).The DL methods also can be combined with GIS for reconstructing 3D city models from high-resolution satellite imagery (Pepe et al., 2021).

CONCLUSIONS
3D city models and digital twins are being produced and used more popularly by all over the world.Producing these models requires a great deal of data, processing, and expertise.Conventional methods for 3D building reconstruction have certain limitation, such as efficient reconstruction of large numbers of buildings at city scale.Typically, conventional methods cannot produce fully automatically; and several steps must be performed manually by users.In addition, data preprocessing is essential.Conventional methods could generate incorrect model geometry with noisy data since they are often not robust and sensitive to noise.
The DL methods have been more successful than conventional methods in many fields.Additionally, more and more city models and EO data are becoming publicly available.With the advancements in computer hardware, especially GPU technology, the DL is becoming more popular in various fields.With the growing popularity of 3D DL libraries such as PyTorch3D (Ravi et al., 2020), Tensorflow3D (Google Research, 2021) and Nvidia Kaolin (Nvidia, 2021), DL studies using 3D data will become more popular after their success in 2D image studies.3D DL libraries can directly learn the 3D object models to reconstruct the model.As a result, 3D city models and buildings can be automatically generated without having to rely on any specific roof libraries.
Using the DL methods, different tasks such as building detection, building segmentation, footprint extraction and 3D reconstruction can be performed consecutively on different data sets and a full automation may be possible.Many countries and municipalities are offering 3D city models in different LoDs through open data portals.The data of different 3D city models or DSMs can be used to train the DL models and to automate the production.Consequently, global-applicable models can be produced with a higher level of accuracy.Furthermore, deep learning can also be used to automatically extract roof segment lines from aerial or satellite imagery, which can then be used to generate a 3D model of the building.
With the availability of EO data, the city models and point clouds, city-scale automatic 3D reconstruction with DL methods will be a research topic that will be actively studied in the coming years, especially semantic 3D city models.It may even be possible to produce global models that are not specific to a particular area if millions of building data are used to train a DL model.In the next few years, it is also expected that opensource 3D DL libraries such as PyTorch 3D, Tensorflow3D, and Nvidia Kaolin will facilitate the 3D reconstruction and enable more research.

Figure 3 .
Figure 3. Automatically reconstructed building models of Manhattan area using CNN (Agoub et al., 2019) Based on a Y-shaped CNN (Y-Net), a modern DL-based framework was proposed by Alidoost et al. (2020) for automatic detecting, localizing, and estimating building heights simultaneously from a single aerial image for LoD2 building reconstruction.A multi-task, multi-feature learning framework