SEMI-AUTOMATIC CITYSCAPE 3D MODEL RESTORATION USING GENERATIVE ADVERSARIAL NETWORK

: The paper addresses the problem of a city heightmap restoration using satellite view image and some manually created area with 3D data. We propose the approach based on generative adversarial networks. Our algorithm contains three steps: low quality 3D restoration, buildings segmentation using restored model, and high-quality 3D restoration. CNN architecture based on original ResDilation blocks and ResNet is used for steps one and three. Training and test datasets were retrieved from National Lidar Dataset (United States) and the algorithm achieved approximately MSE = 3.84 m on this data. In addition, we tested our model on the completely different ISPRS Potsdam dataset and obtained MSE = 5.1 m.


INTRODUCTION
Cityscape 1 3D models are widely used in practical applications, e.g. VR and computer games, data augmentation, etc. Since the creation of highly detailed 3D models requires a bunch of timeconsuming handwork, the solutions that can automate this process are still in high demand.
Many modern techniques can automatically create such type of models using LIDAR data or an image flow, which is sufficient to implement structure from motion approaches. When this data is not fully available, the methods of 3D reconstruction from a single image are applied. In recent years, as in almost all other computer vision tasks, convolutional neural networks (CNN) are gaining well-deserved popularity as the core of these methods.
In this work, we address the problem of automatic 3D cityscape reconstruction using a satellite image and some manually reconstructed parts of 3D scene (e.g. several buildings). This approach is justified in fast semi-supervised 3D modeling of the real environment when no high precision needed (for example, for synthetic dataset generation).
The heightmap 3D models representation is considered, i.e. the 2D matrix that contains surface elevation data. This allows us to build the algorithm on classical CNNs instead of voxel or graph CNNs. Last years, generative adversarial networks (GANs) have made a great performance gain for such types of problems and are employed in this work as well.
Our GAN Generator-CNN receives satellite image and additional data (a building mask or the part of a heightmap) as the input and full heightmap as the output. For 3D reconstruction, two types of CNNs are used -MapNet and MaskNet. MapNet produces the heightmap from satellite image and the heightmap part or the buildings mask. MaskNet delivers intermediate buildings mask from the heightmap. Our CNNs has original architecture.
The training data includes LIDAR data from National Lidar Dataset (USA) for New-York City and corresponding satellite images from Google, Yandex, Nokian and Bing online services * Corresponding author using QGis software. On test dataset MSE (mean square error) = 3.8 m is gained for restored heightmaps. The proposed NetMask outperforms popular CNN architectures such as ResNet36, DeepLabv3 and U-Net.
In addition, we have tested our algorithm on ISPRS Potsdam dataset and obtained MSE = 5.1 m. It should be noted that the Potsdam dataset is significantly different from our training dataset in sense of presented building types (the algorithm has seen a lot of skyscrapers and high buildings in the training dataset of New-York City).

RELATED WORKS
Monocular 3D reconstruction. In our work, for the 3D reconstruction of cityscapes we use two data sourcesaerial(or satellite) images and some manually created area with 3D data. Very similar task of 3D reconstruction from single image is wellknown in computer vision (El-Hakim, 2001), (Remondino, 2003), ( Remondino, 2006). As in other computer vision problems, the methods based on deep learning , ( Choy, 2016), ( Richter, 2018), ( Shin, 2018), (Long, 2015), (Isola, 2015), (Wu, 2017), (Huang, 2015), ( Zheng, 2013) are successfully used in this area. Some methods were developed for voxel 3D model restoration from a single depth map ( Zheng, 2013), ( Firman, 2016). In  CNN for image to voxel 3D model translation was introduced. The CNN architecture is an auto-encoder for direct voxel model prediction. Unfortunately, this approach can work only with small 3D models (up to 20×20×20 voxels). An approach that combines single-view and multi-view reconstruction modes was described in ( Choy, 2016). In (Knyaz, 2018a) more accurate CNN that can generate voxel models of complex scenes with multiple objects was proposed. Using heightmaps, landscape reconstruction problem can be easily transformed to classical image-to-image translation problem. Image-to-image translation.
Well-known grayscale colorization and style imitation methods (Zhang, 2016), (Gatys, 2015) are the examples of the early CNN based image-to-image translation methods. The next level of quality and the ability to solve this problem in general are promised by the generative adversarial networks. The first one was Pix2Pix  model that can learn any type of high quality image-to-image translation using training datasets of corresponding image pairs. In  a new generative adversarial network called CycleGAN was proposed, that have the ability to learn on unpaired datasets. High quality 3D reconstruction. There are also some popular approaches for high quality 3D terrain reconstruction based on stereo matching (Knyaz, 2018b) and structure from motion (Knyaz, 2017). These approaches provide high quality 3D terrain models but require more input data -stereopairs or image sequences.

HEIGHTMAP
Normally, 3D cityscape models are represented as a set of points or triangles with texture. This type of representation is common for 3D modeling software and graphics cards. Unfortunately, this type of representation cannot be used with regular convolutional neural networks, since it can take only 2D fixed grids as an input. Of course, today there are, for example, some good realizations of graph-based neural networks, which can work directly on graph-like data structures. However, in fact, the theory of graphbased network are not as mature as regular CNNs.On the other hand for terrains, there is a 2D fixed grid data representation called heightmap. Heightmap or heightfield is a 2D matrix used mainly as Discrete Global Grid in secondary elevation modeling. Each element of this matrix corresponds to a point in 3D model and the value of the element represents the elevation in this point. The values of elevation are set relatively to some "zero" level. Such type of heightmap can be easily converted by triangulation into 3D mesh. On figure 1 an example of heightmap and corresponding 3D model for landscape are shown. The heightmap is similar to an image in terms of data structuring. Therefore, classical CNNs can be used for 3D landscape processing. For example in (Vizilter, 2019) heightmaps are used for 3D landscape restoration using CNN.
In our work we also use heightmaps for cityscape representation. On Figure 2 an example of heightmap and corresponding 3D model for cityscape are shown.

METHOD
Simple Generative adversarial networks generate some signal ̂ based on random noise vector z, : → ̂ . Conditional GAN transforms an input image A and vector z to an output : { , } → ̂. The input A can be the image that is transformed by the generator network G. The discriminator network D is trained to distinguish "real" signals from the target domain B from the "fakes" B produced by the generator. Generator and discriminator are trained simultaneously. Discriminator provides the adversarial loss that enforces the generator to produce "fakes" ̂ that cannot be distinguished from "real" signal B.
In our case, we have classical Conditional GAN problem, i.e. we have two inputs: aerial image and low quality heightmap (interpolation from a set of points), and get dense landscape model as an output(see Figure 3). Data fusion is made by a concatenation procedure.  The reconstruction process can be divided in three stages: 1. Low quality 3D restoration from input image and heightmap part using MapNet CNN; 2. Building mask estimation based on low quality 3D model and input image using MaskNet CNN 3. High quality 3D restoration using MapNet CNN, which depends on data from previous stages.

IMPLEMENTATION DETAILS
To determine the height of a point, it is necessary to know what type of object it belongs to, what height the given object is, and how uniform it is. Moreover, most of the points, which have a height that differs from the ground level, belong to buildings whose scale can be diverse (small -private sector houses, medium and large -city buildings, hangars, etc.). The height of buildings is indirectly reflected in their shadows size and the deviation of the building roof position relative to the foundation. Moreover, depending on the height of the object and the angle of the sun at which the satellite image is acquired, its shadow on the image can take from just a few to hundreds of pixels. Also, the height of the object depends on the area in which it is located (private sector, residential quarter, skyscrapers, etc.). Thus, to determine the height of a point, it is necessary to take into account both closely situated and distant features. Since the construction of a 3D model is carried out using a satellite image (the resolution of which is a couple of meters or tens of centimeters), it is necessary to minimize the loss of spatial resolution, which can lead to a decrease in the accuracy of height maps restoration.

ResDilation block
For good 3D reconstruction we need multi-scale features. Such type of features can be created using convolution with different dilation (dilated convolutions (Fisher, 2016)). Following (Zhou, 2018) multi-scale features can be combined in one layer. In this paper, we propose a new layer called ResDilation block that combines ideas of residual block from ResNet and multi-scale features. ResDilation block (shown on Figure 5) contains a sequence of convolutional layers with different dilations. For ResDialtion block with convolutions dilations (1->2->4->8->16->32) the receptive field is 127x127. The block output is based on local information from first convolution layers and on global information from last layers. Concatenation is used to prevent any changes in global and local information. ResDilation block is aimed at combining differently distant features and, depending on the position of the block in the network, determine what feature scale is important at this level.

MapNet architecture
The original network architecture based on ResDilation block is shown in

Training process
During training we use generative adversarial network ideology with NetMap network as a generator CNN and the original network described in Table 3 as a discriminator CNN.

pixel-ResDilation
Layers Conv2d (num_filter=64, kernel_size=3, stride=1) BatchNorm ReLU Conv2d (num_filter=128, kernel_size=3, stride=1) BatchNorm ReLU ResDilation(128) Conv2d (num_filter=1, kernel_size=1, stride=1) Table 3 MapNet architecture Training process and basic loss functions are similar to Pix2Pix . To prevent model from smoothing, the special border loss is added -L1 loss between "height difference maps", produced using Laplace operator from the heightmaps of ground truth and the generator CNN output. So the final loss is: * = arg min The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLIII-B2-2020, 2020 XXIV ISPRS Congress (2020 edition) HRNet  was selected for MaskNet architecture. Cross entropy loss for semantic segmentation for two classes (building and background) with learning parameters from original paper are used in training process that contains several stages (see Figure 4): Stage 1: MapNet#1 CNN pre training on low quality 3D model restoration.
Stage 2: MaskNet CNN pre training on semantic segmentation using low quality 3D model and aerial photo as input.
Stage 4: All three CNNs are trained simultaneously using full pipeline (Figure 1).On this stage we use Adam optimizer with β1 = 0.5, β2 = 0.999. Initial learning rate is 0.0001, learning rate decay is 0.1.

EXPERIMENTS
In our experiments we use PyTorch framework and 4 Nvidia Tesla P100 for training and testing.

Database
We use public LIDAR database from National Lidar Dataset (United States) for New-York city (http://gis.ny.gov/elevation/lidar-coverage.htm), 2017 and corresponding satellite images downloaded from Google, Yandex, Nokian and Bing map engines(using QGIS software), with 1 meter per pixel resolution. Training and testing datasets were created from this data. Training dataset contains 18000 samples and 576000 unique pairs (3D model and 256x256 RGB image). Test dataset contains 2000 pairs.

Training results
Measurement quality is estimated by mean squared error (MSE) metric between ground truth and reconstructed heightmap, and building mask quality is evaluated by mean Intersection over Union (mIoU) metric.
The segmentation quality has been tested with different input data. Results are given in Table 4 and show that using low quality heightmaps leads to quality improvement. Proposed architecture leads to significant better quality in comparison to competitors. Also our approach was tested on completely different ISPRS Potsdam dataset from http://www2.isprs.org/commissions/comm3/wg4/tests.html and obtained RMSE = 5.1 without any pretraining. It should be noted that the Potsdam dataset is completely different from our training dataset in sense of presented building types (in New York there are a lot of skyscrapers and high buildings). On Figure 6 and 7 the qualitative example of 3D reconstruction on Potsdam dataset is shown.

CONCLUSIONS
The paper addresses the problem of a city heightmap restoration using satellite view image and some manually created area with 3D data. This problem is kind of monocular 3D reconstruction problem. To solve this problem, we propose an approach that uses a set of convolution neural networks with proxy tasks. We use heightmap 3D models representation, i.e. the 2D matrix that contains surface elevation data. This allows us to use classical CNNs instead of voxel or graph CNNs. Following very popular Pix2Pix technique, we use generative network approach to improve 3D restoration quality. Our Generator-CNN receives satellite image and additional data (a building mask or the part of a heightmap) as the input and full heightmap as the output. L1 and adversarial loss are used as a loss function. To prevent 3D model smoothing the special border loss is added -L1 loss between "height difference maps", produced using Laplace operator from the heightmaps of ground truth and generator CNN output. The proposed algorithm is not supposed to be used for photogrammetric measurements due to the provided accuracy, but it can be effectively used for the automatic generation of surrounding 3D models.