MONOCULAR DEPTH PREDICTION IN PHOTOGRAMMETRIC APPLICATIONS

Despite the recent success of learning-based monocular depth estimation algorithms and the release of large-scale datasets for training, the methods are limited to depth map prediction and still struggle to yield reliable results in the 3D space without additional scene cues. Indeed, although state-of-the-art approaches produce quality depth maps, they generally fail to recover the 3D structure of the scene robustly. This work explores supervised CNN architectures for monocular depth estimation and evaluates their potential in 3D reconstruction. Since most available datasets for training are not designed toward this goal and are limited to specific indoor scenarios, a new metric, large-scale synthetic benchmark (ArchDepth) is introduced that renders near real-world scenarios of outdoor scenes. A encoder-decoder architecture is used for training, and the generalization of the approach is evaluated via depth inference in unseen views in synthetic and real-world scenarios. The depth map predictions are also projected in the 3D space using a separate module. Results are qualitatively and quantitatively evaluated and compared with state-of-the-art algorithms for single image 3D scene recovery.


INTRODUCTION
Depth estimation from 2D images is a fundamental research topic in photogrammetry and computer vision toward 3D reconstruction and scene understanding with a vast field of applications, including mapping, navigation, and augmented reality. Most scenarios have high requirements for dense and accurate depth estimation for, if possible, every scene pixel to recover the 3D structure reliably. The recent success of deep learning in several image recognition tasks, such as image classification (Krizhevsky et al., 2012;He et al., 2016), object detection (Girshick et al., 2014;He et al., 2017), and semantic segmentation (Long et al., 2015;Chen et al., 2017;Badrinarayanan et al., 2017), motivated their application also in the field of depth estimation and 3D reconstruction, especially for tackling the matching ambiguities and occlusions problem. Depth estimation using deep learning can be applied in stereo, multi-view, or monocular scenarios. Indeed, various supervised or unsupervised methods have been suggested in the literature in recent years (Zbontar and Lecun, 2015;Yao et al., 2018;Huang et al., 2021). In particular, monocular depth estimation methods aim to recover distances between scene objects and camera parameters from a single image. It is, by definition, an ill-posed problem since redundant 3D scenes can be projected to the same 2D image. Indeed, an efficient depth map recovering from a single image would require rich scene prior cues, commonly used in conventional methods (Saxena et al., 2008). In the deep learning era, monocular depth estimation refers to the task of single image inference during test time, first introduced by Eigen et al. (2014) using a coarse-to-fine approach. Since then, the problem has been broadly studied in the literature as a supervised or unsupervised task. As with all supervised learning methods, supervised monocular depth estimation relies on corresponding ground truth (GT) depth maps for every RGB image. On the contrary, unsupervised methods learn stereo cues or video sequences during training and predict a depth map for single images during testing. Despite the tremendous underlying potential, supervised depth estimation generally requires an enormous amount of training data to generalize in diverse scenarios properly, i.e., * Corresponding author indoor, outdoor, and aerial applications; this fact is particularly true in monocular depth estimation. Most state-of-the-art methods achieve their results by training and testing on each benchmark separately; few focus on generalization, commonly assuming ordinal depth relations and only recently investigating affine-invariant depth (Yin et al., 2020). However, we believe that the greatest challenge of monocular depth estimation is the quality of the 3D reconstruction derived from the predicted estimates. The deficiency derives commonly due to the lack of 3D supervision cues and the difficulty in determining the camera intrinsics. Indeed, it is not trivial to enforce geometric constraints from monocular images without additional scene cues. In fact, most methods are limited to depth prediction, and while achieving low depth error values, the actual 3D scene reconstruction mostly fails; 3D structure recovery remains an unexplored topic for state-of-the-art methods. Few recent works discuss this issue, integrating geometric supervision (Yin et al., 2019) or relying on extra modules for training in point cloud level separately from depth estimation (Yin et al., 2021). The transferability of deep learning depth estimation for realworld photogrammetric scenarios is a challenging problem that has only recently been acknowledged in the community (Madhuanand et al., 2021;Steenbeek and Nex, 2022).

Aim of the work
This work investigates the potential of integrating learning-based monocular depth estimation in photogrammetric applications. Our contributions can be summarized as follows ( Figure 1): (1) we introduce a novel, large-scale dataset (ArchDepth) of photorealistic outdoor scenes of historic buildings, including high-quality, complete, metric depth maps for every image; (2) we present a straightforward training pipeline following an encoder-decoder network for metric monocular depth estimation to demonstrate the potential of this dataset; (3) we employ a 3D reconstruction module based on our predictions for single-view 3D scene recovery; (4) we evaluate the generalization performance of our trained model and investigate its applicability in real-world photogrammetric scenarios. Figure 1. The pipeline of our method is based on an encoder-decoder architecture with skip connections for monocular depth prediction. An additional module for 3D reconstruction is also employed afterward.

Monocular Depth Prediction.
Early methods for monocular depth estimation relied on handcrafted features and used complementary cues to recover the depth since limited information about the scene geometry can be directly extracted from a single image (Saxena et al., 2008). In the deep learning era, the seminal work of Eigen et al. (2014) proposed a scaleinvariant loss function in a coarse-to-fine context using a VGG network. The approach was further extended by adding more layers while predicting surface normals and semantic maps (Eigen and Fergus, 2015). Since then, the problem has been studied in the literature as a supervised (Laina et al., 2016;Xu et al., 2018;Fu et al., 2018;Hu et al., 2019) or unsupervised problem (Garg et al., 2016;Godard et al., 2017;Tosi et al., 2019). An architecture often adopted in such methods is the encoderdecoder (e.g., Fu et al., 2018) with RGB images as input and direct regression of pixel-wise depth maps as output. Indeed, most methods perform pixel-wise supervision, yet Conditional Random Fields (CRFs) have also been used to exploit neighbor relations and include a more global context (Liu et al., 2015). The loss function can be formed either as a regression or a classification problem. Skip connections in a ResNet fashion are used to preserve the fine-grained features of the first layers (Laina et al., 2016). Cues such as texture, shading, and structural information are used, while high quality and pixel-aligned GT depth maps are needed. Depending on the available training data, the scene depth can be estimated as ordinal, i.e., relative (Fu et al., 2018) or Euclidean (Eigen et al., 2014;Yin et al., 2019). Local planar priors have also been incorporated as guidance (Lee et al., 2019). Apart from standard CNN models, adversarial training (Chen et al., 2018), attention mechanisms (Chen et al., 2020), and transformer architectures (Ranftl et al., 2021;Yang et al., 2021) have also been recently proposed.
3D scene recovery. Even though achieving excellent results in depth map prediction (e.g., Hu et al., 2019), the respective reconstructions in the 3D space suffer from significant distortions and the presence of artifacts. Only recently, few works have tried to incorporate 3D awareness into the methods. Since most manmade scenes can be decomposed in planar structures, plane detection can be used as a prior for monocular depth estimation . However, the 3D structure was not explicitly considered until recently; Yin et al. (2019) formulated a joint loss function using virtual normals to enforce high-order geometric consistency between surface patches in a large range. The work was further extended by considering affine-invariant depth (Yin et al., 2020) and adding an extra training module for scene 3D reconstruction (Yin et al., 2021). These state-of-the-art methods, although promising, still suffer from generalization limitations in diverse scenarios. (Geiger et al., 2012) and NYU Depth v2 (Silberman et al., 2012) are the pioneer efforts and widely-used large-scale datasets regarding the number of images. KITTY Vision contains real-world street scenes captured with a LiDaR sensor, while NYU Depth v2 contains indoor scenes acquired with the Kinect sensor. Indeed, most existing benchmark datasets for depth estimation are video sequences of indoor scenes acquired with such commodity RGB-D sensors. Since then, the increasing demand for training data has led to the release of similar datasets SUN RGB-D (Song et al., 2015) and larger-scale ones regarding scene diversity and acquired images such as Stanford 2D-3D Semantics (Armeni et al., 2017), ScanNet (Dai et al., 2017) and the synthetic SceneNet (McCormac et al., 2017). The aforementioned benchmark datasets have established a common baseline for new algorithms to be developed and evaluated. They have contributed significantly to developing new methods and have driven the research in novel directions during the last decade. However, although the scenarios contain a vast number of images, they are mostly similar; that is, constrained by the usage of depth sensors, they are limited to indoor environments. Moreover, depth sensors inevitably introduce errors during acquisition, resulting in noisy training data. To improve the generalization of such methods in random scenes, datasets with crowdsource images from the internet have also been introduced (Li and Snavely, 2018). Yet, they solve the depth estimation only at the ordinal level, prohibiting distortion-free and metric 3D reconstructions. To overcome such limitations, in this paper, we propose a novel, metric, large-scale dataset containing outdoor scenes of historic buildings of varying architectural styles ( Figure  2). We aspire that this dataset will enable further research in the field.

METHODOLOGY
Most state-of-the-art networks for monocular depth estimation focus on indoor datasets and typically fail to generalize in outdoor, real-world scenarios. Therefore, we introduce a new metric dataset of photorealistic environments. We employ an encoder-decoder architecture for training and an additional single-view 3D reconstruction module to prove its effectiveness.

The ArchDepth dataset
We introduce a novel dataset, named ArchDepth, consisting of seven photorealistic outdoor scenes of historic buildings of diverse architectural styles (Figure 2). The first six scenes are 3D models of northern European medieval churches retrieved from the web 1 , namely Kuusisto, Liedon, Mietoinen, Nousiainen, Piikkio, and Saint Jacobs. The last scene includes various similarly harvested historic facades rendered in a virtual Piazza.

Liedon Mietoinen
Nousiainen Piikkio Saint Jacobs Piazza Figure 2. Our synthetic dataset ArchDepth. The first six scenes depict 3D models of churches, while the last scene includes several historic facades. The diverse camera paths for each scene are indicated with a black line.
We have built upon the open-source software Blender 2 for image rendering. For the first six scenes, we designed four camera paths around each of our models and five paths for the Liedon model, rendering a total of 24,000 images of 640x480 resolution. The virtual Piazza consists of eight camera paths along the facades. Images are generated based on the pinhole camera model, so no distortions were present. Moreover, we also generated a hybrid dataset Modena Cathedral; it contains 88 real-world images acquired for photogrammetric 3D reconstruction. A point cloud collected with a commercial laser scanner was used as a ground truth model. However, since the images also contained areas not acquired with the scanner (due to occlusions, sensor range limits, etc.) or appeared with sparse points, starting from the acquired point cloud, we have generated a 3D model to render complete depth maps. The generation of a new dataset of outdoor scenes for training purposes was undoubtedly a laborious and expensive task, yet we believe it can be a starting point for further research.

Network architecture and training
We employ a straightforward encoder-decoder architecture with skip connections based on the network of Alhashim and Wonka (2018). The network has ca. 58M parameters and has been proven to work efficiently and produce high-quality depth maps with clear boundaries. The original method exploits transfer learning, starting from pre-trained weights on other visual recognition tasks, i.e., image classification. Following this idea, we use a pre-

Loss function.
The choice of the loss function is crucial and should be appropriate for the particular problem. For depth regression, a standard approach considers the pixel-wise depth difference between GT depth value and prediction * (Eigen et al., 2014). Following the approach of Alhashim and Wonka (2018), apart from the pixel-wise loss "#$%& , we use the loss over the image gradient '()" and the loss based on structural similarity **+, as defined by Wang et al. (2004). The final total loss is therefore defined as a weighted sum: For more details on the loss function, we refer the reader to (Alhashim and Wonka, 2018).

Data augmentation.
Data augmentation is proven to be beneficial to reduce overfitting, especially when limited data are The first two rows refer to depth map inference using our training model, while the last two refer to depth maps predicted using the trained model of Yin et al. (2021).
available (Krizhevsky et al., 2012). In the particular scenario of depth estimation, certain geometric transformations may not be appropriate or meaningful. In this case, we only consider mirroring, while for radiometric transformations, we consider color channel permutations following (Alhashim and Wonka, 2018).

Experimental setup.
For the training of the original network, subsets of the NYU Depth v2 (Silberman et al., 2012) and the KITTY Vision (Geiger et al., 2012) datasets were used. We consider the synthetic dataset and the hybrid dataset Modena Cathedral under two different experiments.

Experiment 1 -Synthetic dataset.
We split the synthetic dataset based on a random shuffling approach on all the seven scenes by keeping 88% for training, 6% for validation, and the rest is kept for testing.

Experiment 2 -Model fine-tuned on Modena Cathedral.
We test on the Modena Cathedral dataset to demonstrate how well such an architecture, trained on our synthetic dataset, generalizes in other scenes. For the fine-tuning, we use 68 images for training, 42 for validation, and 19 for testing.

Implementation details.
In our implementation, we use the open-source TensorFlow 3 library (version 2.3.1) and trained on an NVIDIA GeForce RTX 2070 with 8G RAM. Regarding the network hyperparameters, we use the Adam optimizer with a learning rate = 0.0001 and a decay factor of 0.7. Training is performed for 74 epochs for Experiment 1 using early stopping and 48 epochs for Experiment 2.

3D reconstruction
In photogrammetric applications, which commonly have highquality requirements, 3D reconstruction is typically achieved using stereo and multi-view methods. However, monocular estimation can be potentially helpful in cases of low overlap percentage between the images. Given the recent advancements in the state-of-the-art, in this study, we investigate the potential of reliable 3D scene recovery from a single image. We reconstruct the 3D scene explicitly by projecting each depth value to the 3D space using the camera projection matrix. For this module, we use the open-source library Open3D (Zhou et al., 2018). The input is the RGB image along with its respective 16bit depth maps; a pinhole camera model is adopted. to evaluate the quality of the depth maps, namely absolute relative error, root mean square error, logged root mean square error, and accuracy under threshold. These metrics are calculated by comparing all pixel predictions * with their GT equivalents . Predicted depth maps were upscaled to the original resolution (640x480) using bilinear sampling.
Where refers to the total number of image pixels. The accuracy under threshold is calculated as the percentage of pixels below a threshold with the threshold being ℎ = {1.25, 1.25 0 , 1.25 1 }.
The average results for all testing images of experiments 1 and 2 are shown in Table 1. Naturally, in experiment 1 where both training and tests sets come from the same dataset, the results are the best; however, the fine-tuned model generalizes quite well on the challenging real-world scenario. Our intuition is that the high RMSE values are due to the presence of some outliers. In Figures 3 and 4 various examples of the predicted depth maps are shown. In Experiment 1 the predicted depth maps with our model are close to the ground truth ones; depth transitions are smooth and ordinal values seem to be consistent. In Experiment 2 the results behave similarly; however, some outliers are more likely to be present.

3D metrics.
The 2D metrics tend to disregard the overall predicted 3D structure of the scene and cannot thus be reliable regarding the quality of the resulting 3D model. Although demonstrating high scores, most state-of-the-art methods fail to reliably reconstruct the 3D structure of the scene (Yin et al., 2019;. To investigate this deficiency, in this study, apart from the standard 2D scores, we also consider the commonly used metrics for 3D point cloud quality, completeness (recall), accuracy (precision), and their harmonic mean F1-score (Knapitsch et al., 2017). Figures 5 and 7    The low scores of Figure 5c are due to a present shift, probably because of incorrect estimation of absolute minimum and maximum depth values. However, in Experiment 1 these failure cases are rare. This fact is more present in Experiment 2.

Inference of LeRes
We predict on our images using the released pre-trained model of LeReS (Yin et al., 2021), a current state-of-the-art method for recovering the 3D structure of the scene from a single image.
Although the method has demonstrated satisfying results in depth estimation and 3D reconstruction on various benchmarks in the original work, we observe that it does not generalize particularly well on our data. Depth maps seem to have kept the ordinal relations; however, the absolute and minimum depth values are not consistent with the GT (Figures 3,4), a fact also demonstrated in the low 2D scores in Table 1. The prediction on the 3D reconstruction behaves similarly, with evident distortions and scale inconsistencies (Figures 5 and 7).

CONCLUSIONS
This paper proposes a new dataset for monocular depth prediction, composed of 24K images of outdoor scenes with great architectural details. Our dataset aims to provide high-quality metric depth benchmark data for training. We show its potential by training a straightforward encoder-decoder network and testing its robustness in predicting unseen views. The trained model was also fine-tuned using real-world images, typical for photogrammetric applications. Moreover, we employ a 3D reconstruction module to recover the shape of the scene using our predictions. Given that monocular depth estimation is by definition an ill-posed problem, such a reconstruction is not trivial without additional cues. Thus, despite the satisfying results on depth map prediction, improving the accuracy of the predictions in the 3D space is an open challenge. There is indeed The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLIII-B2-2022 XXIV ISPRS Congress (2022 edition), 6-11 June 2022, Nice, France a need to shift the attention to 3D structure recovery and investigate more in this direction.