AERIAL TRIANGULATION WITH LEARNING-BASED TIE POINTS

Aerial triangulation (AT) has reached outstanding progress in the last decades, and now fully automated solutions for nadir and oblique images are available. Usually, image correspondences (tie points) are found using hand-crafted methods, such as SIFT or its variants. But in the last years, there were many investigations and developments to promote the use of machine and deep learning solutions within the photogrammetric processing pipeline. The paper explores learning-based methods for the extraction of tie points in aerial image blocks. Image correspondences are used to perform aerial triangulation (AT) and successively generate dense point clouds. Two different datasets are used to compare conventional hand-crafted detector/descriptor methods with respect to learning-based methods. Accuracy analyses are performed using GCPs as well as ground truth LiDAR point clouds. Results confirm the potential of learningbased methods in finding reliable image correspondences in the aerial block, still showing space for improvements due to camera rotations.


INTRODUCTION
Photogrammetry is one of the most widely used techniques for the determination of 3D metric information at various scales and from diverse imaging platforms (satellite, aerial, drone, terrestrial and underwater). The typical aerial photogrammetric workflow consists of the identification of image correspondences via sparse image matching, the estimation of the unknown camera parameters and 3D object coordinates (image triangulation) with a bundle adjustment (BA) method, the generation of dense point clouds via dense image matching (or Multi-View Stereo -MVS) and the realization of by-products like mesh models or orthophotos. Photogrammetric methods -since ever -aim to provide practical, reliable, and daily-based routines and solutions for geospatial data generation, geometric processing, and semantic interpretation -even with manual intervention to keep accuracy as high as possible. For two decades the community has provided many automated algorithms, also based on Artificial Intelligence (AI), to speed up geospatial data generation and interpretation, increase efficiency as well as robustness (Hartmann et al., 2015;Zhu et al., 2017;Becker et al., 2018;Gong and Ji, 2018;Yao et al., 2018;Liu et al., 2019;Griffiths and Boehm, 2019;Stathopoulou et al., 2019;Heipke and Rottensteiner, 2020;Huang et al., 2018;Shan et al., 2020;Chen et al., 2020a;Oezdemir et al., 2021;Qin and Gruen, 2021;Remondino et al., 2021). For sure there is still a hype around deep learning in research activities and in the media, but these methods are really a treasure trove for innovation in the geospatial field. Following this momentum, this work aims to investigate the use of learning-based algorithms for the extraction of tie points in aerial image blocks. The work applies learning-based methods to full-size aerial images and highlights the performances of these methods in performing aerial triangulation (AT) and, successively, generating dense point clouds. For comparison

RELATED WORK
AT has achieved remarkable improvement in the last decades, and now fully automated solutions for both nadir and oblique images are available (Rupnik et al., 2013;Rupnik et al., 2015;Maset et al., 2021). Automated AT based on a bundle block adjustment moved from point-based to feature-based methods, including also linear features (Habib et al., 2002;Schenk, 2004;Triggs et al., 2000). The identification of image correspondences (tie points - Figure 1) is traditionally performed using handcrafted keypoint detectors and descriptors (Lowe, 2004;Bay et al., 2006;Alcantarilla et al., 2013;Bellavia et al., 2021). These hand-crafted approaches are based on a priori knowledge inspired by professional knowledge and intuitive experience (Yao et al., 2021). Despite their good performance, there are still open issues in case of large perspective or temporal differences as well as scale and illumination changes between the images. In the last years, driven by rapid developments in deep learning networks, researchers proposed various innovative learningbased solutions aiming to overcome the limitations of handcrafted methods (Verdie et al., 2015;Jin et al., 2021). Such solutions include detect-then-describe approaches where the detector (Verdie et al., 2015;Savinov et al., 2017;Barroso et al., 2019;Truong et al., 2020) and the descriptor (Mishchuk et al., 2017;Tian et al., 2017;Mishkin et al., 2018;Ebel et al., 2019;Pautrat et al., 2020;Pultar, 2020;Parihar et al., 2021) can be both learned methods or a combination of hand-crafted and learning- based (Bellavia and Mishkin, 2021;Bellavia et al., 2022). Other approaches, called end-to-end, jointly optimize the entire pipeline to extract sparse image correspondences, e.g., LIFT (Yi et al., 2016), LF-Net (Ono et al., 2019), SuperPoint (DeTone et al., 2018), R2D2 (Revaud, 2019), D2-Net (Dusmanu et al., 2019), ASLFeat (Luo et al., 2020), SuperGlue (Sarlin et al., 2020), DISK (Tyszkiewicz et al., 2020). End-to-end methods were demonstrated to increase both the keypoint repeatability and reliability and, consequently, the image matching success rate and the final pose estimation accuracy . More recently, various researchers (Choy et al., 2016;Rocco et al., 2018;Li et al., 2020) proposed end-to-end detector-free local feature matching methods that remove the feature detector phase and directly produce dense descriptors or dense feature matches. Among these, Sun et al. (2021) created the LoFTR approach based on Transformer (Vaswani et al., 2017): instead of performing image feature detection, description, and matching sequentially, it establishes pixel-wise dense matches at a coarse level and later refines the good matches at a fine level. The use of learning-based methods to automatically orient image blocks is primarily applied to terrestrial datasets (Schonberger et al., 2017;Bojanić et al., 2019;Jin et al., 2021) with very few experiments on UAV datasets Bellavia et al., 2022) and aerial modern (Chen et al., 2020b) and historical (Ressl et al., 2020;Zhang et al., 2021) blocks. This is mainly because most of the existing deep architectures for tie point extraction are not suitable for general-purpose photogrammetric applications, particularly aerial blocks, due to their limitation in handling large image sizes, small scales and camera rotations among strips.

Considered methods
Initial analyses on state-of-the-art hand-crafted and deep learning methods were performed to understand rotation and scale invariance issues in the case of aerial views ( Figure 2). RootSIFT (Relja and Zisserman, 2012) was chosen to represent the handcrafted family as it proved to be the most reliable and versatile solution (Schonberger et al., 2017). On the other hand, among the available learning-based solutions, we considered two rotationinvariant frameworks: LF-Net (Ono et al., 2018) as end-to-end architecture and KeyNet (Barroso et al., 2019) coupled with AffNet (Mishkin et al., 2018) and HardNet (Mishchuk et al., 2017) -available in the Kornia library (Riba et al., 2020), as a detect-then-describe approach. Both frameworks showed good performances in previous evaluations Bellavia et al., 2022), accommodating various scenarios and contexts. They also seem to be suitable for retraining processes to include photogrammetric scenarios. Moreover, to the best of authors' knowledge, they are among the very few methods which are partially invariant to camera rotations.

Image tiling approach
As learning-based methods demand many computational resources and can generally handle only small image sizes, a tiling approach is proposed in order to extract tie points in the full resolution images. Normally keypoints are not detected along the perimeter of the images due to the padding used during convolutions. Therefore, to avoid having no keypoints in areas of adjacent tiles, thus obtaining a not uniform keypoints distribution in the entire image, tiles (2500x2500 pixel) are overlapped vertically and horizontally by some 30 pixels. Features are detected/described on these tiles then tiles are reassembled for the matching and verification steps.

Datasets
Two different sets of aerial images (Table 1) are employed to test the capabilities of learning-based methods within AT processes and their influence on the generation of dense point clouds: the ISPRS/EuroSDR Dortmund benchmark (Nex et al., 2015) and the Dublin benchmark (Ruano and Smolic, 2021). These urban datasets were chosen due to their complementarity in terms of acquisitions, resolution, and ground truth (GT). They both feature nadir and oblique images, varying GSD (and image scale), picturing complex urban scenarios.

Processing pipeline
The AT process consists of features detection and description (Section 3.1), features matching, geometric verification, and final bundle adjustment (BA). The number of detected keypoints was set to be around 10,000 per image, while descriptors consist of 128 (rootSIFT and HardNet) and 256 (LF-Net) parameters. The OpenCV Brute-Force method with L2 distance is used, albeit slow, to handle descriptors of variable sizes and ensure a fair comparison between methods. Matches are then imported into nadir image pair from the same strip with high overlap RootSIFT -2699

Evaluation protocol
In geomatic applications, it is essential to test algorithm performance with metrics specifically tailored for the 3D object space. In our evaluations, the accuracy of tie points extraction methods is evaluated based on: -RMSEs on GCPs/CPs; -multiplicity/redundancy (Mean Track Length -MTL); -cloud-to-cloud comparison with respect to LiDAR ground truth; -point cloud completeness/accuracy.

Dortmund dataset
Different sets of interior parameters are used, and the available 12 GCPs (6 targets and 6 natural points) lead to RMSEs shown in Table 2.  (GT) 12 GCPs, LiDAR (not simultaneously acquired) LiDAR (simultaneously acquired)  Results show that learning-based methods are still slightly worse than RootSIFT. To support the BA metrics provided by COLMAP, Agisoft Metashape 2 is also used, confirming the obtained values. Using the AT results (Figure 3a) with the smallest RMSEs, dense point clouds are derived and compared to the available LiDAR GT (average surface density of ca 10 pts/sqm - Figure 3b). The cloud-to-cloud analyses ( Figure 4 and  Table 3: Mean and standard deviation of the cloud-to-cloud differences. The second value is the average of the 4 sub-areas including only buildings ( Figure 3b).  Table 3.

Dublin dataset
AT results for the Dublin dataset are given in Table 4. As no GCPs are provided in the benchmark, metrics are only in image space. Interestingly, LF-Net provides much more points with multiplicity 2 albeit the average MTL is similar for both methods.  Successively, the images are further processed to generate dense point clouds. Figure 5, Table 5, and Table 6 report color-coded views and accuracy values of the cloud-to-cloud assessments, respectively. These analyses do not reveal significant differences originating from the different AT input data.   Table 6: Precision (accuracy), recall (completeness) and F1 scores for tolerance τ = 0.5 m.

CONCLUSIONS
The paper presented an investigation of learning-based methods to extract tie points in aerial image blocks. AT and MVS results revealed that deep learning could be also a valuable way to find reliable and accurate image correspondences in aerial datasets. Accuracy values provide a clear message that AT could be performed both by hand-crafted and learning-based methods in common AT survey conditions, even if the real potential of these methods lies in managing aerial datasets with images that are difficult to be correctly co-registered due to strong variations in the appearance of the images, and in particular in multi-temporal datasets (Bellavia et al., 2022b;Farella et al., 2022). Moreover, most of these deep architectures still suffer when high camera rotations are present in the datasets. Researchers so far primarily solved the problem by manually rotating images in order to have the same format (Jin et al., 2020), although new methods were developed to match images under large camera rotations (Parihar et al., 2021;Bellavia et al., 2022a). We believe that deep learning will offer more valuable solutions for photogrammetry in the near future, inspiring and impacting research in our field through collaboration with colleagues in neighbouring disciplines.  Table 5 and 6.