AN ACCURATE REAL-TIME UAV MAPPING SOLUTION FOR THE GENERATION OF ORTHOMOSAICS AND SURFACE MODELS

UAVs have become an indispensable tool for a variety of mapping applications. Not only in the area of surveying, infrastructure planning and environmental monitoring tasks but also in time-critical applications, such as emergency and disaster response. Although UAVs enable rapid data acquisition per se, data processing usually relies on offline workflows. This contribution presents an accurate real-time data processing solution for UAV mapping applications as well as an extensive experimental and comparative study to the commercial offline solution Pix4D on the absolute accuracy of orthomosaics and digital surface models. We show that our procedure achieves an absolute horizontal and vertical accuracy of about 1 m without the use of ground control. The code will be made publicly available. * Corresponding authors


INTRODUCTION
In recent years, unmanned aerial vehicles (UAV) have become an important asset for rapid information retrieval for a wide array of applications, such as infrastructure inspection, environmental monitoring or disaster response (Ejaz et al., 2019;Erdelj et al., 2017;Kerle et al., 2019).
UAVs provide high agility, flexibility and fast data capture, whereas respective data processing chains, i.e. image orientation, 3D reconstruction and ortho-generation, are usually performed in post-processing. This traditional pipeline is suited for scenarios which require high-accuracy and excellent reconstruction quality. On the other hand, quick overviews of a scene, disaster or emergency response and other time-critical applications highly benefit from real-time capable processing workflows (Bu et al., 2016;Hein et al., 2019;Hinzmann et al., 2018). This paper extends our previous work on this topic (Fanta-Jende et al., 2020;Kern et al., 2020) by integrating and comparing various state-of-the-art SLAM implementations (ORB-SLAM3 (Campos et al., 2020), OV 2 SLAM (Ferrera et al., 2021) and (Sumikura et al., 2019)) and their impact on the quality and accuracy of generated data products, i.e. orthomosaics and surface models.

RELATED WORK
Generating orthomosaics from a sequence of aerial images is a well-studied subject. Typically, a very distinct photogrammetric workflow is utilised that involves feature extraction and matching, camera pose estimation as well as sparse and dense reconstruction of the scene before the final orthomosaic is created. Open source and commercial implementations of such workflows are diverse (e.g. Pix4D, Agisoft Metashape, Colmap) and achieve high accuracies with the proper hardware and acquisition techniques involved. However, these implementations are designed for post-processing applied to the images after a flight.
When time is crucial and maps are required to be obtained while the UAV is still in the air, the available solutions are limited. The work of Botterill et al. (2010) represents the traditional real-time approach avoiding costly computations of 3D camera motion aligning images into a mosaic with 2D projective homographies. This approach, however, assumes the ground to be planar, which is violated by surface elevation e.g. in low altitudes, mountainous areas or urban canyons resulting in distorted maps. To overcome this limitation Bu et al. (2016) used a modern visual SLAM to track the full camera movement and consequently project images into a common reference plane to create a global image mosaic. While this strategy is more robust when observing an elevated surface, the projection into a plane remains a simplification. In contrast, Hein et al. (2019) utilise an a-priori DEM to account for the elevation and rectify the images with a pinhole model. This reduces computational requirements significantly while still allowing for the generation of accurate maps. However, pre-existing DEMs are limited in availability and resolution (e.g. SRTM) while rapid changes due to e.g. natural disasters (earthquakes, floods, …) or construction works can change the observed scene significantly. By tracking the camera using sensor fusion and reconstructing the surface during flight, Hinzmann et al. (2018) propose a solution that is able to provide the same 2D and 3D information as traditional offline photogrammetry while still achieving real-time performance. Our previous work OpenREALM (Kern et al., 2020) followed a similar approach but modularised important steps of the processing pipeline. Consequently, not only the camera trajectory, a digital surface model (DSM) and an orthomosaic are extracted in real-time but the framework can also be utilised as a testbed for modern visual SLAM and rapid 3D reconstruction applied to aerial mapping. Even though the initial results were promising, an extensive study on the accuracy of the final data products was still open and is now covered in this contribution.

Figure 1
Structure of OpenREALM. It follows a pipeline design, separating different tasks into individual stages that are processed in parallel. An incoming frame moves from left (Pose Estimation) to right (Mosaicing).

METHODOLOGY
In the remainder of this work, we will analyse a variety of quality metrics for orthomosaics generated in real-time with OpenREALM. The workflow of the framework and its structure is shown in Figure 1 and explained briefly in the following. For further details please see Kern et al. (2020).

Processing modules
The input data consists of the GNSS observations, a heading estimate and the (aerial) images. This sensor data is fed into the system using a ROS 1 interface. Subsequently, the data is wrapped into a data frame which is passed into a series of individual stages. Each stage is encapsulated from the rest and processed in parallel allowing for fast overall performance of the pipeline.

Pose estimation
The first stage is the pose estimation, which tracks the camera motion by feeding the frames through an interface into a visual SLAM implementation of choice. Due to the monocular nature of the problem, scale is arbitrary. Consequently, the resulting pose is estimated in a local coordinate system. For every frame a local pose and an unsynchronised GNSS measurement are buffered.
Once a transformation between the local and geographic coordinate system can be computed using the Umeyama algorithm (Umeyama, 1991), this transformation is fixed and saved as a georeference. Depending on the user preference, this georeference can be refined with every new frame resulting in a higher absolute accuracy of the map, or it is fixed to prioritise the relative integrity of the map.

Densification
In the next stage, frames are passed to the densification module. At this point, the Plane Sweep Library (Häne et al., 2015) was integrated as an external framework to reconstruct depth from multiple views. As a result, dense depth maps are computed. The depth maps are filtered by checking the consistency of each depth value across multiple frames. Only if at least two other frames support the pixel depth hypothesis, it is considered valid. 1 ROS, https://www.ros.org/

Surface generation
Depending on the user settings and the computations of previous stages, frames that enter the surface generation are scanned for a surface assumption. If dense depth maps are available, a grid structure is created in which each cell contains an elevation value. The resolution of the grid depends on the average point to point distances of the dense cloud while corresponding elevation values are computed with the inverse distance weighting. This grid structure represents the digital elevation model that is further used for ortho-generation.

Orthorectification
Once an incremental DSM for the current input frame is generated, it is passed to the orthorectification module. Similar to the approach proposed by Hinzmann et al. (2018), a 3D point is created for every cell of the DSM and projected back into the camera using a pinhole model. Consequently, visual distortions induced by the elevated surface are removed and an incremental orthoimage is computed.

Mosaicing
In the final stage, incoming frames containing the DSM and orthoimage are fused. To this end, a region of interest that is affected by the latest frame update is extracted from the existing mosaic. Areas with no prior data are directly written. Those areas with existing data are blended by prioritising points with an incidence angle of 90° to achieve a higher degree of orthogonality. Textural blending, e.g. using seam cuts, is currently not implemented. Results are published and can be visualised in real-time as well.

Data products
All data products of traditional photogrammetry pipelines can be computed using OpenREALM. An overview with the processing stage and the corresponding data product is shown in Figure 2. The pose estimation module runs visual SLAM, consequently the 3D camera trajectory and a sparse point cloud can be exported. The densification module implements external stereo reconstruction in order to save depth maps for all incoming frames. The surface generation stage creates a 2.5D digital surface model, which is further used for orthorectification in the consecutive stage. Both, DSM and orthomosaic, are typically available in the GeoTIFF format. The mosaicing module wraps up all incremental data and is therefore designed to provide exports of all prior data products but at a global scale.

Figure 2
OpenREALM provides exports for most traditional photogrammetry data products. It allows for the extraction of the camera trajectory in 3D, sparse and dense point clouds, depth maps for all keyframes, 2.5D digital elevation models and orthomosaics, all of which are fully georeferenced.

Visual SLAM Varieties
OpenREALM is highly modular and can be easily extended. For this paper, we implemented interfaces for three state-of-the-art visual SLAM frameworks and analysed their feasibility for the aerial mapping scenario at hand.
The first is ORB-SLAM3 (Campos et al., 2020), whose predecessor set new standards in terms of accuracy and robustness in the computer vision community. It is a featurebased approach that reproduces the traditional photogrammetry workflow in high similarity. However, bundle adjustment is only applied to local subsets of keyframes to reduce computational load. Global integrity is achieved with a loop closing module that constantly checks for revisited areas using a bag-of-words model.
The second is OpenVSLAM (Sumikura et al., 2019), which is based on the same algorithms as ORB-SLAM but implements them in a more efficient way to further reduce computational load while tracking is improved overall.
At last, we integrated OV 2 SLAM (Ferrera et al., 2021) into the pipeline. It combines indirect, feature-based techniques for keyframes with a direct, photometric approach for frames in between. This enables robust tracking even when observing homogeneous surfaces while allowing for the usage of traditional optimisation and loop closing based on 2D features and a bag-of-words model.

Platform and hardware design
Our platform design aims at providing data acquisition in a quick and robust manner especially for disaster response scenarios using a fixed-wing platform. The data for this paper was acquired using an experimental UAV platform based on a Skywalker EVE-2000 equipped with a FLIR BFLY-U3-23S6C-C 2 featuring 2.3 MP. The camera is mounted on a gimbal compensating for deviations in the roll angle, which is especially useful for fixed-wing UAVs.
With a framerate of 8 images per second, high image overlap can be achieved at the standard velocity of 16 m/s of the platform. The system is equipped with a GNSS-RTK solution consisting of a Here3 receiver and a Here+ RTK base.
In its final stage, not only will the platform be changed to a more durable UAV, but the processing pipeline will be split up in an air and a ground segment. For instance, pose estimation could be performed on the UAV using the onboard processing capabilities of a NVIDIA Jetson TX2 while keyframes are sent to the ground station equipped with desktop-grade hardware for further and rather expensive processing stages, such as surface reconstruction.

Data acquisition
For the experimental setup presented in this paper, the data has been acquired in four separate flights at different altitudes: 80, 100, 120 and 150 metres above ground. Although these altitudes result in different ground sampling distances (GSD, see Table 1), all data products were processed with a grid size of 15 cm. The acquisition area is a rather flat and rural plot spanning roughly 400 by 300 metres south of Vienna, Austria. To ascertain the absolute accuracy of our mapping solution, 27 ground control points (GCP) have been surveyed using a GNSS rover.

Data preprocessing
To demonstrate the real-time capability of the presented processing pipeline, all data sets have been recorded in rosbags comprising the image frames, their extrinsic orientation parameters as well as precise timestamps. Since the entire pipeline runs on desktop hardware at the moment and will be split up in air and ground segment at a later stage in the project, rosbags are an ideal tool to simulate this real-time behaviour.
For generating reference data, the standard photogrammetric workflow in Pix4D was used. The settings were the following for all data sets: camera orientation: ½ image size; point cloud generation: ½ image size (standard), optimal point density and a minimum of observations per point set to three; DSM and ortho-generation: all settings set to standard except GSD of 15 cm. ROS timestamps were used to associate image frames and the respective exterior orientation parameters to export the images with EXIF headers. The interior parameters, which are required for the underlying SLAM pipeline of OpenREALM, were determined with Pix4D as well.

Data and methods for accuracy assessment
The aim of this study is to ascertain the absolute accuracy of the data products, i.e. orthomosaics in the horizontal dimension and digital surface models in the vertical dimension, generated by the proposed OpenREALM framework using different SLAM pipelines.
To this end, orthomosaics and DSMs were computed with three underlying SLAM pipelines (OpenVSLAM, ORB-SLAM3 and OV2SLAM) using data from different altitudes (80 m, 100 m, 120 m and 150 m). The SLAM pipeline has a tremendous impact on the feasibility of the DSM and orthomosaic generation. For instance, lower flying altitudes lead to smaller frame overlaps which have an impact on the tracking capacity of the SLAM pipeline. Consequently, ORB-SLAM3 and OV 2 SLAM could not track visual features at 80 m and 100 m altitude, respectively, rendering a DSM and orthomosaic generation infeasible. Although ORB-SLAM3 offers the integration of inertial readings, it was used in vision-only mode to allow for a fair comparison.
Before evaluation, every visual SLAM was tuned inside the OpenREALM pipeline to maximise the overall performance and minimise the ground sampling distance while still maintaining a (near) real-time output. Resolution was prioritised over speed in order to be able to detect the GCPs in the final orthomosaic. All maps were fully processed within a minute after the respective mission was finished. Once a set of parameters was found for a SLAM, these were locked and used for all altitudes of the dataset in the evaluation. This ensures high quality results but prevents overfitting of the underlying models.
Pix4D was used to generate DSMs and orthomosaics for reference. Data from all altitudes has been processed without the use of GCPs in Pix4D. These data are closest for direct comparison to the real-time framework since resorting to ground control is unrealistic in time-critical mapping scenarios this procedure was designed for. To establish a baseline of the highest potential accuracy of acquired data, an orthomosaic and a DSM from the 120 m flight were computed in Pix4D using 10 GCPs.   Table 3 Generated data products in Pix4D

Real-time processing / Test data set
The horizontal accuracy of the orthomosaics was assessed using all 27 GCPs as check points in the area. In the case of the Pix4D orthomosaic and DSM processed with ground control, the remaining 17 GCPs were used as check points. The vertical accuracy of the real-time data products was determined by sampling equally distributed points every 5 metres along the vertical and horizontal coordinate axis to establish a measuring grid. As a reference, a DSM from the 120 m data set computed by Pix4D with ground control is used.

RESULTS
This section presents quantitative accuracy results of the procedure as well as exemplary figures of the data products for visual comparison.

Real-time orthomosaic
The results are shown in the five tables below separated according to the underlying SLAM implementation and the reference data sets for comparison respectively. The reference data set processed without the use of ground control indicates an absolute accuracy of around 0.5 m with the exception of the data set acquired at an altitude of 100 m (Table 5). Using GCPs improves the absolute accuracy of the data set significantly as shown in Table 4.

Pix4D reference (with GCPs) -accuracy checked with 19 check points
Altitude 120  Table 5 Absolute accuracies of the reference orthomosaic processed in Pix4D without ground control It becomes evident that the absolute accuracy varies significantly depending on the SLAM pipeline. ORB-SLAM3 performs worst in our experiments (Table 6). Further studies will show whether this is the case if inertial readings are integrated in the procedure. The vision-only mode in conjunction with our pipeline reaches metre-grade accuracy at best. For the case of rapid mapping, however, these accuracies may be sufficient.

ORB-SLAM 3
Altitude 120 Table 6 Absolute accuracies of the orthomosaic processed with ORB-SLAM3 as the underlying framework. Entries in bold are the lowest RMSE among the OpenREALM results in that dimension.
OpenVSLAM is the only SLAM framework which is able to track features reliably at 80 m altitude, although the absolute accuracy deteriorates in comparison to higher altitudes. OpenVSLAM reaches accuracy levels on par with the Pix4D data set at 100 m altitude (Table 7).

OpenVSLAM
Altitude 80  Comparing the orthomosaics on a visual basis draws a different picture. Figure 4 depicts an image detail at the south-west edge of the scene comprising a straight road leading to a junction surrounded by greenery. The underlying SLAM framework does not only have an impact on the accuracy but also the consistency of the orthomosaicthe higher the relative accuracy, the better the representation of the scene.

Figure 4
Comparison of an image detail between orthomosaics from the data set at 120 m. From the top left in clockwise order: OpenVSLAM, ORB-SLAM3, Pix4D reference data set processed with ground control and OV 2 SLAM The processing pipeline based on OpenVSLAM (top left in Figure 4) produces a jittering appearance of the straight road while the ORB-SLAM3-based procedure (top right) does not result in a perfectly straight road either. OV 2 SLAM (bottom left), however, preserves the straight geometry of the road similar to the reference orthomosaic on bottom right. Moreover, the colouring differs between the real-time approaches and Pix4D. This is due to different blending procedures stitching the orthorectified image tiles together.

Real-time Digital Surface Model
A quantitative comparison of accuracy in height was performed based on equally sampled points with a distance of 5 metres across the entire scene (see Table 8). As a reference, the digital surface model generated by Pix4D using the 120 m data set with introduced ground control has been used.
The results show that our real-time approaches achieve similar vertical accuracies compared to the offline Pix4D processing pipeline without ground control. The procedure relying on ORB-SLAM3 fluctuates strongly as minimum and maximum errors are comparatively high. This indicates that the surface reconstruction is rather inconsistent.
On the other hand, OpenVSLAM reaches the lowest height error in average and the lowest standard deviation of the SLAM-based approaches. Again, OV 2 SLAM returns the most promising result among the three SLAM frameworks with an RMSE lower than the offline result.
The offline result has a very low standard deviation with a consistent mean and RMSE error. This may indicate that there is a general shift in height compared to the reference surface model. In the future, aerial laserscanning data could be used to perform are more insightful understanding of height accuracy.  Table 9 Comparison of height accuracies between the reference surface model generated by Pix4D with ground control, the three real-time data products as well as the Pix4D reference data set without ground control Figure 5 depicts the surface models generated by OpenREALM as well as the reference model by Pix4D. The figure shows that the surface reconstruction quality varies heavily. The OpenVSLAM-based procedure returns an uneven and rough surface which results in issues of preserving straight geometries (cf. Figure 4). The DSM based on the ORB-SLAM3 implementation tilts towards lower heights in the eastern part which indicates a drift in scale. The reconstruction using OV 2 SLAM as a basis achieves the best representation of the scene.

Comparison of height accuracy
The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLIII-B1-2021 XXIV ISPRS Congress (2021 edition)

Figure 5
A visual comparison of digital surface models between the three SLAM-based real-time procedures and the reference data set processed with Pix4D

CONCLUSION
This paper presents a real-time mapping framework designed for UAVs. Real-time mapping is a pivotal tool for time-critical applications, such as disaster response and other emergency mapping tasks. We show that our framework enables generating overviews of a scene in a quick as well as accurate manner. Based on efficient SLAM implementations, an entire dense reconstruction pipeline and orthorectification procedure is able to perform high quality data processing in real-time.
In our experiments we compare three state-of-art SLAM implementations (ORB-SLAM3, OpenVSLAM and OV 2 SLAM) in conjunction with our real-time mapping pipeline OpenREALM. Our quantitative experiments indicatedepending on the underlying SLAM pipelinethat absolute accuracies of about 1 m horizontally and about 1 m vertically are well achievable, which is a promising result if compared to off-the-shelf offline reconstruction pipelines, such as Pix4D.
In our future work, we will integrate inertial readings into the pipeline to stabilise the trajectory estimation which in turn may potentially improve the robustness of the procedure at lower altitudes. Moreover, a thorough assessment of the height accuracy will be conducted with a LiDAR-based surface model. The code for our framework will be made publicly available.