ACCURACY ANALYSIS OF ESTIMATED CAMERA ORIENTATION PARAMETERS OF HISTORICAL IMAGES - CASE STUDY OF GEORG-SCHUMANN-BAU IN DRESDEN (GERMANY)

: Digitalization in archives provides access to an ever-increasing amount of historical photographs. As most of these are not taken by photogrammetric experts it is of interest if these images can be integrated into photogrammetric workflows in order to use them for advanced visualizations. While the pose estimation of historical images can be solved using feature matches retrieved by neural networks it is still unclear how accurate the determined poses are. This contribution tries to provide a reference for unordered historical photo collections with additional contemporary images using the example of the Georg-Schumann-Bau in Dresden, Germany. Therefor, a calibrated camera is used to take images of the building with the idea to enhance the estimated camera parameters of historical images in a combined model. This reference can then be used for comparison with a model created using exclusively historical images. Only half of the estimated poses are below 10 meters compared to the reference which falls behind expectations. Nonetheless, it is possible to use the estimated poses for evaluating aggregations of historical images.


INTRODUCTION
With the ongoing digital transformation in libraries and archives, an increasing number of historical photographs becomes accessible to the public (Edward M. Corrado, 2017). It has already been investigated that a representation in threedimensional (3D) space can increase the understanding and visibility of the image data by using Web3D (Figure 1), Virtual Reality (VR), and Augmented Reality (AR) applications (Bruschke et al., 2018;Dewitz et al., 2019). In practice, this requires the precise determination of the camera's pose in 3D for each photograph with estimated interior and exterior orientation parameters. While these can be accurately determined for contemporary images using conventional Structure-from-Motion (SfM) solutions, these software tools often fail for historical images with large radiometric and * Corresponding author geometric differences (Maiwald and Maas, 2021). This happens mainly due to errors in the step of finding correspondences between historical image pairs. However, recent neural network approaches for feature detection, description and matching are able to find many correct feature matches . However, the accuracy of the final pose is only calculable up to scale using pre-defined relative orientations and it is impossible to make statements about absolute accuracy without using further measurements.
Using the workflow described in  on exclusively historical images, this contribution intends to give an estimation about the absolute accuracy for the estimated interior and exterior camera parameters. It should be clarified, that it is very challenging to provide and determine correct camera parameters for historical images as these are only represented in a digitized format. The relationship to the original camera is lost and auxiliary measurements are definitely needed. Therefor, the presented approach uses supporting contemporary images in order to provide an ideally accurate reference for the camera parameters of the historical images. Using a calibrated camera requires that the building is close to the calibration test field so that the interior camera parameters do not vary. Furthermore, the test setup requires a building which did not show vast changes over time and for which historical images are available.
Therefor, the entrance portal of the former district court of the city of Dresden (nowadays the Georg-Schumann-Bau, TU Dresden) is chosen ( Figure 2).
However, accuracy analysis is not always performed or covers only specific parts of the final result. Mostly, research focuses on analyzing the accuracy of the final point cloud in object space supported by using additional measurements of superior accuracy. Reasonable results for object point accuracy using exclusively historical images with unknown camera parameters lie in centimeters (Rodríguez Miranda and Valle Melón, 2017;Bitelli et al., 2017;Khalil and Grussenmeyer, 2019;Kalinowski et al., 2021) or few decimeters (Grün et al., 2004;Grussenmeyer and Al Khalil, 2017;Condorelli and Rinaudo, 2018). For aerial images covering significantly larger areas the object point accuracy is one order of magnitude larger (Feurer and Vinatier, 2018;Zhang et al., 2021). When the camera parameters of the camera are known and the digitization process can be controlled the point accuracy increases up to few centimeters for terrestrial images (Dlesk et al., 2020).
These values are quite consistent through all mentioned references. However, it seems that the final estimated pose of the cameras is often of lower interest even if it is valuable for realizing Web3D or VR applications. Kalinowski et al. (2021) report deviations of 1.2 m of camera poses calculated via the Direct Linear Transformation (DLT) and camera poses calculated using a SfM approach which seems realistic considering the results of the presented research.
More statements about the accuracy of camera poses can be found in research of the Computer Vision community. However, these are not made about historical image archives but mainly for unordered photo collections which are comparable in their radiometric properties. The quality criteria evaluating the accuracy of methods is explained in qualitative benchmarks and commonly the angle error and/or the pose error between reference pose and estimated pose is given (Schönberger et al., 2017;Sattler et al., 2018;Jin et al., 2020). The pose error is defined as the Euclidean distance between reference and estimated pose (camera center). Sattler et al. (2018) define three thresholds for the pose error of difficult datasets as: High-precision (0.5 m), medium-precision (1.0 m), and coarseprecision (5 m). These categories will also be used in this paper for evaluating the accuracy of the estimated poses using exclusively historical images.

METHODS
This section gives a short overview upon the history of the investigated building and how the historical images could be acquired. Further, the process of generating three different models used for estimating the accuracy of camera orientation parameters is described. All models are computed using a SfM workflow up to the generation of a sparse point cloud and show the Georg-Schumann-Bau of the TU Dresden. The first model is referred to as historical model and consists of exclusively historical images. The second model is referred to as contemporary model and consists of images taken by a calibrated camera in a sequential image configuration. The third model is the mixed reference model which consists of historical images that are included in the contemporary model. It is assumed that this allows the comparison of the estimated camera parameters in the historical model with the camera parameters in this reference model with superior accuracy.

Historical information and image acquisition
The former district court of the city of Dresden (nowadays the Georg-Schumann-Bau, TU Dresden) was built between 1902 and 1907 as Royal Saxon District Court at Münchner Platz (Munich Square). As many other buildings in Dresden the District Court was damaged in 1945, but not completely destroyed.
Since 1959 the building includes the Münchner Platz Dresden Memorial commemorating the victims of National Socialism.
Since 1964 the building is used by the TU Dresden and named after Georg Schumann who was executed at the Münchner Platz due to his resistance to National Socialism.
The Münchner Platz Dresden Memorial kindly provided access to all photographs and images linked to the building. All digitized photographs were sighted and 37 images were manually selected. All show the exterior of the building including the entrance portal as seen from the Münchner Platz. These images originate from a period of approximately 1908 to 1996 and show large differences in their perspective as well as radiometric differences.
Further images were collected from the Deutsche Fotothek http://www.deutschefotothek.de/. As the metadata search for Landgericht Dresden (= district court dresden) yields 1790 hits, a search via content-based image retrieval (CBIR) is performed. For this purpose the former published method called layer extraction approach (LEA) is used . Knowing that the database usually only holds few building views, three different query images were chosen in order to retrieve as many further exterior views as possible. The best 50 hits for every query image were sighted. The CBIR only resulted in 7 additional photographs which complement the historical dataset on 44 images. The small number of photographs from the Deutsche Fotothek is due to many interior views, other building parts, events, or very detailed views not relevant for this research.

Historical model
The historical images are directly processed using the workflow described in . Therefor, for all images n = 4096 SuperPoint features are detected and described. Then, the images are exhaustively matched using SuperGlue. In order to get the interior and exterior orientation of the cameras all feature matches are imported into the open-source SfM solution COLMAP (Schönberger and Frahm, 2016) and geometrically verified using locally optimized RANSAC (LO-RANSAC) (Chum et al., 2003). Finally, the orientation of all cameras is estimated if a minimum numbers of 15 inliers are found between a single image pair using bundle adjustment. The final model consists of 42 of the 44 images ( Figure 3). The two images which could not be matched successfully (one aerial view, one blurry photograph) are shown in Figure 4.
While this workflow has been tested for the number of successfully matched and oriented images and evaluated in a local coordinate system, the absolute accuracy of the camera poses could not be determined yet. This requires the transformation from local coordinate system to a metric coordinate system which is realized using contemporary images and further measurements with superior accuracy.

Contemporary model
To ensure highest accuracy for the contemporary reference model the idea is to use a pre-calibrated camera and fix the camera parameters during bundle adjustment. This requires the accurate calibration of the camera which is realized at the test field of the Institute of Photogrammetry and Remote Sensing, TU Dresden ( Figure 5). A Nikon Z7 with a 35 mm lens is selected to completely cover the test field. The camera is calibrated in AICON 3D Studio using a convergent image setup including scale bars of superior The interior camera parameters are used for undistorting all 63 images. The images and camera parameters are imported in Agisoft Metashape, the images are aligned and a sparse point cloud is created selecting highest accuracy.
The sparse point cloud and the camera poses are still in a local coordinate system. In order to compare the location of the cameras in a metric scale further measurements need to be carried out. Therefor, 10 points near the entrance portal are measured using a total station TCA2003 with an angle accuracy of 0.15 mgon and a distance accuracy of 1 mm + 1 ppm. The coordinate system is defined with its origin a the center of the total station with the x-axis pointing to the right, the y-axis towards the building and the z-axis to the top as depicted in Figure 6. The selection of the measured points in multiple images with a standard deviation of 1.3 mm in object space allows a final refinement of the contemporary camera poses. The idea is to use these camera poses to generate an exact reference for the historical images in a mixed model.

Mixed reference model
The mixed model serves two purposes. First, it is interesting whether and if so, how many historical images can be matched to contemporary images. Second, the model is meant to provide the reference poses for the historical images. It is assumed that the camera parameters of the historical images are closer to a real camera if the model is already constrained with the existing contemporary poses and interior camera parameters.
The approach is realized with hloc -the hierarchical localization toolbox (Sarlin et al., 2019). Therefor, the contemporary model holding 63 camera poses is exported from Metashape and converted to the COLMAP model format. This allows using the hierarchical localization toolbox for pose estimation of the historical images while SuperGlue can be used for feature matching. It has already been shown that SuperGlue outperforms other feature matching methods when processing historical images. The undertaken experiments also show that Super-Glue is able to find similar features in image pairs of different points in time (Figure 7). The feature matches found by SuperGlue are then triangulated using the fixed camera parameters of the contemporary images. Then, the camera parameters of the historical images are determined using space resection using pycolmap (https: //github.com/colmap/pycolmap). While this seems like a reasonable approach, the latter evaluation showed that the camera poses of many cameras are determined inaccurately. This happens especially when the contemporary images cover only a small part of the historical images. The good features found in e.g., the center of the historical image are not enough to robustly estimate the camera parameters of the historical images.
Thus, the initial estimated poses by hloc and pycolmap are reimported into Metashape and additional features in the border areas are selected in an interactive approach. This required the time-consuming picking of 36 additional points in all 42 historical images. Using more points only slightly changed the final result. All interior and exterior camera parameters of the historical image are estimated once again minimizing the reprojection error of all measured and selected points. This workflow leads to more reasonable results and allows the comparison of the interior and exterior camera parameters estimated in the mixed model versus the historical model.

RESULTS
In a first instance, the model quality of the historical model is to be evaluated. Therefor, the measurements of the total station with known coordinates are used. 9 of the 10 points can be found in the historical model and are transformed with a Helmert Transformation using least-squares estimation. The resulting σ0 = 0.11 m and the resulting residuals for every point are depicted in Table 2.  Table 2. Residuals in meter for 9 points transformed from the historical model into a metric coordinate system.

Point number
These values are comparable to other published results shown in Section 2.
However, the main part of the evaluation deals with the accuracy analysis of the interior and exterior camera parameters. Therefor, this contribution tries to provide a good reference for camera parameters of historical images using additional contemporary images with known camera parameters. This mixed model is now compared with a SfM reconstruction that uses exclusively historical images (= historical model).
For the interior camera parameters the principal distance c (focal length) and for the exterior camera parameters the position (X, Y, Z) is compared. As there might be gross errors in the historical model the mean and also the median is calculated. It can already be assumed that the principal distance and Y coordinate (in camera viewing direction) will vary between both solutions due their correlation. The historical camera could be theoretically very close to the building with a wide field of view (FOV) or in contrast, very far away from the building with a narrow FOV. This ambiguity cannot be solved for historical images unless additional information is available (as provided in the mixed model). This can be directly seen in the comparison of the principal distance of reference and estimated model depicted in Table 3 as the mean absolute difference is 568.7 pixels and the median absolute difference is 260.9 pixels.
In the raw values a trend can be identified (in 30 of 42 cameras) that the historical model often uses a larger principal distance than the mixed model. That means that the camera pose is often estimated too close to the building.
Unfortunately, this can be directly seen in the values for the estimated poses which fall short of expectations. For all 42 cameras the deviation in the X-coordinate (∆X), Y-coordinate (∆Y), and Z-coordinate (∆Z) is calculated. Additionally, the Euclidean Distance between camera centers (∆XYZ) is shown.
For all values the mean and median is depicted (Table 3).
Furthermore, all poses are sorted into the categories defined by Sattler et al. (2018) plus one additional category (< 10 m) depicted in Table 4.  Table 3. Difference of the principal distance in pixels and coordinate differences in meters between 42 reference and estimated camera poses in all coordinate directions. Additionally, the mean and median over all values is calculated.
These results show that even if the historical model is controlled visually, via the reprojection error and via additional measurements on the building it does not mean that all camera poses have to be calculated correctly. In fact, above-mentioned results show that almost half of the cameras are more than 10 meters from their reference position. These errors occur mainly because of the ambiguity of the principal distance, incorrect feature matches in the historical model, and possibly too soft constraints in bundle adjustment. Additionally, while the mixed reference model has been created with different software tools and different evaluation strategies have been compared, it is not completely clear if it provides a good reference for the original historical camera.
Nonetheless, most of the camera poses can be used for the purpose of visualizing aggregations of cameras or viewing directions in Web3D applications. For a direct use in AR the achieved accuracy is not yet high enough.

CONCLUSION
This contribution allows drawing several conclusions. Generally, it can be seen once again that SuperPoint+SuperGlue is able to deal with very diverse image collections covering a large time span and showing radiometric differences. The workflow is able to estimate the pose of 42 of 44 historical images. Additionally, it can be useful to integrate contemporary images as feature matches can also be found reliably. However, for a successful pose estimation of the historical cameras it is necessary that the contemporary images cover the whole scene depicted in the historical image. Otherwise, an interactive selection of further feature points is required. It is still an interesting problem how to provide a reasonable reference for the cameras of historical images when no further information is given. The timeconsuming manual selection of more tie points between the historical images could still slightly converge the mixed model to the historical model.
Considering the accuracy, it is interesting that the historical sparse cloud is consistent and fits into the total station measurements with an accuracy of around 10 centimeters. However, the principal distance and the pose estimation in the historical model come with a significantly larger error in the range of hundreds of pixels for the principal distance and in the range of few meters for the pose. The historical model also includes few gross errors of camera poses that have been reconstructed but are not in the correct position due to a large amount of incorrect feature matches.
Nonetheless, even if the camera pose is slightly inaccurate the texture can be projected onto 3D models in a VR or Web3D environment. For AR applications the accuracy has to be further improved. This could be done by tuning bundle adjustment parameters as e.g. only using points for the reconstruction of camera poses that have been seen in a high number of images. This should enhance the robustness of the model while as a negative effect decreasing the number of oriented images.