THE EUROSDR TIME BENCHMARK FOR HISTORICAL AERIAL IMAGES

Automatic photogrammetric processing of historical (or archival) aerial photos is still a challenging task, particularly in cases of missing ancillary information, low radiometric and image quality, limited stereo coverage or large temporal span. However, with recent advances in photogrammetry and Artificial Intelligence (AI) algorithms for image processing and interpretation, an increasing number of applications are now feasible. The article presents the TIME (hisTorical aerIal iMagEs) benchmark (https://time.fbk.eu/), promoted by EuroSDR to explore the potential of historical aerial images. Realized in collaboration with various European NMCAs, the benchmark has garnered aerial image blocks and time series imagery captured since the 1950s. To support the photogrammetric processing of the digitized photos, ancillary data are supplied with available information about flight missions, taking cameras, and ground control points (GCPs). Several diverse investigations have been undertaken with the benchmark datasets, all captured over historical urban areas or landscapes. The paper describes the benchmark datasets and some potential research topics, presenting several tests and analyses realized with the collated and shared


INTRODUCTION
Historical images, defined here as photos dating from the origin of photography to the beginning of the digital era (c. 2005), are becoming an increasingly valuable information source, yet many automated processing issues remain. Whether aerial or terrestrial, nadir or oblique, many photo collections are under digitization worldwide, providing a significant amount of data for multiple potential applications (Schulz et al., 2021). Archives often offer long-term time series, relatively dense temporal resolution, relatively high geometric resolution and, quite frequently, stereo coverage. Historical aerial images often feature different image formats (e.g. 23x23 cm, 30x30 cm, microfilms, etc.) and were collected for various purposes (topographic map generation and updating, visual interpretation, reconnaissance, etc.).
offering mapping benefits but also numerous investigation opportunities and challenges: missing or incomplete metadata, low radiometry and geometric resolutions, poor image quality arising from incorrect scanning activities or inappropriate conservation measures, large time intervals in multi-temporal datasets, reduced stereo coverage, etc. (Redecker, 2008;Nocerino et al., 2012, Cowley et al., 2013Giordano and Mallet, 2019). In the case of very old datasets, camera certificates are rarely available (also considering that rigorous calibration standards were only introduced in the 1960s), and this translates into missing Interior Orientation (IO) parameters and fiducial marks coordinates, notwithstanding the lack of flight plan information (for coarse image localization). Furthermore, poor image quality can limit the performance of automatic algorithms for salient feature extraction, complicating further 2D/3D image processes. Finally, the rare availability and visibility of multi-temporal GCPs in historical and modern urban scenarios limits automated georeferencing processes. This paper presents the EuroSDR TIME benchmark, established to collect and harmonize a series of historical aerial images and share them with the research community to investigate open issues in various photogrammetric processing steps. The shared aerial photos (and associated ancillary data) aim to stimulate research activities to offer new procedures that can fully exploit the invaluable content of such historical data. TIME is believed to be the first photogrammetric benchmark dedicated to the processing of historical aerial photographs. Previous studies on archival aerial image processing have highlighted the complexity of handling these data, which are increasingly being digitized by European NMCAs (Giordano and Mallet, 2019).

RELATED WORK
As a tool for scientists, a benchmark is generally defined as a reference against which entities may be compared. A review of benchmarking initiatives for evaluating and comparing the performance of sensors, algorithms and methods for geospatial investigations was presented by Bakula et al. (2019). Many benchmark initiatives have been supported by ISPRS and/or EuroSDR, in line with their ambitions to support the geospatial community with research activities and datasets. Although different in terms of test data and investigation aims, all benchmarks share the common focus on developing and testing automatic processing procedures (Haala, 2014), while simultaneously boosting robust data quality assessments (Nex et al. 2015). Several benchmarks on aerial imagery have been released, primarily focusing on image triangulation and DSM generation (Haala, 2014;Nex et al., 2015), object detection (Ding et al., 2021), semantic segmentation of urban 3D point clouds (Kölle et al., 2021;Hu et al., 2021) and image interpretation (Long et al., 2021). Benchmarks related to historical aerial imagery have not yet been made available, although Maiwald (2019) proposed terrestrial scenario datasets to evaluate the performance of several feature matching methods. The literature shows that the most common investigations into historical aerial images focus on examining the standard pipeline for image triangulation, DSMs and orthoimagery production, including also the most recent SfM and MVS methods, especially for landscape change detection, 3D reconstruction of heritage structures and monitoring analyses (Wiedemann et al., 2000;Sauerbier et al., 2004;Walstra et al., 2004;Sonnemann et al., 2006;Verhoeven et al., 2013;Nebiker et al., 2014;Risbøl et al., 2014;Jao et al., 2014;Micheletti et al., 2015;Popelková and Mulková, 2016;Fieber et al., 2018;Lydersen et al., 2018;Sevara et al., 2018;Peppa et al., 2018;Kupidura et al., 2019;Osińska-Skotak et al., 2019;Pinto et al., 2019;Grottoli et al., 2020;Poli et al., 2020). Missing or incomplete camera and flight mission details are the primary reported problem for photogrammetric processing and the image triangulation of analogue photos. When camera certificates and fiducial marks coordinates are unavailable, an initial bottleneck relates to the derivation of camera interior orientation (IO) parameters, and the establishment of a mathematical relationship between the digital image and the camera's coordinate system. Procedures for approximately deriving interior orientation information and for fiducial transformation employing affine models were described in Nocerino et al. (2012), Nurminen et al. (2015, Salach (2017) and Poli et al., (2020). Different studies have focused on the recovery of camera exterior orientation (EO) parameters and on automatic geo-referencing procedures of historical imagery, using photogrammetric methods (Kim et al., 2010;Redweik et al., 2010;Verhoeven et al., 2012;Karel et al., 2013;Nurminen et al., 2015;Giordano et al., 2018) or empirical transformations (Zambanini and Sablatnig, 2017). Multi-date aerial datasets can be co-registered using handcrafted operators (e.g. SIFT), contours or lines (Clery et al., 2014). These approaches normally fail to extract reliable correspondences when large time differences exist and/or a change of sensor occurs between epochs. For these reasons, recent works were based on learning-based approaches to extract reliable correspondences among multi-temporal images and improve image matching (Ressl et al., 2020;Zhang et al., 2021). Historical aerial images have frequently been used to interpret the territory and discover lost information (Sevara, 2013), applying monoplotting (Bozzini et al., 2012;Bayr, 2021) but also using simulation processes for supporting data interpretation (Siok and Ewiak, 2020). Further attractive research activities have focussed on exploiting historical aerial photographs to identify war-related bomb craters (Meixner and Eckstein, 2016;Valjavec et al., 2018;Clermont et al., 2019;Dolejš et al., 2020). Other studies have deepened the benefits of employing archival sources for producing land-cover and land-use historical maps (LULC) (Ratajczak et al., 2019;Mboga et al., 2020), also exploiting Artificial Intelligence (AI) algorithms. AI-based techniques have also been adopted for tackling grayscale image colourization. Colourizing archival aerial data can be crucial for improving scene understanding and supporting other processing tasks, such as data classification. Several methodologies have been developed in this field for terrestrial scenarios, whereas few works have focussed on aerial-scale historical images (Seo et al., 2018;Dias et al., 2020;Poterek et al., 2020).

DATASETS DESCRIPTION
The TIME benchmark collected image blocks and ancillary information span from the 1940s until the 2000s. The contributing countries (Austria, Norway, Finland, Cyprus, Slovenia, Estonia, Poland, Italy) generally provided multitemporal image blocks with somehow diverse ancillary data consisting of camera calibration, approximate exterior orientation (EO), ground control points (GCPs), ground truth, etc. In particular, the shared data (Table 1)

Image pre-processing
Image enhancement includes techniques employed in image preprocessing to highlight salient image features by transforming pixel intensity values. Poor contrast and brightness levels, overor underexposed images, are normally the result of erroneous acquisitions or scanning (digitization) processes (Baltsavias, 1999). Several solutions have been proposed for image correction and for balancing improvements and enhancements in brightness and contrast (Maurya et al., 2021). Some of the shared aerial blocks from the TIME benchmark present evident radiometric problems, which need to be resolved to enable and facilitate photogrammetric processing tasks. This is especially true for the different Cyprus series (Figure 2). Standard brightness and contrast adjustment algorithms, such as histogram equalisation, were employed before running further processing. Besides investigating more robust and automatic procedures for handling this processing task, further analyses should deepen the radiometric distortion and noise level introduced with these pixels' manipulation, especially in 3D reconstruction processes.

Aerial triangulation (AT) and geo-referencing
In order to derive metric spatial information from aerial digitised photographs, an initial transformation between the pixel and the image coordinate system needs to be computed. This is defined by the fiducial marks (Figure 3) visible on the analogue photos, varying in shape and location (in the middle or in the corners of each side of the image or in all eight locations). The transformation between the pixel and image coordinate systems can be rigorously resolved only when (i) the marks are clearly visible and accurately detected in the digitised images and (ii) the image coordinate system for the transformation is known. As in many datasets, and especially the oldest ones, no camera calibration certificates and fiducial marks coordinates are available, some initial approximations for the photogrammetric processing are needed (Nocerino et al., 2012). A coarse transformation between the pixel and image coordinate systems can be performed by adopting a virtual reference system and assuming that the principal point is at the centre of the fiducial mark intersection. In our tests, pixel coordinates of fiducial marks with respect to the principal point were manually measured in some images, and their average value used for computing an affine transformation for all of the block. Nowadays, some photogrammetric software offers semiautomatic solutions for handling this process (Giordano and Mallet, 2018), mainly employing pattern matching methods for the automatic detection and measure of fiducial marks coordinates, starting from a single archival image. In our tests, the principal point position was used for cropping the single frames and ensuring the same sensor format for all the blocks. Then, camera parameters can be estimated in a bundle block adjustment, eventually including also distortion parameters (Poli et al., 2020). Several blocks or subsets of the TIME benchmark were also processed with different photogrammetric software to assess the capability of automatic SfM-based processes with analogue grayscale images. Tie points were extracted using hand-crafted (RootSIFT), hybrid (HP) and learning-based (LoFTR, KeyNet, D2Net, RoRD) approaches. Results highlight the potential of newly developed solutions in finding tie points among multitemporal datasets (Figure 4). Reference GCPs are necessary for scaling and geo-referencing purposes. This stage is typically difficult and time-consuming, considering that stable-over-time GCPs are rarely available or clearly visible in historical imagery ( Figure 5). When some actual/modern GCPs are available (e.g. for the Vienna dataset) through maps or orthophotos, the manual identification and measurement of these points in the historical images followed by a Helmert transformation is the most common procedure for geo-referencing image datasets. However, this process can be quite complex and laborious, and conspicuous errors can be introduced in this phase. As an example, two temporal series of Vienna, from 1958 and 1976 (15 and 21 images, respectively), were geo-referenced using this procedure, applying a standard rigid transformation.
Using 10 manually identified GCPs, the RMSE is about 4 meters for the 1958 images (GSD≃0.5 m) and 3.5 meters for the 1976 images (GSD≃0.2 m). From this example and further tests conducted on several TIME benchmark datasets, the automatic AT of archival aerial images remains to be still a tricky and insidious processing task.
The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLIII-B2-2022 XXIV ISPRS Congress (2022 edition), 6-11 June 2022, Nice, France  Increasingly robust procedures need to be developed, especially when dealing with large temporal spas, missing ancillary and ground reference information.

DSM and orthoimages production
DSMs and orthophotos are the most common product of the photogrammetric processing of archival aerial images. Once the EO parameters are recovered, DSMs can be produced by means of dense image matching (DIM) algorithms (Figure 6), and then orthorectification and mosaicking can be performed. The correctness of the AT obviously conditions the quality of these products and the extraction of reliable and precise information from historical data. In order to assess the capability of automatic algorithms for the DSMs production, tests were conducted on the same Austrian temporal series (1958 and 1976) presented in the previous section, and the final orthomosaics were visually evaluated and compared with some actual data (Figures 7 and 8).
A different example of AT and DSM production from a subset (31 images) of the oldest Norway block (110 images) is shown in Figure 9.
The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLIII-B2-2022 XXIV ISPRS Congress (2022 edition), 6-11 June 2022, Nice, France  Norwegian blocks are among the most complete in the TIME benchmark, in terms of both number of images and available ancillary information. Provided data were used to re-process the image block and generate a DSM of the area of interest. The numerous lakes present in this dataset (black areas in the historical images) produced various errors in the DSM. The improvement of DSMs quality is a further open research task, considering that matching results are conditioned by the radiometric saturation, scanning errors, dark regions and shadows, typically present in historical imageries.

Automatic colorization of historical images
Automatic grayscale image colorization is an emerging topic following the latest results of CNN (Liu et al., 2018) or GANbased (Poterek et al., 2020) deep learning methods. Colorized historical photos can effectively aid further image processing tasks (data classification, object recognition, etc.). Several methods have been already been developed for tackling this image processing task in terrestrial scenarios, poorly performing on aerial-scale historical data. A few methods have been recently presented for this specific application, however, source codes are unavailable for tests and comparisons. Therefore, a new deep learning method was implemented by the authors for handling the automatic colorization of several TIME benchmark datasets. The Italian imageries, including miscellaneous data captured between 1942 and 1945, were selected for testing the colorization performance of the proposed technique ( Figure 10). The implemented network architecture is a combination of U-NET (Ronneberger et al. 2015) and the hypercolumn techniques (Hariharan et al., 2015), called "Hyper-U-NET". Some examples of the colorization outputs with the proposed techniques are presented in Figure  10. Further tests and quantitative analyses are needed. Figure 10. Example of colorization results.

CONCLUSIONS
The TIME benchmark is an ongoing EuroSDR research activity, offering access to many historical aerial image blocks to test and validate 2D and 3D image processing algorithms. The benchmark follows an increasing number of scientific publications related to historical airborne images as well as an increasing effort of local and national mapping authorities (NMCAs) to create digital archives and valorize the contents. The aerial blocks shared by several NMCAs are heterogeneous in terms of number of images, available ancillary information, GSD, image quality, overlap, etc. This paper has presented the main characteristics of the collected data, some possible investigation topics, and processing tests with several automatic solutions. Issues and bottlenecks encountered in the presented experiments proved that archival aerial image processing is a wide and challenging research field. Many gaps in the currently available solutions can preclude the full exploitation of this incredible source of information on past urban scenarios. EuroSDR and authors believe in the enormous potential of the TIME benchmark and in the value of historical aerial imagery. It is anticipated that the research community will benefit from the various datasets and that new innovative methods will be created to boost the use of these sources.