GENERATION OF A BENCHMARK DATASET USING HISTORICAL PHOTOGRAPHS FOR AN AUTOMATED EVALUATION OF DIFFERENT FEATURE MATCHING METHODS

: This contribution shows the generation of a benchmark dataset using historical images. The difficulties when working with historical images are pointed out and structured in three categories. Especially large viewpoint differences, image artifacts and radiometric differences lead to weak matching results with classical feature matching approaches. The necessity of publishing an own benchmark dataset is emphasized when comparing to existing datasets which are partly using synthetic data, well-known orientation or strictly categorized image differences. The presented image dataset consists at the moment of 24 images which are oriented in image triples using the properties of the Trifocal Tensor as a more stable image geometry. In the following, three different feature detectors and descriptors that have already been proven well on historical images (MSER, ORB, RIFT) are evaluated using the new benchmark dataset. Then, several outlier removal methods were applied on the detected features. The tests show that for the entirety of image pairs RIFT performs slightly better than the other two methods. Nonetheless, for some image pairs MSER significantly improves the matching score but even so, historical image pairs are difficult to be matched with the presented methods due to challenging outlier removal. Still, the estimated projective relative orientation could be used in an autocalibration approach to place the images in a metric scene.


INTRODUCTION
This contribution presents the generation of a benchmark dataset for the evaluation of different feature matching methods on historical images.The work is placed in the context of a 4D web application (3D models and related historical images and data) of the city of Dresden as an alternative media repository for e.g.art historians.Oriented images and methods to match historical images provide the basis for the placement of the images in such a 3D space.The presented images originate from the photo library of the Saxon State and University Library Dresden (SLUB), which contains about 1.8 million images of 80 institutions at this point in time.The majority of images in this archive was taken between 1940 and 1990 (deutschefotothek.de).The images for the benchmark dataset were redigitized for this purpose and show various buildings.While the absolute orientation of these historical photographs is neither given nor easy to define, this approach focuses on the determination of the relative orientation between different historical images.This leads to diverse issues considering that extrinsic and especially intrinsic camera parameters are mostly unknown.Additionally, the images are taken by different camera types which vary in exposure and acquisition time.Consequently, the presented dataset is relatively oriented in a projective frame using a more stable triple image geometry (Hartley, 1997).The matches between the three images of one building view and additionally the relating Trifocal Tensor T are determined and given.This orientation data can then be used to evaluate different feature detectors, descriptors and feature matching methods on historical images.In the following, it may be possible that an oriented image mosaic can be metrically spatialized in a threedimensional environment with the appropriate scale using autocalibration (Faugeras et al., 1992).The dataset consists of 24 * Corresponding author images (2 image triples respectively for 4 buildings) and could be extended in the future.The images have different properties ranging from small viewpoint and radiometric changes to large differences.These properties can be summarized in the following three categories.

Image differences based on digitization and image medium
Even if an image would have been taken twice at the same moment in time, some differences concerning the digitized copy could occur during and even before digitization.This happens because historical images are mainly archived on photographic plates or photographic film.Any change on this original data is preserved during digitization.Especially, the conservative emulsion on the glass plates can deteriorate and additionally a glass plate is fragile and any crack will be pictured in the digitized image (Gillet et al., 1986).Scratches, dust and finger-prints may also be visible in the digital copy.
Similarly, photographic film is vulnerable to damage e.g. by mold, photo-oxidation, air pollutants and improper handling (Slate, 2001).All of these image artifacts are transferred using digitization techniques and will interfere with the process of feature detection.One further image difference that may appear and is relevant for photogrammetry is the change of the principal point in the digital copy.It does not have to be necessarily the middle of the digital copy but it can shift, if only a part of the original image is digitized or if the original data has been cropped.It may be even possible that the principal point is not pictured on the digital copy.Additionally, when the digitization information (sensor, resolution, dynamic range, working area, accuracy, filters) is not available every metric data is lost in the process.

Image differences based on different cameras and acquisition technique
When comparing various historical images, the main difference between them is the strongly changing representation of the depicted object.Photographs of the same object are taken in summer and in winter, in daylight and in nighttime and thus, the radiometric properties change.The historical images may be blurred, noisy, under-and overexposed and different light spots, reflections and shadows can appear in the same photographic scene and interfere with the feature detection.Sometimes, people, cars or other objects are in front of the depicted building and influence the feature matching.
Additionally, on the one hand it is possible that there are extreme viewpoint changes between the images and on the other hand sometimes one building is solely photographed from similar perspectives, which makes a 3D reconstruction difficult.Since the camera types are mostly unknown and undocumented the inner orientation important for the reconstruction is not available and has to be estimated.

Object differences based on different dates of acquisition
A difficult topic is the dealing with object differences shown in the photographs.Building differences can vary between very small changes like on claddings, window frames or small statues to large ones considering destroyed or reconstructed buildings.It is not possible to assume that a historical building that is represented on various images did not change over time.Nonetheless, some valuable orientation information can even be determined using these destroyed or changed buildings.It will be difficult to decide whether an object changed so much that any metric information generated with photogrammetric methods is invalid.Furthermore, it is still discussed how to represent this error-prone data (Apollonio, 2016), (Kensek et al., 2004).It could be possible in a first step to categorize historical images using content-based image-retrieval on a very accurate scale and only use feature matching methods on image pairs of clearly the same building in the same state.

RELATED WORK
There exists already a numerous variety of image datasets in computer vision for different purposes like (people-)detection, classification, recognition, tracking, segmentation, multiview and many more.Famous datasets are e.g. the Caltech 256 dataset for classification purposes (Griffin et al., 2007) or the KITTI dataset used in autonomous driving and SLAM research (Geiger et al., 2013).The presented dataset could be integrated in the multiview category and closes a gap between different existing datasets.In contrast to datasets with a lot of images and their inner orientations (Moreels and Perona, 2007) it is not or only hardly possible to provide that many historical images including the proper inner orientation since the camera types are mostly unknown.
Similar to the Affine Covariant Regions dataset (Mikolajczyk et al., 2005) the presented benchmark dataset consists of real data (= not synthetic data) with changes in illumination, viewpoint, blur and rotation.Some of the historical images even have large viewpoint or illumination changes like in the Extreme View Dataset or the Ultra Wide Baseline Dataset (Mishkin et al., 2015).These existing datasets are using the fact that "the images are either of planar scenes or the camera position is fixed during acquisition, so that in all cases the images are related by homographies [..] and this mapping is used to determine ground truth matches [..]." (Mikolajczyk et al., 2005).This is not (always) possible when using historical data, so the presented benchmark dataset is described by the predefined corresponding points and the Trifocal Tensor determining the relative orientation between image triples.At the time of this research no other freely available benchmark dataset with oriented images older than 40 years used for feature detection and matching could be found.
However, many people are working with historical images and further data to reconstruct mostly buildings and sights.This includes e.g. the reconstruction of the great Buddha of Bamiyan (Grün et al., 2004), dinosaur tracks (Falkingham et al., 2014) or the orientation of historical images of Atlanta, GA (Schindler and Dellaert, 2012).But also recent research is done with historical data e.g. in combination with terrestrial laser scanning (Bitelli et al., 2017), using old film negatives (Rodríguez Miranda and Valle Melón, 2017) or aerial images (Giordano et al., 2018).Though, those projects show a developing degree of automation in image processing a lot of work is still done manually in this field of research (Henze et al., 2009), (Gouveia et al., 2015).An oriented historical image dataset could help to improve automated approaches in image classification, image matching and image orientation.

THE IMAGE DATASET
Examples for the historical image dataset are shown below (fig.1).The whole published dataset consists of 24 images with a maximum side length of 3543 pixels.It is mostly unclear, whether the original data is originated from photographic plates or film negatives.The images are grouped in two triplets respectively for 4 buildings (2 × 3 × 4 = 24).Images were chosen with respect to their possible matching quality.The images show combined differences in illumination, field of view, viewpoints, blurring and slight rotation.Some of the images show building reflections in water or extreme shadowing.Thus, a very challenging dataset when using a single feature matching method is provided.
Since the relative orientation of the image pairs cannot be easily described through a homography as explained before, the first step would be the description of the image pairs using a Fundamental Matrix F calculated out of at least 7 point correspondences, where F is defined by equation 1, where ′ and  are at least 7 image correspondences in homogeneous coordinates.
One must say, that this equation can hardly be used to test correspondences determined with feature matching methods because an estimated (e.g. using RANSAC) fundamental matrix F is only a projective map taking a point to a line.That means a point (, , ) in the first image defines a line (the corresponding epipolar line  ′ = ) in the second image (Hartley and Zisserman, 2003).Additionally, the point transfer from image 1 to image 2 using the epipolar line can lead to false positives considering matches that lie randomly on the epipolar line but are no true matches.This leads to a more stable image configuration when using three images, because e.g. the epipolar lines from image 1 to image 3 and from image 2 to image 3 of the same feature point intersect in image 3 in the homologue feature point (Maas, 1997).The matching can be simplified using the 3 × 3 × 3 Trifocal Tensor T and its properties for a point-point-point correspondence (eq.2) (Hartley and Zisserman, 2003).
where ,  ′ ,  ′′ = image coordinates in the three images [] × = 3 × 3 skew-symmetric matrix of 3-vector  = number of 3 × 3 Tensor slice  = Trifocal Tensor of the three images 0 3×3 = 3 × 3 null matrix 3-vector A point transfer from e.g. the first view to the third view can then be realized using equation 3 and the corrected Fundamental Matrices  12 ,  13 and  23 extracted from the Trifocal Tensor (Hartley and Zisserman, 2003).
where , ,  = indices that correspond to the entities in the first, second and third views respectively Since the Trifocal Tensor is not that easy to determine like a homography or the Fundamental Matrix, it is provided for every benchmark image triple.Additionally, the calculation of T is explained in the following.There are various methods that are used for the computation of the Trifocal Tensor namely e.g. the minimal parameterization by Faugeras and Papadopoulo (Faugeras and Papadopoulo, 1998) and by Nordberg (Nordberg, 2009) or the constrained solutions by Ponce and Hebert (Ponce and Hebert, 2014) as well as Ressl (Ressl, 2002).Most approaches have already been tested and the constrained solution by Ressl has shown the most robust results leading to the smallest reprojection errors (Julià and Monasse, 2017).Using this computation method requires approximation values for the Trifocal Tensor and the Projection Matrices of the three images.These can be found by solving  = 0 in a linear way.The matrix A is the Jacobian of the trilinearities and consists of the ( = 4) row-wise ordered sub matrices   where c is the number of point correspondences (eq.4) (Ressl, 2003).
where   = reduced axiator for point coordinates ⊗ = Kronecker product for 4 linearly independent equations Afterwards, the approximation values can be calculated by minimizing the algebraic error using a singular value decompositon (SVD).It is recommended to use at least 10 normalized point correspondences in all three images with a pixel noise of 1 to minimize the reprojection error with the subsequent constrained solution (Ressl, 2003).For the benchmark dataset at least 15 manual point correspondences were used in the image 1 https://dx.doi.org/10.25532/OPARA-24triples and a pixel noise < 1 was targeted.The verified results for the different matching strategies show that this goal could be accomplished.The detailed description, the images, the matched points and the corresponding Trifocal Tensor are available on the website 1 .
Since the Trifocal Tensor provides geometric relations between three views only in a projective frame independent of scene structure (Hartley and Zisserman, 2003) the resulting camera matrices ,  ′ , ′′ retrieved by eq. 5 could be introduced as a prior relative orientation into an autocalibration algorithm (Heinrich et al., 2011) allowing the estimation of inner and exterior orientation and in the following, the generation of simple structures in euclidean metric 3D space.

COMPARISON OF DIFFERENT FEATURE DETECTION AND DESCRIPTION METHODS
In the following, the different feature detection methods used on the benchmark image dataset are briefly explained.Three distinct algorithms were chosen to process the images in full resolution and find point features.The comparison is done between image pairs but can be evaluated using the Trifocal Tensor.Thus, the number of correct matches in relation to the sum of all matches (= matching score) could be determined.Some of the common methods have already been tested on historical image data and a combination of the ORB (Oriented FAST and Rotated BRIEF) feature detector and the SURF (Speeded-Up Robust Features) feature descriptor produced decent results (Ali and Whitehead, 2014).Another approach that generated a good matching ratio was the MSER (Maximally stable extremal regions) feature detector and descriptor (Wolfe, 2013).Additionally, those results are compared with a newer method called RIFT (radiationinvariant feature transform), that neglects radiometric differences in images and thus, can be a good addition to existing approaches.
For the first and second test the standard implementations of ORB, SURF and MSER in OpenCV were used.The third test used the implementation of RIFT in Matlab (Li et al., 2018).The results are presented without outlier removal using brute force matching, outlier removal using a symmetry test and as a third approach outlier removal using Fundamental Matrix calculation with the random sample consensus (RANSAC) (Fischler and Bolles, 1981).Additionally, for RIFT the native calculation using the fast sample consensus (FSAC) (Wu et al., 2015) is shown.

Oriented FAST and Rotated BRIEF (ORB)
ORB is a common alternative to SIFT and uses an intensity oriented FAST (Rosten and Drummond, 2006) for feature detection and an in-plane rotation invariant version of BRIEF (Calonder et al., 2010) for feature description (Rublee et al., 2011).Since the hybrid version using the ORB detector and the SURF descriptor achieved better results on historical images (Ali and Whitehead, 2014) the presented approach chooses this as a first method for feature detection and description.The oriented FAST detects keypoints using the intensity threshold between the center pixel and a circular ring around that center.The orientation of the keypoints is done using an intensity centroid (Rosin, 1999).
The standard maximum value for the number of features retained was set from a maximum of 500 to 2500 to allow a better comparison with the other methods.In the following, SURF is used for the description of the features since it outperforms other descriptors by repeatability, distinctiveness and robustness (Bay et al., 2006).

Maximally Stable Extremal Region Detector (MSER)
As a second method the presented approach uses MSER (Matas et al., 2004).This algorithm is usually applied on image pairs with a wide baseline.Classical feature points are replaced by regions which are closed under projective transformation of image coordinates and monotonic transformation of image intensities (Matas et al., 2004).Those properties can be especially useful for historical images because of the already explained image differences.Regions described by a connected number of pixels are chosen by the property that all pixels inside one extremal region have either a higher or a lower intensity than all the pixels on its outer boundary (Mikolajczyk et al., 2005).Again, SURF is used for the description of the regions consisting of feature point sets.

Radiation-invariant Feature Transform (RIFT)
The third method used is called RIFT.The radiation-invariant feature transform is chosen because of its invariance to nonlinear radiation distortions (NRD) (Li et al., 2018) and the use of edge features in addition to corner features.Both effects can support the feature detection in historical images.The approach uses the Fourier transform to generate phase congruency maps.Independent maps for each orientation of a 2D log-Gabor filter are created and used for the detection of corner features as well as edge features.In the following, those features are described by a 216-dimensional feature vector calculated through a maximum index map based on a log-Gabor convolution sequence (Li et al., 2018).RIFT is currently not scale invariant and so it should perform bad on large scale-changes.However, it has been observed that feature points in image pairs with small scalechanges can still be matched correctly.

Feature matching and outlier removal
For the comparison of all methods, the presented approach uses a brute force matching (_bf) for all detected feature points, i.e. all feature points with their particular descriptors are matched (so every descriptor in image 1 is compared with every descriptor in image 2).In the following, two different outlier removal methods are evaluated.The first approach uses a symmetry test (_sym).So matches from image 1 to image 2 are only kept if these are also matches from image 2 to image 1.In the second approach the calculation of a Fundamental Matrix between both images based on the feature matching result using brute force matching is used to eliminate outliers.Therefore, the RANSAC algorithm (_RANSAC) was chosen (Fischler and Bolles, 1981).Additionally, for RIFT the already implemented outlier removal (_native) using the fast sample consensus (FSC) (Wu et al., 2015) is shown.
100 % correct matches could not be reached with the presented approaches.Consequently, it can be said that the crucial point when working with historical images is the outlier removal step.Since there always is a small number of feature points in the image pairs that could be matched it will be the objective to filter those correctly.The symmetry test only slightly improved the results of the brute force matching.RANSAC performs better but most of the times the exact Fundamental Matrix could not be found.It seems that a refined RANSAC algorithm like FSC that is used in the RIFT approach could improve the matching scores.
A combination of all methods could result in higher scores and will be tested in the future.Multiple iterations when calculating the Fundamental Matrix or improved RANSAC algorithms like FSC, PROSAC (Chum and Matas, 2005) or MSAC (Torr and Zisserman, 2000) could improve the matching scores for all approaches.
Summarizing, for all image triples of the benchmark dataset it is possible to find homologue points and match them almost only using RIFT.MSER generally finds the most feature points but most of the times RIFT shows the highest matching scores in combination with FSC.For some special image constellations, the other approaches could be more appropriate and a combination of methods could lead to better results (Mishkin et al., 2015).Historical images are still a challenge for classical feature detection and matching algorithms, thus a cautious outlier removal is inevitable.

CONCLUSIONS AND FUTURE WORK
The contribution shows the generation and evaluation of a dataset consisting of 24 historical images.Difficulties determining the relative orientation of the data arise due to large image differences and unknown camera parameters.Thus, a more stable image configuration using three images described by the Trifocal Tensor T has been established.Therefore, T is given for every image triple in the dataset.The Trifocal Tensor can be used to evaluate different feature detectors and matching methods on historical images and the dataset can be used as a benchmark set.In this research MSER, ORB and RIFT were used since these algorithms have already shown good results in other publications.
For the presented dataset RIFT produced better results than the other two methods.FSC performed better in outlier removal than the symmetry test or RANSAC.
It is planned to establish a more reliable workflow for historical image matching using multiple methods consecutively.Also other already developed approaches will be tested on the dataset in the future (Maiwald et al., 2018).Different outlier removal methods could still improve the matching scores.Additional oriented historical images will be added to the dataset to provide a challenging base for other researchers.
Since the images are oriented with the Trifocal Tensor only in a projective space it is planned to use this estimated relative orientation as a base for a metric solution and calculate the inner and exterior orientation of the historical images.In the following, these images could be placed in the 3D/4D web application.Furthermore, simple features like single lines or planes could be generated in 3D space to create generalized historical 3D models.

Figure 1 .
Figure 1.All current images of the benchmark dataset showing the variety of historical images

Table 1 .
Results for different feature matching methods for 8 different image triples (=24 image pairs).Matching results are shown respectively for every dataset for the image pairs 1_2, 1_3 and 2_3 as ration in % between all found matches and correct matches (matching score).Good results are highlighted in green whereas bad results are shown in red

Table 2 .
Results for different feature matching methods for 8 different image triples (=24 image pairs).The total number of correct matches is shown respectively for every dataset for the image pairs 1_2, 1_3 and 2_3.Good results are highlighted in green whereas bad results are shown in red