AUTOMATIC ALGORITHM FOR GEOREFERENCING HISTORICAL-TO-NOWADAYS AERIAL IMAGES ACQUIRED IN NATURAL ENVIRONMENTS

: Automatic georeferencing for historical-to-nowadays aerial images represents the main ingredient for supplying territory evolution analysis and environmental monitoring. Existing georeferencing methods based on feature extraction and matching reported successful results for multi-epoch aerial images acquired in structured and man-made environments. While improving the state-of-the-art of the multi-epoch georeferencing problem, such frameworks present several limitations when applied to unstructured scenes, such as natural feature-less environments, characterized by homogenous or texture-less areas. This is mainly due to the lack of structured areas which often results in sparse and ambiguous feature matches, introducing inconsistencies during the pose estimation process. This paper addresses the automatic georeferencing problem for historical aerial images acquired in unstructured natural environments. The research work presented in this paper introduces a feature-less algorithm designed to perform historical-to-nowadays image matching for pose estimation in a fully automatic fashion. The proposed algorithm operates within two stages: (i) 2D patch extraction and matching and (ii) 3D patch-based local alignment. The ﬁnal output is a set of 3D patch matches and the 3D rigid transformation relating each homologous patches. The obtained 3D point matches are designed to be injected into traditional multi-views pose optimisation engines. Experimental results on real datasets acquired over Fabas area situated in France demonstrate the effectiveness of the proposed method. Our ﬁndings illustrate that the proposed georeferencing technique provides accurate results in presence of large periods of time separating historical from nowadays aerial images (up to 48 years time span).


INTRODUCTION AND MOTIVATION
Generating historical 3D maps from aerial photogrammetric surveys acquired at different epochs represents a valuable information for a wide range of civilian and military applications. Detecting environmental changes across several epochs (Lucas and Gayer, 2021), (Feurer and Vinatier, 2018), supplying infrastructure for ecological transformation (Pinto et al., 2019), but also natural disaster prevention and management are several applications requiring automatic frameworks for producing accurately georeferenced historical Geographical Information Systems (GIS) encoded as orthoimages, Digital Surface Models (DSMs) or land cover maps.
Traditional georeferencing methods rely on operator's intervention for generating ground control points (GCP) (Pinto et al., 2019). While providing accurate solutions, such techniques are laborious at small scale and unfeasible when applied on massive datasets acquired over wide areas and coming from different epochs. Recently reported automatic frameworks exploit feature-based approaches (Feurer and Vinatier, 2018), (Zhang et al., 2021), being suitable for scenes representing structured environments. Therefore, introducing fully automatic techniques capable to provide accurate georeferencing of multi-epoch aerial imagery acquired in natural feature-less environments remains an open issue.
This paper addresses the automatic georeferencing problem for multi-views partially overlapped historical aerial images, acquired in unstructured natural and rural environments. Generally, the key issue standing behind the automation of the multiepoch georeferencing process is represented by the capability * Corresponding author to extract homologous "features" or "landmarks" in presence of consistent environment changing which characterizes multiepochs acquisition scenarios. This problem becomes more severe for images acquired in natural and rural environments which are generally less structured than built urban areas and where the absence of reliable detectable and trackable features over time can not be guaranteed. Natural environments are opposite to urban, man-made areas for which it is easier to extract homologous inter-date features (points or lines) in the neighbourhood of anthropic objects. In addition, as stated in (Giordano et al., 2018), the detected trackable features must be homogenously dispersed over the entire 2D image space in order to ensure a stable multi-view pose optimization process via the bundle adjustment algorithm (Hartley and Zisserman, 2003), (Rupnik et al., 2017).
In order to address the aforementioned key issues, this paper presents a fully automatic patch-based method designed for accurately georeferencing historical aerial images acquired in natural environments. The present paper is organized as following: the next section summarizes existing georeferencing methods for historical datasets, emphasizing our main contributions. Section 3 presents the global overview of the proposed georeferencing algorithm, while highlighting the composing processing stages, each of which being detailed in the following two sections. Experimental results and their performance evaluation are presented in Section 6, while Section 7 draws the conclusions of our contribution and synthesizes future research directions.

RELATED AND PROPOSED WORK
Designing methods for supplying automatic georeferencing of aerial images acquired across different epochs requires to com-pute the optimal 3D rigid transformation lying between historical and recent images. Multi-epoch algorithms often proceed by first establishing homologous feature points between the historical and the recent images. In a second stage, image matches are injected into the pose computation process (Rupnik et al., 2017), (Hartley and Zisserman, 2003), (Moulon et al., 2016) and multi-view optimization via the bundle adjustment algorithm.
The main weakness of the existing multi-epoch georeferencing frameworks is represented by the difficulty to establish longterm feature matches based on which the pose computation can be performed. This paper focuses on the inter-epoch homologous image points computation stage which must be ensured in order to allow automatic georeferencing of multi-epochs aerial images acquired in natural environments.
Following the image matching strategy, it is possible to distinguish two main categories of methods. The first category relies on Ground Control Points (GCP) which are usually provided by an operator (Pinto et al., 2019) or supplied automatically (Giordano et al., 2018). While generating accurate results, manual GCP-based frameworks (Pinto et al., 2019) remain fastidious and difficult to be applied on large scale scenes. The second category of georeferencing methods is represented by featurebased approaches. Reported frameworks exploit SIFT descriptors (Feurer and Vinatier, 2018), (Zhang et al., 2021), line segments (Cléry et al., 2014) and deep learning approaches (Zhang et al., 2021). In (Feurer and Vinatier, 2018), authors combine SIFT extraction and matching (Lowe, 2004) techniques, with Structure from Motion (SFM) and multi-view bundle adjustment frameworks (Rupnik et al., 2017). In (Zhang et al., 2021), authors introduce an additional processing stage for filtering image matches provided by deep learning approaches, such as SuperGlue (Sarlin et al., 2020). This illustrates that deep learning techniques present several drawbacks when applied to the multi-epoch image matching problem. In addition, such algorithms require exhaustive learning which is not feasible in the context of multi-epoch images acquired over largescale sceneries. While improving the state-of-the-art, featurebased method remain limited to structured and man-made environments.
Beside the need of finding long-term feature matches between multi-epoch images, an additional challenge is introduced by natural environments for which feature-based methods provide sparse and ambiguous image matches which are not uniformly dispersed over the image space (Giordano et al., 2018), therefore introducing inconsistencies into the multi-view optimisation procedure. This paper introduces a two-step strategy for establishing reliable multi-epoch point matches for aerial images acquired in natural environments. The first stage focuses on finding 2D patch matches in the image space. This step overcomes the main issue introduced by instable features accumulated over time, such as forests or high-vegetation areas. The second stage is performed in the 3D space and starts with the ground extraction procedure, followed by the 3D patch-based local alignment process. This allows to generate a set of 3D homologous points and the associated 3D rigid transformation lying between each 3D patch match.

OVERVIEW OF THE PROPOSED METHOD
The present research work introduces a fully automatic technique for georeferencing historical images onto nowadays im-ages. The algorithm first generates their DSMs and orthoimages, out of images (Giordano et al., 2018). The obtained products are coarsely georeferenced, out of the available image metadata.
The proposed georeferencing technique presented in this paper takes as input the coarsely georeferenced 2D orthoimages, corresponding to both epochs, together their associated DSMs and outputs the 3D poses together with homologous 3D points. The proposed georeferencing pipeline is composed of two stages. In the first stage, the algorithm performs patch extraction and matching in the 2D image space. The second stage associates the DSM corresponding to each 2D homologous patch and performs patch-based alignment in the 3D space. Figure 1 illustrates the workflow of the proposed DSM georeferencing technique which is summarized through the following description.
2D Patch extraction and matching. The first stage proceeds by extracting a set of patches in the historical image. The patch matching procedure computes homologous patches between the recent and the historical orthoimages through the use of the Histogram of Gradients (HOG) algorithm (Dalal and Triggs, 2005). For each patch extracted in the historical image, the associated HOG is computed. Its homologous patch is searched in the recent image by maximizing the Zero Normalized Cross Correlation (ZNCC) score between the HOGs associated to the historical patch and the recent patch, extracted in a searching area centered around the recent patch. The procedure outputs a set of 2D homologous patches which allows to estimate a 2D global translational motion model for the considered set of patches.
3D Patch-based DSMs alignment. The historical and the recent DSMs are associated to the 2D homologous patches in order to generate the 3D point cloud corresponding to each patch. The global 3D translation motion lying between 3D homologous patches is computed via the barycenter approach. The estimated translation is applied to the historical 3D point cloud, allowing to obtain a coarse 3D global alignment of the historical DSM with respect to the recent DSM.
In order to eliminate 3D points coming from unstable features accumulated over time (such as forest), a ground filtering procedure is performed based on curvature computation (Rusu and Cousins, 2011). This operation allows to discriminate points in two classes: ground and above-ground 3D points. The resulted 3D point clouds belonging to the ground are further injected into the local alignment process which is applied to each pair of 3D patch matches. The fine alignment procedure includes pose estimation based on the Iteratively Closest Point (ICP) algorithm (Besl and McKay, 1992), barycenter compensation and outlier rejection. The final output of the algorithm is a set of homologous 3D patches and their associated 3D rigid transformations. The resulted 3D homologous points are designed to be injected into traditional bundle adjustment engines (Hartley and Zisserman, 2003), (Rupnik et al., 2017) for multiview pose estimation and optimization, which may include the use of other homologous features.

2D PATCH MATCHING VIA HISTOGRAM OF GRADIENTS (HOG)
Let I old and Inew be the historical orthoimage and its associated recent orthoimage. The local patch matching procedure  starts by extracting a list of N square patches in the I old image, P old = {P old,i ∈ I old , i = 1, ..., N }, each patch being defined by a square neighborhood W(u old,i ) centered around the image point u old,i . The size of each patch is given by the patch ray size, noted r, which provides (2r + 1) × (2r + 1) pixels per patch. For each patch P(u old,i ) ∈ P old extracted in the image I old , we search for its optimal patch match in the image Inew by exploring a windowed searching area, centered around ui, noted W SA (ui) ∈ Inew. The patch matching process is performed by maximizing the Zero Normalized Cross Correlation (ZNCC) score (noted Z ) between the HOGs (Dalal and Triggs, 2005) associated to each patch.
.., M } be the number of M patches found by exploring the searching area W SA (ui) ∈ Inew with 1-pixel steps. Please note that the input orthoimages have the same spatial resolution. The best match Pnew,i is obtained by computing the similarity score, Z, for each patch and by maximizing the score over the entire searching area space selected in the image Inew. This yields a list of 2D homologous patches P(u old,i ) ←→ P(unew,i), i = 1, .., N and the possibility to compute a 2D local translational motion for each match. Figure 2 illustrates the output of the patch matching procedure, displaying the corresponding index for each patch match.

3D PATCH-BASED DIGITAL SURFACE MODELS ALIGNMENT
The historical and the recent DSMs are associated to the previously computed homologous 2D patches in order to generate the 3D point clouds corresponding to each patch. Let P3D,new = {pnew,i(x, y, z), i = 1, .., Nnew} and P 3D,old = {p old,i (x, y, z), i = 1, .., N old } be the 3D point clouds obtained by computing the terrain projection. Figure 3 illustrates the 3D point cloud obtained by associating orthoimages and DSMs for recent and historical datasets. When analyzing Figures 3 (a) and 3 (b), it is possible to observe the initial positioning of the historical dataset P 3D,old with respect to the reference 3D point cloud, P3D,new. It can be seen that the 2D coregistration is, at this stage already quite good, but the translation in the Z direction remains important. Figures 4 (a) and 4 (b) show the 3D homo- logous patches obtained by associating the 3D coordinates to the 2D patch matches illustrated in Figure 2, generated by the 2D patch matching procedure described in Section 4.

3D Patch-based Global Coarse Alignment
In order to provide a coarse global alignment, the global 3D translation motion lying between the 3D homologous patches is computed via the barycenter approach by applying the following equation: where n denotes the number of homologous 3D points belonging to all N patch matches, expressed as n = i=N i=1 card(P 3D old,i ) = i=N i=1 card(P 3D new,i ). The estimated translation,t3D, is applied to the historical 3D point cloud p old,k , k = 1, .., n, allowing to obtain a coarse 3D global alignment of the historical 3D point cloud with respect to the recent 3D point cloud, noted p new,k , k = 1, .., n.p new,k = p old,k +t3D (2) In order to refine the obtained global alignment, a local 3D patch-based alignment procedure is designed to be further applied to each 3D patch match.

Curvature-based Ground Filtering
In order to eliminate 3D points coming from unstable features accumulated over time (such as forest) or from noisy DSMs, this paper proposes a ground filtering procedure based on curvature values computed for a given 3D point, p k , k = 1, .., n, belonging to a 3D patch. This operation is performed on each 3D point cloud separately (historical and recent 3D patches) and allows to discriminate 3D points in two classes: ground and aboveground 3D points.
In the present research work, the curvature is estimated following the approach described in (Rusu and Cousins, 2011). For each 3D point p(x, y, z), a neighbourhood is selected around the point. The method proceeds by estimating the plane normal and the curvature for each 3D point. The curvature, noted c(p), is estimated through the use of the eigen values, λ1, λ2 and λ3, associated to the covariance matrix, being given by the following expression: Let c(p k ) be the curvature associated to a 3D point p k (x, y, z). Being given the previously obtained 3D homologous patches, for each 3D point cloud P 3D old,i and P 3D new,i , the mean curvature is computed, notedc old andcnew, respectively. Since 3D points belonging to the ground are described by low curvature values, the ground extraction is performed by rejecting 3D points with higher curvature values than the mean curvature,c, computed for each 3D point cloud. The ground extraction is performed by associating a weight, w k , to each 3D point p k (x, y, z) through the use of the following weighting procedure: where, the parameter δ denotes the threshold factor which allows to discriminate 3D points situated above the ground, corresponding to unstructured areas, such as forests and other high vegetation areas. Figure 4 (c) illustrates the result of the ground extraction procedure applied on the recent dataset Fabas, P 3D new,i , i = 1, .., N . It is possible to observe that the 3D points representing the ground were correctly extracted. Moreover, 3D points situated above the ground (forest, high vegetation) were eliminated efficiently. The ground filtering procedure eliminates unstable features, therefore allowing to perform a consistent 3D local alignment for each patch in order to refine the global alignment.
Let PG = {p k (x, y, z)|w k = 1, k = 1, .., nG} be a 3D point cloud representing the ground, where nG denotes the total number of 3D points. When analyzing the obtained result, it can be observed that 42.8% of the total amount of the 3D points represent the ground. The homologous 3D points assigned to the ground surface are further employed for registering locally each The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLIII-B2-2022 XXIV ISPRS Congress (2022 edition), 6-11 June 2022, Nice, France 3D patch match.

3D Patch-based Local Fine Alignment
Let us note with Pnew,G and P old,G the 3D point clouds representing the ground for the recent and the historical dataset, respectively. In order to refine the 3D global patch alignment, each 3D patch match is injected into the local alignment procedure which includes rigid pose estimation via the ICP (Besl and McKay, 1992) algorithm, barycenter compensation and outlier rejection.
Let Pnew,G = {P 3D new,i , i = 1, .., N } and P old,G = {P 3D old,i , i = 1, .., N } be the list of N homologous 3D patches representing the ground. For each 3D patch match, (P 3D new,i ←→ P 3D old,i ), the local 3D patch alignment procedure estimates iteratively the optimal transformationTi = [R,t] which best aligns the 3D points belonging to each patch by minimizing the Residual Mean Square Error (RMSE), noted Q [R,t] , expressed in the following equation: The accuracy of the obtained 3D matches is evaluated in terms of RMSE computed between the reference and the optimally aligned homologous 3D points. The local alignment procedure is performed individually for each patch i = 1, .., N . For simplicity reasons, let us now drop the subscript i through the following description. LetT = [R,t] be the optimal rigid transformation obtained by applying the ICP algorithm to each 3D patch match.
The second stage of the proposed 3D patch-based pose estimation procedure exploits the estimated rigid transformationT = [R,t] to perform barycenter compensation between 3D points belonging to the overlapping area. The barycenter compensation is performed by registering the translation computed via the barycenter approach.
The third stage of the local alignment process is the outlier rejection step based on the residual errors expressed as: 3D points with a residual error r k >r (wherer denotes the mean residual error) are rejected by updating the weights w k , k = 1, .., nG, associated to homologous 3D points for each patch, as following:ŵ This allows to obtain a set of inliers matches nin,i for each 3D patch i = 1, .., N . The resulted matches are further injected in the ICP procedure for pose refinement. The goal is to minimize the criterion where r k are the weighted residual errors defined by: The final output of the 3D patch-based local alignment is a set of 3D point matches (p new,k ←→ p old,k ), k = 1, .., nin, belonging to each patch and their associated rigid transformations lying between each patch,Ti = [R,t]i, i = 1, ..., N , where N denotes the number of patch matches. Figure 5 illustrates the result of the iterative pose estimation process generated by the The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLIII-B2-2022 XXIV ISPRS Congress (2022 edition), 6-11 June 2022, Nice, France ICP algorithm, for a single patch match, P 3D old,i ←→ P 3D new,i , with i = 4. The accuracy of the algorithm is measured in terms of fitness score (Rusu and Cousins, 2011), noted f [R,t] , defined as the mean of squared distances between each 3D point in the reference patch (recent DSM) to its closest point belonging to the target (historical DSM). The current ICP implementation (Rusu and Cousins, 2011) employs a 3D point cloud subsampling stage which allows to decrease the runtime of the estimation procedure, while ensuring pose correctness. The proposed patch-based pose estimation algorithm alternates an ICP estimation cycle (100 iterations) with barycenter correction and outliers rejection. By analyzing Figure 5, it can be observed that the algorithm starts the estimation process with the initial fitness score of f [R,t] = 2.41(m) and converges to the minimum fitness score f [R,t] = 0.03(m) in the 3rd cycle, during the last 100 iterations.  Figure 6 illustrates the output of the local 3D patch-based alignment generated for the dataset Fabas. By visually inspecting Figures 6 (a) and 6 (b), it is possible to observe that the obtained patch matches are coherent in terms of radiometry values. In order to evaluate the accuracy of the generated matches in terms of geometry, the altimetric residual values are computed, noted DoD (Difference of DEMs), which are expressed in the following equation: where,p new,k [R,t] =R −1 p old,k −t, k = 1, .., nin denote the historical 3D points registered with respect to the reference (recent) dataset. Figure 6 (c) displays the altimetric residual values obtained for the dataset Fabas. A more detailed analysis of the obtained result in presented in the following section. In order to illustrate the final alignment result in terms of geometry, Figure 4(d) displays the initial reference 3D patches (recent DSM) superposed with the optimally aligned patches (historical DSM) generated via the 3D local alignment procedure. By visually inspecting the obtained result, it is possible to observe that the 3D local alignment procedure provides a geometrically correct alignment for which the qualitative evaluation is presented in the following section.

EXPERIMENTAL RESULTS AND PERFORMANCE EVALUATION
This section presents the experimental results together with the performance evaluation in terms of accuracy and computation time. The first part is dedicated to a description of the dataset, while the second part focuses on experimental results and their evaluation.
Historical and Recent Datasets. Experimental results are performed on the historical dataset acquired in 1962 over the Fabas area, situated in France, which is essentially composed of natural and rural areas. Land-cover changes are mainly due to forest evolution and to agricultural parcels clustering. The reference dataset is represented by recent orthoimages and DSMs, acquired in 2010 and extracted from the French national image databases. A detailed description of the dataset can be found in (Giordano et al., 2018). Experimental results performed on several datasets extracted from the Fabas study area allowed us to conclude on the effectiveness of the proposed approach. In order to provide a consistent analysis of the output generated by each composing stage of the algorithm, the experimental results illustrated throughout this paper exploit the same dataset extracted from the Fabas survey.
Accuracy evaluation. The first measure for qualifying the accuracy of the proposed approach is the Difference of Digital Elevation Models (DEMs), noted DoD, expressed in Equation 8. Figure 7 illustrates the DoDs (elevation residual values) obtained for each homologous patch P 3D new,i ←→ [P 3D old,i , Ti], i = 1, .., N , indicating that the proposed method generates accurate registration results in terms of elevation values. As illustrated previously in Figure 6 (c), from Figure 7 (b) it can be observed that the maximum altimetric residual values are defined over the interval [−1, 1]m, being obtained for the patch match 2. Moreover, when analyzing the corresponding 3D output for each patch match displayed in Figure 7, one can observe that high altimetric residual values correspond to the patch matches 1 and 2, from which a high amount of 3D points were eliminated through the curvature-based ground filtering procedure. For these patches, the remaining points describe a non-uniform area. In addition, it is also possible to observe that for the patch matches 3 and 4, which describe a dominant flat area, lower altimetric residual values were obtained.
The second evaluation measure employed for analyzing the accuracy of the local alignment procedure is the RMSE measure computed between the reference patch and the locally aligned patch. Table 1 illustrates the residual mean error obtained for each patch match, expressed in Equation 6. When analyzing the results, one can observe that the mean residual errors,rin, are consistent with their corresponding DoD values illustrated in Figure 7. More precisely, when comparing the accuracy obtained for matches 1 and 2 with the accuracy corresponding to matches 3 and 4, it can be observed that more accurate results are obtained for matches 3 and 4 which describe flat sceneries. Moreover, by analyzing the altimetric residual values illustrated in Figure 7, it is possible to conclude that the obtained results confirm the residual values,rin, obtained for matches 1 and 2, which represent non-uniform areas and for which higher values were obtained. Computation time. The proposed algorithm is implemented in C/C++ and exploits OpenCV (Bradski, 2000) and PCL (Rusu and Cousins, 2011) libraries for 2D images and 3D point cloud processing, respectively. The algorithm runs on a desktop computer equipped with 32 Gb of RAM and an Intel processing  Table 1. Accuracies obtained by running the patch-based local alignment procedure on the dataset Fabas, with N = 4 homologous patches; nG: the number of 3D points representing the ground (input), nin: the number of inliers output by the pose estimation procedure. unit running at 3.5 GHz. Table 2 summarizes the runtime of each processing stage composing the algorithm. When analyzing different processing stages, it can be observed that the most time consuming stage of the algorithm is the 3D fine alignment phase. Several optimization schemes are possible in order to improve the runtime. In terms of algorithm, 2D ICP techniques allow to cut down the combinatory when computing a first pose estimation which will decrease the searching area space for the final 3D pose refinement. When applied to the multi-view georeferencing context, parallelization schemes allow to decrease the computation time considerably.
Processing stage runtime 2D Patch extraction+matching 30 sec Coarse 3D-Global alignment 0.5 sec Curvature-based filtering 10 sec Local 3D-patch fine alignment 11.69 min

CONCLUSIONS AND FUTURE RESEARCH WORK
The presented workflow addresses the archival images georeferencing problem for natural (or rural) feature-less environments through the use of a two-step patch matching and alignment procedure. The first stage outputs a coarse 3D global alignment which initializes the pose refinement stage performed locally for each patch. Experimental results and a qualitative evaluation performed on several epochs allowed us to conclude on the effectiveness of the presented method. Our findings demonstrate that the proposed georeferencing technique provides accurate results in presence of large periods of time separating historical from nowadays aerial images (up to 48 years time span). One of the main research perspective is directed toward the multi-view georeferencing stage via the bundle adjustment procedure, including the design of runtime optimization schemes. Beside the multi-epoch image georeferencing application, the proposed method is able to supply automatic marker-less pose estimation for several applications undertaken in rural and/or natural environments, overcoming therefore the difficulty of establishing homologous features.