An Advanced Benchmarking for Image Compositing Evaluation

: This paper introduces a novel benchmarking tool for measuring the robustness of existing mosaicing algorithms in presence of a given set of disturbances. The process combines a set of partially overlapping images into a wide-view result used to represent UAV image series and orthophotography. Geometrical misalignements caused by perspective error may lead to some unpredictable artifacts and phantom effects in the mosaics. A very few solutions measure their immunity to known distortions and mainly focus on registration accuracy measurement. Only limited attention was given to characterize the response of the actual image fusion algorithms and their capacity to properly preserve content geometries. In this paper, we also introduce a new ﬁdelity metric assessing the mosaicing response to a disturbance of a given extent used as prior information. The metric helps to better deﬁne the use cases fulﬁlling aerial imaging requirements.


INTRODUCTION
Mosaicing refers to the complete process of merging seamlessly a set of adjacent narrow-view images into a wide-view result, providing a significantly larger FOV.The mosaicing pipeline requires geometrical registration and warping prior to any image merge application.A strategy must be adopted when merging sequences are shot at different times to ensure object motion integrity and avoid visual double contours or phantom artifacts to show up.Satellite and high-altitude images appear as static and are mostly immune to movement.UAV low-altitude flights can conversely be severely impacted by motion leading to structural inconsistencies.Existing benchmarks measure the likelihood of images cropped through a Virtual Camera (VC) and are mostly incapable of assessing any quality loss issued from photometric, geometrical, camera noise or vibration disturbances.
The main contribution of this paper relies in the definition of a novel objective assessment coupling (Disturbance, Mosaicing) to compare the resiliency of existing algorithms.The fidelity measures the quality loss after application of the disturbance function to a nominal (ground truth) image through structural PSNR and perceptual SSIM likelihood index, according to their mosaicing response.
Section 2 introduces the main mosaicing principles and methods, before we review existing benchmarking solutions in Sec. 3. Section 4 describes our contribution along with image distortions and assessment metrics.It is used to conduct an experimental analysis in Sec. 5, before we conclude the paper.

IMAGE MOSAICING: PRINCIPLES AND METHODS
A mosaicing pipeline is mainly composed by two steps: registration and composition (see Fig. 1).The first step aligns a set of images with a reference one, correcting perspective disparities that may cause parallax errors as described by (Adel, 2014).SIFT (Lowe, 2004) and SURF (Bay et al., 2008) are the * Also with LGO UMR 6538 CNRS UBO UBS ( Sheikh Faridul et al., 2014) propose a novel method for illumination and device invariant stitching capable of withstanding and correcting large changes in illumination.(Ghosh , Kaabouch, 2016) and (Xu et al., 2016) propose methods for correcting perspective distortions in UAV imaging.
Among existing methods, we can mention: • Image composition: Blending and best-seam stitching are the most common techniques used to merge different contents, as proposed by (Botterill et al., 2010); • Best-Seam search: The seam is placed between two images where intensities are minimal in the area of overlap.• Gradient based approach: This class of algorithms searches the seam line featuring the lowest photometric difference between overlapping regions of adjacent images.(Levin et al., 2004) propose to perform seamless image stitching in the gradient domain (GIST) to minimize a cost function based on the dissimilarity to each of the input images and the visibility of the resulting seam.
• Graph-cut optimization: In this transform, a label associates each pixel to an image.The basic technique is a labeling problem solved through an energy minimization function graph, e.g.maximum flow or MinCut algorithm.(Dijkstra, 1959) and (Bellman, 2010) are two historically known algorithms for graph shortest path search.(Uyttendaele et al., 2001) propose a block-by-block based algorithm based on Regions of Difference (ROD), representing areas of overlap between input images.(Agarwala et al., 2004) propose a graph-cut supervised multi-label optimization to compute the weight of a region in pixel-wise mode.
An iterative alpha-expansion graph-cut algorithm is used.(Brown , Lowe, 2007) formulate stitching as a multi-image matching invariant local features.(Gracias et al., 2009) divide the mosaic space into large disjoint Regions of Image Intersection (ROII) with a camera-to-center penalization function.
3. RELATED WORK (Bevilacqua , Azzari, 2006), (Boutellier et al., 2008) and (Weibo et al., 2013) propose Virtual Camera (VC) and Object Indicator based plans for registration accuracy assessment in image sequences.(Li-hui Zou, 2011) propose a pair-wise blind mosaicing framework with use case recommendation matrices.(Achanta et al., 2012) is an example of a subjective metric based on Human Visual System (HVS) response analysis.(Khan et al., 2012) propose a new quantitative metric for assessing the quality of mosaic to the input narrow images.The process computes the Structural Similarity Index (SSIM) of the high-frequency information (HFI) for structural assessment and the Spectral Angle Mapper (SAM) computed on low-frequency for photometric assessment.The existing solutions measure registration accuracy by relying on well-established PSNR and perceptive SSIM metrics and may not be suitable for any structural or geometrical consistency assessment.(Bevilacqua , Azzari, 2006) and (Boutellier et al., 2008) focus on cumulative error measurement in a mosaicing data set.The benchmark uses real-world remote sensing scenes.Satellite, high and low-altitude UAV images taken under different weather and lightening conditions have been chosen.Workflow of the proposed VC framework is given in Fig. 2.

PROPOSED BENCHMARKING
We propose a novel method for direct evaluation of a distortion / mosaicing algorithm with fully aligned images (planar constraint).It measures the response of a set of algorithms under probe to a set of UAV typical operational disturbances, producing an a priori index of robustness.The image composition is computed through direct application of blending

Mosaicing comparative assessment
We now introduce a mathematical formalization of our benchmark and detail the resulting pipeline.We consider a set of test cases is the mosaic image issued from the application of the operator mi, and nm denotes the number of defined mosaicing functions.We also define a reference mosaicing operator ) is finally defined.The distortion dj is applied to the image x k , with j ∈ [0, nd−1] and where nd is the number of distortions defined in the current workbench.
We compute the reference by image content replacement through and the probe mosaic with the algorithm under probe using ω probe = mi(x k−1 , x k ).We measure the final quality of a stitch between two different contents with no prior information available as the likelihood between ω ref and ω probe , considering their region of overlap Sω = x k−1 ∩ x k .Supervised mosaicing may take advantage of prior information such as relevance or age of the image in a sequence, to gather higher rank and thus larger surfaces on the final mosaic.When two distinct images with the same relevance are merged, the best result would ideally be composed of 50% of each image to minimize loss of contents.At a first glance, we may think that if we generate an overlapping region issued of the undistorted x k−1 image, the Sω ref will be totally error-free, and the assessment will penalize any distortion introduced by the Sω probe .As such, the best solution would ideally contain no potentially distorted contents issued from the x k image in the resulting mosaic, leading to image losses.We decided to add the Sω ref case computed with the fully distorted image x k and average the results to improve reliability.Our approach computes a PSNR and SSIM measuring the difference of contents in shared regions generated by Sω ref and Sω probe : The principle is summarized in Fig. 3.The benchmark applies increasing levels of disturbance to the image x k .The Sω probe may change according to the robustness of the algorithm under probe subject to the selected disturbance.The metrics are computed for every different couple of (dj, mi) and summed-up for all the imageset images.Single mosaicing, single distortion PSNR is computed on the nominal input x k and distorted x k , giving the P SN R i,j,k and SSIM i,j,k .The values of P SN R i,j,k and for the SSIM i,j,k are then summed up image by image for entire the imageset: Figure 3. Process for mosaicing assessment

Benchmarking Pipeline
The first step is to apply VC.To do so we generate ROIs (Regions of Interest), and crop x k−1 and x k with a (x0, y0) fixed shift.
We then apply a disturbance function to the current image x k to gather the image x k , before computing a nominal mosaicing We then compute the P SN Ri,j, SSIM i,j quality assessment metrics for each imageset.Finally, we aggregate these PSNR and SSIM measures to build assessment reports and compute the M SEP SN R j and M SESSIM j .
The mosaicing fidelity metric assesses the likelihood variance at distortion time as prior information to the mosaicing likelihood, measured through the PSNR and SSIM on the entire set of results.(Khan et al., 2009) demonstrate that the pixel-wise PSNR is not meaningful to measure photometric and geometrical distortions occurring at the same time.Conversely, our index is composed of a cost function penalizing strong discordance through PSNR and SSIM likelihood measured after application of a single distortion.
A low distortion variance generating high variance at mosaicing will be highly penalized as it may lead to unpredictable results.High distortion variance and low mosaicing variance corresponds to a reduced algorithm sensitivity.The fidelity workbench pipeline is illustrated in Fig. 4. We define P d = (P SN R d ) γ the normalized value of the linearized distortion PSNR, with γ = 0.2 an empirically-defined linearization coefficient.We also note Pm = P SN Rm the normalized value of the mosaicing PSNR.We then compute the PSNR fidelity Fpsnr = (P d − Pm λ ) 2 .We assign a higher weight to the mosaicing Pm than P d and we set the weight coefficient λ = 0.5 assigned to the mosaicing operator.SSIM fidelity can be computed following a similar reasoning.

Scenarios and parameter settings
We run our benchmark with the following mosaicing algorithms: feathering, multiband (blending); Watershed, Graph-cut, Voronoi (stitching).
As far as the dataset is concerned, we consider various remote sensing and UAV data sources.More precisely, each imageset is made of 20 images coming from: SPOT 5, 6, 7 and Pleiades satellite imagery (satellite), coastal UAV surveys (coastal), black and white historical photographs (BW), various images from the geomatics lag of Rennes 2 University (Costel), and from the ISPRS 2D semantic labeling contest (ISPRS).
We then define 6 test cases to conduct experimental comparisons: 1. Change in image overlap: with the coastline imageset, we measure the robustness when the overlap ratio of the input images changes.
2. Change in number of distortion steps: with the coastline imageset, we measure the robustness when the number of distortion steps changes.
3. Change in number of input images: with the coastline imageset, we measure the robustness when the imageset size changes.
4. Change in imageset: we consider BW, Coastline, and Satellite imagesets and measure the robustness when different imagesets of the same size are considered.
5. Reference ISPRS imageset: we measure the fidelity and robustness to a set of ISPRS images that are well-established for benchmarking activities.
6. Change in ISPRS number of images: we measure the robustness to growth in the image size.
Table 1 lists the parameters applied to each testing scenario.

Results
Figure 5 groups the different responses by distortion.The photometric distortions induce the highest levels of PSNR variability.Hue and Vignetting give the worst photometric response.All test cases are robust to geometrical linear, affine distortions and noise (with the exception of the thermal noise and Gaussian blur).Test 4 (Black and White imageset) yields to the lowest variability.
Figure 6 shows the perceptual geometrical and noise acceptable variabilities.9) shows the good performance of watershed has the highest.Graph-Cut and Voronoi are observed to lack of robustness to photometric changes (see Fig. 10).Multiband gives far better structural results then feathering, as expected.
Figure 11) show blending high perceptual invariance to noise and a low invariance to geometrical transforms.Figure 7) confirms how the photometric distortions harm severely the structural variability of the Graph-Cut and Voronoi.Figure 8) confirms how the geometrical transforms harm the perceptual variability of the Graph-Cut and Voronoi, with acceptable results on blending.Figures 12 and 13 show how watershed produces the highest structural and lowest perceptual robustness on the same benchmarking scenario and is robust to photometric changes.Feathering has the best computational speed (597.337sec.) and is widely used in real-time applications, but may not preserve structural integrity.Optimal-seam watershed, Graph-Cut and Voronoi may sacrify the visual pleasantness to preserve the structural integrity required by geomatics applications.
Watershed is the best trade-off between pleasentness and structural preservation with an acceptable computational time (893.181sec.).Graph-Cut shows the worse computational performances (18285.4sec.).
We have introduce the fidelity analysis metric to measure the dependency between image distortion and mosaicing, as well to penalize high MSE discordances on the entire imageset.
It has been tested on the imageset 4 featuring the following subsets: Satellite, Coastline and BW.The Satellite subset features the highest homogeneity with images with similar subjects.Figure 14

Figure 1 .
Figure 1.Mosaicing pipeline.most robust feature point extraction algorithms to scaling and affine distortions.The FAST (Rosten, 2018) corner detection is a rapid but not scale-invariant solution used in real-time applications.

Figure 2 .
Figure 2. Virtual camera benchmarking and optimal-seam stitching methods applied on a pair of input narrow-view icon VC crops.The assessment requires first to define a formal metric to measure the robustness of the algorithms in presence of disturbances.

Figure
Figure 4. Fidelity computation workflow

Figure 5 .
Figure 5. Distortion -Test Normalized PSNR MSE Response The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLII-2/W13, 2019 ISPRS Geospatial Week 2019, 10-14 June 2019, Enschede, The Netherlands T = {t0, . . ., tnt−1}, where nt is the number of test cases.For each test case, we assume a set St of 24-bit RGB input images of size nm pixels, with S = {x0, . . ., xns−1} and ns the size of the imageset.A pair of consecutive images with x k−1 and x k the first and second images respectively, with k ∈ [0, n − 1].Ω = {ω0, . . ., ωnm−1} is the set of mosaic images and the set of mosaic operators is written M = {m0, . . ., mnm−1},

Table 2 .
In this paper, we have introduced a novel benchmarking tool aimed to classify state-of-the-art UAV and remote sensing mosaicing algorithms upon structural and perceptual robustness in presence of a set of real-life operational disturbances with different extents.The results of a set of exhaustive tests show how the watershed-based mosaicing features far better structural invariance to the entire set of disturbances.Feathering is perceptually more robust when it is subject to the same disturbances.Geometrical distortions give a high structural but low perceptual fidelity.None of the algorithms appears to be robust to noise, large photometric distortions and vignetting.Monochrome leads to the highest fidelity and should preferably be used to get the best mosaics.Despite the features introduced in this paper, we believe the proposed work needs to be pursued to fulfill the needs of aerial imaging professionals.A finer benchmark should then be created to measure the variance and fidelity of a set of simultaneous disturbances occurring on a single image.Ideally, the tool should be capable of assessing an entire sequence of images including aircraft altitude and position metadata vectors, featuring finer analysis and providing far more reliable results for each different flight.Nevertheless, the distortion ranges have been chosen to fit real-wold operational conditions.Table of mosaicing use cases