CRITICAL REFLECTION ON QUANTITATIVE ASSESSMENT OF IMAGE FUSION QUALITY

Image fusion technique has been extended its development from multi-sensor fusion, multi-model fusion to multi-focus fusion. More and more advanced techniques such as deep learning have been integrated into the development of image fusion algorithms. However, as an important aspect, fusion quality assessment has been received less attention. This paper intends to reflect on the commonly used indices for quantitative assessment and investigate how they can represent the fusion quality regarding spectral preservation and spatial improvement. We found that image dissimilarities are unavoidable due to the spectral coverage of different image sensors. Image fusion should integrate these dissimilarities when they are representing spatial improvement. Such integration will naturally change the pixel values. However, as the quality indices for the assessment of spectral preservation are measuring image dissimilarities, the integration of spatial information will lead to a low fusion quality assessment. For the evaluation of spatial improvement, the quality indices only work if the spatial details have been lost; however, in the case of spatial details gain, these indices do not reflect them as spatial improvements. Moreover, this paper raises attention to image processing procedures involved in image fusion, including image geo-registration, image clipping and image resampling, which will change image statistics and thereby influence the quality assessment when statistical indices are used.


INTRODUCTION
Image fusion, as means of enhancing image quality and extracting useful information by combining images from different sources, has been widely applied in many fields in the past decades. Along with these applications, different terms have been used to indicate the different techniques, such as multisensor image fusion for remote sensing (Abdikan et al., 2014), multi-model image fusion for medical diagnosis (Hermessi et al., 2021), to the multi-focus image for optical microscopy (Liu et al., 2020). Meanwhile, more and more advanced techniques have been integrated into the fusion algorithms, such as sparse representation (Ma et al., 2021, Zhang et al., 2021, deep learning (Mustafa et al., 2020, Li andWu, 2018). While the fast development of image fusion algorithms, the development of fusion quality assessment, has been received less attention. This paper intends to make a primary reflection on the commonly used methods for quality assessment, with a focus on remote sensing image fusion.
The general goal of remote sensing image fusion is to improve the spatial resolution, at the same time, avoid spectral distortion. Thus, the evaluation of image fusion quality emphasizes two aspects: spectral preservation and spatial improvement. The most direct assessment method is a visual inspection. The disadvantage, however, is that human observation is subjective and is problematic as a means of measurement. Therefore, it is widely accepted to additionally use a quantitative approach for quality evaluation. Some researchers (Li et al., 2010) conducted a survey and found a total of 27 quantitative measurements for quality assessment. From the calculation function of the list * Corresponding author of measurements, most of them deal with mean, standard deviation, correlation coefficient (CC), and variance, which are basic statistical parameters. The objective of this paper is to investigate how these statistical parameters are transferred as measurements for quality assessment, and to which extend they can reflect the fusion quality regarding spectral preservation and spatial improvement.

IMAGE DISSIMILARITIES AND IMAGE FUSION
As image fusion is conducted between two different images, before examining image fusion quality, it is necessary to explain what makes the differences between the two images, even they were focused on the same target and taken at the same time. From a physical point of view, this is because the sensor system works at a specific range within the electromagnetic spectrum. In remote sensing terminology, this electromagnetic spectrum range is often referred to as spectral bands. The fusion between panchromatic and RGB image, in other words, is using the panchromatic band to sharpen visible bands, therefore, is also named pan-sharpening.
Depending on the spectrum coverage, each band identifies the ground cover differently. As an example, figure 1a and figure 1b display how Sentinel band 2 and band 5 present an airport area as different images. The runway is displayed brighter than the surrounding on band 2 while it is darker on band 5. For further investigation, the same red line is drawn on both bands, across the runway. By plotting the reflectance values along this red line, figure 1c displays that the reflectance increases at the runway on band 2, and the reverse case appears on band 5. On figure 1, the two bands show slightly dissimilarities but the overall spatial pattern are similar. Figure 2 shows another case, where the image difference between two bands are much more distinct. Figure 2a and 2b are the panchromatic and thermal bands from a Landsat 8 image, with spatial resolution 15 m and 60 m, respectively. On figure 2a, there are two farm fields separated by a road. On figure 2b, because of the same reflectance of both farm fields and lower spatial resolution, the road does not appear. By drawing a red line across this road, the radiance transect of the thermal band is a straight line. At the same time, on the pan image (figure 2b) the transect curve drops at the place where the road surface is located.  This brings up a question, how image fusion deal with these image dissimilarities? After image fusion, the resulting image should have sharper and increased spatial details. In the case of figure 2, if using the panchromatic band to sharpen the thermal band, the ideal result should be that the border of the farm fields and the road between them appears on the sharpened thermal image. If so, then at the same transect, the spectral curve of thermal band would not be a straight line anymore, and would either drop or go up. An object can be visible due to the variation of reflectance, where no variation means identical features. It can therefore be concluded that spatial improvements will introduce the modification of spectral information. This finding provides some hints to the phenomena that has been noticed by the image fusion community: there is a tradeoff between the spectral and spatial quality in pansharpening algorithms. For example, some researchers found that a tradeoff occurs between spectral information and spatial information on fused images (Choi, 2006). Some studies reported that the during image fusion higher the amount of signal integrated into the fused image, the higher the amount of aliasing that occurs (Aiazzi et al., 2012). Many researchers also believed that this trade-off is a nature of image fusion (Vijayaraj et al., 2004, Wang et al., 2005, Tu et al., 2007, Chen et al., 2008. This paper further investigates how this trade-off effect influences quality assessment, especially when statistical measurements are used.

FUSION QUALITY ASSESSMENT INDICES
In this section, commonly used indices for fusion quality assessment are discussed in two groups, according to their function, whether for evaluation of spectral preservation, or the assessment of spatial improvement.

Spectral preservation
For evaluation of spectral preservation, the most commonly used quality indices include CC; root mean square error (RMSE) which is a standard measurement of the value of differences; the relative average spectral error (RASE) which is the average RMSE of all bands and is expressed as a percentage; relative dimensionless global error of synthesis (ERGAS) which is a further development of RASE with additional consideration of the resolution ratio between the two image sets. These indices have been widely reported in the literature for performance comparisons of data fusion algorithms (Aiazzi et al., 2002, Aiazzi et al., 2006, Aanaes et al., 2008. Basically, these statistical methods compare the pixel values between the original and fused images; the ideal fusion result should show similar averages and low standard deviation. For the original and fused images, identical pixel values implies good spectral preservation. This might be true for images with large patches of homogeneous areas; for example, green vegetation area after fusion should also remain green, but this might not be the case if the image was taken in an urban area.
As urban space is a heterogeneous area, the high resolution image shows much more spatial details than the low resolution image. As discussed in the last section, increased spatial details will naturally change the pixel values of the original image. The higher the increase in spatial detail, the more dissimilarities will appear on the fused image. These changes are necessary and desired, but will be recognized as poor fusion quality using the aforementioned statistical methods. Zero deviation, one hundred percent CC equals to absolute color preservation but leads to no spatial improvement.

Spatial improvement
Statistical measurements are not ideal for spectral quality assessment, and they are more problematic for assessing spatial quality improvement. It was claimed that some fusion quality indices can measure structure similarity between reference and fused images, such as the structure similarity index metric (Wang et al., 2004), the average value of local variance (Beauchemin et al., 2002), and the quality index based on local variance, usually referred to as QILV (Aja-Fernandez et al., 2006). Alternatively, it was claimed that some indices can evaluate the overall fusion performance which include spatial improvement. Among them, the most popular ones are the universal image quality index (UIQI) (Wang and Bovik, 2002), and quaternion theory-based quality index, also named as Q4 (Alparone et al., 2004).
where r and f are the reference image and fused image µr, µ f = local mean δr, δ f = standard deviation δ rf = local correlation coefficient between r and f A close comparison of the functions from these indices show that there are small differences in expression but are essentially similar. Therefore, UIQI is taken as an example to present how these indices works for fusion quality assessment. In general, spatial details are built up by a group of pixels and spatial improvement cannot be measured by looking at an image on a pixel-by-pixel comparison. Therefore, when these indices are implemented, a sliding window approach is often applied. The measurements are computed locally based on the pixels within a sliding window. Then, this window moves pixel by pixel horizontally and vertically through all the rows and columns of the image. In the end, the overall quality index is the average of all the local measurements. In the following examples, we use a 3 × 3 pixels window to present how UIQI works. The function of this index (equation 4) includes three parts: luminance distortions (equation 1), contrast distortions (equation 2), and correlation loss (equation 3). Assuming that there is an original image (figure 3a) where a geometrical cross feature is presented, part of this feature has been lost (Figure 3b and 3c) after being processed by two fusion methods. Figure 3b retains more structural information than figure 3c. If the pixel values for the cross feature and surrounding pixels are set as 100 and 50 respectively, the UIQI values of the fused figure 3b and 3c are calculated based on the equation 1 to 4 and listed in table 1. The ideal UIQI value is one; the closer to one, the better the quality. UIQI values in table 1 show that figure 3b has a better fusion quality than figure 3c. In this way, the index can reflect the fusion quality in a reasonable range.
However, this is just the case when the fused image has lost the existing structural information compared to the original image, while in the image sharpening context, the fused image will add extra structural information. Then the index does not work in the right way. Figure 4 shows that in the image sharpening context, an original image (Figure 4a) has no visible structures inside. After image fusion, a cross structure (Figure 4b) or a linear structure (Figure 4c) are generated. In this way, the measurement of UIQI will result in a poor evaluation of fusion quality,  see table 1. It seems like, among the three parts of the index, correlation loss is the one that reflects spatial loss but a part that can reflect spatial gain is missing in the index. Take the fusion between the optical and thermal image as an example, spatial gain happens when ground objects have the same temperature but the border of the objects only appear on the optical image, image fusion will integrate the border of the objects into the thermal image. However, such a spatial improvement won't be reflected by the quality index.

FUSION EXPERIMENT AND QUALITY ASSESSMENT
To examine the performance of the quality index in practice, this section carried out an image fusion experiment with Sentinel 2 multispectral bands. Sentinel 2 image contains 13 spectral bands from VNIR to SWIR. The image spatial resolution is dependent on the particular band. We took band 4 as the highresolution image to enhance band 9 and band 1. The spatial resolution and spectral location of the input image are listed in Table 2. These three bands were chosen because they have no overlap on the spectral spectrum, which fits perfectly with the situation illustrated in section in section 2.
We selected an agricultural area as the study area. As figure 5 shows, in this area, the farm fields are laid out in grids. Such spatial features are convenient for investigating spatial improvement. We used the high pass filter to conduct the image fusion, as the other classic image sharpening algorithms, such as IHS, Brovey, and PC need at least three bands as the low-resolution image input. The experiment started with using use Band 4 to sharpen Band 9 and Band 1 separately, at the same time, Band 9 and Band 1 were resampled to 10 m and used as the reference images. After fusion, the fused Band 9 and Band 1 were com-  Table 2. Spatial resolution and wavelength of input images pared with their reference images respectively. With the software ImAnalysis (Vaiopoulos, 2022), the quality indices were calculated for each pixel, as well as the average value for the entire image.
We took two steps to investigate the indices for fusion quality assessment. The first step is to check the index for evaluation of spectral preservation. Here, we use ERGAS as an example. We set sample points at farm fields, streets and residential houses and then collected their ERGAS values after fusion. As figure 5 shows, these sample points present high ERGAS values which indicate poor spectral preservation. However, if we compare the original image and the fused image, it is clear that at these points, the spatial details have been improved. More precisely, the border of the farm field, the edge of the street and the shape of residential houses that were blurry before become clearly visible after fusion. Similarly, other statistic indices used for assessing spectral preservation, such as CC and RASE also show low performance at these places. It proved that if we use these statistical measurements for assessment, the increased spatial  Table 3. Fusion quality assessment for the entire image features will lead to low quality of spectral preservation.
In the second step, we take UIQI as an example to investigate how this index reflects spatial improvement. For such purpose, the sample points were set at liner structures on the ground. To examine large-scale features, samples were set on the main street and the border of farm fields. To investigate small-scale features, samples were set where terraced houses are located. It turns out that even though there were distinct spatial improvements at these places, the UIQI values are very low which indicates low spatial improvement. At some points, the UIQI shows even negative values. These are exactly the cases described in figure 1, where the reflectance values of the two bands are opposite.
The quality assessment result for the entire image is presented in Table 3. All the indices show low fusion quality, even though by visual inspection there are recognizable improvements. Overall, the fused Band 1 has a higher fusion quality than Band 9, according to all the indices. This is not because a better fusion algorithm was used but because the spectral characteristic of the input bands varies.

OTHER FACTORS INFLUENCING INEFFECTIVE EVALUATION
By checking the function with which the quality evaluation index is calculated, all the functions are computed based on pixel values. Therefore, any factor resulting in pixel value change will also influence the result of the quality evaluation. However, during image fusion and quality evaluation, nearly every image processing procedure modifies pixel values. Before image sharpening starts, the input and target images need to be co-registered. Then, an area of interest can be clipped out. Lastly, the low-resolution image needs to be resampled to the same pixel size as the high-resolution image. These processing procedures will further influence these statistical indices on effective quality evaluation.

Image geo-registration
Image geo-registration is the first step in image fusion. Images can be retrieved from various sources where no or different coordinate systems were used. Input images need to be registered to one map coordinate system where each pixel is in its geometrically correct position and has coordinates. For image sharpening, it is crucial that the input and target images are geo-registered correctly so that the data is comparable on a pixel-to-pixel basis. Some researchers (Pohl and Van Genderen, 2016) stresses the importance of image registration for image fusion. Inaccurate image registration will lead to the mismatch of ground objects between the original image and fused image, resulting in low fusion quality. This paper further emphasizes that even though two images are correctly geo-registered, the pixel values of the input image have been modified before image fusion. This is because during image geo-registration, the pixel arrays are shifted and twisted to match the new coordinate system. Consequently, the pixel sizes are slightly adjusted to match the new map grids. The change of the pixel size will modify the pixel values. Such modification may seem like a minor change, and they are nearly invisible thus are often ignored. However, as the difference between assessment results are also not significant, these modifications could produce an impact on fusion quality assessment. Therefore, this paper draws attention to the change in image statistics during processing procedures, particularly when statistical indices are used to measure fusion quality.

Image clipping
For image sharpening purposes, a subset of the high-resolution and low-resolution image will be cut together by the border of an study area. If the border crosses half of the pixel size of the low-resolution image, it is impossible to cut half of the pixels, as a pixel is the minimum unit of a raster image. In this case, the row above or below the border will be taken as the starting line of cutting. Due to the smaller pixel size, the border can go through underneath a row of pixels of the highresolution image. Figure 7 illustrates this situation. It serves as an example to show that when clipping is applied, the generated images may not be completely aligned. Assuming the low-resolution image has a resolution of 30 m, a shift of 15 m is generated. In the urban context, a shift of 15 m could mix up one building with another or result in a street deviating from its original direction. To avoid this problem, this paper suggests resizing the low-resolution image to a higher resolution before cutting the subsets. Consequently, the image will go through geo-registration, resizing, and then clipping. As during georegistration, resampling already occurred once, it means that before image sharpening occurs, the image has already been resampled twice. Consequently, pixel values and image statistics have been modified twice.

Resampling
Resampling is not just used before image fusion but also afterwards. Since it is used so often, it is worth to examine how it changes the image values. The common resampling algorithms include nearest neighbor, bilinear interpolation, and cubic convolution. Different resampling techniques change the pixel values differently. Figure 8 shows the use of nearest neighbor or cubic convolution to resample an image. Cubic convolution produces a smooth color change, and nearest neighbor retains the original pixel values but can produce blocky artifacts. Comparing two resampled images pixel by pixel, very few pixels have the same pixel values. This means if different resampling algorithms have been applied before and after fusion, the statistical quality indices will conclude with a low fusion quality.
To avoid misjudgment about the fusion quality, this suggests that resampling methods before and after fusion should be kept consistent.

CONCLUSION
This paper firstly explained the image dissimilarities from a physical point of view. As each band or image has its specific The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLIII-B3-2022 XXIV ISPRS Congress (2022 edition), 6-11 June 2022, Nice, France  (c) aim to display differences in resample effects but not precise interpolation results. spectrum coverage, it is natural that images taken at the same time do not present the ground cover in the same way. The goal of image fusion is to integrate these image dissimilarities when they can enhance the image quality. However, the more dissimilarities added to the fused image, the more pixels' values will be changed. Consequently, the image statistic will be changed and influence the assessment of fusion quality.
We then examined the often-used fusion quality indices from the aspects of spectral preservation and spatial improvement. We found that when statistical indices are used to evaluate spectral preservation, they are actually measuring the dissimilarities. This means the more spatial details added to the fused image, the worse the fusion quality will be evaluated. This is particularly the case for urban areas, where the ground cover is not homogeneous but has spatial variations at a small scale. This explains the trade-off between spectral preservation and spatial improvement performance. For the evaluation of spatial improvement, we used UIQI as an example to illustrate that the statistical indices only work if the spatial details have been lost. However, in the case of spatial details gain, these indices do reflect them as spatial improvements. Later, we experimented with fusion Sentinel 2 multispectral bands. It proves again that the places where spatial improvements are visible they are evaluated as low performance of spectral preservation and spatial improvement by ERGAS and UIQI. In this experiment, we took extreme examples, where input spectral bands have no overlap at the spectral spectrum. These extreme examples partly reflect the ongoing research of multi-focus, multi-model image fusion, where the input image often have no overlap at the spectral spectrum, such as the fusion between thermal and visible images. Overall, this paper argues that statistical indices do not provide a complete picture of fusion quality.
Moreover, this paper lists the factors which will change the pixel values and thereby change image statistics. Attention should be paid to these factors when statistical indices are used for the quantitative measurement of fusion quality. As some research has pointed out, one direction for future research is a standardized visual quality assessment to objectively compare fusion quality (Pohl et al., 2017). It might be useful to extend such a quality assessment and develop more robust quantitative measurements.