STANDARDIZING QUALITY ASSESSMENT OF FUSED REMOTELY SENSED IMAGES

The multitude of available operational remote sensing satellites led to the development of many image fusion techniques to provide high spatial, spectral and temporal resolution images. The comparison of different techniques is necessary to obtain an optimized image for the different applications of remote sensing. There are two approaches in assessing image quality: 1. Quantitatively by visual interpretation and 2. Quantitatively using image quality indices. However an objective comparison is difficult due to the fact that a visual assessment is always subject and a quantitative assessment is done by different criteria. Depending on the criteria and indices the result varies. Therefore it is necessary to standardize both processes (qualitative and quantitative assessment) in order to allow an objective image fusion quality evaluation. Various studies have been conducted at the University of Osnabrueck (UOS) to establish a standardized process to objectively compare fused image quality. First established image fusion quality assessment protocols, i.e. Quality with No Reference (QNR) and Khan's protocol, were compared on varies fusion experiments. Second the process of visual quality assessment was structured and standardized with the aim to provide an evaluation protocol. This manuscript reports on the results of the comparison and provides recommendations for future research.


INTRODUCTION 1.1 Background
Image fusion is an established tool to create images of high spatial, spectral and temporal resolution.In the past decades a large number of very diverse sensors has been launched on satellite platforms, a few of which are even available free of charge (e.g.Landsat and Copernicus Programme).With the provision of multisensor imagery fusion has gained importance because it enables the production of high quality images that lead to better interpretation capabilities.The selection of a proper fusion algorithm, however, remains difficult (Pohl and Zeng 2015;Yilmaz and Gungor 2016).Users have the option to choose from many different algorithms, fusing images based on component substitution (CS), numerical methods, statistical algorithms, multi-resolution approaches (MRA), hybrid, and other techniques (Pohl and van Genderen 2015).
The motivation to compare the performance of different solutions is manifold.First of all users require to identify an optimised fused image to extract as much and accurate information as possible for their application.Secondly fusion algorithm developers intend to improve existing techniques and eliminate drawbacks, which requires quality assessment.Thirdly each fused result needs to be accompanied by quality criteria to provide the necessary metadata for further processing.Quality assurance is an important aspect in the operational use of remote sensing data, especially if interpretation results are used in a legal context.Due to the popularity of pansharpening techniques the research presented in this paper focussed on pansharpening quality assessment first.

Quality assessment
Many researchers have tried to identify the 'best' fusion algorithm by comparing different promising methods on a given data set (Aiazzi et al. 2017;Witharana, LaRue, and Lynch 2016;Sanli et al. 2016;Licciardi et al. 2016;Kalaivani and Phamila 2016).If done properly they were able to test the fusion results for their individual experiment.Nevertheless an intercomparison with other research efforts is not possible because different pre-conditions were chosen.Traditionally image quality is assessed qualitatively by visual inspection and quantitatively using different parameters, indices or protocols (Palubinskas 2015;Vivone et al. 2015;Liu et al. 2015;Jagalingam and Hegde 2015).
According to the literature visual inspection is a must (Toet and Franken 2003;Pohl and Genderen 1998;Chavez and Bowell 1988).Due to its subjectivity it is a problematic quality evaluation approach.Published remote sensing image fusion research contains the qualitative evaluation but mostly very superficially.No effort has been spent on establishing criteria and workflows for visual quality assessment of fused images.Quantitatively quite a number of scientists have published their findings.The idea to standardize quality assessment for fused images originates from Wald and his colleagues (Wald, Ranchin, and Mangolini 1997).Based on further research the remote sensing image fusion community has come up with image fusion quality assessment protocols (Alparone et al. 2015).These protocols are based on accepted conditions for fused image quality definitions.Yet the performance of these protocols has not been investigated thoroughly.
In order to standardise the assessment procedure for fused images two studies were conducted.The first evaluated the differences in using two most recommended protocols, namely the Quality with No Reference (QNR) (Aiazzi et al. 2006) and Khan's protocols for quality assessment.The second concentrated on the development of a standardized visual quality assessment protocol.The main findings of these studies will be presented in the following sections.Section 2 summarizes the protocol evaluation study.Section 3

Commission III, WG III/6
concentrates on the visual quality assessment.Last but not least Section 4 concludes the findings and recommendations for further research.

QUALITY ASSESSMENT BY PROTOCOLS
A quantitative quality assessment for pansharpening was first established in 1997 (Wald, Ranchin, and Mangolini 1997).They stated that the fused image should be as similar as possible to the original multispectral image, which means that the fused image needs to be compared to the original image at its lower spatial resolution.Secondly the ideal image is an image that resembles a theoretically sensed image with an ideal sensor (high spatial and high spectral resolution).Since such a sensor is physically impossible the protocol 'Quality with No Reference' (QNR) has been developed and will be explained in in section 2.2.Later the research team produced Khan's protocol (Khan, Alparone, and Chanussot 2009) comprising two indices (spectral and spatial) further explained in section 2.3.

Zhou's protocol
Zhou, Civco, and Silander (1998) created a fusion quality assessment protocol where spectral and spatial qualities are evaluated separately.The spectral assessment is done band by band and results in an average value of image differences between the original and the fused images.The spatial assessment is done on high pass filtered (HPF) images, namely the panchromatic and the fused image.The researchers use a Laplace filter to extract the high frequency details of the imagery.The actual quality evaluation is done by calculating the correlation coefficient (CC) between the filtered images.However, researchers have found limitations in this protocol, which is why other protocols were designed as discussed in the following sections (Alparone et al. 2008).

Quality with No Reference
In most cases users do not have access to an appropriate reference image to evaluate the quality of the fused image.Therefore a protocol for quality assessment without reference image was developed (Alparone et al. 2008).This protocol is particularly interesting for pansharpening.It contains two separate indices describing the spatial and spectral distortions in the fused image, respectively.The calculation is based on the Universal Image Quality Index (UIQI) (Wang and Bovik 2002).Its advantage is the good performance on low and high spatial resolution imagery (Khan et al. 2008).For the practical implementation the panchromatic image is resampled to the resolution of the multispectral image.Then spatial and spectral differences are calculated.The final QNR Index is derived from a multiplication of the spatial and spectral distortion indexes produced.Details on the mathematical background can be found in Alparone et al. (2008).The QNR Index allows a modification to concentrate either on spatial or spectral improvement, depending on the user's preference.Often users investigate the spectral and spatial parameters independently (Alparone et al. 2015).

Khan's Protocol
As a further development a new protocol by Khan, Alparone, and Chanussot (2009), integrating Wald's, Zhou's and the QNR protocols.Comprising the consistency criteria of Wald, the introduction of high frequency components from the panchromatic to the fused image using an HPF as in Zhou's protocol and the definition of the spectral distortion from the QNR protocol Khan's protocol assess the overall quality of fused images, considering the modulation transfer function (MTF).The MTF is a measure of quality, representing the sharpness and the quality of edges and lines and is determined in the frequency domain (Thomas and Wald 2006).Khan, Alparone, and Chanussot (2009) integrates the Q4 Index (Alparone et al. 2004) to obtain a comparable quantitative value.The result of Khan's protocol is a spatial (_) and a spectral distortion index (_).
According to the literature quantitative fused image quality assessment should be done using the QNR and Khan's protocol (Alparone et al. 2015;Pohl and van Genderen 2016).However, it has not been investigated so far, how reliable these protocols perform.For this purpose a study was performed at the University of Osnabrueck.For comparison purposes the two resulting values from Khan are transformed into one value (ℎ !"!#$ ) following the calculation of the QNR protocol (compare Equation 1):

Fusion algorithms
Since the investigation intended to judge the performance of quantitative fused image quality assessment protocols the actual fusion algorithms are of secondary importance.Therefore typical and commonly used pansharpening techniques were selected, amongst which also one commercial algorithm.All algorithms, namely FuzeGo (commercial UNB Pansharp algorithm) (Zhang and Mishra 2014), Ehlers Fusion (Ehlers 2004) and Wavelet Fusion (Ranchin and Wald 1993;Vivone et al. 2015) are state-of-the-art.For details on the algorithms we refer to the mentioned literature.

Data selection and pre-processing
The experiments were carried out on RapidEye and Sentinel-2 imagery.Both sensors are very popular and provide images of high quality in the visible and infrared (optical) domain.In addition the channels of Sentinel-2 acquire data at different spatial resolutions so that high spatial resolution multispectral images can be created from a single acquisition.The latter excludes the problems of changes between image acquisitions if multitemporal data are used.The sensor details are summarized in Table 1 (modified from (Pohl and van Genderen 2016)).
The test site is located in Israel, north east of the town Haifa.Acquisition dates are 08th December 2015 for Sentinel-2 and 04th December for RapidEye to ensure that both acquisitions are as close as possible in time.
Since neither RapidEye nor Sentinel-2 provide a separate panchromatic channel an artificial Pan channel was created using equation (2): where  represents the synthetic panchromatic RapidEye/Sentinel-2 channel,  the number of bands and  ! the th band of RapidEye/Sentinel-2.
In addition subsets were created to provide imagery with different typical land cover (urban, agricultural and mixed land cover).Therefore the comparison of the protocols was carried out individually for each land cover type separately.

Results
In general it can be stated that both protocols support the user in identifying the best algorithm for different land cover types.
Another finding is the fact that fusion results deteriorate if the wavelength range of the panchromatic image is located outside the range of the multispectral image bands.The protocols are not suitable for the detection of artefacts.This has to be done qualitatively, i.e. visually.However, it also has to be stated that the protocols lead to contradictory results, especially in the agricultural areas.The FuzeGo algorithm achieved the best spatial quality while Ehlers fusion outperformed the other two algorithms spectrally.As an example Figure 2

VISUAL QUALITY ASSESSMENT
For the purpose of standardizing qualitative fused image quality assessment a study was designed.With the support of the international scientific and professional community criteria and relevant information was collected using a questionnaire.From the collected information a protocol for evaluating fused image quality visually was created.In order to provide a quantitative value of the visual quality weighting factors and values were introduced.The following sections give a brief overview on the study conducted and discussed the derived protocol.

Visual assessment of fused imagery
The background of the study is the subjectivity of visual quality assessment.In addition the evaluation is not standardized and hardly structured when looking at existing remote sensing image fusion literature.The unstructured way of performing visual assessment as well as its inherent subjectivity leads to an incompatibility of the results.Different quality statements from different research cannot be compared.This is a major drawback in image fusion research and applications because the visual assessment is very crucial in the process.The quantitative quality assessment is not sufficient.Apart from the fact that artefacts cannot be quantified the human interpretation adds valuable information to the process and cannot be replaced.

Standardizing and quantifying visual quality assessment
The goal of the study was to streamline the approach to visual quality assessment and to establish criteria with grading and weighting to produce a final quantitative value for the visually assessed quality.The only attempt to standardize visual quality assessment of fused images in the literature was published by Shi et al. (2005).Unfortunately, this attempt did not find its way into operation.Our new initiative tapped into scientific literature and a network of remote sensing image fusion and image interpretation experts.Based on scientific literature major criteria for image interpretation and quality were derived.The visual assessment in the literature is always presented in the form of text.First of all it is difficult to extract the actual quality aspects, secondly a comparison is not possible because each researcher uses a different approach.For standardization purposes the extracted image qualities and aspects from the literature were transformed into a questionnaire in order to tap into expert knowledge around the world.
In total 46 experts participated in the study.They are overly professionals in the field of remote sensing and image analysis, including people from institutions operationally interpreting images for security and disaster mitigation applications.This ensured that the protocol has a high suitability and can be implemented operationally.The major of participants have experience with image fusion and its applications.They are familiar with quantitative quality assessment.
For the design of the different questions to collect the necessary information test data was selected to provide examples of high spatial resolution fused images.The current version covers pansharpening only.In the future the protocol will also be tested on other types of image fusion.The type of data and the fusion techniques were not revealed in order to allow an objective observation of the different achievements.Figure 3 provides an example of such an example.Figure 3. Image example for questionnaire; Images A-C are fused images using three different pansharpening algorithms (Fries 2017) In order to cover different land cover types a set of images were provided.The images were provided at different scales in order to allow global, regional and local assessment.Based on the received information a protocol was designed.It contains a workflow and a value sheet to produce the necessary quantitatively measurable numbers.

The visual quality assessment protocol (VQAP)
Using the protocol the user is guided through the process of fused image assessment from global via regional to local features.Criteria such as sharpness, colour preservation and object recognition are used to judge the quality of the images.This leads to a repeatable process with unified values leading to an objectified visual quality assessment result that can be used for comparison.
The first step in the procedure development required a compilation of important criteria to be used in visual quality assessment.The participants named many different aspects as indicated in Figures 4 (spatial) and 5 (spectral).Then a prioritization of the main criteria deemed necessary.These were spatial improvement (sharpness, spatial quality), colour preservation (spectral quality) and object recognition.Based on the importance of the individual aspects weighting factors were introduced.As a result the weighted criteria in the order of their importance received a value based on the answers to the questions posed.In total 23 aspects are listed in the protocol subdivided into spatial, spectral and object-level criteria on a global, regional and local scale (see table 2).  2. List of final aspects and categories in the protocol An illustration of the different image areas to be considered is given by the example presented in figure 6.The images show global, regional and local aspects of the image.Figure 6.Global, regional and local image area (Fries 2017) The final protocol contains six parts, namely an introduction explaining the context, a description of possible applications of the protocol, a glossary with explanations of the used terminology, an overview on the structure, the questions to be followed (evaluation protocol) and an external, digital answer sheet with an automated calculation of the values and final score.
In the external answer sheet the process to define quality contains drop-down menus to simplify the usage of the sheet (compare with figure 7).This standardizes and facilitates the evaluation.In addition it speeds up the process.

Results and discussion
As a result of the standardized visual quality assessment using the VQAP the user obtains a score.This score results from three major criteria: 1. Spatial quality, 2. Spectral quality, 3. Quality of object recognition.The total score that can be achieved in the individual criteria is presented in Figure 8.It is made up out of the individual aspects previously listed in table 2. The detailed calculation of maximum possible scores and weights is explained in the published thesis.It will be explained in detail in an upcoming journal publication.
Figure 8. Aspect categories and their total achievable scores From figure 8 it becomes obvious that professional users are mostly concerned with colour preservation (spectral fidelity) in the fused image, followed by the increased spatial detail and last but not least the need for object recognition.

Implementation
The protocol was tested on a use case.For this purpose six different, commonly used fusion algorithms, i.e.Brovey Transform, FuzeGo, HPF, modified IHS, Principal Component Analysis -PCA, and Wavelet were applied to an urban test data set of high spatial resolution optical remote sensing data.Figure 9 displays the subsets used for visual quality assessment using the VQAP.Investigating artefacts in fused imagery the visual assessment resulted in table 4. It is obvious from both tables that the selected algorithms perform quite differently and that this can be quantified.For the final decision on which fusion approach performed best in the three aspect categories the entire sheet has to be completed, leading to three final scores for general, spatial and spectral quality.

Discussion
From table 5 it can bee seen that FuzeGo performed well in terms of overall quality, spatial and spectral aspects.The wavelet transform produced the worst general result.Brovey led to the worst spatial and spectral performance, which was expected.The table shows the usefulness of the VQAP in practice.It should be noted that this forms only one part of the quality assessment and should always be accompanied by a quantitative evaluation.

CONCLUSION AND OUTLOOK
The results show that a quantification of qualitative assessment strategies is possible.The standardization of the procedure reduces the influence of a subjective evaluation of image quality.It is a way forward to produce comparable quality scores so that with the combination of a quantitative quality assessment new possibilities arise for future quality statements in the field of remote sensing image fusion.It remains now to conduct a usability study providing the VQAP to image fusion users.This will enable a validation of the presented approach.Another further step is the adaptation of the protocol to different applications and other fusion constellations, e.g. the combination of optical and radar data.

2. 6
Figure1.Experiments and test data -adapted from(Moellmann 2016) displays the results for a combination of RapidEye with Sentinel-2 10 m data for the urban area.It is clearly visible that the QNR protocol leads to completely different conclusions that Khan's protocol.The results of the other experiments were similarly indifferent.

Figure
Figure 4. Spatial image quality criteria

Figure 9 .
Figure 9. Example images to test the feasibility of the VQAP; left to right and top down: Original MS, original PAN, Brovey Transform, FuzeGo, HPF, Modified IHS, PCA, and Wavelet Extracted enlargements for the local image quality assessment are presented in figure 10.

Figure 10 .
Figure 10.Enlarged regions for local quality assessmentFollowing the protocol the result sheet was filled in.As an example to illustrate the scores for individual criteria table 3 presents the result for colour preservation at local level.

Table 1 .
All images were corrected for atmospheric effects and coregistered to provide a compatible, corrected data set.Sensor parameters for images used in experiments

Table 3 .
Table 5 provides the result of the test case.Visual assessment of criterium colour preservation at local level.The maximum score reachable is 116 (Fries 2017).

Table 4 .
(Fries 2017ssment of the criterium artefacts at local level.The maximum score reachable is 57(Fries 2017).