USING LEGACY SOIL DATA FOR STANDARDIZING PREDICTIONS OF TOPSOIL CLAY CONTENT OBTAINED FROM VNIR/SWIR HYPERSPECTRAL AIRBORNE IMAGES

Mapping of topsoil properties using Visible, Near-Infrared and Short Wave Infrared (VNIR/SWIR) hyperspectral imagery requires large sets of ground measurements for calibrating the models that estimate soil properties. To avoid collecting such expensive data, we proposed a procedure including two steps that involves only legacy soil data that were collected over and/or around the study site: 1) estimation of a soil property using a spectral index of the literature and 2) standardisation of the estimated soil property using legacy soil data. This approach was tested for mapping clay contents in a Mediterranean region in which VNIR/SWIR AISA-DUAL hyperspectral airborne data were acquired. The spectral index was the one proposed by Levin et al (2007) using the spectral bands at 2209, 2133 and 2225 nm. Two legacy soil databases were tested as inputs of the procedure: the Focused-Legacy database composed of 67 soil samples collected in 2000 over the study area, and the No-Focused-Legacy database composed of 64 soil samples collected between 1973 and 1979 around but outside of the study area. The results were compared with those obtained from 120 soil samples collected over the study area during the hyperspectral airborne data acquisition, which were considered as a reference. Our results showed that: 1) the spectral index with no further standardisation offered predictions with high accuracy in term of coefficient of correlation r (0.71), but also high bias (-414 g/kg) and SEP (439 g/kg), 2) the standardisation using both legacy soil databases allowed an increase of accuracy (r = 0.76) and a reduction of bias and SEP and 3) a better standardisation was obtained by using the Focused-Legacy database rather than the No-Focused-Legacy database. Finally, the clay predicted map obtained with standardisation using the Focused-Legacy database showed pedologically-significant soil spatial structures with clear short-scale variations of topsoil clay contents in specific areas. This study, associated with the coming availability of a next generation of hyperspectral VNIR/SWIR satellite data for the entire globe, paves the way for inexpensive methods for delivering high resolution soil properties maps.


INTRODUCTION
Visible, Near-Infrared and Short Wave Infrared (VNIR/SWIR, 0.4-2.5 µm) spectroscopy is a physical nondestructive, rapid, reproducible, and low-cost method that allows characterization of materials including soils. VNIR/SWIR spectroscopy has been proven to be accurate for the discrimination of soil types and estimation of a large range of soil properties (e.g., Viscarra Rossel et al., 2006;Ben-Dor et al., 2009), such as clay and CaCO3. More recently, VNIR/SWIR hyperspectral airborne imaging has been demonstrated to be a potential tool for topsoil property mapping over large areas (e.g., Selige et al., 2006;Gomez et al., 2008;Stevens et al., 2010).
Various methods have been developed and used to relate soil spectra to soil properties, including spectral indexes (e.g., Gomez et al., 2008;Lagacherie et al., 2008) and partial leastsquares regression (PLSR) (e.g., Selige et al., 2006;Stevens et al., 2010). Spectral indexes are based on physical analyses of spectral reflectance, such as slope or absorption band depth value which allow estimating mineral, rock, and soil properties. For example, the absorption band depth values centered at 2206 and 2341 nm can be used to estimate clay (Chabrillat et al., 2002) and CaCO3 (Gaffey, 1986) content, respectively. In the PLSR approach, the full spectrum is used to establish a linear regression model where the significant information contained in the VNIR/SWIR spectra is concentrated in a few latent variables that are optimized to produce the best correlation with the desired soil property.
The application of these techniques to airborne hyperspectral data included a calibration step that used a set of sites with VNIR/SWIR spectra associated to the studied soil property measured by physico-chemical lab analysis. These sets were sized large enough (> 70) to accurately represent the distributions of soil property over the study area. This induces significant costs that may limit the applicability of such approach. This will become all the more true with the coming availability of the next generation of orbiting hyperspectral sensors and routinely delivered high spectral resolution images for the entire globe.
The aim of this study was to test the use of legacy soil data collected over and/or around the study site, to calibrate a clay spectral index in order to reduce the soil sample collection cost and time. This study employed VNIR/SWIR AISA-DUAL hyperspectral airborne data acquired over a large area (300 km²) in a Mediterranean region.

Study Area and AISA-DUAL Airborne Data
The study area is located in the Cap Bon region in northern Tunisia (36°24'N to 36°53'N; 10°20'E to 10°58'E), 60 km east of Tunis (Figure 1a). This 300 km² area includes the Lebna catchment, which is primarily rural (>90 %) and devoted to cereals, legumes, olive trees, natural vegetation for breeding and vineyards.
VNIR/SWIR AISA-DUAL hyperspectral data were acquired in 2010, over the study area (12 × 25 km) with a spatial resolution of 5 m (Figure 1b). The AISA-DUAL airborne imaging spectrometer measures the reflected radiance in 359 noncontiguous bands covering the 400-2450 nm spectral domain. Noisy and atmospheric absorption bands were removed and 249 spectral bands 0.489 µm and 2.451 µm were retained. To isolate the bare soil areas, pixels with Normalized Difference Vegetation Index (NDVI) values over an expert-calibrated threshold were masked. As well water areas were masked using an expert-calibrated threshold and 13 urban areas were identified by visual inspection, and masked. So bare soils represent 46.3% of our study area. The AISA-DUAL airborne data and the study area have been described in details in Gomez et al. (2012).

Soil Field Sampling and Laboratory Analysis
The three following soil datasets differed in the standards of soil sampling and descriptions, the georeferencing methods, the laboratory analysis protocols and the data storage methods. They are most located on arable land.

Reference Soil Database
One hundred and twenty soil samples were collected over the study area covered by the AISA-DUAL data. Among this sample set, 50 samples were collected in June 2008, 30 in October 2009, and 40 in November 2010). All of these soil samples were collected in fields that were bare during the hyperspectral airborne data acquisition in November 2010. They were composed of five sub-samples collected to a depth of 5 cm at random locations within a 10×10 m 2 square centered on the geographical position of the sampling plot, as recorded by a Garmin GPS instrument.
The clay content (granulometric fraction < 2 µm) was determined using a pipette method (method NF X 31-107, particle size distribution by sedimentation) (Baize and Jabiol, 1995). The clay content of the 120 soil samples varied between 108 and 772 g/kg and followed a normal distribution (Table 1). This dataset will be denoted as REF.

Legacy Soil Databases
Eighty-six soil samples were collected in 2000 over the CapBon region ( Figure 1b). Among these eighty-six soil samples, only sixty-seven samples located over the study area covered by the AISA-DUAL data were used and denoted as the Focused-Legacy database. All of these soil samples correspond to the first soil horizon of profiles described by the IAO (Instituto Agronomico per l'Oltremare) 20th course professional master "remote sensing and natural resources evaluation" field survey staff from 2 to 28 April 2000 according to the IAO framework (IAO, 2002) ( Figure 1b). The soil profiles were located initially on topographic maps georeferenced in CGE (Geodetic system Carthage Tunisia) with Lambert coordinates and the Greenwich Meridian as the origin.
Seventy-two soil samples were collected between 1973 and 1979 over the CapBon region ( Figure 1b). Among these seventy-two soil samples, only sixty-four samples located outside of the study area covered by the AISA-DUAL data were used and denoted as the No-Focused-Legacy database. All of these soil samples correspond to the first soil horizon of profiles described following a set of specifications that were defined by the Tunisian ministry of Agriculture by adapting existing specifications of the French overseas research institute (ORSTOM -Maignien, 1980). The soil profiles were located initially on topographic maps georeferenced in CGE (Geodetic system Carthage Tunisia) with latitude and longitude coordinates in grade and the Paris Meridian as the origin.
Finally, the Focused-Legacy and No-Focused-Legacy databases were grouped to form the Entire-Legacy database containing 158 soil samples.
The clay content of the Focused-Legacy and No-Focused-Legacy databases was determined by hydrometer method (Baize and Jabiol, 1995). The clay content of Focused-Legacy database varied between 39 and 762 g/kg (Table 1). The clay content of No-Focused-Legacy database varied between 10 and 425 g/kg ( Table 1). The clay content of Entire-Legacy database varied between 10 and 762 g/kg (Table 1). All legacy dataset had positive skew distribution.

Spectral Index
The clay contents were predicted from the Clay Index (called index) proposed by Levin et al (2007) using the spectral bands , and , and following: (1) The HYperspectral SOil Mapper (HYSOMA) software interface was used to derive semi-quantitative soil properties maps from the hyperspectral data. HYSOMA was developed at the GZF German Research Center for Geosciences in the Remote Sensing section and is a software package written in IDL language (www.gfz-potsdam.de/hysoma, Chabrillat et al, 2011).

Prediction Standardisation
The clay content predictions obtained from the index were then standardized using a process in three steps. At the first step, a boxcox transformation was applied to transform the distribution of the clay content predictions into normal distribution (Legendre and Legendre, 1984) defined as: ( 2) where is the clay content predictions obtained from the index, is the transformed prediction and α is the transformation parameter. The assumption is that among all transformations with α values between -5 and +5, transformed data has the highest likelihoodbut not a guaranteeto be normally distributed when standard deviation is the smallest. This step is independent to a calibration database (legacy or reference).
At the second step the normalized predicted values were scaled from a calibration database as follow: ( Where And finally at the third and last step the normalized predicted values were centred from a calibration database as follow: (4) and where is the mean of the clay contents from the calibration database and is the mean of the .

Performances Predictions Indicators
The prediction performances of the spectral index were evaluated using the coefficient of correlation r of the predicted values ( or ) against the measured values in the REF database. The coefficient of correlation r is defined as the covariance of the two variables ( or ) and , divided by the product of their standard deviations as follow: As recommended by Davies and Fearn (2006), prediction performances analyses have to be also guided by the Standard Error of Prediction (SEP) or Root Mean Square Error of Prediction (RMSEP). The SEP is the parameter commonly used in the NIR spectroscopy literature to describe the prediction ability of a model. SEP appears as an averaged error recorded on the validation-sample set, and was calculated as follow: Where is the clay content of the sample i from the REF database, is the estimated clay content of the sample i from the REF database by the spectral index, and N is the number of soil samples in the REF database.
Finally, the SEP value can be decomposed as follows (Davies and Fearn, 2006): (7) Where (8) (9) As explained in detailed by Bellon-Maurel et al. (2010), the bias and the (for ''SEP corrected for bias'') appear as the error of means and the residual variance respectively. and the bias are independent.

Raw Predictions
Clay contents were obtained from the spectral index (equation 1) with a coefficient of correlation r of 0.71 and a high bias and SEP (Table 2). These clay predicted values were ranged from 10 to 305 g/kg (Table 3) which were small compared to the one of the REF database (Table 1). Moreover, the distribution of these clay predicted values was not normal (high skewness) (

Standardized Predictions
Clay contents obtained from the spectral index were transformed using the boxcox transformation, with α = 0.7 in equation (2).This step allows the increasing of r from 0.71 to 0.76, without any changes of bias and SEP.

Range and Standard Deviation Transformation Using REF Database
Estimated clay contents after boxcox transformation ( ) were transformed using the REF database following equations (3) and (4). This process allowed to increase the performance of predictions with a significant decrease of bias (8 g/kg) and SEP (around 116 g/kg) ( Table 2).

Range and Standard Deviation Transformation Using Legacy Databases
Estimated clay contents after boxcox transformation ( were transformed using the Focused-Legacy and No-Focused-Legacy databases following equations (3) and (4). This process using the Focused-Legacy and No-Focused-Legacy databases allowed increasing the performance of predictions with a decrease in bias (respectively -187 g/kg and -288 g/kg) and SEP (respectively 219 g/kg and 311 g/kg) (Table 2).
Finally, estimated clay contents after boxcox transformation ( ) were transformed using the Entire-Legacy database following equations (3) and (4). This process allowed obtaining predictions with a bias of -245 g/kg and a SEP of 269 g/kg (Table 2).

Clay Predicted Maps
The analysis of the clay predicted maps was focused on a test area. This test area is a 6.67 km² area that is centered on the Kamech catchment, which contained a high percentage of bare soils during image acquisition and exhibited contrasting soil patterns. The Kamech experimental catchment belongs to a long-term environmental research observatory called OMERE (Mediterranean observatory of water and rural environment) which aims to study the anthropogenic impacts on water and sediment budgets at catchment scale (e.g., Mekki et al., 2006;Raclot and Albergel, 2006). The Kamech catchment is characterized by strong variations in soil patterns on a small scale, with a close succession of clay-rich areas and clay-poor areas, oriented northwest/southeast, corresponding to marl and sandstone outcrops, respectively.
The raw clay predicted map was obtained from the spectral index following equation (1) (Table 2). Nevertheless, these maps restricted to the Kamech catchment, showed the close successions of clay-rich and poor-clay areas whatever the soil dataset used for standardisation.

DISCUSSION
The clay spectral index proposed by Levin et al (2007) and used without calibration data to estimate relative soil property values, offered accurate estimations (r = 0.71) but highly biased (bias = -414 g/kg). The standardisation of the estimated clay contents using the REF dataset allows a significant increasing of model performances (r = 0.76 and SEP = 116 g/kg), compared to the model without standardisation. These performances are comparable to those obtained by Lagacherie et al. (2008) also using a clay spectral index and Hymap airborne data (R 2 = 0.6 and RMSE = 130 g/kg). Nevertheless, the model performances remain lower than those obtained by Gomez et al. (2015) (R 2 val = 0.75 and SEP = 86 g/kg) using PLSR technique to estimate clay content from the same datasets than in this study (the VNIR/SWIR AISA-DUAL images and the REF soil data).
The standardisation of the estimated clay contents using legacy soil data allows a slightly increasing of model performances (accuracy increasing and bias decreasing), compared to the model without standardisation. Nevertheless, the model performances remain lower than using the REF dataset, whatever the legacy soil dataset used for the standardisation. The location of the legacy samples is more important than the number of these legacy samples for standardisation. Indeed, the use of a small legacy soil database (~80 samples, Focused-Legacy database) with all samples located over the study area, provides better results than bigger legacy soil database (~150 samples, Entire-Legacy database) including samples located out of the study area. Much of the performances limitations are due to bias of clay measurements that affect the legacy soil databases. This means that these legacy soil databases need to be standardized themselves before using them for such approach, as some attempts has been already done in that objective (Baume et al, 2011, Ciampalini, 2013 The standardisation applied in our study is composed of a boxcox transformation for normalization of the predicted data, and a scaling and centering of the normalized data. This standardisation process is comparable to one of the most widely used transfer methods for correcting predicted values which is the univariate slope and bias correction (SBC) (Bouveresse et al., 1996). Transfer methods have been developed to enable a calibration model to be effectively transferred between two "systems" (e.g., two spectroscopic instruments), thus eliminating the need for a full recalibration (Fearn 2001;Feudale et al., 2002). In our case, we do not have a calibration model which has to be adjusted to different "systems", but we have a model which has to be calibrated with an approximate system (legacy soil dataset).
In future, this work should be extended to other clay spectral indexes, such as those proposed by Chabrillat et al, 2002 andLagacherie et al., 2008. As well, this work would be extended to other topsoil properties, such as iron or CaCO3 using iron and carbonate indexes (e.g., Madeira et al., 1997;Lagacherie et al., 2008). This may confirm the independence of our actual observations from the selected spectral index and the topsoil property.

CONCLUSION
This paper shows a procedure to estimate relative soil property values from a spectral index applied to VNIR/SWIR AISA-DUAL hyperspectral airborne data, and standardize it with legacy soil data. The maps accuracy depends on the legacy soil data. The more the legacy soil data are focused on the study area, the more the accuracy is high and the soil patterns are mapped.
With the coming availability of a next generation of hyperspectral VNIR/SWIR satellite data for the entire globe (such as the French HYPerspectral X Imagery -HYPXIM -, the Spaceborne Hyperspectral Applicative Land and Ocean Mission -SHALOM -, the PRecursore IperSpettrale della Missione Applicativa -PRISMA -, the Environmental Mapping and Analysis Program -EnMAPand the Hyperspectral Infrared Imager -HyspIRI -), this study may open new way toward accessible and cheap methods for the delivery of soil properties maps to the geoscience community.