MODELLING BIOPHYSICAL PARAMETERS OF MAIZE USING LANDSAT 8 TIME SERIES

Open and free access to multi-frequent high-resolution data (e.g. Sentinel – 2) will fortify agricultural applications based on satellite data. The temporal and spatial resolution of these remote sensing datasets directly affects the applicability of remote sensing methods, for instance a robust retrieving of biophysical parameters over the entire growing season with very high geometric resolution. In this study we use machine learning methods to predict biophysical parameters, namely the fraction of absorbed photosynthetic radiation (FPAR), the leaf area index (LAI) and the chlorophyll content, from high resolution remote sensing. 30 Landsat 8 OLI scenes were available in our study region in Mecklenburg-Western Pomerania, Germany. In-situ data were weekly to bi-weekly collected on 18 maize plots throughout the summer season 2015. The study aims at an optimized prediction of biophysical parameters and the identification of the best explaining spectral bands and vegetation indices. For this purpose, we used the entire in-situ dataset from 24.03.2015 to 15.10.2015. Random forest and conditional inference forests were used because of their explicit strong exploratory and predictive character. Variable importance measures allowed for analysing the relation between the biophysical parameters with respect to the spectral response, and the performance of the two approaches over the plant stock evolvement. Classical random forest regression outreached the performance of conditional inference forests, in particular when modelling the biophysical parameters over the entire growing period. For example, modelling biophysical parameters of maize for the entire vegetation period using random forests yielded: FPAR: R2 = 0.85; RMSE = 0.11; LAI: R2 = 0.64; RMSE = 0.9 and chlorophyll content (SPAD): R2 = 0.80; RMSE=4.9. Our results demonstrate the great potential in using machine-learning methods for the interpretation of long-term multi-frequent remote sensing datasets to model biophysical parameters.


INTRODUCTION
Agricultural applications using remote sensing data and methods will be fortified by the increase of high-resolution multifrequent remote satellite data.New satellite constellations like the Sentinels will increasingly allow in combination with other systems for high frequent high-resolution observations of the crop lifecycle.Applications of remote sensing mainly focus on the biophysical reality of the crop expressed by crop specific biophysical parameters such as the leaf area index (LAI), fraction of absorbed photosynthetically active radiation (FPAR), or chlorophyll content.One appropriate method that demands reduced amounts of auxiliary data (such as climate or soil data) is to directly derive statistical relationships between the respective biophysical parameter observed in the field and the reflectance signal measured by the sensor of the satellite (Bronge, 2004) Often, univariate statistics are applied to one spectral index, e.g. the NDVI, to create maps of crop biophysical parameters (Fritsch et al. 2012).Such linking of one spectral index with one biophysical parameter implies a direct and permanent relationship between the biophysical reality and the reflectance values (Myneni and Williams, 1994).But the plants change their height, mass and shape, which alters their perception through remotely sensed data as well.For instance, Lex et al. (2015) or Vina et al. (2011) demonstrated that such simple crop-specific statistical relations may vary during a cropping season (Koppe et al. 2012).Comparisons of univariate statistical models revealed that some vegetation indices outreach others in model performance.Such thoughts also guided the study by Tillack et al. (2014) who modelled biophysical parameters for different phenological stages using multivariate statistics that included several vegetation Verrelst et al. (2012) applied machine learning methods to the entire spectra of remote sensing information and underlined their great potential in combination with high resolution remote sensing data.However, despite this promising property of machine learning algorithms such as random forest, systematic applications and comparison of machine learning methods for deriving biophysical parameters in agriculture using high resolution remote sensing data are still rare.The aim of this study is to compare two machine learning methods, namely the traditional random forest (rforest) and the conditional inference forest (cforest) for modelling biophysical parameters of maize in terms of prediction accuracy and variable importance.This comparison is done for the complete phenological lifecycle of maize using the Landsat 8 OLI.The analysis is done completely in R and is based on field observation data gathered on maize fields within the test and calibration site DEMMIN in Northeast-Germany.During an extensive field survey in 2015, FPAR, LAI, and chlorophyll content were measured on a weekly basis throughout the growing season of maize.

DEMMIN
The study area was located near the city Demmin in Western-Pomerania (Mecklenburg), Northeast-Germany.(Figure 1).Glaciers and melting waters formed the landscape during Weichsel glaciation (approximately 10.000 years ago).The climate zone can be described as moderate, an average annual temperature of 8-8.5 °C and an average annual rainfall of 550-600 mm (Borg et al. 2009).The investigated fields were within the test site DEMMIN (Durable Environ-mental Multidisciplinary Monitoring Information Network), one of four test areas of the TERENO NE Lowland observatory (http://demminweb.dlr.de/).The test site is an intensively used agricultural ecosystem dominated by extensive fields (80 ha) where mainly wheat and maize are cultivated.The northern part of the study area is characterized by low topographical variations between 5 -84.5 m a.s.l..The south can be described as hilly to undulating.Due to significant differences in parent substrate material and topography, soils are primarily loamy sands and sandy loams alternating with pure sand patches or clayey areas (Gerighausen et al. 2007).(http://teodoor.icg.kfa-juelich.de/observatoriesde/norddeutsches-tiefland-observatorium/german-lowlandobservatory-de).

In-situ-data description
Field observations of three biophysical parameters, LAI, FPAR, and chlorophyll content (expressed by SPAD measurements) were taken in the study region every week to bi weekly.FPAR and LAI were recorded using a SunScan instrument (Delta-T Devies Ltd., Cambridge, England) and the SPAD values were measured with a handheld chlorophyll meter (SPAD-502, Minola Osaka Company, Ltd., Osaka, Japan).
The data was collected on each 18 elementary sampling units (ESUs) (Baret et al. 2005) on five maize fields.The EUSs had an extent of 20 m x 20 m.Within each ESU, twelve measurement points were set within a rectangular cross.These twelve measurements were averaged for further processing.FPAR and LAI were measured once on every point inside the ESU.The SPAD measurements were taken on every point ten times and averaged.

Linking in situ data with remote sensing data
The maximal temporal offset between the field observations and the day of acquisition of the remote sensing data amounted four days.The spectral data was averaged inside a 20m buffer around the ESU centre.The averaged spectral information was used to calculate the vegetation indices.
Several vegetation indices were calculated from Landsat data comprising simple ratio (SR), NDVI, SAVI, RDVI and EVI.The tasselled cap transformation indices which allow for monitoring greenness, brightness, and wetness, and which are another important source of information for remote sensing applications in agriculture (Liu et al. 2014), were also included in the analysis.

Random Forest
Machine learning applications have resided a lot of attention in the last decades.The ensemble of tress, e.g. the selection of variables for tree construction, can subsequently be analysed using so-called variable importance algorithms.Irrespectively of the type of random forest main approaches investigate the reduction of accuracy of the random forest when randomly modifying each variable (Ishwaran et al. 2007, Strobl et al. 2007, 2008).
In this study, the R software (Liaw & Wiener, 2002) was used for implementing both approaches.The rforest of the R package 'h performance of one model and on the variable importance (Díaz-Uriarte & Alvarez de Andrés 2006).With mtry = 1, the splitting variable would be completely random and mtry = p would exclude the randomness from the random forest model.
Using the caret package 10 different mtry values were tested (2,3,5,7,9,10,12,14,16,18 with p=18) for the Landsat 8 OLI band index ensemble.The metric for comparing and assessing the performance of cforest and rforest was the coefficient of determination (R²).This way ensured finding the best tuning parameter for the model and so for the prediction of the biophysical parameters.
The variable importance assessments were applied to the optimal performing model only, determined by caret.This procedure was repeated 100 times for all datasets.The distribution of the variable importance shows on the one hand the importance of an index or band for predicting a biophysical parameter and on the other hand the stability of its selection over 100 model runs

Prediction accuracy
Due to the relative small dataset and the problem of geo and temporal correlation effects, this study used a 10 times repeated five-fold cross validation to determine the performance of the respective random forests.Each model was tuned to yield the highest coefficient of determination value (R²) by altering the mtry parameter.The performance results were averaged over 100 runs.The root mean square error (RMSE) was derived as second quality information The rforest slightly outperforms the cforest models in terms of prediction accuracy modelling the LAI and SPAD values (see tables 2 and 3).The FPAR parameter was modelled at equal quality levels (R² = 0.85, RMSE = 0.11).Schlemmer et al. (2013) showed a strong relation between EVI and chlorophyll content (R² = 0.67) and a higher relation with the NDVI (R² = 0.75).

Variable importance
The variable importance of the respective random forest models is expressed by boxplots (Figures 2 to 4).These boxplots contain the distribution of the unscaled variable importance over 100 runs.The boxplots are sorted according to the mean importance of those runs.Accordingly, the average variable importance decreases top to bottom.It is obvious, that the combination of cforest with subsequent variable importance assessment, which has a high explorative character (Strobl et al., 2008), more distinctly elaborates important variables as compared to rforests followed by variable importance assessments.For FPAR estimations, both methods (cforest and rforest) exhibited a completely different distribution of variable importance.The variable importance distribution of FPAR indicates EVI to be the most important variable for the cforest, and band 6 (swir1) to be of highest importance when modelling with rforest.
The LAI models show a narrow distribution of the single variable over the 100 runs, for rforest and cforest.Again for the cforest EVI is the most important variable while the boxplot of the rforest model shows the second band (Blue) to be most important.
For modelling the chlorophyll content (SPAD value) both variable importance assessments (cforest and rforest) agree on the important role of greenness and RDVI.Again the variable importance distribution of the cforest is more exact.The statement of variable importance relates only to the respective band vegetation index set described in table 1.
Adding more vegetation indices could change the appearance of the variable importance completely.The same is very likely to be valid for the model accuracy and the tuning parameter.The comparison between the variable importance results of this study and the results of Beckschäfer et al (2014) showed that, there are only few variables necessary to explain biophysical parameters.The selection of these important variables depends on the individual band-index input ensemble.

CONCLUSION
The comparison between the two machine learning methods cforest and rforest showed, that the rforest outperforms the cforest in terms of prediction accuracy, whereas the cforest often resulted in a clearer picture of the variable importance distribution.The cforest variable importance boxplots often show a group of indices and bands sated off against the majority of the band index ensemble.The distribution of variables relevant for the generation of the rforests was found to be more homogeneous.
In terms of tuning parameters, the major difference of the two models is the choice of the best tuning parameters mtry.
In the end, machine learning methods seem to perform very well modelling biophysical parameters of maize.Other studies, like Wiegand et al. (1990) and Gitelson et al. (2014) showed even better relationship between biophysical parameters and remote sensing data, but not for Landsat resolution and not for the entire vegetation period.The machine learning models can use the entire ensemble of multispectral information.The presented results relate to the entire vegetation period and include effects like the change of fractional cover and browning of the plant.It is very likely that optimization can be achieved by focusing on different growing stages.

Figure 1 :
Figure 1: Study site and location of ESUs on maize fields