VALIDATION OF THE GLOBAL HIGH-RESOLUTION GLOBELAND 30 LAND COVER MAP IN EUROPE USING LAND COVER FIELD SURVEY DATABASE-LUCAS

High-resolution land cover maps are one of the technological innovations driving improvements in many fields influenced by Geographic Information Systems (GIS) and Remote Sensing. In particular, the GlobeLand30 (GL30), global LC map with spatial resolution of 30m, is thought to be one of the highest quality high-resolution products. However, these LC maps require validation to determine their suitability for a particular purpose. One of the best ways to provide useful validation reference data is to do a high-level accuracy field survey, but this is time consuming and expensive. Another option is to exploit already available datasets. This study assesses thematic accuracy of GL30 in Europe using LUCAS as a validation reference, because it is a free and open field survey database. The results were generally not good, and very bad for some classes. Analysis was then restricted to a small region (Lombardy, Italy) where LC data of higher resolution than those of GL30 were available. LUCAS was also found to be incoherent with this product. Further comparisons of LUCAS with other independent sources confirmed that the LC attributes of LUCAS are inconsistent with expectations. Although these findings may not be generalized to other regions, the results warn against the suitability of LUCAS as ground truth for LC validation. The paper discusses the process of thematic accuracy assessment of the GL30 and the applicability of LUCAS for high-resolution global LC validation.


INTRODUCTION
Land cover (LC) maps are categorical-type models of the Earth's surface that represent materials covering it, like water, forest, urban areas, etc.They are the products of classification procedures, done on satellite/aerial imagery by some classification algorithm.The maps can be employed for a variety of applications, including understanding climate change, natural resource management and conserving biodiversity.LC often provides valuable information for remote, hardly accessible areas and in developing countries where official data are frequently missing.Some existing LC maps (Table 1) depict only one specific theme and they are characterized only by two classes like Global Human Settlement Layer (Pesaresi et al., 2016), Global Urban Footprint (Esch et al., 2013), Global Surface Water (Pekel et al., 2016), Forest/Non-forest map (Shimada et al., 2014), Tree Canopy Cover (Hansen et al., 2013), Global Forest Cover Gain (Hansen et al., 2013), and Global Forest Cover Loss (Hansen et al., 2013)); others are multi-class like GlobeLand30 (Chen et al., 2015), FROM-GLC (Gong et al., 2013)).Some have multitemporal coverage.Among all global high resolution LC maps, the study focuses on GlobeLand30 (GL30) because it is multiclass and have more temporal repetition than FROM-GLC, and therefore it can potentially meet the needs of many users.
The aim of this study was to compute the accuracy indexes of GL30 for Europe, providing useful indicators that can help users to decide if the quality of the map is suitable for a specific target objective (e.g.monitoring climate change, capturing perma- nent snow and ice melting rates, deforestation rates, urban planning, etc.).Accuracy indexes are computed from an error matrix, which is derived by the comparison of GL30 against a reference dataset.In this study, the LUCAS (Land Use/Cover Area frame Statistical survey) dataset for the year 2009 was chosen as the reference dataset for validation of GL30 2010, because it contains authoritative in-situ observations available free of charge.
For its characteristics, LUCAS in theory could be an optimal dataset for assessing land cover maps.Despite that, as far as the authors have been able to find, very few studies are available in literature about land cover assessment using LUCAS.
Namely, in one study (Gallego, 2011), CORINE Land Cover 2000 (CLC2000) was validated.The author recognized that the different minimum mapping units (MMU) of LUCAS (7m2) and CLC2000 (25ha) was partially affecting validation results due to the inability of CLC2000 to capture features in as much detail as LUCAS.Another contribution to error was attributed to photointerpretation mistakes in LUCAS.Moreover, certain classes did not contain a sufficient number of sample points, so per-class accuracy indexes (Omission and Commission error) could not be confidently estimated.In another study, Karydas et al. (2015) used LUCAS 2009 to validate the 2007 WWF Hellas LC map of Greece, with 30m resolution.A two-tier methodology was employed: first, a so called, "automated" process based validation directly on exploiting information about LC of LUCAS data; and second, a "supervised process" reinterpreted photos collected in the LUCAS campaign.The accuracy of the "supervised" procedure was lower than of the "automated" procedure, which the authors ascribed to misclassification, probably due to mistakes the surveyors made in the implementation of the LUCAS protocol in Greece.Both of the studies used non-squared matrices due to having different classes of reference and classified datasets.
This paper started with the idea of assessing GL30 2010 for all of Europe using LUCAS 2009 (the two datasets are more or less contemporary), considering that the difference between their MMUs is definitely smaller than the one between LUCAS and CLC 2000.Also, it is possible to directly link the classes of GL30 and LUCAS using a standard squared error matrix, which is an advantage since non comparable classes could give misleading results.What was found is that GL30 is not coherent with LUCAS, especially in some classes.Taking into account both, these results and the outcomes of the previously mentioned papers, the study included a comparison of LUCAS to other independent sources of information (authoritative and not) to determine if there is a systematic error in the data that can be removed so that the potential of the LUCAS database can be exploited.
In the following sections, the steps of the work and the results obtained are reported.Namely, section 2 describes the data used, section 3 presents the processing of the data, and sections 4 and 5 are related to the assessment results in Europe and in a small region where the LUCAS dataset was more deeply analyzed.At the end, in section 6, some conclusions are drawn.

DATASETS
In this section the different datasets used in the work are described.GlobeLand30 is the dataset that was chosen to be assessed and LUCAS was chosen as a ground truth reference.The last dataset, DUSAF, was considered when some doubtful and unexpected results were found.The LUCAS sampling strategy (Martino et al., 2009) consists of two phases.First, in the stratification phase, points found on the intersection of 2km 2 x 2km 2 systematic grid are photo-interpreted and assigned to one of the predetermined classes.Based on the first-phase (master) sample, the second-phase (field) sample is chosen using a simple random sampling method.The overall sampling rate (rate between number of points in field samples and in the master sample) is about 25%.Samples are extracted independently in each NUTS2 (a basic region for the application of regional policies) and in every stratum, and the sampling rate per stratum is modified separately for each NUTS2 region.

GLOBELAND30
Inaccessible points (above 1000 meters and far from the road network) are discarded from the field samples, and often photointerpreted, to reduce the cost of the survey (Martino et al., 2009) .
LUCAS land cover nomenclature has 8 main classes: Artificial land, Cropland, Woodland, Shrubland, Grassland, Bareland, Water and Wetlands.Every main class has several subclasses.

DUSAF
DUSAF is a regional land cover database for the Lombardy region, in Northern Italy, produced since 2000 (Bonomi et al., 2011).In the Lombardy Region geoportal (http://www.geoportale.regione.lombardia.it/en)DUSAF land cover vector maps for different time periods are available at 1:10,000 scale, in Shapefile format and in WGS84 reference system with UTM 32 N projection (EPSG:32632).The adopted legend (i.e. the existing classes) is designed in five hierarchical levels.The first level consists of five classes: Artificial surfaces, Agricultural areas, Forest and Semi Natural Areas, Wetlands, and Water Bodies; every next level contains subclasses of the previous level.The map selected for this study is DUSAF 4.0 from 2012.DUSAF has a higher spatial resolution than GL30 (its accuracy is around 2m ) and better detail in the definition of the classes.Therefore, it can be considered as a ground truth reference for assessing the quality of GL30.

PROCESSING
The study performs preprocessing (i.e.preparation of the two datasets), validation, and analysis of the accuracy indexes derived from validation.Two open-source software applications, QGIS (QGIS Development Team, 2018) and GRASS GIS (GRASS Development Team, 2017) were used for the preprocessing, while for validation two tools were developed through an ad hoc Python script for the QGIS desktop GIS software.The tools samples the pixel values of the map under assessment (in this case GL30) corresponding to points or pixels where ground truth observations (in this case LUCAS or the rasterized DUSAF) are present.Based on these sampled pixels and the ground truth observations, an error matrix is created.This is the basic element for the computation of several accuracy indexes proposed in literature.Some indexes indicate the global accuracy of the map (e.g. the Overall Accuracy, Kappa index, etc.), while others are used for accuracy assessment of the individual LC classes available in the classified map (e.g. the Producer's accuracy and the User's accuracy).
The validation can be performed by using two types of reference datasets: vector points or raster data.Moreover, the methodology requires reference and classified datasets with the same classification system, the same coordinate/reference system, and in case of raster data, the same spatial resolution as well.Therefore, some preprocessing steps of datasets were required to successfully perform the validation.LUCAS preprocessing includes: i) removal of the classified points with not valid (null) coordinates due to interrupted GPS signal during survey (denoted as 'X' in GPS column of LUCAS attribute table, ii) reprojection to ETRS89/ETRS-LAEA (EPSG:3035) projection to have a unique coordinate/reference system for all of Europe, and iii) addition of a column to the attribute table with current LUCAS notation translated to the notation of GL30 (by rules shown in Table 2) so that they can be directly compared during validation.
To prepare the raster data, GL30 tiles were firstly reprojected from UTM projection to ETRS89/ETRS-LAEA (EPSG:3035) and then merged to obtain a unique dataset for the 23 countries in Europe.The tiles contain value 0 due to the reprojection in UTM, since a regular tile in geographic coordinates does not correspond to a rectangle in projected coordinates.The 0 values were removed since they do not contain information about classification.Finally, GL30 was reclassified according to the rules in the Table 2.
Validation was made by an ad-hoc PyQGIS script named pts_lcval (https://github.com/GoricaB/Land-cover-validation).pts_lcval tool samples the classified map (GL30) in the position of the points of the reference map (LUCAS) and adds the sampled value in the new column of the reference map.In this way, values of the classified map are as vector points, not raster.Next, the confusion matrix, and all of the indexes of accuracy (global and per-class) were computed.For validation of GL30 in the specific country, the points of LUCAS covering the country are selected prior to running the tool.Validation based on the vector dataset as a reference required different preprocessing.Being a vector layer, DUSAF 4.0 needed to be rasterized.Although, the scale of DUSAF is 1:10,000 (about 4m resolution), it was rasterized to 30m resolution to correspond to GL30.The change of resolution does not significantly affect the outcome of the analysis (Brovelli et al., 2015).It was also reclassified to match the classes of GL30 using rules from Table 3. Tundra class does not exist in Lombardy, so it was left out.Two tiles of GL30 are covering the whole area of Lombardy, and therefore GL30 preprocessing for this type of validation merged the two tiles.Moreover, since the two maps are originally in the same coordinate/reference system, reprojection was not needed.At this point, both datasets were prepared to be input for the validation tool raster_lcval, which compares datasets pixel by pixel, computing the confusion matrix and many accuracy indexes.

VALIDATION OF GL30 WITH LUCAS IN EUROPE
GL30 was validated using the LUCAS database in two cases; in the former, the validation was performed for all of Europe, while in the latter the indexes were computed for each country separately.

Europe
Several indexes are drawn from the error matrix, which, according to Labatut and Cherifi (2011), Congalton and Green (2009), Koukoulas and Blackburn (2001), Pontius and Millones (2011), can be useful for map accuracy assessment.Global indexes are shown in Table 4, while per-class indexes are shown in Table 5.
The values of the indexes range from 0 to 1.For indexes of agreement (Overall Accuracy -P0, Margfit Overall Accuracy -P0marg, Classification Success Index -CSI, Kappa -K, User's Accuracy -UA and Producer's Accuracy -PA), value 1 denotes perfect agreement and value 0 denotes absolute disagreement; for indexes of disagreement, Allocation Disagreement -AD and Quantity Disagreement -QD, the opposite is true (i.e. 1 is perfect disagreement).
The most common index is P0, whose value should be at least 0.85 according to Anderson et al. (1976) and Thomlinson et al. (1999), or 0.7 according to Pringle et al (2009) in order to say that the map is accurate enough.The value of P0 is 0.55.P0marg, which is similar to P0 but takes into account the differences in class sample size, shows the same, low accuracy (0.57).
The values obtained for CSI (0.37) and K (0.38) confirm the low accuracy since they are lower than the expected minimum value (0.8) proposed by Koukolas and Blackburn (2001) and Landis and Koch (1977).
QD value (0.3) is double the value of AD (0.15).This implies that the total error is predominantly due to the difference in the number of sample units in a class of the reference and classified maps.However, allocation error is not negligible.From Table 5 it can be seen that from the user's perspective, Forest (0.8) and Water Bodies (0.77) are the only classes with good agreement, since their accuracy values are higher than the threshold of 0.7 suggested by Thomlinson et al. (1999)

Country by country
Since the LUCAS data were surveyed by independent teams for each of 23 countries, the validation was also performed country by country to understand variations at this level.In the Table 6 it can be seen that P0 is the lowest in Ireland and highest in the Czech Republic and Hungary, but even in these cases it is below 0.7.In most of the cases, P0marg is lower than P0, which means that difference in sample size is contributing to the Overall Accuracy.Being always below 0.5, K is indicating poor agreement in all of the countries.As in the case of Europe, the trend of AD and QD behavior remains the same.For all of the countries except France, QD is larger than AD, and may be up to 5 times larger.
Per-class indexes are shown for each class and each country in the Figure 1 for PA and Figure 2 for UA.The darker the color on the map, the higher the value of the accuracy. According

VALIDATION OF GL30 WITH LUCAS AND DUSAF IN LOMBARDY
From previous experience (Brovelli et al., 2015) it was expected that the results of validation would definitely be more satisfactory, at least in Italy.Considering, the Lombardy region (Northern Italy) Bratic et al. (2018) obtained significantly higher results in GL30 accuracy.The reference map used for the validation was, in this case, DUSAF 4.0, a regional authoritative map with 2m accuracy.This section compares the results of the two GL30 validations using the LUCAS and DUSAF datasets, respectively.Figure 3 presents a comparison between the global indexes obtained against LUCAS and DUSAF, respectively.It can be seen   The comparison of results, presented in the previous section, points out irregularities in the LUCAS data.Hence, a visual check of a certain number of LUCAS points was conducted.It was done with an attempt to understand if LUCAS has systematic errors, since this type of error can sometimes be removed.If this is the case for LUCAS, LUCAS might become more valuable for GL30 validation.The points were photo interpreted on the basis of three imagery sets: 1. Orthophoto of Lombardy region for 2007 at scale 1:10,000 (hereafter Ortho2007) 2. Orthophoto of Lombardy region for 2012 at scale 1:10,000 (hereafter Ortho2012) 3. Bing Aerial imagery (hereafter Bing).Since it is composed of tiles, the dates of imagery acquisition might vary from April 2015 to September 2017.Since these images are temporally more distant from the time of production of LUCAS data then the previous ones, it is expected that greater differences will be found.Nevertheless these images are considered due to their high resolution (around 2m) and therefore greater details.
Only the points of the classes that had very low PA and UA were verified.The classes with a large number of points, Grassland and Water Bodies, were randomly sampled with a rate of about 20% to reduce number of points to be verified.Hence, the following classes were checked independently by three operators: 1. Grassland -40 points 2. Shrubland -15 points 3. Wetland -1 point 4. Water Bodies -14 points 5. Bareland -14 points After the photo-interpretation of the points on the different imagery sets, the percentage of correctly classified LUCAS points according to each imagery set and class was calculated.Table 7 displays the percentage of correct LUCAS points found in every analyzed class according to photo-interpretation of Bing, Or-tho2007 and Ortho2012.Unambiguously, all of the classes contain certain errors, but it can be said that Grassland and Shrubland are classified much better then the rest of the classes.The Wetland class is completely incorrect, and this is an effect of the small sample size in this category.Namely, there is only one point that is classified as Wetland in Lombardy, but it is wrongly classified according to all of the photo-interpreted maps.Observation type attribute is, to some extent, similar to observation distance.There are 5 types of observations in the LUCAS survey, but in Lombardy, only 3 are present: • Type I: field survey of points visible on distance smaller than 100m • Type II: field survey of points visible on distance larger than 100m • Type III: field photo-interpretation in circumstances of unexpectedly inaccessible points Percentage of the correctly classified LUCAS points in a class with respect to the three types of observation are displayed in the Table 9.It was expected that type I observations will have high percentages for each class.However, the results of Bareland showed high deviation with respect to other classes.The highest number of correctly classifed Bareland points were obtained by type III observations.Furthermore, none of the observation types had homogeneous results for all of the classes.Hence the observation type seems not to to influence correctness of the LUCAS points.
GPS precision is an average location error as given by GPS receivers (in m).It is expected that the points for which the value of GPS precision was smaller would be more accurate, but Table 10 does not support this expectation.In the table, the column "<10m" is cumulative of the column "<5m".The percentage of correctly classified LUCAS points corresponding to different GPS precision thresholds, as presented in the table, shows that for all classes except Shrubland, error are fewer for points taken with a GPS precision of less than 10m.Since there is an exception in the case of Shrubland, GPS precision cannot be considered as an attribute that affected the classification accuracy of LUCAS points.96m and with a GPS precision of 5m.Despite that water is not expected to be misclassified, it is wrongly classified looking at Bing tile from May 2015.As can be seen, water is nowhere even close to the point, as the point belongs to the Shrubland class, and is surrounded by Cropland.
The same is evident from the Ortho2007 and Ortho2012, therefore this error is not due to landscape change over the years.

CONCLUSION
Validation of global land cover products is a challenging task considering the area for which ground truth should be provided at a reasonable cost.Initially, LUCAS 2009 data seemed to be a good dataset to be exploited as a ground truth reference since it was mostly collected during in-situ field surveys.Other advantages of LUCAS are that it covers an enormous area (almost a continent) and that it is free of charge.Some former studies have shown that LUCAS might not be a good choice of reference, but since characteristics of GL30 are more similar to LUCAS characteristics with respect to datasets validated in these studies (smaller differences in MMU and the same class definition as for reference data), LUCAS was used as a ground truth reference for the validation of GlobeLand30.According to case studies found in the literature, GL30's accuracy is expected to be higher than the validation outcome that our study is showing.This brings into question whehter LUCAS is an adequate ground truth reference, even in case of similar features with comparison maps (in this case GL30).Further examination was done to determine if it is justified to use LUCAS as a reference by comparing validation results where LUCAS served as a reference to other results obtained with authoritative, high-accuracy data.The analysis for comparison were restricted to the Lombardy region, Italy because here highly accurate data were easily accessible (DUSAF).In general, accuracy indexes for validation with LUCAS are lower than the one with DUSAF, which points towards errors in LUCAS since DUSAF is known to be very accurate.Furthermore, a visual check of LUCAS points was done to understand if some specific characteristic (observation distance, observation type, and GPS precision) may be the cause for errors in the dataset in such a way that these specific points could be removed from the dataset.The analysis was not able to reveal any obvious systematic error.Since the analysis of LUCAS errors were limited to one small region, it cannot be said that the results can be extended to the whole area covered by LUCAS.Certain caution is recommended when utilizing LUCAS as ground truth.

Figure 1 .
Figure 1.Producer's accuracy for each country

Figure 2 .
Figure 2. User's accuracy for each country

Figure 3 .
Figure 3. Global accuracy indexes for Lombardy

Figure 4 .
Figure 4. User's accuracy for each LC class in Lombardy

Figure 6 .
Figure 6.Example of an error in the LUCAS data

Table 1 .
Existing high-resolution global land cover maps

Table 3 .
Reclassification rules for DUSAF Considering the six global indexes, it can be claimed that either classification went wrong or the reference data is not accurate.

Table 5 .
Per-class accuracy indexes for Europe

Table 6 .
to PA results, Cropland is the most accurate class in all the considered countries, while Forest and Artificial Surface are moderately or poorly accurate depending on the country.The rest of the classes like Grassland, Shrubland, Wetland, Water Bodies, Artificial Surface and Bareland have low accuracy in all of the countries.Global accuracy indexes for each country

Table 7 .
Percentage of correctly classified LUCAS pointsSeveral attributes collected during the LUCAS survey are recognized as potentially useful for detecting the cause of errors in LUCAS data:Observation distance is the distance of the surveyor from the observation point at the moment of point acquisition.It is expected that observations from the smaller distance are more accurate.TheTable8is showing which percentage of the correctly classified LUCAS points in a class are surveyed from <5m, <10m, <100m and >100m observation distances.The columns "<10m" and "<100m" are cumulative of the previous columns.The results shown in the table are not consistent for all classes.In fact, the Grassland and Shrubland points have more probability to be correct if they were surveyed from a distance of less than 100m.The increase of proximity to the observation (<5m) seems not to be important in this case.Results for Water are different depending on the reference imagery, and results for Bareland are not showing any dependence on the observation distance.Taking into consideration all of the above mentioned results, it may be reasonable to think that observation distance does not have a significant impact on the classification of the points.

Table 8 .
Percentage of correctly classified points with respect to distance of observation point

Table 10 .
Percentage of correctly classified points with respect to GPS precision