REFINEMENT OF CROPLAND DATA LAYER USING MACHINE LEARNING

As the most widely used crop-specific land use data, the Cropland Data Layer (CDL) product covers the entire Contiguous United States (CONUS) at 30-meter spatial resolution with very high accuracy up to 95% for major crop types (i.e., Corn, Soybean) in major crop area. However, the quality of early-year CDL products were not as good as the recent ones. There are many erroneous pixels in the early-year CDL product due to the cloud cover of the original Landsat images, which affect many follow-on researches and applications. To address this issue, we explore the feasibility of using machine learning technology to refine and correct misclassified pixels in the historical CDLs in this study. An end-to-end deep learning-based framework for restoration of misclassified pixels in CDL image is developed and tested. By feeding the CDL time series into the artificial neural network, a crop sequence model is trained and the misclassified pixels in an original CDL map can be restored. In the experiment with the 2005 CDL data of the State of Illinois, the misclassified pixels over Agricultural Statistics Districts (ASD) #1760 were corrected with a reasonable accuracy (>85%). The findings suggest that the proposed method provides a low-cost and reliable way to refine the historical CDL data, which can be potentially scaled up to the entire CONUS.


INTRODUCTION
Since its first release of a full state wide data product in 1997, the Cropland Data Layer (CDL) product of the U.S. Department of Agriculture (USDA) National Agricultural Statistics Service (NASS) has been widely used by growers, agricultural industry, governments, educators and students, and researchers world-wide for crop production, agricultural production planning and management, government policy formulation and decision making, teaching, and various research activities (Liknes et al., 2009;Thompson, Prokopy;Hao et al., 2015;Lark et al., 2015;Di et al., 2017). Currently, the CDL data covers the entire conterminous United States (CONUS) at 30-meter spatial resolution with a high accuracy up to 95% for classifying major crop types (i.e., Corn, Soybean, and Wheat). However, the quality of the early-year CDL products was not as good as recent years. In early years, there are many misclassified pixels in the CDL products because of cloud cover and lack of satellite images. Moreover, only a few states of CDL data were produced before 2008. For example, the year 2000 CDL covers only Illinois, Indiana, Mississippi, North Dakota, and a part of Arkansas and Iowa. Obviously, the earlier year CDLs' availability and low quality issues affect many follow-on Land Use and Land Cover (LULC) related researches and applications. Therefore, an effective method for refining and correcting the old CDL data is badly needed to improve the quality and accuracy of the historical CDL data.
It is well known that monocropping will result in degradation of soil, build-up of diseases and pests, and decline in productivity. Thus crop rotation becomes a common farming practice in U.S. Corn Belt. The crop rotation can significantly improve the soil * Corresponding author condition, such as fertility and soil physical/chemical properties (Pikul et al., 2001;Karlen et al., 2006;Govaerts et al., 2007;Karlen et al., 2013;Van Eerd et al., 2014). Meanwhile, the crop sequence and cropping decision also have significant impact on crop yields and profitability (Temperly, Borges;Parajuli et al., 2013;Farmaha et al., 2016). Based on this common cropping practice, many crop mapping and yield estimation models and approaches were developed. Secchi et al. (2011) constructed an prediction model of future land use scenario in the state of Iowa based on the corn-soybean rotation and production costs. Schönhart et al. (2011) developed a crop sequence model to generate crop rotations based on agronomic criteria and observed data. Sahajpal et al. (2014) detected the pronounced shifts from grassland to cultivated area by modelling crop rotation in the U.S. Western Corn Belt. Hao et al. (2016) explored the crop classification based on the previous-year crop knowledge. Zhang et al. (2019a) produced a crop cover map of Nebraska State based on the common crop rotation patterns of corn, soybeans, winter wheat, and alfalfa. They further implemented a crop sequence-based machine learning framework for prediction of crop cover maps (Zhang et al., 2019b).
In this paper, we present a machine learning-based crop sequence model to refine the historical CDL data. The proposed model utilizes artificial neural network (ANN) to automatically learn crop sequence information from the CDL time series. The misclassified pixels in the crop cover map can be automatically identified and corrected using the trained model on the historical CDL.
The rest of the paper is organized as follows. Section 2 introduces the CDL data, the study area, and an end-to-end machine learning framework for the historical CDL data refinement. Section 3 demonstrates the experiment results and as-sesses the refinement performance. Section 4 discusses the limitation of the current implementation and gives the conclusion.

Cropland Data Layer
CDL is a raster formatted, geo-referenced, crop-specific land cover map produced by USDA NASS. It is an annual product covering the entire CONUS at 30-meter spatial resolution from 2008 to present and some states from 1997 to 2007. The production of CDL is mainly based on moderate resolution satellite imagery and extensive agricultural ground truth (Boryan et al., 2011). The misclassified pixels in the CDL refer to the pixels that are covered with "clouds" or "no data". These pixels are mainly existing in the CDL products before 2006 due to lack of high-quality satellite data and the algorithm limitation back then. Examples of the misclassified pixels in the early-year CDL are shown in Figure 1.
The CDL data products are freely downloaded from CropScape (https://nassgeodata.gmu.edu/CropScape/), which is developed and maintained in cooperation with Center for Spatial Information Science and Systems of George Mason University (Han et al., 2012;Zhang et al., 2019c). It provides an easy-to-use Web GIS application to visualize, analyse, and download CDL data. All data hosted on CropScape are disseminated via the OGC standards-compliant geospatial Web services, such as Web Map Service (WMS), Web Coverage Service (WCS), Web Feature Service (WFS), and Web Processing Service (WPS).

Study Area
The Agricultural Statistics District (ASD) #1760 of Illinois state is selected as the study area. The study area lies on the Central Corn Belt Plains Ecoregion, which is mainly covered by corn, soybeans, grassland, and forest as shown in Figure 2. It can be seen that the 2005 CDL contains a considerable number of pixels are labelled as "clouds or no data" over the study area. The purpose of this study is to restore those misclassified pixels in the study area of 2005 CDL using the machine-learned crop sequence model.

Machine Learning Framework
To automatically correct the misclassified pixels in CDL, an end-to-end machine learning framework is proposed in this paper. The proposed framework is composed of four major components: data preparation, model training, classification, and evaluation.

Data Preparation:
In data preparation, the CDLs from 2006 to 2018 are stacked sequentially to form CDL time series. All pixels of the CDL time series are arranged into a 2-D array of samples. Each row of the data set array represents a pixel consisting of a sequence of crop type values of different years. Training and validation data sets are randomly sampled from the "good pixels" in the study area and labelled with 2005 CDL. The experiment data set includes all pixels corresponding to those misclassified pixels in the study area without labels.

Model Training:
The crop sequence model is trained by feeding the training set into the artificial neural network, which contains one input layer, multiple hidden layers, and one output layer. The input layer contains a group of neurons corresponding to the same pixel of the CDL time series. Each input pixel represents a specific value of its crop type. There are mul- The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLII-3/W11, 2020 PECORA 21/ISRSE 38 Joint Meeting, 6-11 October 2019, Baltimore, Maryland, USA tiple hidden layers between the input layer and the output layer. The output layer uses SoftMax to estimate the probability of each crop type.

Classification and Validation:
By feeding the experiment data set to the well-trained crop sequence model, the misclassified pixels in the original CDL can be refined. To validate the refinement performance, we applied the same crop sequence model to the validation set. Then we measured the model by calculating the agreement of the classified label and the original label of the validation set.
The applications of the proposed machine learning-based crop sequence model are illustrated in Figure 3. In this study, the crop sequence model is used to restore the historical crop cover map. This model, on the other hand, can be also applied to predict the future crop cover maps with the high-confident training samples for early-season and in-season crop mapping.

RESULTS
The refined 2005 CDL data of ASD #1760 is illustrated in Figure 4. Comparing the refined result with the original 2005 CDL data, we observed that the misclassified pixels had been corrected with the crop sequence information learned from the historical CDL time series. The overall accuracy of the refined pixels is unable to be accessed directly due to lack of ground reference data. Instead, we utilized the validation data set, derived from the "good pixels" in the study area of 2005 CDL to indirectly measure the performance of the model. The overall accuracy of validation based on the validation sample set is over 85%. Therefore, the actual overall accuracy of the refined pixels may vary. To further validate the performance of refinement, the ground reference data are required.

CONCLUSION
This study investigated the feasibility of using machine learning technology to refine CDL data. An end-to-end ANN-based framework was proposed and tested to correct the misclassified pixels in the historical CDL data. The preliminary experiment result indicates that the misclassified pixels over the ASD #1760 could be corrected with reasonable accuracy (>85%). The findings suggest that the proposed machine learning approach is effective and low-cost for correcting the misclassified pixels, and has great potential for refining the historical CDL over large geographic area. More experiments and validation will be conducted in the future.