MULTI-TEMPORAL LAND COVER CLASSIFICATION WITH LONG SHORT-TERM MEMORY NEURAL NETWORKS

Land cover classification (LCC) is a central and wide field of research in earth observation and has already put forth a variety of classification techniques. Many approaches are based on classification techniques considering observation at certain points in time. However, some land cover classes, such as crops, change their spectral characteristics due to environmental influences and can thus not be monitored effectively with classical mono-temporal approaches. Nevertheless, these temporal observations should be utilized to benefit the classification process. After extensive research has been conducted on modeling temporal dynamics by spectro-temporal profiles using vegetation indices, we propose a deep learning approach to utilize these temporal characteristics for classification tasks. In this work, we show how long short-term memory (LSTM) neural networks can be employed for crop identification purposes with SENTINEL 2A observations from large study areas and label information provided by local authorities. We compare these temporal neural network models, i.e., LSTM and recurrent neural network (RNN), with a classical non-temporal convolutional neural network (CNN) model and an additional support vector machine (SVM) baseline. With our rather straightforward LSTM variant, we exceeded state-of-the-art classification performance, thus opening promising potential for further research.


INTRODUCTION
In earth observation, the problem domain of land cover classification (LCC) has educed a variety of techniques until today.Many approaches rely on mono-temporal observations and concentrate on spectral or textural features describing observations acquired at one specific point in time.However, some land cover classessuch as, e.g., vegetation and especially crops-are difficult to classify by mono-temporal approaches (Foerster et al., 2012), as vegetation changes its spectral and textural appearance within its species-dependent growth cycle.Especially crops develop these temporal dynamics in a systematic and thus predictable manner, dependent on phenology and the applied crop calendar (Valero et al., 2016;Whitcraft et al., 2014).These temporal features can be utilized for classification by suitable techniques.
In the recent past, the deep learning community has developed a variety of architectures producing impressive results for a wide range of applications.Among these applications, long short-term memory (LSTM) neural networks are commonly utilized to handle sequential information in various problem domains, such as natural language processing and text or voice generation.In contrast to mono-temporal models, these LSTM networks can store a theoretically unlimited amount of evidence and make decisions in that actual temporal context.In text generation, for instance, the subsequent word is chosen from the vocabulary body wrt. to a sequence of preceding words.These generated texts imitate the language, grammar, and word choice of the training data.
In this work, we propose to use LSTM networks for the purpose of crop classification from earth observation data.In experiments carried out on a series of SENTINEL 2A images collected over the entire growth season of the year 2016, the effect of multi-temporal features has been evaluated by comparing the performance of multi-temporal models, namely LSTM networks and RNNs, with mono-temporal convolutional neural network (CNN) models and a support vector machine (SVM) baseline.

Remote Sensing of Phenology
Vegetation follows specific periodic growth cycles determined by the plant's biology.The study of these cycles is known as phenology and describes characteristic events such as germination, flowering, or senescence.Along with these phenological events, plants change their reflective spectral characteristics which can be observed via remote sensing technologies.Phenological characteristics are assumed to change in a predictive manner and can thus be utilized for identification, as long as farming practices and environmental conditions remain unchanged or are considered in the model (Odenweller and Johnson, 1984;Foerster et al., 2012).
Figure 1 illustrates reflective RGB responses of different crops in SENTINEL 2A observations along the growth season.Fields containing the same types of cultivated crops change their spectral appearance uniformly within the series of observation.This is due to a combination of the crops phenological cycles and farming practices, such as the date of seeding and or harvesting.
Commonly, vegetation remote sensing uses vegetation indicese.g., the normalized difference vegetation index (NDVI) or enhanced vegetation index (EVI)-to observe these changes over a temporal series of observations (Reed et al., 1994).However, these indices usually consider only a limited number of bands which are predominantly sensitive to chlorophyll and water content, i.e., red and near infrared wavelengths.Further spectral bands are often discarded, even though that information is perceived by the satellite and may also contribute to the classification procedure.
Additionally, these approaches need to filter high-frequent coverages, such as clouds, as preprocessing routines or by applying upper envelope filters (Bradley et al., 2007) to remove negative outliers from the temporal profiles.Overall, these manually-crafted functional models might not be able to represent the complex effects of various influencing factors-such as, for instance, weather conditions, sunlight exposure, or farming practices-which are encoded in the reflectance signal.For these reasons, very recent research has started to employ deep learning techniques to overcome these limitations for crop yield prediction (You et al., 2017) and classification of phenological events (Ikasari et al., 2016).

RELATED WORK
Even though vegetation analysis with continuous monitoring over the growth season dates back many decades (Reed et al., 1994), only recently space-born sensors-namely the LANDSAT and ESA SENTINEL series-provide sufficient ground sampling distance (GSD) and temporal resolution for single-plot field classification.Thus, classical land-cover classification approaches have concentrated on multi-or hyperspectral sensors at one single observation time.utilize RNNs and LSTM architectures to multispectral LANDSAT-7 and hyperspectral EO-1 HYPERION images, but-in contrast to our approach-for the purpose of change detection.

Neural Network Architectures
Traditional classification systems are assembled from sequential building blocks, e.g., feature extraction, classification, and post processing, as summarized by Ünsalan and Boyer (2011).Features, which are expected to be significant for classification, are extracted from available observations, e.g., via estimating the parameters of functional models (Bradley et al., 2007).These features are further passed as inputs to classifiers like, for instance, maximum likelihood classifiers, SVMs, or RDFs.The optimal choice of feature extraction methodology and classifiers depends on the actual classification task and available data.
In contrast to these approaches, artificial neural networks (NNs) are trained in an end-to-end manner, solely based on raw input data x ∈ R n and output labels ŷ ∈ R c .NNs are usually used for supervised learning to approximate non-linear response functions, e.g., class probabilities, by a sequence of affine transformations Wdata important for classification based on available data.This scheme ensures that neural network architectures can be applied to a variety of tasks and scenarios, as long as a sufficient quantity of training data is available.Additionally, information which is not important for classification can be gradually ignored by the network.Hence, all available information can safely be provided to the deep learning model.Nevertheless, various neural network architectures have been developed for application to certain fields which-by design-excel at some types of features.
Mono-temporal models Feed-forward neural networks process input data in a one-directional pipeline.Image processing and segmentation feed forward neural networks incorporate additional convolutional layers to account for local neighborhoods and thus are well suited for recognition of shapes and textural patterns.Due to these properties, convolutional neural networks (CNNs) are already applied in earth observation for high resolution satellite imagery (Hu et al., 2015) or semantic segmentation (Castelluccio et al., 2015).
Multi-temporal models RNNs (Werbos, 1990) are potentially well suited for processing sequential data, such as temporal sequences of observations, as the network has access to information of the previous observation for the classification of current observation.At each network layer l and observation t, a hidden output vector h l t is derived from the output of the previous observation h l t−1 and the input of the current observation h l−1 t .Hence, decisions can be based on the context of previous observations, thus making RNNs a useful architecture for language processing, text generation, or voice recognition.
Inducing a further level of complexity, long short-term memory (LSTM) networks (Hochreiter and Schmidhuber, 1997) introduce an additional cell state vector c l t ∈ R m providing long-term memory capabilities, as at each observation t information can be stored or retrieved to varying extents.Figure 2 illustrates the flow of vectorized information within one LSTM cell layer l.At each observation t, the previous cell state vector c l t−1 is manipulated by a set of gates, i.e., the forget gate f l t for discarding information and the input gate i l t combined with the modulation gate g l t for writing additional information to c l t .The output gate o l t is derived based on h l t−1 and h l−1 t and provides the same functionality as a RNN layer.The new output vector h l t is then obtained by an element-wise multiplication of the output gate o l t and the cell state c l t .While our approach is based on these LSTM networks (Zaremba et al., 2014), a variety of variations have been presented in the past (Gers et al., 2002;Graves and Schmidhuber, 2005;Graves et al., 2013;Kalchbrenner et al., 2015).

Approach
We employ LSTM neural networks for the purpose of crop classification on a per-plot scale from medium resolution satellite imagery.Figure 3 illustrates the classification and training scheme of our approach for at multiple consecutive observations.At each observation t, ns spectral bottom-of-atmosphere reflectance measurements ρi ∈ R (k×k)•ns , along with the day of observation ti are fed to the network as input vector x.A series of l LSTM layers process the data with additional information of the previous observation t − 1.A softmax layer produces probabilities for each class ŷt.At each training step, the cross-entropy loss wrt.predicted and actual class probabilities is calculated.Gradients are calculated by Adam optimizer (Kingma and Ba, 2014) and back-propagated in order to adjust the network weights.
Crops have been chosen as subject of classification since these land cover classes are expected to change in a characteristic manner, as explained in Section 1.1, thus making these classes ideal subjects to demonstrate the capabilities of temporal modeling by LSTM networks.

EXPERIMENTS
In order to evaluate the classification performance of the LSTM architecture and to investigate the effects of temporal features on the classification results, we trained multiple multi-temporal models, i.e., based on LSTM networks and RNNs, and mono-temporal alternatives, i.e., a CNN model and a baseline SVM, on the dataset described in Section 4.1.In the following, we describe the body of

Data Material
To train the large number of neural network weights, a feasible body of raster and label data is necessary.For this reason a large area of interest (AOI) of 102 km × 42 km in the north of Munich, Germany, has been selected (cf. Figure 4) due to its homogeneous geography, climate conditions, and farming practices which suggest similar environmental conditions.A raster dataset of 26 SENTINEL 2A images, acquired between 31 st December, 2015 and 30 th September, 2016, has been retrieved from ESA SCI-HUB and atmospherically corrected by SEN2COR software.For consistency reasons with the LANDSAT series, blue, green, red, near-infrared and shortwave infrared 1 and 2 bands were selected for this evaluation.Field geometries and cultivated crop labels of 137 k fields in the AOI have been provided by the Bavarian Ministry of Agriculture (Bayrische Staatsministerium für Ernährung, Landwirtschaft und Forsten).
The distribution of fields per crop class is not uniform with common cultivated crops, e.g., corn or wheat, dominating the dataset, while other crops, e.g., sugar beet or asparagus, are less represented (cf. Figure 5).Nevertheless, from 172 unique field crops, 19 field classes have been selected with at least 400 field-plots in the AOI.

Data Extraction
The field geometries of the field and reflection values of the raster dataset have been discretized to a 100 m × 100 m grid of points of interest (POIs).Each POI contains information of network input x and classification ground truth labels y in a 30 m × 30 m neighborhood.The network input vector x incorporated bottom-of-atmosphere reflection values directly derived from the raster dataset combined with the day of observation  If POIs were located at the border of multiple classes, class labels have been weighted with respect to a local 30 m neighborhood.

Dataset Partitioning
Commonly, two sets of parameters need to be determined when selecting and training the neural network architecture.Weights W ∈ R n×m are adjusted during the training process by backpropagation of residuals and hyper-parameters θ are chosen following a grid search regime in order to find the optimal network structure for the classification task.To ensure that these parameters are chosen independently, training of network weights and evaluation of hyper-parameters was performed on training and validation datasets, respectively.A third evaluation dataset is used for to calculate accuracy measures of neural network independently from network weights and parameters.While the evaluation dataset remained unchanged, training and validation were redistributed in multiple folds.This practice maximizes the number of training samples and avoids misrepresentations of classes containing small numbers of features in the respective dataset.Hence, the body of POIs was divided in the three respective datasets.
As discussed in Section 1.1, the dates of phenological events are influenced by environmental conditions which vary at large spatial distances.To average these environmental conditions, the body of data is divided randomly with respect to the extent of the AOI.A pure random assignment of individual POIs would ensure an even distribution of POIs but may assign POIs of the same field to the different datasets and thus cause dependencies between datasets.
For these reasons, the POIs were not assigned completely randomly to the respective datasets, but have been first partitioned in blocks.These 476 blocks of 3 km × 3 km were in turn randomly assigned to training, validation, and evaluation in a 4:1:1 ratio (cf. Figure 6).A circumferential margin of 200 m ensures that POIs located on the same field were not assigned to different datasets.At each fold, training and evaluation blocks got reassigned randomly while the validation dataset remained unchanged.

Experimental Setup
In total, 135 networks of each architecture have been trained the body of training data over 30 epochs with varying hyperparameters l ∈ {2, 3, 4} and r ∈ {110, 220, 330, 440, 880}.This process has been repeated in 9 folds of randomly reassigned training and validation datasets.Dropout with probability pdropout = 0.5 was used for regularization.Additional to the investigated neural network architectures, a baseline SVM with radial basis function (RBF) kernel was trained on a balanced dataset of 3,000 samples per class extracted from the training dataset.
The optimal hyper-parameters, i.e., slack penalty C and RBF scaling factor γ, have been chosen based on a grid search over C ∈ 10 −2 , ..., 10 6 and γ ∈ 10 −2 , ..., 10 3 .All networks have been trained within 8 hours on a NVIDIA DGX-1 server equipped with eight NVIDIA TESLA P100 GPUs and 16 GB VRAM each.Five networks have been able to be trained on each GPU in parallel, making the large grid search of parameters possible.While neural networks were implemented in TENSORFLOW, the SVM baseline based on the SCIKIT-LEARN framework.

Results
In this section, we evaluate the performance of the trained networks at multiple scales from general performance of each neural network architecture to the performance of best networks on individual classes.chosen hyper-parameter sets and are indicated by their standard deviation intervals.These standard deviations turned out to be relatively small compared to the performances of LSTM, RNN, and CNN models and suggest that, on this dataset, the choice of the actual architecture has larger influence on classification performance than the choice of the involved hyper-parameters.Overall, LSTM networks and RNNs achieved significantly better accuracies over the entire training process with LSTM models performing slightly better than their RNN competitors.The networks which achieved best accuracies are plotted separately as solid lines, as these will be evaluated in detail in the following.

Accuracy measures per best network
The best performing networks, opposed to the SVM baseline, have been tested in detail on the evaluation dataset.Results of these experiments are compiled in Table 1.Additionally, covered and field classes have been separated from each other.Similarly to the previous figure, multi-temporal models achieved better accuracies compared to mono-temporal competitors.This performance gain is supposed to be largely caused by the field classes which-in contrast to covered classes-contain temporal phenological characteristics likely to be utilized by LSTMs and RNNs.Covered classes achieved similar accuracies in all of the models with also the baseline SVM achieving good classification accuracies.Apparently, the characteristics of these classes are more distinctive and can be utilized by all models.4.5.3Class confusions Similar results can be observed from the confusion matrices shown in Figure 9 based on the best performing networks and the SVM baseline shown in Figure 8.The class frequencies are normalized by the sum of ground truth classes to obtain the precision measure.While some classes represent distinct cultivated crop, other classes-such as meadow, fallow, or other-can not be defined precisely.Consequently, these classes performed worse during our experiments and were more likely confused with a variety of other classes, as becoming apparent in in the confusion matrix.Further chance for confusion was observed in the case of classes with are botanically related to each other and thus share similar spectral and temporal features.For instance, the classes triticale, wheat, and rye have been commonly confused, as triticale is a hybrid of the latter two classes.The CNN model, in general, performed worse compared to LSTM and RNN.Some classes (e.g., sugar beets, wheat) achieved good accuracies, Table 1.Performance evaluation of our proposed LSTM-based method in comparison to standard RNNs and mono-temporal baselines based on CNNs and SVMs.As cover classes (i.e.,, cloud, cloud shadow, water, and snow) are usually comparatively easy to recognize, we restrict our evaluation to unbiased performance measures with respect to the remaining field classes, Our LSTM model achieved good classification accuracies compared state-of-the art, while considering a notably larger number of crop classes (Foerster et al., 2012;Siachalou et al., 2015).While the hidden markov model approach of Siachalou et al. (2015) is methodically closest to our deep learning strategy, their relatively small study area together with their small number of classes impede direct comparison.Our deep learning approach achieved better accuracy performance than the approach of Foerster et al.
(2012) using spectro-temporal NDVI profiles and adjusting these by additional agro-meteorological information (cf. Figure 11).Their considerably large test area is located in north-east Germany and fields are comparable with ours in terms of cultivated crops and farming practices.Hence, their work is most comparable in terms of data.
In this work, LSTM and RNN architectures have been shown to perform similar, with the LSTM model achieving slightly better accuracies consistently over all evaluation schemes.With increasing observation length, LSTM models may be able to exhaust their full long-term memory capabilities which may be advantageous for monitoring multiple years or data with a higher temporal frequency.As a side effect, the LSTM model have learned cloud and cloud shadow detection along with field classifications with good overall accuracy of 98.5% by providing additional covered classes to the network.Hence, no preprocessing or cloud filtering is necessary, as the network learns to distinguish these coverages in the training process based on provided class labels.This argues for the flexibility of deep learning approaches allowing large amounts of data to be processed without the need of manual data preprocessing and feature selection.networks with increasing number of observations.The more observations have been registered, the more context information was available to LSTM and RNN networks to assist the classification decision.CNN models can not utilize temporal information, thus performance remained at a nearly constant level with increasing observation length.

CONCLUSIONS
Many applications and architectures of the deep learning community can be applied to the domain of earth observation for efficient, large scale data processing.This work has demonstrated the applicability of long short-term memory (LSTM) networks, originating from speech and text generation, for earth observation.Earth observation, in particular, has to face increasing amounts of data from a variety of multi-modal sensors, such as the SENTINEL, LANDSAT, or MODIS satellite series.The acquired information needs to be processed on a large scale and in an efficient manner exploiting all available information.Considering these requirements, neural networks provide flexibility in terms of preprocessing and provided data, as, e.g., no cloud filtering is necessary if the network has been trained on additional cloud classes.Additionally, data which is not significant for the given task will be ignored.We believe that a holistic data approach-comprising temporal, spectral, and textural information-has the potential to yield superior results in future applications.Our presented approach limits textural and spatial features by only observing a small neighborhood of 30 m around each POI to concentrate on available temporal infor- mation.In future work, a CNN encoder prepended to the LSTM network could additionally benefit the classification accuracy, as richer textural features would be extracted in a perceptive field optimally chosen by the network.

Figure 1 .
Figure 1.Sequence of observations during the growth season of the year 2016.SENTINEL 2A RGB bands (top row) illustrate the systematic and characteristic change of crops over the season.Class labels (bottom row) show the ground truth labels, as used for the training process.Coverage of the ground is considered by additional covered classes, as shown as clouds at July 2nd.The systematic temporal changes of spectral reflectances can benefit identification of crops, as exploited by our proposed multi-temporal land cover classification network.

Figure 3 .
Figure 3. Unrolled flow of data for classification and training.Class probabilities ŷt are calculated by multiple stacked LSTM layers and one softmax layer.LSTM layers have additional access to cell state vectors ct−1 and ht−1 originating from previous observations, indicated by an arrow between processing columns, which benefits extraction of temporal features.At each training step, the cross entropy loss Ht( ŷt, yt) between predicted and reference class probabilites is calculated.To minimize the loss function, gradients (dotted) are calculated by Adam optimizer and adjust the network weights in involved in LSTM and softmax layers.

Figure 4 .
Figure 4.In our experiments, we used an area of interest of 102 km × 42 km (black rectangle) in the north of Munich, Germany, as study area containing 137 k field plots.obtained data in Section 4.1 and the performed data aggregation to obtain input and label vectors in Section 4.2.Section 4.3 explains the intuitions behind the different dataset partitioning regimes to obtain training, validation, and evaluation subsets in the context of phenological temporal features and regional environmental influences.The training and evaluation process is explained in Section 4.4 and results are shown in Section 4.5.

Figure 5 .
Figure 5. Distribution of classes by fields in the AOI.Common field crops, such as corn or wheat, dominate the dataset with 28kand 22k fields, respectively, while, e.g., asparagus or peas are cultivated in less than 600 fields.Only crops with at least 400 occurances have been included in the dataset.

Figure 6 .
Figure 6.Illustration of the field dataset overlayed with 3 km × 3 km blocks dedicated for training (blue), evaluation (lightblue), and validation (orange).A circumferential margin of 200 m ensures that field plots are not located in two distinct datasets.Training and evaluation datasets are randomly reassigned at each cross-validation fold.

Figure 7 .
Figure 7. Evolution of overall validation accuracy performance during the training process of 135 networks of each CNN, RNN, and LSTM architecture with different hyperparameter settings.Means (dashed lines) and standard deviation intervals (shaded areas) indicate the general performance of each architecture, the most performand network is superimposed separately (solid line).

Figure 9 .
Figure 9. Confusion matrices reporting class-wise average accuracy values of LSTM, RNN, and CNN architectures.The orange-framed submatrix comprises covered classes.While LSTM and RNN showed similarily good performances, as these networks utilize temporal features, the CNN network performed generally worse, as temporal characteristic features are not accessible to the CNN architecture.

Figure 10 .
Figure 10.Evolution of performance of LSTM, RNN, and CNNnetworks with increasing number of observations.The more observations have been registered, the more context information was available to LSTM and RNN networks to assist the classification decision.CNN models can not utilize temporal information, thus performance remained at a nearly constant level with increasing observation length.
Common field crops, such as corn or wheat, dominate the dataset with 28k and 22k fields, respectively, while, e.g., asparagus or peas are cultivated in less than 600 fields.Only crops with at least 400 occurances have been included in the dataset.normalized as fraction of year.The network labels y have been formed by two types of classes.
1. Field classes were derived from the field dataset, namely corn, meadow, asparagus, rape, hop, summer oats, winter spelt, fallow, winter wheat, winter barley, winter rye, beans, winter triticale, summer barley, peas, potatoe, sugar beets, soybeans, and the default class other.2. Covered classes cloud, water, snow, and cloud shadow, account for high frequent coverages and are provided by the scene classification mask of SEN2COR extracted from the raster dataset.
In this work, we have shown how to employ LSTM and RNN models for land cover classification.Large-scale experiments have been conducted on a real-world dataset acquired from openaccess satellite data together with in-situ annotations provided by local authorities.These experiments have shown that LSTM and RNN networks are able to directly utilize temporal information, namely phenological characteristics of crops, for classification and achieve superior results compared to models which-by designcan not benefit from these features but solely rely only on spectral and textural characteristics.All models performed well on classes which do not incorporate characteristic temporal information, such as clouds, while crops can be reliably better classified by LSTM networks and RNNs utilizing distinct phenological events.