TOWARDS LIFELONG CROP RECOGNITION USING FULLY CONVOLUTIONAL RECURRENT NETWORKS AND SAR IMAGE SEQUENCES

Recent works have studied crop recognition in regions with highly complex spatio-temporal dynamics typical of a tropical climate. However, most proposals have only been evaluated in a single agricultural year, and their capabilities to generalize to dates outside the temporal sequence have not been properly addressed thus far. This work assesses the generalization capabilities of a recent convolutional recurrent architecture, testing it in a temporal sequence two years ahead of the sequence with which it was trained. Furthermore, a N-to-1 variant of such network is proposed, which is able to produce classification outcomes for every month in the agricultural year, and it is compared with two baselines designed in a more traditional approach, in which a separate specific network is trained for each month of the year. The approaches are evaluated on two public datasets from a tropical region. The first dataset comprehends the period from June 2017 to May 2018, while the second goes from October 2019 to September 2020. Results show a decrease of up to 24.6% in per-date average F1 score when training the network with data of an agricultural year different from the one it is tested on, which indicates a domain shift that demands further research. Additionally, the proposed approach presented only a slight decrease in performance compared to its baseline when trained on the same dataset, with a 2.7% drop in average F1 score. This performance drop is a small cost in exchange for its operational advantages, such as reduced training time and a more straightforward pipeline.


INTRODUCTION
Crop monitoring involves several factors that might vary from one region to another. In tropical areas, the favorable climate allows more flexibility regarding the seeding and harvest times, making the modeling of crops' dynamics more complex (Sanches et al., 2018b). Notably, tropical areas generally exhibit complex spatio-temporal dynamics of culture because of their favorable climate and crop rotation practices. In Brazil, for example, the favourable climate allows up to 3 harvests per year, and the plantation and harvesting dates for the same crop type may greatly vary. Thus, it is necessary to map crops in tropical regions several times throughout the year, unlike temperate areas, where a single classification result may be sufficient for the entire agricultural year.
Several works have addressed the crop recognition problem using classical machine learning techniques. In (Tardy et al., 2017), the authors used a Random Forest algorithm for multitemporal crop mapping and evaluated fusion techniques to leverage information from multiple past agricultural years to classify unlabeled data from the current period. The authors in (Ajadi et al., 2021) proposed a large scale crop mapping scheme leveraging auxiliary data from the United States, using a XG-Boost model. Similarly, the authors in (Santos et al., 2021) used self-organizing maps (SOMs) to classify crop types using MODIS time series. However, their training features only considered the temporal dimension, leaving the spatial context unexplored.
In recent years, deep learning techniques have been successfully applied to crop recognition applications (Audebert et al., * Corresponding author 2017). Such methods can be roughly grouped into Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN). A recent work compared a CNN and a more spatially efficient Fully Convolutional Network (FCN) for crop recognition in a tropical climate in the Municipality of Campo Verde, Brazil (La Rosa et al., 2018). Despite the reported high accuracy values, these approaches require specific training for each date in which a result is desired, which represents an operational disadvantage.
The second group of techniques involves recurrent neural networks (RNNs). A type of RNN called ConvLSTM was proposed by (Shi et al., 2015), which replaces the LSTM internal operations with convolutional layers, allowing it to consider the spatio-temporal context in the input sequence. ConvLSTM networks have recently been used for crop recognition applications (Rußwurm and Körner, 2018, Teimouri et al., 2019, Martinez et al., 2021. (Rußwurm and Körner, 2018) reports an evaluation of a ConvLSTM network for crop recognition in a temperate region. (Martinez et al., 2021) proposed a network that combines a ConvLSTM with an Unet-like encoder-decoder network (UN-etConvLSTM). The authors tested the proposed network in public datasets from a tropical climate. This network was trained as a unique, end-to-end network and can produce classification results for every date represented in the temporal sequence.
To our knowledge, all previous works evaluate the network performance in the same temporal sequence used for training. Such an analysis is insufficient to assess how well the networks generalize on dates outside the period covered by the training's temporal sequence.
The first contribution of this work is an evaluation of a Unet-ConvLSTM network at dates not used for training. A second contribution is a novel N-to-1 variant of the UnetConvLSTM network from (Martinez et al., 2021), which is trained in a sliding window manner to produce outcomes for any date along the temporal sequence using a single, end-to-end network called VUnetConvLSTM.
This work also evaluates the performance of the proposed network at dates outside the temporal sequence used for training.
We used two open-access datasets for this purpose. Both refer to the municipality of Luis Eduardo Magalhães, Bahia state, Brazil. The dataset refers to two temporal sequences separated in about one year from each other. The first dataset, called hereafter LEM 17/18, comprises SAR images acquired between 2017 and 2018. We used this dataset for training. The second dataset, called henceforth LEM 19/20, comprehends a temporal sequence between 2019 and 2020, and it was used for testing.
The remainder of this work is organized in the following sections. Section 2 briefly explains some fundamental concepts required to understand the evaluated networks, such as LSTM, ConvLSTM and UnetConvLSTM. Section 3 presents the proposed VUnetConvLSTM architecture. Section 4 details the datasets used in this work, the baseline architectures, and the experimental protocol. In Section 5, the results are presented and discussed. Finally, Section 7 describes the final conclusions and discusses possible future directions.

LSTM and ConvLSTM
Recurrent Neural Networks are a class of neural networks designed to handle data that exhibits a time/sequence dependency. The RNN's output at a specific time step t, depends on the data from the current data observation xt and the information stored in the network's hidden state ht, which is updated at each time step. A modern type of RNN called Long Short-Term Memory (LSTM) introduces trainable gates in its architecture to control the information flow of the network's memory ct (Hochreiter and Schmidhuber, 1997). In particular, gates are shallow neural networks with sigmoid activation function as output and parameterized by W f , Wi, and Wo, respectively, which regulate the information accessed, cleared, and added to ct (see Figure 1). This architecture allows LSTMs to be capable of modeling both long and short-term time dependencies. The major drawback of LSTMs for image sequence processing is that their internal operations are fully connected layers, which means that their input, hidden states, and output correspond to sequences of vectors. To overcome this issue, (Shi et al., 2015) proposed the ConvLSTM, which replaces all the LSTM internal fully connected layers with convolutional layers. As a result, its input, hidden states, and output correspond to sequences of tensors with spatial dimensions.

UnetConvLSTM
This architecture was introduced in (Martinez et al., 2021). It combines the ConvLSTM with the Unet network (Ronneberger et al., 2015), which considers the spatial context at multiple spatial scales. The block diagram for this network is presented in Figure 2. The input to this network is a sequence of images. First, an encoder stage is applied to each of the images in the sequence. This encoder consists of successive convolutional layers and downsampling operations, which extract more coarse feature representations. The resulting sequence of feature maps produced by the encoder is then passed to a unidirectional ConvLSTM network, which effectively considers the spatio-temporal context. The ConvLSTM is configured as N-to-1, and thus only the last element in the sequence produced at the RNN output is considered. This feature map is then passed to a decoder stage. In the decoder stage, successive convolutional layers and upsampling operations are applied to recover the original spatial resolution. Skip connections are used to preserve fine details. Afterward, a convolutional layer with a 1x1 filter and a softmax activation function produces the final class probabilities. At inference, the patch outcomes in the test area are stitched together in a mosaic to form the final result.

PROPOSED VARIABLE SEQUENCE NETWORK
The majority of works on multi-date crop recognition for tropical regions are designed to produce classification outcomes for a single date in the dataset (Castro et al., 2017, La Rosa et al., 2018, Martinez et al., 2019. In those works, separate network architectures need to be trained for each date to be predicted. However, this results in operational disadvantages such as larger training times and a more complex pipeline.
The proposed variable sequence network, called VUnetCon-vLSTM, attempts to solve this problem by training a unique, end-to-end architecture capable of producing classification outcomes for every month of the year. The UnetConvLSTM network is configured with an input sequence of length 12 with one image per month, corresponding to a year before the desired output month. Therefore, the network's output corresponds to the class probabilities for the last image in the sequence. During training, the network is trained with a different target month from the available labeled dates in each mini-batch. Considering our works focus on obtaining the classification outcomes for the last date in the sequence, we did not use the bidirectional variant because our model does not employ future information.
At test time, the network is capable of producing outcomes for every month of the year. Thus, this architecture addresses the issues presented in the aforementioned previous works, associated with the need to train separate networks for each month of the year. Nonetheless, this approach may result in a performance decrease, given that the network needs to learn the specific classification patterns for the entire year.
For each image in the input sequence, the day of the year was added as metadata for the network to better understand the sequence's temporal structure. First, the sine and cosine operations were computed on the day of the year to preserve its cyclical nature. The resulting values are of shape T × 2, where T is the sequence length. Spatial dimensions were added to this representation, resulting in a sequence of shape T × 1 × 1 × 2.
The result was spatially upsampled, concatenated with the network encoder's outcomes, and then used as an input to the Con-vLSTM block. The modified UnetConvLSTM is presented in Figure 3.

Study Area
The proposed approaches were evaluated in two publicly available multi-date crop recognition datasets with complex spatiotemporal dynamics due to their tropical climate. Both datasets are located in Luis Eduardo Magalhães municipality, Bahia state, Brazil (See Figure 4). The first dataset, called LEM 17/18 (Sanches et al., 2018a), comprises a monthly sequence of 12 pre-processed Sentinel-1 images and their ground truth labels, corresponding to the period between June 2017 and May 2018. In this work, the dataset was extended with additional monthly images, for a total of 21 monthly images from September 2016 to May 2018 (See Table 1). Reference information is not available for these additional images.
The second dataset, called LEM 19/20, consists of 12 preprocessed Sentinel-1 images from October 2019 to September 2020, together with their corresponding ground truth (Oldoni et al., 2020). As in the previous case, this dataset was extended by using additional monthly images, without references, from November 2018 until the beginning of the dataset (See Table  2). Although it covers a larger area compared to the LEM 17/18 dataset, we cropped its input and ground truth images to match the same extension of LEM 17/18. The class distribution for each month in both cases is presented in Figure 5.   Experiments were carried out considering the most representative classes. We grouped classes with an overall percentage of samples lower than 5.3% into a single class called Other classes. In all cases, the Sentinel-1 images were resampled to a 10m resolution, and the VV and VH bands were used. No specific filtering was considered towards reducing speckle noise.

Baseline
We compared the proposed approach with two baseline methodologies. In the first one, called Baseline 1, we trained separate UnetConvLSTM networks specifically for each month of the year. The input sequence length is the same in each of these networks, corresponding to an input sequence of one year in the past with respect to the desired output month. At inference, outcomes for each month are produced using each of the separ- O c t -1 9 N o v -1 9 D e c -1 9 J a n -2 0 F e b -2 0 M a r -2 0 A p r -2 0 M a y -2 0 J u n -2 0 J u l -2 0 A u g -2 0 S e p -2 0 ately trained networks. The networks were trained and tested in the same LEM 19/20 dataset. Because this approach does not need to generalize to unseen dates at test time, it is expected to outperform the remaining approaches, and we used it as an upper bound for comparison purposes.
As in the previous approach, Baseline 2 consists of multiple identical UnetConvLSTM networks, each one trained for a specific month of the year. Input continues to be a sequence of images corresponding to one year in the past with respect to the corresponding output month. In this case, the networks are trained on the LEM 17/18 dataset, while at inference the networks are tested on the LEM 19/20 dataset. Thus, this model evaluates the baseline's generalization capabilities in the temporal domain. One should expect Baseline 2 to perform better than our proposed network because it trains a specialized classifier for each specific date. Notice that the advantage of the proposed model compared to this baseline is operational because it requires a unique network to produce classification outcomes for all dates.

Experimental protocol
For the UnetConvLSTM network, we used average pooling as downsampling operator, and transposed convolution as up- sampling operator. Table 3 presents the filter number we used at every stage.
In each dataset, agricultural crop parcels were represented as polygons. We used 100% of the LEM 17/18 dataset polygons for training the Baseline 1 and the proposed approach. In LEM 19/20, the polygons intersecting with the train area of LEM 17/18 were used for training, and the remaining polygons were used for testing. The train and test areas for both datasets are presented in Figure 6 and Figure 7.
Patches of spatial dimensions 32-by-32 were used, with an input sequence length of 12. We used zero padding for the input images that were not available and were required during the training of the LEM 17/18 early dates. Such unavailable images correspond to the months of July and August 2016.
We used focal loss cost function and Adam optimizer with a learning rate of 0.001. Mini-batches of 16 samples were used during training. The networks were trained on an NVIDIA GTX 1080Ti GPU. 1 1 Code available at https://github.com/DiMorten/FCN_ ConvLSTM_Crop_Recognition_Generalized The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLIII-B2-2021 XXIV ISPRS Congress (2021 edition)

Layer
Processing Table 3. Parameter configuration used in all experiments. T is the temporal sequence length, H and W indicate height and width, D is the input dimensionality, and C is the class number (H=32, W =32, D=2, C=6). In VUnetConvLSTM, the day of the year was used as an additional input, which was concatenated to the input of the ConvLSTM layer.  Figure 9 summarize the performance of the assessed methods in terms of per-date Overall Accuracy (OA) and F1-score, respectively. The corresponding rates were obtained by evaluating the models on the LEM 19/20 dataset on the sequence from October 2019 to September 2020. As expected, Baseline 1 produced the highest OA metrics with values up to 93.3%. Even so, it produced relatively low OA values for some months such as March, with an OA value of 58.2%.

Figure 8 and
The other two methods, Baseline 2 and VUnetConvLSTM, performed similarly for most dates, but produced lower OA values than Baseline 1. However, the drop in performance regarding Baseline 1 is not constant over the dates for both methods.
The differences are more remarkable in January, February, and between May and July, where the decrease can be up to 39.9%. These results are understandable considering the time-shift of two years between the training and test sets, which is also accentuated by the high dynamics that characterize the tropical regions. In this sense, domain adaptation techniques would help in reducing the gap between both datasets.
Recall that Baseline 2 involves specific classifiers for each date, while VUnetConvLSTM consists of a single classifier for all dates. Note that the performance gap between these two approaches was relatively small, with an average reduction of 2.7% in the F1 score.
On some dates, the proposed VUnetConvLSTM outperformed Baseline 2 in terms of F1 score and OA, e.g., in May  Results regarding the F1 score metric are consistent with the ones reported in terms of OA. For most of the dates, the highest F1 scores were obtained with Baseline 1, followed by Baseline 2 and VUnetConvLSTM approaches, which performed similarly on most dates. However, it can be noticed that the decrease in performance for both methods, regarding Baseline 1, is significant for the entire sequence. An exception occurred in April, in which Baseline 2 outperformed Baseline 1 by a margin of 7.5%. For Baseline 2, May was the most affected month, with a drop of 31% in F1 score, while for VUnetConvLSTM, the major reduction occurred in July and August, with a decrease of 24.1% and 24.6% respectively. O c t -1 9 N o v -1 9 D e c -1 9 J a n -2 0 F e b -2 0 M a r -2 0 A p r -2 0 M a y -2 0 J u n -2 0 J u l -2 0 A u g -2 0 S e p -2 0 F1 score

Date Classified
Baseline 1 Baseline 2 VUnetConvLSTM Figure 9. Per-date average F1 score for the proposed approaches. Table 4 presents the result in terms of F1 score and OA averaged over the entire temporal sequence. Notice that metrics are consistent with results reported in Figure 8 and Figure 9, where the highest values of F1 and OA were obtained by Baseline 1.

CONCLUSIONS
This work evaluated the capabilities of a convolutional recurrent crop mapping architecture adapted to tropical regions, to generalize to future agricultural years in dates unseen during training. Results suggest that merely training a network in a specific agricultural year might not be enough for the architecture to correctly generalize to future unseen dates. The Baseline 2 and the proposed VUnetConvLSTM model, which were trained in LEM 17/18 and tested two years ahead in time using the LEM 19/20 dataset, presented a significant performance drop compared to Baseline 1, which was trained and tested on the same temporal sequence. This indicated a domain gap between the datasets, which was not addressed in this work.
Furthermore, this work compared the generalization capabilities of a network trained to produce an outcome on a specific date (Baseline 2) and the proposed VUnetConvLSTM, trained to produce outcomes for all the months of the agricultural year in an end-to-end fashion. Results showed that the proposed VUnetConvLSTM presented only slight performance decreases compared to Baseline 1, which might be a small cost in exchange for the operational advantages brought by the end-toend approach.
Future works will focus on evaluating domain adaptation techniques such as Colormapgan (Tasar et al., 2020) to address the domain gap between the assessed datasets. Data fusion between optical and SAR images will also be considered for improving the classification metrics. Likewise, the inclusion of other deep learning networks such as Unet3D and Transformers will be evaluated.