A 3D CNN APPROACH FOR CHANGE DETECTION IN HR SATELLITE IMAGE TIME SERIES BASED ON A PRETRAINED 2D CNN

Over recent decades, Change Detection (CD) has been intensively investigated due to the availability of High Resolution (HR) multispectral multi-temporal remote sensing images. Deep Learning (DL) based methods such as Convolutional Neural Network (CNN) have recently received increasing attention in CD problems demonstrating high potential. However, most of the CNN-based CD methods are designed for bi-temporal image analysis. Here, we propose a Three-Dimensional (3D) CNN-based CD approach that can effectively deal with HR image time series and process spatial-spectral-temporal features. The method is unsupervised and thus does not require the complex task of collecting labelled multi-temporal data. Since there are only a few pretrained 3D CNNs available that are not suitable for remote sensing CD analysis, the proposed approach starts with a pretrained 2D CNN architecture trained on remote sensing images for semantic segmentation and develops a 3D CNN architecture using a transfer learning technique to jointly deal with spatial, spectral and temporal information. A layerwise feature reduction strategy is performed to select the most informative features and a pixelwise year-based Change Vector Analysis (CVA) is employed to identify changed pixels. Experimental results on a long time series of Landsat 8 images for an area located in Saudi Arabia confirm the effectiveness of the proposed approach.


INTRODUCTION
Recently, the availability of High Resolution (HR) satellite images with detailed spatial, spectral and temporal information have increased the range of possible applications of the Change Detection (CD). CD has been regularly used to observe phenomena such as urbanization (Lu et al., 2011), disaster management (Stramondo et al., 2006), natural industrial disasters (Hulley et al., 2014) and Land Cover Changes (LCC) (Zanetti and Bruzzone, 2018), (Solano-Correa et al., 2020). In order to effectively exploit HR Satellite Image Time Series (SITS) in LCC detection, new challenges should be addressed in terms of data processing and algorithm development. In this context the main challenge refers to the complexity of the temporally dense SITS that requires computationally heavy processing algorithms. The first introduced methodologies in the CD context mostly focused on bi-temporal change detection like image differencing and Change Vector Analysis (CVA) (Malila, 1980) (Bovolo et al., 2012). Such approaches benefited of the use of specific features like Principal Component Analysis (PCA) (Celik, 2009), advanced statistical parameters estimation (Bruzzone and Pietro, 2000), Parcel-based (Bovolo and Bruzzone, 2005) and Markov random field (Kasetkasem and Varshney, 2002) paradigms for multi-level and multi-temporal spatial context information modelling. In recent years, Deep Learning (DL) has become mainstream in image understanding tasks (Krizhevsky et al., 2016) (Ren et al., 2015), including remote sensing image understanding (Zhang et al., 2016). Deep learning has also been introduced for CD and it is considered as a methodology of choice for CD over the past few years (Khan et al., 2017). Deep learning-based change detection methods that have been applied to satellite image analysis are both supervised (Mou et al., 2018) and unsupervised (Louis de Jong and Sergeevna Bosman, 2019). A supervised deep learning method for CD is chosen when the labelled multitemporal training data are available. An example of supervised change detection method is proposed in (Mou et al., 2018), where a recurrent convolutional neural network (ReCNN) architecture for extracting joint spectral-spatial-temporal features is developed. The main idea is to combine convolutional neural network (CNN) and well-established RNN for remote sensing applications. In this work, a CNN is employed to model the spectral-spatial features for a pair of multi-spectral data patches and a RNN is used to represent the temporal information in the bi-temporal satellite images. Considering the state of the art, the main issue of the supervised deep learning models in remote sensing analysis is the need of collecting and constructing ground reference data for the system-training phase which are difficult to obtain. This is even more true when we deal with long time series (more than two images). It has been shown that a deep network trained with images of a certain domain can become useful to treat images of other domains (Volpi and Tuia, 2016). As a result, some unsupervised CD methods have been designed in this context. In (Hou et al., 2017) a CNN already pretrained on a large-scale natural image data set is used in a remote sensing context. To get better results they fine-tune the CNN-based architecture to adapt it to their optical remote sensing image. They show that deep learningbased feature extraction has better generalization capability than traditional hand-crafted features. The feature maps are produced by means of convolutional layers which result from applying multiple kernels to the original image. Despite differences, CD methods emphasize the importance of using spatial context information, object-level information, and complex nonlinear features (Desclée et al, 2006) (Francisco et al, 2021. Moreover, a majority of existing CNN algorithms exploited 2D CNNs (Zhan et al., 2017) (El Amin et al., 2016. But a 2D CNN is unable to properly model the temporal features since it averages and collapses the temporal information to a scalar in the convolutional layers. These methods have limited capability in capturing temporal context information, complex visual features and most of them have focused on bi-temporal CD instead of SITS (more than two images). There is still limited work on CD in image time series with more than two images as input data. The nature of a 3D CNN with 3D kernels suits to spatio-temporal representation of the satellite images and can provide dynamic information extracted as temporal features. Recently a 3D CNN has been applied to several studies such as human action recognition (Funke et al, 2019), spatio-temporal feature learning (Li et al., 2017), (Sexton, et. al, 2013) and spatial-spectral image classification (Lifan et al., 2021). Considering the feasibility and efficiency of spatio-temporal feature representation of 3D CNN, a 3D CNN architecture is promising for SITS CD. There are rare studies that have applied 3D CNN to extract temporal information from remote sensing SITS, and in most of them temporal features are partially ignored or represented by simplistic models. In (Meshkini et al., 2021) authors developed an unsupervised deep learning-based 3D CNN method by using HR remotely sensed images in the CD context. The 3D CNN architecture was trained on a large-scale supervised video dataset for the purpose of image classification. The features were extracted and stacked from all the convolutional layers in order to generate a hyper feature map for representing the spatiotemporal information of the images. Finally, a pixel wise distance was computed to produce the change map. These 3D CNN architectures have some limitations: 1) they are usually trained targeting scene classification by using the back propagation method and the error is computed by considering the entire image, not at pixelwise level; 2) the architecture is restricted based on the fully connected layer and they can only accept a fixed size input; 3) most of the pretrained 3D CNNs are trained on RGB spectral channels, while the Near Infrared Red (NIR) channel is important for CD, especially for vegetation analysis; 4) since the architecture is not trained on remotely sensed data, the performance of the method is accurate only on the detection of very sudden abrupt changes (changes that happen in a short time with a great magnitude) and it is suitable for CD in small areas only. Thus, a shift is required in the paradigm of CNN from scene classification to pixelwise image segmentation. In (Volpi and Tuia, 2016), the authors proposed a new kind of architecture where all the learnable layers are convolutional with a series of convolutional, pooling, and activation layers followed by a series of deconvolutional and activation layers. It can accept input of any spatial dimension, produce pixelwise output for the entire image and effectively encode the spatial context information of each pixel. By exploiting the CNN-FPS network developed in (Volpi and Tuia, 2016) that is available for downloading, we design our proposed pretrained deep learning-based CD architecture that: 1) is automatic and fully unsupervised; 2) considers a multi-layer 2D CNN architecture designed for semantic segmentation which is trained on remote sensing images and can accept NIR spectral band; 3) is able to analyse the spatio-temporal features, since a transfer learning technique is developed to adapt 2D CNN to 3D by weights transformation; 4) performs a pixelwise time series feature extraction that implicitly models the spatio-temporal context information of each pixel; 5) locates the position and the time of the changes by means of CVA. The method is performed on each two adjacent years covering SITS, effectively extracts change information by representing spatio-temporal features and detects changes in space and time. Some qualitative and quantitative analysis is provided on a region in Saudi Arabia for the period 2013 to 2019 using ten images per year. The CD maps have been compared with the 3D CNN CD methodology developed in (Meshkini et al., 2021). The rest of this paper is organized as follows. A structure for the 3D CNN CD for SITS with details is presented in section 2. Section 3 and 4 provide some information on the study area and the experimental analysis together with a discussion on the results. In section 5, the conclusion together with the discussion on scope of future research are presented. Figure 1 shows the block scheme of the proposed 3D CNN-based approach to CD in SITS. Let = { 1 , … , , … , } be a pre-processed time series covering years including images acquired over the same geographical area. Let = { 1 , … , , … , } and +1 = { 1 , … , , … , } be two time series with non-uniform time sampling for the th year and ( + 1)th year, respectively. Let correspond to the total number of images for each year ( > 2). The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLIII-B3-2022 XXIV ISPRS Congress (2022 edition), 6-11 June 2022, Nice, France Let = { 1 , 2 , … , } be the set of bands that compose images ∈ and ∈ +1 ( is the total number of bands). Given an image or , the size of the input image is × × where and correspond to the number of rows and columns of and , respectively. The proposed method aims at detecting changes between the preprocessed and +1 thus involving 2N (>>2) images. It consists of four main steps: i) 2D to 3D transition where the weights and convolutional layers of the base 2D CNN are transformed to three dimensions to generate a new 3D CNN architecture; ii) 3D CNN feature extraction where features are extracted from SITS to obtain a hyper change vector; iii) feature reduction based on a variance measure; iv) land cover change detection by a pixelwise distance calculation among and +1 . The proposed 3D CNN architecture automatically processes 1 , 2 , 3 , 4 bands at a time and the output is a change map representing the LC changes and the year of change in .

2D to 3D transition
The proposed approach starts from a 2D CNN trained on remote sensing images to develop a new 3D CNN architecture for feature extraction in SITS. Figure 2 represents the overview of the proposed strategy to exploit 2D weights for training a 3D CNN architecture. First, the weights of a pretrained 2D CNN are transformed to 3D to be used as the weights of the 3D CNN. Second, the 2D CNN is adapted to the 3D CNN by using 3D convolutional layers instead of 2D convolutional layers. Finally, the 3D CNN extracts the 3D features that will be passed to the feature reduction module. In order to transform the convolutional layers and weights of 2D CNN to three dimensions, two techniques are considered as described in (Merino et al., 2021). Since the weights can be represented as 2D matrices, a 2D matrix can be transformed into a 3D tensor to generate the 3D weights. Let ( , ) = (R, G, B, NIR) be the matrix of weights in the position and of the 2D CNN, so the transformed matrix of weights can be represented as ( , , ) = ( , , , ) where , , ∈ ℕ and , , , ∈ ℝ. Two strategies Extrusion and Rotation have been developed as the transformation functions from 2D to 3D. Extrusion transforms the 2D matrix to 3D tensor by copying the R, G, B, NIR values along one axis. Given a matrix W of size × , the transformed tensor has in size of × × where is the total number of images per year and = . The Extrusion can be done along the three main axes and is defined as: Rotation transforms 2D to 3D weights by performing a 0 to 90 degrees rotation with respect to the fixed axis . Equation 2 represent Rotation transformation from 2D matrix to 3D tensor: ( , , ) = ( , min (⌊√ 2 + 2 ⌋ , )) (2)

3D CNN feature extraction
Our challenge in the development of an unsupervised method for CD in time series is obtaining a suitable pretrained CNN architecture that can properly model spatio-temporal information of SITS. Several pretrained architectures exist in literature that accept only RGB input (Nogueira et al., 2017), thus losing a large amount of information embedded in the NIR band. Among the pretrained CNN architectures, we use the best of 2D CNN architectures (CNN-FPS network) that is developed by (Volpi and Tuia, 2016) and is trained on remote sensing images for semantic segmentation however other architectures can be considered. The CNN-FPS network processes a five channel input composed of Blue, Green, Red, NIR and digital surface model (DSM). We exclude the DSM input since it is seldom available in the context of change detection and it becomes even more rare as temporal series becomes longer and denser.
Removing the kind of input from the pre-trained network does not significantly impact the feature extraction performance as it is presented in (Saha et al., 2019) . The 2D convolutional network is transformed to a 3D convolutional network by considering the two transformation strategies explained in section 2.1. Extrusion and Rotation transformations have been applied separately to generate a 3D tensor from the 2D weight matrices. Moreover, the 2D convolutional layers have been transformed to 3D convolutional layers. The resulting 3D CNN architecture has a more complex network structure with the convolution and pooling operations that perform spatio-temporally to preserve the temporal information and extract relevant features from multitemporal data. A convolutional layer in the 3D CNN structure extracts features from the local neighborhood of feature maps in the previous layer − 1. The output at position ( , , ) denoted as , is given by equation (3) (for simplification, we omit the layer notation l): Where ℎ is the 3D tensor for the ( , , )th value of the kernel connected to the ℎth feature map in the previous layer, and and are the height and width of the kernel, respectively. represents the input activation at location ( , , ), is the temporal indicator with length (the number of images per year) and ℎ refers to the set of SITS feature maps of the previous layer. Choosing the right layers to extract features is also important. There is a significant difference in characteristics of features depending on the layer from which they are extracted. The initial layers of the CNN capture low-level visual concepts such as edges, curves, and color patches. As we go deeper, filters capture more complex concepts by combining lower level features of the previous layers (Zeiler and Fergus, 2014). Since both high and low level features are useful to analyse HR images, a combination of them should be considered in the feature extraction step to catch information (Hariharan et al., 2015). To extract features from the 3D CNN architecture, we choose more convolutional layers than deconvolutional ones to form the hypervector since the convolutional layers learn the semantics of the image at a degraded resolution and the deconvolutional layers mainly learn to reconstruct the spatial arrangements. The first convolutional layer is excluded as it learns very primitive features that are significantly noisy. The convolutional and deconvolutional layers that have been used in this study for the 3D CNN feature extraction are shown in Table I. Considering equation (3) and Table 1, a feature map for each layer is obtained as = { 1 , … , , … , } for each SITS after applying 3D convolutions in the spatio-temporal domain of images and accumulating the outputs of the spectral bands.

Feature reduction
The 3D CNN model that is used in the feature extraction provides a large number of features for each layer (up to 512). Not all of them carry relevant information for CD. Thus, a feature selection technique is developed in order to maintain only informative and reliable features. The applied feature selection strategy is the one proposed in (Bovolo and Bruzzone, 2017), which is based on the variance measurement. The feature selection is performed for each layer of the 3D CNN architecture and the hyper feature vector for every ∈ is obtained by concatenating the selected features from each layer , = 1, … , ( is total number of the layers). For a given layer , a layerwise difference vector is computed by subtracting from +1 . Then, a subset ′ of is considered that contains features more sensitive to change information. The variance measurement is used as an index of sensitivity to change information. The assumption is that features containing potentially relevant change information have higher variance than those less affected by changes. Inspired by (Bovolo and Bruzzone, 2017)

Land cover change detection
The hyper feature vector with dimension created in the feature selection step represents effectively the behaviour of changed and unchanged pixels between and +1 . In order to provide a comprehensive comparison between changed and unchanged pixels the magnitude of is calculated by considering the CVA technique proposed in (Bovolo and Bruzzone, 2007) that is extended in the context of time series CD for means of the proposed 3D CNN. The magnitude of the hyper feature for each pixel ( , ) is given by: Where , is the feature value on the ℎ dimension of the positions and of the hyper feature space. Although by calculating the magnitude a strong compression of the hyper feature map is performed, the main information about the changes are preserved.
( , ) is assumed to have relatively large values for the changed pixels and small values for the unchanged ones. Therefore, a thresholding strategy is implemented in order to separate them. In the literature, several thresholding techniques have been proposed such as local adaptive threshold (Kieri, 2012), (Wellner, 1993) and Otsu thresholding (Otsu, 1979). Local adaptive thresholding selects an individual threshold for each pixel based on the range of intensity values in its local neighbourhood and changes dynamically over the image. Otsu thresholding processes the image histogram and segments the objects by minimization of the variance on each of the classes. In this research, we examine both thresholding strategies to check the performance of each methods for our proposed CD approach.

STUDY AREA AND EXPERIMENTS
To evaluate the effectiveness of the proposed 3D CNN CD approach, a study area is selected in North-West of Saudi Arabia where several central pivot crop fields have been built. Desert to central pivot cropland is the most relevant change class and the pivot crops can be easily identified by checking the images in Figure 3. The dataset is downloaded directly from USGS Landsat 8 Surface Reflectance Tier1 database and with cloud coverage higher than 70% are ignored. A cloud/cloud shadow mask is imposed to filter cloudy pixels in each image and the data are atmospherically corrected. For each year in the period January 1 st , 2013 and December 31 st , 2019, ten images of 600 × 600 pixels (18km × 18km) (see Figure 3) are selected being distributed over the seasons to guarantee variability in temporal behaviours. To proceed with the experiments, two 3D CNN architectures have been developed based on the two transformation strategies, Extrusion and Rotation. The developed 3D CNN CD algorithm has been applied six times considering each couple of adjacent years in the acquisition period (2013 to 2019) with four spectral bands (Blue, Green, Red and NIR). The proposed method extracts features from layers = {2, 5, 8, 10} and generates the feature vector ( ) for upcoming feature reduction. To select a subset of features being sensitive to change about 10 percent of the features are selected after arranging the feature variances in a descending order for each split. The dimension of after feature subset selection is d = 106 and features have been selected with a split size of 200 × 200 pixels. Larger split size reduces the sensitivity of variance to change class thus increasing possible missed detection. Smaller split size allows to capture features that increase false alarm. Experiments were conducted to see the effect of using different thresholding strategies for discrimination of changed and unchanged pixels, the results are reported for each of the local adaptive thresholding and Otsu segmentation method separately (see Figure 4). The algorithm automatically takes as an input all the images for each couple of years, detects the changes and shows the changes for the entire processing period.

DISCUSSION
As it is shown in Figure 4, changes for each year have been visualized by different colors in order to make a better comparison between the different transformation and thresholding strategies. By analysing Figure 4 (Meshkini et al., 2021) and prior knowledge on the evolution of the geographic area. For this area, 13 crop fields have been constructed during the processing period (2013 to 2019) resulting in 4583 changed pixels. A quantitative and qualitative performance analysis is conducted in terms of false and missed alarms in space and time. Figure 6 locates false/missed alarms showing that they mostly exist along the borders of crop fields. A detailed quantitative performance analysis is provided in Table 2. It clearly shows that the portion of false and missed alarms is considerably low in terms of pixels (less than 5% and 2%, respectively). In addition, an object-based analysis is conducted considering the 13 crop fields being changed in reference map. Three fields have been detected correctly in space, but they represent false alarms in time since the year of change is imprecise (i.e., the change is detected later or earlier than expected) ( Figure 5 (a), red squares). One crop filed is a pure false alarm both in space and time ( Figure 5 (a), red square bottom left corner). In order to further prove the robustness of the proposed approach, the performance is compared with the 3D CD methodology that is presented in (Meshkini et al., 2021). In (Meshkini et al., 2021), a pretrained 3D CNN CD technique trained on a largescale video dataset is used that accepts a specific size of images with only three spectral bands (for more details refer to (Meshkini et al., 2021)). As it is clear, the proposed approach has outperformed the method in the literature in detecting the changes and defining the edges of the crop fields. Furthermore, by looking at the sample images provided in 2017 and 2018, there are two crop fields (one in the top left and another one almost right centre) that have started to be built in 2017 and continued in 2018. The reference method failed to recognize the crop field located in the right centre of area and it was unable to specify the year of the change correctly for the crop field in the top left (purple squares in Figure 5 (a)).  Table 2. Validation result of the proposed CD method.

CONCLUSION
An unsupervised CD approach to the analysis of HR SITS based on a 3D CNN has been proposed. The approach uses a pretrained 2D CNN architecture for semantic segmentation to design a 3D CNN model that can deal with multi-temporal information. A 2D to 3D transformation module is implemented to transform 2D weights to 3D weights by using the Extrusion and Rotation strategies. The 2D convolutional layers have been replaced by 3D convolutional layers, the relevant features are extracted from some of the convolutional layers and are reduced by considering a variance measurement as an index of sensitivity to change information. The CD map is produced by extending the CVA strategy to SITS analysis that separates changed from unchanged pixel calculating the magnitude of the hyper feature map for each pixel. The reliability of the proposed approach is demonstrated by comparing with a 3D CNN-based CD from the literature. Quantitative analysis shows a decrease of false/missed alarms, a better capability to recognize the year of change and locate the object borders. False alarms Missed alarms Figure 6. False/missed alarms of CD maps derived from the proposed method.
As a future work, we aim to improve the performance of the proposed method by employing a fine-tuning technique for the 3D CNN feature extraction, adding more comparisons using features extracted via 2D CNNs and performing the proposed method on the other areas with different change classes.