SPECTRAL-SPATIAL MULTISCALE RESIDUAL NETWORK FOR HYPERSPECTRAL IMAGE CLASSIFICATION

: In recent years, deep neural networks (DNN) are commonly adopted for hyperspectral image (HSI) classification. As the most representative supervised DNN model, convolutional neural networks (CNNs) have outperformed most algorithms. But the main problem of CNN-based methods lies in the over-smoothing phenomenon. Meanwhile, mainstream methods usually require a large number of samples and a large amount of computation. A multi-task learning spectral-spatial multiscale residual network (SSMRN) is proposed to learn features of objects effectively. In the implementation of the SSMRN, a multiscale residual convolutional neural network (MRCNN) is proposed as spatial feature extractors and a band grouping-based bi-directional gated recurrent unit (Bi-GRU) is utilized as spectral feature extractors. To evaluate the effectiveness of the SSMRN, extensive experiments are conducted on public benchmark data sets. The proposed method can retain the detailed boundary of different objects better and yield a competitive performance compared with two state-of-the-art methods especially when the training samples are inadequate.


INTRODUCTION
With the rapid development of remote sensing imaging spectroscopy technology, hyperspectral images (HSIs) have become increasingly important in Earth observation due to their rich spectral and spatial information. HSI classification is the task of identifying the category for each pixel with a proper landcover label (Sun et al., 2019), which is more challenging because of the large dimensionality, spectral heterogeneity, and complex spatial distribution of the objects . To alleviate these problems, traditional HSI classification methods follow a two-step approach: 1) Feature selection and extraction (Lefei Zhang et al., 2016), such as subspace projection (Harsanyi & Chang, 1994), random feature selection (Waske et al., 2010) and principal component analysis (PCA) (Rodarmel & Shan, 2002). 2) Classifier training, such as the k-nearest-neighbour (KNN) (Ma et al., 2010), Gaussian mixture model classifier (GMMC) (Li et al., 2013), support vector machine (SVM) (Melgani & Bruzzone, 2004), and random forest (RF) (Ham et al., 2005). But the traditional HSI classification methods are a separated process and mainly utilize hand-crafted features, which are not robust for different input data. Deep neural networks (DNNs) can learn very complicated relationships between their inputs and outputs with multiple nonlinear hidden layers (Liu et al., 2017). A large number of DNNbased methods are proposed for end-to-end modelling, which can integrate a spectral module and a spatial module (Liangpei Zhang et al., 2016). For example, Yang et al. designed a twoconvolutional neural network (CNN) model to learn the spectral features and spatial features jointly (Yang et al., 2017). Spatialspectral unified network (SSUN) combined a spectral dimensional band grouping-based long short-term memory (LSTM) model with 2D CNN for spatial features and integrated the spectral feature extraction (FE), spatial FE, and classifier training into a unified neural network . Zhong et al. raised an end-to-end 3D residual CNN architecture * Corresponding author for spectral-spatial feature learning and classification (Zhong et al., 2017). Motivated by the attention mechanism of the human visual system, a spectral-spatial attention network (SSAN) (Mei et al., 2019) and a residual spectral-spatial attention network (RSSAN) (Zhu et al., 2020) were proposed for hyperspectral image classification. To reduce computations, fully convolutional networks were proposed for HSI classification (Xu et al., 2019). For correctly discovering the contextual relations among pixels, the graph convolutional network(GCN) was adopted for dealing with the HSI classification, which was originally designed for arbitrarily structured non-Euclidean data (Wan et al., 2020). DNNs have demonstrated excellent performance in image classification (Druzhkov & Kustikova, 2016). However, most existing deep learning methods are timeconsuming in the training period. A lot of parameters need to be determined, and a large number of training samples are usually required for deep learning methods (Cheng et al., 2021;. DNNs usually yield overfitting methods , and are sensitive to perturbations (Xu et al., 2020), especially when the training samples are inadequate. Exploring the proper depth of a DNN model for a given data set is still an open topic to be researched (Liangpei Zhang et al., 2016).

CNN
A CNN (Shin et al., 2016) is a class of deep neural networks, most commonly applied to analysing visual imagery. Compared with other neural networks, CNNs are easier to train with their fewer parameters and connections because of parameter sharing and local connectivity. But CNN extracts the spatial structure of the current pixel neighbouring region, instead of directly for the current pixel. So, the main problem of CNN-based methods lies in the over-smoothing phenomenon. One approach to solve this The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLIII-B3-2022 XXIV ISPRS Congress (2022 edition), 6-11 June 2022, Nice, France problem is to utilize superpixels segmentation, but the segmentation algorithm affects the classification results. Another approach is to use an attention mechanism. Attentional mechanisms can counteract the effects of parameter sharing, but increase the amount of computation.

RNN
To utilize the features of the current hyperspectral pixel itself, the RNN network structure can be used. An RNN extends conventional feedforward neural networks with loops in connections that can use their internal memory to exhibit temporal dynamic behaviour. RNN makes them applicable to challenging tasks involving sequential data such as speech recognition and language modelling (Cho et al., 2014). A natural idea is to consider each band as a time step, but hyperspectral data usually have hundreds of bands, which consumes high computing and storage resources. Meanwhile, a large number of spectral channels and limited training samples restrict the performance of hyperspectral image classification (Cheng et al., 2021).

PROPOSED FRAMEWORK
Three subsections are playing crucial roles in our methodology: a multiscale residual CNN-based spatial feature learner, a bidirectional GRU-based spectral feature learner, and a multi-task learning model.

Multiscale Residual CNN for Spatial Classification
Considering the complex environment of the HSI, we propose to extract robust and multiscale spatial features. The proposed multiscale residual CNN(MRCNN) architecture is shown in Figure 1. Let ∈ ℝ × × be the original HSI data, where , and are the row number, column number, and band number, respectively.
First of all, to suppress noise and reduce the computational costs, the PCA is applied to the original HSI data and only the first principle components are reserved. Denote the dimensionreduced data by ∈ ℝ × × . Around each pixel, a neighbour region is extracted with the size of × × as the input of the spatial branch. Considering the complex environment of the HSI, where different objects tend to have different scales, we propose to extract both shallow and deep features by applying a convolution layer with ReLU activation and two residual blocks in the classification. The max pooling layer is adopted in residual blocks. We add a flatten layer with dropout technology and a fully-connected(FC) layer after the convolution layer with ReLU activation and two residual blocks. Then, these fully-connected layers are merged into a new fully-connected layer. Let ℎ = � + �, = 1,2,3 denotes the th fully-connected layer, where is the flattened features in the th flatten layer, and are the corresponding weight matrix and bias term, respectively. The fourth fully-connected layer ℎ 4 can be calculated as ℎ 4 = concat[ℎ 1 , ℎ 2 , ℎ 3 ]. In this way, features in different layers are taken into consideration during the classification stage, and the network will possess the property of multiscale. The loss function for cross-entropy of MRCNN can be expressed by where and � denote the truth and predicted labels, respectively. is the number of training samples and is the number of classes. All convolutional layers have 32 filters. The kernel size of the first left convolutional layer is 1 × 1, and the other kernel sizes are 3 × 3. The size of the max pooling layers is 2 × 2. The first three fully-connected layers each own 128 units, and the fourth fully-connected owns 384 units.

Bi-GRU for Spectral Classification
The GRU has fewer parameters than LSTM for modelling various sequential problems, and bi-directional RNN can make full use of both latter and previous information. Therefore, we utilize Bi-GRU for spectral classification. The complete spectral classification framework is illustrated in Figure 2. A natural idea is to consider each band as a time step in Bi-GRU, but hyperspectral data usually have hundreds of bands, which consumes high computing and storage resources. Thus, a suitable grouping strategy ) is used in this paper. For each pixel in the HSI, let = � 1 , 2 , … , … � be the spectral vector, where is the reflectance of the th band and is the number of bands. Let (≪ ) be the number of time steps (e.g., number of groups). The transformed sequences can be denoted by = [ 1 , 2 , … , … ], where is the sequence at the th time step. Specifically, grouping strategy is where = ( ⁄ ) is the sequence length of each time step and (·) function rounds numbers down. After grouping, spectral vector is transformed into sequences The input to our model is the sequential vector = [ 1 , 2 , … , … ] , and the bi-directional hidden vector is calculated as: Forward hidden state: Backward hidden state: where the coefficient matrices ���⃗ and ⃖��� are from the input at the present step, �⃗ is from the hidden state ℎ −1 at the previous step, ⃖� is from ℎ +1 at the succeeding step, is the nonlinear activation of the hidden layer, and the memory of the input as the output of this encoder is : where concat(·) is a function of concatenation between the forward hidden state and backward hidden state. Bi-GRU allows the sequential vector to be fed into the architecture one by one to learn continuous features with forward and backward directions. So, we can compute the predicted label of pixel as follows: where (·) is a flatten function with dropout technology. A fullyconnected layer and a softmax activation function are added to accomplish the image classification.

SSMRN
The proposed SSMRN framework is shown in Figure 3, which integrates the spectral FE, spatial FE, and classifier training into a unified neural network. The network adopts the proposed MRCNN to extract spatial features and the band grouping-based Bi-GRU algorithm to extract spectral features. The last fullyconnected layer in Bi-GRU and the last one in MRCNN are concatenated to form a new fully-connected layer for the spectral-spatial classification. All parameters in the framework are trained at the same time.
To better train the whole network, two auxiliary tasks are added in the framework . So the proposed SSMRN is a triple-task framework, including one main task(classification based on spectral-spatial information) and two auxiliary tasks(classification based on spectral information and classification based on spatial information ). The complete loss function for cross-entropy of the SSMRN is defined as where ℒ is the main loss function, ℒ and ℒ are two auxiliary loss functions, � , � , and � are the corresponding predicted labels, is the true label. is the number of training samples and is the number of classes. The whole network is trained in an end-to-end manner, where all the parameters are optimized by the mini-batch or batch stochastic gradient descent algorithm at the same time. In this way, the complete loss function will balance the convergences of both the whole network and the subnetworks.

EXPERIMENTAL RESULTS
In this section, we introduce the public data sets used in our experiment and the configuration of the proposed SSMRN. In addition, classification performance based on the proposed method and other comparative methods is presented. All the experiments are implemented with an Intel(R) Xeon(R) Sliver 4210 CPU @ 2.20-GHz with 64 GB of RAM and an NVIDIA RTX2080 graphic card, TensorFlow 2.3.1, and Keras 2.4.3 with python 3.7.6.

Experimental Data
Pavia University hyperspectral data set is utilized to evaluate the performance of the proposed method in the experiment. The data set is acquired by the Reflective Optics Systems Imaging Spectrometer (ROSIS) sensor during a flight campaign over Pavia, northern Italy. This image consists of 103 spectral bands with 610 × 340 pixels and it has a spectral coverage from 0.43 to 0.86 10 −6 meters. The spatial resolution is 1.3 m/pixel. Image ground truths differentiate 9 classes each.
All the experiments in this paper are randomly repeated 30 times with random training and test data. In each repetition, we first randomly generate the training set from the whole data set with the same number of the labelled class. Then, the remaining samples make up the test set. Details are listed in Table 1.

Parameter Setting
Let 3 be the number of time steps in the spectral branch. The input of the spatial branch is a 24 × 24 × 4 patch, where 4 is the number of reserved principal components. The dropout rate is set as 60%.
The number of neurons of the fully-connected layer in the spectral branch and the fully-connected layer corresponding to different scales in the spatial branch is set as 128, so the number of neurons in the joint fully-connected layer is 512. We use the Adam optimizer to train the networks with a learning rate of 0.001. The training epochs are set as 1000 with batch size 1024.

Classification Results
To demonstrate the superiority and effectiveness of the proposed SSMRN model, it is compared with advanced machine-learning methods such as SSAN (Mei et al., 2019), and RSSAN (Zhu et al., 2020). Limited by our computer configuration, we cannot run RSSAN properly with the original input, so the input of RSSAN is a 24 × 24 × 8 patch, where 8 is the number of reserved principal components instead of the number of spectral bands. All algorithms are executed 30 times. The average results which add the standard deviation obtained from the 30 runs are reported to reduce random selection effects. Class-specific accuracy, overall accuracy (OA), and kappa coefficient are used as the evaluation measurements for the compared methods.

Qualitative Evaluation:
The classification maps of different methods are shown in Figure 4. These methods make full use of the continuity of the ground object and yield a cleaner classification map.

CONCLUSIONS
To effectively learn features of objects and significantly reduce networks complexity, a multi-task learning spectral-spatial multiscale residual network (SSMRN) has been proposed to extract spectral-spatial features. The end-to-end networks based on MRCNN and Bi-GRU can learn higher-level spectral-spatial joint features. The experimental results demonstrate that the method not only has a better performance compared with the other methods in terms of class-specific accuracy, OA, and kappa, but also mitigates the over-smoothing phenomenon. Our method shows more robustness even when the training samples are inadequate.
Although we utilize the proposed band MRCNN and Bi-GRU as the spatial and spectral feature extractors in the implementation of the proposed SSMRN, other deep networks can also be introduced into our model, especially for spectral extractors. It deserves to be investigated in future work.