SPECTRAL-SPATIAL CLASSIFICATION OF HYPERSPECTRAL REMOTE SENSING IMAGES USING VARIATIONAL AUTOENCODER AND CONVOLUTION NEURAL NETWORK

In this paper, we propose a spectral-spatial feature extraction framework based on deep learning (DL) for hyperspectral image (HSI) classification. In this framework, the variational autoencoder (VAE) is used for extraction of spectral features from two widely used hyperspectral datasetsKennedy Space Centre, Florida and University of Pavia, Italy. Additionally, a convolutional neural network (CNN) is utilized to obtain spatial features. The spatial and spectral feature vectors are then stacked together to form a joint feature vector. Finally, the joint feature vector is trained using multinomial logistic regression (softmax regression) for prediction of class labels. The classification performance analysis is done through generation of the confusion matrix. The confusion matrix is then used to calculate Cohen’s Kappa (K) to get a quantitative measure of classification performance. The results show that the K value is higher than 0.99 for both HSI datasets.


INTRODUCTION
Hyperspectral imaging sensors have narrow continuous spectral channels from visible to IR wavelengths for the same spatial location.The information obtained from these sensors is in the form of grids where each element represents a pixel vector.The size of each vector is equivalent to the number of spectral channels or bands.Accurate spectroscopic information obtained from hyperspectral sensors increases the ability to distinguish between different land-cover classes with enhanced accuracy.Various hyperspectral imaging systems are currently available, providing an enormous amount of image data that finds application in diverse fields such as forestry, ecology, geology, hydrology, precision farming and military applications.These applications often require class recognition of each pixel with a small number of pixels as training samples.
There are several important issues when it comes to HSI classification.Supervised classification methods face problems associated with an imbalance between the high dimensionality (Donoho et al., 2000) of data and limited access to the available training samples, the presence of highly correlated bands, and the presence of mixed pixels (Ghamisi et al., 2017).When dimensionality (number of bands) grows, keeping the number of training samples constant, the accuracy of statistical evaluation is reduced which leads to a reduction in the classification accuracy beyond some bands (Landgrebe, 2003).For classification, these problems relate to the so called curse of dimensionality (Hughes, 1968).Also, with the increased spatial resolution of new hyperspectral imaging sensors, the intra-class variation increases and the inter-class-variation decreases both in the spatial and spectral domain, leading to lower interpretation accuracies.Due to these issues with HSI classification, it is necessary to utilize spatial information and to reduce data dimensionality without losing desirable information.

* Corresponding Author
In HSI, dimensionality reduction has two main categories, i.e., band selection and feature extraction.In-band selection, we obtain a subset of the original bands by minimizing spectral redundancy (Yang et al., 2011;Du et al., 2008).The process selects an appropriate band subset from the original bands, while feature extraction methods preserve essential features through mathematical transformations.Some of the most common feature extraction method for HSI include principal component analysis (PCA) (Rodarmel and Shan, 2002), Fisher's Linear discriminant analysis (FLDA) (Bandos et al., 2009) and Independent component analysis (IDA) (Villa et al., 2011).All these methods project the original data into a low-dimensional subspace.However, hyperspectral images are considered inherently nonlinear (Bachmann et al., 2005).These linear methods are not suitable for the analysis of such data (Han and Goodenough, 2008).HSI classification is one of the active areas in the remote sensing community.Till the advent of deep learning (DL) in remote sensing, researchers have used pertinent machine learning algorithms prevalent in computer vision tasks for classifying HSI.Gaulteri (1998) carried out classification of HSI using SVM.Melagni and Bruzzone (2004) compared the classification performance of SVM with two non-parametric classifiers, k-NN and RBF kernel based ANN.Chen et al. (2014) first used DL in HSI classification where they extracted joint spectral-spatial features using multi-layered stacked autoencoders.For classification, a logistic regression based classifier was used which achieved 97-98% accuracy on the Pavia dataset.Hu et al. (2015) then proposed a simple CNN structure with five layers: one input, one convolutional, one max pooling, one fully connected and one output layer for spectral classification of HSI which achieved 90-92% overall accuracy for India Pines, Salinas and Pavia University datasets.Later, Chen et al. (2016) proposed a 3D CNN which could extract robust feature vectors by exploiting both spectral and spatial features of HSI.Mou et al. (2017) used RNN with parametric rectified tanh activation function for the first time in HSI data classification which tried to employ the sequential property of HSI data for classification.
The rest of this paper is organized as follows: Section 2 presents the description of VAE that is used for extraction of spectral features.Section 3 presents the description of CNN and multinomial logistic regression that are used for spatial feature extraction and fine tuning of pre-trained model, respectively.The spectral-spatial feature extraction framework for HSI classification is introduced in section 4. The experimental results are reported in Section 5. We conclude this paper in Section 6 with some discussions.

Vanilla autoencoder
A Vanilla autoencoder (figure 1) is an unsupervised neural network that attempts to encode input samples into some latentspace representation so that the inputs can be rebuilt from these representations with minimal reconstruction error.It consists of three components-encoder, the latent feature vector, and decoder.The encoder transforms the input into a latent feature vector having lower dimensions than the original input (a form of compression) through an encoding function.The decoder then tries to reconstruct the initial input by using the latent feature vector through a decoding function.Autoencoder is a feedforward network which consists of two layers-a hidden layer and an output layer where the input and output layer have the same number of neurons.The hidden layer has less number of neurons than the input layer which forces the network to learn good representations of the input.The objective of an autoencoder is to regenerate the inputs with minimum reconstruction error. .Each training sample (i)  x is encoded to a latent space representation (i) h using an encoding function.The latent space representation is converted back to original training sample (i)   x by using a decoding function.In neural network terminology, In the above equations 1 2 , WW denote input to hidden and hidden to output weights, respectively.b1.b2 denotes the bias of hidden and output units, and f(.),g(.)denotes the activation function for the encoder and decoder, respectively.As the goal of an autoencoder is to have (i)   x approximate (i)  x , the objective function is set up as the sum of squared difference between (i)   x and (i)   x which is also called reconstruction loss: ) which can be minimized using the stochastic gradient descent algorithm.(Amari, 1993)

Variational autoencoder
The Variational autoencoder (Kingma and Welling, 2013) is a generative model (Ng and Jordan, 2001) that allows us to sample from a distribution similar to the training data.This can be accomplished by specifying a joint distribution over all the dimensions of the input.The VAE approach is an extension of autoencoder where instead of forcing encoders to create a unique encoding, we force the encoder to generate latent feature representations that roughly follow a standard normal distribution over encodings.The decoder will then sample a latent feature vector from this probability distribution and try to reconstruct the original input.In practice, there is a trade-off between the accuracy of the network and the proximity of its latent variables to the standard Normal distribution.For evaluation of neural network performance, we sum up two distinct losses-a reconstruction loss, which is a mean squared error that measures how accurately the network reconstructs the input and a latent loss, which is KL divergence (Raiber and Kurland, 2017) that measures to what extent the latent variables diverge from a standard normal distribution.The neural network aims to minimize these losses with each iteration.(1) A value (i)  z is generated from some prior distribution * () pz  (2) A value (i)  x is generated from some conditional distribution * ( | ) p x z  The true parameters θ * of the generative model have to be estimated in order to generate new data.So, to represent this model, select   which is represented as a neural network with non-linear hidden layer.To solve these intractibilities problems, an additional encoder network ( | )  q z x  (also called variational posterior) is defined that approximates the true intractable posterior ( | ) p z x  .To know how well the variational posterior approximates the true posterior, the KL divergence is used which measures the loss in information when variational posterior is used to approximate true posterior.
Our goal is to find the variational parameters ϕ that minimize this divergence.This allows us to derive a lower bound on the marginal likelihood that is tractable and can be optimized.The optimal variational posterior is therefore However it is intractable due to the presence of () px  in KL divergence.As KL divergence is always ≥ 0, we get where L (θ, ϕ) is the variational lower bound on the marginal likelihood which can also be written as:

Convolution neural network
Convolution neural network (CNN) (Lawrence et al., 1997) is composed of alternate convolutional and max-pooling layers optionally followed by one or more fully connected layers as in standard multilayer neural network.First, the convolution layer extract features through filtering process resulting in a feature map.The feature map generated is sub-sampled by the pooling layer to generate more general and abstract features.In the final stages, various fully connected layers are utilized to generate deep features.
The convolution layer takes an image X of dimension m x n x c as input where m represents height, n represents width and c represents number of spectral channels in an image.For an RGB image, the value of c is 3 whereas for HSI, it is in order of hundred.The convolution layer consists of k filters or kernels denoted as W of size p x p x q such that p is smaller than image dimension (both m and n) and q is equal to the number of channels.Each kernel is convolved with the input to produce k features maps of dimension (m-p+1) x (n-p+1).An additive bias denoted as b and a non-linear activation function denoted as f(.) are then applied to every feature map.The output of convolution layer can be calculated as: 1 ( ) where wj and bj denotes the j th component of W and b respectively, the operator  denotes convolution operation, xi denotes the i th feature map of X and aj denotes the j th output of convolution layer (Song et al., 2018).
Each feature map is then sub-sampled with either max or mean pooling over r x r contiguous regions where r typically ranges from two to five.Finally, after a series of successive convolution and pooling operations, the resultant features are combined into one dimensional feature vector which is passed to the fully connected layers and then to the softmax layer (multinomial logistic regression) for classification.CNN architecture for image classification is shown in figure 3.

Multinomial logistic regression
Multinomial logistic regression also called softmax regression is a generalization of logistic regression to classification problems where the output classes are more than two.It takes the numeric features of the dataset as input.For categorical features, suitable technique has to be implemented for its conversion to numerical features.
Consider a training set   (which sums up to one) as output.The hypothesis hθ(x (i) ) takes the form (Vryniotis, 2013) (1) . The Multinomial logistic regression model is formally given by (Böhning, 1992) And the cost function is defined as: where 1{.} is the indicator function such that 1{a true statement} = 1 and 1{a false statement} = 0 (Vryniotis, 2013)

METHODOLOGY
The proposed spectral-spatial framework can be divided into 3 steps as indicated through flowchart in figure 4. 1) Extraction of spectral features using variational autoencoder described in section 2. 2) Extraction of spatial features using Convolution neural network described in section 4. 3) Stacking of spectral and spatial features to form a joint feature followed by softmax regression for classification.

Spatial feature extraction using CNN
The steps involved in extraction and processing of hyperspectral data are as follows: 1) PCA is performed on whole image data and PCs are obtained by setting 99.9 percent variance preserving criteria.
2) The image obtained after PCA and ground reference image is then padded with size 15 at the top, bottom, left and right such that the last element along a direction is replicated along the whole padding area in that direction.3) A fixed window is selected centred on the non-zero ground truth pixels to generate training samples (patches).The patches will be sub-images having ground reference information corresponding to the centred ground truth pixel.(Zhao and Du, 2016) The CNN model consist of two alternating convolution and pooling layers followed by three fully connected layers.The first convolution layer has 30 kernels of size 5 x 5 with a stride of 2 and the second convolution layer has 30 kernels of size 3 x 3 with a stride of 2. The number of neurons in fully connected layers are 1000, 400 and 60 respectively.
where yi = predicted class label for training patch Ti.
In order to avoid overfitting in the model, drop out regularisation (Srivastava et al., 2014) is used with 70% retain probability.After training, the extracted spatial features are obtained as an output of the last fully connected layer.

Stacking of spectral and spatial feature vectors
The spectral and spatial feature vectors obtained are stacked together to generate the joint feature vector which is passed as an input to the softmax layer for calculation of conditional probabilities for each class.For a joint feature vector z, the probability distribution is calculated (Song et al., 2018).

Dataset description
For various investigations, two hyperspectral datasets with different themes are utilized to validate the proposed spectralspatial feature extraction based classification method.The first dataset is a mixed vegetation site acquired by the AVIRIS sensor over Kennedy Space Centre (KSC), Florida containing 13 land cover classes in 224 spectral bands of 10 nm width with a spatial resolution of 18 m.After removal of water absorption and low signal to noise ratio (SNR) bands, 176 bands are used for classification.The dataset is a 512 × 614 pixels image, in which ground reference information of 5211 pixels is available.The second dataset is an urban site imaged by the ROSIS sensor over the University of Pavia, Italy having nine land cover classes in 103 spectral bands of 4 nm width with a spatial resolution of 1.3 m.It is a 610 × 340 pixels image, in which ground reference information of 42776 pixels is available.

Classification accuracy analysis
For both hyperspectral datasets, we split the image into two sets, i.e., training, and testing data, with a split ratio 1:9 i.e., we randomly select 10% of the labelled samples as the training set, and remaining 90% for the testing sets, respectively.During training, we use the training set to learn weights and biases of each neuron and the test set is used to produce final classification results.In order to quantitatively compare and estimate the capabilities of the proposed models, overall accuracy (OA) and Kappa coefficient and z-scores are used as performance measurement (Congalton and Green, 2009).To perform statistical evaluation, we conduct five independent replications of the whole process and use the average Kappa coefficient to compare the performance among various DL based methods like SAE-LR (Chen et al., 2014) and CNN.The accuracy analysis is shown in Tables 1 and 2 for Kennedy Space Centre and Pavia University images respectively and figures 5 and 6 show the classification maps obtained by different methods.The classification performance comparison of the proposed (VAE+CNN)-LR method with the two other methods is done based on the one tailed hypothesis testing, the results of which are shown in Table 3. From one tailed test with 95% confidence interval, it can be concluded that the proposed method performs significantly better than the two other DL methods.

Figure 1 .
Figure 1.Vanilla autoencoder Consider a training dataset X which consist of 'm' samples of continuous or discrete variables (that the training data is generated by a random process involving unobserved (latent) continuous random variable z.The data generation process consists of two steps: () pz  to be simple, e.g.Gaussian.However, ( | ) p x z  is complex as it has to be used for image generation.A neural network is the best choice for representing this complex function which is called a decoder network.For training this neural network, we try to learn these model parameters in order to maximize the marginal likelihood of training data.

(
8)So, minimizing the intractable KL divergence is equivalent to maximizing the variational lower bound on the marginal likelihood that is computationally tractable and is our objective function.The first part of the objective function (denoted by equation 9) represents the quality of reconstruction and the second term regularizes the latent space towards the true posterior.To optimize the variational lower bound, a Stochastic Gradient Variation Bayes (SVGB) estimator is formulated that includes a reparameterization trick to allow the SVGB estimator to be differentiable(Kingma and Welling, 2013).A basic variational encoder network architecture is shown in figure2.

Figure 2 .
Figure 2. Variational autoencoder architecture ,...... } i yc where c denotes number of classes.For a given test input x, the hypothesis estimates   () | i P y c x  for each c value i.e. the probability of the class label taking each of the c possible class outcomes.The hypothesis thus gives a c dimensional vector containing the c estimated probabilities

Figure 4 .
Figure 4. Flowchart for proposed framework 4.1 Spectral feature extraction using VAE The first step is an unsupervised pre-training step using VAE for extraction of spectral features.The VAE model consist of 3 hidden encoder layers and 3 hidden decoder layers.The hidden layer neurons are 150, 100, 60 respectively.The framework for spectral feature extraction is shown in fig-7.The network is trained using exponential linear unit (ELU)(Clevert et 2016) Suppose there are M training patches selected randomly from the input image and Ti denotes a training sample where [1, 2,... ] i M  whereas li denotes the class label corresponding to training patch Ti.The training objective of CNN is to learn the filters W and the bias parameter b by minimizing the cross entropy loss (Nasr et al., 2002) function using back-propagation algorithm.

Table 2 .
Class-wise accuracies of the University of Pavia (PU) image

Table 3 .
Classification performance comparison using kappa coefficient and Z-score6.CONCLUSIONIn this paper, a novel DL-based spectral-spatial framework is presented for HSI classification.The proposed method can extract more in-depth features.On the basis of z-score, the results on two hyperspectral datasets have demonstrated the superiority of the proposed method over other widely used DL methods like SAE-LR and CNN.For the future works, we propose to investigate more DL-based methods for spectral feature extraction methods and classification performance analysis both in terms of accuracy and time for hyperspectral images of different themes.