EVALUATING THE INITIALIZATION METHODS OF WAVELET NETWORKS FOR HYPERSPECTRAL IMAGE CLASSIFICATION

The idea of using artificial neural network has been proven useful for hyperspectral image classification. However, the high dimensionality of hyperspectral images usually leads to the failure of constructing an effective neural network classifier. To improve the performance of neural network classifier, wavelet-based feature extraction algorithms can be applied to extract useful features for hyperspectral image classification. However, the extracted features with fixed position and dilation parameters of the wavelets provide insufficient characteristics of spectrum. In this study, wavelet networks which integrates the advantages of wavelet-based feature extraction and neural networks classification is proposed for hyperspectral image classification. Wavelet networks is a kind of feedforward neural networks using wavelets as activation function. Both the position and the dilation parameters of the wavelets are optimized as well as the weights of the network during the training phase. The value of wavelet networks lies in their capabilities of optimizing network weights and extracting essential features simultaneously for hyperspectral images classification. In this study, the influence of the learning rate and momentum term during the network training phase is presented, and several initialization modes of wavelet networks were used to test the performance of wavelet networks.


INTRODUCTION
Imaging spectrometer, a remote sensing technology which was developed in 1980's, can obtain images with hundreds of spectral bands simultaneously (Goetz et al., 1985).The images acquired with spectrometers are called as hyperspectral images.These images not only reveal two-dimensional spatial information but also contain rich and fine spectral information.With these characteristics, they can be used to identify surface objects and improve land use/cover classification accuracies.In past three decades, hyperspectral images have been widely used in different fields such as mineral identification, forest vegetation mapping, and disaster investigation based on the spectral analysis (Lillesand and Kiffer, 2000).
Since hyperspectral images contain rich and fine spectral information, an improvement of land use/cover classification accuracy is highly expected from the utilization of such images.However, the statistics-based classification methods which have been successfully applied to multispectral images are not as effective as to hyperspectral images.In fact, problems will arise if too many spectral bands are simultaneously taken on finite training samples.If the training samples are insufficient for the needs, which is a very common case in the hyperspectral images, the estimation of statistical parameters becomes inaccurate and unreliable.As the dimensionality increases with the number of bands, the number of training samples needed for training a specific classifier should be increased exponentially as well.The rapid increase in training samples size for density estimation has been termed the "curse of dimensionality" by Bellman (1961), which leads to the "peaking phenomenon" or "Hughes phenomenon" in classifier design (Hughes, 1968).The consequence is that the classification accuracy first grows and * Corresponding author then declines as the number of spectral bands increases while training samples are kept the same.A simple but sometimes very effective way of dealing with highdimensional data is to reduce the dimensionality deliberately (Lee and Landgrebe, 1993;Benediktsson et al., 1995;Hsu, 2007a).In the case of hyperspectral images, this can be done by feature extraction that a small number of salient features are extracted from the hyperspectral images before classification.Since the dimensionality is reduced before classification, the curse of dimensionality can be avoided.Thus most of the traditional classification methods such as the maximum likelihood classifier (MLC) can be directly applied to the extracted features after the feature extraction In order to avoid the problems caused by the limited training samples, several feature extraction methods based on the wavelet transform (WT) have been proposed for hyperspectral image classification (Hsu, 2003;Hsu, 2007a).In the past decades, WT has been developed as a powerful analysis tool for signal processing, and also has been successfully applied in applications such as image processing, data compression and pattern recognition (Mallat, 1999).Due to the time-frequency localization properties, discrete wavelet and wavelet packet transforms have proven to be appropriate starting point for the classification of the measured signals (Pittner and Kamarthi, 1999).The WT decomposes a signal into a series of translated and scaled versions of the mother wavelet function.When WT is applied to the hyperspectral images, the local energy variations of a spectral signature in different spectral bands at each scale (or frequency) can be detected automatically and provide useful information for hyperspectral image classification.Although the proposed wavelet-based methods perform well for feature extraction and also effectively for classification, however, the relationship between the extracted features and the identified classes are not apparent.In addition to the dimensionality reduction for the statistics-based classifier, nonparametric classifiers such as the artificial neural networks (ANNs) are also proposed to deal with the problem of high dimensionality and also have been applied to hyperspectral image classification.The use of ANNs is motivated by their power in pattern recognition and classification due to the ultimately fine distribution and non-linearity of the process.However, most of the neural processing algorithms are computationally intensive and involve many iterative calculations, especially for hyperspectral images.A characteristic of neural networks is that the networks need a long training time but are relatively fast data classifiers.For very-highdimensional data, the training time of a neural network can be very long and the resulting neural network can be very complex.This is a serious drawback, especially when the dimensionality and the sample size of training data are large (Benediktsson et al., 1995).
To combine the advantages of ANN's with wavelet-based feature extraction methods, the wavelet networks (WNs) have been proposed with some success in data approximation, identification and classification (Dickhaus and Heinrich, 1996).The value of wavelet networks lies in their capabilities of extracting essential features in time-frequency plane.Furthermore, both the position and the dilation of the wavelets are optimized besides the weights of the network during the training phase.This hierarchical, multiresolution training can result in a more meaningful interpretation of the resulting mapping and adaptation of networks that are more efficient compared to conventional methods.In addition, the wavelet theory provides useful guidelines for the construction and initialization of networks and, consequently, the training times are significantly reduced (Iyengar, 2002).
The performance of the WNs for hyperspectral image classification has been test in Hsu (2007b).The experiment results showed that the WNs exactly is an effective tool for classification of hyperspectral images, and has better results than the traditional feed-forward multi-layer neural networks.The structure of the WNs used in this study is a kind of feed-forward neural network, and the ordinary back-propagation (BP) is used for training WNs.Therefore, the drawbacks of BP may exist in WNs.The first problem is the local minimum of the loss function caused by the gradient descent algorithms (Postalcioglu & Becerikli, 2007).In BP, the learning rate is usually used to control the size of weight changes during the learning phase.Finding a reasonable learning rate of wavelet networks is important to not only curtail processing cost but also classify accurately.A simple gradient decent procedure steps toward the minimum very slowly, and an oscillatory descent occurs with a higher learning rate.In general back-propagation learning process, a simple but powerful improvement algorithm is to add a momentum term (Plaut et al., 1986) to the gradient decent formula.The use of momentum adds inertia to the motion through weight space and smoothes out the oscillations (Bishop, 1995).In this study, the influence of the momentum will be tested.
The second problem is the initialization of the network parameters.Efficient initialization will result to less iterations in the training phase of the network and also avoid local minimums of the loss function in the training phase (Alexandridis and Zapranis, 2013).As for the initialization of wavelet networks, weights are typically started with random numbers.For initial wavelon nodes, there are several initial modes have been proposed to solve best regression problems (Zhang, 1997), but they are not convenient for classification.An easier approach is take prior information into account.For example, differences between two classes are expected for higher frequencies, and the wavelet nodes should be initialized in this region of the timefrequency plane.
In this study, the theory of WNs is firstly introduced for hyperspectral image classification, and then an AVIRIS image was used to test the feasibility and performance of classification using the WNs.The influence of the learning rate and momentum term is presented, and several initialization modes of WNs were used to test the performance of wavelet networks.

Wavelet Transform
Due to the time-frequency localization properties, wavelet transform (WT) has proven to be appropriate starting point for the classification of the measured signals (Stefan and Sagar, 1999).The wavelet transform decomposes a signal into a series of shifted and scaled versions of the mother wavelet function.In the past two decades, wavelet transform (WT) has been developed as a powerful analysis tool for signal processing, and also has been successfully applied in applications such as image processing, data compression and pattern recognition (Mallat, 1999).
Mathematically, a wavelet is defined as a function  ∈  2 () that has effectively a limited extent and it has an average value of zero.The family of wavelet bases can be produced by scaling s and translating  from the mother wavelet (Mallat, 1999): The continue wavelet transform (CWT) of  ∈  2 () at time  and scale  can obtained by taking the integral inner product of () with the scaled and translated versions of the basis function : where * denotes complex conjugation.In definition, the CWT is a convolution of the input data with a set of functions generated by the mother wavelet.The convolution can be computed by using the Fast Fourier Transform (FFT).
The analysis of a signal using CWT yields a wealth of information.Clearly there will be a lot of redundancy in the CWT The discrete wavelet transform (DWT) is an implementation of the wavelet transform using a discrete set of the scales and translations according to some defined rules to reduce the redundancy.The orthogonal wavelet in terms of multi-resolution analysis (MRA) is commonly used in various applications.The DWT can decompose a signal into the low-frequency components that represent the optimal approximation, and the high-frequency components that represent the detailed information of the original signal (Mallat, 1989).The decomposition coefficients in a wavelet orthogonal basis can be computed with a fast algorithm that cascades discrete convolutions with conjugate mirror filters h and g, and subsamples the outputs.The decomposition formulas are listed as following: Either CWT or DWT can be used in WNs.In this study, the CWT is used for the purpose of the hyperspectral image classification.

Wavelet-Based Feature Extraction
It has been shown that wavelet transform provides good capabilities of time-frequency analysis.Hence, several waveletbased feature extraction (WFE) methods have been proposed to extract essential features of hyperspectral images for classification (Hsu, 2007a).Such features were performed with wavelet transform described by translating and scaling indices.
The values of WT at specific time and scale index can be regarded as meaningful features that can be used to distinguish different classes of land objects in image classification.However, the extracted features were highly dependent on both the values of the translating and scaling parameters that characterize preprocessing of WFE.These values are selected before feature extraction procedure based on a user's prior knowledge that is rarely acquired or unpredictable (Angrisani et al., 2001).To overcome this limitation, adjustable translating and scaling parameters dependent on characteristics of data are expected.In this paper, wavelet networks based on back-propagation networks and wavelet theory is introduced.Wavelet networks can adjust translating and scaling parameters during learning stage and give optimized image classification result.Not only avoid the limited sample problem, but also improve the performance of neural networks.This wavelet networks-based classifier for hyperspectral image is more flexible than neural networks because weights and extracted features both are optimized during training.Figure 1 illustrates the difference between wavelet networks and neural networks with extracted features by the WFE.

Structure of Wavelet Networks
Based on the theory of wavelet transform, the concept of wavelet networks was first proposed by Zhang and Benvenite (Zhang and Benveniste, 1992).Wavelet networks for classification combines the aspects of the wavelet transformation for purposes of feature extraction and selection with the characteristic decision capabilities of neural-network approaches (Dickhaus and Heinrich, 1996).Figure 2 shows the structure of the wavelet networks used in this study, in which consists of one wavelet layer, one hidden layer, and one output layer.A wavelet node (called wavelon) for feature extraction is parameterized by a translation parameter,   , and a scale parameter,   .The outputs of the wavelons,   , which can be interpreted as the correlation between the signal [] and the wavelet ℎ  (), server as input to the hidden layer of neural network classifier.The classifier of the right part can be any single-layer or multi-layer perceptrons.
During the learning process, the wavelet node parameters are also updated to minimize the error, .

Implementation of Wavelet Networks
A typical wavelet function used in the wavelet networks is the complex Morlet wavelet (Dickhaus and Heinrich, 1996): The wavelet nodes ℎ  () in Figure 2 are scaled and dilated versions of this wavelet mother function: The variable   is the scale parameter, and   is the dilation parameter of the wavelet function.If the scale   is large, the wavelet is a dilated low-frequency function, whereas for small values of   , the wavelet is compact, corresponding to a highfrequency function.Formally, the node's output   is the result of the wavelet transform which is defined as the inner product of the node ℎ  and the signal [], which is the input of the wavelet networks (the index  = 1, ⋯ ,  denotes the signal number) For the Morlet wavelet,   can be calculated for each wavelet node where The neuron's output   is calculated by the weighted sum of the outputs of the pervious layer,   , the neuron's threshold,   , and its activation function f , that the sigmoidal function is used in this paper: [2] •    =1 +   [2] ) (11)

Training of Wavelet Networks
During the training phase, the ANN weights  are adjusted to minimize the total least-square error   between the net's desired output vector   and its actual output   for all input vectors  () .
The minimization problem can be solved by an iterative gradient technique.The partial derivative of the weights, , are calculated according to the generalized delta rule: = {  ,   ,   [1] ,   [1] ,   [2] ,   [2] } (14) The partial derivatives of the weights to the neurons in the output and hidden layer are calculated as follows: These two equations hold for neurons with a sigmoidal activation function.
In the wavelet network, not only the weights are adjusted, but also the parameters of the wavelet nodes.The partial derivatives for a wavelet node's scale parameter,   , and it shift parameter,   , depend on the wavelet basis chosen and are determined using the backpropagated error   : Thus, in each iteration of the training cycle, the weights and the wavelet parameters are varied to reduce the error, E. This procedure is repeated until the net has settled down to a minimum

EXPERIMENTS
In this study, a set of hyperspectral data was used to test the classification performance using the two methods mentioned above.The diagram of the experiment is illustrated in Figure 1.
The study image is a small segment of AVIRIS image.The image is located in NW Indiana Indian Pine test site and was taken in 1991.The image size is 145 pixels145 pixels.The original data set has 224 spectral bands from 400 nm to 2450 nm with 10 nm spectral resolution.The number of bands is 220 after removing 4 noisy bands.The radiance spectra are directly used to test the feature extraction method without performing any kind of atmospheric correction.The ground truth data includes five different classes which are "grass/trees", "soybeans-min", "soybeans-notill", "hay-windrowed" and "woods".Figure 4 shows the information related to test image.The number of training samples of each known class was 50 from ground truth data and additional labeled test samples were used to assess the accuracy of the classification.
In order to develop a wavelet networks image classifier suitable for hyperspectral image, three experiments were designed to look for a proper wavelet network classifier.The performance of wavelet networks was evaluated generally by the criterion MSE (mean square error).
(a) Test image (b) Ground truth data Figure 3. Test Image Data

Experiment 1: Learning Rate and Momentum Term
An experiment was set to test three back-propagation algorithms including general gradient decent algorithm, gradient decent with momentum term, and quickprop method.Momentum term and quickprop method are improved algorithms to accelerate the learning process.The performance (MSE) was used to evaluate the efficiency of learning in Figure 4.An oscillatory curve was obtained by the simple gradient descent method.It indicates that a large initial learning rate was chosen in this case.Compare to the result of added momentum term, the oscillations was smoothed out because of adaptive learning strategy.In addition, the quickprop method lead the most effective learning process in this experiment because of the quick and smoother convergence in Figure 4.
Table 1 shows the classification results acquired by trained networks whose MSE = 0.5 .As discussed above, quickprop required the least iteration computation.Further, good generalization ability was provided because the minimum MSE was corresponding to the best classification result in Table 1.These results suggest that quickprop is a practical improved gradient algorithm for wavelet networks.Moreover, a more extensive study about the applicability of other artificial intelligence techniques, such as support vector machine, fuzzy logic and genetic algorithm is another interesting topic.

Figure 1 .
Figure 1.Comparison between neural networks with WFE and wavelet networks

Figure 4 .
Figure 4.The performance (MSE) of three back-propagation algorithms

Figure 5
Figure5represented the locations of the wavelons for all modes on the time-frequency plane after training.It can be found that several common regions were selected as features by different modes.It can be inferred that wavelet networks can extract useful features during training.Moreover, these features were distributed mostly in high frequency domain.Features extracted from wavelet networks initialized by different modes were converged to the same spot of time-frequency plane, especially higher frequency or detail space.In multi-resolution analysis, useful features can be found in detail space to distinguish objects.

Figure 5 .
Figure 5. Locations of learned wavelons on time-frequency plane

Figure 6 .
Figure 6.Modulus of coefficients and learned wavelons of mode 5

Figure 7 .
Figure 7. Classification results with increasing the numbers of wavelons (features)

Table 1 .
Classification result and the number of learning at MSE = 0.5

4.2 Experiment 2: Different Initialization Modes Several
initialization modes were used to test the performance of wavelet networks.As listed in Table2, better classification results were given by mode1 and mode 5. Initial values set by these two modes were well distributed in all band-scale space.The results indicated that useful features can be found in both high frequency field (detail space) and low frequency field (approximation space).Further, using the wavelet networks with the initial scaling and translating values chosen randomly is a feasible and easy procedure.