SCENE CLASSFICATION BASED ON THE SEMANTIC-FEATURE FUSION FULLY SPARSE TOPIC MODEL FOR HIGH SPATIAL RESOLUTION REMOTE SENSING IMAGERY

Topic modeling has been an increasingly mature method to bridge the semantic gap between the low-level features and high-level semantic information. However, with more and more high spatial resolution (HSR) images to deal with, conventional probabilistic topic model (PTM) usually presents the images with a dense semantic representation. This consumes more time and requires more storage space. In addition, due to the complex spectral and spatial information, a combination of multiple complementary features is proved to be an effective strategy to improve the performance for HSR image scene classification. But it should be noticed that how the distinct features are fused to fully describe the challenging HSR images, which is a critical factor for scene classification. In this paper, a semantic-feature fusion fully sparse topic model (SFF-FSTM) is proposed for HSR imagery scene classification. In SFFFSTM, three heterogeneous features—the mean and standard deviation based spectral feature, wavelet based texture feature, and dense scale-invariant feature transform (SIFT) based structural feature are effectively fused at the latent semantic level. The combination of multiple semantic-feature fusion strategy and sparse based FSTM is able to provide adequate feature representations, and can achieve comparable performance with limited training samples. Experimental results on the UC Merced dataset and Google dataset of SIRI-WHU demonstrate that the proposed method can improve the performance of scene classification compared with other scene classification methods for HSR imagery. * Corresponding author


INTRODUCTION
The rapid development of earth observation and remote sensing techniques has led to large amount of high spatial resolution (HSR) images with abundant spatial and structural information.Some of the most popular approaches are the object-based and contextual-based methods which can achieve precise object recognition (Bellens et al., 2008;Rizvi and Mohan, 2011;Tilton et al., 2012).Nevertheless, the HSR scenes often contain diverse land-cover objects, such as road, lawn, and building.The same type of objects may vary in spectral or structural based low-level features.The different distribution of the same land-cover objects may obtain different type of semantic scenes.And the same type of scenes may consist of different types of simple objects.These methods which are based on the low-level features are unable to capture the complex semantic concepts of different scene images.This leads to the divergence between the low-level data and the high-level semantic information, namely the "semantic gap" (Bratasanu et al., 2011).It's a big challenge to bridge the semantic gap for HSR imagery.Scene classification, which can automatically label an image from a set of semantic categories (Bosch et al., 2007), as an effective method has been receiving more and more attention (Yang and Newsam, 2010;Cheriyadat, 2014;Zhao et al., 2013;Zhao et al., 2016b;Zhao et al., 2016c).Among the various scene classification methods, the bag-of-visual-words (BOVW) model has been successfully applied to capture the high-level semantics of HSR scenes without the recognition of objects in object-based scene classification methods (Zhao et al,. 2014).
Based on the BOVW model, the probabilistic topic model (PTM) represents the scenes as a random mixture of visual words.The commonly used PTM, such as probabilistic latent semantic analysis (PLSA) (Hofmann, 2001) and latent Dirichlet allocation (LDA) (Blei et al., 2003) mine the latent topics from the scenes and have been employed to solve the challenges of HSR image scene classification (Bosch et al., 2008;Lié nou et al., 2010;Văduva et al., 2013).
To acquire latent semantics, the feature descriptors captured from HSR images are critical for PTM.In general, a single feature is employed and is inadequate (Zhong et al., 2015).Multi-feature based scene classification methods have also been proposed (Shao et al., 2013;Zheng et al., 2013;Tokarczyk et al., 2015).Considering the distinct characteristics of the HSR images, the features should be carefully designed to capture the abundant spectral and complex structural information.In addition, the different features are usually fused before k-means clustering, thus acquiring one dictionary and one topic space for all the features.This leads to the mutual interference between different features (Zhong et al., 2015), and is unable to circumvent the inadequate clustering capacity of the hardassignment based k-means clustering, which is not efficient in the high-dimensional feature space.With the development of PTM for HSR scene classification, there are two issues should be considered.The first one is how to infer sparser latent representations of the HSR images.Another one is how to design more efficient inference algorithms for PTM.In order to achieve good performance for huge volume of HSR image scenes, we may have to increase the number of topics to get more semantic information.However, for instance, the distribution of topic variables for the LDA model is drawn from a Dirichlet distribution with the parameter  .The variable is greater than 0 no matter how the parameter  varies (Blei et al.,   2003).This leads to a dense topic representation of the HSR images, which is not sparse and requires more storage, and is time consuming.Another method is to impose sparsity constrains on the topic to change the object function of the model (Shashanka et al., 2007;Zhu and Xing, 2011).But we have to do model selection with the regularization terms based auxiliary parameters of these model, which is problematic when dealing with large amount of HSR image dataset.Fivefold cross validation is often performed to evaluate the experimental dataset to guarantee enough training samples for classification accuracy (Yang and Newsam, 2010;Cheriyadat, 2014).Reducing the number of training samples would be more practical.
Inspired by the aforementioned work, we present a semanticfeature fusion fully sparse topic model (SFF-FSTM) for HSR image scene classification.Fully sparse topic model (FSTM) proposed by Than and Ho (2012) for modeling large collections of documents is utilized to model HSR imagery for the following reason.Based on the similarity of documents and images, FSTM is able to remove the redundant information and infer sparse semantic representations with shorter inference time.In this way, to acquire sparse latent topics, we intended to use a limited number of images as training sample which is more in line with the practical application.To the best of our knowledge, no such PTM based scene classification method with limited training samples has been developed to date.However, FSTM is unable to fully exploit the information provided by the limited training samples with sparse representations.Hence in SFF-FSTM, three complementary features are selected to describe HSR images.Dense scale-invariant feature transform (SIFT) feature is chosen as the structural feature, mean and standard deviation as the spectral feature, and wavelet feature as the texture feature.Based on the effective feature description for HSR imagery, a semantic-feature fusion strategy is designed to fuse the three features after semantic mining with three distinct topic spaces.This can provide fully mined semantic information of the HSR imagery from three complementary perspectives, with no mutual interference and clustering impact.The incorporation of support vector machine (SVM) with a histogram intersection kernel (HIK) is effective in increasing the discrimination of different scenes.The combination of multiple semantic-feature fusion strategy and sparse representation based FSTM is able to trade off sparsity and the quality of sparse inferred semantic information as well as inferring time, and presents a comparable performance with the existed relevant method.
The rest of the paper is organized as follows.The next section details the procedure of the proposed SFF-FSTM for HSR image scene classification.A description of the experimental datasets and an analysis of the experimental results are presented in Section 3. Conclusions are discussed in the last section.

Probabilistic Topic model
Based on "bag-of-words" assumption, the generative probabilistic model of PTM, including PLSA, LDA and FSTM, are applied to HSR images by utilizing a visual analog of a word, acquired by vector quantizing spectral, texture, and structural feature like region descriptors (Bosch et al., 2008).
Each image can then be represented as a set of visual words from the visual dictionary.By introducing the latent topics characterized by a distribution over words, the PTM model the images as random mixtures over latent variable space.
Among the various PTM, the PLSA model as the classical PTM is proposed by Hofmann (2001) The mixing weight ( | ) ki p z d is the semantic information which PTM mined from the visual words of HSR images.It can be seen that PLSA lack a probability function to describe the images.This makes PLSA unable to assign probability to the images outside the training samples, and the number of model parameter grow linearly with the size of image dataset.
Hence, in 2003, Blei proposed LDA, which introduces the Dirichlet distribution over the topic mixture  based on the PLSA model.The k-dimensional random variable  follows the Dirichlet distribution with the parameter  , where k is assumed known and fixed first.The LDA model provides a probability function for the discrete latent topics in PLSA, which being a complete PTM.However, the Dirichlet variable is greater than 0 when  varies.The latent representation of HSR imagery by LDA is often dense with the large amount of images to model, while requiring huge memory for storage.And the inference algorithm of the LDA model is complex and takes a lot of time.
In 2012, Than and Ho proposed FSTM for modeling large collections of documents and applying to supervised dimension reduction.FSTM uses the Frank-Wolf algorithm of the sparse approximation algorithm as the inference algorithm, which follows the greedy approach, and has been proven to converge at a linear rate to the optimal solutions.In FSTM, the latent topic proportion  is a convex combination of the topic simplex with at most l+1vertices after l iterations, which follows an implicit constraint 0 || || 1 L  .Hence, we choose FSTM with the sparse solutions to model the HSR imagery in this paper.

Complementary feature description
As can be seen from Fig. 1(a), it is difficult to distinguish parking lot from harbor, neither from the structural characteristics nor the textural ones.However, due to the spectral difference between ocean and road, the spectral characteristics play an important role.In Fig. 1(b), the storage tanks and dense residential scenes mainly differ in the structural characteristics.In addition, it can be seen from Fig. 1(c) that the forest and agriculture scenes are similar in spectral and structural characteristics, but they differ greatly in the textural information from the global perspective.Considering the abundant spectral characteristics and the complex spatial arrangement of HSR imagery, three complementary features are designed for the HSR imagery scene classification task.Before feature descriptor extraction, the images are split into image patches using uniform grid sampling method.

Spectral feature:
The spectral feature reflects the attributes that constitute the ground components and structures.
The first-order statistics of the mean value and the second-order statistics of the standard deviation value of the image patches are calculated in each spectral channel as the spectral feature, According to (2) and (3), n is the total number of image pixels in the sampled patch, and ij v denotes the j-th band value of the i-th pixel in a patch.In this way, the mean (mean j ) and standard deviation (std j ) of the spectral vector of the patch are then acquired. 1

Texture feature:
The texture feature contains information about the spatial distribution of tonal variations within a band (Haralick et al., 1973), which can give consideration to both the macroscopic properties and fine structure.Wavelet transforms enable the decomposition of the image into different frequency sub-bands, similar to the way the human visual system operates (Huang and Avivente, 2008).This it especially suitable for image classification and multilevel 2-D wavelet decomposition is utilized to capture the texture feature from the HSR images.And the level where the wavelet decomposition of the images at is optimally set to 3.

Structural feature:
The SIFT feature (Lowe, 2004) has been widely applied in image analysis since it can overcome the addition of noise, affine transformation, and changes in the illumination, as well as compensating for the deficiency of the spectral feature for HSR imagery.Each image patch is split into 44  neighbourhood regions and each directions for each gradient orientation histogram are counted in each region.Hence, the gray dense SIFT descriptor with 128 dimensions is extracted as the structural feature.This was inspired by previous work, in which dense features performed better for scene classification (Li and Perona, 2006), and Lowe (2004) suggest that using a 4 4 8 128 =  dimensions vector to describe the keypoint descriptor is optimal.

Multiple Semantic-feature Fusion Fully Sparse Topic Model for HSR Imagery with Limited Training Samples
The previous studies have shown that a uniform grid sampling method can be more effective than other sampling methods such as random sampling (Li and Perona, 2006).In this way, the image patches acquired by uniformly sampling the HSR images are digitized by spectral, texture and SIFT features, and three types of feature descriptors, D 1 , D 2 , and D 3 are obtained.However, with the influence of illumination, rotation, and scale variation, the same visual word in different images may be endowed with various feature values.The k-means clustering is applied to quantize the feature descriptors to generate 1-D frequency histogram, and image patches with similar feature values can correspond to the same visual word.By the statistical analysis of the frequency for each visual word, we can obtain the corresponding visual dictionary.
The conventional methods usually directly concatenate three types of feature descriptors to make up a long feature   . The long vector is then quantized by k-mean clustering to generate a 1-D histogram for all the features.As the features interfere with each other when clustering, the 1-D histogram is unable to fully describe the HSR imagery.In SFF-FSTM, the spectral, texture, and SIFT features are quantized separately by k-mean clustering algorithm to acquire three distinct 1-D histograms, H 1 , H 2 , and H 3 .By introducing probability theory, each element of the 1-D histogram for SFF-FSTM are transformed into the word occurrence probability.To mine the most discriminative semantic feature, which is also the core idea of PTM, the three histograms are separately mined by SFF-FSTM to generate three distinct latent topic spaces.This is different from the conventional strategies which fuse the three histograms before topic modeling, and only one latent topic space is obtained which is inadequate.
Specifically, SFF-FSTM chooses a k-dimensional latent variable  .Given an image M and K Hence, suppose there are N images, then for each of H 1 , H 2 , and H 3 , K 1 , K 2 , and K 3 topics are assumed to compose the images, respectively.The latent semantics of H 1 , H 2 , and H 3 , denoted as 1  , 2  , and 3  , respectively , are inferred with the Frank- Wolf algorithm.Then the semantic features 1  , 2  , and 3  of all the HSR images are fused at the semantic level, thus obtaining the final multiple semantic-feature ,, with a sparse size.Finally, the F 2 with the optimal discriminative characteristics is classified by SVM classifiers with a HIK to predict the scene label.The HIK measures the degree of similarity between two histograms, to deal with the scale changes, and has been applied to image classification using color histogram features (Barla et al., 2003).We be the LGFBOVW representation vectors of M images, and the HIK is calculated according to (7).In this way, SFF-FSTM provides a complementary feature description, an effective image representation strategy, and an adequate topic modeling procedure for HSR image scene classification, with even limited training samples, which will be tested in the Experimental Section.classification based on SFF-FSTM is shown in Fig. 2. (5)

Experimental Design
The commonly used 21-class UC Merced Dataset and a 12-class Google dataset of SIRI-WHU were evaluated to test the performance of SFF-FSTM.In the experiments, the images were uniformly sampled with a patch size and spacing of 8 and 4 pixels, respectively.To test the stability of the proposed LGFBOVW, the different methods were executed 100 times by a random selection of training samples, to obtain convincing results for the two datasets.A k-means clustering with the Euclidean distance measurement of the image patches from the training set was employed to construct the visual dictionary, which was the set of V visual words.K topics were selected for FSTM.The visual word number V and topic number K were the two free parameters in our method.Taking the computational complexity and the classification accuracy into consideration, V and K were optimally set as in Table 1 and Table 3 for the different feature strategies with the two dataset.In Table 1, 2, 3, and 4, SPECTRAL, TEXTURE, and SIFT denote scene classification utilizing the mean and standard deviation based spectral, wavelet-based texture, SIFT-based structural features, respectively.The proposed method that fuse the multiple semantic features at the latent topic level is referred to as the SFF strategy.
To further evaluate the performance of SFF-FSTM, the experimental results utilizing SPM (Lazebnik et al., 2006), PLSA (Bosch et al., 2008), LDA (Lié nou et al., 2010) and the experimental results on the UC Merced dataset, as published in the latest papers by Yang and Newsam (2010), Cheriyadat (2014), Chen and Tian (2015), Mekhalfi et al. (2015), and Zhao et al. (2016a) are shown for comparison.SPM employed dense gray SIFT, and the spatial pyramid layer was optimally selected as one.In addition, the experimental results on the Google dataset of SIRI-WHU utilizing SPM (Lazebnik et al., 2006), PLSA (Bosch et al., 2008), LDA (Lié nou et al., 2010) and the experimental results on the Google dataset of SIRI-WHU, as published in the latest paper by Zhao et al. (2016a) are also shown for comparison.

Experiment 1: The UC Merced Image Dataset
The UC Merced dataset was downloaded from the USGS National Map Urban Area Imagery collection (Yang and Newsam, 2010).This dataset consists of 21 land-use scenes (Fig. 3), namely agricultural, airplane, baseball diamond, beach, buildings, chaparral, dense residential, forest, freeway, golf course, harbor, intersection, medium residential, mobile home park, overpass, parking lot, river, runway, sparse residential, storage tanks, and tennis courts.Each class contains 100 images, measuring 256 256  pixels, with a 1-ft spatial resolution.Following the experimental setup as published in Yang et al. (2010), 80 samples were randomly selected per class from the UC Merced dataset for training, and the rest were kept for testing.The classification performance of different strategies based on the FSTM and the comparison with the experimental results of previous methods for the UC Merced dataset are reported in Table 2.As can be seen from Table 2, the classification results of the single feature based FSTM is unsatisfactory.The classification result, 94.55% ± 1.02% for the proposed SFF-FSTM is best among the different methods, and improves a lot compared with the single feature strategy.This indicates that the combination of multiple semantic-feature fusion strategy and sparse representation based FSTM is able to trade off sparsity and the quality of sparse inferred semantic information as well as inferring time.In addition, it can be seen that SFF-FSTM is superior to the performance of SPM (Lazebnik et al., 2006), PLSA (Bosch et al., 2008), LDA (Lié nou et al., 2010), the Yang and Newsam method (2010), the Cheriyadat method ( 2014), the Chen and Tian method (2015), the Mekhalfi et al. method (2015), and the Zhao et al. method (2016a).meadow,pond,harbor,industrial,park,river,residential,overpass,agriculture,water,commercial,and idle land,as shown in Fig. 4.Each class separately contains 200 images, which were cropped to 200×200 pixels, with a spatial resolution of 2 m.In this experiment, 100 training samples were randomly selected per class from the Google dataset, and the remaining samples were retained for testing.4. As can be seen from Table 4, the classification results, 97.83%±0.93%,for the proposed SFF-FTSM, is much better than the spectral, texture, SIFT based FSTM method, which confirms the framework incorporating multiple semantic-feature fusion and FSTM is a comparative approach for HSR image scene classification.In  (Zhong et al., 2015), respectively.
The training number was varied over the range of [80,60,40,20,10,5] for the UC Merced dataset.And the training number for the Google dataset of SIRI-WHU was varied over the range of [100,80,60,40,20,10].The classification accuracy with different numbers of the training samples for the UC Merced dataset and the Google dataset of SIRI-WHU are reported in Table 5 and Table 6.The corresponding curves are shown in Fig.

5.
As can be seen from Table 5, Table 6 and Fig. 5, the proposed SFF-FSTM performs better, and is relatively stable with the decrease in the number of training samples per class for the two datasets, when compared to SAL-LDA.When the training samples is under 20%, even 10% or 5%, SFF-FSTM display a smaller fluctuation than SAL-LDA, and can keep a comparative satisfactory and robust performance with limited training samples.
We also test and compare the inference efficiency of the proposed SFF-FSTM and SAL-LDA with the spectral feature for the two datasets.Nevertheless, image patches obtained by the uniform grid method might be unable to preserve the semantic information of a complete scene.It would therefore be desirable to combine image segmentation with scene classification.The clustering strategy, as one of the most important techniques in remote sensing image processing, is another point that should be considered.In our future work, we plan to consider topic models which can take the correlation between image pairs into consideration.
Fig. 1.HSR images of the parking lot, harbor, storage tanks, dense residential, forest, and agriculture scene classes: (a) shows the importance of the spectral characteristics for HSR images; (b) shows the importance of the structural characteristics for HSR images; and (c) shows the importance of the textural characteristics for HSR images.

M
is the frequency of term j in M. Hence, the inference task is to search for  to maximize the likelihood of M.is the visual dictionary of V terms.Different from other topic models, SFF-FSTM do not infer  directly, whereas reformulate the inference task of optimization over  as a of topic.It can be seen that x is a convex combination of the K topics with the fact in (6), and by finding x that maximizes the objective function (5), we can infer the latent topic proportion of the image M.

Figure 2 .
Figure 2. The proposed HSR scene classification based on the SFF-FSTM.
Classification accuracies with different numbers of training samples per class.(a) UC Merced dataset.(b) Google dataset of SIRI-WHU 4. CONCLUSION In this paper, we have designed an effective and efficient approach-the semantic-feature fusion fully sparse topic model (SFF-FSTM)-for HSR imagery scene classification.The fully sparse topic model (FSTM) has been used for unsupervised dimension reduction of the large collection of documents first.By combining the novel use of FSTM and the semantic fusion of three distinctive features for HSR image scene classification, SFF-FSTM is able to presents a robust feature description for HSR imagery, and achieve comparative performance with limited training samples.The proposed SFF-FSTM can improve the performance of scene classification compared with other scene classification methods with the challenging UC Merced dataset and Google dataset of SIRI-WHU.

Table 4
Zhao et al. (2016a)ther methods, SPM, the LDA method proposed byLienou et al. (2010), the PLSA method proposed byBosch et al. (2008), and the experimental results published byZhao et al. (2016a), the highest accuracy is required by the proposed SFF-FSTM, which presents a comparable performance with the existed relevant method.

Experiment 3: Multiple Semantic-feature Fusion Fully Sparse Topic Model for HSR Imagery with Limited Training Samples
By modeling the large collection of images with only a few latent topic proportions of non-zero values, we intend to deal with the HSR imagery with limited training samples employing SFF-FSTM and SAL-LDA

Table 5 .
The inference time of SFF-FSTM is about 3 minutes, whereas SAL-LDA takes almost 40 minutes to infer the spectral based latent semantics.This indicates SFF-FSTM is an efficient PTM compared with the classical non-sparse PTM such as SAL-LDA.Performance of SFF-FSTM and SAL-LDA for the UC Merced dataset with limited training samples

Table 6 .
Performance of SFF-FSTM and SAL-LDA for the Google dataset of SIRI-WHU with limited training samples