A NOVEL SELF-TAUGHT LEARNING FRAMEWORK USING SPATIAL PYRAMID MATCHING FOR SCENE CLASSIFICATION

: Remote sensing earth observation images have a wide range of applications in areas like urban planning, agriculture, environment monitoring, etc. While the industrial world beneﬁts from availability of high resolution earth observation images since recent years, interpreting such images has become more challenging than ever. Among many machine learning based methods that have worked out successfully in remote sensing scene classiﬁcation, spatial pyramid matching using sparse coding (ScSPM) is a classical model that has achieved promising classiﬁcation accuracy on many benchmark data sets. ScSPM is a three-stage algorithm, composed of dictionary learning, sparse representation and classiﬁcation. It is generally believed that in the dictionary learning stage, although unsupervised, one should use the same data set as classiﬁcation stage to get good results. However, recent studies in transfer learning suggest that it might be a better strategy to train the dictionary on a larger data set different from the one to classify. In our work, we propose an algorithm that combines ScSPM with self-taught learning, a transfer learning framework that trains a dictionary on an unlabeled data set and uses it for multiple classiﬁcation tasks. In the experiments, we learn the dictionary on Caltech-101 data set, and classify two remote sensing scene image data sets: UC Merced LandUse data set and Changping data set. Experimental results show that the classiﬁcation accuracy of proposed method is compatible to that of ScSPM. Our work thus provides a new way to reduce resource cost in learning a remote sensing scene image classiﬁer.


INTRODUCTION
Remote sensing plays an important role in earth observation, and in this area remote sensing image scene classification is one fundamental problem (Cheng et al., 2017). With the increasing development of remote sensing imaging techniques, huge amounts of high spatial resolution images have been acquired. Detailed contents in these images, however, make automatic classification a challenging task.
In the past decade machine learning based methods have made notable success in computer vision, several of which, for example support vector machine (SVM) (Mountrakis et al., 2011) and stacked auto-encoder (Yao et al., 2016), have been applied to high resolution remote sensing image processing. Recent studies show that the bag-of-visual-words (BOVW) model is an effective and robust feature encoding approach (Yang, Newsam, 2010) (Zhu et al., 2016), generated by which the higher level image representations can improve performance of machine learning classifiers like SVM. Spatial pyramid matching using sparse coding (ScSPM) (Yang et al., 2009) is a BOVW based model that has achieved state-of-the-art performance on several open remote sensing image data sets (Yang et al., 2015) (Wu et al., 2016). Basically, ScSPM uses dictionary learning with sparse coding to train a dictionary that captures salient features, then low-level features are encoded by the dictionary and represented in a spatial pyramid way to form higher level features, which are used as input to the classifier. One bottleneck of Sc-SPM is the dictionary learning process. The learning objective function is generally difficult to optimize, and what's more, if ScSPM is applied to classify some data set B, then the dictionary should be trained on B. This way of dictionary learning is sometimes called "task-specific". Hence if we want to use Sc- * Corresponding author SPM on multiple tasks, say to classify data set B, C, and D, then it can be very time and computation consuming to train three dictionaries on their corresponding data sets.
In the computer vision community, dictionary learning (DL) is also a topic that attracts lots of attention, on which many of the works focus on non-task-specific dictionary learning (Maurer et al., 2013) (Zhu, Shao, 2014). In self-taught learning (Raina et al., 2007), it is for the first time proposed that the dictionary can be efficiently trained in an unsupervised way. One important conclusion of self-taught learning is that a dictionary learned on a large, unlabeled data set A can be used for feature encoding of another labeled data set B, and the classifier using these encodings on B can get promising results.
Inspired by this, we propose a self-taught learning framework using spatial pyramid matching (S-ScSPM) on remote sensing scene classification from high resolution imagery. We show in our experiments that using S-ScSPM, to classify labeled data set B and C, a dictionary trained on data only from unlabeled data set A is sufficient. While the overall classification accuracy using S-ScSPM is compatible to and sometimes outperform that of original ScSPM on labeled data sets, in S-ScSPM the dictionary is learned only on one unlabeled data set, and thus the resource cost of learning is significantly reduced.

Backgrounds
Formally, we use D l to denote a labeled data set, and Du to denote an unlabeled data set. Self-taught learning mainly consists of three stages: dictionary learning, sparse representation and classification. ScSPM consists of the same three stages, however these two methods have the following differences. First, in self-taught learning, one trains a dictionary using Du, while computes sparse representations for images in D l and classifies D l . In ScSPM, one uses D l for dictionary learning stage and the same D l for the latter two stages. Second, in self-taught learning, one generally uses raw pixels as image descriptors to learn a dictionary and calculate sparse codes for images, while in ScSPM, the dictionary is learned on dense SIFT features of images, and sparse codes computed using the dictionary are further processed through spatial pyramid matching (SPM) and max pooling, before fed to the classifier.
Our proposed S-ScSPM is therefore a combination of the selftaught learning framework and ScSPM. S-ScSPM consists of the same three stages as discussed above, with the details shown in Figure 1.

Dictionary learning
We use dense SIFT features extracted on unlabeled images to train a dictionary via dictionary learning using sparse coding. For simplicity we let Du denote the set of extracted dense SIFT features. The problem of dictionary learning is defined as the following optimization problem: where We follow the convention that B is an overcomplete basis set, i.e., K > d, and that an l2-norm constraint is applied on codeword B k to avoid trivial solutions (Yang et al., 2009). The objective function of (1) balances two terms: (i) the first quadratic term encourages x (i) u to be well reconstructed by a linear combination of all the codewords in B, with a (i) u being the combination weights; and (ii) the l1-norm penalty on au forces au to be sparse, so that the codewords capture salient patterns in xu. This problem is optimized by alternately updating au and B. B is obtained once the optimization of (1) converges. We follow the details described as (Yang et al., 2009) in our implementation.

Sparse representation
We extract dense SIFT features from labeled scene images to form D l , and the features are fed into the dictionary and further encoded using standard ScSPM algorithm. Specifically, with the dictionary B trained and fixed, we compute the sparse codes a Then by ScSPM, a spatial pyramid is built over the image and participates the image into {2 l } L l=0 regions, where L is the pyramid level. A max-pooling function is applied to all M sparse codes in one region, and a higher level representation z of this region is formed by zj = max{|a l(1j) |, |a l(2j) |, . . . , |a l(M j) |}, where zj denotes jth component of representation vector z. Finally we concatenate z obtained from all the regions in image n as z (n) .

Classification
Once the representations z (n) are obtained, we combine them with image labels to form a training data set {(z (n) , y (n) )} N n=1 which we use to train a multiclass linear SVM classifier. Here the label y ∈ Y = {1, 2, ..., C} and C denotes number of classes.
Following ScSPM, we take the one-vs-all strategy to train C linear SVM classifiers for multiclass classification, each optimizing an objective function of the form where wc = SVM parameters (for class c) y c , z (n) ) = the hinge loss term Instead of the standard hinge loss, we adopt the squared hinge loss l(wc; y (n) c , z (n) ) = [max(0, w c z (n) · y (n) The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLIII- B2-2020, 2020XXIV ISPRS Congress (2020 such that the objective function (3) is differentiable and thus can be trained using gradient based methods.
Finally for some test data z (n) , the class labelŷ is given bŷ

EXPERIMENTS AND DISCUSSION
3.1 Data sets We test our proposed method on the following three data sets: we choose the Caltech-101 data set (Fei-Fei et al., 2004) for the unlabeled data set Du, while we evaluate the performance of S-ScSPM on two labeled data sets: UC Merced LandUse (UCM) data set (Yang, Newsam, 2010) and Changping data set. Caltech-101 data set is a data set of natural scene/object images, while the latter two data sets contain only remote sensing scene images. Some typical images in these data sets are shown in Figure 2.
Caltech-101 data set contains 101 classes, including images of animals, faces, vehicles, etc. The number of images per class varies from 31 to 800, with most categories having about 50 images. The size of each image is roughly 300 × 200 pixels. Although for each image in this data set there is a label, the dictionary learning stage of our algorithm does not require any information from labels.
UC Merced LandUse data set is one of the most popular remote sensing scene image data sets, which contains 21 scene categories including agricultural, airplane, buildings, etc., with each category having 100 images of 256 × 256 pixels. The spatial resolution of these images is 1 foot.
Changping data set is acquired by the Gaofen-2 sensor and covers a certain area in Changping District, Beijing, China. The original image size is 4736 × 3200 pixels, with spatial resolution of 0.8m. In total 6 categories of scenes are obtained from the original image by a non-overlapping grid with a cell size of 128 × 128 pixels. The categories are idle area, freeway, sparse buildings, dense buildings, industrial area and vegetation area, each containing 117, 101, 154, 56, 103 and 85 images, respectively. For scenes that cannot be categorized as any of the above 6 classes, scenes being a mixture of several classes for example, we label them as "undefined" and do not use these scenes in either dictionary learning or classification. The image and annotations are shown in Figure 3. The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLIII-B2-2020, 2020 XXIV ISPRS Congress (2020 edition)

S-ScSPM v.s. ScSPM
In this section we compare our method with the original ScSPM algorithm. We evaluate and compare performance of both algorithms on UCM and Changping data set. In the experimental group, we evaluate S-ScSPM by using dictionary trained only on the Caltech-101 data set, and the SVM classifier is trained on either UCM or Changping data set. In the control group, ScSPM is performed on UCM data set and Changping data set. The dictionary is trained task-specifically.
In S-ScSPM dictionary learning stage, dense SIFT patch size is set to 16 × 16 pixels and step size is 8 pixels. The SIFT descriptors are extracted on gray scale images, and 200,000 descriptors are random selected and used to train the dictionary. Dictionary size K and regularization term λ are set to 1024 and 0.15, respectively.
In ScSPM dictionary learning stage, we set dense SIFT patch size to 16 × 16 pixels and step size is 6 pixels on UCM data set, and 8 × 8 pixels and 3 pixels on Changping data set. On both data sets, K = 1024, λ = 0.15 and number of training patches is 200,000, same as the settings in S-ScSPM.
In the sparse representation stage, for both algorithms we set the encoding regularization term λ to 0.15 and spatial pyramid level L to 2.
Following the common benchmarking procedures, we repeat all the classification experiments 10 times with different random initialization and report average classification rate with its standard deviation. Furthermore, for both experimental and control group, the SVM is trained on 20%, 50%, and 80% of the data and tested on the rest. Overall classification accuracy on UCM data set and Changping data set is shown in Table 1 and Table 2, respectively. On UCM data set, S-ScSPM averagely outperforms ScSPM by 1%, and the standard deviations are also smaller. On Changping data set, S-ScSPM achieves 1% lower classification rate than Sc-SPM under the setting of 20% and 50% training data, while slightly outperforms ScSPM when using 80% data for training. S-ScSPM has generally larger standard deviation.

Method
The above results indicate that S-ScSPM can perform at least as well as ScSPM, and thus it is possible to learn a dictionary on a single unlabeled data set Du and adopt it for classifying multiple labeled data set D l s, even if the distribution of images are very different between Du and D l . Such a difference also suggests that it is possible to train a dictionary using large open source computer vision data sets like ImageNet, which is much cheaper than obtaining and labeling a large remote sensing image data set, and use the dictionary for feature encoding for further scene classification.

S-ScSPM v.s. DL based methods
In this section we compare the performance of S-ScSPM with that of several other DL based methods on UCM data set. These methods include the original BOVW, unsupervised feature learning (UFL) (Cheriyadat, 2013), multipath unsupervised feature learning (MP-UFL) (Fan et al., 2017) and bi-layer dictionary learning (BL-DL) (Yang et al., 2016). Average classification rates of the above algorithms on UCM data set, all using 80% of the data for training and repeated 10 times, are shown in Table 3 In Table 3, reading from left to right, the model complexity goes higher while the classification rate gets improved. UFL basically uses SPM without sparse coding. In MP-UFL, the dictionary objective function is slightly different from S-ScSPM, and image descriptors go through a procedure called multipath consisting of multiple dictionary learning -sparse coding operations to form a higher level representation. BL-DL builds several dictionaries to seperate commonality and classparticularity dictionary atoms for better classification performance.
All algorithms listed above, except S-ScSPM, are trained taskspecifically. We suspect that the self-taught learning framework may not work out fine with BOVW and UFL, for they do not require the encodings be sparse, while sparse coding is a key feature that makes self-taught learning successful. On the other hand, it can be difficult to fully combine methods like MP-UFL and BL-DL with self-taught learning, for they learn multiple dictionaries and are more complicated than ScSPM. It is possible, however, to learn some low-level dictionaries in these complex methods on unlabeled data and apply them to multiple tasks.

CONCLUSION
In this paper we integrate ScSPM with self-taught learning framework and propose the S-ScSPM framework for remote sensing scene classification. The experiments show that the pretrained dictionary applies surprisingly well to both UC Merced LandUse data set and Changping data set, with our classification rate slightly outperforming traditional ScSPM. Using S-ScSPM, time and resources required for training can be significantly reduced.