FEATURE FUSION FOR CROSS-MODAL SCENE CLASSIFICATION OF REMOTE SENSING IMAGE

Scene classification plays an important role in remote sensing field. Traditional approaches use high-resolution remote sensing images as data source to extract powerful features. Although these kind of methods are common, the model performance is severely affected by the image quality of the dataset, and the single modal (source) of images tend to cause the mission of some scene semantic information, which eventually degrade the classification accuracy. Nowadays, multi-modal remote sensing data become easy to obtain since the development of remote sensing technology. How to carry out scene classification of cross-modal data has become an interesting topic in the field. To solve the above problems, this paper proposes using feature fusion for cross-modal scene classification of remote sensing image, i.e., aerial and ground street view images, expecting to use the advantages of aerial images and ground street view data to complement each other. Our crossmodal model is based on Siamese Network. Specifically, we first train the cross-modal model by pairing different sources of data with aerial image and ground data. Then, the trained model is used to extract the deep features of the aerial and ground image pair, and the features of the two perspectives are fused to train a SVM classifier for scene classification. Our approach has been demonstrated using two public benchmark datasets, AiRound and CV-BrCT. The preliminary results show that the proposed method achieves state-of-the-art performance compared with the traditional methods, indicating that the information from ground data can contribute to aerial image classification.


INTRODUCTION
Scene classification is a hot topic in remote sensing field, which aims to assign a semantic category to the image according to its content, and is also the most intuitive understanding of remote sensing image. Unlike the traditional land use classification, scene classification does not find the corresponding figure category of each pixel or object. Scene classification only focuses on the semantic features of the whole image, and the overall cognition of the image scene. Scene classification pays attention to the global macro information, and generally tends to classify a region as a whole according to the scene semantic information. Therefore, global cognition and semantic information are the two most important parts of scene classification. At present, high-resolution remote sensing image scene classification is widely used, such as urban functional zoning planning (Huang, 2018), vehicle (Schilling, 2018) and ship object detection (Wang, 2019.), etc.
The traditional scene classification of high-resolution remote sensing image is based on a single simple network, from a single perspective, that is, using satellite remote sensing image training model for classification and prediction (Liu, 2018) (Xu, 2020). Although this kind of method is more common, the model training is affected by the image quality of the dataset, and the single perspective will cause the mission of some scene semantic information, which eventually affect the classification accuracy (Cheng, 2017). With the development of remote sensing technology, multi-source and multi view remote sensing data become easy to obtain (Xiong, 2020). The traditional method of "one data source, one model" is slightly outdated. How to do scene classification of cross source data sets has become a major research hot topic. * Weixun Zhou，zhouwx@nuist.edu.cn To solve the above problems, this paper proposes a method based on cross modal model fusion features, which combines the air and ground perspectives, uses the advantages of aerial images and ground street view data to complement each other. We extract the features of similar scenes from different perspectives for fusion, and finally achieve the purpose of improving the accuracy of scene classification.

Siamese Network
The cross-modal model is based on Siamese network Siamese network, which consists of two neural networks to form the whole Siamese structure. This kind of "Siamese" is realized by sharing weights by two networks (Liu, 2019). Therefore, Siamese network receives two inputs and transmits it to two neural networks sharing weights to form their own architecture. Finally, the feature representation of each network output is calculated by the same loss function The measurement between them can represent the correlation between the two inputs, thus evaluating the similarity between them. Fig. 1 illustrates the framework of Siamese networks, in which a CNN is the basic unit of the model. It is composed of several layers including convolutional layers, pooling layers, and the fully connected layers, and each plays a vital role in the whole architecture. The convolution layer extracts feature by convolution operation on the input image using convolution kernel, and obtains the feature map as the input of the next layer. The pooling layer compresses the feature maps obtained by convolution layer, and reduces the dimension while retaining important features and avoiding overfitting. The full connection layer is to expand the features obtained from the volume layer or pool layer into one-dimensional feature connection classifier for classification.

Cross-modal model
After the brief introduction of Siamese network, we focus on the proposed method. Figure 2 shows the process of crossmodal feature fusion. In this method, we train the cross-modal model by pairing different sources of data based on Siamese network, that is, input aerial remote sensing image at one branch and input ground street view data at the other branch, and specify label to 0 or 1 (1 for the same scene, 0 for different scenes). Then, the model trained is used as the deep feature extractor to extract the deep features of the aerial / ground image pair named feature_a and feature_g. And the features of the two views feature_a and feature_g are fused in case of keeping dimension unchanged. The fused feature is named as feature_fusion. Do the same for the training set and the test set. Finally, a SVM classifier is trained with the fused features for scene classification.

Dataset
Airound dataset (Machado, 2020) consists of 1165 pairs of images distributed in 11 categories, including airport, bridge, church, forest, lake, river, skyscraper, stadium, statue, tower and city park. Each sample is composed of a double group, which contains two images from different perspectives, i.e. ground Street perspective image and high-resolution RGB aerial image. All images are paired and manually checked to ensure their correctness. Figure 3 shows class distribution of AiRound and Figure 4 are some examples.  The CV-BrCT dataset (Machado, 2020), which stands for Cross-View Brazilian Construction Type, comprises of approximate 24k pairs of images split into 9 urban classes. The pairs are composed of images from two different views: an aerial view, and a frontal view of a location. Figure 5 shows class distribution of AiRound and Figure 6 are some examples.

Experimental Details
In order to demonstrate the performance of our proposed method, we conducted four comparative experiments. All the training sets and test sets used in the experiment are the same for a fair comparison. Here is a detailed description of the four groups of experiments and Table 1  Because of the fusion, experiment Ⅲ and Ⅳ only have one overall classification accuracy. All of our experiments were experimented on a PC with a 3.7-GHz 7-core CPUs, 16-GB memory and a NVIDIA GTX 1660s GPU. Figure 1 shows the results on Airound. It can be seen from the figure that the difference between the two view images in Airound dataset is small, and both can achieve nearly 80% accuracy with a single CNN. The accuracy of using Siamese network is slightly lower, and the data fusion is basically the same. The accuracy of our method is improved by about 3-4%. Figure 1 shows the results on CV-BrCT. This is a bit different from the Airound. In CV-BrCT dataset, the classification accuracy differences of two view data using a single CNN is high. The classification accuracy of aerial data is about 80%, while that of ground street view data is only 65%. Due to the great difference of classification between the two perspectives, Siamese network is about 5% lower in both perspectives, and the effect of dataset fusion is not good. But using our method, the accuracy can reach 80.64%, compared with aerial images. The improvement is not obvious, but for street view data, the improvement is very significant.

CONCLUSION
In this paper, we propose a feature fusion method for crossmodal scene classification. Our method uses cross-modal training between aerial images and ground street view data, which learn features from different perspectives for fusion, and achieve cross source scene classification finally. In addition, the experiments also indicate that the information from ground data and aerial image can contribute to each other in scene classification. Comparison experiments on two datasets has demonstrated that there are performance improvements on both aerial image and ground view image.