THE JOINT SPATIAL AND RADIOMETRIC TRANSFORMER FOR REMOTE SENSING IMAGE RETRIEVAL

Content-based remote sensing image retrieval refers to searching interested images from a remote sensing image dataset that are similar to a query image via extracting features (contents) from images and comparing their similarity. In this work, we come up with a lightweight network structure, which we call the joint spatial and radiometric transformer, which is composed of three modules: parameter generation network (PGN), spatial conversion and radiometric conversion. The PGN module learns specific transformation parameters from input images to guide subsequent spatial and radiometric conversion processes. With these parameters, the spatial conversion and radiometric conversion transform the input images with spatial and spectrum perspectives respectively, to increase the intra-class similarity and inter-class difference, which are attached great importance to CBRSIR. In comparative experiments on multiple remote sensing image retrieval datasets, our proposed joint spatial and radiometric transformer combined with the backbone network ResNet34 has achieved optimal performance. * Corresponding author


INTRODUCTION
Content-based image retrieval (CBIR) is a hot research topic both in computer vision and remote sensing (Du et al., 2016). A query process of CBIR consists of three steps: calculate the features of the query image and of the images in the chosen database; compare the similarity of the features; rank the images in the database according to the similarity score. As for remote sensing images, geometric deformation caused by various camera angles from overhead platforms and complex radiometric distortions caused by a dynamic atmosphere both impose higher requirements on the retrieval technology.
In recent years, convolutional neural network, which has been widely used in many fields, has shown excellent performance in the domain of remote sensing image retrieval. In the classic convolution neural network structures (e.g. Vgg, ResNet) (Krizhevsky et al., 2018;Simonyan and Zisserman, 2014;Szegedy et al., 2015;He et al., 2016), the simple and straight convolution and maxpooling operation can indeed achieve some translation invariance to a certain extent. However, using the fixed size of the convolution window and pooling unit, the geometric translation invariance may not be fully achieved when processing remote sensing images. On the other hand, the commonly used data augmentation for color transformation hardly handle with radiometric distortions completely.
In this paper, we propose a lightweight network structure performing spatial and radiometric conversion simultaneously on remote sensing images, which is robust to the diversity of perspective angles and radiometric situations of input images without extra supervision. This main part of the structure, called transformer, learns dedicated spatial and radiometric transformations for each individual image. According to the classification loss at the training stage, the transformer tends to learn the conversion parameters to perform spatial and radiometric correction on the input image, and generate a corrected image that is more conducive to the subsequent feature extraction. On one hand, spatial correction mainly aims at making the foreground more prominent, which can be regarded as an attention mechanism, and also achieves affine and some non-rigid deformation through the defined transformation model. On the other hand, the radiometric transformation will reduce the intra-class variability through spectral correction, which is particularly important for retrieval. Our method, combined with the backbone network ResNet34, demonstrates excellent performance on multiple popular remote sensing image retrieval datasets.

METHOD DESCRIPTION
The model of spatial and radiometric transformation consists of three modules: parameter generation network (PGN), spatial conversion and radiometric conversion.
Parameter generation network is composed of two convolutional layers with the maxpooling layer following respectively and a multi-layer perceptron with one hidden layer , through which a fixed number of parameters that are based on the selected spatial and radiometric transformers are obtained by the last regression layer. Ten parameters are regressed in our experiment.
The spatial conversion module, which performs grid generating and sampling in turn, borrows ideas from Jaderberg et al (Jaderberg et al., 2016). The grid generator transforms the spatial coordinates of the input image by using the parameters obtained by the PGN and the defined spatial transformation model. The sampling module resamples the input images with the transformation model and a specific interpolation method. In this paper, we choose affine transformation as the spatial transformation model, which requires 6 transformation parameters. Affine transformation is a linear transformation from one 2D coordinate to the other, which can be divided into a series of single transformations, including translation, scale, flip, rotation and shear. The interpolation method we choose is the bilinear interpolation.
Compared with spatial transformer, which has appeared as similar versions such as attention mechanism in previous works, radiometric conversion has not attached much significance in image retrieval tasks. For remote sensing image retrieval tasks, the radiometric correction of the image makes the network having the ability to actively learn to increase intra-class similarity and inter-class difference at the spectral level. In this study, we apply four transformation parameters obtained by PGN on the input image for radiometric correction. We observed that a variety of different satellite images covering the same area, including illumination change, under-or overexposure, color cast, can be largely modeled and repaired by adjusting different spectral channels. Therefore, we set the four parameters as a linear stretching coefficient respectively to the R, G and B channels with the same translation bias. The input image is then transformed pixel by pixel according to the stretching parameters.
After the original image is adjusted by the transformer model, the corrected image is inputted into the shortcut structures of the ResNet to extract features. In the training phase, which is the same as a common classification task, the features are processed by two full connected layers to output predictions. The outputs are compared with the ground truths to optimize the whole network consisting of the transformer and the Resnet. During the retrieval phase, the last fully connected layer is replaced with principal component analysis (PCA), which outputs a feature vector with a fixed length. Then, the normalized correlation coefficients (NCC) are calculated between the feature vectors extracted from the query image and from any image in the database to be retrieved. The NCC score ranks the images in database.
The whole process of our proposed mothed is shown in Figure 1.

Data Used
We use PatternNet (Zhou et al., 2018) as the fine-tuning dataset to transfer a model pretrained on close range ImageNet dataset adaptive to overhead images, and RS19 (Xia et al., 2010), UCM (Yang and Newsam, 2010) and RSSCN (Zou et al., 2015) as the test dataset for retrieval.
PatternNet is a large-scale high-resolution remote sensing dataset specifically designed for RSIR. The dataset has a total of 30400 images each of which size 256×256.
WHU-RS19 contains 19 categories, total of 1005 remote sensing images, which can be used for scene classification and retrieval. This dataset has around 50 images of each type, and each image is 600 × 600 pixels in size.
The UC Merced Land-Use Dataset contains 21 types of scenes, each of which is composed of 100 images. The size of each image is 256 × 256 pixels.
RSSCN7 consists of 7 typical scene categories and 2800 images. Each category contains 400 images of size 400 × 400, averagely sampled from 4 different scales.

Setting
The proposed network was pre-trained on the ImageNet dataset for weight initialization. The input images for fine-tuning and retrieval were all resized to 224×224 pixels. In the fine-tuning phase, 40 epochs were conducted, among which the learning rates of 1st to 15th epoch were set to 10 -3 , those of 16th to 30th epoch were 10 -4 , and those of 31st to 40th epoch were 10 -5 . The batch size was set to 64 and the optimizer was SGD (Adam for the compared compact bilinear pooling (CBP) method (Wang et al., 2020)). A Linux PC with an NVIDIA GeForce GTX 1060 6G GPU and the PyTorch deep learning environment was used.

We use mean Average Precision (mAP), Precision at k (P@k)
where k indicates the top k retrieval results in a query, to evaluate the retrieval performances of different methods. The mean Average Precision (mAP) is the average of AP where AP means the average of the correct rates on different recalls in a query.

Experimental Results
The retrieval results on dataset RS19, UCM and RSSCN are shown in Tables 1, 2 and 3 respectively. The content in parentheses represents the modules (decoder) after the shortcut structures (encoder) in the retrieval network, in which FC is the fully connected layer, and PCA stands for principal component analysis. Respectively, ST and RT are abbreviations of spatial and radiometric transformation.
It can be seen that in the three remote sensing image retrieval datasets, our transformer model all achieved the highest accuracy on mAP, surpassing the newest classification network NTS-Net, SENet, SKNet and that attention boosted bilinear pooling (ATT + CBP). The groups replacing the last FC layer with PCA get better results in all the controlled experiments. To prove the effectiveness of our proposed joint transformer, we tested the case of adding a single spatial transformer or a single radiometric transformer, respectively. Their retrieval results are both better than those of the simple resnet34 network, but inferior to those of the network that added the joint spatial and radiometric transformer, which indicates that the single spatial or radiometric transformer can indeed learn the transformation parameters that are conducive to retrieval, what's more, the joint spatial and radiometric transformer can effectively integrate the advantages of the single spatial and radiometric transformer.
The retrieval results on different datasets are shown in Figure 2. In each figure, the query image is shown on the first row, and the result of ResNet34(PCA) and ST+RT+ResNet34(FC1+PCA) are shown on the second and third row respectively. The red box indicates that the image is irrelevant to the query image and wrongly predicted by algorithm, and the green one means relevant and correctly predicted.

Input
Parameter      The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLIII-B3-2020, 2020 XXIV ISPRS Congress (2020 edition) Figure 2, 3 and 4 demonstrated that the introduction of the ST and RT combination obviously improves the retrieval performance of the baseline.

CONCLUSION
In this paper, we propose a joint spatial and radiometric transformer to converse the input image for image retrieval. Specifically, a spatial conversion can be regarded as an attention mechanism to make the foreground of the image more prominent, while radiometric conversion uses the parameters obtained by actively learning to increase the intra-class similarity and inter-class difference at the spectral level. Experiments on multiple challenging remote sensing image retrieval datasets show that our joint transformer surpasses the popular latest networks such as SENet and effectively improves retrieval accuracy.
Compared with the FC layer that learns parameters from the fine-tuning dataset and output features more discriminatory, PCA which is unrelated to any specific dataset is more universal in terms of output feature vectors, therefore replacing the last FC layer with PCA in the retrieval process can achieve better results.