MOTIF: MULTI-ORIENTATION TENSOR INDEX FEATURE DESCRIPTOR FOR SAR- OPTICAL IMAGE REGISTRATION

The inherent speckle noise in synthetic aperture radar (SAR) images and the significant differences between SAR and optical images in nonlinear radiation give rise to the great difficulty in computing similarity between image features, improving detection accuracy of corresponding points and the efficiency of image matching, thus making the registration of SAR and optical images a long-standing challenging task. To address these issues, a new SAR-optical image registration method was proposed in this paper, namely, Multi-orientation Tensor Index Feature (MoTIF), which is characterized by a lightweight feature descriptor. Specifically, we firstly established a diffusion tensor model based on the information of image gradient orientation. Then, the model was parameterized using polar coordinates to help identify the MoTIF and get the array of indices of maximum value, with which we could draw a multi-orientation index map and thereupon construct the feature vector descriptors. To evaluate the proposed method, seven representative SAR-optical image pairs were tested along with a comparison with other four state-of-the-art methods. Results show that our MoTIF method outperforms the other methods in that it substantially de-speckles SAR images, overcomes nonlinear radiation distortions caused by the differences between SAR and optical images, and achieves high precision and efficiency in image registration. The average number of correct matches (NCM) of 151.0 and the root of mean-squared error (RMSE) of 1.66 pixels obtained by utilizing MoTIF with lower time consumption adds more evidence to its superior performance. The time consumption of the MoTIF method is better than that of the other four methods, and the calculation speed is 4 times faster than that of the LGHD method. Executable code and test data are published in the link https://skyearth.org/publication/project/MoTIF/ * Corresponding author


INTRODUCTION
The rapid development of computer science, remote sensing, artificial intelligence as well as other technologies brings about a great variety of multi-source images obtained by different kinds of sensors. Among those images, optical and SAR images and their joint use have been the heated topics receiving substantial scholarly attention, for their matching and registration can provide technical support for fields like 3D reconstruction, rescue and relief work, urban planning (Ye et al., 2017). Drawing on the time lag and spatial differences of two image sensors in obtaining information, we can enhance the temporal and spatial resolution of data sources and improve the reliability of data application and processing, thus giving full play to their respective advantages. However, SAR is usually pointed to the side for imaging, so it is not hard to figure out why it is strongly affected by the ground range and terrain relief. SAR-optical registration remains still difficultly formidable in view of the multiplicative noise caused by speckles distributed in the SAR image and the geometric and non-linear radiation distortions in its imagery.
In recent years, a lot of research have been carried out on SARoptical image registration, and the methods used can be generally classified into three categories: area-based matching, deep learning-based matching, and feature-based matching. The area-based matching method features metrics like correlation coefficient, mutual information, etc. (Öfverstedt et al., 2019(Öfverstedt et al., . Viola et al., 1997. Commonly used for its accuracy and efficiency in identifying grayscale changes, the correlation coefficient cannot be used, however, to locate the nonlinear grayscale differences between SAR and optical images due to its high sensitivity to between-image discrepancies. In contrast, metrics of mutual information to non-linear gray-scale differences produce consistently sharper peaks in the surface of the similarity measurement, but it is not favored all the time because it often gets stuck at a locally optimal solution without giving a more global one. Deep learning-based method has been widely used in multi-model image matching and other materials, witnessed by the proposal of several robust deep learning algorithms, such as convolutional neural network matching (K. Yi et al., 2016), multi-source image feature extraction and description D2-Net network (Dusmanu et al., 2019), deep matching network based on co-attention (Wiles et al., 2021), VGG network feature extraction matching (Efe et al., 2021) and weak texture matching method LoFTR based on transformer network (J. Sun et al., 2021). At the same time, some deep learning matching methods based on SAR-optical have also achieved good results, such as conditional generative adversarial networks (Merkle et al., 2018), pseudo-siamese convolutional neural network (Hughes et al., 2018), convolutional neural network suitable for SAR-Optical match (Bürgmann et al., 2019), CorrASL network (Hughes et al., 2020). Nonetheless, deep learning-based method also has the problem of poor stability in matching brought by the complexity and unpredictability of scenarios where SAR-optical images are acquired.
Feature-based method such as SAR-scale-invariant feature transform (SIFT) (Dellinger et al., 2015), optical-to-SAR-SIFT (OS-SIFT) (Xiang et al., 2018), A new matching framework (Bas et al., 2021), Rotation-invariant self-Similarity descriptor (Mohammadi et al., 2022), have limitations in themselves as well due to their strong dependence on image gradients and high sensitivity to SAR-optical imaging matching. To address this, methods based on phase features have been put forward, including LGHD (Aguilera et al., 2015), channel features of orientated gradients (Ye et al., 2019), radiation-variation insensitive feature transform (RIFT) (Li et al., 2020), and histogram of absolute phase consistency gradients (HAPCG) (Yao et al., 2021). These methods make good use of the frequency-domain features of the image to make multi-source matching feasible. They did yield a great number of positive results, yet their deficiency of sensitivity to the multiplicative noise by registration of SAR and optical images, especially the registration taking place in non-urban, cannot be ignored. As is summarized above, area-based method enjoys good resistance to nonlinear radiation distortions, with obviously poor performance in the matching efficiency and the elimination of the coherent speckle of SAR images. Deep learning-based method has shown its great potential in image matching, but it is subject to the sample size and the complex scenarios where SAR-optical images get matched. Feature-based method is not quite useful in tacking problems such as multiplicative noise and nonlinear radiation distortions, though it makes feature matching faster. Clearly, if we want to further the research on SAR-optical image registration, we should firstly deal with how to achieve the accurate positioning of corresponding points and greater efficiency of matching both at once. This paper presents a lightweight SAR-optical registration method based on a multi-orientation diffusion tensor index feature (MoTIF) description. It uses the diffusion coupled with the parametric expression of polar coordinates to construct rich multi-orientation tensor features, by which the maximum index value can be calculated and a corresponding descriptor would be generated as well. During this process, the registration of SAR-optical images could be advanced in a more accurate and efficient manner.

IMAGE REGISTRATION BASED ON MOTIF
The proposed MoTIF method is composed of four steps: (i). feature points extraction, (ii). MoTIF descriptor construction, (iii). bilateral matching and outliers removement, (iv). image fusion. The second step is the major focus of this paper which will be elaborated in the following section.

Feature point extraction
Feature point extraction is an important part in image matching. It is also a demanding task because of the significant nonlinear radiation differences between SAR and optical remote sensing images, and the multiplicative noise caused by the coherent speckles in the SAR images. Therefore, the anisotropic weighted moment image space is used to extract image features (Yao et al., 2021), and it is defined as (1): where W represents the final anisotropic weighted moment result; max M represents the maximum moment of phase consistency of the image; min M represents the minimum moment of phase consistency of the image;  represents the image weight coefficient (with the range of [-1…5]). After the image's anisotropic weighted moments are established, the FAST (Rosten et al., 2006) operator is used to extract key points.

MoTIF descriptor construction
Although the previously established feature descriptors have achieved better results than traditional feature descriptors, they are still limited by different conditions, resulting in matching of SAR-optical images that does not satisfy actual production requirements. Hence the construction of a robust descriptor has been the major problem in SAR-optical image registration and yet to be solved. This part focuses on the construction of MoTIF descriptors to mainly address geometric deformations, non-linear radiation differences and speckle noise inherent in SAR images. The construction was conducted through the following three steps as shown in Figure 1: (ⅰ). multi-orientation tensor features construction, (ⅱ). multi-orientation tensor index mapping, (ⅲ). descriptor vector calculation.

Multi-orientation tensor features construction.
Step 1: Tensor feature calculation. The image gradients cannot be directly relied on in image registration owing to their high sensitivity to image distortions, especially to speckle noise in SAR image. To this end, some scholars have employed Sobel and Laplacian operators (Ma et al., 2016) for gradient optimization and filtering, making its use more promising in cross-modal image matching. In this section, the second-order gradient was firstly calculated, followed by the computing of the second-order gradient amplitude in the horizontal and vertical directions via using the Sobel template [-1,0,1; -2,0,2; -1,0,1] (see equation (2)).
where ( , ) xx x y L and ( , ) yy x y L represent the sum of squares of second order gradients in the x and y directions respectively.
( , ) xy x y L denotes trace of the second-order gradient and  denotes the convolution operator.
The tensor provides the edge information in terms of its shape and direction. Despite its shape changes with the contrast and illumination, edge direction always remains unaltered.
Consequently, the tensor model is frequently used in extracting the structural features of the image (Köthe et al., 2003). The definitive structure tensor expression is given as equation (4): Where, σ G is the Gaussian kernel function with standard deviation σ ; The results of ( , ) yx x y L gets the identical result with ( , ) xy x y L . We also calculated the image tensor, the parallel and orthogonal eigenvectors of the tensor ( , ) x y T and denoted the latter two as Step 2: The coherent speckle noise of the SAR image still existed even when we finished the calculation of the image tensor features. It will undermine the robustness of the registration of the SAR-optical image we introduced the coherence-enhancing diffusion function (Weickert et al., 1999) to retain image edges features by reducing the multiplicative speckle noise in SAR images, particularly in their uniform area. This function defined as equation (5) and (6).
Where, D is the tensor matrix after coherence enhancement diffusion. α permits a small diffusivity (usually α =0.05) even when no preferential direction exists and k acts as a threshold to ( , ) o x y V is the orthogonal feature vector of the image tensor. T is the transpose operator of the matrix. Moreover, the positive constant is introduced to correct the bias in the original Perona-Malik diffusivity function (Weickert et al., 1999).
Step 3: The enhanced diffusion tensor features of SAR-optical images can be obtained computationally by Step 2. This operation weakens the effect of multiplicative noise, leaving problems such as nonlinear radiometric distortion and geometric distortion between SAR-optical unsolved. On that account, Ye et al. (2019) have attempted at enriching the image structure features by generating multi-directional gradient features. Based on their findings, this paper adopts a parametric representation of polar coordinates to generate create the map sets centered on the MoTIF. Firstly, the D-matrices obtained from Step 2 are decomposed along the x-directions and y-directions, which are denoted as x CoT and y CoT . Then, the feature images are rotated in the orientation range [0~π] with the rotation interval angle (π/o). After the rotation is completed, the fast fourier transform is executed to further filter the fine noise, and the final equation is shown in (7).
Where, ( , ) o x y F is the eigenvalue of the diffusion tensor of the o-th layer. o is the number of layers of the multi-orientation tensor feature (The value of o is taken as 6 in this paper). cos(.) and sin(.) are signs of trigonometric functions. FFT is a Fast Fourier Transform function.  is the sign for absolute value.
The construction of multi-orientation tensor feature sets has been completed, as shown in Figure 2.

Multi-orientation tensor index map.
Although values of coherence-enhancing diffusion tensor feature under different layers were obtained, it is still difficult to directly describe the between-layer feature similarity of SAR-optical images, which are easily affected by multiplicative noise and nonlinear radiation distortions. To enhance the robustness of the descriptor, using index features between different coherenceenhancing diffusion tensor feature maps is necessary and effective. The values of corresponding coherence-enhancing diffusion tensor features of each pixel p, and the maximum value of each pixel to construct a feature map were all calculated. This paper is divided into 6-layers into multiorientation tensor feature layers. Firstly, a multi-channel feature map was formed with reference to the features of the ( , ) o x y F after the calculation of coherence-enhancing diffusion tensor features in Step 2. The mathematical formula is defined as equation (8): Then, the maximum value of each pixel of different images is counted and its location is marked. Finally, the values of each pixel location in the image, the corresponding pixel value is obtained, and an o-dimensional ordered array were obtained. The channel index value ( , ) x y MoTF which incorporated the maximum value of the o-dimensional array was calculated. The mathematical expression is defined as equation (9)  The Gaussian function is used to assign weight to each pixel to make the feature description vectors consistent when they are confronted with the changes of the window position. Since the value range of the multiorientation tensor index map is set from 1 to o, the partial image block is divided into o×o sub-regions. Then a histogram vector of n=6 is fixed in each subregion, and all histogram vectors are sequentially connected to form o×o×n-unit feature vector and normalize the feature vector. Finally, a 216-dimensional descriptor vector was generated as shown in Figure 1.

Bilateral Matching
After the MoTIF descriptor calculation is completed, feature matching ensues. In this paper, we used the similarity measure of Euclidean distance and the matching strategy of the bilateral matching method making sure the one-to-one correspondence of obtained matching points. Besides, outliers are unavoidable after the bilateral matching. For their removal we used the fast sample consensus (FSC) algorithm  to cast out wrong matchings. The FSC algorithm can steadily extract the correct matching point pairs from mismatches with fewer iterations.

Image fusion
When the correct correspondence points are obtained, the transformation matrix between images needs to be calculated to achieve the image fusion. Here Affine model the chessboard grid fusion (Li et al., 2017) method were employed for the former was used to calculate the homography matrix between images, while the latter for the registration of SAR-optical images.

EXPERIMENTAL RESULT
Four state-of-the-art methods, i.e., OS-SIFT (Xiang et al., 2018), LGHD (Aguilera et al., 2015), RIFT (Li et al., 2020) and HAPCG (Yao et al., 2021) methods, were used for comparison. During the tests, the feature point extraction threshold was set to 0.4 the image scale difference was set to 1.6 with six image multi-orientation tensor feature maps and 72 pixels concerning the neighborhood window. The parameters of the compared methods were adjusted to the optimal stage accordingly. The proposed MoTIF method, OS-SIFT,LGHD, RIFT and HAPCG were implemented in Matlab-R2018a. When those methods, the number of matched key-points were kept under 3500. The experiments were performed on a Dell-G3 laptop with an Intel(R) Core(TM) i7-9750H CPU, 16GB-RAM, and Windows 10 x64 operating system. Image-space affine transformation was used to model the geometric relationships of image pairs. For each pair, over 15 well-distributed corresponding points were manually collected to calculate the affine transformation as the ground truth, which is used to measure the location accuracy of the automatically matched points. Three indices, i.e. the number of correct matches (NCM), the root of mean-squared error (RMSE) of the correct matches, and matching time (MT) are used to quantitatively evaluate the performance of several methods.

Image Datasets
The test data consists of seven SAR-optical image pairs acquired respectively by SAR and optical sensors. The SAR images are made up of the data from German Terra SAR-X satellite and Chinese Gaofen-3 satellite, while the optical data are from Google Earth data and Chinese Gaofen-2. These images, with three types of low-resolution, medium-resolution, and high-resolution, have different spectral characteristics covering different scenes such as urban areas, suburban areas and mountainous areas. Notably, differences in the imaging mechanisms between SAR and optical images bring about the significant nonlinear radiation variances amid each pair. We hence categorized the data into two groups in line with their geometric differences. The first group contains four pairs, namely, no. 1, no. 2, no. 6, and no. 7, whose sizes from 520 to 1000 pixels. These images, though substantially similar in rotation and scale, cannot be registered immediately for SAR images are strongly affected by coherent speckles. The registration of no. 6 and 7 image pairs with unclear structural features seems more challenging for the information they offer is all about mountains with no buildings captured. The second group has three pairs, i.e. no. 3, 4 and 5, whose sizes varying from 450 to 923 pixels. As shown in Figure 3, images of no. 3 pair have a temporal difference of more than one year, images of no. 4 pair have a scale difference, and images of pair 5 has a rotational difference.  Figure 4 illustrates the matching results of three indices used to compare the proposed MoTIF method with other four state-ofthe-art methods. The unit of NCM (see Figure 4 (a)), RMSE (see Figure 4 (b)) and MT (see Figure 4 (c)) is the number of points, pixel and seconds.

Quantitative Results
(a) NCM results of several methods (b) RMSE results of several methods (c) MT results of several methods Figure 4. Quantitative results of several methods. As shown in Figure 4 that OS-SIFT is superior to LGHD, RIFT and HAPCG on MT while the latter three work better on NCM and RMSE. However, MoTIF outperforms these four methods on all three indices and it is remarkably less time-consuming. At the same time, the average scores of the seven groups of image pairs in the three indices were further calculated with the aim of comparing several methods more comprehensively.
Methods OS-SIFT LGHD RIFT HAPCG MoTIF Note that, "" failed pair matching. The value of NCM is positively associated with the performance, the higher the number, the better the performance. As of RMSE, the RMSE results higher than 5 are set to 5 pixels.
From Drawing on, the correct correspondence points obtained, we then calculated the homography matrix between images which was used to promote image fusion (see Figure 5). Figure 5 (a) is the distribution of ground truth points, and Figure 5 (b) is the matching result of the SAR-optical image. Figure 5 demonstrates that MoTIF works well on robust matching and fusion of images with scale and rotation differences. It boosts the accurate fusion of SAR-optical images with no mismatch or artifacts, as is proofed by maps (c) and (d).

PARAMETRIC ANALYSIS
To fully evaluate the performance of the proposed method, we analyzed its two core parameters, the size of the neighborhood windows and the number of calculations required by MoTIF, which would be estimated by two indicators, RMSE and NCM. Settings of the parameters are shown in Table 2 3,4,5,6,7,8,9,10,11,12] Nw = 72  speckle noise is used as an example. Figure 6 shows that as window Nw increases from 32 to 64, NCM results also increase with a steady upward trend. When the window Nw reaches 64 and beyond, NCM goes higher than 70. When the window Nw equals 88, NCM arrives at its maximum. But when Nw exceeds 130, NCM begins to decrease. As for RMSE, Nw =72 is the most favorable in that RMSE on that point gets to its optimal value, 1.41 pixels. Therefore 72 is recommended when setting Nw.  Figure 7 demonstrates that RMSE won't go lower than 2 pixels when the value of NL is set minimal or maximal. When NL ranges from 6 to 8, it fluctuates between 1.3 and 1.6 but NCM goes higher than 75 which is surely unsatisfactory. Only when NL equals 6, can we get good values of RMSE and NCM both at once. As the NL increases, more corresponding points will be brought while the matching growth rate is declining and the increase in the number of matching points can no longer compensate for the deterioration of the matching accuracy itself. Therefore, NL=6 is recommended.

CONCLUSION
This paper proposed a novel lightweight SAR-optical image registration method mainly to resolve the problems of significant nonlinear radiation distortions and multiplicative noise caused by speckles inherent in SAR images. Firstly, the tensor features of the image to collect information on its salient structure were extracted. Then we used the coherence enhancement diffusion model to cope with the strong effect from the coherent speckle. Finally, enriched features of the images were obtained by performing MoTIF extraction through using polar coordinates. The maximum index value was also calculated so the descriptor vectors can be established. Experimental results show that MoTIF boasts better accuracy and efficiency in SAR-optical image registration than OS-SIFT, LGHD, RIFT, and HAPCG, etc. With this method, NCM has increased approximately by 1 to 2.3 times, the accuracy of RMSE improved by 0.25 to 0.9 times and the average MT increased significantly by 0.64 to 4 times.
However, the MoTIF method also has its limitations. It is not so useful when applied in the registration of images whose scaleand rotation-distortions cannot be corrected or eliminated merely based on their locations or attitudes. Future research could address this issue of the scale and rotation invariance to help further MoTIF and its application in SAR-optical and other multi-modal images registration.