MINDFLOW BASED DENSE MATCHING BETWEEN TIR AND RGB IMAGES

Image registration is a fundamental issue in photogrammetry and remote sensing, which targets to find the alignment between different images. Recently, registration of images from difference sensors become the hot topic. The registered images from different sensors are able to offer additional information, which help with different tasks like segmentation, classification, and even emergency analysis. In this paper, we proposed a registration strategy to calculate the dominant orientation difference and then achieve the dense alignment of Thermal Infrared (TIR) image and RGB image with MINDflow. Firstly, the orientation difference of TIR images and RGB images is calculated by finding the dominant image orientations based on phase congruency. Then, the modality independent neighborhood descriptor (MIND) together with global optical flow algorithm are adopted as MINDflow for dense matching. Our method is tested in the image sets containing TIR images and RGB images captured separately but in the same construction site areas. The results show that it is able to achieve the optimal results with features of significance even for dramatically radiometric differences between TIR images and RGB images. By comparing the results with other descriptor, our method is more robust and keep the features of objects in the images.


INTRODUCTION
Image registration is a fundamental issue in photogrammetry and remote sensing, which targets to find the alignment between different images. Since images acquired by different sensors or platforms provide information on various properties of objects in the scene, combining information from different platforms and sensors will provide a new perspective of information delivery for better visualization and analysis of scenes and objects (Tong et al., 2019, Ye et al., 2020. Thermal infrared (TIR) images, acquired by thermographic sensors, depict temperature and emission properties of objects. Different from camera in the visible spectrum, thermographic sensors detect radiation in a long-infrared range of the electromagnetic spectrum. Compared to the near infrared images, thermal infrared images have lower resolutions and different features due to the long wavelength. Since all the objects with a temperature above absolute zeros emit infrared radiation to the environment, thermal infrared images enable us to observe the geometry of objects, their moving process, inside structures, and their thermal properties without sufficient visible illumination (Zin et al., 2007, Weinmann et al., 2014, Christiansen et al., 2014. This property helps with thermal inspection of objects and recognition of hidden structures. Though thermal images provide thermal indicators for analysis, it is hard to achieve results of high accuracy for tasks like classification, recognition and abnormal phenomenon analysis due to low spatial resolution and strong distortion compared to RGB cameras. Besides, prior knowledge of the study area and the thermal images are required for analysis. As supplementary, RGB images are typical data sources in the field of photogrammetry, which provide accurate and detailed geometry and texture information of the scene. Considering Generally, the challenges in the registration of thermal images (Fig. 1a) and RGB images (Fig. 1b) from different datasets lie in several aspects, including different geometric properties caused by various image resolutions, different intensity information resulting from various radiometric characteristics, and differences of sensor pose. Firstly, the thermal infrared images usually have lower spatial resolution compared to RGB images. The different in resolution cause different features in RGB and TIR images, which makes it hard to find the correspondences. Besides, due to different wavelengths, thermal infrared images and RGB images have different intensity information. The radiation in RGB images depicts the texture information of the objects, which is similar to the visible information observed by human eyes. The thermal images reflect the temperature or emission properties of the objects. Though a similar shape could be observed, the radiation information dramatically varied. Besides, the RGB images and TIR images may acqired seperately, or mounted on the same plateform with different viewing angles. The different extrior parameters lead to difficulties in finding corresponding points. in Section 1 describes the motivation and problems in the registration of thermal infrared images and RGB images. The literature review part in Section 2 summarises the state of art methods for image matching. Then, the proposed strategy will be described in detail with each step of our proposed method elaborated on in Section 3. After that, the data and result based on the proposed method will be demonstrated in Section 4. The result will be compared with several current methods. Finally, we will draw a conclusion regarding the results in Section 5.

LITERATURE REVIEW
The typical automatic image matching methods can be roughly divided into two categories: feature-based methods and areabased methods. The feature-based methods use a feature symbolic descriptor to describe the characteristics of elements in the image. By finding a similar descriptor, the corresponding elements in other images can be found. The feature-based image matching process includes the following steps: feature detection, feature description, feature matching, transformation model estimation, and image resampling (Zitová, Flusser, 2003, Chen et al., 2019a. Here, reliable and effective feature detection and feature description normally deem to be the preconditions for image matching. Feature detection detects distinctive features, such as points, lines, blobs between images. Typical algorithms are i.e. Harris detector (Harris, Stephens, 1988), Laplace of Gaussian (Lindeberg, 1994),Harris-Laplace (Mikolajczyk, Schmid, 2004), Hessian-Laplace , Difference-of-Gaussian (Lowe, 2004) and so on.
After the feature detection, feature descriptors will be constructed to depict the characteristics of detected features. The scale-invariant feature transform (SIFT) (Lowe, 2004) is widely used due to its robustness to image rotation and scale changes. After SIFT, a series of descriptors have been proposed to improve the performance of SIFT, such as speed up robust feature (SURF) (Bay et al., 2008), fast local descriptor for dense matching (DAISY) (Tola et al., 2009), gradient location and orientation histogram (GLOH) , Binary Robust Independent Element Feature (BRIEF) (Calonder et al., 2010) and Fast Retina Keypoint (FREAK) (Alahi et al., 2012). The feature-based method is usually more robust to geometric and radiometric differences, but ignores the positional relation between neighboring pixels in the image. Compared to the feature-based methods, the area-based methods make use of intensity information of the images. Feature detection and feature matching are combined by using some similarity measures, such as mutual information (MI), normalized cross-correlation (NCC), sum of square diference (SSD) and matching by tone mapping. Though this method is sensitive to geometric distortion, it is optimal for dense matching and subpixel matching.
Due to the increasing focus on registration of images from different sensor platforms, researchers have done a lot of work on multi-modal image matching, especially in the medical imaging and remote sensing domains. The challenge mainly lies in the geometric deformation and radiometric variation, leading to mis-correlated features. (Wang et al., 2012) proposed the bilateral filter SIFT to find the feature matched for optical images and SAR image. (Xiong et al., 2019) put forward the Rank-based Local self Similarity descriptor describing the local shape properties for SAR to optical images registration. (Ye et al., 2017) come up with the shape descriptor DLSC to form a similarity metric (named DLSC). By using the normalized cross correlation (NCC) of the DLSS descriptors, a template matching strategy is used to register the optical images and SAR images. (Xiang et al., 2020) incorporate the dense feature representations into the 3-D phase congruency scheme to estimate the translation of the Optical images and SAR images in sub-pixel level. Due to the outstanding performance in feature detection in intensity variant situations, phase congruency is widely used in illumination-invariant image matching. (Xiang et al., 2019) present the global and local frames for matching of opticla images and SAR images with gradiant location and orientation histogram (GLOH) descriptor. (Ye et al., 2018b) adopts the magnitude and orientation of phase congruency feature as structural image representation for subpixel image correlation. (Ye, Shen, 2016) proposes the histogram of oriented phase congruency (HOPC) and integrates a similarity matrix to represent the geometric structure features of images, which achieve great performance in the registration of SAR, optical and map image. Based on the HOPC descriptor combined with normalized correlation coefficient, a similarity measure called HOPCncc is used for image registration. (Ye et al., 2018a) proposes the minimum moment of phase congruency with LoG to detect feature points and local histograms of phase congruency for feature descriptors.
In addition to phase congruency, methods have been applied for infrared and RGB image registration. For example, (Chen et al., 2019b) utilizes the multi-scale and multi-orientation Gabor filter to encode the edge information as a descriptor to match the infrared and visible image. (Yu et al., 2019) proposes a grayscale weight with window algorithm together with normalized mutual information to register infrared images and RGB images. (Rahaghi et al., 2019) utilizes mutual information and particle swarm optimization. (Istenic et al., 2007) detects the line segments in visible and infrared images, and find the corresponding lines in Hough parameter space to recover the translation and rotation. The scale problem is not properly solved here. (Hrkać et al., 2007) proposed the method to co-register the IR images and RGB images based on the assumption that the images are taken at close viewpoints to simplify the transformation modal and reduce computation, which limit the application of this method. In (Shen et al., 2014), a robust selective normalized cross correlation (RSNCC) is used as the matching cost applying to a coarse-to-fine registration between the RGB/NIR and RGB/range images. (Turner et al., 2014) demonstrates a workflow to co-registrater visible, multi-spectral and thermal images aquired with a micor-UAV. The registration utilize the ground control points and is conducted by Photoscan. (Sanchez et al., 2015) register thermal and visible light images based on a novel multi-scale method that employs the stationary wavelet transform. The silhouettes of diseased plants extracted can be used to register thermal and visible light images with high accuracy. However, how the method can be used to registrate the IR images and RGB images captures in the urban areas without positioning information still remains to be solved.

PROPOSED METHOD
Considering the radiometric differences between thermal infrared images and RGB images, this contribution proposes a strategy to achieve a registration which is orientation-invariant and illumination-invariant. The scale was estimated by forming a pyrimad space of both images. The the best scale is chosen when the peak of normalized correlation of two edge strech images by phase congruency reaches the highest. Orientation of phase congruency and modality independent neighborhood descriptor (MIND) are adopted as a similarity descriptor. After that, the MINDflow is adopted for dense matching of thermal infrared images and RGB images. Finally, corresponding points from the dense matching are used for the transformation of the image. In Fig. 2, the workflow of the proposed method is given. Figure 2. Workflow of the proposed method.

Orientation estimation
Phase congruency is calculated via local frequency analysis which originated from local energy modal. It assumes that per-ceived features are located at points where the Fourier components are maximal in phase (Morrone, Owens, 1987). The phase congruency function is defined as the ratio of local energy as position x[E(x)] to the sum of local Fourier components.

P C1(x) = E(x)
n An = n An cos(φn(x) −φn(x)) n An (1) n An cos(φn(x)) is the Fourier series expansion of the function where An represents amplitude and φn(x) represents the local phase. Since it is not robust to noise and does not consider the frequency spread, a refined calculation model using log Gabor wavelets is adopted from Kovesi (Kovesi et al., 1999). The transfer function of a log Gabor filter is show as Eq. 2: ω0 is the center of the filter and σω/ω0 is the constant parameters for bandwidth. If M e n and M o n represent the e evensymmetric and o odd-symmetric components of log Gabor filter at scale n, the response vector of a quadratic pair can be expressed as: Then, the amplitude An(x) and phase φn(x) at the scale n can be expressed as: Based on the magnitude of phase congruency, the orientation of phase congruency Θ can be achieved by the log Gabor oddsymmetric wavelet of multiple directions.
θ is the orientation of odd-symmetric wavelet. Therefore, the orientation of the image can be calculated by the accumulated orientation.

Modality independent neighborhood descriptor
MIND descriptor (Heinrich et al., 2012) uses the principle of local self-similarity. The patch distance is first calculated as the sum of square distances between two voxels x1 and x2 as Dp(x1, x2), between the patches P of size (2p + 1) d .
The exact patch-distance can be efficiently calculated using convolution filter C of size (2p + 1) d with point-wise squared difference between image I and transformed image I (r) (r is the transformed distance.
Moreover, in order to highlight the response of MIND with similar patches, a Gaussian function is used. The variance of the image can be estimated via mean of the patch distances them-selves within a six-neighborhood n ∈ χ.
where k is the number of neighbors in the defined neighborhood, in our case it equals to six.
Then, the output of MIND can be defined by a distance Dp, a variation estimate V and a spatial search region R.

Dense matching of corresponding points by MINDflow
MINDflow is adapted from the renowned Siftflow (Liu et al., 2010) and here it is adopted for dense matching in feature space. Considering the SIFT descriptor can hardly describe the similarity between the images from different spectral bands, especially between the thermal infrared images and RGB images, here MIND descriptor is applied.
Specifically, let s1 and s2 be two images represented by MIND descriptor, and ε contains all the spatial neighborhoods. w(p) is the flow vector at grid coordinate p = (x, y). The energy function is defined as: which contains a data term (i.e., the first term of Eq. 9), a small displacement term (i.e., the second term of Eq. 9) and a smoothness term (i.e., the third term of Eq. 9). The data term contains the MIND descriptor to be matched along with the flow vector w(p). The small displacement term constrains the flow vector to be as small as possible when no information is available. The smoothness term constrains the flow vector of adjacent pixels to be similar. L1 norms are used in the data term and the smoothness term to account for matching outliers and flow discontinuities. This energy function can be then solved by the use of sequential belief propagation (BP-S) (Szeliski et al., 2008).

Image transformation using corresponding points
Once the corresponding points are obtained, the dense alignment between TIR and RGB images can be achieved. Based on the dense matching points, a non-parametric transformation based on a flow map can be achieved. Each pixel in the transformed image can be thus resampled.

EXPERIMENTAL DATA AND RESULT
The experimental data used consist of two datasets: TIR images (512 × 640 pixels) and RGB images (2680 × 4019 pixels). Two UAV flights with the same predefined flight path have been performed: one with the RGB camera and one with the TIR camera. Three image sets are utilized for validation. However, due to limited positioning accuracy, the exterior orientation of the same waypoints for both flight are slightly different. In order to evaluate the co-registration result, we select two images (Fig. 3) and manually selected matched keypoints for evaluation. The RGB images lay on the left, while the TIR images are on the right. In image 1, 46 point pairs are selected and in image 2, 40 point pairs are selected. In Fig. 3, we found the images we use here have two characteristics. Firstly, the scale differences exist between the two datasets. The RGB images are much larger than the corresponding TIR images. Secondly, the overlap between the corresponding images are not always the same. For the second image pairs, the selected points in the RGB images distributed on the bottom, while the corresponding points in the TIR images spread close to the top. Compared to the first image pairs, the overlap between the second image pairs is smaller.   Fig. 4 shows the results of the proposed method. The first column are the original RGB images, the second column are the TIR images, the third column plots the flow map results of each pixel, the third column shows how phase congruency edge strength image by the RGB image are warped to that of the TIR image, and the last column show the checkerboard image from the TIR image and the RGB image. In the flow map, the color demonstrates the flow vector of each pixel with respect to the RGB images. Generally speaking, although the overlap between the two images are not exactly the same, the proposed method is able to find the corresponding areas with significant features. Compared to other data, Image 3 didn't achieve an optimal results due to the repeated texture information and similar boundary structures inner and outer the building roof. Fig. 5 presents the comparision results between the proposed MINDflow and SIFTflow. The first two rows are the results for image 1, the last two rows are results for the Image 2. Note that the images applied to the SIFT flow has already adopted the scale parameters generated by the proposed method. Bedside, Table lists the estimated orientation differences between The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLIII-B2-2020, 2020 XXIV ISPRS Congress (2020 edition) image pairs based on phase congruency and the error in x and y direction. The average error for MINDflow is less than that of SIFTflow. From the table, MINDflow gets better result in y-direction while SIFtflow receives better result in x-direction for Image 1. Besides, the errors for Image 2 are always less than that in Image 1. However, when we analysis the results in Fig. 3, especially comparing the warped images in the third column, results by SIFTflow have larger distortions compared to the results by MINDflow. By checking the boundary of checkerboard image by Image 1, the MINDflow can achieve the approximate aligned road along the upper boundary. The white area on the bottom of warped image indicating a large offsets in y-direction. Comparing the original RGB image and the TIR image, a large offset on y direction is reasonable due to the overlapping areas spanned positions in different images. The phenomenon could be better observed in the Image 2. The overlap between the two images is about half of each image. In the Checkerboard image by MINDflow, the shape of the boxes are kept despite the large offsets and the radiation differences in the RGB and the TIR images.

CONCLUSION AND DISCUSSION
In this paper, we proposed a strategy to register thermal infrared images and RGB images based on MINDflow. Firstly, the orientation difference of TIR images and RGB images is solved by finding the dominant image orientations with phase congru-ency. Then, the MIND descriptor and flow matching are adopted as MINDflow for dense matching. Our method is tested in 9 image sets in construction areas. The result shows that it is able to achieve an optimal result in the images with significant features but can hardly deal with the area with repeat patterns or insufficient structure differences. Comparing the result with SIFTflow, MINDflow combined with Phasecongruency can better extract the features for dense matching.
Though our proposed method is able to preliminary accomplish the mission of the registration of thermal infrared images and RGB images, there are some challenges we need to face. First, we need to find a better way to realize optimal orientation and scale estimation for multi-modal images. For both area-based and feature-based methods, the orientation and scale will highly affect the feature descriptor or similarity descriptor. Though we are able to offer the coarse scale and orientation, an optimal result will improve the result. Besides, the method is limited by a certain overlap between the images, which is the typical problem when the data are acquire separately. The overlap areas influence the estimation of rotation and scale. Last but not least, we need to find the optimal solution for the texture repetition areas.
The next challenges is the detection of overlapping areas in RGB and TIR images to optimal the image orientations and scale. How to match images with less textures or repetitive patterns is also our concern.