OBJECT RE-IDENTIFICATION USING MULTIMODAL AERIAL IMAGERY AND CONDITIONAL ADVERSARIAL NETWORKS

Object Re-Identification (ReID) is the task of matching a given object in the new environment with its image captured in a different environment. The input for a ReID method includes two sets of images. The probe set includes one or more images of the object that must be identified in the new environment. The gallery set includes images that may contain the object from the probe image. The ReID task’s complexity arises from the differences in the object appearance in the probe and gallery sets. Such difference may originate from changes in illumination or viewpoint locations for multiple cameras that capture images in the probe and gallery sets. This paper focuses on developing a deep learning ThermalReID framework for cross-modality object ReID in thermal images. Our framework aims to provide continuous object detection and re-identification while monitoring a region from a UAV. Given an input probe image captured in the visible range, our ThermalReID framework detects objects in a thermal image and performs the ReID. We evaluate our ThermalReID framework and modern baselines using various metrics. We use the IoU and mAP metrics for the object detection task. We use the cumulative matching characteristic (CMC) curves and normalized area-under-curve (nAUC) for the ReID task. The evaluation demonstrated encouraging results and proved that our ThermalReID framework outperforms existing baselines in the ReID accuracy. Furthermore, we demonstrated that the fusion of the semantic data with the input thermal gallery image increases the object detection and localization scores. We developed the ThermalReID framework for cross-modality object re-identification. We evaluated our framework and two modern baselines on the task of object ReID for four object classes. Our framework successfully performs object ReID in the thermal gallery image from the color probe image. The evaluation using real and synthetic data demonstrated that our ThermalReID framework increases the ReID accuracy compared to modern ReID baselines.


INTRODUCTION
Object Re-Identification (ReID) is the task of matching a given object in the new environment with its image captured in a different environment. The input for a ReID method includes two sets of images. The probe set includes one or more images of the object that must be identified in the new environment. The gallery set includes images that may contain the object from the probe image. The ReID task's complexity arises from the differences in the object appearance in the probe and gallery sets. Such difference may originate from changes in illumination or viewpoint locations for multiple cameras that capture images in the probe and gallery sets.
Many ReID methods have been developed to date for the task of person ReID (Nguyen et al., 2017c, Nguyen, Park, 2016a, Ye et al., 2018b, Ye et al., 2018a. Such methods can be broadly divided into three groups: deep learning, transfer learning, and metric learning. Deep learning methods leverage neural networks to learn end-toend models for matching objects in the probe and gallery sets. Transform learning methods learn to translate images in the probe set to match camera viewpoint and illumination conditions in the gallery set. Metric learning methods aim to develop a function that returns a distance for a given pair of samples in the probe and gallery sets. The distance is required to be small if the pair is correct and large otherwise. * Corresponding author While many solutions have been proposed for the ReID task for images captured in the visible range, cross-modality ReID remains challenging. Recently a new generation of neural networks has been developed focusing on generative learning. Such networks are commonly called Generative Adversarial Networks (GANs) (Goodfellow et al., 2014). GANs are capable of learning complex image-to-image translations such as a season change or an object transfiguration. Modern research demonstrates that GANs can learn to translate probe images to different viewpoints or different illumination conditions. To the best of our knowledge, there are no results to date in the literature regarding crossmodality object ReID from airborne images. This paper focuses on developing a deep learning ThermalReID framework for cross-modality object ReID in thermal images. Our framework aims to provide continuous object detection and reidentification while monitoring a region from a UAV. Given an input probe image captured in the visible range, our ThermalReID framework detects objects in a thermal image and performs the ReID. Our pipeline includes four significant steps. Firstly, we translate the input probe color image to the infrared range using a GAN model . After that, we perform geolocalization of the gallery image captured by an onboard infrared camera. Specifically, we generate a semantic segmentation of the gallery image and match it with a semantic map of the landscape. Next, we perform object detection in the gallery image. We use a semantic map as an additional input modality for the object detection model to improve the object detection score. Finally, we  perform the ReID using the Bhattacharyya distance between the synthetic thermal probe image and real thermal images from the gallery set.
We developed a 3D environment to train and validate our framework. Our virtual environment includes a city scene that can be rendered in thermal and visible ranges. We include four object classes in our environment: car, person, bicycle, and dog. Using our 3D environment, we prepared a dataset consisting of 10k images divided into training and test splits. We trained our framework using the training split of the dataset and validated it using the test split and samples from the LAERT dataset (Knyaz, 2019).

Contributions
We present three key technical contributions: • A unified ThermalReID framework for cross-modality object re-identification in thermal images.
• A new geo-localization algorithm leveraging tiled representation of the semantic map and a deep model with inverted residual blocks.
• A YOLO-Semantic model for object detection and localization leveraging an additional semantic labelling of the input thermal image.

Object Re-identification
The problem of object re-identification is important for various computer vision applications, such multi-modal image segmentation and object detection, autonomous driving, security etc. So currently it attracts attention of many researches (Farenzena et al., 2010, Gong et al., 2014, Wu et al., 2017, Wang et al., 2019, Bhuiyan et al., 2018, Prosser et al., 2008, Bhuiyan et al., 2015. New methods of re-identification allow to significantly improve the matching performance. Meanwhile, for such area as video surveillance, modern ReID systems have challenges still. Modern approaches for object re-identification can be separated into three groups (Bhuiyan et al., 2018): direct re-identification methods, metric learning methods and transform learning methods.
New transform learning-based method (Bhuiyan et al., 2018) predicts person appearance for a new camera basing on cumulative weight brightness transfer function. The method uses a robust segmentation technique to segment the human image into meaningful parts, and then matches the features extracted only from the body area. Such approach provides an improved performance of person re-identification. Multiple pedestrian detections are applied for improving the matching rate.
A specific algorithm for vehicle re-identification uses the rich annotation information available from large-scale dataset for vehicle re-identification (Wang et al., 2019). The dataset contains 137k images of 13k vehicle instances captured by cameras mounted on the board of unmanned aerial vehicle (UAV). For increasing intra-class variation, each vehicle in the dataset is captured by at least two UAVs at different locations, with diverse view-angles and flight-altitudes. The dataset contains a variety of manually labelled vehicle attributes, such as vehicle type, color, skylight, bumper, spare tire and luggage rack. In addition, the discriminative parts of each vehicle are annotated, thus contributing in distinguishing of one particular vehicle from others. The developed algorithm explicitly detects discriminative parts for each specific vehicle, and thus providing high re-identification performance, comparing with the evaluated baselines and state-of-the-art vehicle ReID approaches.
Multi-Scale and Occlusion Aware Network  for UAV based imagery extracts information about vehicles in challenging conditions of arbitrary orientations, huge scale variations and partial occlusion. It consists of two parts: Multi-Scale Feature Adaptive Fusion Network (MSFAF-Net) and Regional Attention based Triple Head Network (RATH-Net). MSFAF-Net includes a self-adaptive feature fusion module, that adaptively aggregate multi-level hierarchical feature maps, thus helping Feature Pyramid Network (FPN) to deal with the vehicle scale changes in images. The second part, Regional Attention based Triple Head Network, is used to enhance the vehicle of interest and suppress background noise caused by occlusions. Along with the developed network model, a large comprehensive vehicle dataset is collected, that contains UAV based imagery.
AI City Challenge  addresses to to accelerating intelligent video analysis that helps make cities smarter and safer. It is based on large city-scale real traffic data and high-quality synthetic data for evaluating developed methods. Track 2 of AI City Challenge is aimed at vehicle re-identification with real and synthetic training data. The solution (He et al., 2020) is based on a strong baseline with bag of tricks (BoT-BS) proposed in person ReID domain. It proposes a multi-domain learning method for real-world and synthetic data to train the model. The proposed Identity Mining method automatically generates pseudo labels for a part of the testing data, and performs better than the k-means clustering. Results post-processing is performed by tracklet-level re-ranking strategy with weighted features. The methods achieves 0.7322 in the mAP score on the AI City Challenge data For overcoming the difficulties of re-identification for night-time application, additional modalities are introduced in re-identification techniques, such as infrared or long-wave infrared imagery. Using additional modalities allows to improve the robustness of matching in low-light conditions.
Exploiting of thermal camera in the field of computer vision attracts attention of many researches in re-identification field (Yilmaz et al., 2002, Davis, Keck, 2005, Knyaz, Moshkantsev, 2019. While thermal cameras serves for a significant boosting in pedestrian detection (San-Biagio et al., 2012, Xu et al., 2017 and ReID with paired color and thermal images (Nguyen et al., 2017a), crossmodality object re-identification is still a challenging task (Nguyen et al., 2017c, Nguyen, Park, 2016a, Nguyen, Park, 2016b, Nguyen et al., 2017a, Nguyen et al., 2017b. Most of problems appears from severe changes in a person appearance in color and thermal images. To study the problem of multi-modal re-identification a set of multispectral datasets was collected in recent years (Nguyen et al., 2017a, Nguyen et al., 2017c, Nguyen, Park, 2016a, Wu et al., 2017, Ye et al., 2018a. SYSU-MM01 dataset (Wu et al., 2017) includes unpaired color and near-infrared images. RegDB dataset (Ye et al., 2018a) presents color and infrared images for evaluation of cross-modality ReID methods. Comprehensive studies of modern re-identification methods on these datasets has exposed the challenges of color-infrared matching. Simultaneously, they demonstrated the increasing performance in ReID robustness during the night-time.
Hierarchical Cross-Modality Disentanglement (Hi-CMD) method (Choi et al., 2020) automatically disentangles ID-discriminative factors and ID-excluded factors from visible-thermal images, thus reducing both intra-and cross-modality discrepancies. It uses IDdiscriminative factors for robust cross-modality matching without ID-excluded factors such as pose or illumination. ID-preserving person image generation network and a hierarchical feature learning module are designed for implementing the developed approach.
Recently proposed generative adversarial networks (GAN) (Goodfellow et al., 2014) provides a background for the impressive progress in arbitrary image-to-image translation problem. We hypothesize that using a dedicated GAN framework for colorto-thermal image translation can increase color-thermal ReID performance.

Generative Adversarial Networks
Generative adversarial networks (GANs) (Goodfellow et al., 2014) exploits an antagonistic game approach, that allows to significantly increase the quality of image-to-image translation (Isola et al., 2017, Zhang et al., 2017a, Zhang et al., 2017b. pix2pix GAN framework (Isola et al., 2017) carries out arbitrary image transformations, using geometrically aligned image pairs from source and target domains. The framework successfully performs arbitrary image-to-image translations such as season change and object transfiguration. The pix2pix network model (Zhang et al., 2017a, Zhang et al., 2017b trained to transform a thermal image of a human face to the color image allows to improve the quality of a face recognition performance in a cross-modality thermal to visible range setting.
While human face has a relatively stable temperature, color-thermal image translation for more temperature-variable objects, such as the whole human body or vehicles with an arbitrary background, is more challenging.

Framework Overview
Our goal is twofold. Firstly, we would like to perform visualbased UAV geo-localization using onboard cameras. Secondly, we perform search of the probe object in thermal gallery images. Our framework works by running five deep models. Overview of the proposed ThermalReID framework is presented in Figure 1.
Our semantic geo-localization approach is inspired by tiled map representations. Our algorithm estimates the geographic coordinates (φ, λ) of the UAV given an input color or thermal image of the scene and rough approximate of the current geo-location. The algorithm leverages a distance learning technique. Firstly, we perform a semantic segmentation S of an input image A. Secondly, we use a deep model to estimate a distance between the generated semantic labelling S and semantic tiles from the onboard geographic dataset. We use a MobileNetV2 (Sandler et al., 2018) model for the distance estimation task.
The main object re-identification algorithm leverages the precise coordinates of the UAV estimated by the localization algorithm. It performs object re-identification in three steps. Firstly, given the estimated geo-coordinates, we generate a semantic labelling SG of the thermal input gallery image BG. We perform precise alignment of the semantic labelling using a differential optical flow estimation approach (Kniaz, 2018b). After that, we generate a synthetic thermal probe image BP using image the input color probe image and a ThermalGAN conditional adversarial network . We use the thermal input gallery image BG and its semantic labelling SG as the input for our object detection YOLO-Semantic model. Finally, we measure distance between each candidate object detected by our YOLO-Semantic model and the synthetic thermal probe image BP . We perform ReID by selecting the candidate object with the smallest distance ( Figure 2).

Color-to-Thermal Translation Synthetic Color Image
Color Gallery Image Set Feature Matching … ThermalReID d(Î i , I j ) Figure 2. Cross-modality object ReID using a conditional GAN model.

Semantic Geo-Localization
We perform semantic geo-localization using two deep models and a semantic map of the search area. To optimize the search performance, we use tiled representation of the semantic map ( Figure 3). Our aim is training an algorithm that estimates the geographic coordinates (φ, λ) of the UAV given an input color or thermal image of the scene and rough approximate of the current geolocation. We use a MobileNetV2 (Sandler et al., 2018) model as a staring point for our research. Our approach is twofold. We perform a semantic segmentation S of an input image A using a GeoGAN (Kniaz, 2018a) model. After that, we use a MobileNetV2 model to estimate a distance between the generated semantic labelling S and semantic tiles from the onboard geographic dataset. The closest matching tile gives the current coordinates of the UAV.

YOLO-Semantic
Our YOLO-Semantic model is inspired by the YOLOv3 model. We consider three domains: the thermal image domain B ∈ R W ×H , the semantic labelling domain S ∈ R K×W ×H , where K is the number of semantic classes predicted by the GeoGAN model, and the bounding box predictions domain T ∈ R (5+K)×U ×V , where U, V is the number of cells in the output of our YOLO-Semantic model. We aim training a mapping Y : (AP , SP ) → T from a pair on input tensors BP ∈ B and SP ∈ S to the bounding boxes tensor T ∈ T . Details of the proposed architecture are presented in Table 1.

ThermalReID
We follow the general approach for thermal ReID proposed in . We use Bhattacharyya distance to compute a distance between two signatures using temperature histograms and MSER distance (Matas et al., 2002, Cheng et al., 2011 d(Îi, Ij) = βH · dH (Ht(Bi), Ht(Bj)) where dH is a Bhattacharyya distance, dMSER is a MSER distance (Matas et al., 2002), and βH is a calibration weight parameter. Overview of the proposed ThermalReID framework is presented in Figure 1.

EXPERIMENTS
We evaluate our ThermalReID framework and modern baselines using various metrics. We use the IoU and mAP metrics for the object detection task. We use the cumulative matching characteristic (CMC) curves and normalized area-under-curve (nAUC) for the ReID task. The evaluation demonstrated encouraging results and proved that our ThermalReID framework outperforms existing baselines in the ReID accuracy. Furthermore, we demonstrated that the fusion of the semantic data with the input thermal gallery image increases the object detection and localization scores.

Network Training
We trained our models and baselines using train split of the LAERT dataset. Training of the ThremalGAN model took 68 hours. We optimize network using minibatch SGD with an Adam solver. We use a learning rate of 0.0002, and momentum parameters β1 = 0.5, β2 = 0.999 similar to (Isola et al., 2017).

Quantitative Evaluation
We evaluate our model using SemanticVoxels (Kniaz et al., 2020) and LAERT (Knyaz, 2019) datasets. We evaluate our model and baselines quantitatively in terms of cumulative matching characteristic (CMC) curve and normalized area-under-curve (nAUC). We compare our model to two model baselines. The VRAI ( Table 2.

CONCLUSION
We developed the ThermalReID framework for cross-modality object re-identification. We evaluated our framework and two modern baselines on the task of object ReID for four object classes. Our framework successfully performs object ReID in the thermal gallery image from the color probe image. The evaluation using real and synthetic data demonstrated that our ThermalReID framework increases the ReID accuracy compared to modern ReID baselines.