UNSUPERVISED HARMONIOUS IMAGE COMPOSITION FOR DISASTER VICTIM DETECTION

: Deep detection networks trained with a large amount of annotated data achieve high accuracy in detecting various objects, such as pedestrians, cars, lanes, etc . These models have been deployed and used in many scenarios. A disaster victim detector is very useful when searching for victims who are partially buried by debris caused by earthquake or building collapse. However, considering that larger quantities of real images with buried victims are difﬁcult to obtain for training, a deep detector model cannot give full play to its advantages. In this paper we generate realistic images for training a victim detector. We ﬁrst randomly cut out human body parts from an open source human data set and paste them into the ruins background images. Then, we propose an unsupervised generative adversarial network (GAN) to harmonize the body parts to ﬁt the style (illumination, texture and color characteristics) of the background. These generated images are ﬁnally used to ﬁne-tune a detection network YOLOv5. We evaluate both the AP (average precision) for IoU (Intersection over Union) 0.5 and for IoU ∈ [0.5:0.05:0.95], which are denoted as AP @0 . 5 and AP @[ . 5 : . 95] , respectively. The best experimental results show that the YOLOv5l pre-trained on the COCO data set performs poorly on detecting victims, and the AP @[ . 5 : . 95] is only 19.5%. The model that uses our composite images as ﬁne-tuning data can effectively detect victims, and increases the AP @[ . 5 : . 95] to 33.6%. The AP @0 . 5 increases from 32.4% to 53.4%. Our unsupervised harmonization


INTRODUCTION
Object detection is an important topic that has been investigated for nearly 20 years. With the rise of deep learning (DL) and the availability of massive training data in recent years, DL-based object detection methods have made outstanding achievements and have become dominant (Zhao et al., 2019). Large data sets such as VOC (Everingham et al., 2015), ImageNet (Deng et al., 2009), and COCO (Lin et al., 2014) enable researchers to train a common object detector. There are also a lot of annotated data sets that make it possible to detect specific objects. For example, Wider Face (Yang et al., 2016) is a data set designed for face detection. Mappilary (Neuhold et al., 2017) provides 65 classes for object detection in autonomous driving scenes. Fruit and crop detection data sets are also available (David et al., 2020;Bargoti and Underwood, 2017). A high-accuracy detector relies heavily on a large number of training images, but in some special scenes training images are difficult to obtain. For instance, it is useful to train a victim detector that can be used by unmanned aerial vehicles (UAVs) in a rescue mission, but it is difficult to acquire such real images for training. Existing victim detection networks rely on common object data sets, which do not contain real victim images (Hoshino et al., 2021). Hartawan et al. (2019) trained a detector using INRIA person data set (Dalal and Triggs, 2005). The performance of these models on real victim images also needs more verification. Sulistijono and Risnumawan (2016) used only 19 real victim images to test their detector, which was not convincing.
As the number of images is a crucial factor in training good deep learning models, many researchers have studied how to * Corresponding author generate synthetic data that can be used in training. Dwibedi et al. (2017) proposed a simple but effective way to augment data for instance detection of indoor objects. They collected some object instances and pasted them on random backgrounds to generate more training images. Wang et al. (2019) used a similar method that replaced an object instance with another instance of the same class.
Besides, it is convenient to use advanced computer graphics to generate a large number of realistically rendered images, which makes up for the lack of real data for training. In addition to many rendered data sets in the field of semantic segmentation (McCormac et al., 2017;Ros et al., 2016;Kirsanov et al., 2019;Zhang et al., 2021Zhang et al., , 2022, researchers also rendered some synthetic data sets to train better object detection models. Han et al. (2021) proposed a rendered 3D face data set to study the relationships between object features and the performance of face detection. Peng et al. (2015) created 3D synthetic models to augment the training and outperformed previous methods. Rozantsev et al. (2015) proposed a method to synthesize unlimited unmanned aerial vehicles (UAVs) images in arbitrary 3D poses, and improved their UAV detector.
To facilitate first responders' rescue missions and save more lives we aim at training a victim detector to search for victims who are partially buried under ruins after an earthquake or building collapse. In general, when a person is crushed under the ruins, only part of the body is exposed, and the color of the body or clothes is similar to the background color due to dust or soil. A person detector trained on a normal object detection data set or a specific pedestrian data set might work, but the performance will likely be poor because these data sets usually contain completely displayed, standing people in normal scenes. Therefore, in this paper we propose a composite data set for victim body part detection. We first randomly cut out human body parts from the open source human parsing data set LIP (Gong et al., 2017), and paste them into random background with collapsed structures. Then we use a novel unsupervised Generative Adversarial Network (GAN) to harmonize the body parts to fit the style of the background. Our contribution can be summarized as follows: • We propose a novel framework to generate a data set that contains harmonious composite images of human body parts in ruins.
• We use the generated composite images to train a victim detector, and the experimental results show that our composite data is effective when training a victim detector. Our source code can be found on our project website https://github.com/noahzn/VictimDet.
Our approach pipeline is shown in Figure 1, which consists of three steps: image composition, image harmonization, and finetuning a victim detector. The rest of this paper is organized as follows. We present the image composition part in Section 2. Section 3 introduces the details of our deep harmonization network. Our experiments are elaborated in Section 4. Section 5 concludes the paper.

IMAGE COMPOSITION
The first step generates a composite image with simple cut and paste. Since we focus on victim detection in ruins, we need to collect both human body parts images as the foreground, and images with ruins as the background.

Collect background images
We use search keywords such as earthquake, ruins and collapse to collect background images I b from Google images. We check to make sure there are no human beings in these images.

Collect foreground images
To obtain foreground images I f we have two options. The first option is that, as used by Dwibedi et al. (2017) and Ghiasi et al. (2021), we can cut out complete human instances from existing data sets that contains the human class. However, in real scenarios it could happen that most of a person's body is buried, with only one arm or one leg exposed. Therefore, this option is not flexible enough to make composite images with only limbs exposed. The second option is to cut out a specific body part of a person as the foreground, and this option is better than the first one because in this case even if the detector only detects one arm or one leg, it can classify it as a potential victim. In this paper we also use the second option, and we start from an open-source human parsing data set LIP (Gong et al., 2017). LIP provides more than 50K human images annotated with 19 semantic classes such as face, left arm, right arm, upper clothes, etc. We can select and cut out specific semantic body parts from the foreground. The binary mask of the body parts can be denoted as M f .
As shown in the first row of Figure 2, not all the images in the LIP data set are suitable for compositing victim images. Because there is no automatic method to accurately filter out all black-and-white images, blurred images, low-resolution images, severely occluded images and images with no exposed body parts, we manually deleted these images, and some good image samples are shown in the second row of Figure 2. These images have higher resolution, and the postures of the characters in the images are suitable to generate composite images in rescue scenes.
(e) Figure 2. Some images in the LIP data set are not suitable to be used as the foreground, such as (a) black-and-white image; (b) blurred image; (c) low resolution image; (d) severe occlusion image, and (e) the image with no body parts exposed. (f)-(i) are good image samples we keep to generate composite images. We blur faces for privacy reasons.

Paste body parts into disaster scenes
For privacy reasons we discard faces in the images. At the same time we merge 19 semantic classes into five body parts combinations: upper limbs, upper limbs + torso, lower limbs, lower limbs + torso, and full body. For each image we randomly cut out body parts according to these five combinations. We also apply data augmentation such as resize, crop, and flip horizontally, to increase the diversity of foreground images. We paste the body parts at random positions onto the background images. The composite image I c can be represented by the background I b , the binary mask M f , and the foreground I f as:

IMAGE HARMONIZATION
Different from the tasks of Dwibedi et al. (2017) and Ghiasi et al. (2021) there are great visual differences between a foreground and a background in our task because of the inconsistent style (color, illumination, texture). Dwibedi et al. (2017) used Gaussian blending to smooth edges of the foreground, but this method cannot change the foreground's color, illumination, or texture. To make the composite images look more realistic we propose an unsupervised image harmonization network to adjust the style of the body part. Our proposed framework for image harmonization is based on the adversarial training. As shown in Figure 3 it consists of a generator G and two discriminators D global , D local . The generator generates a harmonious image, and two discriminators discriminate the real images and the generated harmonious images globally and locally, respectively. Using only one global discriminator will ignore the relationship between small-sized body parts and their surrounding background pixels, so we introduce a local discriminator to realize the harmonization of local illumination, texture and color characteristics.

Generator
The structure of our generator G is a U-Net with 3 attention layers. DoveNet (Cong et al., 2020) also uses the same generator structure, but we only use a three-channel composite image as input instead of using an extra mask channel, because we find that the combination of using an extra mask and our loss functions yields black artifacts on the output.
Given a composite image I c we want the network to output a harmonious image I h . The network should be trained to keep the background unchanged and make the foreground have the same style as the background, while the content does not change. For each pixel i, whose value is in the range of [0, 1], we calculate the masked smooth L1 loss: Then, the loss of the whole image is: Compared with L1 loss smooth L1 loss avoids gradient explosion in some cases (Girshick, 2015).

Global discriminator
We use a global discriminator D global to discriminate whether the input image is real or composite. Because we do not have the corresponding real version of a composite image, we take the background image I b used to generate the composite image as the real image. The background image can be seen as a harmonious image, so the global discriminator can help the generator to generate harmonious images at a global level. D global uses a PatchGAN (Isola et al., 2017) structure, and the adversarial loss function to train the global discriminator can be defined as:

Local discriminator
We use another local discriminator D local to focus on the foreground and its surrounding background, and constrain the local style consistency. The input is P(I h ), a patch centered on the foreground body parts and expands the neighborhood of the foreground bounding box by 60 pixels. So the patch contains both the foreground body part and its surrounding background pixels. The loss function is defined as: LG local = E[log (1 − D local (P(I h )))], where P(I b ) is the corresponding patch on the background image. The local discriminator focuses on the cropped patches and is helpful for discriminating small foreground images.  Figure 3. Our framework consists of a generator G and two five-layers discriminators D global , D local . The generator takes a composite image as input, and generates a harmonious image. Two discriminators discriminate the real images and the generated harmonious images globally and locally, respectively.

Locally constrained perceptual loss
The body parts in the output image should have the same semantic information as that in the input image. Perceptual loss proposed by Johnson et al. (2016) enforces the similarity between images at features level, and it has been used in many tasks (Rad et al., 2019;Yang et al., 2018;Ledig et al., 2017). The perceptual loss includes the content loss and the style loss. The content loss constrains the high-level semantic information, while the style loss makes two images consistent in style, such as color, illumination and texture. Different from those methods by computing the perceptual loss between the output image and the corresponding ground-truth, we propose to compute the perceptual loss between the input image and the output image. This is based on our purpose that the image harmonization network should only change the style of the input image, but not its semantic information. Besides, we compute the content loss on the same cropped patch P(I h ), as used in the local discriminator. The proposed locally constrained content loss LLCC is defined as: where φj denotes the feature map of the j-th convolutional layer of a pretrained VGG16 model. Cj × Mj × Nj is the size of the feature map, and · 2 computes the l 2 -norm. The shallow layers of a CNN model represent low-level style features such as colors and edges, and the deeper layers represent high-level semantic and content information (Lee and Tseng, 2019). We choose j = 8, 11 to compute the content loss.
Similarly, our style loss is also locally constrained and it only changes the style within the mask. It is defined as: (G(φj(P(I h ))) − G(φj(P(I b )))) 2 2 , (7) where G is the Gram matrix proposed in the perceptual loss (Johnson et al., 2016). We choose j = 3, 5, which are two shallow layers to compute the style loss. We also use the total variation loss LT V to smooth the local patch (Rudin et al., 1992), and it can be expressed as: where P(I h )m,n denotes the pixel value of the coordinates (m, n) on the patch P(I h ). The combined loss for training the generator G can be expressed as: LG =λ1L1 + λ2LLCC + λ3LLCS The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLIII-B3-2022 XXIV ISPRS Congress (2022 edition), 6-11 June 2022, Nice, France We set λ1 = 80, λ2 = 2, λ3 = 0.2, λ4 = 1, λ5 = 1 to make each loss part have a close scale. λ6 is set to 10 −5 for a slight regularization.

EXPERIMENTS
In this section we evaluate if the images generated by the proposed approach are helpful to train a victim detector. Our goal is not to improve the accuracy by directly enhancing the detector, but to fine-tune a pre-trained detector with additional composite images.

Data set and implementation details
We generate 1936 composite images, and harmonize them using the proposed framework. Each image has a size of 512×512 pixels, and we generate the ground-truth of bounding boxes of body parts. The purpose of generating these composite images is to use them to train a body parts detector. We also collect 197 real images to test if our generated images help to improve the accuracy of body parts detection. Some images are collected from the internet, and the main search keywords are earthquake rescue and collapse rescue. Most of the victims in these images are buried under debris, and we can only see part of their bodies. We also take some pictures by ourselves. All the test images are annotated. For fine-tuning and inference we use the official implementation of YOLOv5 1 on an Ubuntu 18.04 system with a Nvidia Titan XP graphics card. Figure 4 shows some samples of composite images (first row) and the corresponding harmonious version (second row) generated by our proposed method. The proposed unsupervised harmonization framework successfully transfer the illumination and colors of the background images to the foreground body parts. Further, the arms and legs are automatically added with some gray colors, making them look realistic, because in real images the body parts of victims are usually dirty due to dust or soil. However, we have no ground-truth to evaluate the quality of these generated harmonized images. We can only evaluate whether the fine-tuning of victim detectors benefits from these harmonized images. The quantitative evaluation is in Section 4.4.

Details of fine-tuning a body part detector
We fine-tune the detection model YOLOv5 (Jocher et al., 2021), which is a PyTorch implementation of the YOLOv4 (Bochkovskiy et al., 2020) pre-trained on the COCO data set (Lin et al., 2014). The YOLOv5 model consists of a backbone to extract features, a neck to concatenate features, and a head to predict the class and the bounding box. According to the difference of network depth and width, we used three different sizes of YOLOV5 models (Jocher et al., 2021), namely YOLOV5s, YOLOV5m, and YOLOV5l. Because the pre-trained YOLOv5 model has advantages in feature extraction, we fine-tune the pre-trained detector by fixing the backbone and updating the weights of the neck and the head. We want the model to be able to apply the feature extraction ability learned from the COCO data set to our composite data set. Two settings of data set are used, (i) the composite images without harmonization, and (ii) the harmonized composite images.

Evaluation of victim detection
According to the evaluation metric used in Pascal VOC challenge (Everingham et al., 2010) we measure the AP (average precision) and assume a successful detection if the predicted bounding box has an IoU (Intersection over Union) greater than a threshold 0.5 with the ground-truth, and denote this as AP @0.5. We also evaluate AP @[0.5 : 0.95], which is a COCO metric and can be calculated by averaging AP over different IoU thresholds, from 0.5 to 0.95 with a step 0.05 (Lin et al., 2014). Table 1 shows the results of the COCO pre-trained YOLOv5 model and our models on the test set. We can find from the first row of each model that YOLOv5s is fast, but it performs poorly in detecting victims. YOLOv5s's shallow structures cannot learn a good representation from the training data. Although YOLOv5l is about 2.5 times slower than YOLOv5s, the detection accuracy improves substantially when the models are fine-tuned with our composite image (the second row of each model). Our harmonious images further improve the results (the third row of each model), which verifies the effectiveness of the proposed approach. Figure 5 visualizes some detection results. The default COCO pre-trained model (the second row) cannot detect victims as expected.
Because of the uncertainty of image copyright, we do not show the detection results of real victims in this paper. It is worth reporting that our models can detect some body parts, but fail in detecting some complete bodies. For example, in our test set there is a picture of the victim lying on the ground covered with mud, and the COCO pre-trained model cannot detect any body parts in this image. Although our method only detects one foot of the victim, not the whole body, it is still useful in the real rescue operation.

CONCLUSION
In order to enable first responders to find the victims partially buried in the ruins more efficiently in real rescue, we propose a novel framework to generate composite victims-in-ruins images and apply them to fine-tune a victim detector. The experimental results show that the normal COCO pre-trained models achieves low AP in detecting victims in ruins. Since it is difficult to get more real training images, our method uses composite images to train victim detectors, and its effectiveness is verified. We evaluate three variants of YOLOv5, which are fast detectors that can be deployed on UAVs for real-time victim detection. We hope that the work is useful in real disaster search and rescue and can save more lives.
There is still much room for improvement in studying the victim detection models, and our future work will focus on the use of UAVs with illumination for victim detection at night, because many rescues are carried out at night. (d) Figure 4. Qualitative results. The first row shows composite images, and the second row is the corresponding harmonious images generated by our network.  Figure 5. Visualization of victim detection. The first row is the ground-truth, and the second row is the default COCO pre-trained YOLOv5l model. The third row and the fourth row are the models fine-tuned on our composite images and harmonious images, respectively.