ORIENTED VEHICLE DETECTION IN HIGH-RESOLUTION REMOTE SENSING IMAGES BASED ON FEATURE AMPLI FI CATION AND CATEGORY BALANCE BY OVERSAMPLING DATA AUGMENTATION.

: Vehicles usually lack detailed information and are difficult to be trained on the high-resolution remote sensing images because of small size. In addition, vehicles contain multiple fine-grained categories that are slightly different, randomly located and oriented. Therefore, it is difficult to locate and identify these fine categories of vehicles. Considering the above problems in high-resolution remote sensing images, this paper proposes an oriented vehicle detection approach. First of all, we propose an oversampling and stitching method to augment the training dataset by increasing the frequency of objects with fewer training samples in order to balance the number of objects in each fine-grained vehicle category. Then considering the effect of the pooling operations on representing small objects, we propose to improve the resolution of feature maps so that detailed information hidden in feature maps can be enriched and they can better distinguish the fine-grained vehicle categories. Finally, we design a joint training loss function for horizontal and oriented bounding boxes with center loss, to decrease the impact of small between-class diversity on vehicle detection. Experimental verification is performed on the VEDAI dataset consisting of 9 fine-grained vehicle categories so as to evaluate the proposed framework. The experimental results show that the proposed framework performs better than most of competitive approaches in terms of a mean average precision of 60.7% and 60.4% in detecting horizontal and oriented bounding boxes respectively .


INTRODUCTION
Small object detection is attracting increasing attention due to the advancement of high-resolution remote sensing images.Real-time detection of vehicles is important in the tasks of autonomous driving and traffic monitoring recently (Sun et al., 2006).In some more specific tasks, we sometimes need to determine their types and orientations.The traffic conditions can be better scheduled and analysed by providing more accurate information about vehicles.Therefore, the oriented fine-grained detection of vehicles is of great research significance.However, the oriented fine-grained vehicle detection is a more challenging task compared with other multiclass object detection problems since it contains small objects.
Traditional remote sensing image vehicle detection methods usually include the following four steps: 1) Data preprocessing.It includes operations such as improving the image quality, improving the contrast between vehicles and backgrounds, and clustering.2) Determination of the potential position of vehicles.For example, the contrast between different parts of images can be calculated to determine the potential position of vehicles.3) Segmentation.It is performed to accurately extract potential vehicles from the background.4) Recognition.Features are extracted from potential objects and the category of vehicles is finally determined by extracted features.
Recent remote sensing image vehicle detection methods are completely different from traditional methods since they try to decrease the influence of intermediate decisions on the detection results obtained by machine learning methods.The machine learning methods constitute deep learning based approaches and shallow features based approaches based on different types of extracted features (Cheng et al., 2016).Before 2012, shallow features based approaches are the mainstream algorithms for object detection.However, shallow features including Viola Jones Detectors (Viola et al., 2001), Deformable Parts Model (DPM) (Felzenszwalb et al., 2008) and Histogram of Oriented Gradients (HOG) (Dalal et al., 2005) usually deliver poor performance in representing vehicles because they lack semantic information which are important for recognizing objects.
Convolutional Neural Networks (CNN) can automatically learn semantic features and perform well in representing objects.The development of vehicle detection approaches has been promoted since deep learning architectures emerged in 2012.Vehicle detection approaches based on deep learning architectures consist of one-stage vehicle detection approaches such as Single Shot Multi-Box Detector (SSD) (Liu et al., 2016), You Only Look Once (YOLO) (Redmon et al., 2016), YOLOv2 (Redmon et al., 2017)，YOLOv3 (Redmon et al., 2018) and two-stage vehicle detection methods such as Region CNN (RCNN) ( Van et al., 2011), SPPNet (He et al., 2015), Fast RCNN (Girshick et al., 2015), andFaster RCNN (Ren et al., 2015) according to different detection processes.Compared with two-stage vehicle detection approaches, one-stage approaches are with lower precision ratio and faster detection speed.The two-stage methods can achieve high precision ratio with a speed that can meet real-time requirements.Therefore, this paper mainly investigates two-stage vehicle detection approaches.
Existing two-stage vehicle detection methods may suffer from class imbalance, the reduced discriminative ability caused by pooling operation, the arbitrary orientation of objects.The research status of these three problems is as follows.
In remote sensing images, vehicles are a kind of moving objects and usually distributed in geographic space at different frequencies.Therefore, there are usually uneven numbers of different fine-grained vehicle categories in the dataset.The imbalanced class distribution makes the network training favour the vehicle categories with a larger number, which makes it difficult to obtain ideal detection results for fine-grained vehicles.Ouyang et al.(2016) propose to fine-tune the distribution of under-represented categories by clustering similar categories to address the class imbalance problem.Oksuz et al. (2020) proposes an online foreground balanced sampling method, which decreases the imbalance between distributions of different objects within each batch by assigning probability to each true bounding box.Wang et al.(2019) proposed a sample exchange strategy to generate new samples and decrease the imbalance by exchanging the same type of objects in diverse images.However, the above methods are mainly aimed at the object imbalance problem in the natural imagery.Unlike natural imagery, remote sensing imageries are with a larger coverage area and more complex backgrounds.Therefore, the above class balance methods may deliver unsatisfactory performance when applied to the field of remote sensing.
Some researchers have carried out some works in order to improve the performance in detecting small objects.These methods mainly start from two aspects, one is to improve the resolution of the training images including small objects, and the other is to improve the detail information of the feature maps describing the small objects.In terms of improving image resolution, Ji et al.(2019) fuse the object detection network with the image super-resolution reconstruction network to increase the image size.Singh et al.(2018) propose to establish multiscale pyramids for training images by resizing them.By increasing the image resolution, the discriminative ability of the features from different vehicles and the performance of locating and detecting vehicles can be optimized theoretically.In improving the resolution of the feature map, Tayara et al (2017).propose to perform deconvlution on the feature maps continuously to improve the discrimination of shallow features and retain the detailed information of feature maps.AVDNet (Mandal et al, 2019) is proposed to keep the detail information by increasing the spatial resolution of feature maps for vehicles and introducing ConvRes modules to difference scale layers.Lin et al. (2017) propose a layer-by-layer prediction using feature pyramids (FPN) to detect multi-scale objects.The advantage of this method is that the multi-scale feature map is an inherent transition module in the CNN, which predicts the output of the feature map of each layer of the CNN and finally selects the optimal detection results.The above methods based on feature pyramids or image pyramids increase the amount and computational cost of training data and set high requirements for computers and graphics cards.Since the computers may not meet the requirements of the hardwares, these methods are not commonly used in practical applications.In addition, the above methods may lose the deep semantic information hidden in features while improving the feature resolution, and the discriminative ability to distinguish different vehicles is still limited.
The remote sensing image acquired by the sensor taken overhead.Therefore, the direction of the vehicle on the image is arbitrary.Traditional horizontal bounding boxes can only roughly describe the position of vehicles.The directions of vehicles can help to accurately locate the position of vehicles.Therefore, it is necessary to study oriented vehicle detection algorithms.In text detection, Ma et al.(2018) propose an oriented text detection algorithm to detect inclined text.Yang et al.(2018) propose a multi-oriented ship detection algorithm of remote sensing images.Ding et al. (2018) propose an oriented multi-class object detection method in aerial images.Oriented object detection methods are more and more important in the field of object detection, especially for fine-grained vehicle detection.It is necessary to accurately determine the vehicle's direction information.
An oriented vehicle detection framework based on feature amplification and oversampling data augmentation is proposed for high-resolution remote sensing images so as to address the problems mentioned above.This paper takes the two-stage object detection algorithm Faster RCNN as the research basis and makes improvements on this framework.Considering the distribution characteristics of vehicles in remote sensing images, we design an oversampling and stitching data augmentation method for remote sensing images so that the numbers of different vehicles in images are balanced for training model and the negative influence of category imbalance on training dataset can be reduced to some extent.Considering both the discriminative ability of features and computer hardware requirements, we explore semantic information hidden in deep feature maps and performs magnification operation for deep feature maps.By performing bilinear interpolation on feature maps, the detailed information can be enriched while maintaining the deep semantic information, which may improve the discrimination of the features in representing vehicles.We also designs a joint training loss function for horizontal and oriented bounding boxes with the center loss (Wen et al., 2016).The proposed method regresses the vehicle position and direction by setting the horizontal anchors, jointly training the horizontal bounding boxes and orientated bounding boxes.The method can simultaneously acquire the horizontal and oriented detection results to get more accurate position of the vehicles.

The overall architecture
Figure 1 shows the overall architecture of the proposed approach, where the Resnet101 (He et al., 2016) is used for extracting feature maps.The proposed approach three parts, 1) oversampling based data augmentation, 2) improving the size of the feature maps and 3) a joint training loss function for horizontal and oriented bounding boxes combined with center loss.The details of three steps are as follows.
First of all, we perform oversampling data augmentation on the training dataset.The frequency and location of the vehicles in the remote sensing images are random.The number of finegrained vehicles in the training dataset is usually uneven.We perform stitching and oversampling augmentation on the training data by increasing the frequency of vehicles with fewer number of training data to synthesize a new dataset.
In the stage of region proposal network (RPN), we set up multiscale and multi-shape horizontal anchors and select positive and negative samples for training a RPN network, by calculating the overlap between anchors and ground truth.The oversampling augmentation method can improve the diversity and number of positive samples in the RPN stage.
In the stage of classification network, we perform the feature map magnification, to enhance the ability of feature maps to represent vehicles by increasing the size of deep feature maps.
Considering the orientation of vehicles, we propose a multi-task loss function, which jointly trains oriented and horizontal bounding boxes, and introduces the center loss for minimized within-class difference.The loss function can increase the ability of deep features to distinguish fine-grained vehicles, and obtain the accurate positions of the vehicles.
Figure 1.The architecture of the proposed framework.

Data augmentation by oversampling and stitching
We propose an oversampling and stitching method to augment the training dataset by increasing the frequency of objects with fewer training samples in order to keep a balance between the number of objects in each fine-grained vehicle category as shown in Figure 2. We segment the vehicles in the dataset according to their coordinates.By increasing the frequency of these vehicles in different background images, the number of vehicles in each category reaches a balanced state.Considering the random location of vehicles in the geographic space and reduced impact of batch size on foreground category imbalance, the rules that each image may contain all types of objects are applied to data augmentation of each image.Meanwhile, no overlap between augmented objects and existing objects requires to be ensured.

Magnification operation of deep feature maps
Considering the effect of the pooling operations on representing small objects in convolutional neural network, we propose to improve the resolution of feature maps so that they can better distinguish the fine-grained vehicle categories and detailed information hidden in feature maps can be enriched.There are usually two main methods for upsampling the feature map, one is interpolation, and the other is deconvolution.However, enlarging the image by deconvolution usually produces checkboard artifacts, which is not conducive to the detailed description of features.Therefore, we employ interpolation to enlarge the feature maps.Here, we use bilinear interpolation to improve the size of deep feature maps.Figure 3 shows details of bilinear interpolation to enlarge the feature map can be illustrated.
Figure 3. Flow chart of bilinear interpolation to enlarge feature map.

Multi-task loss function for joint horizontal and oriented bounding boxes
We design a joint training loss function for horizontal and oriented bounding boxes with the center loss, in order to decrease the impact of within-class diversity on vehicle detection.The proposed method in this paper can detect horizontal and rotational objects simultaneously by combining the loss function of horizontal bounding boxes with that of rotational bounding boxes.As shown in Eq. ( 1) and ( 2  (1) Where m is the mini-batch size,    categories.We randomly select 50% of images to be training samples while the other 50% for testing.

Description of Experimental Data
In this paper, the backbone for extracting features is Resnet101 which is pre-trained on the ImageNet dataset.TensorFlow framework with the Ubuntu 16.04 system is used for implementing the proposed method.The GPU for accelerating computation is GTX1080ti with 12GB display memory.The sizes of mini-batch in the RPN stage and classification stage are 256 and 512 respectively.The learning of first 30000 iterations is set to 0.003 while the learning rate of subsequent 70,000 epochs is 0.00003.The proposed framework will not stop until 100,000 iterations are reached.The momentum and weight decay are set to 0.9 and 0.0001.The anchor scale parameter is set to [8,16,32,64,128], and the shape parameter is set to [1,1 / 2,2 / 1,1 / 3,3 / 1,1 / 4, 4/1, 1 / 5, 5 / 1,1 / 6,6 / 1,1 / 7,7 / 1].This article considers anchors with an IOU overlap above 0.7 as positive samples and those with an IOU overlap below 0.3 will be considered as negative samples.
The images augmented by the proposed method are shown in the Figure 5.The original image before data augmentation usually contains only a few vehicle types.After data augmentation, each image contains at least 9 different types of vehicles, and the vehicle position is randomly generated by the proposed algorithm.No overlap between vehicles are ensured in order to increase the frequency of the categories with a smaller number in the dataset.The proposed synthesis method also increases the background diversity of the vehicles to a certain extent.

Evaluation Metric
In the paper, the commonly used evaluation metric mean average precision (mAP) that represents the average of average precision in each type of vehicle.The higher mAP is, the better object detection performance is.The average precision in each type can be calculated as equation (3).(3) Where n R and n P represent the recall ratio and precision ratio when n-th threshold is set.The recall ratio and precision ratio can be defined as equation ( 4) and ( 5).

TP Precision
Where FP and TP are the amount of wrongly and accurately detected vehicles.FN represents the amount of undetected vehicles.If the Intersection over Union (IOU) between a bounding box and its true locations is above 0.5, the bounding box will be considered as TP.Otherwise, it will be FP.

Analysis of the data augmentation approaches
The proposed vehicle detection framework in this paper are verified to prove the superiority of the data augmentation by oversampling and stitching method.We adopt three different datasets for experiments.The first dataset is after performing rotation augmentation with the angles of 90 °, 180 °, 270 ° denoted as R, the second is the dataset after the proposed oversampling and stitching augmentation in this paper denoted as O, and the third is combination of the previous two datasets denoted as M.
As shown in Table 1 and 2, the bold indicates higher accuracies of average precision and recall ratio in the R and O datasets.The underlined indicates the optimal average precision and recall ratios in the R, O and M datasets.The oversampling and stitching data augmentation method improves the recall ratios of the vehicles with a small number of training samples such as Truck, Tractor, Boat, Vans, and Plane, and effectively improves the corresponding average detection accuracy.As the number of vehicles in other categories has been increased after the proposed data augmentation method, the tendency of network to categories with a larger number of training samples has been reduced.According to the results of the merged dataset, we find that the results of the merged dataset are better than those of the previous two datasets.The merged datasets for training the proposed network can further improve the average vehicle detection accuracy since the effective samples to train the network are further increased.

Analysis of the feature map magnification operation
In order to prove that vehicle detection results can be enhanced by feature map magnification operation, we use the merged dataset to discuss the magnification operation of deep feature maps and analyse the impact of different upsampling parameters on vehicle detection.Simultaneously two interpolation methods including bilinear interpolation and nearest neighbour interpolation are used for comparison.H and O are respectively denoted as the and oriented detection results.
The bold in Table 3 represents the optimal average precision of horizontal or oriented bounding boxes in each line.The mAP of the proposed framework without magnification operation for horizontal and oriented vehicles are 58.9% and 58.3%, respectively.The detection accuracy obtained by bilinear upsampling for the feature map is higher than the detection results without feature amplification.The experimental results show that the bilinear interpolation enlarged feature map can improve the detection accuracy to a certain extent.Among them, in the comparison experiments with upsampling multiples of 1.5, 2.0, 2.5, and 3.0, bilinear interpolation by 2.0 upsampling improves the most average detection accuracy, with about 2%.At the same time, we use the nearest neighbor interpolation method of 2.0 to perform upsampling experiments on feature maps.The nearest neighbor upsampling method shows lower accuracy than that of without feature map amplification.The nearest-neighbor interpolation method will cause a Jagged effect on the enlarged feature map, which is not beneficial to the feature representations of truck, car, others, and tractor.Therefore, detection accuracy is lower than the detection results without feature amplification.

CONCLSION
The experimental results show that the oversampling and stitching data augmentation method can improve the imbalance of vehicle categories in the dataset to a certain extent.The combined datasets of the oversampling and stitching augmentation and rotation augmentation can improve about 3% mAP.The feature maps by magnification operation method proposed in this paper can increase the ability to classify the fine-grained vehicle categories for the network by restoring the detailed information of the feature map without increasing the amount of calculation.Considering the random direction of the vehicle, the combined horizontal and oriented bounding box proposed in this paper with center loss can simultaneously obtain the horizontal and oriented detection results.The center loss constraint can reduce the intra-class diversity of the finegrained categories to a certain extent and improve the accuracy of vehicle detection.The proposed method achieves the mAPs of 60.7% and 60.4% in horizontal and oriented bounding boxes, respectively, which outperforms other competitive vehicle detection approaches.Compared with the original Faster RCNN algorithm, the proposed method in this paper can achieve an improvement of about 10% mAP.

Figure 2 .
Figure 2. Schematic of oversampling and stitching data augmentation.
), the joint training loss function used in this paper consists of 5 parts, namely the cross-entropy loss of rotational objects ()

Figure 5 .
Figure 5.Comparison of images before and after synthesis by the oversampling and stitching method.(a)-(e) are original images from VEDAI data, and (f)-(j) are corresponding images synthesized by proposed method.

Figure 4
Figure 4 shows the experimental dataset named Vehicle Detection in Aerial Imagery (VEDAI) (Razakarivony et al., 2016) that is used in this paper.Images in the VEDAI dataset are with a size of 1024x1024.The ground sampling distance (GSD) of the original image is 12.5 centimetres.The image consists of four bands including red, green, blue and near infrared.Since it is enough to produce satisfactory performance with only the visible wavelengths for fine-grained vehicle

Table 1 .
Detection accuracy of proposed algorithm for horizontal bounding boxes in three different types of datasets.

Table 2 .
Detection accuracy of proposed algorithm for oriented bounding boxes in three different types of datasets.

Table 3 .
Detection accuracy of different parameters in the magnification operation.All the other comparison methods are only performed on the merged dataset.The detection results of the proposed method in this paper are better than those of Faster RCNN algorithm and FPN algorithm in both horizontal and oriented bounding boxes as can be seen in Table4.Although the FPN method selects features suitable for detecting a certain type of vehicles from the multilayer pyramid feature maps, with the highest detection accuracy in the tractor and vans.However, the detection results in other categories are not good, which is especially unsatisfactory for the airplane.Airplane is the category with the smallest number of samples.FPN builds a feature pyramid and increases the amount of training parameters, requiring numerous samples.So it is difficult to achieve the ideal detection results.The method in this paper is improved on the Faster RCNN algorithm.The enlarged feature maps may improve the ability to distinguish fine-grained vehicles by restoring the detailed information of feature maps.Center loss may relatively enhance the gap between features of different vehicle types in the feature space by reducing the intra-class diversity existing in features belonging to the same vehicle type, which may lead to decreased misclassification of similar vehicle types.Therefore, the mAP of the Faster RCNN algorithm is lower than that of the proposed framework.The results of the Faster RCNN of the rotation dataset and merged dataset proof that the merged datasets can further improve the average vehicle detection accuracy.Compared with the original Faster RCNN algorithm, the proposed method with the merged dataset in this paper can achieve an improvement of about 10% mAP.

Table 4 .
Vehicle detection comparison experiments.