Rapid Target Detection in High Resolution Remote Sensing Images Using YOLO Model

: Object detection in high resolution remote sensing images is a fundamental and challenging problem in the field of remote sensing imagery analysis for civil and military application due to the complex neighboring environments, which can cause the recognition algorithms to mistake irrelevant ground objects for target objects. Deep Convolution Neural Network(DCNN) is the hotspot in object detection for its powerful ability of feature extraction and has achieved state-of-the-art results in Computer Vision. Common pipeline of object detection based on DCNN consists of region proposal, CNN feature extraction, region classification and post processing. YOLO model frames object detection as a regression problem, using a single CNN predicts bounding boxes and class probabilities in an end-to-end way and make the predict faster. In this paper, a YOLO based model is used for object detection in high resolution sensing images. The experiments on NWPU VHR-10 dataset and our airport/airplane dataset gain from GoogleEarth show that, compare with the common pipeline, the proposed model speeds up the detection process and have good accuracy.


INTRODUCTION 1.1 General Instructions
Object detection is an important task for understanding highresolution images and has very important military value.The purpose of target detection is to determine whether a given remote sensing image contains a target of interest type and determine the position information of the predicted target."Target" usually refers to man-made objects, such as buildings, vehicles, airplanes, ships, etc., which have boundaries independent of the background environment, and feature information that is part of the background environment.With the rapid development of Remote Sensing (RS) technology, the RS imagery produced by high-resolution remote sensing satellites (such as IKONOS, SPOT-5, WorldView and Quickbird) have more abundant information to extract features and detect ground object than the low-resolution remote sensing imagery.Many artificial objects that are difficult to be detected in the past are now available to be detected.Since the 1980s, Object detection in remote sensing image is widely studied, mainly using shallow features that were hand-engineered by skilled people who have experience in the field and also often required domain-expertise.This also means that if the conditions change even slightly, a framework which works well in a given task may fail in another task.So that the whole feature extractor might have to be rewritten from scratch, which is very time consuming and expensive.These disadvantages led researchers in the field looking for a more robust and effective approach.In 1998, Yan LeCun et al. proposed a handwritten digital recognition method using neural networks which allowed them to achieve more than 99% accuracy in the digit recognition task.The result re initiated interest in researches of using the neural network for image recognition application.But limited by the lack of computational power and effective techniques to train neural network model for more complex tasks, correlation researches were largely abandoned for many years.

OBJECT DETECTION BASED ON YOLO FRAMEWORK
Type text single-spaced, with one blank line between paragraphs and following headings.Start paragraphs flush with left margin.Figure 1 shows the yolo network structure, which has 24 convolutional layers and 2 fully connected layers.The alternating 1 × 1 convolution layer reduces the feature space of the previous layer, enhancing the resolution of detection by pretraining the convolutional layers on the ImageNet classification.The final output of our network is the 7 × 7 × 30 tensor of predictions.

Model Training
( where c = focal length x, y = image coordinates X0, Y0, Z0 = coordinates of projection center X, Y, Z = object coordinates Formula 1 represents first moment deviation, formula 2 represents second moment deviation. is step,  is a little constant (default value is 8 10  ).
(3) Loss function In mathematical optimization, statistics, econometrics, decision theory, machine learning and computational neuroscience, a loss function or cost function is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cost" associated with the event.An optimization problem seeks to minimize a loss function.An objective function is either a loss function or its negative (in specific domains, variously called a reward function, a profit function, a utility function, a fitness function, etc.), in which case it is to be maximized.In YOLO, the mean square error is used as the loss function to optimize the model parameters, and the mean square error of the vector output from the network and the vector corresponding to the real image.As shown below, where coordError is the coordinate error, iouError is the IOU error, and classError is the classification error.

Post processing
The last step of YOLO is to perform non-max suppression (NMS) on S*S*(B*5+C) vectors.The purpose of the NMS is to eliminate redundant boxes and find the best object detection location.NMS algorithms are used in well-known target detection frameworks such as RCNN and SPPnet.

Fine-tuning
Transfer learning strategies depend on various factors, but the two most important ones are the size of the new dataset, and its similarity to the original dataset.Fine-tuning allows us to bring the power of state-of-the-art DCNN models to new domains where insufficient data and time/cost constraints might otherwise prevent their use.

Evaluation metric
In order to evaluate performance of the algorithm, results of the model should be compared to the ground truth.In this paper, Precision and recall are used to evaluate the similarity and diversity between detection results and ground truth in test dataset.
where TP = true positive FP = false positive FN = false negative.

Advantages and disadvantages
Advantages: 1) Faster.YOLO solves object detection as a regression problem, that simplify the entire pipeline of detection network.2) Lower background false positive rate.During the training and prediction process, the information of the entire image can be perceived.The RCNN-based method can only perceive local information from the candidate frame.
3) Strong versatility.YOLO can be applied in object detection for unnatural images.Disadvantages: 1) The model uses multiple downsampling layers.The characteristics of the objects learned on the Internet are not fine, which affects the detection effect.2) Poor recognition of object position accuracy.
3) The detection of small targets and dense targets is poor.4) Lower recall rate.

Dataset
(1) Airport dataset and airplane dataset gain from GoogleEarth and manually annotated using Imglabel software.Airport dataset contains 1893 remote sensing images.Airplane dataset contains 250 remote sensing images.Figure 5 shows two demos of image label.This dataset contains totally 800 very-high-resolution (VHR)remote sensing images that were cropped from Google Earth and Vaihingen dataset and then manually annotated by experts.

Results
Figure 6 shows airports detected using our method.3.3 0.9 0.1 Table 3. Testing time of four object detection methods.As is shown in Table 3, YOLO model greatly improved the speed of detection and can reach the requirement of real time Figure 9. Results of object detection with small and dense objects.
As is shown in Figure 9. YOLO do not perform well for objects that are very close to each other (middle points of more than one objects fall into the same grid), and for small object group.Because there are only two boxes belong to one category are predicted in a grid.There are problems of bad training approximation and generalization for test image of unusual aspect ratio.Due to the problem of loss function, positioning error is one of the most important reason that affects the detection effect, specially for large or small objects.

2. 2 Figure 2 .
Figure 2. Framework of machine Learning based object detection The main disadvantages of RCNN are the computational complexity and the accuracy of border position.Spatial pyramid

Figure 3 .
Figure 3. Framework of YOLO The YOLO model is a CNN based framework.At the beginning, the convolutional image extraction feature predicts the output probability through the all-connected layer.Through similar to GoogleNet, also trained YOLO quick version, producing Fast-YOLO at the final output of 7 * 7 * 30 tensor.Figure 1 shows the fall into the the bounding box j in lattice i and software environment The model is implemented by Keras with Tensorflow backend.Keras is a high-level neural networks API, written in Python and capable of running on top of TensorFlow, CNTK, or Theano.Tensorflow developed by the Google Brain team is an opensource software library for dataflow programming across a range of tasks.Many 3rd party libraries are required such as Tifffle for reading remote sensing imagery, OpenCV for basic image processing, Shapely for handling polygon data, Matplotlib as visualization tool, Imglabel for dataset construction.The experiments run on a Sugon W560-G20 Server with E5-2650 v3 CPU, 32GB memory, and Quora k2000 GPU.

Figure 4 .
Figure 4. Experimental Process The experimental process can be divided into four steps: Dataset construction.The remote sensing images (gain from open source data source such as GoogleEarth, USGS, DigitalGlobe and so on) are animated using imgLabel to obtain standard PASCAL_VOC format dataset.Divide labeled dataset into training, validation, and test sets.Model construction.Constructing a CNN structure and setting its hyperparameters.Model training.Training with training sets and validation sets.Model prediction.Testing with the test set, and the result is used for evaluates model.

Figure 5 .
Figure 5. Demo of image label (2) NWPU VHR-10 dataset.It is a publicly available 10-class geospatial object detection dataset.These ten classes of objects are airplane, ship, storage tank, baseballdiamond, tennis court, basketball court, ground track field, harbor, bridge,and vehicle.

Figure 8 .
Figure 8.Detection results of NWPU VHR-10 dataset Results and performances of airport detection and airplane detection are shown in Figure 6-7 and Table 1-2.Results of object detection of NWPU VHR-10 dataset are shown in Fighre 7. R-CNN Fast R-CNN Faster R-CNN YOLO 64.83.3 0.9 0.1 Table3.Testing time of four object detection methods.As is shown in Table3, YOLO model greatly improved the speed of detection and can reach the requirement of real time the problem of rapid object detection for high-resolution remote sensing image with CNNs.A YOLO model is used in this paper for object detection in high resolution remote sensing images.Experiments on NWPU VHR-10 dataset, our airport dataset and airplane dataset gain from GoogleEarth demonstrate that YOLO model has a strong applicability for remote sensing image, especially in speed of prediction.The main disadvantages of YOLO are its poor positioning accuracy, bad training approximation and generalization for images of unusual aspect ratio and objects that are very close to each other.It needs a large number of high quality Ground Truth labels for the model training, which relies on professional interpretation experiences and lots of manual work.Therefore, to solve these problems is orientation of the future research.Spatial Resolution Remote Sensing Imagery, Remote Sensing 9.7(2017):666.Shi, Shaohuai, et al, Benchmarking State-of-the-Art Deep Learning Software Tools, (2016).Cheng, Gong, et al, Object detection in VHR optical remote sensing images via learning rotation-invariant HOG feature, International Workshop on Earth Observation and Remote Sensing Applications IEEE, 2016:433-436.Cheng, Gong, et al, Multi-class geospatial object detection and geographic image classification based on collection of part detectors, Isprs Journal of Photogrammetry & Remote Sensing 98.1(2014):119-132.Cheng, Gong, P. Zhou, and J. Han, Learning Rotation-Invariant Convolutional Neural Networks for Object Detection in VHR Optical Remote Sensing Images, IEEE Transactions on Geoscience & Remote Sensing 54.12(2016):7405-7415.Revised October 2017

OBJECT DETECTION FOR REMOTE SENSING IMAGERY 2.1 Machine Learning based object detection
In 2012, Alex 7-layer convolutional neural network named AlexNet, and that result opened the floodgates to new research in the field with the name "deep learning".Concept of convolution layer introduced from Convolutional neural network (CNN) uses two methods to greatly reduce the number of parameters: local receptive field and parameter sharing.Local receptive field is a principle learned from human visual that the pixels in an image have more relevant with the adjacent pixels.Instead of having each neuron receive connections from all neurons in the previous layer, CNNs use a receptive field-like layout in which each neuron receives connections only from a subset of neurons in the previous (lower) layer.The use of receptive fields in this fashion is thought to give CNNs an advantage in recognizing visual patterns when compared to other types of neural networks.Parameter sharing scheme is used in Convolutional Layers to control the number of parameters.Generally, the statistical features of part of the image is considered same with the other parts.So that the convolution kernel is used as a feature extraction method regardless of the position, pool layer is used to represent the implied invariance from transformation of the image.A preliminary set of experiments fusing CNN obtains state-of-the-art results for the well-known UCMerced dataset.The researches show that the CNN model can be generalized to the field of remote sensing imagery and obtain better results than the traditional methods.2.

Table 1 .
Performance of airport detectionFigure7shows airplane detected using our method.

Table 2 .
Performance of airplane detection.