OBJECT LOCALIZATION FOR SUBSEQUENT UAV TRACKING

The paper is devoted to the problem of semi-automatic initialization of the tracking algorithm, i.e. selecting an object of interest by unmanned aerial vehicles or drones. In this work, we propose an algorithm to refine the position and dimensions of the boundary box of the tracked object at the initial time (on the first frame), based on saliency detection algorithm, which simulates the map of human attention. We tested existing algorithms for object tracking by UAVs on the largest and most complex dataset – UAV 123. It is shown that the quality of tracking as a result of initialization by the proposed algorithm varies within acceptable limits for successful tracking of the object. The advantage of the proposed approach is that it applies the principles, used by the human visual system: the color, contrast, central focus.


INTRODUCTION
The recent progress in using UAVs for different human needs motivates software development researches in the fields of security and video surveillance.In security systems it is crucial to define the current position of the tracked object, so, applied to the UAV tracking problem, also to plan and form an optimal flight path of the unmanned aircraft in three-dimensional space.Therefore, a lot of computer vision teams work on the task of moving object detection and tracking from UAV in real-time.
A critical issue in tasks of visual tracking is the initialization or detection of an object of interest.The quality of tracking largely depends on it, as roughly defined position and size of the object of interest in the first frame entails rapid breakdown of the tracking process.Usually UAV operator marks an object of interest using his control pad, but due to wind, UAV speed and other disturbing factors, the result is mostly unsatisfactory for the future tracking.Bad initialization makes tracking a lot more difficult, because it leads to either the case when parts of the scene background occupy a significant part of the object's region (Vishnyakov et al., 2015), or the case when important parts of the object are discarded.
The task of semantic segmentation is also called scene parsing, it splits an image into semantically independent regions.It is also related to the object detection task.Such algorithms can be used to define the position and size of a traceable object.But they give redundant information in this situation, since we are only interested in the area, containing the traceable object.In addition, the algorithms of semantic segmentation are rather slow.
The first stage of the proposed approach is preliminary processing of the image (noise removal) by the Gaussian filter and converting the image into the CIE LAB color space.The next step is segmenting the image into homogeneous areas (superpixels) by the simple linear iterative clustering (SLIC) algorithm (Achanta et al., 2012).

Image pre-processing
The basic pre-processing task is noise reduction.Smoothing filters perform this task quite well.There are many linear and non-linear smoothing algorithms.Their usual application area is noise reducing, luminance stabilization, contrast and clarity enhancement.One of the popular smoothing methods is Gauss filtering.It has the successful application in many areas.Gaussian kernel coefficients are sampled from the 2D Gaussian function.

𝐹(𝑖
where σ is the standard deviation of the distribution, , pixel coordinates.
We use 3x3 convolution kernel.Smoothed image is converted into the CIE LAB color space.The Lab color space describes mathematically all perceivable colors in the three dimensions:  for lightness and ,  for the color components green-red and blue-yellow respectively.The nonlinear relations for  * ,  * , and  * are intended to simulate the nonlinear response of the human eye.Perceptual differences between any two colors can be approximated by taking the Euclidean distance between values in Lab color space.In our tests the algorithm showed better results using Lab than using RGB color space.
For the computational effectiveness homogeneous areas were used instead of discrete pixels.

Segmentation
Methods of segmentation as k-means method (Mirkes, 2011), watershed method (Beucher andMeyer, 1993), the method of graph cut (Boykov et al., 2001), simple linear iterative clustering (Simple Linear Iterative Clustering, SLIC) (Achanta et al., 2012) are able to break up the source image on different, but, in some sense, homogeneous areas named "superpixel" in a reasonable amount of time.SLIC method perform a local clustering of pixels in the 5-D space, defined by the , ,  values of the CIELAB color space and outputs better quality superpixels by a very low computational and memory cost.
SLIC segmentation algorithm: 1. Initialize cluster centers   = [  ,   ,   ,   ,   ] by sampling pixels at regular grid steps .2: Perturb cluster centers in an n × n neighborhood, to the lowest gradient position.
3: repeat 4: for each cluster center   do 5: Assign the best matching pixels from a 2 × 2 square neighborhood around the cluster center according to the distance measure (1).6: end for 7: Compute new cluster center and residual error  { 1 distance between previous centers and recomputed centers} 8: until  ≤ threshold 9: Enforce connectivity.
All the pixels of image are allocated to clusters, referred to as 'superpixels' after segmentation algorithm.There is used nonoriented graph to store information about segments of image.
Vertices of this graph are superpixels.Every vertex stores information about corresponding superpixel average color components, mean coordinates and it is on the boundary or not.
Weight of the edges is a Euclidian distance between average colors of vertices.Then, we need to calculate the object and the background measures.

Background measure
Background superpixels recognition is based on the idea that background regions have large perimeter on the boundary and object regions mostly have central location (Zhu et al., 2014).Define geodesic distance   (, ) as the shortest path between two vertices of superpixels graph.We calculate it using Johnson algorithm.
Let us define the boundary length: where  some color variation constant.

Object measure
In (Zhu et al., 2014) "Background weighted contrast" is used as an object measure.

Saliency measure
The resulting saliency measure С(s) of a saliency map , that we are trying to find, for each superpixel is calculated by optimizing the objective function value (Zhu et al., 2014), which combines background, foreground measures and a smoothing component: To find the optimal values {  } =1  that minimize (), we have to solve equation (6) using least squares method.Considering (8-11), the optimal value of saliency map  can be found from (7): where  =  −  +   +   .
Component  is the matrix of components   that defines the weights between superpixels   and   .
Component D is the sum of the adjacent edges weights of superpixel: where w ̅ (  ) = ∑ (   ∈   ,   ).
Component   defines background measure.Each region in the generated saliency map is identified by values between 0 and 1, where the object of has values near 0 (marked white) as background has values near 1 (marked black).

Binarization
We convert the resulting saliency map into the binary image using binarization with an upper threshold: Threshold can be found using Otsu method (Otsu, 1979).

Shadow removal
Then we perform shadow detection on image and remove shadow regions from object superpixels.In this paper, shadow detection is based on the method (Blajovici, 2011), which uses the luminance statistics.Approach is based on the following considerations: -the pixel belongs to the shadow when its brightness is less than 60% of the average brightness of the entire image.
-the pixel belongs to the shadow when its brightness is less than 70% of the average brightness in a superpixel.

Binary image processing
Binarization results may lead to small objects that lay outside the target object or target object can be divided into parts.Therefore, we need to delete some needless separate elements and bring parts of the foreground together.
For small target objects (width and height of 5-20 pixels) we use erosion operation to small fragments with a structuring element in the form of a circle (having two-pixel radius).As a result, all found objects will be reduced in size.To restore the shape of objects, the dilating operation with the same structuring element is then used.Next, to connect the small parts into one, a dilating operation with a circle (having three-pixel radius).All constants, mentioned above, may vary for a target object of different size.
The scores for these trackers are based on two metrics, precision and success rate (Table 1).Precision is measured as the distance between the centers of a tracker bounding box and corresponding ground truth bounding box.The precision plot shows the percentage of tracker bounding boxes within a given threshold distance in pixels of the ground truth.To rank the trackers, we use a threshold of 20 pixels (Bolme et al., 2010) For initialization of the tracker we use: 1) triple-sized ground truth region with our saliency algorithm, predicting the initialization region of an object; 2) ground truth region.Table .1. Trackers average FPS, success rate difference for semiautomatic and ground truth initialization with IoU > 0.5 and precision difference for semi-automatic and ground truth initialization with 20-pixel precision threshold.
The success plot (Table 2) shows the percentage of tracker bounding boxes whose overlap score  is larger, than a given threshold.

CONCLUSIONS
According to the results of the experimental testing we can conclude that the best tracking quality in the initialization of the proposed algorithm is achieved by tracking algorithms "SRDCF" and "MOSSE_CA".It is easy to notice that the tracking algorithm "MOSSE_CA" outperforms other algorithms according to the experiments.Thus, the most appropriate algorithm for tracking objects from UAVs combined with the proposed algorithm of the initialization is "MOSSE_CA", because it was least sensitive to the accuracy of initialization and it is the most fast-acting in comparison with its competitors.
The proposed algorithm does not require special hardware and can work in real-time.It is implemented in C++.The average time required before the object is specified, occupying 40% of the image size 256  256 pixels, is equal to 60 milliseconds on the Intel® Core ™ i5-3470 CPU @ 3.20GHz.

Figure 2 .
Figure 2.There are original images in the first column, saliency map in the second, binary saliency maps in the third, object shadows, detected on the image in the fourth column, binary image without shadow in the fifths and bounding boxes in sixths.
. The success is measured as intersection over union of pixels in tracker bounding box and corresponding ground truth bounding box.