Real-Time Tracking in Satellite Videos via Joint Discrimination and Pose Estimation

: Object tracking has gained much attention in the field of computer vision and intelligent traffic analysis. Satellite videos are more suitable for long-distance tracking comparing to the road traffic videos. However, most of the state-of-the-art methods produce poor results when applied to satellite videos, due to low resolution of the small target and interference from similar background in satellite videos. In this paper, we present an improved Discriminative Correlation Filter based approach specifically tailored for small objects tracking in satellite videos through applying spatial weight in the filter and estimating the pose by Kalman filter. First, a spatial mask is introduced to encourage the final filter to give different contributions depending on the spatial distance. Furthermore, Kalman filter is incorporated into the approach with the aim of predicting the position when the target run into the large area similar background region. Finally, an efficient strategy for combining the improved DCF tracker and pose estimation is proposed. Experimental results on three satellite videos describing the traffic conditions of three cities demonstrate that the proposed approach can effectively track the targets even though the targets are similar to the background region in a period of time. Compare with other state-of-the-art methods, the final accuracies and speeds of our method achieve the best.


INTRODUCTION
Object tracking is an attractive research field which is useful for many computer vision and intelligent traffic analysis tasks.Road traffic videos are the main data source for target tracking in traffic surveillance.However, it is difficult to achieve longdistance tracking of a moving target by using road traffic videos as a result of local limited range observed by the traffic camera.
A satellite video is a newly developed earth observation satellite, which can capture continuous image sequences in a long time (Xiao, Wang, Wang, & Ren, 2018).It has received attention in many real-world tasks, such as motion analysis (He, Yi, Cheung, You, & Tang, 2017), moving ship surveillance (Li & Man, 2016), traffic monitoring(Caro-Gutierrez, Bravo-Zanoguera, & Gonzá lez-Navarro, 2017), (Patel & Mishra, 2013).In general, the size of a frame in the satellite video can be up to 3840 × 2160 pixels .Yet the size of moving objects in satellite video usually can be very small because of the long distance from the camera, consequently the information we can obtain is limited comparing with the background.Meanwhile, the target is very similar to the background, even in some region they are completely indistinguishable.
Prior to 2012 tracking approaches based on generative models are widely applied such as meanshift (Vojir, Noskova, & Matas, 2013), Kalman filter (Chen, 2012), optical flow (Boroujeni, 2012) and particle filter (Baxter, Leach, & Robertson, 2014).The generative model treats the tracking as a searching problem to find a region most similar to the target.These methods mainly use the information of the target itself such as color, position, feature corners, etc., ignoring the background information.They often produce poor results when lighting changes, deformations or occlusions occur.
Recent years tracking solutions based on correlation filters have achieved great success in object tracking.They treat the tracking problem as a binary classification problem.Many works (Henriques, Caseiro, Martins, & Batista, 2012), (Danelljan, Bhat, Shahbaz Khan, & Felsberg, 2017) based on Discriminative Correlation Filter (DCF) improve continuously accuracy and robustness on tracking benchmarks (Wu, Lim, & Yang, 2013), (Wu, Lim, & Yang, 2015).For instance, Color Attribute (Danelljan, Khan, Felsberg, & Van De Weijer, 2014) is introduced to rich the feature, DSST (Danelljan, Hä ger, Shahbaz Khan, & Felsberg, 2014) is proposed to solve the problem of scale variations, Spatial Regularization (Danelljan, Hager, Khan, & Felsberg, 2015) is used to reduce the boundary effects.However, most of these methods fail to track the targets effectively when applied to the satellite video, as the background information with target-similar color in the searching box hampers accurate location of the target.
Recently deep learning shows its power in many computer vision tasks.The CNN is introduced to extract features for tracking (Nam & Han, 2016), (Bertinetto, Valmadre, Henriques, Vedaldi, & Torr, 2016), (Danelljan, Hager, Khan, & Felsberg, 2016), (Valmadre, Bertinetto, Henriques, Vedaldi, & Torr, 2017).These trackers pre-train the CNN model to get deep features useful for distinguishing the object categories.Nevertheless, there are two obstacles when integrating deep features into tracking models in satellite videos.Firstly, the training data of satellite videos are extremely scarce.It is very difficult to find the labeled satellite videos.The second issue is low resolution of the target in the satellite video.The targets in the satellite videos such as cars and planes occupy only a few pixels and are shape-blur and color-similar.As a result, effective deep features are very hard to obtain through most of the deep models.Even though object tracking has attracted much attention, the tracking of small object is left in the basket.Some moving object detection methods are proposed to fuse with the Kernel Correlation Filter (KCF) to solve the problem (Du, Sun, Cai, Wu, & Du, 2018).High computation of object detection makes the approach operate at low frame rates, thereby infeasible for realtime applications.
In this paper, we present an improved DCF-based method restricted with the spatial weight, at the same time, Kalman filter (Chen, 2012) is combined to estimate the pose.The traditional DCF filter has multi-peak response region when applied to satellite video, primarily due to the similar color and no obvious shape feature.To solve this, we introduce the spatial weight calculated based on the distance from the center to ensure the target region in the previous frame is given higher weights.Most of the state-of-the-art tracking solutions lost the target in the case that the target is obscured for a period of time.
The large buildings with similar color of the target often make the target invisible.We estimate the position of the target based on the motion model established by the Kalman filter in the above situation.An effective strategy is employed to decide whether to discriminate or to estimate.Finally, we conduct experiments on three satellite sequences to demonstrate the effectiveness of our approach.An overview flowchart of the proposed approach is shown in Figure 1.

The Single Channel Discriminative Correlation Filters
We base our method on the DCF tracker, which is the basis of many trackers providing good performance in a recent evaluation.Single channel is used in the correlation filters, seeing that targets in satellite video have a low resolution and their colors are similar to the backgrounds.Here we provide a brief introduction of the single channel discriminative correlation filters.
The DCF tracker detemines the target position by find a function that minimizes the squared error over the correlation response and their desired output as solving a ridge regression problem, where f is the grayscale feature of samples and h is the corresponding target templates, y is a 2D Gaussian function centered at the target location, λ is a regularization term.
The lost function (1) can be expressed as a Hadamard product in the Fourier domain to improve tracking efficiency, where f , h , y are the Fourier transforms of f, h, y and h is the complex-conjugate operator of h .The solution to the lost function where 1  is element-wise division.The new position of the target is thought to be maximum in the final response.

Spatial Weight Mask
The color of targets are similar to the background in satellite video, consequently we get a multi-peak response in the filter.To alleviate the problem induced by the similar background, we introduce a spatial weight mask m.The element in the mask is between one and zero depending on the location in the sample.

Then we get a constraint function h m h
 , where represents element-wise product.After adding the constraint, the solution to the lost function is not closed-form.An iterative method is proposed to solve the constraint problem in Correlation Filter with limited boundaries (Italiano, Sim, & Lucey, 2015).
We introduce a variable s h m h  and handle the closedform problem through an Augmented Lagrangian Method (Boyd, Parikh, Chu, Peleato, & Eckstein, 2011).The augmented Lagrangian of the constraint loss function can be formed as where I is a complex Lagrange multiplier ,  , | ) arg min ( , , | ) The minimizations to the solution can be computed, And the value is updated through The spatial weight mask is target location centered.If the pixel is within the original width and height, the value of the pixel is set to one, otherwise it is set to zero.We also try to decrease the weight value progressively based on the distance from the center.Results indicate that the spatial weight mask having weights within the original box of the target set to one performs best.

Pose Prediction
Aiming at the problem that targets are hard to distinguish from the color-similar background region, Kalman filtering feedback mechanism is adopted to estimate the target position.Kalman filter can predict the state sequence of a dynamic system having the minimum mean square error.The state equation and observation equation is the base of Kalman filter, the estimated value and the observation value are used to update the state variable so as to predict the movement state.
The state equation and the measurement equation are Where X(k), X(k-1) are the system states at times k and k-1, U(k) is the control amount at time k.Z(k) is the observation value at time k.W(k) and V(k) are the noises.A, B and H are the parameter of the system.The whole process of Kalman filter is a recursive calculation process, it continuously predict and update.The predict function are The update function are We use the tracking results as the measurement to regress the Kalman filter, the predicted value is used as target location when targets and the surrounding region are exactly the same.

Tracking with Spatial Weight and Pose Prediction
We introduce a strategy for combining the improved DCF tracker and pose estimation in this section.When targets reach the region having highly similarity, the tracking results of DCF with spatial weight stay at the old positions.So the average distance between two frames are selected as a standard to decide the ratio of the position estimated by Kalman filter.
The accuracy of the Kalman filter is great improved after regression of some frames.Therefore we do the pose estimation in every Nith frame after Ns frame.The parameter Ni determines how often pose estimation is done.The parameter k R is adopted to calculate the new possible position, 1 (1.0 ) where Pk is the position estimated by Kalman filter, Pt-1 is the position tracked at frame t-1.If distance between position at frame t-1 and position at frame t-2 less than the average distance between two frames, Rk is set to a high value, otherwise Rk is set to a small value.Rk must be at least 0 but not more than 1.

Instead of using
 .7: Update Kalman filter using Pt as the measurement value.Output: tracking location Pt at frame t .

EXPERIMENT
We validate our proposed approach by performing experiments on three sequences.These sequences describing the traffic conditions of three cities are provided by Chang Guang Satellite Technology Co. Ltd.The image sizes are 3840 × 2160 pixels.Challenging problems such as low resolution, shape blur, background clutter and occlusion are posed in these data sets.Details of the data sets are shown in Figure 2. We select planes and cars as our targets.
Figure 2 Three data sets in the experiment.Traffic scenes in the USA, Mexico and Spain are shown respectively.The planes are selected as the targets in A and B, while the car is chosen as the target in C.

Implementation
Raw Grayscale feature is used in the correlation filter, since moving objects in satellite videos have no obvious shape features and their colors are similar to the background.The image region area of the samples was set to 42 times the target area.We multiply samples by a Hann window (Bolme, Beveridge, Draper, & Lui, 2010).We set the augmented Lagrangian optimization parametersμto 5 andβto 3. We do 4 times Gauss-Seidel iterations and the filter fusing rate is set to 0.02   .We do the pose estimation in every three frames after 25 frame.Rk is set to 0.95 when the distance between the last two frames less than the average distance between every two frames, else Rk is set to 0.2.All parameters do not need to be fine-tuned and were consistent across all experiments.The proposed approach is implemented in a C++ OpenCV library.Our implementation runs at 1500 frame per second on an Intel Core i7-6700HQ 2.60GHz CPU with 8GB RAM.

State-of-the-art Comparison
We compare our method with CSRT (Lukežič, Vojíř, Zajc, Matas, & Kristan, 2017) ,ECO-HC (Danelljan et al., ,STAPLE(Bertinetto, Valmadre, Golodetz, Miksik, & Torr, 2015) ,LCT (Ma, Yang, Zhang, & Yang, 2015) and KCF (Joã o F. Henriques, Caseiro Rui, Martins Pedro, & Batista Horge, 2015), these approaches provide excellent results in literature.ECO-HC, STAPLE and LCT are provided with the benchmark evaluation.CSRT and KCF are provided with OpenCV library. Figure 3 shows a comparison with the mentioned methods on the three sequences using overlap rate and distance precision using the method provided with the benchmark.Among the compared trackers ours method achieved the highest success rate with the AUC score of 0.725, while CSRT, ECO-HC, STAPLE, LCT, KCF get the AUC scores of 0.621, 0.533, 0.483, 0.254, 0.054.In the Precision plot, the best approach is OURS proposed in this letter.One of the most important reason leading to the good performance is that we do the pose estimation when targets disappear in the background, especially targets have the same color with the large buildings in the background.The sequence of the USA has a short video clips that targets have the same color with the surrounding background region, the other methods except ours fail to track the target effectively in such a situation.We draw the tracking results of some frames belonging to the experiment data sets in Another important criterion for tracking targets in satellite video is tracking speed.Table I shows the speed comparison among the six approaches.OURS has a remarkable speed advantage.
The reasons why the fps of the proposed method is so high are as follows.The targets in satellite video are almost scale invariant as a result of the long distance from the camera, so we do not estimate scale variations.Moreover, only gray feature is employed, considering that targets occupy only a small number of pixels without evident shape features and the color is similar to the background.Contrary to the usual experience, small objects tracking in satellite videos do not tend to benefit from HOG and CN feature.

CONCLUSIONS
In this paper, we propose a novel approach with spatial weights and pose estimation for tracking small objects in satellite videos.
Tt is hard to extract strong features from the small targets in the satellite videos, on the other hand, large area color-similar background is a major obstacle.By exploiting the spatial weight mask, we can get a single-peak response filter.Kalman filter for pose estimation is introduced to avoid interference from the large area color-similar region.We further propose a strategy to combine the discrimination and pose estimation based on the average distance between frames.Experiments on three satellite videos show the tracker can produce superior tracking performance in terms of both accuracy and speed.

Figure 1
Figure1Flowchart of the proposed approach.We introduce the spatial weight to restrict the DCF filter and do the pose estimation through Kalman filter when the tracking result stay at one place.
t P as the position at frame t, we track again centered on this position.The tracking result is the new position of the target.The new position is used as the measure value to update the Kalman filter.Algorithm 1 The proposed tracking algorithm Input: Image It, object position on previous frame Pt-1 and Pt-2, average distance between two frames, filter ht-1 Method: 1: Set the parameter Rk according to the distance between the position of frame Pt-1 and the position Pt-2.Update the average distance.2: Calculate the new possible position t P using Pt-1 and the position Pk estimated by Kalman filter.3: Extract image patch feature f centered on position Pt-1.4: New target position Pt: position of the maximum in correlation between ht-1 and image patch feature f. 5: Calculate the spatial weight mask m, estimate a new filter h '

Figure 3
Figure 3 Success rate and Precision Plots.The legend shows the AUC score of each tracker.

Figure 4
Figure 4 Example tracking results of different methods on the USA sequence.The number in the top left corner shows the frame number in the video.Our tracker is the only tracker tracking the target correctly from the beginning to the end.

Figure 4 .
Figure 4. ECO-HC and CSRT are selected besides ours method as they get better performance comparing to STAPLE, LCT and KCF.In the USA sequence, ours can track the target accurately until the end of the video.We can see the ECO-HC tracking box staying in the similar background region obviously at the 295 frame, in addition CSRT has lost the target before then.

Table I
AVERAGE SPEED( IN FPS UNITS )