SIAMESE NETWORK COMBINED WITH ATTENTION MECHANISM FOR OBJECT TRACKING

After the development of deep learning object tracking methods in recent years, the fully convolutional siamese network object tracking algorithm SiamFC has become a more classic deep learning object tracking algorithm. In view of the problem that the accuracy of the tracking results of SiamFC will be reduced in the case of complex backgrounds, this paper introduces the attention mechanism based on the SiamFC, which performs channel and spatial weighting on the feature maps obtained by convolution of the input image. At the same time, the backbone network model of CNN in the algorithm is adjusted, then the siamese network combined with attention mechanism for object tracking is proposed. It can strengthen the effectiveness of the results of feature extraction and enhance the ability of the network model to discriminate targets. In this paper, the algorithm is tested on the OTB2015, VOT2016 and VOT2017 datasets, and compared with multiple object tracking algorithms. Experimental results show that the algorithm in this paper can better solve the complex background problem in object tracking, and has certain advantages compared with other algorithms.


INTRODUCTION
In recent years, with the development of computer vision technology, visual object tracking technology has also developed rapidly. The so-called object tracking is to give the position and size of the object in the first frame of a given video sequence, and predict the position and size of the object in the subsequent video frame through an algorithm (Granström, Baum, 2017). Visual object tracking has always been an important research topic in the field of computer vision and one of the current research hotspots. Object tracking is widely used in many fields such as video automation monitoring, human-computer interaction, traffic monitoring, virtual reality, robot visual navigation and positioning, medical diagnosis, military applications, etc.
However, visual object tracking is actually a challenging task. In the tracking process, there are still a series of difficulties, such as the variability of moving target features, the scale change of target, the inconsistency of light intensity, occlusion, and the interference of complex backgrounds. These problems still have constraints on the performance and speed of the object tracking algorithm. Therefore, it is necessary to design a robust algorithm for object tracking.
In the object tracking algorithm based on deep learning, SiamFC is a classic tracking algorithm. In this paper, aiming at the problem of poor tracking effect of SiamFC in complex background, the object tracking algorithm combined with attention mechanism is proposed in this paper to improve the performance of the tracking algorithm.
The main contributions of this paper are: (1) A siamese network object tracking method combining spatial attention and channel attention is proposed, which increases the ability of the siamese network to discriminate the target, and improves the problem of poor tracking effect of SiamFC in complex backgrounds.
(2) Replace the CNN backbone network model in the siamese network object tracking algorithm from AlexNet to VGG, which increases the depth of the network and improves the algorithm's ability to express features of object.
(3) The algorithm is tested using multiple data sets and compared with various methods. The results show that the method in this paper has a certain degree of advancement.

ANALYSIS FOR SIAMFC
The object tracking algorithm based on the siamese network is first appeared in the SINT (Tao et al, 2016) algorithm in 2016. In the same year, Bertinetto et al. proposed the SiamFC (Bertinetto et al, 2016a) algorithm, which, like SINT, is also based on the siamese network, and the tracking problem is converted into a comparison problem of two images through the siamese network to solve the tracking problem. After the development of deep learning object tracking methods in recent years, SiamFC has become a classic deep learning object tracking algorithm.
SiamFC pioneered the application of the siamese network structure in the field of object tracking, significantly improving the tracking speed of the deep learning method, but it still has certain problems. According to the test results of the tracking effect of the SiamFC algorithm, it is found that the SiamFC algorithm will have a reduced tracking accuracy under the complex background, and may even cause tracking failure. Therefore, in the research work of this paper, the specific test and analysis of the SiamFC algorithm is carried out first, and then the problem is solved according to the conclusions drawn, and a more effective tracking algorithm is designed. In the research, the tracking results of the SiamFC algorithm in complex background tracking scenarios are tested. Among them, the background complexity can be defined as that the background near the tracking target has the color or texture similar to the target. Figure 1 and Figure 2 are the tracking results of two representative scenes, where the first picture of each figure is the first frame in the video. It can be clearly seen from the figure that when the background is complex, it is easy to cause interference to the tracking of the target. In the first scene, the background is more complicated, including not only the people around the target athlete, but also the track and the pass on the track. The tracked athlete is first offset from the athlete in the background. In the following tracking test, the tracking result is tracked in the background, completely deviating from the tracking target.
In the second scene, the back and forth running of the players on the basketball court also makes the background of the target tracking process relatively complex, and the tracking of the target is shifted to another player with similar clothing in the background.
It can be seen from the experimental results that in the case of complex background, using SiamFC algorithm for object tracking, the accuracy of tracking will be reduced. The main reason for this situation is that the SiamFC algorithm is an algorithm to judge the tracking target through similarity learning. The complex background with similar textures and colors around the target will interfere with the tracking process.

SIAMESE NETWORK OBJECT TRACKING ALGORITHM COMBINED WITH ATTENTION MECHANISM
In order to solve the problem of poor tracking effect of SiamFC algorithm in complex background, this paper introduces the attention mechanism to improve this algorithm. The attention mechanism enables the algorithm to focus on the original goal itself. Figure 3 shows the network structure of the proposed siamese network object tracking algorithm combined with attention mechanism. This network structure is improved on the basis of the SiamFC algorithm. The improvements are mainly in the following two aspects: 1) Embedded attention module, including channel attention module and spatial attention module; 2) The improved network structure replaces the CNN network backbone in SiamFC from the AlexNet to the VGG-16, increasing the depth of the network.

Algorithm Framework
The basic principle of the improved algorithm is similar to the SiamFC algorithm, which uses similarity metric functions to determine the similarity of the target for the template image Z and the search image X. The similarity metric function here refers to a cross-correlation operation, that is, to use the feature map obtained by convolution of the template image Z to convolve the feature map obtained by convolution of the search image X. The feature image obtained by convolution of the template image Z is equivalent to the convolution kernel in the convolution process.
Since the improved algorithm introduces the attention mechanism module based on the SiamFC algorithm, the calculation formula for calculating the similarity between the template image and the search image in the improved algorithm is: where z = the value of a certain position on the template image Z x = the value of a certain position on the search image X φ= the function after CNN convolution operation δ= the weight distribution of the template image Z obtained by the attention mechanism module ε = the weight distribution of the search image X obtained through the attention mechanism module b1 = the value of each position in the score map The specific algorithm flow can be described as: (1) Input template image Z and search image X. The size of the template image Z is 127 × 127 × 3, and the size of the search image X is 255 × 255 × 3.
(2) CNN convolution of template image Z and search image X respectively. This process is also the process of feature extraction. After convolution, the template image Z and the search image X respectively generate 6 × 6 × 512 and 22 × 22 × 512 feature maps.
(3) The extracted feature maps are sequentially weighted in channel and space. After extracting the features, the feature maps of the template image Z and the search image X are respectively input into an attention mechanism module, and the feature maps are first weighted on the channels to improve the feature expression ability between different channels, and then weighted on the space to highlight the importance of different locations.
(4) Calculate the response score map (score map). Perform crosscorrelation operations on the features extracted from the template image Z and the search image X after channel and space weighting, respectively, and calculate and generate the response score map.
(5) Object tracking. When using this algorithm for object tracking, the search image centered on the previous frame of target position is generally used to calculate the response score map. Finally, the position with the largest score is multiplied by the step size to determine the current target position.

Attention Mechanism Module
The algorithm in this paper introduces the attention mechanism based on the SiamFC algorithm, and weights the channel and space of the features obtained by convolution of the input image. It strengthens the effectiveness of the results of feature extraction, enhances the discriminant analysis in the model, and improves the ability of the neural network model to discriminate against targets. The attention mechanism module includes channel attention module and spatial attention module.

Channel Attention Module
The introduction of the channel attention module on the basis of the SiamFC algorithm is mainly to allow the convolutional neural network to have a better adaptability to tracking the changes of the target's appearance semantics. The channel attention module can increase the proportion of feature channels related to the target, and reduce the proportion of other feature channels not related to the target. In this way, it can highlight the target that needs to be tracked. The channel attention module can also change the channel dependence between different channels. Each channel of the feature map obtained by the high convolution layer can be regarded as a response to a specific object category, and there is an interrelated relationship between the feature responses of different object categories. Therefore, the use of the interdependence between the feature maps of different channels can improve the ability to express the target features, for example, to enhance the interdependence between the feature maps of different channels. The structure of the channel attention module is shown in Figure  4. First, the channel set of the input feature map is defined as: where ∈ × , = 1,2,3, … , 。 The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLIII-B2-2020, 2020 XXIV ISPRS Congress (2020 edition) This contribution has been peer-reviewed. https://doi.org/10.5194/isprs-archives-XLIII-B2-2020-1315-2020 | © Authors 2020. CC BY 4.0 License.
Next, the input feature map A is global pooled, and the resulting feature vector is: where ∈ × , = 1,2,3, … , Then first pass the feature vector b through a fully connected layer FC, and then use the nonlinear activation function ReLU function to activate, so that the result has a nonlinear nature. After passing through the second fully connected layer FC, using the activation function sigmoid function to get the feature vector is: where ∈ × , = 1,2,3, … , Finally, the obtained feature vector α is superimposed on the original feature map A, and the feature channels are rescaled to obtain the channel set of the finally generated channel attention feature map as: where ̅ ∈ × , = 1,2,3, … ,

Spatial Attention Module
The introduction of the spatial attention module on the basis of the SiamFC algorithm is mainly able to assign different weights to different spatial positions on the feature map, because different spatial positions have different importance in feature extraction.
In the spatial attention module, the attention mechanism is introduced to establish the connection between any two positions in the feature map. For the feature of a certain position on the feature map, it can be calculated by weighting and summing the feature information of all positions on the feature map. Finally, the input feature and the spatial location feature are added to the element to further enhance the feature expression ability of the network model. Adding the spatial attention mechanism can increase the spatial position weight of important features to make the features more effective, and at the same time, it will not cause too much calculation and increase the calculation speed of the algorithm.

Figure 5. Spatial attention module
The structure of the spatial attention module is shown in Figure  5. First, the input feature map x is convolved using convolution kernels of size 1 × 1, respectively. Next, the three convolution results are converted using three transformation functions f (x), g (x), and h (x), respectively. Among them, the transformation functions f (x), g (x) and h (x) are: ( ) = 1 · , ( ) = 2 · , ℎ( ) = 3 · where W1 = the weight of the function f (x) W2 = the weight of the function g (x) W3 = the weight of the function h (x) Then, the result output by the function f (x) is transposed and matrix multiplied by the result output by g (x), and the obtained result can be calculated by using the softmax function to calculate the spatial attention map. The calculation formula of the spatial attention map is: where a = the a-th position on the input image b = the b-th position on the input image Then, the obtained spatial attention map and the function h (x) are subjected to matrix multiplication, and the obtained result is added to the input feature map x to calculate the feature map adjusted by the spatial attention module. The calculation formula of the final output is: where x = the input feature map β = the weight parameter

CNN Network Structure
In the SiamFC algorithm, the CNN backbone network structure used is AlexNet (Lecun, Bottou, 1998). In the algorithm of this paper, the backbone network used is the VGG-16 model (Simonyan, Zisserman, 2014) with deeper network layers, and some modifications have been made according to the algorithm.
The VGG-16 convolutional neural network model includes 16 layers (excluding the pooling layer), of which there are 13 convolutional layers and 3 fully connected layers. In the algorithm of this paper, the VGG-16 model has been modified to meet the needs of the algorithm. The main change is that the last three convolutional layers and the last three fully connected layers are removed, and the maxpooling layer before convolutional layer 4-1 is adjusted behind convolutional layer 4-1. Table 1 gives the specific CNN network structure parameters of the algorithm in this paper, including the size of the convolution kernel, stride of convolution, number of channels, template image size and search image size. The set CNN network structure contains 10 convolutional layers and 3 maxpooling layers, and no padding is used in the network. In addition to the last layer Conv4-3, each convolutional layer in the network uses the ReLU function for nonlinear activation. When training the network, batch normalization (BN) is performed after each convolutional layer.

Experimental Environment and Dataset
In the experiment, the operating system used was Linux (Ubuntu 16.04). During the experiment, CUDA was used for GPU acceleration. The GPU model is NVIDIA GeForce GTX 1060.
The deep learning framework used in the experiment is pytorch, and the program implementation language is python.
In the experiment, the datasets used includes the training datasets and the test datasets. The training datasets used in the experiment are the Got-10k (Huang et al, 2018) dataset and the VID (Russakovsky et al, 2014) dataset. The test datasets used in the experiment are OTB2015 (Wu et al, 2015), VOT2016 (Kristan et al, 2016) and VOT2017 (Kristan et al, 2017

Experiment Details
The training data needs to be preprocessed in the experiment. The siamese network structure requires training data to be image pairs, so the training data should be processed into image pairs (z, x). The template image Z and the search image X are both centered on the target and extracted from two frames of a video. The part beyond the image is filled with the RGB average value, and the target aspect ratio is kept unchanged. The specific category of the target is not considered during training, and the input image size of the network model is uniform.
The pre-trained model used in the experiment is the model trained on the ImageNet dataset. When training the network, use stochastic gradient descent (SGD) to train the network model. Among them, the momentum is set to 0.9; the decay mode of the learning rate is set to exponential decay, and the decay process starts from 10-2 to 10-8; the weight decay is set to 0.0005. The model was trained for 50 epoches, and the minimum number of mini-batch samples was 16.
Regarding the problem of scale transformation in the tracking process, the multi-scale test in SiamFC was followed in the experiment. In the multi-scale test, the target is tested in three scales, and the scale scaling factors are 1.025 −1 , 0, and 1.025, respectively. Use these zoom factors on the image to be searched to search for the image.

Experimental Results
This paper tests the datasets and evaluates the proposed siamese network object tracking algorithm combined with attention mechanism. The training time in the experiment was about 26 hours. During testing, the average time required for each test data is 50 seconds. Figure   The first scene is a running competition. On the sports ground, there are multiple runners running on different tracks. One of the players is selected as the tracking target during tracking. During the running process, different players will stagger due to the difference in running speed. In the tracking process, because the tracking target person blocks the next person, the tracking result of the SiamFC algorithm will drift to the next person, resulting in tracking drift. The method proposed in this paper can focus on tracking the target. In the second scene, there is a box that is controlled to move up, down, left, and right, and is in a place with a complicated background. During the movement, the box is partially blocked by surrounding objects. After occlusion, the tracking result of SiamFC algorithm will drift to surrounding objects, and the method proposed in this paper can keep track of the target box. In the third scene, various types of bottles place on the table are artificially added, and they are exchanged left and right respectively. In the process of bottle movement, other bottles will be blocked or blocked by other bottles. During the tracking process, the tracking results of the SiamFC algorithm will drift to non-target bottles, and the method proposed in this paper can keep track of the target bottles.
In summary, from the tracking results, it can be seen that when the complex background changes, the SiamFC algorithm may track other surrounding objects with semantic information, and the algorithm in this paper can focus on tracking the target itself, and the tracking effect is better.

Comparative Analysis
In order to further analyze the performance of the proposed algorithm to verify the effectiveness of the algorithm in this paper, a variety of methods were used on multiple test datasets and the tracking results of the method in this paper were compared.

Results on OTB2015 Dataset
Since each data on the OTB2015 dataset has attribute labels such as scale change, occlusion, and background clutters, 31 data with background clutters label are selected on the OTB2015 dataset for the experimental test. The experiment uses precision and success rate to evaluate the results, and the evaluation method is One-Pass Evaluation(OPE).
In the success plots, the algorithm in this paper is also ranked second, with the average success rate of 0.613, slightly lower than the KCF algorithm, which is 0.9% lower, which is higher than the success rate of algorithms such as MEEM algorithm and SiamFC.

4.4.2
Results on VOT2016 and VOT2017 datasets On the VOT2016 and VOT2017 datasets, the method proposed in this paper is used for experimental testing. In the experiment, the three indexes of Accuracy, Robustness and Expected Average Overlap(EAO) are used to evaluate the results.  Table 2 shows the comparison results of our method and tracker DSST (Danelljan et al, 2014), MDNet (Nam, Han, 2016), UPDT (Bhat et al, 2018), MEEM (Zhang et al, 2014), Staple (Bertinetto et al, 2016b), SRDCF (Danelljan et al, 2016b), CSRDCF (Lukezic et al, 2017), C-COT (Danelljan et al, 2016a), ECO-HC (Danelljan et al, 2017), ECO (Danelljan et al, 2017), SiamFC (Bertinetto et al, 2016a), DensSiam (Abdelpakey et al, 2018), SiamRPN (Li et al, 2018), SA-Siam (He et al, 2018). In the VOT2016 dataset, the Accuracy of the algorithm in this paper is 0.55, ranking second, 1% less than the DensSiam and SiamRPN algorithms, the same as the ECO algorithm, 2% higher than the SiamFC algorithm. The Robustness value is 0.35, the robustness relatively poor, ranked lower, but more robust than the SiamFC algorithm. The EAO is 0.261, ranking in the middle of the 14 algorithms. In the VOT2017 dataset, the Accuracy of the algorithm in this paper is 0.51, ranking fourth, 3% less than the DensSiam algorithm, and 1% higher than the SiamFC algorithm. The Robustness value is 0.51, ranking relatively low, but its value is 8% lower than the SiamFC algorithm, and the robustness is better. The EAO is 0.221, ranking ninth.

CONCLUSION
Visual object tracking is to estimate the position of the target in the image sequence. It is a research hotspot in recent years and has been applied in many practical applications, such as automated monitoring, intelligent transportation, and robot positioning and navigation. Although research on object tracking has made great progress in recent years, object tracking is still a challenging task. In the process of object tracking, factors such as occlusion, background interference, target scale changes, and ambient light changes may affect the tracking results, and may even cause tracking failure. Therefore, in order to meet different practical application requirements, it is of great practical significance to study visual object tracking algorithms with higher accuracy and better effects. At present, with the popularity of deep learning methods, there are more and more object tracking algorithms based on deep learning, but there are still many problems to be solved in these methods.
Based on the above problems, through reading a large number of related papers at home and abroad, this paper learns the related theories based on the siamese network object tracking algorithm and introduces the attention mechanism to solve the background interference problem in siamese network object tracking. The proposed siamese network object tracking algorithm combined with attention mechanism not only improves the performance of the tracking algorithm, but also further solves the complex background problem in object tracking. The experiment carried out object tracking test on multiple datasets, and compared with multiple object tracking algorithms. Experimental results show that the algorithm in this paper can better solve the complex background problem in object tracking, and has certain advantages compared with other algorithms.