SPECTRUM-BASED OBJECT DETECTION AND TRACKING TECHNIQUE FOR DIGITAL VIDEO SURVEILLANCE

This paper presents a motion detection and object tracking technique for digital video surveillance applications. Motion analysis algorithms are based on processing of multiple-regression pseudospectrums. Complete object detection and tracking scheme is described. Results of testing on public PETS and ETISEO test beds are outlined.


INTRODUCTION
The video surveillance is one of the key technologies of modern security systems.Digital video surveillance presumes the visual control of some territory with one or more video cameras, that allows storing and viewing digital video data, continuously evaluating the state of controlled region and detecting some changes in observed scene as "security events".
The main drawback of traditional video surveillance systems providing raw video to a human operator is a serious decreasing of operator's response capability, while the system is growing in size.This problem is especially urgent in case of city-level surveillance systems.Well-known business case is an implementation of video surveillance system in London, Great Britain including tens of thousands of cameras in a single network and more than half a million cameras in the whole city.Unfortunately, it did not provide a serious reduction of crime incidents or increasing of crime detection rate.Now we know that it is not enough just to broadcast cameras' video to the surveillance center.Video should be processed and alarms should be generated in real-time to attract the attention of operator in critical situations.
So, the design of high-performance intellectual video analytic systems is a very actual practical task.Moreover, such intelligent systems can address both security and counterterrorism objectives, and can be of use in some business applications.For example, they can collect statistical information about the attendance of observed object, distribution of visitors over time, main routes of movement, etc.Other possible application is a traffic monitoring and so on.
The Motion analysis is a basis of all intelligent video surveillance technologies.In particular, it provides the fundamentals for automatic detection and tracking of moving objects and automatic detection of new or disappeared objects of observed scene.It is the well-studied area of computer vision including many different techniques.The brief overview of these techniques is given in next section.This paper contains a description of proposed technique accompanied with testing results on PETS (PETS video database, n.d.) and ETISEO (ETISEO video database, n.d.) public video test beds.

RELATED WORKS
The motion detection and tracking problem is widely studied all around the world.There are lots of methods and algorithms, that detect motion and trace moving objects.Let us dwell on main approaches in video analysis task.First one is the optical flow approach (Horn and Schunck, 1981, Nagel, 1983, Barron et al., 1994).It was the first mentioned in (Horn and Schunck, 1981).This approach is based on finding the pixel speed from previous to current frames.Let I(k) be an input image pixel matrix with width w and height h on frame number k.It is assumed that the brightness of a point remains constant during a short period of time, which is expressed by the equation dI(k)x,y dk = 0.
Hence we get an equation where (u, v) T -vector of pixel movement.
Hence optical flow speed (u, v) T can be found via iteration method from (Horn andSchunck, 1981, Barron et al., 1994).In different books and papers the number of required iterations varies, but to achieve a good result you have to make over 100 iterations over full image, what is very time consuming.
The optical flow approach is useless if image sequence contains large amount of pixel noise.The next correlation approach (Anandan, 1989, Singh, 1992) is based on computing correlation function of some area and minimizing it in surrounding region to find the best match for it and speed vector (u, v) T .Most of correlation algorithms are based on minimization of SSD-function (Sum of Squares Difference): where Wi,j is weight function for the area.
International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XXXIX-B3, 2012 XXII ISPRS Congress, 25 August -01 September 2012, Melbourne, Australia In (Anandan, 1989) SSD-function is sequentially optimized by the Laplacian pyramid.Minimum is found for all levels of the pyramid to begin with the highest level (the smallest image) and dropping to the lowest level (the whole image).Speed vector is being obtained more accurate on each level.In (Singh, 1992) minimum of SSD-function is found through iteration process.
But correlation approach is not robust too because it strongly depends on invariability of scene brightness.In (Heeger, 1988) frequency approach is proposed.This approach is based on "power" function, evaluated as the Gabor filter (Gabor, 1946) with frequencies Lx, Ly, ω: where σx, σy, σ k -standard Gabor filter derivatives.
Speed vector (u, v) T is found during minimization of function with respect to u and v, where mi -measured power value, Ripredicted power value, mi and Ri are average power values.

REGRESSION PSEUDOSPECTRUMS
In this section we introduce the notion of multiple-regression pseudospectrums.
Let again I(k) be an input image pixel matrix with width w and height h on frame number k, I(k) ∈ IR w×h .It is assumed that I(k) is a grayscale image, so 0 ≤ I(k)x,y ≤ 255 ∀x = 1 . . .w, y = 1 . . .h.Let us call Mn(k) an regression accumulator of n frames with parameter α, calculated on frame k.It will be a matrix Mn(k) ∈ IR w×h (Box et al., 1994): You can calculate the accumulator value Mn(k) on frame k by adding each older member in series (1): Let us assume that l(k) is an element of the image matrix I(k) and mn(k) is an element of the accumulator matrix Mn(k) with the same coordinates, as l(k).Let us suppose that on an initially zero input of accumulator (2) since some moment k0 (without loss of generality, let k0 = 0), during enough long time some signal with intensity l is being given: (3) Now it's quite simple to find such α, so that mn(k) would surely exceed β share of signal l after n frames: Thus, αn is such time averaging parameter, at which the accumulator sum will be equal to mn(n) = βl through n frames.At the same time n here can be called β memory length or, simply, length of accumulator memory with the corresponding averaging parameter αn = n √ 1 − β.
Given αn can be found as (4), the whole accumulator sum in one pixel at variable frame k can be found as The mn(k) graphs for different αn, n = 4, 8, 16, 32 values are shown in Figure 1, supposed β = 0.5, l = 100, k0 = 10.Thus, time averaging parameter αn, defined in (4), is in fact the satiety parameter of the filter response function.It allows to judge after which time (in frames) n accumulated sum will be equal to βl.
According to (5), αn possesses the multiplicity property: a difference between the responses of accumulators with multiple smoothing parameter n and n • s.By ( 6) and assuming that some signal with intensity l is being given from time k0 = 0, this difference will possess a very interesting property: Consider the behaviour of derivatives Dn,s(k) function.Let s = 2 and β = 0.5.Then, according to (7), difference between accumulator with memory of 2n frames and accumulator with memory of n frames will be equal to Figure 2 shows differences Dn,s between accumulators with variable n and s = 2, l = 100.
International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XXXIX-B3, 2012 XXII ISPRS Congress, 25 August -01 September 2012, Melbourne, Australia As you can see, quadruplicate difference of multiple accumulators 4 • Dn,2(k) is a partially convex function on the segment of the signal presence.This maximum is single and equal to l, moreover, it is reached on the frame with number 2n (if this maximum can be reached at all).
Thus, first order regression derivatives behaviour with multiple memory length recalls spectral decomposition, or rather signal wavelet transformation.Let us call a multiple-regression pseudospectrum -set of differences of first-order regressive accumulators ( 7) with multiple characteristics of memory length by a sequence of powers of two: 1, 2, 4, 8, . . .(view Figure 2).This pseudospectrum allows to qualitatively and quantitatively investigate both the duration and amplitude of the input time signal such as "meander." If the maximum of differences between the responses has been consistently achieved for all accumulators with memory length N , but for accumulator with memory length n = N +1 predicted signal maximum was not reached, it means that a constant input signal had a length of 2N frames, and then began to decrease or was otherwise dramatically changed.
Similarly, we can make conclusions about the magnitude of the signal.Cause Dn,2(2n) = 0.25l, for all n whose maximum was reached, l = 4Dn,2(2n).
Expected maximum value of Dn,2(k) can be easily found, for example, for n = 1.Further it should be compared with the value of differences between accumulators Dn,2(k) for other n until maximum on frame k = 2m will be less than all previous maximums for n < m.
Now consider the problem of determining the sensitivity threshold of the algorithm, detecting the changes of brightness in images.Figure 3 shows the shape of multiple-regression pseudospectrum for the case of shorter time of signal presence on the image sequence.
Apparently, for lesser duration of the signal, lower frequency components of pseudospectrum start to move in the negative direction from higher initial values (after a reaction to the passage of the front edge of the signal) and thus achieve the appropriate extremum (in this case it will be minimum) at values lower in magnitude than the specified threshold, based on the expected drop estimate (8). Figure 3 illustrates it well by the function D16,2(k) (the lowest frequency component of the presented pseudospectrum).However, this problem can be solved if we jointly consider a pair of consecutive pseudospectrum components.
Consider previous D8,2(k) to D16,2(k) pseudospectrum component on Figure 3. Since its response to input signal change is Analysis of the introduced multiple-regression pseudospectrums is particularly useful in the case of image analysis that studies moving objects or left/missing items.Since, on the one hand, the object's motion relative to the background due to the effect of image pixels obstruction generates in each individual pixel temporal "meander" signal, which has clearly defined leading and trailing edges (brightness fluctuations over time).On the other hand, the possibility of signal analysis based on the difference between the accumulators with multiple memory lets you significantly decrease processing time of machine vision systems.Since estimates of the time signal characteristics must be obtained independently for each image pixel, in the case of using more complex statistics than the accumulated sums, the necessity to calculate the corresponding parameters estimates of the time signal directly leads to a huge increase of either computation time, or use of the program memory, or both.

ALGORITHMIC SCHEME
In this section we introduce the algorithmic scheme, which includes image preprocessing, motion detection and object tracking.
Objects detection and tracking are implemented as a modular three-stage procedure: 1. Detection of moving pixel groups based on pseudospectrum analysis.
2. Forming of object hypotheses and interframe object tracking.

Spatiotemporal filtration of object motion parameters.
Let us consider first and second stages of this procedure.
Detection of moving pixel groups is performed as follows: International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XXXIX-B3, 2012 XXII ISPRS Congress, 25 August -01 September 2012, Melbourne, Australia • Calculate Dn,2(k) pseudo-spectrums for various n, for example, n = 2, 4, 8, 16 in each pixel of the image on frame number k.
• If signal exits in some pixels, then |Dn,2(k)| in them will be greater than zero.It can be or a signal from the object, or some noise on the image sequence.To make an algorithm more robust, we should filter the noise with some threshold.This threshold can be found adaptively on each frame using methods described above.
• Divide the whole accumulator image on many square parts using grid.Assume each small square as moving if its value is greater than threshold and not moving (background) otherwise.Let us call these small image squares moving image elements ω1 . . .ωm.
Moving • No object associates with the moving element.So this moving element belongs to a new object.
• No moving element associates with the object.This object is treated as lost on this frame.Maybe it will be found in future.
• Several moving elements are associated with the object.This object is treated as found on this frame.New position is calculated for it.
• Several objects are associated with one moving element.This case is called a "collision".It's the most difficult case, it should be treated very carefully.We have to use additional algorithms to parse this conflict.
As a result, on each frame we have a number of moving objects with their unique IDs and a number of new or disappeared objects with their unique IDs too.

EXPERIMENTAL RESULTS
Described algorithms were tested using the private video bases and public domain video bases like PETS (PETS video database, n.d.), ETISEO (ETISEO video database, n.d.).Typical screenshot of object tracking visualization is presented on Figure 4.
We created an algorithm analyzing and testing block that is based on comparison of automatic object detection and tracking results with results of manual object marking.Performance is measured in FPS (frames per second processed).Detection probability is estimated in terms of "precision" and "recall".
The "Precision" is a percentage ratio of real (human-marked) objects traced by the algorithm to all number of objects traced by algorithm.Simply put, 100% minus precision is a percentage of outliers provided by algorithm.The "Recall" equals is a percentage ratio of human-marked objects found by the algorithm to all number of human-marked objects in a sequence, i.e. 100% minus recall means percentage of real objects that were not found by the algorithm somehow.
The table 1 contains some video sequences from PETS and ETISEO databases and corresponding processing results.FPS was especially estimated for budget PC configuration: Intel Atom N270 1600 MHz processor and 1 Gb of RAM memory.

CONCLUSION
The problem of automatic video analysis for object detection and tracking is the most significant algorithmic topic in the digital video surveillance.The new motion analysis and object tracking technique is presented.Motion analysis algorithms are based on forming and processing of multiple-regression pseudospectrums.
The object detection and tracking scheme contains: detection of moving pixel groups based on pseudospectrum analysis; forming of object hypotheses and interframe object tracking; spatiotemporal filtration of object motion parameters.Results of testing on public domain PETS and ETISEO video test beds are outlined.

Figure 3 :
Figure 3: Dynamic brightness threshold correction based on pseudospectrum.much faster, it crosses the zero line much earlier, according to signal disappearance.At this point, the value of current D16,2(k) component still significantly greater than zero.This value (the value of the D16,2(k) pseudospectrum component when preceding component D8,2(k) crosses zero line) is proposed to memorize for each pixel and then to use in dynamic corrections to the threshold that detects brightness changes.As shown in Figure 3, detection of the back front of the signal with the threshold with dynamic correction is successful even in case of significantly short, compared with the characteristic time of accumulation of this pseudospectrum component, input signal.
object is created from moving image elements ω1 . . .ωm. Various moving elements exist for all values of n (or don't exist if there's no moving objects on video sequence on current frame).It's obvious that pseudospectrums with longer memory are more robust to noise, but it takes longer to react for them, when a signal in some pixels starts being received.Pseudospectrums with shorter memory react to a pixel signal much faster, but they react to noise as well as to a real signal.So if an element is a moving one, its signal should exist on most of faster pseudospectrums.And if it is a new or disappeared object, its signal should exist on most of slower pseudospectrums.Let us suppose that we have a set of moving objects Λ1 . . .Λs1 and set of new or disappeared objects ∆1 . . .∆s2 on a previous frame, set of moving image elements ω1 . . .ωm1 and elements that concern to new or disappeared objects ω1 . . .ωm2 on current frame.So we must somehow associate all objects with their new regions.Let us see hypotheses forming for moving objects: