INTEGRATION OF PRIOR KNOWLEDGE INTO DENSE IMAGE MATCHING FOR VIDEO SURVEILLANCE

: Three-dimensional information from dense image matching is a valuable input for a broad range of vision applications. While reliable approaches exist for dedicated stereo setups they do not easily generalize to more challenging camera configurations. In the context of video surveillance the typically large spatial extent of the region of interest and repetitive structures in the scene render the application of dense image matching a challenging task. In this paper we present an approach that derives strong prior knowledge from a planar approximation of the scene. This information is integrated into a graph-cut based image matching framework that treats the assignment of optimal disparity values as a labelling task. Introducing the planar prior heavily reduces ambiguities together with the search space and increases computational efficiency. The results provide a proof of concept of the proposed approach. It allows the reconstruction of dense point clouds in more general surveillance camera setups with wider stereo baselines.


INTRODUCTION
The automated analysis of surveillance videos is an important tool to support human operators.Given the enormous amount of data collected by omnipresent surveillance cameras manual online inspection of the images is impossible and even manual reconfiguration of pan-tilt-zoom cameras (PTZ-cameras) is very challenging.Often, human operators cannot prevent camera networks recording empty scenes while missing crucial events, a fact that obviously limits the efficiency of such systems.Prefiltering of interesting scenes, automatic reconfiguration of cameras to focus relevant contents and the extraction of geometric information about people, actions and scene layout can help to make the task of manual inspection of video surveillance footage more tractable.Spatial information from stereoscopic analysis is very valuable in this context.However, low image resolution and wide baselines render the application of dense image matching for surveillance videos a challenging task.In common surveillance scenarios repetitive patterns and a significant spatial extent of the scene of interest demonstrate the limits of most general image matching approaches.
In this paper, we show how to derive strong prior knowledge for image matching assuming a realistic video surveillance setup and we describe the integration of this prior knowledge into a graph-cut based image matching framework.The method is proposed as a building-block for more comprehensive surveillance applications.While it tackles typical challenges to dense image matching and aims at reliable depth estimation, it does not address further tasks like people detection and tracking.
A brief overview of related work and the details of our method are given in section 2. Section 3 describes our experimental evaluation of the approach.The results provide a proof of concept and show the improved quality of the derived point clouds.

Related work
Stereo image matching on dedicated short-baseline image pairs has been a major topic of research throughout the last decades.The taxonomy by (Scharstein et al. 2002) gives an insight into common dense matching methods.Among the best performing approaches are those enforcing global smoothness assumptions.Graph-cut based optimization strategies (Boykov et al. 2001) are employed for efficient inference of optimal disparities and are widely applied to diverse optimization tasks in photogrammetry and computer vision.
Because in related work on video surveillance stereoscopic image matching is often regarded as the central component in the respective systems, the used sensors or sensor networks are mostly designed to fulfil the specific requirements of stereo approaches.The pairwise installation of PTZ cameras (Zhou et al. 2010) provides image pairs with short baselines.Dedicated stereo devices, as used in Darrel et al. (2000), Haritaoglu et al. (1998) and many other publications, capture synchronised image pairs that are processed on specialised hardware.Although the advantages of high-frequency depth maps for people detection and tracking are shown (Schindler et al. 2010), a dedicated system design leads to additional costs that, from our point of view, are not necessary, when applying stereoscopic analysis to camera networks.In contrast, we propose to integrate strong prior knowledge into the process of dense image matching so that it becomes applicable for more general camera setups with wider stereo baselines.

Setup
The goal of the presented work is to provide a dense image matching approach that can be applied in realistic surveillance camera networks without dedicated stereo sensors.Camera calibration and absolute orientation are assumed to be known so that for camera pairs with sufficiently overlapping fields of view the images can be transformed to the normal case of stereophotogrammetry.In addition, the exterior assumed to refer to an object coordinate system that depends on a predominantly horizontal ground plane knowledge can be derived for objects moving on this plane

Derivation of prior knowledge
To extract moving objects we make use of a common pre processing step and remove static background from the image sequences by subtracting an adaptive background model (Kaewtrakulpong et al. 2001).The resulting foreground blobs are enhanced by morphologic closing.instantiate planes that give a first approximation of the desired result in object space.To exclude small blobs induced by noise an area-threshold is applied.The handling of merging situations between multiple foreground objects is outside the scope of this paper.Certainly, depth cues from image matching would be valuable for tackling this kind of challenge.
In the context of video surveillance applications it is reasonable to assume that objects of interest predominantly ex perpendicular to the ground.Thus, we represent prior knowledge for the image matching approach as upright planes with 3D normals pointing parallel to the ground and perpendicular to the cameras' x axis.Given the exterior and interior camera orientation a first estimation of the position of the object on the ground plane can be derived from monoplotting, i.e. intersection of the viewing ray through the bottom-most point of the foreground blob with a plane in object space.Our straight-forward implementation directly uses the ground plane for this purpose.Shadows and occlusions may cause errors in this simple reconstruction planes still give reasonable priors on the 3D positions of the observed objects.To support image matching t be projected to disparity space, which is the field of view in image coordinates and dispar applications the cameras are usually mounted above the volume of interest and tilted towards the ground so that upright planes in the object coordinate frame project to slanted planes in disparity space.Figure 1 depicts a typical input image rectified to epipolar geometry (on the left) and colour information from the image projected to the approximate 3D planes.there is no plane instantiated for one person in the background since the size of the corresponding foreground blob is smaller than the respective threshold.exterior orientation is n object coordinate system that depends on horizontal ground plane so that strong prior derived for objects moving on this plane.
we make use of a common prestatic background from the image by subtracting an adaptive background model .The resulting foreground blobs They are used to instantiate planes that give a first approximation of the desired lude small blobs induced by noise The handling of merging situations between multiple foreground objects is outside the scope of this paper.Certainly, depth cues from image matching would be In the context of video surveillance applications it is reasonable to assume that objects of interest predominantly extend Thus, we represent prior knowledge for the image matching approach as upright planes D normals pointing parallel to the ground and Given the exterior and a first estimation of the position of the object on the ground plane can be derived from of the viewing ray through the most point of the foreground blob with a plane in object forward implementation directly uses the Shadows and occlusions may cause errors in this simple reconstruction but the resulting planes still give reasonable priors on the 3D positions of the To support image matching these planes can discretization of the and disparity.In surveillance applications the cameras are usually mounted above the volume so that upright planes in the object coordinate frame project to slanted planes in a typical input image rectified colour information from projected to the approximate 3D planes.Note that there is no plane instantiated for one person in the background since the size of the corresponding foreground blob is smaller

Disparity estimation
In order to derive a detailed disparity map of the observed object given the approximate plane respect to this plane has to be computed respective foreground blob.This multi-class labelling problem.
A row-wise disparity seed point is directly specified by the plane in disparity space.The set of disparity offsets in the range of feasible deviations from the planar prior.Admissible offsets are individually computed for each foreground object.Since they are defined in disparity space they depend on the absolute viewing distance.We found that for our application a metric search range of 1.5 the planar prior yields good results initial plane localization and transf runtime.Note that this range is much smaller than the admissible range of disparities for the complete scene which leads to a massive reduction of ambiguities and computational burden.This also reduces the risk of the optimization getting stuck in local minima.slope of the plane in disparity space would lead to varying search intervals along the vertical extent of when working with absolute disparity values.The use of disparity offsets with respect to the issue.
The task of finding optimal disparity offsets now corresponds to finding an optimal labelling of all fore relative quality of a labelling ‫ܮ‬ is commonly evaluated by an appropriate energy functional of the form In order to derive a detailed disparity map of the observed given the approximate plane, the disparity offset with respect to this plane has to be computed for each pixel of the .This task can be formulated as a seed point is directly specified by the The set of labels represents discrete s in the range of feasible deviations from the Admissible offsets are individually computed for each foreground object.Since they are defined in disparity space they depend on the absolute viewing distance.We found that for our application a metric search range of 1.5 m around yields good results.It can be computed from the localization and transformed to disparity space at Note that this range is much smaller than the admissible range of disparities for the complete scene which on of ambiguities and a decreased This also reduces the risk of the optimization getting stuck in local minima.Furthermore, the slope of the plane in disparity space would lead to varying search intervals along the vertical extent of the foreground blob when working with absolute disparity values.The use of disparity offsets with respect to the seed point circumvents this The task of finding optimal disparity offsets now corresponds to finding an optimal labelling of all foreground pixels.The is commonly evaluated by an of the form label or the corresponding absolute disparity implementation the data term (2) is computed as the Hamming distance between the local binary pattern ܿ around pixel ‫‬ in the left image and the same descriptor from the right image at a horizontal offset induced by the label pixel ‫.‬By truncating the data cost term at ߬ against outliers.The choice of labels assigned to such pixels dominated by the smoothness term (3).
The smoothness term ‫ܧ‬ ௌ favours smooth transitions of disparity by penalizing the absolute difference of labels assigned to adjacent pixels.To allow for discontinuities, e.g.around the limbs, we use a robust energy function that truncated absolute difference of the discrete labels connected neighbourhood.
Like the search range, the smoothness term is adjusted to the viewing distance.Obviously, discretization of the disparity space leads to significantly different ranges of plausible smooth transitions on object surfaces in different distances.In equation ( 3) this results in distance-dependent values of the truncation threshold from a quadratic function of the viewing distance.
Finding a globally optimum solution to such a labelling problem is in general NP-hard complexity of the solution space.For appropriately defined energy terms graph-cut based approaches (Boykov et al. 2001) can be used to find approximate solutions.initial labelling individual labels are iteratively expanded so that the total energy of equation (1) successively decreases.

RESULTS
The experiments are conducted on videos we collected for the joint research project CamInSens (CamInSens, 2013) cameras are mounted 4.5 m above the ground plane with a stereo base of 4 m.Defining a region of interest of 10 in the scene the total range of plausible disparities for this setup spans approximately 250 pixels.Given the wide baseline and ambiguous image content standard matching algorithms fail to produce reliable dense results on these stereo pairs.
Because there is no densely labelled reference data we qualitative results in 2 and validate the results on sparse, manually annotated control points on people in the scene.this end, we annotated 10 equally spaced frames of our test sequence.Although a more extensive evaluation would be helpful, this validation ensures that there are no systematic errors in the approach and provides a proof of concept.
For the experiments we set the truncation threshold of the data term ߬ to 12, the truncation threshold for the smoothness term is individually set at runtime to enforce smooth disparity transitions corresponding to a 3D range of 0.8 weight of data and smoothness term is set to 3 given in equation ( 1) is minimized using graph α−expansion (Boykov et al. 2001).
Table 1 gives the results of the quantitative evaluation at sparse control points.Resulting disparities are compared to label or the corresponding absolute disparity.In our is computed as the Hamming (Zabih et al. 1994) in the left image and the same descriptor from the right image at a horizontal offset induced by the label ݈ of ߬ it becomes robust labels assigned to such pixels is smooth transitions of disparity by penalizing the absolute difference of labels assigned to To allow for discontinuities, e.g.around the energy function that evaluates the crete labels in a 4- Like the search range, the smoothness term is adjusted to the viewing distance.Obviously, discretization of the disparity space leads to significantly different ranges of plausible smooth transitions on object surfaces in different distances.In equation dependent values of ߬ ௌ .We compute the truncation threshold from a quadratic function of the Finding a globally optimum solution to such a multi-class hard because of the omplexity of the solution space.For appropriately defined cut based approaches (Boykov et al. 2001) can be used to find approximate solutions.Starting from an initial labelling individual labels are iteratively expanded so that al energy of equation (1) successively decreases.
The experiments are conducted on videos we collected for the (CamInSens, 2013).The m above the ground plane with a a region of interest of 10 m x 10 m in the scene the total range of plausible disparities for this setup pixels.Given the wide baseline and ambiguous image content standard matching algorithms fail to sults on these stereo pairs.
Because there is no densely labelled reference data we present validate the results on sparse, manually annotated control points on people in the scene.To spaced frames of our test Although a more extensive evaluation would be this validation ensures that there are no systematic errors in the approach and provides a proof of concept.
we set the truncation threshold of the data to 12, the truncation threshold for the smoothness term is individually set at runtime to enforce smooth disparity transitions corresponding to a 3D range of 0.8 m.The relative thness term is set to 3. The energy is minimized using graph-cuts and offsets from manual annotation.To evaluate the results an inlier-threshold of 3 pixels is applied to the absolute difference of the disparity values.While the planar prior alone fails to predict accurate disparities, a standard winner evaluation of the data term increases performance but still gives wrong results for more than half complete approach gives correct results for more than the points providing a proof of concept and indicating remaining challenges.Figure 3 depicts one of the annotated frames with green crosses marking successfully matche points and red crosses indicating outliers.The reconstructions on the right hand side of figure 3 show that outliers occur on the endpoints of limbs which are hard to match correctly due to the limited resolution of the input images perspective for the rightmost person Figure 2. Planar prior (left), result of WTA (centre) and result of the proposed approach (right).
Figure 2 depicts an exemplary result disparity maps are projected to point clouds in object space.The left part shows the planar prior, the centre depicts results of a simple winner-takes-all evaluation of the challenging task.On the right the improved are depicted yielding a consistent surface reconstruction of the limbs.Such results can directly be used for robust localization and estimation input to automated scene understanding

CONCLUSIONS
We propose an approach addressing typical challenges to dense image matching in surveillance dedicated stereo sensors.The integration of planar prior knowledge reduces ambiguities together with the search space and thus increases computational efficiency.While significantly improved results can be shown for isolated foreground blobs merging situations between multiple objects resolved.Future work will implement a feedback loop between matching and tracking to address this issu depth information.In a further step, a more detailed object model will be integrated to couple 3D reconstruction and semantic interpretation.

ACKNOWLEDGEMENTS
This research was partially funded by the German Federal Ministry of Education and Research (BMBF), 13N10809 13N10814.The support is gratefully acknowledged.

WTA
Graph-cut 46.5 76.1 53.5 23.9 Table 1.Percentage of disparity inliers and outliers at sparse offsets from manual annotation.To evaluate the results an applied to the absolute difference of the disparity values.While the planar prior alone fails to predict accurate disparities, a standard winner-takes-all evaluation of the data term increases performance but still gives wrong results for more than half of the control points.The complete approach gives correct results for more than 75 % of the points providing a proof of concept and indicating Figure 3 depicts one of the annotated frames with green crosses marking successfully matched control points and red crosses indicating outliers.The reconstructions on the right hand side of figure 3 show that outliers occur on the endpoints of limbs which are hard to match correctly due to the limited resolution of the input images and significant changes in perspective for the rightmost person.
. Planar prior (left), result of WTA (centre) and result of the proposed approach (right).
an exemplary result of the optimization.The disparity maps are projected to point clouds in object space.The left part shows the planar prior, the centre depicts results of a all evaluation of the data term illustrating the challenging task.On the right the improved matching results consistent surface and a correct reconstruction of the limbs.Such results can directly be used for ion of body height and provide input to automated scene understanding.

CONCLUSIONS
an approach addressing typical challenges to dense image matching in surveillance camera networks without dedicated stereo sensors.The integration of planar prior together with the search space onal efficiency.While significantly improved results can be shown for isolated foreground blobs between multiple objects are not yet implement a feedback loop between to address this issue with the help of In a further step, a more detailed object be integrated to couple 3D reconstruction and

Figure 1 .
Figure 1.Rectified input image induced by a labelling ‫ܧ‬ , ‫ܧ‬ ௌ = data, smoothness term Equation 1 represents a Markov Random Field evaluating labelling ‫ܮ‬ by a weighted sum of smoothness term ‫ܧ‬ ௌ with ‫ݓ‬ ௌ controlling the influence of the smoothness term.The data term measures the dissimilarity of right stereo frame that are associated ified input image (left) and visualization of planar prior (right)

=
energy induced by a labelling L = data, smoothness term Equation 1 represents a Markov Random Field evaluating the sum of a data term ‫ܧ‬ and a controlling the influence of the the dissimilarity of pixels in the left and associated by the currently assigned (right).