A LOW-COST MARKERLESS TRACKING SYSTEM FOR TRAJECTORYINTERPRETATION

Abstract. The tracking abilities of 1st generation Kinect sensors have been tested over common trajectories of folk dances. Trajectories related errors, including offset, curve shape, noisy points are investigated and mitigated using well-known signal processing filters. Low cost depth trackers can contribute towards the remote tutoring of folk dances, by providing adequate data to instructors and explicit details to the trainees which segments of their dance trajectories need more work.


INTRODUCTION
According to UNESCO's 2003 'Convention for the Safeguarding of the Intangible Cultural Heritage, the Intangible Cultural Heritage (ICH) is the mainspring of cultural diversity and a guarantee of sustainable development.The Convention proposes five broad 'domains' in which intangible cultural heritage is manifested.One important domain is the domain of the performing arts which includes traditional music, theatre and dance.Availability of a digitized technological framework is a critical aspect for the preservation of the intangible cultural heritage content.
Although, ICH content, especially traditional folklore performing arts, is commonly deemed worthy of preservation by UNESCO and by the EU Treaty, most of the current research efforts focus on tangible cultural assets.The primary difficulty stems from the complex structure of ICH, its dynamic nature, the interaction among the objects and the environment, as well as the emotional elements, i.e. the way of expression and dancers' style.For this reason, the European Union recently approved a research project, namely TERPSICHORE, with the main purpose of researching, implementing and testing an innovative framework for digitization, 3D modelling, and archiving, choreographic performing arts ("Terpsichore: Transforming Intangible Folkloric Performing Arts into Tangible Choreographic Digital Objects," 2017).
Currently, simple AV recordings have been used for digitizing folklore performances.However, such digitization technology offers no possibility to extract important symbolic characteristics that represent human creativity and the respective geometry.Therefore, it is difficult to preserve the way (styling) of a dance, the way of expression and the human feelings.The recent advances in hardware engineering have stimulated a boost in stereoscopic digitization technologies with the ability to capture stereo video data in real-time.Again, these methods fail to capture the complete structure and the geometry of a folklore performance.

RELATED WORK
The National Science Foundation of the USA supports a programme for developing a tele-immersive architecture (Nahrstedt et al., 2007) for capturing the intangible attributes of dances.The purpose of these works is to design a symbiotic creativity framework for choreography based on LMA-Laban Movement Analysis (Guest, 2014).However, the main research objective was the creation of a collaborative virtual environment instead of modelling, preserving and enriching human creativity in the framework of intangible cultural folklore performing arts.A 3D archiving system for traditional performance arts has been presented in (Hisatomi et al., 2011) focusing on Japanese traditional performing arts.The system generates sequences of 3D actor models of the performances from multi-view video by using a graph-cut algorithm.However, the work mainly focuses on the 3D digitization of folklore performance arts instead of transforming the captured visual signals into a set of symbolic representations.
One of the first approaches for extracting symbolic information from a dance performance, i.e., transforming the dance into Laban movement attributes, is presented in (Smigel et al., 2006).However, this method is based on a manual annotation, making the whole process arduous.In the same context, the Labanwriter graphical user interface has been developed in (Wilke et al., 1932).To address the limitations of the manual annotation, the work of (Hachimura and Nakamura, 2001) introduces an automatic generation of Laban notation, exploiting motion data properties, while the work of (Chen et al., 2005) proposes a scoring system using a marker-based motion capturing architecture.
Recently, the work of (Chen et al., 2013) generates automatic Labanotation using hierarchical data presentations from the motion attributes of dances.A computer aided tool for automatically generating Labanotation scores has been proposed in (Choensawat et al., 2015), by analyzing body motions.The main limitation of the aforementioned approaches is that they are usually based on a marker-based motion capturing system which is an expensive hardware sensing interface.Furthermore, such systems require an expert installation workflow procedure making their operation, calibration and setup a difficult and costly task.

PROPOSED METHODOLOGY
In this paper, low-cost sensors are considered based on the Microsoft Kinect device.The idea is that an easy to install and use sensor can provide adequate tracking abilities, allowing its utilization for remote sessions of folk dance lessons.The apprentice, without living home, will record his/her movements, providing adequate information to a distant instructor or appropriate software for comments, suggestions and advices.However, possible sensor limitations are related to inaccuracies regarding the coordinates of the captured 3D data, especially for short and long range distances (Yang et al., 2015).To address these difficulties, in this paper, we introduce a methodology which exploits the spatiotemporal coherency of a human movement in order to compensate the depth inaccuracies of the Kinect sensor.In order to achieve this, we exploit innovative methodologies from photogrammetry and computer vision.
Initially, the 3D information extracted from the skeletal tracking (Shotton et al., 2011) is projected onto the 2D surface of the dancer's movement.Since these 2D surface points are noisy, due to depth inaccuracies of Kinect, we compensate their coordinates assuming a smooth movement trajectory of the dancer.First, a low-pass filter is applied onto the projected 2D surface points with the aim of minimizing their spatial-temporal variations.In this way, the algorithm compensates the coordinates of a point in a way that: i) the captured 3D information from the Kinect sensor is trusted as much as possible, while simultaneously ii) the variations among consecutive points are minimized.Error performance scores are obtained from comparing the coordinates, estimated by the Kinect, as projected onto the 2D surface with ground truth data.The adopted methodology steps are shown in fig 2.

Kinect sensor
The Kinect sensor is a markerless motion tracking architecture capable of extracting human motion attributes under real-time constraints.It also provides skeletal tracking information modelling the human joints as 3D data representations (Zhang, 2012), as illustrated in fig. 1.The device features an "RGB camera, depth sensor and multi-array microphone running proprietary software", which provide full-body 3D motion capture, facial recognition and voice recognition capabilities.
The depth sensor consists of an infrared laser projector combined with a monochrome CMOS sensor, which captures video data in 3D under any ambient light conditions.The sensing range of the depth sensor is adjustable, and Kinect software is capable of automatically calibrating the sensor based on gameplay and the player's physical environment, accommodating for the presence of furniture or other obstacles.

Joint's analysis related limitations
We were interested in hip joint trajectory tracking.However, in order to follow the floor trajectories, a person has to change his body posture; i.e. bend a little.Minor course deviations were also expected due to movement speed variations.
The possibility of missteps should also be considered.The dancer could slightly deviate his course and instantly correct the position.Such cases result in non-smooth areas of sensor's calculated trajectories.

Trajectories smoothing
In order to smooth trajectory peaks, well-known techniques from the signal processing field are adopted (Orfanidis, 1995).The finite impulse response (FIR) filters is defined as: Figure 2. Trajectory assessment, using Kinect sensor, adopted steps.

Define actual trajectories
Step 2 Perform an act over the tracks Step 3

Map on 2D space
Step 5

Noise removal
Step 6 Find the corresponding trajectory Step 7 Similarity scores where [] are the initial trajectory values,   [] is the smooth value for the -th point,  is the filter order and   is a weight factor.
Another possible smooth operator is the Infinite impulse response (IIR) filter, defined as: The IIR filter is a combination of feed forward and feed back filters where ,  are the corresponding orders, and   ,   the corresponding coefficients.
Finally, the Savitzky-Golay (SG) smoothing filter, a smoothing polynomial filter, has been applied.The filter is actually a generalizations of the FIR average filter that can preserve better the high-frequency content of the desired signal, at the expense of not removing as much noise as the average.An illustration of the filter outputs, for trajectory no 4, is shown in fig. 4.

EXPERIMENTAL SETUP
A Kinect sensor has been utilized for motion capturing in predefined trajectories.The sensor position was at the edge of a flat surface 0.84m from the ground floor.Although the placement position has been marked, minor displacements (few mm) are considered.The entire analysis has been done on an ordinary pc, using MATLAB software.
There was no special hardware requirements, except from the use of Kinect.The actual data trajectories are given as inputs, during the initialization of the system using a few points; Less than 5 in case of linear segments (in order to consider noise in measurement) and the radius in case of circular segments.

The monitoring area
The monitoring area was a flat surface; i.e. the floor of a room.All trajectories were designed in accordance to sensor capture area capabilities (fig.3).The nominal limits, as given by Microsoft are for the default range between 0.8 meters and 4.0 meters, suggesting a practical range of 1.2 to 3.5 meters.Another important aspect was that the entire body of a person should appear on the generated depth maps, while following the designated trajectories.

Trajectories definition
Two types of trajectories were marked on the floor, using color tape: straight lines and curves.The lines were placed in order to form a grid with three almost horizontal and diagonal segments.
Additionally, we established a cyclical segment at a radius of 1.79 m from Kinect mounting point projection on the floor.A brief description of the designed trajectories is provided in table 1.

Dancer actions
A person wearing tight-fitting clothing, in order to avoid miscalculations of the joints' position, was asked to follow the trajectories on the floor, by stepping on them.A total of six tracks were planned along each of the five 5 line segments and one in the cyclical arc.Person's movement's speed varied slightly.An illustration of the test implementation is shown in fig. 5.

Data processing
At first raw data, from one out of the 20 recorded body joints, are extracted (i.e.hip joint).Data are mapped to the ground floor level in order to generate a 2D trajectory.Prior to any kind of curve analysis / comparison various issues, related to tracking capabilities had to be dealt with.
Trajectory redundant points trimming.We check if the trajectory starting (end) point coincides to the actual starting (end) point of our trajectory.Kinect sensor provided a wider range of values, outside designed trajectories.As such, we had to identify corresponding points among trajectories.
In order to interpolate corresponding points between Kinect output and actual trajectories, we had to create explicit solution for each of the designed tracks.Then, given the -axis values from the sensor, we could calculate the corresponding  values of the actual trajectories, making the tracks comparable.All (  ,  ̂) pairs (Kinect output) who were outside actual trajectory limits [  ,   ] were discarded.

Trajectory
Offset values (m)

Performance metrics
Kinect's captured trajectories were assessed in both curve similarity and corresponding points' distance fields.Generally, we can describe the shape of a function through its moments.A moment is a specific quantitative measure of the shape of a set of points.The -th moment is calculated as: where  is the number of trajectory points over -axis.
Central moments are used in preference to ordinary moments, computed in terms of deviations from the mean instead of from zero, because the higher-order central moments relate only to the spread and shape of the distribution, rather than also to its location.In our case, moments up to sixth order were calculated.
Figure 5.One of the main advantages of the system is the applicability at home, requiring minimal effort.The trajectories are marked using tape (image left).The persons' movement is recorded and projected to the ground, in order to perform the trajectory assessment and comparison process.Prior to the moments calculation, traditional error metrics were employed to illustrate the differentiation between actual and Kinect curves.
The mean squared error (MSE) is defined as: The MSE is a measure of the quality of an estimator-it is always non-negative, and values closer to zero are better.The MSE is the second moment (about the origin) of the error, and thus incorporates both the variance of the estimator and its bias.
The mean absolute error (MAE) is defined as: MAE is known as a scale-dependent accuracy measure and therefore cannot be used to make comparisons between series using different scales.
The differences among various corresponding points between actual and Kinect's trajectories projections on 2D planes have been evaluated using central moments.Table 4, presents the differences between original trajectories moments and the Kinect's corresponding ones.Results table also include the moment after the application of described filters (see sec.3.3).
Central moments of order 0 and 1 offered no additional information; i.e. their values were 1 and 0 respectively.Thus, the difference with actual moments was 0. The rest of the monuments, regardless of the applied filter appear minor differences.Consequently, the application of signal filtering should be considered in more complex trajectories, where missteps are more likely to occur.

CONCLUSIONS
The applicability of low cost depth sensors for the evaluation of moving patterns at home, has been investigated.Movement patterns are projected on a 2D plane and evaluated against predefined trajectories.Analysis of the trajectories provide significant data that can be utilized in many ways.Future work shall involve the transition from trajectories to Laban notation, in order to support remote tutoring and preservation of folk dances.

Figure 1 .
Figure 1.Kinect vertical Field of View in default range (left) and the corresponding tracked body joints (right).

Table 2 .
Comparison between actual and Kinect generated trajectories, illustration for four point pairs and the corresponding mean squared and absolute errors.