FOLK DANCE PATTERN RECOGNITION OVER DEPTH IMAGES ACQUIRED VIA KINECT SENSOR

The possibility of accurate recognition of folk dance patterns is investigated in this paper. System inputs are raw skeleton data, provided by a low cost sensor. In particular, data were obtained by monitoring three professional dancers, using a Kinect II sensor. A set of six traditional Greek dances (without their variations) consists the investigated data. A two-step process was adopted. At first, the most descriptive skeleton data were selected using a combination of density based and sparse modelling algorithms. Then, the representative data served as training set for a variety of classifiers.


INTRODUCTION
Intangible Cultural Heritage (ICH) and its perpetual preservation, is an intriguing domain that attracts both the scientific community and the general public. The main challenges involved are associated with the complex structure of ICH; i.e. its dynamic nature, the interaction among the objects and the environment, as well as emotional elements, such as the dancers' expressions and style.
Folk dances are important to ICH; they are directly connected to local culture and identity (Shay and Sellers-Young, 2016). Hence, the preservation of folk dances is a basic requirement, since the history and style of each folk dance will be readily available to the public through a system that includes descriptive information, videos, movement and 3D modelled data relevant to it.
Different songs of the same music genre are usually represented choreographically by the same set of signature moves and gestures. This holds especially true in case of traditional folk music and dances that abide by a stronger sense of structure. This structure can be exploited in order to index and classify the key elements that compose each dance. A properly designed database of characteristic dance instances to compare against, will allow for a number of useful applications to emerge.
Dance recognition, with semantic information such as genre, provenance, and correlation, symbolism, all the way to difficulty level, tutorials and even pertinent advertising such as museums, upcoming performances, dance studios and discography, will be available through a single snapshot or short video. The technology required to achieve the aforementioned goal is within reach, since depth cameras and classification algorithms that lie in the center of these applications are not only available but capable enough to achieve the aforementioned goals.
Arguably, the prevalent and most widespread depth camera is the Microsoft (MS) Kinect (Microsoft, 2016), however mobile * Corresponding author variants of this technology, such as Project Tango by Google (Google, 2016) are already market available, and can be employed.
The focus of this paper is the application of segmentation and classification algorithms to Kinect captured depth images and videos of folkloric dances in order to identify key movements and gestures, compare them against database instances and determine the dance genres they represent, as well as to provide helpful metadata. Then, the goal is to identify the geometric structures of the dance and then to classify the dances with respect to common geometric features they share.

RELATED WORK
There are several examples in the literature which present applications that exploit Kinect 3D data of human (dance) movements' for recognition and classification purposes. In the work (Gianaria et al., 2014), gait analysis is performed using skeletal data provided by Microsoft Kinect sensors and a set of physical and behavioural features is defined, in order to identify the more relevant parameters for gait description. The aim of this work is the gait characterization and people recognition using SVM classification.
The authors of (Raptis et al., 2011) describe a gesture classification system for skeletal wireframe motion. A classifier was designed and trained to recognize certain gestures, among several dozen, in real-time and with high accuracy. In another work (Zanfir et al., 2013), a simple non-parametric Moving Pose framework is proposed, for low-latency human action and activity recognition using a modified NN classifier. Furthermore, a method to recognize individual persons from their walking gait using 3D skeletal data from a MS Kinect device using the -means algorithm is described in the work (Ball et al., 2012).
Ever since the introduction of the first Kinect, depth cameras have grown in importance and are widely used as low-cost peripherals for several applications. The advantages of a depth camera is that it produces dense and reliable depth measurements, albeit over a limited range and offers balance in usability and cost. Kinect is the sensor of choice for such applications and will be employed to capture sets of dance moves and gestures in 3D space and in real time, resulting in a recorded sequence of points in 3D space for each joint at certain moments in time .
In (Kitsikidis et al., 2014b), a methodology is proposed for dance learning and evaluation using multi-sensor and 3D gaming technology. The learners are captured during dancing, while an avatar visualizes their motion using fused input from multiple sensors. Motion analysis and a two-level Fuzzy Inference System (FIS) are applied, using as input low level skeletal data and high level motion recognition probabilities, for the evaluation of dancer's performance. In (Kitsikidis et al., 2015b), a 3D game environment for dance learning is presented, which is based on the fusion of multiple depth sensors data in order to capture the body movements of the user/learner. In addition, the system automatically assesses the learner's performance, by utilizing a combination of Dynamic Time Warping with a FIS, and provides feedback in a form of a score as well as instructions from a virtual tutor in order to promote self-learning.
In (Kitsikidis et al., 2014a), improved robustness of skeletal tracking is achieved by using sensor data fusion to combine skeletal tracking data from multiple sensors. The fused skeletal data is split into different body parts, which are then transformed to allow view invariant posture recognition. For each part, a posture vocabulary is generated by performing -means clustering on a large set of unlabelled postures. Finally, body part postures are combined into body posture sequences and Hidden Conditional Random Fields classifier is used to recognize motion patterns, e.g. dance figures. In (Kitsikidis et al., 2015a), a skeletal representation of the dancer is again obtained by using data from multiple depth sensors. Using this information, the dance sequence is partitioned first into periods and subsequently into patterns. Partitioning into periods is based on observing the horizontal displacement of the dancer while each period is subsequently partitioned into patterns by means of training an exemplar-based Hidden Markov Model that classifies frames to exemplars representing HMM states.
In (Dimitropoulos et al., 2016), human action recognition is treated as a special case of the general problem of classifying multidimensional time-evolving data in dynamic scenes. To solve detect correlations between channels, a generalized form of a stabilized higher-order linear dynamical system (sh-LDS) and the multidimensional signal is represented as a third order tensor. Each multidimensional signal is represented as a cloud of points on the Grassmann manifold and a codebook is created by identifying the most representative points to be used in classification by applying a bag-of-systems approach.

PROPOSED METHODOLOGY
Dance recognition via pattern identification is the scope of this paper. In particular, we try to identify Greek folk dances by matching recorded sequences to a database of characteristic dance instances; i.e. motion and rotation 3D points plus the time. Therefore, the problem at hand entails to a conventional machine learning paradigm: create a set of appropriate features and train the classifiers.
A meaningful train set should contain data spanning as much as possible the feature space, without redundant information or almost identical entries. In our case, raw data space lies in ℝ 7×1 , since we have the 3D position of various body parts, plus the corresponding rotations (four in total), for a set of consecutive frames.
At first, the structure of the data had to be captured by creating clusters. Then, for each of the identified clusters, find the smaller possible set that can accurately describe the remaining cluster data. Such sparse modelling approach has multiple advantages; non redundant information, reduced storage space, faster preprocessing, etc. All the above steps can be done without supervision. Once the process is complete, the remaining data can either serve as a reference point for future comparison with other motion data, in order to identify the dance patterns or serve as a training data set for a variety of classifiers.
The aforementioned method leads to dance clustering through the comparison of properly captured dance movement sets by a depth sensor and sets already recorded and stored to a relevant motion capture database. The high-level 3D representations of dance movements obtained from the low-cost consumer-level depth sensor, will then be fed to an unsupervised clustering algorithms in order to produce meaningful instances for comparison purposes.

Sensors used
The Microsoft Kinect II is currently one of the most advanced motion sensing input device that is available to the public. It is a physical device with depth sensing technology, built-in color camera, infrared (IR) emitter, and microphone array, which projects and captures an infrared pattern to estimate depth information. Based on the depth map data, the human skeleton joints are located and tracked via the Microsoft Kinect II for Windows SDK ("Kinect -Windows app development," 2017) More specifically, the Microsoft Kinect II sensor can achieve real-time 3D skeleton tracking, while at the same time it is relatively cheap and easy to setup and use. The tracked skeleton consists of twenty five joints with each one to include the 3D position coordinates, its rotation and a tracking state property: "Tracked", "Inferred", and "UnTracked" (Webb and Ashley, 2012). Moreover, the sensor can work in dark and bright environments and the capture frame rate is 30fps. On the other hand, there are some limitations that should be taken into account: it is designed to track the front side of the user and as a result the front and back side of the user cannot be distinguished, and that the movement area is limited (approximately 0.7-6 m).

Sampling algorithms
The main purpose of data sampling is the selection of appropriate representative samples in order to provide a good training set and, thus, improve the classification performance of risk assessment models. The main purpose of data sampling is the selection of appropriate representative samples in order to provide a good training set and, thus, improve the classification performance of risk assessment models. The most important factor in data selection is the definition of distance function. For any two given data points and , ∈ ℝ let ( , ) denote the distance between them. In order to compute the distance, let ∈ × be a symmetric matrix and the distance measure defined as: All the proposed approaches are Euclidean based (i.e. = ).

OPTICS algorithm
Ordering Points to Identify the Clustering Structure (OPTICS) is an algorithm for finding density-based clusters in spatial data (Ankerst et al., 1999); i.e. detect meaningful clusters in data of varying density. In order to do so, the points of the database are (linearly) ordered such that points which are spatially closest become neighbors in the ordering.
OPTICS requires two parameters: the maximum distance (radius) to consider (ε), and the number of points required to form a cluster ( ). A point is a core point if at least points are found within its ε -neighborhood, ( ). Once the initial clustering is formed, we may proceed with any sampling approach (e.g. random selection among clusters).

Sparse representative selection
In order to extract the most important, i.e. descriptive, data, the work of (Elhamifar et al., 2012) around sparse modeling, is employed. Sparse representative selection (Sparse) focuses on the identification of representative objects. Their work is summarized through the following formulation: where and refer to data points and coefficient matrix respectively. This optimization problem can also be viewed as a compression scheme, where we want to choose a few representatives that can reconstruct the available data set.

Utilized classifiers
A set of well-known classifiers were applied in order to evaluate the detection rates, for various sets of input data.

k nearest neighbors
In pattern recognition, the -nearest neighbors ( nn) algorithm is a non-parametric method used for classification (Bhatia and Vandana, 2010). An object is classified by a majority vote of its neighbors, with the object being assigned to the class most common among its nearest neighbors; it is therefore, a type of instance-based learning, where the function is only approximated locally and all computation is deferred until classification.

Classification trees
Decision tree learning uses a decision tree as a predictive model which maps observations about an item to conclusions about the item's target value. In classification tree structures, leaves represent class labels and branches represent conjunctions of features that lead to those class labels. Each internal (non-leaf) node is labeled with an input feature. The arcs coming from a node labeled with a feature are labeled with each of the possible values of the feature. Each leaf of the tree is labeled with a class or a probability distribution over the classes.

Artificial neural networks
Artificial neural networks are non-linear mapping structures, inspired by biological nervous systems, which are capable of machine learning and pattern recognition (Li et al., 2011). ANNs are universal approximators which however have multiple local minima (i.e. solutions), due to their structure; they are composed from multiple hierarchical layers of interconnected nodes. Their structure consists of weights, biases and activation functions, imitating the real brain's neurons and synapses.

Support vector machines
Support vector machines (SVMs) are supervised learning models with associated learning algorithms that analyze data and recognize patterns, used for classification analysis (Abe, 2010). An SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap (margin) that is as wide as possible. New examples are then mapped into that same space and predicted to belong to a category based on which side of the margin they fall on. The mappings used by SVM schemes are defined through a kernel function ( , ) selected to suit the problem. In our case we utilized linear and RBF kernels.

EXPERIMENTAL RESULTS
In our study in order to capture and record the performers' body motions, we used a motion capture system using one Kinect II depth sensor (Fig. 1) and the ITGD module, developed within the i-Treasures project (Dimitropoulos et al., 2014) by UMONS. The ITGD module enables the user to record and annotate motion capture data received from a Kinect sensor. The recording process took place at the School of Physical Education and Sport Science of the Aristotle University of Thessaloniki. Six Greek traditional dances with a different degree of complexity were recorded. Each dance was performed by three dancers twice: The first time in a straight line and the second in a semi-circular curving line.

Dataset description
Data set consists of six different dances. Their execution was either in straight line or circle (table 1). Every dance is described by a set of consecutive image frames. Every frame, , = 1, … , , has a corresponding extensible mark-up language (XML) file with positions, rotations and confidence scores for 25 joints on the body (table 2), in addition to timestamps.
Investigated dances were: 1. Enteka (eleven): A dance, performed by both women and men, which is popular mainly in the large urban centers of Western Macedonia (Grevena, Kozani, Florina, Kastoria, etc.). The dance is performed freely as a street carnival dance, but also around the carnival fires. The dancers' hands during the dance move freely or are placed at the waist. 2. Kalamatianos: It is a popular Greek folkdance throughout Greece, Cyprus and internationally, often performed at many social gatherings worldwide. It is a circle dance performed in a counterclockwise rotation with the dancers holding hands. It's a twelve steps dance and the musical beat is 7/8.

Makedonikos:
A circle dance, performed by both women and men, with a 7/8 musical beat. The basic pattern of dance is performed in twelve movements / steps. Therefore, it resembles the Kalamatianos dance in a great degree with the difference that it is a more joyous dance. It is popular in the region of Western and Central Macedonia. 4. Syrtos (2 beat): The Syrtos (2 beat) dance is organized in a quick (2 beat) rhythm. It is a circle dance, performed by both women and men mostly in the region of Pogoni of Epirus.
In the past, the dance was performed separately by men and women, in one, two or more lines. 5. Syrtos (3 beat): Syrtos is one of the most popular dances throughout Greece and Cyprus. The Syrtos (3 beat) dance is organized in a slow (3 beat) rhythm. It is a line dance and a circle dance, performed by dancers (both women and men) in a curving line holding hands, facing right. It is widespread through Epirus, Western Macedonia, Thessaly, Central Greece and Peloponnese. 6. Trehatos (Running): A circle dance, performed by both women and men, which is danced in the village Neochorouda of Thessaloniki. The kinetic theme of the dance is composed of three different dance patterns. The first one resembles the Syrtos (3 beat) pattern, the second takes place once and connects the first and the second pattern, and the third one is characterized by intense motor activity. However, the dance is not a static act; the time dimension should be also considered. Therefore, we utilized the information of two consecutive frames, and +1 . In the end, each dance, , was of size × (2 ) × − 1. Prior to the representative selection step, data were normalized using minmax normalization.

Representative samples selection
A combination of Combination of OPTICS and SMRS algorithms was adopted. In this case, SMRS is performed to the sub-clusters obtained through the OPTICS algorithm. This approach is similar to the work of (Protopapadakis et al., 2014).
It creates a small subset of representative samples from each cluster formed through the OPTICS algorithm.
The number of clusters, , was defined using the rule: = ⌈√ /2⌉ , where denotes the number of available samples. The minimum number of data within a cluster, required by OPTICS, , was defined as: = ⌊ / ⌋.

Algorithms setup
All algorithms were implemented in MATLAB except for the SVMs (Chang and Lin, 2011). In our case the nn parameterization process considers the number of k nearest points, which was set as = 5. Classification trees required no further parameterization. A feed forward network of two hidden layers was utilized. All activation functions were hyperbolic tangent and the training method was the back propagation method. Parameters and where defined according to crossvalidation accuracy scores.

Experimental results
It appears that the exploitation of raw data from two consecutive frames does not suffice for accurate folk dance recognition; neither body joint nor classifier appear dominant behaviour in terms of detection rates. All results correspond to frame detection rates; i.e. given two consecutive frames of any dance, we try to identify the dance, using joints position.
At first, the impact of the joint location is investigated (see fig.  4). In order to facilitate the illustration the 25 body joints were classified into body regions, as shown in table 2. Although torso or leg provide better average detection rates, it appears that no body parts / regions are dominant for all dance types.
Secondly, the impact of the classifier selection has been assessed (see fig. 5); for illustration purposes results are shown only for  The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLII-2/W3, 2017 3D Virtual Reconstruction and Visualization of Complex Architectures, 1-3 March 2017, Nafplio, Greece torso joints group. No classifier achieved better performance than the rest in the dance recognition activity.
The combinatory detection rates, for dance categories, were, also, investigated ( fig. 6). Depending on the dance, the importance of body region for the feature extraction varies. It is also intriguing that the arm and head related joints position contains meaningful information for the clustering. For instance, the Syrtos (3 beat) detection accuracy exceeds 65% using head joints and SVMs with linear kernel. Yet, for the Syrtos (2 beat) the same classifierjoints combination performs poorly; i.e. below 20%.

CONCLUSIONS
In this paper an investigation has been conducted, regarding the identification abilities of well-known classifiers, over folk dance identification. The impact of the body joint regions was also investigated. Analysis was based on raw data provided by a single Kinect II sensor. In total there was six Greek folk dances; most of them had two variations. The feature space was the coordinates and the rotations of the body joints, in pairs of consecutive frames, in order to incorporate the time dimension. Future work will focus on the exploitation of more complex feature extraction processes and classifiers. Makedonitikos. Trehatos.
The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLII-2/W3, 2017 3D Virtual Reconstruction and Visualization of Complex Architectures, 1-3 March 2017, Nafplio, Greece