PARAMETRIC REPRESENTATION OF THE SPEAKER ’ S LIPS FOR MULTIMODAL SIGN LANGUAGE AND SPEECH RECOGNITION

In this article, we propose a new method for parametric representation of human’s lips region. The functional diagram of the method is described and implementation details with the explanation of its key stages and features are given. The results of automatic detection of the regions of interest are illustrated. A speed of the method work using several computers with different performances is reported. This universal method allows applying parametrical representation of the speaker’s lipsfor the tasks of biometrics, computer vision, machine learning, and automatic recognition of face, elements of sign languages, and audio-visual speech, including lip-reading.


INTRODUCTION
According to the World Health Organization's (WHO) statistics, in 2015 there were about 360 million people (5%) in the world (328 million adults and 32 million children) suffering from hearing loss (WHO, 2015).Currently, governments of many countries in collaboration with international research centers and companies are paying great attention to the development of smart technologies and systems based on voice, gesture and multimodal interfaces that can provide communication between deaf people and those with normal hearing (Karpov and Zelezny, 2014).
Deaf or "hard of hearing" people communicate by using a sign language (SL).All gestures can be divided into static or dynamic (Plouffe and Cretu, 2016).Sign languages recognition is characterized by many parameters, such as the characteristics of the gestural information communication channel, the size of the recognition vocabulary, variability of gestures, etc. (Zhao et al., 2016).The communication of information is carried out using a variety of visual-kinetic means of natural interpersonal communication: hand gestures, facial expressions, lip articulation (Karpov et al, 2013).
In addition, future speech interfaces for intelligent control systems embrace such areas as social services, medicine, education, robotics, military sphere.However, at the moment, even with the availability of various filtering and noise reduction techniques, automatic speech recognition systems are not able to function with the necessary accuracy and robustness of recognition (Karpov et al., 2014) under difficult acoustic conditions.In recent years, audiovisual speech recognition is used to improve the robustness of such systems (Karpov, 2014).This speech recognition method combines the analysis of audio and visual information about the speech using machine vision technology (so-called automatic lip reading).

DESCRIPTION OF THE PARAMETRIC REPRESENTATION METHOD
Parametric video representation methods belong to the methods of low and medium processing levels, as informative features, extracted from the input video stream, are calculated at the output.
A functional diagram of the method parametric lips representation is presented in Figure 1.The data from recorded video files or directly from the video camera are delivered at the output of the developed method.Stream processing of video frames is carried out.The image of each frame is subjected to linear contrasting (Al-amri et al., 2010).Then a graphical face area is being determined in the full image, using the Viola-Jones algorithm (Viola andJones, 2004 andCastrillуn et al., 2011).The complexity of finding faces in the frames is due to the presence of different positions and facial expressions.Next, the eye region in the found face area is being determined using a classifier.The found region is being colored in monochrome black.Nose and mouth regions are being found using classifiers trained on the images.The remaining steps are identical to that for the eyes search, except that the mouth region is not colored but is converted from color image into a grayscale image, and also the color histogram is being equalized (Yoon et al., 2009).
The final stages of the method perform the following:  Change the image size that contains the mouth region;  Analyze principal components of the features (Candes et al., 2011) (Principal Component Analysis, PCA);  Saving the counting results in a special text file.
Stepping through of the method looks as follows.Input parameter of the method is the full path to the video file in the format Audio Video Interleave (AVI).Resolution of video frames should be at least 640x480 pixels in the 24 bits format and with a frequency of 30 Hz.  Next, search of a graphic speaker's face carries out using the Viola-Jones algorithm, which has been offered Michael Jones and Paul Viola (Viola andJones, 2004 andCastrillуn et al., 2011).This algorithm allows finding a variety objects in the image in real time.Its main task is of persons, however can be used also in different kind of object recognition.This algorithm is implemented in the Open Source Computer Vision Library (OpenCV, 2016) and has the following characteristics: the image has the integral representation (Gonzalez and Woods, 2007), which allows finding the necessary objects at a high speed:  The necessary object is being searched using Haar's 19 signs (Haar Cascades, 2008) and (Padilla et al., 2012);  Boosting is used, which is the selection of the most suitable characteristics for the desired object in the selected part of the image (Viola and Jones, 2004);  A classifier previously trained on faces, (Bruce et al., 2016), which are scalable up to the 20x20 pixels size, accepts features at the input, and outputs the binary result "true" or "false";  Cascades of signs, consisting of several classifiers, are used to quickly discard windows in which the object is not found (Felzenszwalb et al., 2010).
Processing takes a linear time which is proportional to number of pixels in the image, due to this the integral image is calculated in one pass.In addition, the sum of the pixels of the rectangle having an arbitrary space is possible to calculate using such integral matrix in a short period of time.In the end the principle of the scanning window is based on the following steps: − The scanning window with a step to 1 pixel and the size which is equal to the predetermined values is sequentially moved along the original image; − The location of the signs using the scaling and rotation of the sings in the scanning window is calculated in each window; − The scanning is sequentially happened at different scales; − The image remains unchanged, the only scanning window is scaled (the size of the window is increased); − The found signs as input parameter are passed or the cascade classifier, which gives the result.
In the process of search is not prudent to calculate all the signs, because time of treatment will increase.Therefore, the classifier is trained to react only to a certain subset of signs.problem of classification in the Viola-Jones algorithm is solved with the help of machine learning and the Adaptive Boosting method (AdaBoost, adaptive amplification of classifiers) (Wu and Nagahashi, 2014).Some examples of determination of the speaker's face are presented in Figure 5.A distinctive feature of the method of parametric lip representation, presentedin Figure 1, is the obligatory face area detection only in the first frame.Further, in case of receiving a negative result from the classifier, the coordinates will be drawn from the previous successful operations.
The next operation is aimed at reducing of the frame to area with the face (Figure 6).Cutting off the excessive area of the image allows reducing the further processing time.The next step of the method submitted is aimed searching for the speaker's eye using the Viola-Jones algorithm.Instead of the detector of face is used the detector of eye, which provides models a graphical eye region trained on the images of the 11x45 pixels size.This search is performed only if the face area is successfully found.As is the case with face area detection, eyes needed to find only in the first frame.Subsequently, any mistake is will lead to borrowing of coordinates of the eye region from the previous successful operations.Some examples of determination of the speaker's eye are presented in Figure 7.As is the case with the previous area detection, the nose region had to be found only in the first frame.If there is an error extraction of coordinates for necessary area is made the previous successful operations.Some examples of determination of the different speaker's nose are presented in Figure 9.   Before calculating Principal Component Analysis (PCA) of the features of the lips region a grayscale transformation must be fulfilled (Figure 12a), which is based on transformation of the making pixels of red, green and blue channels of the color image in the corresponding brightness values.Then, leveling the histogram the resulting type of which must respond to a uniform distribution to realize for the levels of brightness of image to the same frequency (Figure 12b).The process looks as follows in detail.The image has dimension M of pixels down and N pixels across and also the level of quantization of brightness J .The general dimension of a matrix of the original image is equal * MN .It followed that on average one level of brightness contains the number of pixels calculated on a formula: The distance between discrete brightness levels from g  is distinguishable, and the number of pixels at each level, on average, one and is equal 0 n .For this purpose the sum of all i f is starting from 0, in increments of 1, until the value 0 n had been reached.Then the level of initial i f are assigned to all summed pixels, after further processing is carried out starting with level i f at which it was stopped.In case taken singly level exceeds the values 0 n in q time, the connection q levels remain unfilled.
Finally, corresponding rectangular regions are normalized to the size 32x32 pixels (Figure 12c).

M
principal component breaks the vector space on the principal (own) space containing of principal components and its orthogonal addition.In this case, the input vectors represent the images centered and provided to a single scale in the task of images analysis.The input image degradated into a set of linear coefficients called by principal components using preprocessing display matrices.The sum of principal components multiplied by the corresponding own vectors is reconstruction of the image.The principal components (several tens) are calculated for each graphical interest area, all other components code only small differences between the standard and noise, and its can be not considered.
PCA projection is precomputed from a set of training images containing the mouth region of a man (Kaehler and Bradsky, 2017).To calculate the PCA projection are used the following data: 1 2 ( , , , ) containing the sequence of pixels of some image with sizes WH  .There is also a set off M input vectors, vectors 1 2 { , , , } m U U U , vector of expected value  and covariance matrix C , which are defined as follows: The vector of the sums  and private covariance matrix C are defined according to formulas: The first p largest own (characteristic) numbers of the vector The resultant vector is normalized using own numbers: After principal component analysis of the mouth region, this vector will have the dimension of 32 components: 1 ( , , ) The main advantage of applying the principal component analysis is storage and retrieval of image in large database, reconstruction of images.The possible disadvantage of the method is high standards for conditions to shooting of images.
Images should be available in close conditions of illumination, the same foreshortening.Qualitative pre-processing leading the images to standard conditions (scale, turn, centering, and alignment of brightness, cutting off of a background) should be carried out.Phases of work of the method of parametric lip are presented in Figure 13.The testing has proved that in average it is possible to process video sequences 5 frames in 1 second.

CONCLUSIONS AND FUTURE WORK
The proposed method allows determining speaker's lip region in the video stream during articulation and it represents the region of interest as a feature vector with the dimensionality of 32 components.Due to its versatility, this parametric representation of video signals can be used for different tasks of biometrics, computer vision, machine learning, automatic systems for recognition of face (Lei et al., 2012), speech and elements of sign languages (Karpov et al., 2016).In our current research the proposed parametric representation is used for multimodal recognition of elements of the Russian sign language and audio-visual Russian speech.
In further research, we are going to use statistical approaches for recognition of gestures and audio-visual speech including lip-reading, e.g. based on some types of Hidden Markov Models, as well as deep learning approaches with deep neural networks (Kipyatkova and Karpov, 2016).Also we plan to apply a high-speed video camera with > 100 fps in order to keep dynamics of gestures and lips (Ivanko and Karpov, 2016).

Figure 1 .
Figure 1.Functional diagram of parametric video representation method

Figure 2 .
Figure 2. Examples of linear contrasting Educating of classifier takes long time, but the search for objects (faces) is fast.In addition, it has a quite low probability of false positive as the search is carried out on the principle of a scanning window the general view of which is as follows.Let R us all spatial area occupied by the image which includes the desired object (face) and is represented in an integral form.The integral representation of the image in the Viola-Jones algorithm is aimed a quick calculation of the total brightness of an arbitrary rectangular area on the input image.The calculation time remains unchanged in relation to the space of the chosen area.The integral representation of the image is a matrix I , whose dimensions coincide with the dimensions of the original image.Each element stores the sum of intensity of all pixels located to the left and above this element.Elements of the matrix I are calculated by the following formula: R i j = the value of the brightness of the pixel of the original image Any element of the matrix ( , ) I x y stores the sum of pixels in the rectangular from (0, 0) to ( , ) xy , in other words value of the each pixel ( , ) xy is equal to the sum value of all pixels located to the left and above this pixel ( , ) xy .Calculation of a matrix happens on the following formula: origin of the rectangular area regarding I w = width h = height  = the angle of the rectangular area Receiving signs is made using the Haar's filter.It compares the brightness in two rectangular areas of the image.The standard Viola-Jones algorithm uses Haar's primitives presented in Figure 3.

Figure 3 .
Figure 3. Haar's primitives in the standard Viola-Jones algorithm

Figure 5 .
Figure 5. Examples of determination of the speaker's face

Figure 6 .
Figure 6.Examples of reduction of the frames to area with the speaker's face

Figure 7 .
Figure 7. Examples of determination of the speaker's eye

Figure 8 .
Figure 8. Examples of the monophonic painting of the eye region

Figure 9 .
Figure 9. Examples of determination of the different speaker's nose

Figure 10 .
Figure 10.Examples of the monophonic painting of eye and nose regions

Figure 11 .
Figure 11.Examples of determination of the different speaker's lips of the original image is equally in the histogram of the original image f  , but each level of the histogram contains different number of pixels.After the alignment procedure, the distance

Figure 12
Figure 12. a) grayscale transformation, b) leveling of histogram, c) normalization of mouth region

Figure 13 .
Figure 13.Parametric lip representation method: a) face area detection, b) reducing of the frame to area with the speaker's face, c) monophonic painting of eyes and nose regions, d) lips region detection, e) normalization of the mouth region

Table 1 .
The average processing speed of video frames using different computer systems Calculation of the average speed of work of the proposed method of parametric video signals representation was made on computers with different performances, whose parameters are presented in Table1.