CONVOLUTIONAL RECOGNITION OF DYNAMIC TEXTURES WITH PRELIMINARY CATEGORIZATION

Dynamic Texture (DT) can be considered as an extension of the static texture additionally comprising the motion features. The DT is very wide but the weak studied type of textures that is employed in many tasks of computer vision. The proposed method of the DTs recognition includes a preliminary categorization based on the proposed four categories, such as natural particles with periodic movement, natural translucency/transparent non-rigid blobs with randomly changed movement, man-made opaque rigid objects with periodic movement, and man-made opaque rigid objects with stationary or chaotic movement. Such formulation permitted to construct the separate spatial and temporal Convolutional Neural Networks (CNNs) for each category. The inputs of the CNNs are a pair of successive frames (taken through 1, 2, 3, or 4 frames according to a category), while the outputs store the sets of binary features in a view of histograms. In test stage, the concatenated histograms are compared with the histograms of the classes using the Kullback-Leibler distance. The experiments demonstrate the efficiency of the designed CNNs and provided the recognition rates up 97.46–98.32% for the sequences with a single type of the DT conducted on the DynTex database. * Corresponding author


INTRODUCTION
It is evident that the most of the wild natural scenes include a lot of motion patterns, such as clouds, trees, grass, water, smoke, flame, haze, fog, etc., called as the DTs. Even, crowd of people running, vehicular traffic, swam of fishes or birds in flight may be modelled as the DTs under the specific shooting parameters. The DTs are caused by a variety of physical processes that leads to different visualization of such objects: small/large particles, transparent/opaque visibility, rigid/nonrigid structure, 2D/3D motion. The goal of the DTs recognition can be different. In reconstruction tasks, the recognition of the DT means a creation of its 2D or 3D statistical model. In surveillance system, the DT motion in 3D spatiotemporal volume is analyzed. In virtual applications, only the qualitative motion recognition is necessary.
One of the main properties of man-caused textured objectsregularity is not so evident for the natural DTs. It is reasonable to assume that a computing of the gradient fields and full displacements with high accuracy is not necessary. The spatial properties of textures are well-known and include statistical, fractal, and color estimators (Favorskaya et al., 2016). The temporal properties of textures differ each others, and one can speak about the common temporal properties, such as divergence, curl, peakiness (the average flow magnitude divided by its standard deviation), and orientation in the case of the normal or full optical flow , and the special temporal properties, for example, the stationary, coherent, incoherent, flickering, and scintillating. It is required a necessity of the spatiotemporal features to be invariant, at least, to the affine transform and illumination variations.
The recognition of the DTs remains a challenging problem because of multiple impacts appearing in the dynamic scenes that include the viewpoint changes, camera motion, illumination changes, etc. In past decades, a variety of different approaches have been proposed for recognition of the DTs, such as the Linear Dynamic System (LDS) methods (Ravichandran et al., 2013), GIST method (Oliva and Torralba, 2001), the Local Binary Pattern (LBP) methods (Zhao and Pietikainen, 2007a), wavelet methods (Dubois et al., 2009;Dubois et al., 2015), morphological methods (Dubois et al., 2012), deep multilayer networks Arashloo et al., 2017), among others.
Our contribution deals with the architecture's design of the spatial and temporal CNNs for the categorized type of the DTs in such manner that the parameters of filters are optimally tuned for each DT category. The special attention is devoted to the motion features like the periodic movement features and movement features based on the energies.
The rest of the paper is organized as follows. Section 2 describes the related work. The preliminary dynamic textures categorization is proposed in Section 3. Section 4 contains the design details of the CNN, while Section 5 presents the results of experiments conducted on DynTex database. The last section 6 concludes the paper.

RELATED WORK
All approaches for the DTs recognition can be roughly categorized as generative and discriminative methods. The generative methods consider the DTs as the physical dynamic systems based on the spatiotemporal autoregressive model (Szummer and Picard, 1996), the LDS (Doretto et al., 2003), the kernel-based model (Chan and Vasconcelos, 2007), and the phase-based model (Ghanem and Ahuja, 2007). The details of this approach are described in (Haindl and Filip, 2013). The main drawback of this approach is an inflexibility of models, describing the DT sequence with the nonlinear motion irregularities.
The discriminative methods employ the distributions of the DT patterns. Many methods, such as the local spatiotemporal filtering using an oriented energy (Wildes and Bergen, 2000), normal flow pattern estimation , spacetime texture analysis (Derpanis and Wildes, 2012), global spatiotemporal transforms (Li et al., 2009), model-based methods (Doretto et al., 2004.), fractal analysis (Xu et al., 2011), wavelet multifractal analysis (Ji et al., 2013), and spatiotemporal extension of the LBPs (Liu et al., 2017), are concerned to this group. The discriminative methods prevail on the generative methods due to their robustness to the environmental changes. However, the merits of all approaches become quite limited in the case of complex DT motion.
The spacetime orientation decomposition is an intuitive representation of the DT. Derpanis and Wildes (Derpanis and Wildes, 2010) implemented the broadly tuned 3D Gaussian third derivative filters, capturing the 3D direction of the symmetry axis. The responses of the image data were pointwise rectified (squared) and integrated (summed) over a spacetime region of the DT. The spacetime oriented energy distributions maintained as histograms in practice evaluated by Minkowski distance, Bhattacharyya coefficient, or Earth mover's distance respect to the sampling measurements. These authors claimed that the semantic category classification results achieved 92.3% for seven classes like flames, fountain, smoke, turbulence, waves, waterfall, and vegetation from UCLA dynamic texture database (Saisan et al., 2001).
The temporal repetitiveness is a basis of most DTs. A periodicity analysis of strictly periodic and nearly (quasi) periodic movements was developed by Kanjilal et al. (Kanjilal et al., 1999) and included three basic periodicity attributes: the periodicity (a period length), pattern over successive repetitive segments, and scaling factor of the repetitive pattern segments. Hereinafter, this analysis based on the Singular Value Decomposition (SVD) of time series configured into a matrix was adapted to the DTs recognition (Chetverikov and Fazekas, 2006).
A set of spatial features is wide. Usually it is impossible to recognize the DT using a single descriptor, some arbitrary aggregation of features is required to satisfy the diversity, independence, decentralization, and aggregation criteria. The spatial features are defined by the type of the analyzed DTs. Thus, the LBP and Gabor features can be used to recognized the simple and regular textures, while the shape co-occurrence texture patterns (Liu et al., 2014) and deep network-based features (Bruna and Mallat, 2013) describe the geometrical and high-order static textures. The GIST descriptor is selected to depict the scene-level textural information.
The DTs indicate the spatial and temporal regularities, depicting simultaneously. Therefore, many researches are focused on a simultaneous processing the spatial and temporal patterns in order to construct the efficient spatiotemporal descriptors based on the LDT model (Chan and Vasconcelos, 2007;. Nevertheless, some authors study the dynamic or spatial patterns of the DTs separately using, for example, Markov random fields, chaotic invariants, GIST descriptor, the LBPs, among others (Zhao et al, 2012;Crivelliet al., 2013).
Recently, the deep structure-based approaches have been actively applied in many tasks of computer vision. The deep multilayer architectures achieve an excellent performance, exceeding the human possibilities in different challenging visual recognition tasks (Goodfellow et al., 2016). However, they require a large volume of labeled data that makes the learning stage computationally demanding. Due to the large number of the involved parameters, these networks are prone to overfitting. A particularly successful group of multilayer networks is the convolutional architectures (Schmidhuber, 2015) or the CNNs. In the CNN, the problems of the overfitting, expensive learning stage, and weak robustness against image distortions are handled via the constrained parameterization and pooling.
Qi et al. (Qi et al., 2016) proposed the well-trained Convolutional neural Network (ConvNet) that extracts the midlevel features from each frame with following classification by concatenating the first and the second order statistics over the mid-level features. These authors presented and tested two-level feature extraction scheme as the spatial and the temporal transferred ConvNet features. The ConvNet has five convolutional layers and two full-connected layers with removal of the final full-connected layer.
The PCA Network (PCANet) was designed by Chan et al. (Chan et al., 2015) as a convolutional multilayer architecture with filters that are learned using principal component analysis. The overcoming is in that the training the network only involves the PCA data volume. The PCANet is a convolutional structure with high restricted parameterization. Despite its simplicity, the PCANet provides the best performance in static texture categorization and image recognition tasks. Afterwards, the static PCANet was extended to the spatiotemporal domain (PCANet-TOP) for analysis of dynamic texture sequences (Arashloo et al., 2017).
Due to a great variety of the DTs, it is reasonable to categorize preliminary the DTs according to their global features and design the special CNNs with simpler architectures for each category. Objectively, the proposed structure permits to speed up a recognition process of huge data volume, for example, during the object recognition and surveillance.

PRELIMINARY CATEGORIZATION OF DYNAMIC TEXTURES
Our DTs categorization is based on the following spatiotemporal criteria: 1. Spatial texture layering/layoutuniformly distributed texels/texels in a non-uniform spatial background 2. Type of texelsparticles/blobs/objects 3. Shape of texelsrigid/non-rigid 4. Color of texelschangeable/persistent 5. Transparencyopaque/translucency/transparent 6. Type of texels' motionstationary/periodic/randomly changed/chaotic A multi-slice of texels (2D space texture) mapping forms a volumetric representation, so called voxels. It is worth noting that the voxels' analysis is also possible. However, 3D filters are more complicated and provide the global estimators with less discriminative information regarding a multi-slice mapping. Above all, any motion in a scene ought to be analyzed on a subject of textural/non-textural regions. For this goal, the wellknown techniques, such as background subtraction, blockmatching algorithm, optical flow, or their multiple modifications may be applied (Favorskaya, 2012).
A periodicity of the DTs is very significant feature for preliminary categorization. A generalized function of periodicity f(k) describes the temporal variations of the average optic flow of the DTs with following pre-processing according to Chetverikov and Fazekas (Chetverikov and Fazekas, 2006) recommendations. These preprocessing algorithms reduce the effects of: 1. Noise. The original function f o (k) is smoothed by a small mean filter 2. Function trend. The denoised function f n (k) is detrended by the smoothing with a large mean filter and subtracting the mean level from the denoised function f t (k) 3. Amplitude variations. The detrended function f t (k) is normalized without shifting 4. Potential non-stationarity. The periodicity is computed using a slicing window, which size should span at least four periods of the function f t (k) According to the notation of Kanjilal et al. (Kanjilal et al., 1999), digital generalized function f(k), having a period n, is placed as the successive n intervals of f(k) into the rows of the m × n matrix A n : (1) If n equals the period length, then the rows of matrix A n are linearly dependent in spite of different scaling factors of the rows. This proposition permits to use the SVD to determine the repeating pattern and the scaling factors from matrix A n as  N), two cases are possible. The first case is the same one that was mentioned above but with different scaling. In this case, the vector v 1 remains the periodic pattern. The second case evaluates the nearly repeating patterns with different scaling. In this case, the matrix A n can be full-rank and s 1 >>s 2 that indicates a strong primary periodic component of the length n, given by rows of the matrix u 1 s 1 v 1 T . To obtain the further component, the iterative procedure for the residual matrix A n -u 1 s 1 v 1 T is required. Besides the ratio s 1 /s 2 evaluation, two alternative measures of function periodicity were introduced by Chetverikov and Fazekas (Chetverikov and Fazekas, 2006 . (2) Consider the most interesting measures of moving regions that are easily can be implemented in the CNN architecture. Four spatiotemporal measures have been proposed by Xu et al. (Xu et al., 2015), which are suitable for the shape, motion, and fractal evaluation of the DTs. The pixel intensity measure  I (p 0 , t 0 , r s , r t ) is calculated by equation 3: where I(p, t) = an intensity value of pixel p in time instant t r s = a spatial radius r t = a temporal radius The temporal brightness gradient  B (p 0 , t 0 , r s , r t ) is a summation of temporal intensity changes of the DT in a 3D cube (). This parameter is defined by a derivative of second order (equation 4): The Laplacian  L (p 0 , t 0 , r s , r t ) means the information of the local co-variance of pixel intensity at point (p 0 , t 0 ) in the spatialtemporal domain (equation 5): The normal flow  F (p 0 , t 0 , r s , r t ) is often used in motion estimation of the DTs. It measures a motion of pixels along the direction perpendicular to the brightness gradient, e.g., edge motion as an appropriate measure for chaotic motion of the DTs. This measure can be calculated by equation 6: The spatial texture layering as well as the type and shape of texels are also important descriptors for preliminary categorization. They can be estimated using the gradient information of the successive frames. According to the proposed spatiotemporal features, the following categorized groups were formulated: 1. Category I. Natural particles with periodic movement like water in the lake, river, waterfall, ocean, pond, canal, and fountain, leaves and grass under a wind in small scales 2. Category II. Natural translucency/transparent nonrigid blobs with randomly changed movement like the smoke, clouds, flame, haze, fog, and other phenomena 3. Category III. Man-made opaque rigid objects with periodic movement like flags and textile under a wind, leaves and grass under a wind in large scales 4. Category IV. Man-made opaque rigid objects with stationary or chaotic movement like car traffic, birds and fishes in swarms, moving escalator, and crowd It is reasonable to design the separate CNNs for each category.

DESIGN OF CONVOLUTIONAL NETWORK
Many conventional machine learning techniques were approved for texture recognition. However, the DTs recognition causes the challenges that can be solved by use of more advanced techniques, for example, the CNN as a sub-type of a discriminative deep neural network. The CNNs demonstrate a satisfactory performance in processing of 2D single images and videos as a set of successive 2D frames. The CNN is a multilayer neural network, which topology in each layer is such that a number of parameters is reduced thanks to the implementation of the spatial relationships and the standard back propagation algorithms. The typical CNN architecture consists of two types of the alternate layers, such as the convolution layers (c-layers), which are used to extract features, and sub-sampling layers (slayers), which are suitable for feature mapping. The input image is convolved with trainable filters that produce the feature maps in the first c-layer. Then these pixels, passing through a sigmoid function, are organized in the additional feature maps in the first s-layer. This procedure is executed until the required rasterized output of the network will not be obtained. The high dimensionality of inputs may cause an overfitting. A pooling process called as a sub-sampling can solve this problem. Usually, a sub-sampling is integrated in 2D filters.
Except the special cases, the correlation between the spatial and temporal properties of the DTs does not exist. Therefore, it is reasonable to introduce the parallel spatial and temporal CNNs called as s-CNN and t-CNN with the finalizing voting of the separate concatenated results. The system's architecture involves three main parts. The first categorization part defines a periodicity activity in a long-term series of the DTs that exceed as minimum in four times a period length of oscillations or it will be clarified that a periodicity is absent. The second convolution part analyses a short-term series and includes two s-CNNs and one t-CNN for a pair of successive frames. The s-CNNs process two successive frames, while the inputs of the t-CNN obtain a frame difference. The layers of all CNNs are trained relative to a category that was determined during a categorization part. For this goal, different spatial and temporal filters are tuned. The output feature results of two s-CNNs are averaged. The third voting part concatenates the spatial and temporal features in order to define a class of the DT. The detailed block-chart of the proposed architecture is depicted in Figure 1. The specifications of two s-CNNs and one t-CNN for category I of the DTs are situated in Table 1. Consider the implementation of the learning and test processes in the DT recognition system in Sections 4.1 and 4.2, respectively.

Overview of Learning Process
Suppose that two successive training frames of size m  n pixels are divided into k  k patches, where k is an odd number. In  each convolutional layer, one of the filters is applied to each patch. The s-layer provides an estimation of the obtained intermediate results using a sigmoid function. During a learning stage, only the high quality frames are processed with the goal to tune optimally the parameters of each filter. In current architecture, the mean filter, the median filter, and the Laplacian filter were applied in the s-CNN and four spatiotemporal measures (equations 3-6) were implemented in the t-CNN for four DTs categories. Also the optical flow provides the information about the local and global motion vectors. After the last layer, the residuary patches ought to be binarized and represented as the separate histograms. The goal of the pooling layers is to aggregate the separate histograms, improve their representation, and create the output histogram for a voting part. Note that the output histograms from two s-CNNs are averaged. Each final histogram is associated with the labelled class of the DTs.

Overview of Test Process
The well-trained CNNs do not need in the architecture changes during a test process. The input frames are categorized according to the movement and main spatial features. The histograms are improved and concatenated in a voting part of a system. For recognition, the Kullback-Leibler distance among the others, such as Chi-square distance, histogram intersection distance, and G-statistics, was used as a recommended frequently method for the histograms' comparison. Also the Kullback-Leibler distance called as a divergence provided the best results in our previous researches regarding the dynamic transparent textures (Favorskaya et al., 2015).
The Kullback-Leibler divergence is adapted for measuring distances between histograms in order to analyze the probability of occurrence of code numbers for compared textures. First, the probability of occurrence of the code numbers is accumulated into one histogram per image. Each bin in a histogram represents a code number. where h  1, 2 = a number of compared histograms H() = a histogram K = the total number of coded numbers Note that a multi-scale analysis is only required for the DTs from categories III and IV (man-made objects) due to the natural textures are fractals with a self-organizing structure.

EXPERIMENTAL RESULTS
The DynTex database (Peteri et al., 2010)  The moving textured objects in video sequences were detected using the block matching algorithm and optical flow. A half of the obtained data was used in the CNN learning, while the other half was applied during the CNN test. Examples of such fragments in color and gray-scale representations are depicted in Figure 2. Experiments show that the generalized CNN designed for all types of the DTs has very complicated architecture, when only the separate components of the s-layers and c-layers are worked out. This leads to a necessity of preliminary categorization of the DTs with following construction of the specified 2D filters for each type of the DTs. The s-CNN employs Laplacian, Gaussian, the energy Laws filter (Laws, 1980), and the energy Tamura features (Tamura, 1978). The t-CNN uses the filters, employing the block-matching and optical flow components with the successive decreasing of resolution. Then the resulting histograms are built and compared by the Kullback-Leibler divergence as a measure of differences. As an example, the histograms for Category I and Category III of the DTs are depicted in Figure 3. The averaged recognition results for all four Categories are presented in Table 6.  (Zhao and Pietikainen, 2007b) 92.45 Xu et al. (Xu et al., 2011) 97.63 Tiwari and Tyagi (Tiwari and Tyagi, 2016a) 85.14 Tiwari and Tyagi (Tiwari and Tyagi, 2016b) 98.57 The proposed method 97.45 Table 7. Comparative results The experiments confirm the efficiency of the proposed method for the DTs recognition using the designed CNNs.

CONCLUSIONS
The proposed system architecture, including the categorization, convolution and voting parts, provides very promising results in the DT recognition task. The experiments conducted on the sequences from the DynTex database show the best recognition results for the Categories VI and III with the averaged recognition rate 97.46 % and 98.32%, respectively. For the DTs based on man-made opaque rigid objects with stationary or chaotic movement, the errors of temporal features are high for the short-term series that influence on the final result. Also the samples of these categories usually contain a cluttered background. This means that a special attention ought to be paid for the temporal analysis in the further investigations.