CLUSTERING-BASED APPROACHES TO THE EXPLORATION OF SPATIO-TEMPORAL DATA

As one spatio-temporal data mining task, clustering helps the exploration of patterns in the data by grouping similar elements together. However, previous studies on spatial or temporal clustering are incapable of analysing complex patterns in spatio-temporal data. For instance, concurrent spatio-temporal patterns in 2D or 3D datasets. In this study we present two clustering algorithms for complex pattern analysis: (1) the Bregman block average co-clustering algorithm with I-divergence (BBAC_I) which enables the concurrent analysis of spatio-temporal patterns in 2D data matrix, and (2) the Bregman cube average tri-clustering algorithm with I-divergence (BCAT_I) which enables the complete partitional analysis in 3D data cube. Here the use of the two clustering algorithms is illustrated by Dutch daily average temperature dataset from 28 weather stations from 1992 to 2011. For BBAC_I, it is applied to the averaged yearly dataset to identify station-year co-clusters which contain similar temperatures along stations and years, thus revealing patterns along both spatial and temporal dimensions. For BCAT_I, it is applied to the temperature dataset organized in a data cube with one spatial (stations) and two nested temporal dimensions (years and days). By partitioning the whole dataset into clusters of stations and years with similar within-year temperature similarity, BCAT_I explores the spatio-temporal patterns of intra-annual variability in the daily temperature dataset. As such, both BBAC_I and BCAT_I algorithms, combined with suitable geovisualization techniques, allow the exploration of complex spatial and temporal patterns, which contributes to a better understanding of complex patterns in spatio-temporal data.


INTRODUCTION
Thanks to the advanced technology in data collection and sharing, large volumes of spatio-temporal data are becoming unprecedentedly available with various scopes and coverages (Guo 2003, Miller andHan 2009).Extracting meaningful information from these data becomes the primary challenge in spatio-temporal analytics.Under this situation, data mining is especially useful because it distils information from data and reveals patterns hidden in large datasets.
Clustering is an important task in spatio-temporal data mining.It assigns similar data elements to the same group and thus allows an overview of the data at a higher level of abstraction (Andrienko, Andrienko et al. 2009).However, previous studies have primarily focused on clustering data elements along space or time dimension (Crane andHewitson 2003, Hagenauer andHelbich 2013).Take spatial clustering for example, it clusters the locations in spatiotemporal data by the similarity of the attribute's values along all timestamps and the resulting clusters are groups of locations with similar behaviour.Because of this, they are incapable of analysing complex patterns in spatio-temporal data.For instance, concurrent spatio-temporal patterns in 2D or 3D datasets.In this study, we present two clustering algorithms used for complex pattern analysis: (1) the Bregman block average co-clustering algorithm with I-divergence (BBAC_I) which enables the concurrent analysis of spatio-temporal patterns in 2D data matrix, and (2) the Bregman cube average triclustering algorithm with I-divergence (BCAT_I) which enables the * Corresponding author patterns analysis in 3D data cube.Geovisualization techniques are used to support the representation and understanding of the clustering results (Kraak 2003).

Co-clustering
Co-clustering maps locations (rows) to location-clusters and timestamps (columns) to timestamp-clusters at the same time and groups data elements into location-timestamp co-clusters with similar values along both dimensions of the data (Dhillon, Mallela et al. 2003).
Co-clustering, firstly proposed by Hartigan (1972), has become increasingly used for pattern analysis in diverse fields (Madeira andOliveira 2004, Banerjee, Dhillon et al. 2007).Take bio-informatics for example, Cheng and Church (2000), Cho, Dhillon et al. (2004) and Pensa and Boulicaut (2008) all used co-clustering methods for gene expression analysis to explore small subsets of genes and conditions of interest.In the field of text analysis, the co-clustering method is applied to a co-occurrences table for word-document for both document and word categorization (Takamura andMatsumoto 2002, Dhillon, Mallela et al. 2003).Images and auditory scenes are also co-clustered to facilitate information retrieval in multimedia content analysis (Qiu 2004, Qiu 2004).Also in recommender system (e.g.movies), coclustering is used to analyse the patterns in the ratings of various users to build a prediction model (Hofmann 2004).Banerjee, Dhillon et al. (2007) generalized the previous co-clustering methods as Bregman coclustering algorithm, which allows several distance measures (e.g. Euclidean distance) to optimize the co-clustering results and also allows various co-clustering schemes to preserve different sets of summary statistics in the co-clustering results.However, the coclustering methods are rarely used for exploring spatio-temporal data.

Tri-clustering
Tri-clustering analyses spatio-temporal data that fit into a data cube by three dimensions of the data at the same time.There are several studies using tri-clustering methods even though they are relatively new (Sim, Gopalkrishnan et al. 2013).Zhao and Zaki (2005) developed the TRICLUSTER algorithm to analyze a 3D gene expression dataset.Ji, Tan et al. (2006) developed the CubeMiner algorithm to identify the frequent co-occurrences in gene-sample-time dataset.Xu, Lu et al. (2009) developed a 3D cluster model named S2D3 also to identify gene-sample-time tri-clusters in the microarray dataset.Moreover, Sim, Aung et al. (2010) proposed the MIC algorithm to analyze a 3D financial dataset.However, CubeMiner is only applicable to binary 3D datasets while S 2 D 3 identifies clusters that are not axis-parallel, which worsen the task to understand the explored patterns.Although TRICLUSTER and MIC are applicable to quantitative datasets and identify axis-parallel tri-clusters, the significant tri-clusters they are able to identify are usually of few amount (Sim, Gopalkrishnan et al. 2013) and therefore incapable of revealing patterns in the whole dataset.In this context, BCAT_I is developed to enable the complete partitional analysis of the 3D data cube.

DATA
Dutch daily average temperatures collected from 28 weather stations 1992 to 2011 are used to illustrate this study.To define the area affected by each station, a Thiessen polygon map was generated based on all stations' geographic coordinates (Figure 1) where each polygon is labelled with station ID and name.The temperature dataset and coordinates are available for free from the website of the Royal Netherlands Meteorological Institute (https://data.knmi.nl/portal/KNMI-DataCentre.html).
To assure positive temperatures which are required by BBAC_I and BCAT_I, the absolute value of the minimum temperature was added to all the average temperatures in the data matrix and cube.Without losing generality, the description of methods is guided by the case study dataset (Dutch daily temperatures at m stations in n years).

BBAC_I
As one specific case of Bregman co-clustering algorithm, BBAC_I employs the I-divergence as the distance measure because its superiority has been empirically proved by Banerjee, Dhillon et al. (2007).BBAC_I uses the second co-clustering scheme to preserve cocluster averages because this scheme considers the variations among attribute values within each co-cluster along both dimensions of the data, which matches the purpose of the co-clustering analysis.
To apply BBAC_I, the case study dataset needs to be averaged to create a yearly temperature dataset, which is organized in a 2D matrix where rows are stations and columns are years.By iteratively mapping stations to station-clusters and years to year-clusters, BBAC_I (which regards the co-clustering problem as an optimization issue) minimizes the loss of mutual information between the original and the coclustered matrices (Banerjee, Dhillon et al. 2007).As a result, it identifies the station-year co-clusters that contain similar temperatures along both the spatial and the temporal dimensions.By this means, BBAC_I enables the analysis of spatial and temporal patterns in a concurrent fashion.For a detailed explanation, refer to Wu, Zurita-Milla et al. (2015) and Wu, Zurita-Milla et al. (2016).

BACT_I
Developed from BBAC_I by (Wu, Zurita-Milla et al. 2017), BCAT_I is an extension of BBAC_I that enables the analysis of 3D data cubes.In this case, the dataset is organized into a cube with stations, years and days as its three dimensions.By iteratively grouping stations to station-cluster, years to year-clusters and days to day-clusters concurrently, BCAT_I identifies tri-clusters with similar temperatures along all the three dimensions.
Like BBAC_I, BACT_I uses I-divergence as distance measure to optimize the loss of mutual information between the original cube and tri-clustered one.Also, it preserves the tri-cluster averages to consider the variations among temperature values within each tri-cluster.Similar temperatures might still exist at different tri-clusters due to the need to predefine the numbers of clusters.This is solved by regrouping the tri-clusters into an optimal number of temperature groups.For a detailed explanation, refer to Wu, Zurita-Milla et al. (2017).

The co-clustering results
BBAC_I was used to concurrently analyze spatio-temporal patterns in Dutch yearly temperatures.The 28 stations were clustered to four station-clusters and the 20 years to four year-clusters, resulting in 4x4 station-year co-clusters.
The heatmap in Figure 2 provides a straightforward view of stationclusters, year-clusters, station-year co-clusters as well as their elements in the co-clustering results.Year-clusters from 1 to 4 that categorize "cold", "cool", "warm" and "hot" years and also years belonging to each year-cluster are arranged in the x-axis.From the bottom to the top of the y-axis shows the stations IDs arranged as station-clusters from 1 to 4 with increasing temperature values.Based on the arrangement of axes, the "coldest" co-cluster lies in the bottom-left corner while the "hottest" one in the top-right corner.Also, the temperature values of the co-clusters become increasingly high from left to right and from bottom to top of the heatmap.
The small multiples in Figure 3 display the spatial distribution of four station-clusters for each of four year-clusters.Each map of the small multiples shows the spatial patterns of station-year co-clusters.In each map, the 4 regions indicated by thick correspond to 4 stationclusters and each of regions to each of the station-year co-clusters shown in Figure 2.These co-clusters reveal the increasing temperature patterns from northeast to southwest of the Netherlands and from "cold" to "hot" years.The timeline below displays the temporal distribution of the four year-clusters with only 3 years belonging to "cool" or "cold" years.

The tri-clustering results
The Dutch daily temperature dataset was subjected to BCAT_I to yield 128 (4 × 4 × 8) tri-clusters, which was re-grouped to 20 irregular triclusters.
These irregular tri-clusters are displayed in Figure 4 using a 3D heatmap (centre) and four 2D heatmaps.These heatmaps show that the data cube is fully partitioned and the discrete values for each irregular tri-clusters indicate that similar temperature values of the original dataset are completely identified along spatial, temporal and day Fiugre 4. 3D heatmap (center) and 2D heatmaps (side subplots) to visualize irregular tri-clusters (Wu, Zurita-Milla et al. 2017) dimensions, which enables the full exploration of the complex patterns in the data cube.
Then by analysing groups of stations and years that have similar within-year variability from irregular tri-clusters, BCAT_I explored the spatio-temporal patterns of intra-annual variability in Dutch daily temperature dataset.The explored six unique spatial patterns of intraannual variability are displayed in the small multiples in Figure 5. Extracted from irregular tri-clusters by combining day-clusters with the same spatial patterns, the temporal patterns of temperature variability within each of four year-clusters are displayed in the timeline aligned with the corresponding spatial pattern.Figure 5 shows that even though the Netherlands has a relatively small territory, this country has complex spatio-temporal patterns of intra-annual variability in daily temperatures.For most days from 1992 to 2011, there are two regions in the whole country that are defined by the variability in temperature: the northeast & centre and the southwest.In both cold years (i.e. 1996, 2010) and hot years (most recent years), there is an intense variability in the spring and winter temperatures in the northeast & centre of the country while such a variability in spring temperatures only exists in the southwest.Whereas summer temperatures over the whole Netherlands are homogeneous for most of the study period.

CONCLUSION
In this study, we presented the use of two clustering algorithms for complex pattern analysis of spatio-temporal data.The first is the BBAC_I that enables the simultaneous analysis of spatial and temporal patterns in 2D data matrix.By identifying location-timestamp coclusters which contain similar attribute values along both the spatial and temporal dimensions, this co-clustering algorithm are able to reveal the concurrent space-time varying behaviour in the data.The second is BCAT_I that enables the complete partitional analysis of 3D data cube with one spatial, one temporal and any third (e.g.attribute) dimensions.By partitioning the whole dataset into tri-clusters that contain similar values along three dimensions of the cube, this triclustering algorithm allows the full exploration of spatio-temporal pattern in the 3D data cube.
Using Dutch daily average temperature dataset as the case study, both clustering algorithms explored interesting patterns from the data.On one hand, BBAC_I was applied to the averaged Dutch yearly temperature to explore patterns along the weather stations and years.Results show the increasing temperature patterns from the northeast to southwest of the Netherlands and from "cold" (i.e.1996 and 2010) to recent years.On the other hand, by organizing the case study dataset in a data cube with one spatial (stations) and two nested temporal dimensions (years and days), BCAT_I enables the exploration of spatio-temporal patterns of intra-annual variability in the dataset.Results show that from 1996 onwards, an intense variability of spring and winter temperatures exists at the northeast & centre of the Netherlands while such variability of spring temperatures is visible only at the southeast of the country.
In summary, both BBAC_I and BCAT_I allows the analysis of complex patterns in spatio-temporal data.By combining suitable geovisualization techniques, both clustering algorithms enable the exploration of complex spatial and temporal patterns, which contributes to a better understanding of spatio-temporal data.

Figure 1 .
Figure 1.Thiessen polygon map of the Dutch weather stations.