EXTRACTING HUMAN BEHAVIORAL PATTERNS BY MINING GEO-SOCIAL NETWORKS

Accessibility of positioning technologies such as GPS offer the opportunity to store one’s travel experience and publish it on the web. Using this feature in web-based social networks and considering location information shared by users as a bridge connecting the users’ network to location information layer leads to the formation of Geo-Social Networks. The availability of large amounts of geographical and social data on these networks provides rich sources of information that can be utilized for studying human behavior through data analysis in a spatial-temporal-social context. This paper attempts to investigate the behavior of around 1150 users of Foursquare network by making use of their check-ins. The authors analyzed the metadata associated with the whereabouts of the users, with an emphasis on the type of places, to uncover patterns across different temporal and geographical scales for venue category usage. The authors found five groups of meaningful patterns that can explore region characteristics and recognize a number of major crowd behaviors that recur over time and space.


INTRODUCTION
In general, a social network is defined as a communicative framework between constituents of the network that reconstructs users' social relations in environments such as websites, and provides opportunities to users for sharing their ideas, activities, events and interests on internet. In recent years, membership rates have increased dramatically in web-based social networks such as Facebook 1 , Twitter 2 and LinkedIn 3 . On the other hand, portable digital devices such as smart phones have advanced considerably and been welcomed broadly. Their embedded advanced positioning technologies (such as GPS) and wireless communication (such as Wi-Fi and 3G) have offered great potential for various services such as locationaware services. After emergence of social networks, smart phones, and location-aware services, which were welcomed by users, these components joined forces to form geo-social networks (Karimi, 2009). Geo-social networks allow users to add the location dimension to the existing social networks through various methods. Current geo-social networks are divided into three general groups in terms of the way location data are shared: networks based on geo-tagged data, networks based on check-in data, and trajectory-centric networks (Zheng & Zhou, 2011). Check-in based geo-social networks such as Foursquare 4 have attracted millions of users to themselves, which is indicative of their broad acceptance and users' greater than before interest in sharing their location information with their friends. Although check-in data are apparently poor as they basically consist of user identifiers, coordinates in space, and time stamps, network-based on check-in data are of a venue-oriented nature and thus, rather than only providing users' geocoordinates over time, they also provide metadata such as the 1 www.Facebook.com 2 www.Twitter.com 3 www.Linkedin.com 4 www.Foursquare.com name of the visited venue, its type, comments about it and even photos. Hence, valuable information about the users, as well as about the space and time in which they live, that are termed as human behavioural patterns, can be gained even from such basic data by means of data mining methods. Most of previous research projects on human behavioural patterns lack semantics as their data were mostly collected through mobile phone coordinate tracking (Preoţiuc-Pietro & Cohn, 2013). This research aims to study human behavioural patterns in relation with venues types at different temporal scales (e.g. time of day, day of week) and geographical scales (e.g. state, province). In terms of human behavioural patterns, the authors attempt to understand the way people live, what they have in mind and how they interact in their urban environments. For instance, we can find out that people from one part of the globe used to go to public art places more or can detect the downtown area attracting many people on weekends to itself. Based on behavioural patterns, we can also determine the type of activities that are most common in specific urban areas or localize urban points of interest as places with high activity in a specific context (Martinez, 2012). In other words, crowd recurrent behaviour in urban areas is a critical factor for understanding geo-spatial space. The strong relationship between characteristics of the real world spaces and the activities of citizens is already proved (Lee et al., 2013). Therefore, significant rules depicting how urban space is being used can be extracted from human life patterns over geo-social networks. For instance, sometimes only a part of a region contributes to a function. Discovering regions of different functions can enable a variety of valuable applications. It can provide people with a quick understanding of a complex city and these functional regions would also benefit location choosing for a business and advertisement (Yuan et al., 2012). To this end, the authors attempt to investigate the possibility to extract behavioral patterns from location-based social networks. In particular, the authors propose the use of place categories, time context, geographical administration boundaries and association rule mining algorithm to create fingerprints of users and areas. The remainder of this paper is organized as follows: Section 2 reviews related works on utilizing geo-social networks to uncover human behavioral patterns. Section 3 explains the concept of association rule mining. Section 4 describes an overall process of data preparing for association rule mining. Section 5 illustrates the experiment that the authors conducted to extract crowd behavioral patterns and urban characteristics through data gathered from Foursquare for Italy. Finally, the conclusion of paper is presented Section 6.

RELATED WORK
In recent years, many studies have been conducted on geosocial networks to discover users' characteristics including social relations, mobility patterns, preferences etc. For instance, in one of the large-scale research studies carried out on geosocial networks (Li & Chen, 2009), much data from these networks such as Brightkite was collected and the results were analyzed in terms of users' features, their travel experience, and their relation. In another study, researchers attempted to characterize the relationship that exists between people's cyber interests and their mobility properties (Trestian et al., 2009). In a study by (Gao et al., 2012), the pattern of user check-ins with respect to social-historical ties was explored. For example, they found out that users with friends tend to go to similar locations than those without. In addition, a number of researches were carried out in this field in order to discover the geo-spatial characteristics. For example, in a study on extremely large anonymized mobility data, the most visited areas by tourists during the day and the typical time of the visits were identified (Reades et al., 2009). In another study, a probabilistic topic model was adopted for extraction of urban patterns and recurrent behaviors from location-based social networks (Ferrari et al., 2011). Also in a research carried out by (Lee et al., 2013), a novel method was proposed to discover urban characteristics by exploiting common behavioral patterns.

ASSOCIATION RULE MINING
Association rule mining is one of the most important and well researched techniques of data mining which aims to extract interesting correlations, frequent patterns, associations or casual structures among sets of items (Han, 2000). Unlike other data mining functions, association is transactionbased. In transactional data, a collection of items is associated with each case. The collection can theoretically include all possible members of the collection. However, in actuality, only a tiny subset of all possible items is present in a given transaction. The first step in association analysis is the enumeration of item sets. An item set is any combination of two or more items in a transaction. The maximum number of items in an item set is specified by the user. If the maximum is two, all the item pairs will be counted. If the maximum is greater than two, all the item pairs, all the item triples, and all the item combinations up to the specified maximum will be counted. Association rules are calculated from item sets. Frequent item sets are those that occur with a minimum frequency specified by the user. The minimum frequent item set support is a user-specified percentage that limits the number of item sets used for association rules. An item set must appear in at least this percentage of all transactions if it is to be used as a basis for rules.
An association mining problem can be decomposed into two sub-problems:  Finding all combinations of items in a set of transactions that occur with a specified minimum frequency. These combinations are called frequent item sets;  Calculating the rules that express the probable concurrence of items within frequent item sets.
One of the most well-known algorithms to solve association rule mining is Apriori algorithm (Han & Kamber, 2006). The Apriori algorithm calculates the rules that express probabilistic relationships between items in frequent item sets. For example, a rule derived from frequent item sets containing A, B, and C might state that if A and B are included in a transaction, then C is also likely to be included. The IF component of an association rule is known as the antecedent. The THEN component is known as the consequent.

Metrics for Association Rules
 Support The support of a rule indicates how frequently the items in the rule occur together. Support is the percentage/fraction of transactions that include all the items in the antecedent and consequent to the total number of records in the database.


Confidence The confidence of a rule is the conditional probability of the consequent given the antecedent. It is a measure of strength of the association rules.
Association rule mining is to find out association rules that satisfy the predefined minimum support and confidence from a given database. However, there are times when both of these measures may be high, and yet still produce a rule that is not useful. A third measure is needed to evaluate the quality of the rule. Lift indicates the strength of a rule over the random cooccurrence of the antecedent and the consequent, given their individual support. It provides information about the improvement, the increase in probability of the consequent given the antecedent. Lift is defined as follows: Any rule with Lift of less than 1 does not prove to be a really useful rule.

DATA PREPARATION
In this section, the source of the utilized data is first presented briefly and then the method used for data collection and preparation for data mining process is presented.

Data Set
Foursquare dataset 5 was used to study the behavioral patterns on geo-social networks. Foursquare is one of the most popular

Data Cleaning and Transforming
To make this analysis more effective, the authors needed to review all collected data to complete it with more detailed information and to remove less significant fields. Association rule mining technique cannot work with all types of information but specific formats and values are required to discover facts, principles or relationships.
The model employed to organize data around facts was composed of three dimensions including: location type, time and geographical area. Each of these dimensions or sets of information, were extended to obtain more in-depth information.  figure 1, each of these categories is divided to a number of sub-categories too.

Time Dimension
Time-dependent data is especially important for obtaining human behavior patterns. The timestamp field of check-ins was transformed to new attributes with more relevant values. Figure 1 illustrates the new types of time fields.

Geographical Dimension
As described before, one of the check-in data elements is their geographical coordinates. Venues were located based on checkin coordinates into three administrative levels: state, province and commute. This way, human behavioral patterns can be studied, and the results gathered in these three levels can be analyzed.

(a)
The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XL-2/W3, 2014 The 1st ISPRS International Conference on Geospatial Information Research, 15-17 November 2014, Tehran, Iran

EXPERIMENT AND INTERPRETATION
Apriori algorithm was implemented to discover association rules from prepared data. To do this, the KNIME software was utilized. KNIME is a user-friendly graphical workbench for the entire analysis process: data access, data transformation, initial investigation, powerful predictive analytics, visualization and reporting. Figure 4 shows a sample part of data prepared as input for data mining process.

Experiment setup
At first, features of input data were transformed to collection type to make them available for the association rule mining. In addition, a number of settings were available in many association algorithm implementations. These settings mostly cover the minimal cut-off points for the major evaluation criteria of the output rules such as support, confidence and lift. In association rule mining, the degree of dependency between items is described by confidence and since we are looking for expensive correlated items, the confidence criterion is actually more important than support in this study. The minimum support and confidence level were considered as 0.1 % and 40.0 % respectively. As for the minimum lift value, it was set to 1.0 to discover meaningful patterns.

Results
In this section, the dataset was investigated by extracting crowd human behavioural patterns relating to the spatial-temporal rules that were observed. It was also attempted to characterize geo-space according to them. The mined rules are explained in detail as follows: Pattern 1: The relation between human activity level and seasons of year in different regions: Based on some extracted association rules (a number of which is presented in Table 2), the rate of check-ins culminate in a special season of year in some provinces. This can be due of geo-social situation of those regions. The geographical distribution of these patterns is presented in Figure 5.  Pattern 2: the frequency of check-ins in all states except one in workdays is more than rest days: this pattern is not surprising as the number of workdays is actually more than the number of rest days and so the number of check-ins should be higher generally. This pattern has an exception which is discovered as a rule in the form of "[Valle d'Aosta]  [Rest day]" with a confidence of about 62%. Aosta Valley is the smallest Italian region and is located at the "hub" of the Alps.

Sup
The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XL-2/W3, 2014 The 1st ISPRS International Conference on Geospatial Information Research, 15-17 November 2014, Tehran, Iran The region looks like an island among the mountains, it is wellknown as a touristy destination because of its peaceful atmosphere in summer and snow in winter. Considering these explanations, this kind of rules can be a useful indicator for touristy districts.
Pattern 3: the temporal-spatial correlation of human activity level: Based on this kind of pattern, the effectiveness of spatialtemporal correlations of activity rate can be investigated. The authors were interested in how crowd human behavior rate (is represented by check-in frequency) in different commutes is affected by the temporal situations (morning, afternoon and evening). Figure 6 presents the extracted rules on a map.   Pattern 5: region functionality: in this section, the authors aimed to discover regions of different functions in urban areas using human behavioral patterns. The function of each region was extracted from the associated rules in the form of "Administrative Area  Venue Category" with a confidence of more than 40%. This implied that most of the activities in the region had the same context, or the region was well-known because of a particular functionality. But it did not mean that the region had only venues merely for one category. Specifically, we filled the regions (the areas in province and commute scale) having similar major categories with the same color as illustrated in Figure7. A number of mined rules indicating the correlation between some commutes and sub-categories are presented in Table 4. Furthermore, there were some associated rules that expressed the time slot in which a specific functionality of a region culminated, or stated the dominant activity in a region during a particular time interval. A number of these rules are presented in Table 5. These rules can be useful and applicable, indicative of the crowd behaviour of region population or urban characteristics.  Table 5. A number of associated rules indicating socio-temporal characteristics of commutes.

CONCOLUTION
Geo-Social networks contain a multi-layer data structure including geographical, temporal and contextual information, providing an unprecedented opportunity for studying crowd human behavior. Investigating such information can help design new applications of more similarity to users' daily life, and thus improve the experience of urban life. In this paper, the authors studied recurrent behaviors combining the temporal information about the whereabouts of users with information on the types of places they visit in different administrative areas. The investigation of various properties of users' behavior using massive geo-tagged contents on Foursquare network resulted in discovering interesting associations and daily life patterns regarding multiple aspects. In addition, the authors attempted to discover crowd behavioral features for different administrative areas and extract regional characteristics by exploiting common behavioral patterns. The experimental results exhibit the power of association rule mining method for capturing crowd behavioral patterns.