Exploring potential of crowdsourced geographic information in studies of active travel and health : Strava data and cycling behaviour

In development of sustainable transportation and green city, policymakers encourage people to commute by cycling and walking instead of motor vehicles in cities. One the one hand, cycling and walking enables decrease in air pollution emissions. On the other hand, cycling and walking offer health benefits by increasing people’s physical activity. Earlier studies on investigating spatial patterns of active travel (cycling and walking) are limited by lacks of spatially fine-grained data. In recent years, with the development of information and communications technology, GPS-enabled devices are popular and portable. With smart phones or smart watches, people are able to record their cycling or walking GPS traces when they are moving. A large number of cyclists and pedestrians upload their GPS traces to sport social media to share their historical traces with other people. Those sport social media thus become a potential source for spatially fine-grained cycling and walking data. Very recently, Strava Metro offer aggregated cycling and walking data with high spatial granularity. Strava Metro aggregated a large amount of cycling and walking GPS traces of Strava users to streets or intersections across a city. Accordingly, as a kind of crowdsourced geographic information, the aggregated data is useful for investigating spatial patterns of cycling and walking activities, and thus is of high potential in understanding cycling or walking behavior at a large spatial scale. This study is a start of demonstrating usefulness of Strava Metro data for exploring cycling or walking patterns at a large scale.


INTRODUCTION
By means of enhancing physical activity, active travel (cycling or walking) produces health benefit (Forsyth et al., 2015;Oja et al., 1998;Oja et al., 2001;Pucher et al., 2010;Wen and Rissel, 2008).In earlier studies that use traditional data collection methods, research on the role of cycling for health through physical activity has been limited by the lack of information on where bicyclists ride (Griffin and Jiao, 2015).Specifically, travel survey data tends to have a low spatial granularity as geography level of travel survey data is usually census tract; whilst traffic counts data have a high spatial granularity but a low spatial coverage as traffic counts points are usually located in major roads other than minor roads.In recent years, GPSenabled mobile devices, such as smartphones and smartwatches, allow individuals to track their cycling GPS traces with fine spatial granularity (Jesticoa et al., 2016;Broach et al., 2012;Casello and Usyukov, 2014;Hood et al., 2011).In the era of Big Data, a large volume of cycling traces generated by individuals are becoming potential data for studies of travel and health (Prins et al., 2014;Duncan et al., 2009;Dill, 2009;Griffin and Jiao, 2015;Sun and Mobasheri, 2017).Recently, as a popular platform dedicated to tracking users' cycling, walking, running and hiking activities, Strava is gaining attention from both researchers and planners after it launched a data service called Strava Metro.There are millions of users uploading their rides, walks, runs and hikes to Strava each week (Strava Metro, 2016).To protect user privacy, Strava Metro anonymized and aggregated users' traces to streets of each city.Strava Metro data is of high potential in a wide range of applications, including mapping cycling activities over cities (Jesticoa et al., 2016), assessing effects of environmental factors on cycling behavior (Griffin and Jiao, 2015;Heesch et al., 2016) and assessing air pollution during cycling (Sun and Mobasheri, 2017).Moreover, by comparing cyclist counts between Strava data and manual count data in count stations, some studies have revealed that Strava Metro data is a good representation of cycling population (Jesticoa et al., 2016;Herrero, 2016).As a result, due to a high level of spatial granularity and a large spatial coverage Strava Metro provides an opportunity to depicting cycling behaviour.This study aims to demonstrate usefulness of Strava Metro data in depicting cycling behaviour over a city by taking account of cycling activities and daytime population.Moreover, this study could offer implications for policies to help policymakers to consider investment priority in bicycle infrastructure of the areas where cyclists are likely to go.

MATERIALS AND METHODS
In this section, research data and methods are presented.Specifically, sub section 2.1 introduces the research data, and sub section 2.2 introduces the approach to investigating spatial patterns of cycling behaviour.

Research Data
The Strava Metro dataset (Urban Big Data Centre, 2016) has 287, 833 cycling activities within the Glasgow Clyde Valley Planning area (including Glasgow City and seven contiguous council areas) in 2015.This dataset contains three sub sets with three different formats: Streets, Origin-Destination, Nodes (Strava Metro, 2015).This study uses the Nodes sets.The Nodes set was created based on a street network which is extracted from OpenStreetMap.Specifically, the Node set contains all nodes of the street network, and each node represents an intersection of streets (see Figure 1).Table 1 lists attributes of nodes, including count of cycling activities (regardless of unique riders) at the node (street intersection) at a specific time.Note that the temporal granularity is the minute level (Strava Metro, 2015).Table 1.Fields in the Nodes file (Strava Metro, 2015) Additionally, the dataset contains a file that offers demographics of the cycling trips (see

Investigation of spatial patterns
This study explores spatial patterns of cycling behaviour over a city by identifying spatial clusters of cycling activities.By considering background population this study uses the ratio of cycling activities to daytime population to identify clusters of high density cycling activities.Specifically, an improved AMOEBA (A Multidirectional Optimum Ecotope-Based) algorithm developed by Duque et al. (2011) is used to identify clusters of high ratio of cycling activities to daytime population.Then this study associates clusters with locally environmental characteristics such as land use type.As the population data is available at the census area level, this study calculates the ratio of cycling activities to daytime population at the census area level.
Firstly, this study defines the ratio of cycling activities to daytime population (RCADTP) within an area (census output area).Suppose i is an area, RCADTP of i is computed as _  () (1) where _  () is the number of cycling activities in the area i, and _  () is the daytime population in the area i. N i is the set of nodes that are located within the area i, and _  () is the number of cycling activities in the node j.
In this paper, the improved AMOEBA (A Multidirectional Optimum Ecotope-Based) algorithm developed by Duque et al. (2011) is used to identify clusters of high RCADTP.This algorithms suits for the task in this study as it is applicable to classification of a large number of areas and identification of irregularly shaped clusters.This study briefly introduces the improved AMOEBA algorithm based on Duque et al. (2011).Essentially, a region or ecotope is a spatially linked group of areas.A region can thus be defined as a spatially contiguous set of areas.The value of the   * statistic is used to measure the level of clustering of an attribute x around an area.Suppose we run AMOEBA on a study region with N areas and an attribute x with elements x i , indicating the value of x at area i.Let us denote this set of areas as M, and ̅ and S as the mean and the standard deviation of the attribute x and let R be a sub region of M with n areas.Duque et al. (2011) rewrite the formulation of   * as follows: Basically,   * depends on the areas that are in the region R and the parameters N, ̅ and S that are obtained from the areas in M. Accordingly, a positive (negative) and statistically significant value of   * statistic indicates the presence of a cluster of high (low) values of attribute x around area i.Thus, AMOEBA identifies high-valued, or low-valued, ecotopes (regions) by looking for subsets of spatially connected areas with a high absolute value of the   * statistic.There is only one parameter, i.e., the significance level threshold, that is required to run the AMOEBA algorithm.The significance level threshold was set to 0.01, meaning only clusters with a p-value less than 0.01 are statistically significant.

RESULTS AND DISCUSSION
This section demonstrates the empirical results in the study area and makes discussions about the results.

Spatial patterns of cycling behaviour
First of all, annual total cycling activities at each node is calculated after aggregating number of cycling activities at different times throughout the year 2015.Second, Nodes and census output area boundaries are overlapped.Then total number of cycling activities and RCADTP of each census output area are calculated according to Equations ( 1)-(2).Figure 4 maps number of cycling activities in census output areas.Areas with high-density cycling activities are situated around the city centre.In the AMOEBA algorithm, an observation is the RCADTP of an area (census output areas).In this paper, running AMOEBA is conducted using ClusterPy (RiSE group).The AMOEBA algorithm identifies statistically significant clusters of high value and clusters of low value.Figure 5 maps the cluster of high RCADTP.In the top map, clusters of high value and low value represent cluster of high RCADTP and low RCADTP respectively.This study then associates clusters of high RCADTP with locally environmental characteristics such as main land use types by overlapping the clusters and basemap such as GoogleMap and OpenStreetMap.As a consequence, clusters of high RCADTP mainly surround green spaces such as parks and gardens, as well as the river crossing the city (see the bottom map in Figure 5).Strava cyclists are likely to go to green spaces and the riverside.This implies that large portion of Strava cycling trips tend to be recreational cycling trips.

Discussion
Moreover, this study could offer an implication for policies that improvement on bicycle infrastructure in clusters of high RCADTP to increase road safety for cyclists and attract more recreational cyclists.Nevertheless, there are still some limitations in this paper.First, there is representativeness bias in cycling trips.The population structure (gender, age and other socio-economically personal characteristics) between Strava cyclists and regular cyclists is likely to be different.As young people are more active in social media, old cyclists and pedestrians are likely to be under-represented by Strava users.Some users like to upload a large proportion of their cycling or pedestrian trips; whilst other users might upload a small proportion of their trips.As they upload a small proportion of their trips, their realistic trips are under-represented by trips of Strava.Second, although Strava has the original GPS traces of cycles and walks, it only offers aggregated data to researchers due to a risk of privacy issues.The original GPS trace data has a larger potential than the aggregated data.Ideally, this study would select GPS traces of cycles created by a number of Strava users who compose a cohort.This would enable a cohort study of cyclists in a city.

CONCLUSIONS
This study demonstrates usefulness of Strava Metro data in depicting cycling behaviour over a city.The representativeness of Strava cyclists are potentially biased in age and probably income or education.In the future, we will take account of some aspects to enhance this study.First, the effect of potential biased issues on the fitness of use for Strava Metro data needs to be investigated.Second, as it is expensive and time-consuming to conduct a travel survey every year, Strava Metro data offers a good opportunity to explore the annual variations of cycles and walks, which could be used to roughly evaluate the realistic effects of policies or interventions on modal shift from inactive travel (motorized vehicles) to active travel (cycles or walks).Third, although Strava only offer aggregated data to researchers due to privacy issue, it is still possible to publicize original GPS traces of some Strava users.As some Strava users probably are glad to make their traces publicly and be used for research, Strava might send requests to users and ask whether they are glad to publicize their original GPS traces.Once some original GPS traces were available, Strava data would have a larger potential in studies of active travel and health.
In this study, daytime population is used as background population.The daytime population data is downloaded fromScotland's Census (2016).The geography level of daytime population data is census output area.Daytime population is estimated based on the 2011 census data.Specifically, the daytime population is an estimate of the population of an area during the working day.It includes everybody who works or studies in the area, wherever they usually live, and all respondents who live in the area but do not work or study.People who work or study mainly at or from home, or who do not have a fixed place of work or study, are included in the area containing their home address.The daytime population will include shift and night workers such as hospital staff and security guards.Figure 3 maps density of daytime population at the census output area level.Areas with high-density daytime population are not particularly situated around the city centre.

Figure 1 .
Figure 1.Nodes and edges of Strava Metro data (Basemap: OpenStreetMap, licensed under the Open Database License).

Figure 2 .
Figure 2. Census output areas in Glasgow.

Figure 3 .
Figure 3. Density of daytime population in Glasgow (Basemap: OpenStreetMap, licensed under the Open Database License).

Figure 4 .
Figure 4. Number of cycling activities in census output areas.

Table 2
), including average trip distance, average trip time, and user base structure by sex and age.There are over 280 thousand cycling trips contributed by over 10 thousand of cyclists.It is noted that, although this data set has a large user sample set, average annual cycling frequency of Strava users seems to be much smaller than the real frequencies.Specifically, on average, each cyclist has 21 cycling trips in 2015.Unsurprisingly, male cyclists outnumber female cyclists.Specifically, number of male cyclists is 5 times of number of female cyclists.The largest age group of male cyclists is 35-44 whilst the largest age group of female cyclists is 25-34.Generally, almost half of cycling trips were contributed by users aged25-44 (25-34 and  35-44).Additionally, a large portion of trips are recreational trips(Strava Metro, 2015).Therefore, the majority of the Strava users are likely to be young and sporty cyclists.

Table 2 .
Demographics of cycles of Strava users in 2015.