SEMANTIC LOCATION EXTRACTION FROM CROWDSOURCED DATA

: Crowdsourced Data (CSD) has recently received increased attention in many application areas including disaster management. Convenience of production and use, data currency and abundancy are some of the key reasons for attracting this high interest. Conversely, quality issues like incompleteness, credibility and relevancy prevent the direct use of such data in important applications like disaster management. Moreover, location information availability of CSD is problematic as it remains very low in many crowd sourced platforms such as Twitter. Also, this recorded location is mostly related to the mobile device or user location and often does not represent the event location. In CSD, event location is discussed descriptively in the comments in addition to the recorded location (which is generated by means of mobile device's GPS or mobile communication network). This study attempts to semantically extract the CSD location information with the help of an ontological Gazetteer and other available resources. 2011 Queensland flood tweets and Ushahidi Crowd Map data were semantically analysed to extract the location information with the support of Queensland Gazetteer which is converted to an ontological gazetteer and a global gazetteer. Some preliminary results show that the use of ontologies and semantics can improve the accuracy of place name identification of CSD and the process of location information extraction.


INTRODUCTION
The Crowdsourced Data (CSD) has recently gained increased attention in many fields.Factors like technological development, improvements in mobile communication and availability of sophisticated software in the form of apps are supporting this growth.Moreover, production convenience, ready access, free and openness, currency and abundance of CSD are the key reasons for growing interest.During critical events like disasters, people use social media platforms (twitter, Facebook etc.) to communicate with others as it is the fastest and most convenient way to do so in the modern world.Because of this, the availability of CSD is very high in current disaster events.This CSD would be a valuable resource for disaster management as the data is current and rich in information.However, there are quality issues such as incompleteness, credibility and relevancy, lack of availability of location information and vagueness of available location.Additionally, inherited problems in structure, documentation and validity of the CSD limit the direct applications of it for scientific and technical analysis (Flanagin andMetzger, 2008, Longueville et al., 2010).Researchers have now understood the value of available CSD and are concentrating their efforts to improve its quality.
Geospatial Information Retrieval (GIR) is important and widely used in many application areas like emergency response, transport planning, hydrology and land-use (Battle and Kolas, 2011).To this end, the most popular approach is to use gazetteers for retrieving GI from the web pages or online contents.Researchers argue that this is not very different from a keyword base search like in search engines (Buscaldi and Rosso, 2009).In recent GIR research, semantics are mainly used along with the gazetteers and other vocabularies.There are number of issues pertaining to GIR and those are discussed in detail in the latter sections of this paper.The scope of this paper is to extract the target geography from the social media communications using a semantic approach.
The objectives of this paper are to semantically recognize, extract and geo-code the content or target location information from the 2011 Queensland Flood CSD (General public Tweets and Ushahidi Crowdmap data) using an ontological gazetteer and other semantic geospatial resources.The paper is structured as follows: section 2 will discuss the background along with important studies conducted in the fields of CSD, GIR from CSD, gazetteers and ontologies.Section 3 introduces the methodology used throughout the study.Next, section 4 provides the preliminary results and discussion.Finally, section 5 will elaborate on the conclusions and future developments of this project.

CSD, Twitter and Crisis Mapping platforms
Current, reliable and high quality spatial data are crucial in successful disaster management.During disaster management, available data sources are often not optimally configured to ensure effective data management.Disaster management staff have the option to use government maintained authoritative data or other forms of data like CSD. .CSD provides the opportunity to use other related data that may have higher levels of currency or further depths.However, this data is often problematic due to lack of currency, completeness, access and availability.
Conversely, CSD is freely available and mostly contains current information about the event concerned.
Disaster related CSD are usually accessible through both desktop and mobile social media platforms (e.g.twitter1 , Facebook2 , flicker3 , foursquare4 , Ushahidi5 etc.).It provides a readily available source for real-time and dynamic disaster related information to address the currency, completeness, access and availability issues pertaining to authoritative data in disaster management.Often CSD comprise of comments over an incident which occurred and then posted on top of base data like google maps or open street maps and spatial qualities like location information can be missing.CSD creators are generally laypersons and hence the end product may not result in qualified spatial data.Interestingly, the base maps used in crowdsourcing may also be developed by the crowd.Often the crowdsourced data can be improved to enhance the quality of the final product.To this end, it is argued that disaster management can be improved by optimising the use of CSD along with authentic data incorporating ontology and geospatial semantics.
As indicated previously, CSD can originate from a number of diverse sources, social media like twitter, Facebook, flicker, foursquare etc. or crowd mapping platforms like Ushahidi.Ushahidi (which means 'testimony' in Swahili) is a crowd mapping platform that was developed to easily capture inputs from people by cell phones or emails (Bahree, 2008, Longueville et al., 2010).Even though it's original development goal was to report the election violence in Kenya, over time its usage has changed towards natural disaster crisis mapping.The user can report an incident in various forms including SMS, email or web.The most notable advantage is that the users can report incidents onsite with the help of a mobile device.
Twitter is a very popular social media platform in which the conversations are limited to 140 characters.The users may pass their messages (tweets) very concisely and sometimes using quite different terminology including abbreviations, modified terms or slangs.If the user is skilful and experienced in enabling the location in their mobile device, the messages may include locational data as well.However, in general, the location availability is disabled due to privacy concerns or through the device location settings and therefore care must be taken when considering Twitter as a geospatial data source (Koswatte et al., 2014).

GIR, Gazetteers and Ontologies
Geospatial Information Retrieval (GIR) is critical in many application domains including emergency response, transport planning, hydrology, land-use and etc. (Battle and Kolas, 2011).Most of the studies attempt to retrieve GI from the web with the help of gazetteers.These approaches have mostly concluded with limited results due to the limitations of clear data definitions.To this end, semantics support the clear specifications of the spatial query.The objective of GIR is to geotag web pages based on its content which involves resolving two types of ambiguities i.e. geo-geo and geo-non-geo (Amitay et al., 2004).A geo-geo ambiguity occurs when two distinct places have the same name (e.g.Rockville in Queensland and Rockville in United States), and geo-non-geo ambiguity occurs when a place name also has a non-geographic meaning (e.g.Forbes is a town in New South Wales and Forbes is a popular magazine in USA).
The geography or the location information in GIR from web contents has identified two main types of location i.e. source and target (Amitay et al., 2004) or reporter and reported location (Koswatte et al., 2014).In this process, the source (or reporter) geography deals with the page/message origin or the server's/mobile device's physical location whilst the target (or reported) geography incorporates the content of the page.The source (reporter) location can also be defined by the provider location and serving location in contrast to web resources (Wang et al., 2005).With regards to a crisis, three types of location has considered in this CSD research i.e. a) reporter location b) incident location and c) content location.The scope of this paper is to extract the target geography (in contrast to GIR from web contents) or the content location (in contrast to GIR from CSD) using a semantic approach.
Gazetteers are geospatial dictionaries containing place names and related information like spatial references and feature types (Janowicz andKeßler, 2008, Machado et al., 2011).Many countries have developed and maintain their own gazetteers.Digital online formats like Alexandria Digital Library Gazetteer6 (ADL), Getty Thesaurus of Geographic Names7 (TGN) and GeoNames8 are available (Machado et al., 2011).Furthermore, integrated semantic geospatial information retrieval systems are also slowly become available.A good example is GeoWordNet9 (georeferenced version of WordNet10 ) which is an integrated system of GeoNames with WordNet plus the Italian section of MultiWordNet11 (Giunchiglia et al., 2010, Buscaldi andRosso, 2009).Gazetteers are widely used in Geospatial Information Retrieval (GIR) research (Borges et al., 2011, Amitay et al., 2004, Hill, 2000, Souza et al., 2005) but it is mostly argued that they are not fully supported in this sense as there are structural limitations and lack of intra-urban place names, no records on spatial relationships among elements other than relying on their proximity based footprints (Machado et al., 2011).Automatic recognition of geographic characteristics from web contents remain challenging and numerous approaches like automatic indexing and georeferencing (Larson, 1996), ontology-driven approaches (Jones et al., 2001, Fu et al., 2005a), semantic query expansion (Delboni et al., 2007, Fu et al., 2005b) and natural language positioning (Delboni et al., 2007) along with gazetteers and geocoding techniques are proposed (Borges et al., 2011).
Ontologies are explicit specifications of shared conceptualizations and are key to establishing shared formal vocabularies (Du et al., 2013, Gruber, 1993).They are fundamentally important when dealing with heterogeneous systems and considered as a main pillar in so called semantic web.When considering the geo-spatial system manipulations it should be specifically conceptualized considering special geographic properties like inherited location and spatial integrity.Geo-spatial ontologies include geo-spatial entities, geographic classes and topological relations (Giunchiglia et al., 2010) and describe conceptual hierarchies and terminological interrelations of geospatial domain, and facts about spatial individuals along with location and geometry information (Du et al., 2013).

The 2011 floods in Queensland and the study area
In January 2011, the state of Queensland Australia experienced one of the largest disaster events in its history.Nearly, 70 towns and 200,000 people were affected by severe flooding, claiming 35 lives and costing over $10 billion.This study will analyse citizen involvement in this natural disaster through the data that was collected via the #QLDFloods hashtag based Twitter communications and Ushahidi based crisis mapping platform.The study area (Figure 1) covers an area approximately 4000 km 2 where the majority of tweets and Ushahdi posts originated.

2011 QLD Flood CSD
The two months, December and January, 2011 were a very critical period for the Queenslanders with a series of flood events due to a La Nina event.With all of the flooding the social media, including Twitter and ABC's12 Ushahidi based QLD Flood Crisis Map were busy with communications including severe weather alerts.which is based on Twitter API and developed using PHP15 and MySQL16 .More than 35,000 tweets (based on the #qldfloods hash tag) were sent during 10-16 January, 2011 while more than 11,600 of them on 12th January alone.Moreover, there were more than 15,500 Twitter users participated using #qldfloods hash tag.During this period, leading accounts included the Queensland Police Service Media Unit (@QPSMedia), ABC News (@abcnews), and the Courier-Mail (@couriermail).@QPSMedia, (Bruns et al., 2012).
According to the findings of Koswatte et al., (2014) it was identified that the location availability of the 2011 QLD Flood tweets were only 1%.ABC's Ushahidi based QLD Flood Crisis Map: During the early stages of the flood event, the Australian Broadcasting Corporation (ABC) maintained an interactive map based on the Ushahidi crowdmap platform to gather information related to the Queensland floods 2011.The public's uptake of the site was quite remarkable and more than 230,000 site visits over a 24 day period.According to the ABC's statistics, approximately 1,500 reports were received on the site and nearly 500 of these were from the public whilst another 1000 were generated by ABC moderators.Most reports were made through the online interface, however a small percentage of reports were made via emails, twitter and through SMS.The floodmap was most commonly browsed using the Internet Explorer browser (77%) via Windows operating systems (81%).Surprisingly, browsing using mobile devices was less than 5% of total visits (Potts et al., 2011).For mobile users, Ushahidi iPhone and Android apps were available.
Within the ABC's Queensland Flood Crisis Map dataset, there were approximately 700 reports during the period of 9-15th of January, 2011, which included the location information where it originated.These records were filtered and extracted for further analysis.
Selected samples from both 2011 Queensland Flood Tweets and Crisis Map data which fell inside the North and South Districts (Figure 1) of Brisbane City, Queensland, Australia were used as input CSD in this study.The study area was selected based on the high density of crisis communications which occurred.The sample contains 89 Tweets, 268 Ushahidi posts and 800 Queensland Gazetteer place name entries which are all provider location enabled.

The Research Approach
The Figure 2 illustrates the overall research approach.The study used Natural Language Processing (NLP) and annotation techniques incorporating additional resources like gazetteers.The GATE17 (General Architecture for Text Engineering) software which is a robust and scalable open-source java based tool (Cunningham et al., 2002) developed by the University of Sheffield, United Kingdom was used for text processing including the semantic processing.GATE system components are termed as resources.The main three elements are Language Resources (LRs), Processing Resources (PRs) and Visual Resources (VRs).LRs are the entities like lexicons, corpora or ontologies.PRs are parsers, generators or modellers and VRs represent visualisation and editing components that participate in Graphical User Interfaces (Cunningham et al., 2002).
The first step of this research was to design and develop an ontology set for the Queensland Gazetteer.An ontology schema (Figure 3) was designed for Queensland Gazetteer based on OMT-G Gazetteer conceptual schema (Souza et al., 2005) and OnLocus simplified schema (Borges et al., 2011).Ontology design and development: Ontologies are key in semantic information processing.An ontology set was developed to convert the general Queensland Gazetteer to an ontological Gazetteer.That was to enable the semantic processing of the selected CSD in this study.Noy and McGuinness (2001) Ontology Development 101 guide was followed for developing the ontology which includes;  ontology class definition  class arrangement in a taxonomic hierarchy  slot definition and value description  value feeding for slot instances The designed ontology was constructed using the GATE's ontology tools which provide the ontology viewing/editing facilities.
Processing and analysis using GATE software: The two datasets were analysed separately using the GATE.Processing Resources (PRs) of the GATE software; ANNIE's (A Nearly New Information Extraction system) English Tokenizer, Sentence Splitter, POS tagger, Transducer, and GATE's morphological analyser, and Document reset were used along with Queensland Place Name Gazetteer for non-semantic analysing.In the semantic analysis phase, ANNI OntoRootGazetteer along with Flexible Gazetteer were used along with the above processing resources.
The PRs were organized in the order of Document reset, Tokenizer, POS tagger, Morphological analyser, Gazetteer and then the Transducer for more effective processing and better results.
Figure 3. Simplified schema used in ontological Queensland place name Gazetteer

PRELIMINARY RESULTS AND DISCUSSION
This research work is still progressing and only some preliminary analysis has been undertaken thus far.The Table 1 shows the annotation accuracy matrix.As explained in GATE's Annotation Diff tool, precision (P) is a measure of number of correctly identified items as a percentage of the number of items identified, recall (R) measures the number of correctly identified items as a percentage of the total number of correct items and f-measure (F) is the weighted average of those two.Future research is planned to further improve the designed ontologies along with the ontological gazetteer.It is expected more stable and improved results through these modifications.
It is recognised that the results indicate a bias to the Ushahidi annotation accuracy as the ontology was developed on the same dataset.However, the annotation accuracy results of the Twitter dataset is encouraging as it is independent of the ontology development.
Table 1: Comparison of gazetteer success for Twitter and Ushahidi

CONCLUSION AND FUTURE WORK
In this paper we investigated how to extract the missing location of CSD.In this context we extracted location information from two different information sources from Ushahidi and Twitter social media platforms.An ontology set was designed and developed to make the general Queensland Place Name Gazetteer a semantic Gazetteer which was termed QLDGazOnto in this study.The study is still progressing and the initial results were encouraging and open for further improvements.
In future, the ontology development will be more generalized and controlled.Furthermore, it is planned to examine and resolve the ambiguities of the identified location.The identified place names will be converted to a true location by hierarchical analysis and considering the adjacent place names through a selected span.The study also plans to apply the identified process to larger datasets.

Figure 1 .
Figure 1.Study area and 2011 QLD Flood CSD

Figure 2 .
Figure 2. Semantic CSD Location extraction and Geo-tagging ANNIE Gazetteer is a global gazetteer used in GATE as the default gazetteer.QLDGazetteer is Queensland's official place name gazetteer while QLDGazOnto is its ontological version developed in this study.It was developed with a main focus on the Ushahidi dataset and the results were dominant in tagging the Ushahidi dataset based on the ontological gazetteer.In future developments, it is planned to generalize the ontology set by considering a control dataset.The results in Table1indicate that it is more advantageous to use local gazetteers in place name extraction.The use of the ANNIE gazetteer which is a global gazetteer provides the poorest results of all.The use of QLDGazetter alone dramatically improves the recall (R) factor but other measures are still better with the combined use of ANNIE and QLDGazetteer.Even though the combined use of global and local gazetteers shows some improvements, care needs to be taken not to introduce more geo-geo ambiguities.These ambiguities will be analysed in the future versions of this study.It can be seen some indications that the use of semantics would improve the place name extraction of CSD.The results of the use of ontological Gazetteer QLDGazOnto clearly improves the detection accuracy of Ushahidi posts.In case of Twitter posts, the semantic local Gazetteer outperforms the global ANNIE Gazetteer.No significant differences for the use of combined local and global gazetteers was detected.