GEOSPATIAL DATA STREAM PROCESSING IN PYTHON USING FOSS 4 G COMPONENTS

One viewpoint of current and future IT systems holds that there is an increase in the scale and velocity at which data are acquired and analysed from heterogeneous, dynamic sources. In the earth observation and geoinformatics domains, this process is driven by the increase in number and types of devices that report location and the proliferation of assorted sensors, from satellite constellations to oceanic buoy arrays. Much of these data will be encountered as self-contained messages on data streams continuous, infinite flows of data. Spatial analytics over data streams concerns the search for spatial and spatio-temporal relationships within and amongst data “on the move”. In spatial databases, queries can assess a store of data to unpack spatial relationships; this is not the case on streams, where spatial relationships need to be established with the incomplete data available. Methods for spatially-based indexing, filtering, joining and transforming of streaming data need to be established and implemented in software components. This article describes the usage patterns and performance metrics of a number of well known FOSS4G Python software libraries within the data stream processing paradigm. In particular, we consider the RTree library for spatial indexing, the Shapely library for geometric processing and transformation and the PyProj library for projection and geodesic calculations over streams of geospatial data. We introduce a message oriented Python-based geospatial data streaming framework called Swordfish, which provides data stream processing primitives, functions, transports and a common data model for describing messages, based on the Open Geospatial Consortium Observations and Measurements (O&M) and Unidata Common Data Model (CDM) standards. We illustrate how the geospatial software components are integrated with the Swordfish framework. Furthermore, we describe the tight temporal constraints under which geospatial functionality can be invoked when processing high velocity, potentially infinite geospatial data streams. The article discusses the performance of these libraries under simulated streaming loads (size, complexity and volume of messages) and how they can be deployed and utilised with Swordfish under real load scenarios, illustrated by a set of Vessel Automatic Identification System (AIS) use cases. We conclude that the described software libraries are able to perform adequately under geospatial data stream processing scenarios many real application use cases will be handled sufficiently by the software.


INTRODUCTION
This paper concerns the description of a Python-based data streaming framework called Swordfish that is designed to be used in the transport and processing of streams of data that contain a geospatial or locational component.
We offer a brief introduction to the data streaming paradigm and provide some descriptive examples of data streaming software frameworks, before discussing the nature of geospatial data on streams.We then introduce the Swordfish framework -its architecture, approach to processing and implementation specifics -leading to a discussion on geospatial processing functionality and the Free and Open Source Software for Geospatial components that enable this functionality.Early performance insights are discussed.Finally, some usage scenarios are provided.

General Data Streaming Background
The concept of data streaming systems has long been recognised.In the (Babcock, et. al., 2002) synthesis and in (Lescovec et. al., 2014), a class of systems is identified that processes data arriving in "multiple, continuous, rapid, timevarying data streams" rather than data in sets of persistent relations.These data streams may be infinite or ephemeral and are often unpredictable (Kaisler, et.al., 2013).
The need for these kinds of systems results from the burgeoning of data arising from numerous sources including (Lescovec et. al., 2014), (Pokorný, 2006), (Kaisler, et.al., 2013), (Stonebraker, et. al., 2005):  arrays of sensor networks or earth observing satellites continuously and variably transmitting multiple measurements of environmental parameters  packets of data generated by network traffic  social media  science experiments and model outputs  monitoring systems (cameras, electronic tolling)  positions of moving objects (vehicles on roads, vessels at sea, parcels or cargo in delivery process)  market trading systems, which can peak at several million messages per second, as illustrated by (FIF, 2013) These data sources can produce very large volumes of data at rapid rates, in a variety of forms and complexities.It is difficult or infeasible to store all these data and analyse post-acquisition (Kaisler, et.al., 2013).Data streaming systems exist to process and extract value from such data as it is 'in-motion' with low latency.
Significant computational challenges arise as a result of these data stream characteristics, necessitating methods for ETL (extract, translate and load) of data, sampling strategies, aggregation and stream joining techniques, windowing approaches, stream indexing, anomaly detection, clustering, summarising of streams and many others as described by (Lescovec, et. al., 2014) and (Agarwal, 2007).
In (Stonebraker, et. al., 2005), it is argued that stream processing systems must exhibit eight properties: 1. Data should be kept moving -there should be no need to store data before processing and data should preferably be pushed rather than pulled to the processing components.2. Support for a high-level processing language equipped with stream-oriented primitives and operators such as windows.3. Resilience to imperfections of streams, such as partial data, out-of-sequence data, delayed or missing data and corrupted data.4. Outputs should be predictable and repeatable (though as described above, techniques exist to sample and summarise streams of data, perhaps leading to a third quality statement around statistical significance). 5. Ability to store, access and modify state information and utilise such state in combination with live data, without compromising on low latency goals.6. Mechanisms to support high availability and data integrity, for example through failover systems 7. Ability to scale or distribute processing across threads, processors and machines, preferably automatically and transparently.8. Near instantaneous processing and responseprovision of a highly optimised execution environment that minimises computations and communication overheads.
These properties provide guidance on the architecture and likely the goals of a data streaming system.

Geospatial Data Streaming Background
A significant amount of the data originating from the sources described previously, such as sensor networks, moving objects and social media has an explicit or implicit location or spatial context that can be utilised as data is processed.
This has some implications for data streaming software frameworks.Firstly, frameworks need to be capable of processing the extra volume of data necessary to describe location or spatial relationships.Second, it is important that data streaming components recognize geospatial data in the different forms it manifests in, so that the data can be accessed as efficiently as possible in pursuit of low latency.Thirdly, there needs to be a recognition that a significant number of the offline algorithms and processes that characterise geospatial computation (i.e.algorithms that have full knowledge of their input data) are not appropriate for the continuous, possibly infinite and often incomplete online nature of data streams, as noted by (Zhong, et. al., 2015).Algorithms and processes here need to deal with data as it arrives and may never have sight of the data again, since the complete data stream is unlikely to be captured in local computer memory.
This last issue hints at a need for a deeper discussion of classification of geospatial computation functions for streaming data.This is not dealt with here; for the purposes of this article it is enough to observe that different geospatial computations will be more adaptable to a streaming paradigm than others.This is driven by the complexity of the calculation and the amount of state or information completeness that is required by the calculation.
In concrete terms, a process that simply filters data by feature name or ID will be well suited to a streaming paradigm since it exhibits low complexity and no state requirement.A process to transform the spatial reference system of features is also easily fitted to a data stream, even though the process is more complex, since there is no state requirement to handle.
A process to join together two datasets based on a spatial relationship such as feature containment is more difficult or even intractable to implement in a streaming system.The state of both streams needs to be known, since each feature on one stream needs to be compared with every feature on the other stream; furthermore, the individual calculations could be expensive, depending on the complexity of the streamed features.This type of geospatial computation exemplifies the notion of an offline algorithm.However, a geospatial data streaming system arguably should offer this kind of functionality.Stream windowing functions like time-based windows (features for the last 10 minutes) or count-based windows (the last 100 features) offer a way to manage a limited amount of state.A spatial join could be performed on the features in small windows of the data streams, such that only features within the windows are compared to each other.This spatial join process also highlights the importance of spatial indexes on streams: in order to reduce latency and keep data moving, as per the eight properties of stream processing, a spatial index on the features in one window may help to reduce the number of containment calculations executed.
The geospatial stream processing approach may be deployed in answering a wide variety of geocomputation query types.Two classes of geospatial analysis are illustrative.(Xiong, et. al.,2004) provides some examples of queries that analyse the spatial relationships between features that change location over time:  moving queries on stationary objects -petrol stations within a given distance of a moving car  stationary queries on moving objects -counts of vessels inside a harbour, aeroplanes inside an airspace, cars on a road section  moving queries on moving objects -the position of icebergs in relation to ship positions (Zhong, et. al., 2015) demonstrate spatial statistical calculations over streams to generate spatial grids (for use in fire behaviour models) from point location data from sensor networks.
In broad terms, a geospatial data streaming framework should provide functionality for efficient structuring, filtering, aggregating, joining, transforming and analysing of the spatial component of data 'in motion'.

Data Streaming Implementations
A number of proprietary and open-source data streaming frameworks and query languages have existed in the last fifteen years.This paper does not intend to enumerate them, a task undertaken by (Jain et. al., 2008).Instead, we present here some modern, open source examples of data streaming frameworks that have influenced this work or are illustrative of the data stream processing domain.The frameworks briefly considered here are Storm, Samza, Kafka Streams and Spark Streaming.
In this viewpoint, we briefly describe, for each implementation: 1. Implementation origin -developers and driving use case of the implementation; 2. Data form -the atom of streaming, usually a tuple of values or a message consisting of key-value pairs; 3. Streaming transport -the underlying infrastructure used to move streaming data around; 4. Deployment and Execution infrastructure -the layers of software upon which framework resides and processing logic can be run; 5. Processing approach -batch/ mini-batch processing or per-atom processing; 6. Processing API -the kinds of functionality that are provided for processing data streams; 7. Domain-specific data model -the nature of a domain specific streaming data model , if present; 8. State Model and fault tolerance -many operations on streams require maintenance (and recovery) of state; streaming systems must be resilient to node failure and minimise or prevent message loss.
These tables should be viewed in terms of the eight stream processing system properties identified above.(Zaharia, et. al., 2012) This short discussion of some of the features of various data streaming systems illustrates that there exist many approaches to constructing and deploying such a system, with varying levels of complexity and processing styles.It should be noted here that these ecosystems and frameworks primarily target the Java Virtual Machine.

Geospatial Data Streaming Implementations
Similarly to stream processing frameworks, there have been a number of implementations of geospatial data streaming frameworks over the last two decades.This section does not enumerate the various efforts, rather it highlights a few interesting exemplars.
PLACE (Mokbel, et. al., 2005)  IBM InfoSphere is used by (Zhong, et. al., 2015) as an infrastructure for supporting the deployment of a framework called RISER.RISER utilises stream processing for ETL of spatio-temporal data and as a spatial analysis engine performing spatial functions (such as interpolation) over sensor network data.

Design Goals and Architecture
Swordfish is intended to provide a non-clustered stream processing software framework for the Python programming environment.Stream processing topologies, along which messages are passed, provide the main Swordfish structure.Nodes (processing units, sources and sinks) and edges (streams) can be distributed across machines, but do not have to be.The implication of a non-clustered architecture, e.g.. no default reliance on a Hadoop cluster, is that Swordfish stream processing topologies can be executed anywhere that Python can be installed; from a sensor gateway to a Desktop, from a single computer to a network of computers running in a cluster or cloud environment.
Python provides rich functionality for geospatial work, ranging from data translation libraries to machine learning and statistical analysis libraries.Furthermore, Python is a dynamically typed, general purpose programming language, providing great flexibility.Thus, Swordfish can utilise a functional style of programming, common to many of the streaming systems described, yet provide utilities from object-oriented software libraries.Swordfish processing topologies are dynamic, meaning that new nodes and edges can be established at runtime, rather than compiled into the topology.The primary goal is to support the performance of spatiotemporal access, transformation and analysis against geospatial data streams from the kinds of systems illustrated previously, such as Automated Identification System (AIS) positional information from vessels, sensor networks monitoring phenomena like radiation levels, to monitoring networks for water and electricity usage, near real-time remote sensing data product feeds and social media feeds.Swordfish has been optimised in a number of places (data structures and streaming function primitives) to enhance performance, primarily by producing Cython code.

Transport:
To date, Swordfish is capable of read/write streaming of data over a wide and growing range of message transports, including Advanced Message Queueing Protocol (AMQP), ZeroMQ, MQTT, Redis, websockets and several inmemory structures.Adapters have been developed to harness social media streaming platforms like Twitter.

Execution:
Swordfish has no requirements for a processing cluster to be present; it can run on a Desktop computer as part of a normal Python application.As such, it should be considered as a set of software libraries, implemented according to application needs.Swordfish can be executed in a distributed fashion using the Python code remoting platform called RpyC (RpyC, 2013), but this is not as transparently managed compared to the clustered systems.Inherently, as with most message passing systems such as Swordfish, a level of distribution is naturally possible through the use of message broker protocols that provide part of several of the transport implementations.By default, Swordfish uses in-memory transports, but in practice, data are usually received from transport mechanisms such as distributed message queues, e.g.MQTT.Software bindings/ adapters to such queuing systems need to be present for Swordfish to utilise them.

Processing:
Swordfish is a message-oriented system with per-message processing semantics -data are processed as soon as received; no facility exists yet for batching of messages.

Application Programming Interface:
Swordfish supplies stream processing utility via a set of primitives for describing nodes and edges in a stream topology and a set of primitives for adding actual processing functionality.Nodes are abstract StreamProcessors and would include Sources (e.g. a subscription to an MQTT topic, a file, a database), Sinks (places outside of the system where data can be passed to (e.g.database, web service endpoint, websocket, message broker), and concrete StreamProcessors (generic functionality executors).These nodes are connected by different types of Streams, which are components that abstract the underlying message transport protocol and provide a callback mechanism for 1...n StreamProcessors to receive messages off the stream, i.e. each StreamProcessor registers a callback with a Stream.StreamProcessors usually accept a function that will provide application logic.Swordfish implements optimised StreamProcessors that allow a MapReduce style of application composition: Maps, Folds, Reduces, Joins, Filters.Maps are generally used to transform or analyse each message and return an output (e.g.reproject the spatial data in each message).Folds are a specialised Map that allows a message to be compared to some representation of state that is passed in at the same time as the message , often the output of the previous message (e.g. a check to see if each message is further east in heading than the previous message).Reduce is a component that aggregates or summarises data and outputs a result, continuously or at certain time interval, count interval or other delta in the data (e.g.union the geometries of the last 100 messages).Joins allow one stream of data to be joined with another, following SQL style semantics of inner joins and left/ right outer joins (e.g.merging data from two streams based on feature ID or spatial location).Filters utilise some function to exclude data that does not meet some requirement from being output downstream (e.g.discard features that are not within a specific area-of-interest).A number of these operators will operate in sliding or tumbling fashion over a count or time window of the data on each stream; as described previously, it is nigh on intractable for a process such as a spatial join to maintain the state of all the messages it has ever had sight of.At the time of writing, Swordfish maintains state only via inmemory structures -no serialised state is managed.

Geospatial Functionality and Components
Geospatial utility is provided to Swordfish via a package of programming functions that can be invoked and passed through to Swordfish primitives like Maps, Folds, Filters etc. as arguments.These programming functions are provided by a set of well known Free and Open Source geospatial software libraries, wrapped with code to specialise them for use in the Swordfish streaming environment.All these programming functions understand and can transform, query and populate AttributeDictionaries of the common data model.These are early results, but show that Swordfish can stream and process high velocity data streams.The payload here is quite large; numbers improve drastically for a small, non-spatial, nocommon data model message (e.g. a key-value pair of an integer and a short string).Streaming systems often claim throughputs of > 1 000 000 messages per second, but we feel the numbers we illustrate are more likely to be found in practice when dealing with geospatial data streams.It is notable that throughput slows when geospatial functionality is applied; these results nevertheless show Swordfish as capable of throughput of an order of magnitude greater than our highest velocity streams (merged Satellite and Terrestrial AIS receivers).The distributed/ multiprocessing capability of Swordfish may reduce any throughput bottlenecks when it is necessary to scale the system.

SUMMARY AND FURTHER WORKS
In this article we discuss geospatial data stream processing and introduce the Swordfish stream processing framework, highlighting some of its spatial capabilities.We indicate that Swordfish offers sufficient throughput capability to allow application developers using Python to build online geospatial systems for a number of potential use cases.A significant effort is needed to expand the geospatial functionality (particularly to move beyond computational geometry and indexing functionality) and perhaps optimise it as necessary.Effort needs to be undertaken to ensure that Swordfish is stable for long running applications, though its early deployment in particular use cases suggest it is reasonably stable.Swordfish is currently limited to holding state in memory; further work may be necessary to develop mechanisms to serialise state, especially in use cases where recovery of the streaming topology state may be necessary.A long term view of Swordfish development is the provision of a streaming data management platform, (as is provided by ESRI GeoEvent Extension, and the Confluent platform).We are investigating the process for open-sourcing Swordfish; organisational policies enforce a technology evaluation process before open-source licenses can be applied to software and code placed under an open-source management model.

Figure 1 :
Figure 1: Swordfish Common Data Model example

Table 5 :
Indicative performance results