DEVELOPING APACHE SPARK BASED RIPLEY’S K FUNCTIONS FOR ACCELERATING SPATIOTEMPORAL POINT PATTERN ANALYSIS
- 1School of Remote Sensing and Information Engineering, Wuhan University, Wuhan 430079, China
- 2State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University, Wuhan 430079, China
Keywords: Point pattern analysis, Visual analytics, Spatial agglomeration, High performance computing, Spatiotemporal index, Caching, Spatiotemporal data partitioning, Spatiotemporal object serialization
Abstract. Ripley’s K functions are powerful tools for studying the spatial arrangement or spatiotemporal distribution characteristics of geographic phenomena and events in spatial analysis and has been used in many fields. However, the K functions are compute-intensive for point-wise distance comparisons, edge correction and simulations for significance test. Although parallel computing technologies have been adopted to accelerate K functions, previous works haven’t extended the optimization from space to space-time dimension. This study presents an acceleration method for K functions upon state-of-the-art distributed computing framework Apache Spark, and four optimization strategies are leveraged to simplify calculation procedures and accelerate distributed computing respectively, including 1) spatiotemporal indexing based on R-tree with Sort-Tile-Recursive (STR) algorithm for reducing distance comparison when retrieving potential spatiotemporally neighbouring points; 2) Hash-Table-based caching for spatiotemporal edge correction weights reuse and reducing repetitive computation; 3) Spatiotemporal partitioning using KDB-tree as well as cylinder intersection redundancy strategy for decreasing ghost buffer redundancy in partitions and supporting near-balanced distributed processing; 4) Customized serialization of spatiotemporal objects and indexes for lowering the overhead of data transmission. Experiments verify the effectiveness and time efficiency of the proposed optimization strategies, and also evaluate the overall performance and scalability. Based on the proposed methods, a web-based visual analytics framework has been developed and publicly shared through GitHub, and four types of the distributed K functions are implemented, including space, space-time, local and cross K functions, which demonstrates its value on promoting geographical and socioeconomic studies.