TOWARDS AN OPEN SOURCE PYTHON LIBRARY FOR AUTOMATED EXPLORATORY SPATIAL DATA ANALYSIS
- 1Department of Geography, Geoinformatics and Meteorology, University of Pretoria, Pretoria, South Africa
- 2Department of Statistics, University of Pretoria, Pretoria, South Africa
Keywords: Open Source, Python library, Spatial Statistics, ESDA
Abstract. The exploratory spatial data analysis (ESDA) process refers to the use of various functions to gain an initial understanding of a spatial dataset. These include measures of spatial heterogeneity and spatial autocorrelation. Currently, the ESDA process is repetitive and time-consuming. Additionally, while different results arise for different datasets, how these results are generated does not change significantly. Results are also generated individually for each variable which means that they cannot be easily compared or shared.
The automation of the ESDA process would therefore have multiple benefits as it would not only save time, but it would also allow the data analyst to keep up with the rapid rate at which we generate data. This paper aims to introduce the first iteration of autoESDA – a Python library capable of automating the ESDA process by summarising the results into a single report.
In this paper, we present the defined high-level requirements for the implementation of autoESDA. Various dependency libraries are discussed and a high-level overview of the workflow of autoESDA is described. The library is then evaluated against the requirements laid out earlier in the study. Semi-structured interviews were carried out, which yielded a wealth of feedback and suggestions from the participants, describing how the output report could be improved. Finally, a roadmap of proposed further developments and improvements is discussed.
The first version demonstrates that the automation of ESDA is possible and lays the foundation for further development in this regard. This is an important contribution to understanding spatial data as it enables the data analyst to keep up with the magnitude of data that is generated on a daily basis.