CLOUD-BASED INTEGRATION AND STANDARDIZATION OF ADDRESS DATA FOR DISASTER MANAGEMENT – A SOUTH AFRICAN CASE STUDY

Addresses are essential for disaster risk management and response because they are used to locate people affected by a disaster or at risk of being affected. South Africa is vulnerable to disasters, however, despite a legislative framework for supporting disaster risk management that meets international standards, implementation falls short due to underfunding, poor interdepartmental coordination and lack of political support. The importance of cross jurisdictional address data was highlighted by the COVID-19 pandemic of 2020 when the geocoding of positive cases was hindered due to the lack of such address data in South Africa. In this paper, we present first results about a cloud-based tool for integrating address data from multiple municipalities into a single address dataset that conforms to the South African National Standard, SANS 1883-2:2017, Geographic information – Addresses: Part 2: Address data exchange. We reviewed and evaluated three cloud platforms for the prototype implementation. The integrated dataset is maintained in the cloud and therefore readily accessible by relevant organizations. At the same time, processing in the cloud can handle changing volumes of data with elasticity, i.e. computing power can be increased or decreased at short notice, as necessary during a disaster response. Furthermore, processing can be automated, thereby mitigating the risk of reduced manpower due to a disaster. Overall, a properly maintained cloudbased tool can result in more efficient use of resources presenting a viable and interesting alternative for underfunded disaster risk management centres in South Africa and other parts of the world.


INTRODUCTION
Over the past century, the risks posed by disasters have increased dramatically due to climate change and the changing patterns of human settlement (Huppert, Sparks, 2006). Disasters are events in which the sudden impacts caused by natural or anthropogenic agents threaten the normal operation of a society (Quarantelli, 2001). Disaster risk management is aimed at mitigating or avoiding the effects of disasters through continuous management and preparation. It takes place at multiple organisational levels, but generally, local and municipal entities are more engaged with the operational aspect of disaster management while national and international entities oversee cooperation and allocation of resources (Boin, 't Hart, 2010). Well-implemented disaster risk management can alleviate risks, saving lives and property (Lettieri et al., 2009), however, it is highly dependent on current data (Alexander, 2005).
South Africa is noted to be vulnerable to disasters and has organised a national disaster risk management system based on a national, provincial and municipal disaster management centres and advisory forums (South Africa 2002). These are playing a pivotal role following the South African government's declaration of a national state of disaster due to the COVID-19 pandemic in 2020 (South Africa, 2020). The legislative framework for supporting disaster risk management in the country is of a high standard and compliant with international initiatives (Ngqwala et al., 2017). However, this framework is often not followed due to underfunding, poor interdepartmental coordination and lack of political support (van Niekerk, 2014), resulting in the use of outdated technology and personnel with * Corresponding author little training (Wentink, van Niekerk, 2015). Regional government typically acts in a coordinating role, with most functions performed at the local or municipal government level (Botha, van Niekerk, 2013).
Addresses are essential for disaster risk management and response because they make it possible to locate the people who are affected or at risk to be affected. In the case of an epidemic or pandemic, addresses can be used to detect emerging disaster hotspots so that targeted location-based responses can be implemented. The importance of addresses became specifically apparent during the COVID-19 pandemic. The geocoding of addresses of COVID-19 positive test cases was hampered in South Africa, amongst others, by the lack of cross jurisdictional address data. The mandate to assign addresses and maintain address data lies with municipalities. Each municipality maintains address data for its area of jurisdiction according to its own specific data model that satisfies the organizational objectives of the municipality.
In this study, we designed and developed a cloud-based tool for integrating address data from multiple municipalities into a single address dataset that conforms to the South African National Standard, SANS 1883-2:2017, Geographic information -Addresses: Part 2: Address data exchange. SANS 1883-2 is a profile of ISO 19160-1:2015, Addressing -Part 1: Conceptual model. Cloud computing involves the utilization of offsite computing resources in a computer network distributed over a large geographic extent, which can survive natural disasters that would disable self-contained data centres by redirecting network traffic and making or moving data backups to servers in safer locations (Mukherjee et al., 2014). Because the integrated dataset is stored in the cloud, it is readily accessible by disaster risk management centres and other relevant organizations. At the same time, processing in the cloud can handle large volumes of data with elasticity, i.e. computing power can be increased or decreased at short notice, as necessary during a disaster. Furthermore, processing can be automated, thereby mitigating the risk of reduced manpower due to a disaster. Overall, properly managed cloud computing can result in more efficient use of resources (Evangelidis et al., 2014), presenting a viable and interesting alternative for underfunded disaster risk management centres.
In this paper, we review three cloud platforms and present first results of the cloud-based tool for integrating address data from two South African municipalities into a standardised data model. The paper is structured as follows: Section 2 provides the review of cloud platforms and explains why we chose Amazon Web Services. In Section 3, we present the design and implementation of the tool. Section 5 offers a brief discussion and concluding remarks. The term cloud computing emerged in 2007. It simply refers to computing resources that are accessible via the internet, which is commonly represented as a 'cloud' in diagrams (Regalado, 2011;Venters & Whitley, 2012). Cloud computing is the delivery of scalable and elastic computing services, such as Infrastructureas-a-Service (IaaS), Platform-as-a-Service (PaaS) and Softwareas-a-Service (SaaS), over the internet (Peng et al., 2009). This model typically allows users to only pay for what they are using (i.e. pay-per-use business model), keeping the initial infrastructure investment, as well as operational costs, low but allowing them to scale up as and when their needs change. Venters and Whitley (2012) point out that the common definitions downplay the applications or tools that these platforms provide as part of the computing environment, specifically their analytics and intelligence capabilities. In this paper, we are especially interested in Extraction-Transformation-Loading or ETL, a process that acquires, processes and then transfers data from a source to a database (Vassiliadis, 2009), usually as a prelude to analysis for decision support (Ali & Wrembel, 2017).

Background
In 2020, Gartner published their report on cloud infrastructure and platform services (CIPS) based on the magic quadrant method (Bala et al., 2020). They identified seven vendors and investigated the vendor profile, as well as the strengths and weaknesses of the various CIPS services they offer. Refer to Figure 1. Amongst others, Bala et al. (2020) found that all vendors offered a public cloud IaaS and PaaS; all claimed to have high security standards, but there were differences in the servicelevel agreements (SLAs); and that the software marketplaces also differed substantially. Based on their evaluation, they identified three leaders in the field, Amazon Web Services (AWS), Microsoft Azure and Google Cloud Platform (GCP).

Comparison of selected cloud-based platforms
Based on the report from Gartner, we decided to evaluate AWS, Microsoft Azure and GCP in more detail, specifically looking at the service they offer related to ETL and their pricing structures.
In this section, we present our finding of this evaluation.

2.2.1
Brief overview of the cloud platform vendors AWS began offering IT infrastructure, now known as cloud computing, in 2006, and fast became the industry leader providing over 140 cloud-based services globally (Amazon Web Services, 2020). These services are available on-demand and billed based on pay-as-you-go pricing. AWS has data centres on all continents, except for Antarctica, which allows for high speed data transfer. AWS provides potential users with numerous examples of use cases across 25 industries, including agriculture, education, financial services, media, and retail. Based on our experience and on Bala et al. (2020), it seems that tools for almost all possible use cases are available. For academia, the AWS Educate portal provides educators and students with cloud career pathways, a large portal of educational material and free credits for student projects.
GCP started with App Engine (enables users to build and host applications) in 2008 and now offers more than 90 products (Google Cloud, 2020). Currently, GCP has data centres in the Americas, Europe, and the Asia Pacific region, however, there are no data centres in Africa. Kubernetes and TensorFlow, two of GCPs open source applications, are described as market-moving innovations by Bala et al. (2020). With TensorFlow and other GCP products, such as Big Query and Dataproc, GCP is often associated with the big data and data science uses cases but actually provides services for a larger diversity of use cases.
Microsoft Azure (commonly referred to as Azure) was announced in 2010 as Windows Azure and changed to Microsoft Azure in 2014(Microsoft Azure, 2020. Similar to AWS, Azure is suitable for all use cases and has data centres on all continents, except Antarctica. Unlike GCP, supports edge computing. Currently, Azure offers more than 200 services to its customers and boasts strong partnerships with Oracle, SAP and VMware to provide a complete end-to-end set of solutions. The pricing method and infrastructure provided is similar to that of AWS and GCP. According to the ThousandEyes independent cloud performance benchmark (2019), the speed of the services on the above three cloud platforms is comparable for the purposes of our implementation. The benchmark found that there is a latency of a few seconds and less than 1% packet loss affecting connections from each provider's most accessible cloud data centre to Africa (ThousandEyes, 2019). Thus, ruling out performance as criteria for our evaluation.

ETL tools
All three vendors provide ETL tools that can be used for our implementation. A brief overview of the ETL tools available in each platform follows.
AWS Glue is a serverless ETL that allows users to automate data preparation and analytics. AWS Glue provides a visual editor that simplifies the process of creating ETL processes on data stored in AWS S3 or any database that accessible through a Java Database Connectivity (JDBC) connection. Once the data has been discovered, its metadata is loaded into the AWS Glue Data Catalog and it is then searchable and queryable, and available for ETL. Our implementation is intended for government and local municipalities in South Africa. As a developing country, South Africa has one of the world's most volatile currencies, thus the cost of cloud-based tools, specified in US dollars, is an essential consideration. We compared the cost of data lake storage, data warehousing, requests, outbound internet traffic, and processing costs.
Each of the cloud vendors provide new users with free credits, and they also have a free tier. The aim of the free tier is not intended for operational purposes, but rather gives potential users the opportunity to test the services and get some hands-on experience before making a decision. The free tier is typically limited to a specific time period, for example, 12 months.
To estimate the monthly cost of using each vendor, we calculated the price per gigabyte (GB) processed and stored. For this calculation, we assumed that if we request one row, this would be equal to 1 kilobyte (KB), an overestimation for the current dataset but this keeps the calculation simple. Thus 1000 database requests equate to 0.001 GB. Table 1 shows a breakdown of the estimated cost for each vendor.

Cloud platform selected for our implementation
Based on the above, we decided that AWS would be most suitable for our implementation. It is clearly the leading cloud platform with a wide range of tools and a mature ETL application, AWS Glue. Additionally, AWS is the most cost effective at current prices.

South African standards for addressing
To facilitate standardisation of addressing, the South African Bureau of Standards (SABS) developed the 'Geographic information -Addresses' set of standards, namely SANS 1883 Parts 1 to 3. Due to their importance in the response to the COVID-19 pandemic, SABS made the SANS 1883 standards freely available (SABS, 2020).  Integration into a single dataset therefore requires transformation from an organization specific data model into a standardized model (Aydinoglu et al., 2011).

Municipal address data used in this study
In South Africa, addresses are assigned by municipalities who also maintain address data. The address assignment responsibility was delegated to them by the South African Geographic Names Council (Coetzee & Cooper, 2007).
Address data from the City of Johannesburg (CoJ) and the City of Tshwane (CoT) were used for the design and implementation of the tool. Both municipalities are located in the Gauteng province of South Africa, the economic hub of the country, and are classified as metropolitan (Category A) municipalities. There are eight such municipalities in South Africa. The population in CoJ and CoT is estimated at 4.4 million and 2.9 million respectively (StatsSA, 2020).
We received the address data from the municipalities as Esri personal geodatabases. The data included 945 633 and 674 061 addresses for CoJ and CoT respectively. Each address was represented as a line feature with address information provided in the attributes. Some attributes represented identifiers for another feature, e.g., in the CoJ data, the unique identifier for the street associated with the address was provided in the attribute (not the street name itself). Similarly, the full place name was provided in a separate dataset for the CoT data.  Before we could start with the design and implementation of the tool, we reviewed the CoJ and CoT address datasets and metadata to establish how they could be related to the standard data model. This was done by mapping the attributes in each dataset to the corresponding attribute in SANS 1883-2. For the CoJ address data, nine attributes could be directly mapped, and eleven other attributes could be derived from the metadata provided. We could also directly map nine attributes for CoT, but only eight attributes could be derived from the metadata. Refer to Figure 2 for an example of the mapping for the CoT address data.

Mapping the municipal data models to the SANS 1883-2 conformant data model
The mapping is a manual process and time consuming, however, once the mapping is specified, any other transformations of CoT data can follow the same mapping. In future, one could experiment with machine learning to identify possible attribute mappings for other municipalities, based on data already transformed. In Figure 3, we depict the process followed to transform the municipal input address data to the SANS 1883-2 conformant data model. The ETL component of the tool is automated through AWS Glue's workflow system. A trigger initiates the workflow. Next, a crawler connects to the specified S3 file storage bucket and extracts input data. The transformation follows, specified as an AWS Glue ETL job, and executed for each input record. First, the record is prepared for transformation, e.g. by adding the street name or place name from another dataset. Next, the validity of the input record is checked by verifying that it includes the address components required for either a street address or a site address. Invalid records are not processed further. For a valid record, values are now assigned to each SANS 1883-2 conformant attribute, following the mapping specified in Figure  2. Finally, the record is added to the integrated dataset in the Athena database from where results are available as commaseparated values (CSV) files.

Design and implementation of the cloud-based ETL process
Implementation of the transformation was done using only open source software and libraries, and coded in Python, specifically using PySpark. We chose PySpark, as the code should then theoretically run on any cloud platform that can execute the PySpark language.

Results
After performing the transformation and integration, the resulting dataset contained 1 619 694 records, amounting to the sum of addresses in the CoJ and CoT datasets, i.e. the datasets did not contain any invalid records. We found that the processing time was impressive: the transformation took a fraction of the time that this type of processing would typically take on a general-purpose workstation.
Our cloud-based tool was able to successfully transform the address data from the two municipalities into SANS 1883-2 conformant data, and allowed us to integrate data from two heterogeneous sources into a single uniform dataset. The prototype tool successfully performed the ETL to produce an integrated dataset, however, it has only been tested for two municipalities. We plan to add mappings for other sources and municipalities to allow a wider range of input data models.

CONCLUSION
In this paper, we presented the results of an evaluation of three cloud platforms and a prototype tool for transforming and integrating address data from two South African metropolitan municipalities into the data model described in SANS 1883-2:2009.
The results from the evaluation of the cloud platforms showed that all three platforms would be suitable for us to develop our cloud-based tool. However, we chose to use AWS products and services for our prototype implementation. AWS is the leading vendor and provides a wide range of services at the lowest cost. Additionally, documentation and resources are extensive. To implement the prototype cloud-based tool, we used mainly the AWS S3 and AWS Glue services. From our initial results, it is clear that cloud-based tools are suitable as they provide scalable and elastic infrastructure that can grow as the national address dataset grows over time. The processing time using cloud-based tools would also be significantly faster than using generalpurpose workstations, and thus a large initial investment to purchase a processing server would not be needed.
These results can inform guidelines for improving disaster risk management in South Africa and can bring South Africa one step closer to an integrated national address dataset. In future work we plan to add data from other municipalities, and to eventually also integrate data from multiple countries into a single dataset conformant to ISO 19160-1:2015. The additional data sources will also allow us to extend our mapping and investigate possible machine learning tools to automatically perform the attribute mapping between the input address datasets and SANS 1883-2 and/or ISO 19160-1.