A VECTOR ANALYTICAL FRAMEWORK FOR POPULATION MODELING

We propose a vector alternative to the typical raster based population modeling framework. When compared with rasters, vectors are more precise, have the ability to hold more information, and are more conducive to areal constructs such as building and parcel outlines. While rasters have traditionally provided computational efficiency, much of this efficiency is reduced at finer resolutions and computational resources are more plentiful today. Herein we describe the approach and implementation methodology. We also describe the output data stack for the United States and provide examples and applications.


INTRODUCTION
High resolution mapping of human populations is often achieved through the disaggregation of aggregate counts (e.g. census tabulations) from tabulation areas (source zones) to smaller areas (target zones), with the aid of covariate spatial data characterizing the natural or built environment (e.g. land cover/use, building footprints) with some known or presumed functional relationship with population density. Source zones and natural/built environment data, found in a variety of raster-/vector formats and resolutions, are often converted to a common raster resolution for analysis (Lloyd et al., 2017, Mennis, 2003, Bhaduri et al., 2007, Freire et al., 2016. This approach is computationally efficient at coarse resolutions and existing software and methods facilitate modeling for those with an understanding of raster-based spatial analysis (Leyk et al., 2019), but has potential shortcomings due to limitations of raster data formats. When compared to a vector data model, rasters are less precise, usually hold less information, and are less conducive to smaller area constructs, such as building outlines and parcels. Given these shortcomings, we propose a vector analytical framework for population modeling. The framework is designed to combine all of the lines defining the input layers so that fields enclosed by those lines (i.e. polygons) are uniformly attributable to each of the input layers. This richer data stack allows for the development of models with more complex logic that are straightforward to implement and explain, as well as potentially increasing the accessibility of modeled estimates and intermediate layers to a broader audience. Furthermore, by embedding grid cell boundaries into the vector framework from the outset, we maintain the ability to generate raster layers (e.g., gridded population estimates) using this framework.

Approach
Within our framework, capturing all relevant built environment attributes as well as all source zone identifiers at the finest resolution requires calculating the spatial intersections of all input layers as a first step. Unlike approaches that convert all vector data to raster or that simply join attributes on polygon centroids, our approach maintains all attribute information from the input layers with their original spatial precision. We thereby retain the flexibility to aggregate at subsequent steps according to modeling assumptions. This method also allows for the inclusion of a regular grid for aggregation, preserving our ability to aggregate back to the raster format familiar to users of high resolution population estimates. Simply put, the vector analytical framework is designed to combine all of the lines defining the input layers so that fields enclosed by those lines (i.e. polygons) are uniformly attributable to each of the input layers.

Implementation
Our population models rely on myriad datasets, some with point/polygon geometries natively in shapefiles, geodatabases, geojsons, etc, some in tabular form, such as from csvs. We have found PostgreSQL with PostGIS to be an ideal central storage point for extract, transform, and load (ETL) processes from the many data sources. While using this as storage and service for our previous raster based modeling efforts, we found the Post-GIS processing capabilities to be impressive and thus developed our vector based framework to run within the PostgreSQL/Post-GIS environment.
While SQL is more accessible as a descriptive language, some understanding of the underlying PostgreSQL query engine and configuration settings is required for performant outcomes. We arrived at the following order of operations: 1. Index all geometries and id columns to optimize evaluations and joins.
2. Assemble all linear boundaries from polygon inputs, census blocks and parcels, as well as the regular grid bounding lines, into a single 'blades'

Processing times
The processing times shown in Table 1 are for the polygon to line conversions and the creation of the 3 arcs-econd grid cell bounding lines, described in section 2.1, operation 2. As such, the time is a function of the total rows. Larger extents have more grid lines and more populated states have more blocks and parcels. Table 2 lists processing times for operations 3 to 6 in section 2.1, as they are scripted together. Variations from state to state are not simply dependent on the number of rows in the input tables; they depend on the complexity of polygons in the building and land use layers and the spatial relationships between those layers. Operation 3, for example, takes longer for a state with more building/blade intersections. However, the relationship between number of building polygons and processing time is roughly linear (Figure 2). With regard to operation 2, the number of parcel polygons is moderately predictive of processing time; the number of census blocks and the total land area to be diced into grid cells also contribute to the processing time variation (Figure 1).

Discussion
The vector analytical framework has been transformative in our population modeling work for LandScan USA (Moehl et al., 2020. It allows us to move the heaviest   computation to the early stages of production, before many decisions, removing barriers to iterating and adjusting implemented decisions. Conversely, raster based methods have the heaviest computations at the penultimate step. The US analytical data stack herein consists of 270 million plus rows of building parts resulting from running 123 million building outline polygons through our framework, using over 152 million parcel polygons, over 11 million census blocks, and over 65 million unique grid cells also embedded; all stored and calculated within PostGIS. These 270 million building parts having no loss of information and linking back to the source datasets, become the basis for all subsequent decisions and analysis in the workflow, including handling of overlapping parcel polygons, interpreting other confounding land use information, imputation of null land uses, as well as summaries of area by land use by source zone for further statistical analyses.  (5) after being split by parcels and grid lines. An example row, with the polygon geom highlighted in Figure 5, is shown below:    Spatially joining attributes from one polygon layer to another is a common procedure. This is often done using the centroid of the target layer to ensure one record in the resultant table for each of the target layers. An example using building outlines as the target layer for parcel land uses, would result in a 1 to 1 ratio of records between the input buildings layer and the output buildings + land use layer. Figures 3 and 4, which show one building polygon intersecting four parcels, illustrate how a spatial join of parcels to a building using centroids might be insufficient. With the building parts, we can calculate the ratios of building/parcel intersections to buildings by any any census geography. A ratio of 1:1 would be equivalent to the centroid based join, with no building intersecting parcel boundaries. Figure 6 shows that a centroid based spatial join approach would not be sufficient in many counties. The highest ratios are found in the New York City counties. Queens County is one of these extreme examples with a ratio of 3.075:1. Figure 7 shows the 38 buildings before being split by parcels, as shown in Figure  8. These figures illustrate that while the county level ratio at 3:1 is extreme relative to the rest of the US, there are even more extreme ratios at the tract scale; 12.94:1 in this example. The vector analytical framework and the building parts allow us to understand these ratios across our entire study area at any scale, not just for a sample. Otherwise, it would be very difficult to know and explain where and how a centroid method would be insufficient.
Figure 8 also illustrates where the vector framework allows for the precision of the underlying land use data to be maintained throughout the modeling process. In a raster based workflow, rasterization would inevitably distort the relative distributions of land use areas at all scales and depending on the cell size and alignment some land uses might be lost entirely.

Applications
2.5.1 Overlap Handling Vector polygon datasets often have overlapping features. This is sometimes the result of misalignment, but is also often a true representation, a building outline and a land use polygon within OpenStreetMap, for example. Measuring the impact of these overlaps on population models can be difficult. In a raster framework, calculating the impact of overlaps requires one to rasterize for each possible handling scenario. This impact can also change depending on spatial resolution, i.e. cell size, which would require further rasterization for evaluation. Additional complexity is added for evaluation at different scales, such as state, county, or tract. With the vector analytical framework, all decisions are downstream of the heaviest computations, thereby encouraging and facilitating iterative analysis and refinement of methods. With the building parts, we have no information loss and all area and land use information is in the context of the building outline polygons. This allows us to calculate overlap in terms of area of the building parts at various scales rapidly. As specified in Figure 6. Most counties have ratios between 1 and 2, though some have much higher ratios.  the 2.2, when multiple values occur in the parameter polygons, arrays are used to capture the distinct values present. We can use this in the sql logic to find the building area with multiple parcels by any level of census geography.
SELECT sum(area_m2), substring(censusblock, 1, 5) as county FROM table WHERE array_length(a.parcelids, 1) > 1 GROUP BY county The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLVI-4/W2-2021FOSS4G 2021-Academic Track, 27 September-2 October 2021 It takes about a minute to process the 270 million building parts and find that the building area overlap across the states ranges from 0.75% in Indiana to 16.15% in Washington, D.C. as shown in Figure 9. It takes another minute each to process the 70,000 tracts and 3,000 counties in Figures 10 and 11, respectively. These calculations show that the chosen method(s) for overlap handling, such as 'prefer residential' or 'take the smallest parcel' can greatly impact population models at finer scales as there are some geographies with large amounts of overlapping information. These calculations can also aid in identification of any systematic issues that might be present in the input datasets.  layer has a unique identifier for each campus and a population to be distributed within the zone. To distribute the population with the zone, we first select the building parts that intersect each zone, along with zone id and population so that the resultant weights table has a row with the uid, area m2, and land use attributes from the building parts and the zone id and population from the zone layer. After selection, population is distributed to each of the building parts in the weights table. Any building parts that intersect multiple source zones receive reduced portions of the contributing source zones' populations so that they don't receive extra population. This is calculated by counting the occurrences of each building part, with one occurrence indicating that building part is within one college zone, and then dividing that building part's area by the occurrences count. The area per school is calculated and each building part is given it's portion of the college population according to its portion of of the total school area. The building parts with nonresidential land uses have their areas set to a tiny value (0.0000001), so that a building part with a nonresidential land use receives no population unless all the building parts for that zone have nonresidential land uses. Figure 12 shows the building parts within two campus zone polygons (blue). The darker green building parts within the smaller college zone denotes the presence of two building parts polygons, one for each college. That building would receive all of small source zone population and a fraction of the population from the larger college at half the density of the buildings which only intersect the larger source zone.

CONCLUSIONS
We believe this framework, developed in the context of population modeling, is extensible to many other spatial analysis problems. Also, it is an excellent example use of Postgres/Post-GIS beyond storing and serving data and is thus of interest and relevance to the FOSS4G community. To GIS practitioners accustomed to a geoprocessing workflow in desktop GIS software (input layers process output layer), our data flow diagrams are familiar, but the scale and performance achieved is not. Many practitioners of traditional GIS spatial analysis can benefit from making their data more analytically accessible to a wider array of emerging data science techniques.

COPYRIGHT
This manuscript has been authored by UT-Battelle, LLC, under contract DE-AC05-00OR22725 with the US Department of En- The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLVI-4/W2-2021 FOSS4G 2021 -Academic Track, 27 September-2 October 2021, Buenos Aires, Argentina ergy (DOE). The US government retains and the publisher, by accepting the article for publication, acknowledges that the US government retains a nonexclusive, paid-up, irrevocable, worldwide license to publish or reproduce the published form of this manuscript, or allow others to do so, for US government purposes. DOE will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (https://energy.gov/downloads/doe-public-accessplan).