AN EFFICIENT ND-POINT DATA STRUCTURE FOR QUERYING FLOOD RISK

Governments use flood maps for city planning and disaster management to protect people and assets. Flood risk mapping projects carried out for these purposes generate a huge amount of modelling results. Previously, data submitted are highly condensed products such as typical flood inundation maps and tables for loss analysis. Original modelling results recording critical flood evolution processes are overlooked due to cumbersome management and analysis. This certainly has drawbacks: the ‘static’ maps impart few details about the flood; also, the data fails to address new requirements. This significantly confines the use of flood maps. Recent development of point cloud databases provides an opportunity to manage the whole set of modelling results. The databases can efficiently support all kinds of flood risk queries at finer scales. Using a case study from China, this paper demonstrates how a novel nD-PointCloud structure, HistSFC, improves flood risk querying. The result indicates that compared with conventional database solutions, HistSFC holds superior performance and better scalability. Besides, the specific optimizations made on HistSFC can facilitate the process further. All these indicate a promising solution for the next generation of flood maps.


INTRODUCTION
To prepare for potential future flood disaster (Figure 1), many countries have conducted flood mapping projects in past decades (Merz et al., 2007, Cheng, 2005, Tran et al., 2009. For instance, in USA, the Federal Emergency Management Agency (FEMA) generated digital flood maps for most of the U.S. population (Council et al., 2009). The European Water Directors also established the European Exchange Circle on Flood Mapping (EXCIMAP) to gather all existing experiences in Europe on flood mapping to improve related practices. The goals include land-use planning and land management, watershed management, hazard assessment on local level, emergency planning and management, and insurance (Van Alphen et al., 2009). China has implemented the national flood risk mapping projects for key areas during 2013 to 2015. The mapping process * Corresponding author mainly includes two parts. The first part concerns running a 1D and 2D coupled hydrodynamic model to compute water depth, flow speed and direction at different time steps, given a specific breach case. The result is stored in a 2D grid covering the modelling basin, which is then used for making various maps such as the maximum inundation map and inundation duration map. Also, combined with social economic data, potential loss tables are computed. The water ministry collects these final products, and plans to use them for decision making. However, they omit a large part of original modelling result which is cumbersome to manage, analyze and present. This certainly has drawbacks: 1. since the products are 'static' (Figure 2), no more details can be derived; 2. the map fails to address new types of queries, which significantly confines the use of flood maps; 3. many small regions which can alleviate the risk of larger flood storage and retention areas are not modelled, which limits analysis when a real flood happens; 4. for the regions simulated, input breach cases are not enough, because the real breach may be somewhere else; 5. due to limited computing power, spatial and temporal resolution is confined to coarse scales; 6. frequent update due to land use change (Merz et al., 2007, Cheng, 2005 is impossible.
In fact, any specific flood maps can be expressed by SQL queries. For example, the inundation extent is achieved by selecting all the grid cells with water depth larger than 0, while the arrival time map is created by selecting cells at different time In addition, new queries such as flood situation around certain objects can also be resolved. Hence, to address issues listed above, developing an efficient database for ad-hoc queries is imperative.
The issue then becomes which database technology to use. Current solutions for flood risk analysis are mainly based on relational databases, either standalone as flat tables (Brocca et al., 2013), or encapsulated in GIS software (Forkuo et al., 2011, Wang et al., 2010. However, according to practical experience and previous studies in geo-applications (Stonebraker et al., 2013, Van Oosterom et al., 2015, indexing flat tables performs inefficiently faced with large multidimensional data. One major reason is the huge size of the extra indexing structure which makes it cumbersome to use. GIS software mainly focuses on processing and analyzing small datasets, and lacks mechanism to handle huge data. Another aspect to consider is that current flood models mainly use irregular grids, e.g., triangular networks for computation. So, the multidimensional array databases such as SciDB and Rasdaman are inapplicable, as they focus on regular grids (e.g., the raster) (Liu et al., 2018). Alternatively, point cloud databases can be an option: • we can directly use the point cloud model to manage the flood modelling result. Each point (i.e., the centre of a cell) has breach case ID, X, Y, Z, time, depth, velocity and direction dimensions; • recent HistSFC framework which is aimed at efficiently managing and querying massive nD point clouds is developed and verified with applications on LiDAR point clouds (Liu et al., 2020). It is implemented in a database, with nD-histograms and irregular querying techniques to solve more complex queries; • flood risk queries can be performed by using such a database, where different maps can be derived from queries. Besides, the database can address other potential queries, such as evaluating water depth along a road or the risk to a house considering all breach cases.
This paper investigates the applicability of the nD-PointCloud structure, HistSFC for querying flood risk, and presents its advantage in terms of functionality and efficiency. The rest of the paper is organized as follows: Section 2 discusses the stateof-the-art HistSFC, including novel optimizations we made for flood queries. Section 3 tests the performance of HistSFC, and demonstrates its superiority over conventional solutions based on a case study in China. In the end, the paper concludes with summary of main results and future work in Section 4.

HISTSFC
In point cloud data, each point contains multiple dimensions. In terms of data management, we distinguish two types of dimensions. Organizing dimensions are used to cluster and index the data, e.g., X/Y/Z/T. The other property dimensions are affiliated, such as color, intensity and classification, which are not frequently used in the SQL WHERE clause. These two types of dimensions are not fixed, and may be varied depending on applications.
Another key concept is Space Filling Curve (SFC). It is an advanced technique to cluster and access data, and has been adapted and improved for multidimensional point data management Shan, 2005, Zhang et al., 2014). Among all SFCs, the Morton curve is commonly studied and practiced due to the simplicity of mapping functions (Morton, 1966). It is based on interleaving the bits from the coordinates.
HistSFC (Liu et al., 2020) utilizes Morton curve to convert all organizing dimensions into a Morton key for storage. For example, given a point with coordinates (3,2), its binary representation is (11, 10). By interleaving these bits, the Morton key can be derived which is 1101, i.e., 13 as a decimal number. In such a way, all nD-points can be converted to 1D Morton keys. Then, one-dimensional indexing structure such as the B+-tree can be used to retrieve the keys, to address a query. The remaining work is to map the query window to ranges of Morton keys for selection. Section 2.1 specifies this process.

Primary settings
As is mentioned, each nD-point is encoded as a full resolution Morton key, a combination of all organizing dimensions. Property dimensions are attached to the key. Such a full resolution key can be decoded directly to the original coordinates. Figure 3 presents the workflow of HistSFC. For storage, after computing SFC keys, HistSFC adopts the Index-Organized Tables (IOT) (Oracle, 2013a) to manage them. Oracle has implemented the B+-tree structure in IOT: the key and other property dimensions are stored in leaf nodes present in a flat table, while the internal B+-tree uses the SFC keys to organize and index the storage. In this way, the indexing structure is integrated with the storage, which lead to a very compact structure.
Before executing queries, we first build the HistTree. HistTree is devised to represent the data distribution to avoid excessive range generation in sparse areas, in the presence of a skewed point distribution. As Figure 4 shows, HistTree counts the  points in each SFC node at different level. If the number exceeds the threshold of the tree, the node will be partitioned into nodes at a lower level. A height field is used in a HistTree node to distinguish different nodes, because branch nodes at different levels may possess identical keys. A HistTree node actually represents the MBB of a quadrant, but it contains neither points nor pointers to points. Thus, HistTree is a compact structure which can be stored in a flat table.
When querying (Figure 3), HistSFC employs HistTree to map the query window to 1D SFC ranges. Starting from the root node, the extent of each HistTree node can be computed using its height and the key. Then, by performing intersection between branch nodes and the query window iteratively, Hist-SFC retrieves all intersected leaf nodes, and abandons nonoverlapping nodes. Some of the nodes retrieved locate totally inside the query window. HistSFC directly exports ranges represented by them. The other resultant leaf nodes fall on the boundary of the query window. These can be further refined based on a recursive fixed decomposition. That is, halving every dimension to build child nodes in each iteration. The process stops when the number of ranges reaches the maximum we set. The IOT strategy then uses the B+-tree to select keys within all these ranges. The returned keys (points) are then applied to the second filter to select the required set of points ( Figure 3).
The theoretical querying time of HistSFC is as follows: where B = page capacity in the number of points r = number of ranges generated ki = number of points inside a specific range k = number of points returned by the first filter In Equation 1, tpre elaborates the computing time and scanning time of one range for joining with the SFC point records. The middle term indicates the I/O cost, and tio refers to the time cost to load one page from disk. In fact, the middle term is an upper limit of I/O operations, as different ranges might cover the same disk page which has been counted more than once. In the last term, tpost refers to the post processing after retrieving one SFC record, including decoding, point-in-window computation and exporting. Parallelism can be implemented for post-processing in a straightforward way (Section 2.2.3). Basically, we divide SFC ranges among p processors, each of which will get part of k . So, the final decoding will cost about k p · tpost time, given a balanced workload.

Optimizations
The first filter leverages the B+-tree for querying, which is efficient. However, due to the constraint of maximum number of ranges, it may use larger coarse ranges and thus return additional false positive points which can only be filtered out by the second filter. In fact, flood risk queries often result in large output, e.g., to support whole basin management and statistical analysis. Thus, we optimize HistSFC in the following.

Range refinement
In current settings, boundary leaf nodes selected will be recursively decomposed in a breadth-first way to intersect the query window, to derive child nodes (Figure 5). So, HistSFC can filter out more false positive points. This strategy assumes all leaf nodes are equally important for decomposition, which may not be appropriate.
In Figure 5, some SFC cells (i.e., nodes) intersect with the query window by a large proportion. Thus, most points inside these cells are within the query window as well. In contrast, other boundary cells, i.e., P1, P2 and P3, which intersect the query window by a small portion mainly contain false positives. These cells should get priority for the refinement such that a decomposition becomes more effective. Consequently, we can optimize the nodes' partitioning process to refine the The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLIII- B4-2021XXIV ISPRS Congress (2021 ranges, by considering the intersection ratio which equals the volume of intersection divided by the volume of node. P 1 P 2 P 3 Figure 5. A 2D query sample: leaf nodes selected (red) in the HistTree to match the query window (blue) Specifically, we compute the intersection ratio in the first filter and rank the intersected leaf nodes by the estimated number of false positive points (Equation 2). The leaf nodes with more outliers will be decomposed first. After one decomposition, we assume points are allocated evenly to different child nodes. Then, we add the child nodes into the original pool of leaf nodes, and rank them again for further decomposition.
wheref pp = estimated number of false positive points in a boundary node R = intersection ratio Nc = number of points in the boundary node 2.2.2 Polytope querying HistSFC is not restricted to rectangular search regions, but can also deal with irregular geometry querying, such as an nD sphere or triangle. The recursive partition of the SFC node to match the query geometry still applies. However, we need to overwrite the intersection module for different types of geometry. To make the solution more generic, we use half-spaces composing a convex polytope to approximate the real query geometry, and develop a generic sweep algorithm (Thompson et al., 2020) for intersection computation.
The sweep algorithm first identifies the enter and exit (one of the vertices) of a SFC node with respect to a half-space (, For each dimension, if the corresponding ai ≥ 0, then the enter takes the lowest value at that dimension. Otherwise, it adopts the highest value. The exit is the opposite of the enter ( Figure 6). As an illustration, Figure 6 indicates how the algorithm determines whether a node intersects a polytope composed by H1 and H2: • Case 1: if the enter is inside the half-space, then the node is fully inside the half-space, and therefore is fully inside the polytope; • Case 2: if the exit is outside the half-space, then the node is totally outside the polytope; • Case 3: if the enter is outside the half-space, but the exist is within the half-space, then the node is cut by the halfspace boundary (, e.g., for both H1 and H2). The node intersects the polytope; • Case 4: if the enter is outside the half-space, but the exit is inside, then the node intersects the half-space (H2). However, if the exist is outside another half-space (H1), then the node is totally outside the polytope.
For each node, HistSFC examines all half-spaces. If intersection happens, HistSFC decomposes the node to its children to check again. During this process, HistSFC records the halfspaces totally containing the node so that no more examination will be conducted for its children. Till the maximum of ranges generated, the algorithm exports all intersected nodes.

Parallel decoding
The previous optimizations are aimed at decreasing the false positive keys returned by the first filter. However, the decoding process can still be time consuming if large amount of keys have to be processed anyway. To address this issue, we adopt the parallel technique for decoding. A straightforward way is to evenly distribute the ranges to different processors so that each processor performs a query and decode the result. This can be easily implemented as all processors adopt the same function.

USE CASE STUDY
The research area is Niansi Levee, located at Jiangxi province, China (Figure 7). To its northwest is the Poyang Lake. The total area is about 183 km 2 . Niansi Levee is one of the key areas that are modelled in the national flood mapping project.

Data
The hydrodynamic model consists of a 1D channel model, and a coupled 2D flood routing model in the basin. The channel model provides extreme flow for the flood simulation in the 2D grid which totally contains 59,680 triangular cells. The model computes water depth, velocity and flow direction. We modelled 8 cases: 4 locations of breach, combined with extreme rainfall with a return period of 20 and 50 years. Each case simulates 720 steps (corresponding to a 30-min resolution). So in total, we get 59,680 × 720 × 8 = 343,756,800 points, in an 8D space composed by case ID, X, Y, Z, time, depth, velocity and direction.
The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLIII-B4-2021 XXIV ISPRS Congress (2021 edition)

Queries
According to the experience acquired during the project, as well as potential need, we devised 6 queries for testing (Table 1). All locations are depicted in Figure 7, e.g., Q4 and Q5.

QID Analysis Q1
The area that is flooded with depth greater than 3 m, when dyke I bursts Q2 The area that is flooded in 24 hours, when dyke IV bursts Q3 The maximum inundation area when dyke III bursts Q4 Risk of several houses (depth > 0) when dyke II or III bursts Q5 Risk to a county road (velocity ≥ 0.5) considering all possible bursts Q6 The dangerous area evaluated by human instability (depth × velocity ≥ 2), when dyke I bursts The rectangular area is about 1.5 km 2 . Q5 uses an irregular geometry (i.e., the road) to query. The result of Q5 is 93 8D points, presented as 6 distinct spatial points at different time steps in Figure 8. The points indicate the vulnerable parts of the road which need enhancement. During flood evacuation, people should also avoid these locations. Q6 uses the product of flow velocity and depth to quantify the human instability in flowing water (Jonkman and Penning-Rowsell, 2008). We choose 2 m 2 /s, a rather safe level, as the threshold.

Implementation
The test is conducted on a 'testbed' server: a HP DL380p Gen8 server with 2 × 8-core Intel Xeon processors, E5-2690 at 2.9 GHz, 138 GB of main memory, a RHEL6 operating system. The disk storage is a 41 TB SATA 7200 rpm in RAID6 configuration. Among the 8 dimensions of the data, flow direction is seldomly used for ad-hoc analysis, compared with other dimensions. So, we set it as the property dimension when building HistSFC, while encode the other 7 dimensions into the Morton key stored in IOT. We use 1000 as the threshold to build the HistTree. As a comparison, we also build a flat table which stores each dimension as an individual column. Although this approach performs inefficiently, it serves as a baseline to compare with. Solutions provided by major spatial databases fail to address nD-point management. For instance, Oracle SDO PC (Oracle, 2013b) builds point blocks based on a 2D organization, while PostGIS can maximally support a 4D-point geometric type.
To explore the scalability, both approaches divide the whole The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLIII-B4-2021 XXIV ISPRS Congress (2021 edition) dataset into 4 stores, containing 1, 2, 4 and 8 cases respectively. The size of HistSFC store with 8 cases is 12.88 GB, and that of the HistTree is 84 MB on the disk. The flat model occupies 15.4 GB disk space.

Benchmark tests
For all queries, HistSFC utilizes 16 processors to decode Morton keys. Besides, it employs 7 half-spaces to approximate the shape of Q6, for querying ( Figure 9). More details about implementing data storage and queries can be found in the Appendix. Figure 9. Polytope querying of Q6 Figure 10 shows all testing results. It indicates that for all queries, compared to the flat table approach, HistSFC holds better scalability, following a constant trend. In most cases, the time cost is much lower than the flat table solution. An exception is Q6, where HistSFC performs significantly worse. This is because current parallel implementation evenly distributes the ranges to different processors. As each range contains different number of points due to skewed data distribution, the actual workload can be unbalanced among processors. In this case, we found a processor undertakes 72% of the whole workload.
More specifically, Q2 takes less time on the whole, compared with Q1 and Q3. This is because it selects only 925,691 points out of the total 343,756,800 points, while Q1 and Q3 export 26,484,215 and 32,183,314 points, respectively. Q4 is also very selective, returning 170,417 points. It first does a spatial window selection (, i.e., a rectangular area with houses), and then checks the flood information which is indicated by a positive depth value. Unlike Q4, Q5 utilizes a long and narrow 2D polygon for the spatial selection, consisting of 628 edges. This, however, does not introduce significant overhead compared with Q4. Q6 uses the polytope querying technique (Figue 9), which leads to a resultant 30,937 point set, slightly larger than the correct answer 28,351. The false positive rate is below 1%. For all other queries, the two approaches return the same results.

Discussion
Flood analysis is frequently performed globally, such as Q1, Q2 and Q3. Therefore, a large output may always be encountered.
To perform such queries efficiently, low I/O operations and parallel post-processing become very crucial, indicated by Equation 1. Both processes are related to the data distribution: if data is severely skewed in the nD space, some ranges generated will contain lots of false positives, which significantly increases I/O; besides, the uneven distribution will cause unbalanced workload among processors (Figure 10(f)). It is possible to perform the first filter on a single thread, and then distribute the keys selected for decoding. However, additional memory and time cost will become an issue. A direct solution is to estimate the number of points in the final ranges, e.g., based on the length of the range. Then, each processor gets the job allocated according to the points estimated.
Besides, the dimensionality is a variable. The test dataset contains only one levee. In fact, to defend flood, groups of levees are built and used. Consequently, a levee ID may also get involved for an integrated database. Encoding too many dimensions into the SFC key is not sensible, as this will significantly decrease the accuracy of the first filter (Liu et al., 2020). The consequent I/O cost will become huge. Hence, it is suggested that before using HistSFC, a comprehensive analysis of applications and data should be conducted to determine the proper organizing dimensions. For example, this research does not consider using flow direction for data organization, as it is not queried often.

CONCLUSIONS AND FUTURE WORK
This paper has investigated the possibility of using an nD-PointCloud structure -HistSFC -for the next generation of flood maps, to fulfill increasing needs of different stakeholders. This research also developed critical optimizations on HistSFC to resolve flood issues. Then, a benchmark test is performed to evaluate HistSFC's performance in practice. The result indicates that the optimized HistSFC can process various flood queries very efficiently. The scalability remains stable as the input size increases. Although the paper only presents 6 queries, HistSFC can address more complex nD queries in an analogous way. The polytope querying technique is also generic: in additional to the polygonal spatial queries, more query geometries formed in other physical dimensions can be addressed. Besides, with the parallel post-processing, HistSFC can be improved further using state-of-the-art hardware, e.g., not only powerful workstations, but also cloud computing platforms. We can additionally parallelize the range computing process to enhance HistSFC.
Point cloud databases enable dynamic analysis on large-scale flood simulations. However, the SQL interface is less friendly and requires high level of expertise to master. So, in the future, we plan to develop a Graphical User Interface (GUI) as the front-end, to drive various queries and analysis. User requirements must be collected for devising such an integrated flood mapping system. Besides, the GUI should support efficient rendering, as large-scale point visualization may become the routine. This reversely calls for new design in the data structure, such as embedding Level of Detail (LoD). This verifies the advantage of HistSFC which can just treats LoD as another organizing dimension. Specific LoD that is applicable to flood analysis should be researched and developed in the next step.
An nD-point data structure managing flood modelling results is innovative, as the major use of high dimensional points is still in the surveying domain and initial stage of applications. However, either a physical point or a simulated point functions as the carrier of data. The nD point cloud model is the core. From this perspective, we can extend the scope of applications, e.g., wind modelling and mechanical analysis of materials.   Figure 10. Performance of querying, where X axis represents the number of cases included in the specific data store Zhang, R., Qi, J., Stradling, M., Huang, J., 2014. Towards a painless index for spatial objects. ACM Transactions on Database Systems (TODS), 39(3), 1-42.