EXTENDING LKN CLIMATE REGIONALIZATION WITH SPATIAL REGULARIZATION: AN APPLICATION TO EPIDEMIOLOGICAL RESEARCH

Regional climate is a critical factor in public health research, adaptation studies, climate change burden analysis, and decision support frameworks. Existing climate regionalization schemes are not well suited for these tasks as they rarely take population density into account. In this work, we are extending our recently developed method for automated climate regionalization (LKN-method) to incorporate the spatial features of target population. The LKN method consists of the data limiting step (L-step) to reduce dimensionality by applying principal component analysis, a classification step (K-step) to produce hierarchical candidate regions using k-means unsupervised classification algorithm, and a nomination step (N-step) to determine the number of candidate climate regions using cluster validity indexes. LKN method uses a comprehensive set of multiple satellite data streams, arranged as time series, and allows us to define homogeneous climate regions. The proposed approach extends the LKN method to include regularization terms reflecting the spatial distribution of target population. Such tailoring allows us to determine the optimal number and spatial distribution of climate regions and thus, to ensure more uniform population coverage across selected climate categories. We demonstrate how the extended LKN method produces climate regionalization can be better tailored to epidemiological research in the context of decision support framework.


Climate and health
Climate, climate change and adaptation are the issues of heightening concern in public health research.The effect of climate on human health and wellbeing in both developed and developing worlds is profound and multifaceted.The climate change affects vector borne and water borne diseases, food security and mental health (WHO (World Health Organization) 2009, World Meteorological Organization and World Health Organization 2015, Crimmins 2016) The impact could be direct or indirect, immediate or delayed, localized or widespread depending on causal, temporal and spatial aspect.For vector borne and water borne diseases the effects of climate change can be examined by better understanding habitat suitability for causal pathogens and their routes of transmission.For obvious reason the deteriorating effects of climate change and extreme weather on human health is likely to be best measured in locations with high population, better healthcare provision and monitoring.Recent studies show that the magnitude of the effect of extreme weather on human health and pathogen habitat depends on the baseline climate conditions, which may mitigate or aggravate the overall changes.Thus, the accurate climate regionalization is needed to accurately quantify and forecast such effects.

Climate regionalization
There are number of climate regionalization schemes exist.One of the well-known and often used regionalization schemes is Köppen-Geiger (KG) climate classification system.The KG climate classification system, developed in 1884 by the Russian/German climatologist Wladimir Köppen, is based on the fundamental concept that regional climate can be defined by a prevalent phenology (Geiger andPohl 1954, Köppen, Volken et al. 2011).However, due to the technological limitations in the pre-satellite era it was not possible to reliably define phenology over large and remote areas.Consequently, the temperature and precipitation were used as available proxies to determine regions with similar climate.While the KG climate classification is still actively and widely used to quantify climate variation (Chen and Chen 2013), the arbitrary nature of suggested parameters in KG climate classification system has been criticized (Thornthwaite 1943).Furthermore, this commonly used scheme does not account for population density and thus is not well suited for the tasks of capturing population-relevant properties.

Application of Satellite Remote Sensing to Climate Regionalization
Emerging data sources, such as vegetation indices, spectral radiation patterns, surface albedo and other measures, available with the advent of remote sensing technology, allow for a definition of a prevailing phenological pattern at virtually every place worldwide.It is now feasible to derive local phenology directly from satellite remote sensing data using one of the existing vegetation indices, which are based on the fact that plants' canopy reflect sunlight strongly in the Near Infra-Red (NIR) part of the spectrum (wavelengths of 700 to 1000 nanometers), while absorbing sunlight in the visible spectrum (400 to 700 nanometers).The clouds and the bare soil, including snow, have the opposite reflectance properties, reflecting strongly in all visible spectral bands, and absorbing the NIR part of the spectrum.Several worldwide phenological measures emerged during the past two decades with the advent of satellite remote sensing technology.For example, the Normalized Difference Vegetation Index (NDVI) (Carroll, DiMiceli et al. 2000) is defined as the ratio between the difference and the sum of the amount of sunlight reflected by vegetation canopy in the NIR and Red optical bands, respectively: The spectral characteristics of the NDVI index allow the differentiation of phenology and states of vegetation.-NASA 2000-NASA -2013)).

LKN regionalization
Our recently proposed automated climate regionalization method called LKN-regionalization is based on k-means clustering algorithm over time-space (Liss et al. 2014).This method is using distributed NDVI scenes, which allow capturing both essential climate properties, and changes in climate patterns.The LKN method consists of the data limiting step (L-step) to reduce dimensionality by applying principal component analysis, a classification step (K-step) to produce hierarchical candidate regions using k-means unsupervised classification algorithm, and a nomination step (N-step) to determine the number of candidate climate regions using cluster validity indexes.Using comprehensive set of multiple satellite data streams, arranged as time series the method is capable of defining climate regions over large spatial extents.This is essential for large-scale epidemiological studies to account for geographic heterogeneity.

Objectives
In this study, we are extending LKN-method to incorporate the spatial features of target population by including a regularization term reflecting the spatial distribution of target population.We illustrate this extension with an example of climate regionalization in Ghana.

Satellite Remote Sensing Data
MODIS NDVI and pixel quality (QA) data for 15 years was downloaded from the online Data Pool at the NASA Land Processes Distributed Active Archive Center (LP DAAC), USGS/Earth Resources Observation and Science (EROS) Center, Sioux Falls, South Dakota (LPDAAC-NASA 2000-2013).We arranged NDVI data so that it covers entire extent of our study region.Each of the two EOS satellites, Aqua and Terra, produced composites on overlapping 16 days schedule.By combining data streams from both satellites, it was possible to construct a time series with 8 days temporal resolution.The Vegetation Index data was aggregated in a layered space-time series.Normalized index allowed us to reduce or eliminate the effect of seasonally changing lighting conditions, thin clouds, atmospheric and anisotropic distortions.The water reflectance pattern differs significantly from almost any other land surface material by absorbing most of the incoming radiation.In order to avoid the misclassification due to the water reflectance pattern, the water bodies were masked for the analysis.
Population density raster for Africa was downloaded from WorldPop site (Worldpop 2015).It was clipped in ArcGIS to the extent of the NDVI data set.

Reducing correlation and clustering
The time series of 8-days NDVI rasters naturally has a very high degree of spatial and temporal correlation.Following the original LKN-methodology we reduced dimensionality and orthogonalized this data by applying Principal Component (PC) decomposition to original time series.We retained 12 components as per the original methodology.
The original methodology employs cluster analysis to define regions with similar climate.It aims to assign a finite set of labels (also known as categories or classes) to a very large number of multidimensional objects (pixels, representing a defined area on the ground in our case) based on their similarity.Conventional clustering algorithm given a set of n data point distributed over time t  , ∈   *  seeks to minimize the clustering objective function where  1 , … ,   represent centers of the respective clusters 1 to k, and (  ,   ) is a distance measure between each point and center of the clusters.

Determination of the number of regions
The clustering algorithm requires that the number of climate regions to be specified a priori.The LKN method employs cluster validity index criterion to decide optimal number of regions.We extend this approach by using several cluster validity criterion and adding a regularization term penalizing number of regions formed.In general cluster validity criterion measures goodness of clustering.In commonly used cluster validity indexes compactness of the clusters are compared with the dispersion of the cluster centroids.We are using three generally employed cluster validity indexes, Calinski-Harabasz (Caliński and Harabasz 1974), Dunn (Dunn 1974) and Davies-Bouldin(Davies and Bouldin 1979).We have trivially transformed these indexes so that for each one of them the optimal solution seeks to minimize the validation criteria with respect to number of clusters.In addition to the cluster criterion we also added the regularization term  − .The validation criteria therefore becomes: where (|) is the validity index for the clustering solution of the data set x with k regions, and  is a regularization constant.

Regionalization and population distribution
We evaluate this approach by comparing distribution of the population with proposed regional division.In the context of the epidemiological and sociological research it is desired that population distribution across climate regions was uniform or as close to uniform as practical.

Regionalization
For this analysis, we studied the North West African country Ghana.Situated on the south shore of the West Africa's Gulf of Guinea, and on the shores of the Lake Volta, one of the largest fresh water bodies in the World, it has a significant variability in the local climate as well as in the population density.It has ocean shoreline and dense rainforest, a lifeless desert and the mountain ranges.The large variability in the local climate patterns and significant variability in population density on a relatively small geographic footprint create a favourable set of conditions for this study.
We have downloaded 2760 MODIS tiles (h17v7, h17v8, h18v7 and h18v8), from the MOD13A2 and MYD13A2 collection.After pre-processing (mosaicking the tiles, extracting NDVI and QA SDSs, clipping to the study area, and re-projecting) the downloaded tiles, we performed principal component analysis as described in the method section.Top 12 principal components retained 96.9% of the total information in 46 components.The pseudo-colour image of the first three principal components presented in Figure demonstrate a good separation of colours.The distinct spatial features can be clearly seen in that figure .We proceeded by clustering the first 12 principal components using k-means clustering algorithm with a range of 2 to 28 classes.The validity of each clustering result was assessed by the Calinski-Harabasz cluster validity index.The index reaches minimum value at the  = 3.It also is minimized at  = 15 and  = 25 regions, suggesting that these could also be considered as the candidate values (Figure 2).

Figure 2. Calinski-Harabasz index value
We have added the regularization term as described in the method section, and also included two additional cluster validity measures.The two added indices do not confirm the Calinski-Harabasz's selection of the 3-region solution as a preferred one.Instead, with added regularization all three indexes concur, that the better overall solution to the climate regionalization are 8 or 15 clusters for the study area (Figure 3).

Distribution of the population
The population in the study region distributed very unevenly.The two major cities, Accra and Kumasi, account for nearly 20 percent of the total population of 33 million.In the Table 1 we aggregated number of people residing in each of the assigned climate region.For the first column, K-03, for example, there are three defined regions, with the distribution of population 56.5, 11.5 and 32.0 per cent (18.8, 3.9 and 10.7 million) respectively.It is clear that population is allocated very unevenly to these regions.Increasing number of regions to 8 and 15 (K-08 and K-15, respectively) as is suggested by the regularized ensemble of cluster validity indexes indeed creates a much more even distribution of the population and leads to a sharp decline in the population variance between the climate regions.The standard deviation of population counts across regions declines from 7.8 for 3 regions, to 2.5 to 1.4 for 8 and 15 regions, respectively

DISCUSSION
Climate is an important factor in environmental and public health, climate change adaptation, and it affects many facets of human life.The ability to detect differences in the climate and the pattern of change of the climate is important and is a subject of a growing body of research.Using an automated method to define climate regions based on the satellite remote sensing data allows uniform definition of regional climate patterns.Furthermore, it allows adapting climate regions to the effects of climate change.We have demonstrated, that by using regularization it is possible to adjust climate regionalization to address needs of epidemiological and public health research.This allows for the tailoring of the regions to the specific discipline, without losing the generality of the LKN methodology.
It is also worth noting that there are several hyper-parameters in this methodology, that may require further study and tuning to the specific areas of interest and the requirements of the research protocol.The number of components extracted from the PCA decomposition, type of the cluster indexes used and the voting methodology for the selection of the suitable number of regions, and the strength of regularization require additional research to utilize their potential to the fullest.
Future directions.This study is part of our larger effort to assess and evaluate the effect of extreme weather and climate on US elderly residents.We have studied Ghana as a pilot site.Ghana has sufficient diversity in its climate and population density.At the same time it is substantially smaller than Continental United States both in size and population count which made it an ideal site for the development and testing of regionalization methodology.

CONCLUSION
We demonstrated the applicability of LKN methodology in application to another region with different climate and demographic patterns.Suggested enhancements and regularization term allows more robust determination of the climate regions.Further study into hyper-parameter determination is required to facilitate integration of this methodology in the wider context of decision support framework

Figure 1 .
Figure 1.Pseudo-color image of the first three principal components Figure 1.Pseudo-color image of the first three principal components The greater values of the index indicate healthier vegetation cover with vigorous growth while lower values indicate declining, stressed or dying vegetation.NASA, using data provided by Moderateresolution Imaging Spectra-radiometer (MODIS) on board NASA's Terra and Aqua satellites, produces worldwide NDVI composites with 16 days overlapping temporal resolution and various spatial resolutions (LPDAAC