QUANTIFYING THE RELATIONSHIP BETWEEN NATURAL AND SOCIOECONOMIC FACTORS AND WITH FINE PARTICULATE MATTER ( PM 2 . 5 ) POLLUTION BY INTEGRATING REMOTE SENSING AND GEOSPATIAL BIG DATA

PM2.5 pollution is an environmental issue results from various natural and socioeconomic factors, frequently witnessed in the spring and winter across mainland China. However, the dominant influence of natural and socioeconomic factors within a city on PM2.5 is not extensively studied yet. In this study, the Random Forest Regression (RFR) is utilized to quantify the relationships between PM2.5 and potential factors within Wuhan city on a typical day turn from winter to spring. Technically, the 24-hour average PM2.5 concentration in downtown area on February 17 2017 are collected at 9 sites. In the meantime, we retrieve simultaneous aerosol depth optical depth (AOD) from the Moderate Resolution Imaging Spectroradiometer (MODIS). The ground measured PM2.5 and AOD are coupled for the retrieval of near-surface PM2.5 concentration by Spatial-temporal CoKriging (STCK) with Normalized Vegetation Index (NDVI), Modified Normalized Water Index (MNDWI), Normalized Building Index (NDBI) from Landsat-8 and DEM from Shuttle Radar Topography Mission (SRTM). As the geospatial big data booms, the Internet-collected volunteered geographic information (VGI), representing the urban form and function, are integrating for the regression to obtain the spatial variables importance measures (VIMs) by RFR both in centre and sub-urban region of Wuhan. The results reveal that terrain characteristics and the density of industrial enterprises have obvious relationships with the accumulation of PM2.5 while the density of roads also contributes to this.


INTRODUCTION
The harm of fine particulate matter (PM2.5) to public health has drawn across China in recent years (Ma et al., 2016;Wang et al., 2015).As an environmental issue, PM2.5 is widely recognized as related to natural and socioeconomic factors (Chen et al., 2016;Tan et al., 2017).China possesses vast territory and changeable natural elements across its mainland.There are various interactions between natural and socioeconomic factors across the contruy (Ma et al., 2016).Industries, fossil fuel consumption and vehicles are explored to be the main cause of PM2.5 accumulation of PM2.5 worldwide (Perrone et al., 2012;Zalakeviciute et al., 2018).Socioeconomic factors are directly related to the emergence and spatial-temporal variations of PM2.5 and natural elements have influence on the diffusion and accumulation of PM2.5 (Fanizza et al., 2018;Kioumourtzoglou et al., 2016).Cities are not only closely related to human life, but also the interaction between man and nature in the city is more complicated.However, the spatial-temporal variations and driving factors of PM2.5 in urban scale have not been thoroughly explored (Yang et al., 2018a).Many scientific research institutions have made enormous efforts to monitor the environmental issues within the city.China Atmosphere Watch Network (CAWN) and Aerosol Robotic Network (AERONET) are established under this background.Most of the existing PM2.5 monitoring data are scattered points provided by monitoring stations such as CAWN and AERONWT.In order to obtain seam-less and continuous PM2.5 concentration at surface-level, many previous studues utilize the relationship between satellite observed atmospheric optical depth (AOD) and ground-based PM2.5 observations (Guo et al., 2014;Lin et al., 2015).These methods retrieve surface-level PM2.5 concentration using AOD-PM2.5 relationship can be classified into two categories: empirically and semi-empirically observation based methods (Lin et al., 2015).The empirically observation based methods rely on the statistical regressions of AOD-PM2.5 relationships.These methods usually adopt ploynomial linear regression (Li et al., 2011), artificial neural network (ANN) (Wu et al., 2012), and non-linear regression (Benas et al., 2013).The semi-empirically observation based methods integrate the AOD-PM2.5 relationship with atmospheric parameters (Boyouk et al., 2010;Yang et al., 2018b).Since the accumulation of PM2.5 is extensively studied to be affected by natural and socioeconomic factors.The mapping and analysis of surface-level PM2.5 and its factors depend on the significance and representativeness of the related factors in the retrieval and regression.The natural factors within the city, such as greenness, bareness and wetness, can be retrieval from the remotely sensed images by multiple space-bone sensors with high spatial/temporal resolution and great visual scene.However, the economic factors within the city are too various and volatile to be extracted accurately and duly (Yao et al., 2017).Conventional statistical and monitoring methods are not qualified enough to meet the rigorous requirements of data volume and temporal resolution for implementing spatial-temporal pattern analysis of socioeconomic activities within cities.As the geospatial big data booms, the Internet-collected volunteered geographic information (VGI) such as Point of Interests (POI) and real time population thermodynamic map (PTM) con now be used to settle the dilemma (Elwood, 2008;Zhang et al., 2018).Previous studies show that POI and PTM are related to socioeconomic activities at multi-scale and effectively represent the urban form and urban function (Yao et al., 2017;Zhang et al., 2018).In this study, the fine scale surface PM2.5 concentration is retrieved using AOD from the Moderate Resolution Imaging Spectroradiometer (MODIS) coupling with Normalized Vegetation Index (NDVI), Modified Normalized Water Index (MNDWI), Normalized Building Index (NDBI) from Landsat-8 and DEM from Shuttle Radar Topography Mission (SRTM).The POIs and PTM collected over the same period and some other factors are utilized to explore the main driving factors in central and sub-urban areas within Wuhan city by RFR.

Study area and data
Wuhan, China, is selected as the case area.Wuhan is a mega city with most populous and largest economic volume of central China.The extent of the study area covers 75 × 75  2 and possess heterogeneous land cover composition.The urban area surrounded by the 2nd Ring Road is the centre of Wuhan, which has a high building density and land use efficiency.Land use types in this area are mainly residential land use, commercial land use and public service land use.The exterior area of the second ring road and neighbourhood area the third ring road is mainly the sub-urban and expanded urban areas.The land use types are more diversified.

Figure 1. Study area
The AOD product (MOD04_3K) with 3000 meters of MODIS (https://ladsweb.modaps.eosdis.nasa.gov/search/)and PM2.5 concentration at 9 sites within downtown area of Wuhan are coupled in the retrieval of surface-level PM2.5.The detailed information of these sits are reported in Table 1.

Spatial-temporal CoKriging (STCK)
Technically, to obtain the spatially-temporally continuous and typical latent pattern of the surface-level PM2.5 concentration in Wuhan within one day, the Spatial-temporal CoKriging (STCK) is adopted for its proper consideration of spatial-temporal relationship between variances and dependence factors (Liu et al., 2015).STCK relies on the basic assumption of spatial-temporal stationary.This assumption indicates that the sampling data of many natural phenomena have a certain correlation in space or time domain.The prediction process of STCK can be represented as: where   ̂( 0 ,   ) is the predicted pixel of fine resolution F at time tk for each sliding-window,   (  ,   ) is the pixel at coarse resolution V at time tk,   (  ,  0 ) is the pixel at fine resolution V at time t0.For the 3 × 3 sliding window, m=9.The optimization of weights in time domain can be realized by solving the following equation: where  −1 is the inverse matrix of covariance matrix C,  = [  0 ( 0 ),   1 ( 0 ), … ,    ( 0 ), 0 1 , 0 2 , … , 0 −1 , 1]  , and C can be represented as: where   00 is the sub covariance matrix with size of  ×  for fine resolution image (  (  ,  0 ),   (  ,  0 )).STCK is scalable.
When the number of independent variables increase, STCK is still applicable.

Random Forests Regression (RFR)
RFR is one of the most suitable machine-learning method for exploring the spatial-temporal variation and driving factors of atmospheric issues, such as PM2.5 concentration (Hu et al., 2017).The foregoing study also indicates that RFR is able to solve multicolinearity and dimensionality of the input data and can properly settle the overfitting problem.RFR is a non-linear statistical regression method that is composed of and averages multiple randomised and de-correlated decision trees (Hutengs and Vohland, 2016).RFR introduces randomness into the construction of each de-correlated decision tree by bootstrap sampling.For each tree, the following process is repeated: 1. m of all p variables of the original dataset are selected randomly, 2. find the variable which is able to split dataset optimal and compute the corresponding split point, 3. split input dataset at this split point.
Subsequently, RFR computes such trees and averages them to obtain predictions.Out-of-bag (OOB) (Hutengs and Vohland, 2016) error estimation and spatial variables importance measure (VIM) (Zhang et al., 2018) are two key parameters of RFR.In each split, the number m of variables selected by minimising the OOB error of predictions.The growth of each decision tree stops when the OOB error stabilized.

Surface-level PM.25 concentration retrieval
The hourly ground-observed PM2.5 concentration data is discrete spatially and temporally, so they can not reflect the typical pattern of PM2.5 concentration and spatial disparity within one day throughout the region.In order to obtain the continuous and seamless pattern of surface-level PM2.5 concentration, STCK are adopted using AOD data from MODIS and spectral indices from Landsat-8 OLI.MOD04_3K is the level 2 aerosol product with 3000 meters resolution of MODIS representing the ambient aerosol optical properties, aerosol mass concentration and data quality assurance.MOD04_3K product in Collection 6 retrieving AOD using Dark Target algorithm (Sayer et al., 2014).In this study, we collected the MOD04_3K data on February 17 th 2017 (Figure 2).It can be seen that the distribution of industrial parks in Wuhan has a significant impact on regional PM2.5 accumulation.In addition, in the process of urban development and renewal, openair construction sites also result in locally higher PM2.5 concentrations.However, it is noteworthy that Qingshan District, which owns Wuhan Iron and Steel Works, the largest iron and steel plant in central China, does not have a high concentration of PM2.5.To the best of the knowledge of the authors, Qingshan industrial district was selected as one of the second batch of Pilot Demonstrations of Circular Economy in China in 2007.Since then, various measures have been promoted on technological upgrading and energy recycling aiming at weakening the negative effects of iron and steel production on environment (http://www.qingshan.gov.cn).

VIMs of driving factors for PM2.5 concentration in Wuhan
RFR is able to measure the contribution weights of the driving factors involved in the regression, i.e. the VIMs (Zhang et al., 2018).VIM of each input variable (Breiman, 2001) is calculated on the basis of OOB as: 1. after the construction of the each decision tree, OOB estimation error is calculated, 2. repeat OOB estimation error calculation on every variables involved, 3. calculate additional OOB estimation error by increasing the value of each variables in each OOB. 4. the quantified difference between OOB estimation error and additional OOB estimation error is the VIM of each variable.
In our study, the natural factors involved covering bareness (represented by NDBI) and geological characteristics (represented by DEM).The socioeconomic factors are industrial land use (IL) distribution, commercial land use (CL) distribution, administration and public services land use (APL) distribution, road density (RD) and population distribution (PD).In addition, the density of restaurants (RED) are taken into consideration to evaluate the effects of cooking fume on PM2.5 accumulation.According to Table 3, DEM has the highest VIM value among the natural factors.Topographic factors, such as elevation and slope direction, have a greater impact on the accumulation and dispersion of PM2.5.As we all know, cities located in the basin are more vulnerable to PM2.5 pollution (Hu et al., 2017;Yang et al., 2018a).IL has the relatively highest VIM value compared with the other factors.Industrial land use usually produce more particulate pollutants, nitrogen oxides and sulphides, which are widely recognized as the main chemical source of the PM2.5, than any other land use categories.PM2.5 accumulation also has close relationship with RD for the sulfur dioxide and nitrogen dioxide from auto exhausts.It is remarkable that RED also contributes a lot to PM2.5 concentration in Wuhan.Because the restaurant itself is the source of particulate matter and lampblack.At the same time, social and economic activities around the restaurant are more frequent, resulting in greater population and traffic flow, which also contributes to PM2.5 accumulation.However, the RED is highly co-related with RD and CD.Thus, this phenomenon needs to be further verified by more accurate field observations and chemical analysis.
To explore the spatial disparity of factors in multiple regions under different natural and socioeconomic backgrounds within Wuhan.Simultaneously, RFR is also implemented on these factors in centre (enclosed by 2 nd Ring Rod) and sub-urban area (exclude by 2 nd Ring Rod) of Wuhan.The VIMs of these factors in centre and sub-urban area of Wuhan are reported in Table 4 and

RESULTS
This study aims at quantifying the effects of natural and socioeconomic factors on the concentration of PM2.5 within Wuhan.The spatially and temporally continuous pattern of surface-level PM2.5 concentration is produced by STCK integrating 24-hour ground-observed PM2.5 data and daily AOD from MODIS.The spatial disparity of surface-level PM2.5 accumulation is related to urban function and land use.
Contributions of multiple natural and socioeconomic variables, both in center, sub-urban area and whole study area, are quantified using VIMs of RFR.The findings of this study can support better understanding of driving factors for PM2.5 accumulation within cities. Although, due to the intrinsic characteristics of RFR, the un-stationary of the relationship between PM2.5 concentration and multiple factors is not able to be extensively explored yet.This problem can be solved in future works by using the geo-statistical model considering spatial heterogeneity.

Figure 3 .
Figure 3. Surface-level PM2.5 concentration with 120 meters resolution of Wuhan Figure 3 reveals that PM2.5 concentration in the centre of Wuhan, enclosed by the 2 nd ring road, is relatively heterogeneous.The accumulation of PM2.5 in Hankou is slightly obvious than that in Wuchang and Hanyang.The areas with high concentration of PM2.5 are mainly concentrated in the areas scattered located along the 3 rd ring road in Hankou and Hanyang.Especially, some main industrial districts in Wuhan, covering: 1. Dongfeng Peugeot Citroren Automobile Company (DPCA) industrial park located in the South-West corner of Figure 3, 2. Jinyintan industrial park at the North boundary of Figure 3, 3. Baishazhou area where exists many open-air construction sites and logistics parks, 4. Wuchangnan railway station performs as the one of the most important freight marshalling station in the Jiangguang railway.
All the POIs, roads shapefile and population density are collected using geospatial Big Data technique from Baidu Map (http://lbsyun.baidu.com).To be accordance with the resolution of PM2.5 concentration, natural factors are resampled into 120m resolution, and socioeconomic factors are calculated in 120 × 120  2 pixel.The 24-hour average population density of Wuhan on February 17 th 2017 is shown in Figure3.

Figure 4 .
Figure 4. 24-hour average population density of Wuhan collected from Baidu Map on February 17 th 2017 top three factors lead to PM2.5 pollution in sub-urban area of Wuhan.As mentioned in Section 3.1, the majority of industrial land use located in the sub-urban area along with 3 rd Rind Road.And in the areas within 3 rd Ring Road of Wuchang and Hankou district (specifically are Baishazhou region and Wangjiadun region), there many open-air logistics yards and building sites which may produce serious particulate emissions.The ground truths of Baishazhou region and Wangjiadun region in Google Earth Pro are shown in Figure 5.

Figure 5 .
Figure 5.The ground truth of Wangjiadun region (a) and Baishazhou region (b) in Google Earth Pro

Table 1 .
Detailed information of nine stations within Wuhan The natural constitution is represented by NDVI, NDBI, MNDWI and Albedo.These indicators are calculated using the L1T product of Landsat-8 Operational Land Imager (OLI).The POIs and PTM are collected from the Baidu Map (http://lbsyun.baidu.com)and spatialized on ArcGIS 10.2.

Table 2 .
Table 2 illustrates the natural and socioeconomic factors involved in this study.Detailed information and definition of each variable

Table 3 .
Table 3, the VIMs and cross validated mean absolute predictive error (MAPE) of each spatial variables throughout the whole study area are reported.VIMs of variables throughout the study area

Table 4 .
VIMs of variables in the centre of Wuhan

Table 4
reveals that the dominant factor of PM2.5 accumulation in urban central area is RD.An increase of VIMs on socioeconomic factors, except for IL, is witnessed.Specifically, the contributions of CL, RD and RED are increased more obvious.
Because the major land use categories in the centre of Wuhan are residential land use, traffic facilities and commercial areas.Hence, the off-gas, kitchen products and some other wastes become the major factors on PM2.5 accumulation under urban background.

Table 5 .
VIMs of variables in sub-urban area of WuhanAs reported in Table5, IL, DEM and RD are the