A GEOGRAPHIC WEIGHTED REGRESSION FOR RURAL HIGHWAYS CRASHES MODELLING USING THE GAUSSIAN AND TRICUBE KERNELS : A CASE STUDY OF USA RURAL HIGHWAYS

Based on world health organization (WHO) report, driving incidents are counted as one of the eight initial reasons for death in the world. The purpose of this paper is to develop a method for regression on effective parameters of highway crashes. In the traditional methods, it was assumed that the data are completely independent and environment is homogenous while the crashes are spatial events which are occurring in geographic space and crashes have spatial data. Spatial data have spatial features such as spatial autocorrelation and spatial non-stationarity in a way working with them is going to be a bit difficult. The proposed method has implemented on a set of records of fatal crashes that have been occurred in highways connecting eight east states of US. This data have been recorded between the years 2007 and 2009. In this study, we have used GWR method with two Gaussian and Tricube kernels. The Number of casualties has been considered as dependent variable and number of persons in crash, road alignment, number of lanes, pavement type, surface condition, road fence, light condition, vehicle type, weather, drunk driver, speed limitation, harmful event, road profile, and junction type have been considered as explanatory variables according to previous studies in using GWR method. We have compered the results of implementation with OLS method. Results showed that R for OLS method is 0.0654 and for the proposed method is 0.9196 that implies the proposed GWR is better method for regression in rural highway crashes.


INTRODUCTION
Driving incidents and their social and economical impacts have compelled UN that call current decade as the decade for the secure roads.Based on world health organization (WHO) report, driving incidents are counted as one of eight reasons for death in the world (PARK et al., 2010).Among the various components of country's infrastructure, the roads are very important in the transport of goods and passengers.Therefore, road safety authorities around the world demanding for use of new technologies in vehicles an infrastructure to enhance roads safety.Basically, crashes are spatial events which occur in geographic space.Traffic crashes can be spatially the correlated events and the analysis of the distribution of traffic crash frequency requires the evaluation of parameters that reflects the spatial properties and correlation (Rhee et al., 2016).Most of accidents occur due to human faults, technical failure of vehicle, technical failure of road, and environmental condition.Sometimes, accumulation of these factors as the hidden factors lead to accident.Accumulation of several factors create the hotspots.Identifying effective parameters of accidents can prevent future incidents (Black, 1991).Previous studies have used regression analysis to determine effective parameters.Crashes severity models like linear least square, negative binomial regression, Poisson and more complex models like seemingly unrelated regression have been most former common methods.Chen et al (2016) developed a hierarchical Bayesian logistic model to examine the significant factors at crash and vehicle/driver levels and their heterogeneous impact on driver injury in rural interstate highway crashes.Rhee et al (2016) have employed a geographic weighted regression in urban traffic analysis in Seoul.The result showed the best area for safety improvement and because center lanes had more crashes, there is a need to improve the design to enhance their safety.De Oña et al ( 2013) have used the combination of Latent Class Clustering (LCC) and Bayesian networks (BN).The result showed that the simultaneous use of these methods is useful for road safety analysis.Xu and Huang (2015) have employed the semiparametric geographically weighted Poisson regression model (S-GWPR) and the random parameter negative binomial model (RPNB) to investigate the spatial heterogeneity in regional crash modelling.The result showed that the S-GWPR is more appropriate for regional crash modelling in comparison with those of the non-spatial models and global models.Zha et al (2016) have used Poisson inverse Gussian (PIG) regression model for modelling motor vehicle crash data and compered it with Negative binomial (NB) model.The result showed the PIG models perform better than the NB in the term of goodness of fit statistics and the PIG model can perform as well as NB model in capturing the variance of crash data.Also PIG models demonstrate same prediction performance compared to NB models.Hence, PIG model could be alternative to NB model for analysing the crash data.Several studies have been conducted to identify factors affecting crashes.For instance, (Sohn and Shin, 2001) and (Delen et al., 2006)have used Artificial Neutral Networks for data mining of roads crashes.(Clarke et al., 1998) and (Chang and Chen, 2005) have used a decision tree method to study the crash rate and to determine the most important factors in crashes.(Pakgohar et al., 2011) has used regression trees and Logistic Regression to investigate the impact of factors on roads' crashes in Iran.Result of this study indicates that 97.5% of road accidents caused by a driver failure.In 70.5%, crashes caused by the environment and in 31.5%,technical failure of vehicles was the reason of the incidents.(Kashani and Mohaymany, 2011) used a classification and regression tree (CART) to determine both the severity of crashes and the effective factors on the severity of injury for passengers on two lane two way roads.As mentioned above, lots of studies have been conducted to find appropriate methods for detecting effective factors in crashes but most of this studies have not considered spatial data's features properly.Basic regression models assumes that data should be independent but this assumption is impossible in spatial data.Spatial data have some features that working with them is a bit difficult.Two samples of these features are a) spatial autocorrelation, based on Tobler's first low of geography "everything is relate to everything else, but near things are more related than distant things" (Tobler, 1970), and b) spatial non-stationarity that represents change in space and spatial heterogeneity of environment.Traditional methods like ordinary least squares (OLS) cannot be adapted by spatial autocorrelation and nonstationarity because these methods have assumed that data are completely independent and environment is homogeneous.Hence, OLS regardless of spatial dependencies gives an answer for all parts of region.In this regard, a geographic weighted regression (GWR) method has been proposed in this study for considering spatial autocorrelation and spatial non-stationarity in rural highway crashes.

Study area
This study used the real world fatal crashes data occurred in several states in the east of US.This crashes were occurred on highways that connect eight states of Alabama, Georgia, North and South Carolina, Virginia, West Virginia, Kentucky, and Tennessee.Data used consist of the spatial and nonspatial data of fatal crashes occurred from beginning of 2007 to the end of 2009.During these years 2432 fatal crashes were recorded.Some of these crashes related to pedestrians' crashes and some of them related to multi vehicle crashes.This study works on a part of data related to two vehicle crashes that includes 828 crashes (see Figures. 1 and 2).Table 1 shows the spatial and nonspatial variables used in this study.Table 1.Spatial and nonspatial variables

Geographic weighted regression
As mentioned in the introduction, OLS method cannot be adapted to features' of spatial data because this method has assumed that data are completely independent and environment is homogenous.Hence, OLS method without considering dependency gives an answer for all points of reign and for this reason, a GWR method was presented by (Brunsdon et al., 1998).
In this method, spatial dependency of observation is considered as the weight matrix due to environment homogeneity and nonstationarity regression coefficients were derived locally and separately for each point.The general relation for GWR is as follows (Brunsdon et al., 1998): Where y is the dependent variable, Xj is the independent variable, p is the number of independent variables,  is the residual of the model, and   is the coefficient of regression that is a function of observation point (u,v).Unlike OLS, the GWR is the weighted adjustment and the coefficients of regression can be computed by (Brunsdon et al., 1998): where W is the weight matrix of observations that is a function of point's location and this matrix is diagonal matrix as follows (Brunsdon et al., 1998): Determining the geographical weights is so important in GWR method.In this regard, several kernels have been presented for this purpose.In this study, the GWR method was used with two kernels that have demonstrated superior performances.The Gaussian and Tricube kernels are as follows (McMillen and McDonald, 2004): Where   is the geographic weight of observation j on the point i,  is the normal standard distribution function,   is the distance between two points i and j, and  is the standard deviation for   for each point and h is the bandwidth.  is the Euclidean distance in Cartesian coordinate when using geographic coordinates the distance is great circle distance.The most important issue in determining the geographic weights is selecting appropriate bandwidth because if this parameter is too large, GWR trends to OLS results and if too small bandwidth is selected, the variance will increase (Charlton and Fotheringham, 2009).There are several method for optimizing bandwidth.One of them is Cross Validation method which can be computed by (Brunsdon et al., 1998): Where n is the number of observation,   is the observation i, and  ̂ is the estimated value for the observation i computed by the other observations.Also,  ̂ is a function of bandwidth and if bandwidth minimizes the function, it will be considered as the optimal bandwidth.Actually, in goodness of fit, determining the bandwidth is more effective than the kind of kernel used.There are two methods for selecting bandwidth (Charlton and Fotheringham, 2009): • Fixed bandwidth: if data are distributed regularly, fixed bandwidth will be used.

•
Unfixed (changeable) bandwidth: it is used in the cases that data are almost irregular and have clustered distribution.In this regard, in the high density area bandwidth decreases and vice versa.One criterion for this change can be the minimum and maximum of observation points in search band.Moreover, bandwidth can be changed in a way the fixed number of observations would stay on each band.

Evaluation Criteria
There are different parameters for evaluating the results of regression.One of them is R 2 that indicates the goodness of fit for the achieved result.The value of R 2 is between 0 and 1. R 2 =0 indicates that using explanatory variables (effective parameters on crashes in this study) are not effective on estimating the dependent variable (the number of casualties in this study) and R 2 =1 indicates that the dependent variable is completely predictable using regression model.R 2 can be calculated by (Shekhar and Xiong, 2007): (8) Where n is the number of observations,   is the value of dependent variable, i.e. observation, i,  ̂ is the estimated value for the dependent variable,  ̅ is the mean of observations.The methods that have been used for evaluation of residuals are RMSE and NRMSE that are computable by: Where   ̂ is standard deviation for the estimated values of dependent variables.

IMPLEMENTATION
In this study, the number of casualties were considered as the dependent variable and the number of persons in crash, road alignment, number of lanes, pavement type, surface condition, road fence, light condition, vehicle type, weather, drunk driver, speed limitation, harmful event, road profile, and junction type were considered as explanatory variables.These factors were selected based on previous studies and our limitation to access the data.Firstly, correlation between data must be checked by (Dale, 2014): Where cov(X,Y) is the covariance of two data sets of X and Y,  ̅ and  ̅ are the mean values for two X and Y data sets, n is the number of observations in each data set, r is the correlation coefficient between two data sets,   and   are the standard deviation for data sets.All of the calculated values for correlation coefficient is between -0.1 and 0.6 that indicates any of explanatory variable does not have specific correlation to the other one.For this reason, all of them have been used in implementation.Correlation matrix of explanatory variables is shown in Figure 3.   6 show the results of the proposed GWR using Gaussian and Tricube kernels, respectively.The blue line depicts actual data and the red line depicts the predicted result by GWR with kernels.In GWR both of bandwidth have been used and in order to optimize the bandwidth, the cross validation method was used.Moreover, Table 2 shows the achieved results for evaluation criteria used.As shown in Table 2, R 2 was calculated for both GWR and OLS the value of R 2 for OLS has been obtained 0.0654 that is near zero.It implies that using explanatory variables (effective factors on crashes in this study) are not useful on estimating the dependent variable, i.e. the number of casualties.Hence, the OLS is not appropriate method for this issue while the calculated value for GWR with Gaussian kernel is 0.1294 and with Tricube kernel is 0.9196 that demonstrate better performance of GWR method in comparison with OLS for rural highway crashes.In fact OLS cannot be adapted with spatial autocorrelation and spatial nonstationarity because in this method, it was assumed that the data were completely independent and environment was homogenous.As a result, OLS without considering spatial dependency presented an answer for whole region.Furthermore, the obtained results for RMSE and NRMSE for GWR with Tricube kernel has great difference with both GWR with Gaussian kernel and OLS model therefor using GWR with Tricube kernel in rural highway crashes can improve accuracy and increase performance of detecting effective factors on rural highways crashes.

CONCLUSIONS
Today detecting effective factors of road accidents is so important because the number of passengers' have been injured or died by driving accidents is too much.These casualty causes irreparable social and economical impacts.Hence, identifying hazardous times and places can be used in preventing future accidents' occurrence.The goal of this paper is to develop an appropriate model for regression on rural highway crashes factors.Crashes are spatial events that occur in geographic space.The former regressions used for this purpose are not compatible with spatial data features like spatial autocorrelation and spatial non-stationarity.Thus, the GWR method as an appropriate method for studying local patterns with adaptivity with spatial data features was used.For evaluation, the proposed method was applied to the real-world data of US rural highways recorded from 2007 to the end of 2009.In order to show the impact of spatial data features on regression, we compered the result of the proposed GWR method with that of OLS method.Goodness of fit result of OLS on highway crashes was 0.0654 while this value was 0.9196 for the proposed GWR method with Tricube kernel.As a result, using GWR with Tricube kernel can enhance accuracy and increase performance of detecting effective parameters on occurrence of crashes in rural highways.

GWR OLS
In future study we recommend using combination of GWR with evolutionary algorithms to identify most effective factors on accidents.Also, using the GWR method with Tricube kernel in urban crashes is proposed in order to detect the effective factors of accidents in urban areas.

Figure
Figure 2. Fatal crashes on US rural highways

Figure 3 .
Figure 3. Correlation matrix of explanatory variables

Figure 4 .
Figure 4. Results of OLS regression Figures 5 and6show the results of the proposed GWR using Gaussian and Tricube kernels, respectively.The blue line depicts actual data and the red line depicts the predicted result by GWR with kernels.In GWR both of bandwidth have been used and in order to optimize the bandwidth, the cross validation method was used.Moreover, Table2shows the achieved results for evaluation criteria used.

Figure5.
Figure5.Results of GWR with Gaussian kernel