STUDY ON ADAPTIVE PARAMETER DETERMINATION OF CLUSTER ANALYSIS IN URBAN MANAGEMENT CASES

The fine management for cities is the important way to realize the smart city. The data mining which uses spatial clustering analysis for urban management cases can be used in the evaluation of urban public facilities deployment, and support the policy decisions, and also provides technical support for the fine management of the city. Aiming at the problem that DBSCAN algorithm which is based on the density-clustering can not realize parameter adaptive determination, this paper proposed the optimizing method of parameter adaptive determination based on the spatial analysis. Firstly, making analysis of the function Ripley's K for the data set to realize adaptive determination of global parameter MinPts, which means setting the maximum aggregation scale as the range of data clustering. Calculating every point object’s highest frequency K value in the range of Eps which uses K-D tree and setting it as the value of clustering density to realize the adaptive determination of global parameter MinPts. Then, the R language was used to optimize the above process to accomplish the precise clustering of typical urban management cases. The experimental results based on the typical case of urban management in XiCheng district of Beijing shows that: The new DBSCAN clustering algorithm this paper presents takes full account of the data’s spatial and statistical characteristic which has obvious clustering feature, and has a better applicability and high quality. The results of the study are not only helpful for the formulation of urban management policies and the allocation of urban management supervisors in XiCheng District of Beijing, but also to other cities and related fields. * Corresponding author


INTRODUCTION
With the rapid growth of large-scale data processing and in-depth analysis of demand in all walks of life, data mining has become a hot area of research for many scholars (Genlin et al.,2014).Refinement, which is an important goal of the urban operation and development, provides the technical support for the delicacy management of city operation (Jing 2014).With the progress of society and science technology, all kinds of issues with respect to urban operation have appeared in succession.According to the city's report on the work of the government, the number of cases about urban management(ChengGuan case in short)also increases year by year, which has influence on the urban appearance and steady running of the city.Therefore, it is of great theoretical and practical value to use the spatial data mining technology to analyze the urban management cases and assist the government decision-making.
As a method of data mining, clustering analysis has been widely used.DBSCAN algorithm based on density clustering which has high speed, data adaptability, noise insensitive characteristics was studied by many researchers (Xinyan and Deren,2005).However, the DBSCAN algorithm needs to manually determine the parameters Eps and MinPts, and the values of these two parameters directly affect the quality of data clustering.In view of the problem of how to select the optimal parameters, a large number of literatures have proposed the method of assuming MinPts value and then determining the Eps value.
Although avoiding parameter determination artificially, these methods based on the premise of assumption of MinPts are still lack of adaptive parameters Such as Ren Xingping (Xingping et al.,2007)

The Study Area And Data Source
This paper is based on the Xicheng District of Beijing in 2010.
Xicheng District is located in the center of Beijing, which is a set of politics, economic, culture and tourism as one of core development area.It has higher requirements for urban management because of its special geographical position.The specific distribution shown in Figure 1.

Ripley's K Function
Ripley's K method is a representative spatial point pattern analysis method, which can quantitatively evaluate the spatial distribution characteristic of point patterns (Tang et al.,2015).
In this method, Ripley's K function is used to analyze the clustering degree of point datasets at different spatial scales in a certain confidence interval, then the maximum clustering scale of the best clustering effect is quantitatively analyzed according to the expected K value and observed K value.The spatial scale is calculated as follows: ) In the above formula, d represents the distance, A represents the total area of the area occupied by the feature set, Kij represents the spatial weight.

At a certain specific distance, when the observed K value is
The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLII-2/W7, 2017 ISPRS Geospatial Week 2017, 18-22 September 2017, Wuhan, China larger than the predicted K value, the clustering degree of the distribution is higher than that of the random distribution of the scale.Therefore, this paper selects the spatial scale with the largest difference between observation K and prediction K as the value of parameter Eps, in order to get a better clustering effect, this paper sets the confidence level to 95% based on the data size of research.In view of this problem, some scholars have proposed an improved method (Yi et al.,2016).This paper calculates the point data size K for each point object which is in the Eps threshold, and then takes the max frequency of statistical analysis of the value of K as MinPts value.

Component correlation analysis
Overlapping analysis method has the characteristics of low  5.According to the quantitative analysis in the light of the above chart, when aggregation scale of the street order cases is greater than the maximum aggregation scale of 563 meters, the observed K value gradually approaches the prediction K value and they The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLII-2/W7, 2017 ISPRS Geospatial Week 2017, 18-22 September 2017, Wuhan, China almost keep parallel; When the aggregation scale of the urban environment cases is larger than the maximum aggregation scale of 792 meters, the observation K value is close to the forecast K value and almost parallel, which shows that as for these two cases, when the spatial scale is larger than the maximum aggregation scale, the degree of aggregation of the data set is decreasing.In view of the fact that the observed K value is higher than the predicted K value at a certain distance, the aggregation degree is higher.Therefore, this paper selects the maximum aggregation scale as the parameter Eps when the aggregation effect is best.The clustering results and the parameter values are statistically analyzed, as shown in Table 7.
et al take MinPts as 4. According to the forth nearest neighbor distance graph of the data object set, the value of Eps is taken as less than the percentage of noise level; Zhou Dong(Dong and Peng,2009)et al assume that MinPts is 3, and then according to K-dist curve to determine the value of Eps.Some scholars have done some research on the adaptive determination of global parameters Eps and MinPts.Among them, the majority are the research that under the premise of statistical analysis of the data set.For example Xia luning (Luning and Jiwu,2009) et al proposed to k-dist probability curve and statistical model fit peak to Eps, drew the Noise curve and its inflection point MinPts method to achieve the parameters of the adaptive determination, but the whole process is too cumbersome and calculation is large, and the practicability is weak; Li Zonglin(Zonglin and Ke,2016.)etal established a suitable mathematical model to determine the Eps and MinPts values adaptively by using the kernel density estimation theory, but this method is not suitable for data set with large density difference, and the computational complexity of the algorithm is high.There are some scholars to explore the data partition as the premise of the research methods, such as Stefanakis(Stefanakis, 2007), Pandey Abhilash Kumar(Pandey Abhilash Kumar and Dubey Roshni,2014) and others on first divided again clustering of data area.Huang Gang(Gang et al.,2015)et al reduced the number of regional query method to achieve high efficiency clustering algorithm by selecting the seeds on behalf of objects.In summary, the existing literature on spatial data and spatial statistical characteristics of the research is less.The DBSCAN clustering algorithm based on density still needs to study the data set to explore the statistical characteristics of the data and achieve high quality clustering.This paper uses the Ripley's function and K-D tree to analyze the statistical characteristics of urban management case data, and determine the parameters of DBSCAN algorithm adaptively.The optimized DBSCAN algorithm is used for data mining in typical urban management cases to provide auxiliary decision for urban management policy making, for urban management supervision staff scheduling to provide quantitative analysis support and enhance the city running fine management ability.

Figure2
Figure2.Data statistical histogram K-D tree is similar to the binary tree, it is a data structure with left and right subtrees.The biggest difference between K-D tree and the binary index tree is that K-D tree is stored in the K-dimensional point data.The K-D tree algorithm is composed of two parts, including tree-building and search.K-D tree is divided into left subtree and right subtree according to the max variance of data.In order to make sure the left subtree and the right subtree have the same length, K-D takes the median where array of attribute value as the partition axis.There are two kinds of search methods in K-D tree structure: range search and K-nearest neighbor search.The range search refers to searching the point data within the threshold range for a certain point object in the given searching threshold; K-neighbor search refers to specifying a point object, and then traversing the original data set to find the nearest point of the object K point data.This classical K-D tree algorithm can only be used to search the high efficiency K in the low dimensional case, and the efficiency of searching for high-dimensional data is very low.
Figure6.Spatial distribution of cluster results Figure 6 and Table 7 indicate that: (1) In the case of the street order case, there are 38 clusters under the condition of confidence 95%, scanning radius 563m and scanning density 110, which are mainly distributed in the east and west streets of the city center and north of the city.In addition to the less clustered cluster distribution on Yuetan and Xinjiekou streets, apart from Yuetan and Xinjiekou streets with less clusters, the clusters of The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLII-2/W7, 2017 ISPRS Geospatial Week 2017, 18-22 September 2017, Wuhan, China West Chang'an Street, Shichahai Street, Desheng Street and Exhibition Road are uniformly distributed.The northern part of the city has larger population mobility, and the majority of the floating population on the convenience, fast shopping needs for unlicensed operators and operators outside the shop to provide a market.And the requirement for convenient and fast shopping of the majority of the floating population provides the unlicensed operators and operators outside the shop with the market.So It is extremely necessary to strengthen the management and deployment of urban managers in the northern part of the city; (2)Under the conditions of confidence 95%, scan radius 792 m and scanning density 200, the cluster of urban environment cases is 44, which are mainly distributed in Yuetan Street, Financial Street and West Chang'an Street in the center of the city, the northern part of the city also has a small amount of distribution, such as Desheng Street, Xinjiekou Street and Shichahai streets, in which the east side of the city, the Yuetan street is especially concentrated, and the number of cluster types is as high as 25, compared with the other streets in Xicheng District, the number of communities (32) and population (150,000) is the largest in Yuetan district.The excessive communities and population make the number of long-term abandoned vehicles on the roadside and garbage piled up on the side of the road larger than other streets.Yuetan streets become a high incidence of urban environment class cases because of lacking of Sanitation facilities and the exposed garbage can not receive timely processing.So we should strengthen city management personnel in Yuetan Street inspections, in order to reduce the number of cases of the occurrence of such cases, improve the current analysis of clustering results In view of the street order case is mainly affected by unlicensed business operators, shop-outside operators and vagrant begging and other issues; City environment cases are mainly affected by the exposed garbage, unclean pavements, accumulation of waste residue and abandoned vehicles and other issues.Lacking of urban public infrastructure is the root cause of a large part of the city environment.Therefore, in order to verify the rationality of the clustering results, this paper choose 14 classes cases about urban public infrastructure and urban environment, such as dustbin, garbage bin, comfort station, storage frame and so on to make correlation analysis according to " digital city management information system _ second parts: 2013) ".If there are a lot of duplicate areas between the clusters and the concentrated areas of component cases, which means that urban management cases is unrelated to the component facility.Conversely, If there is only a small amount of overlap between the clusters and the concentrated areas of the component cases, it shows that the urban management case is related to the configuration of the components.The specific correlation analysis results are shown in Figure 8, in which different colors represent different clusters, transition which is from white to black region shows the nuclear density of components from small to large.