METHOD OF GRASSLAND INFORMATION EXTRACTION BASED ON MULTI-LEVEL SEGMENTATION AND CART MODEL

It is difficult to extract grassland accurately by traditional classification methods, such as supervised method based on pixels or objects. This paper proposed a new method combing the multi-level segmentation with CART (classification and regression tree) model. The multi-level segmentation which combined the multi-resolution segmentation and the spectral difference segmentation could avoid the over and insufficient segmentation seen in the single segmentation mode. The CART model was established based on the spectral characteristics and texture feature which were excavated from training sample data. Xilinhaote City in Inner Mongolia Autonomous Region was chosen as the typical study area and the proposed method was verified by using visual interpretation results as approximate truth value. Meanwhile, the comparison with the nearest neighbor supervised classification method was obtained. The experimental results showed that the total precision of classification and the Kappa coefficient of the proposed method was 95% and 0.9, respectively. However, the total precision of classification and the Kappa coefficient of the nearest neighbor supervised classification method was 80% and 0.56, respectively. The result suggested that the accuracy of classification proposed in this paper was higher than the nearest neighbor supervised classification method. The experiment certificated that the proposed method was an effective extraction method of grassland information, which could enhance the boundary of grassland classification and avoid the restriction of grassland distribution scale. This method was also applicable to the extraction of grassland information in other regions with complicated spatial features, which could avoid the interference of woodland, arable land and water body effectively. * Corresponding author


INTRODUCTION
Grassland, one of the world-wide distributions of vegetation, is the natural barrier and important component of ecosystem.It is mainly distributed in arid and semiarid regions with the important functions in regulating climate, soil and water conservation and other aspects (Qian Y. R. et al., 2013;Liu Y. B. et al., 2011).Although China has vast grassland area, the problems of sharp reduction of the grassland and the imbalance of grass with the livestock are outstanding.Therefore, the comprehensive and timely grasp of the spatial distribution and coverage area of grassland is of great significance to the management and protection of grassland resources in our country (Zhang Y. X. et al., 2012).With the development of satellite remote sensing technology, remote sensing becomes an important means of grassland dynamic monitoring, which providing grassland area distribution, temporal change of multidimensional information for environmental protection departments, agriculture departments and local governments.At the same time, remote sensing provides strong space data support for the rational development and utilization of grassland resources.
Because of the difference of grassland vegetation composition and dominant species, the distribution of grassland is complex and changeable.At the same time, grassland's synonyms spectrum is significant with forest, wetland, and farmland on the remote sensing images.Therefore, the automatic extraction of grassland information is always a difficulty of the classification of remote sensing images, and there are few related researches.
Currently, grassland information extraction method is mainly in two aspects.First, pixel based remote sensing image classification methods, including multiple filtering, classification models based on principal component analysis and spectral angle mapping, artificial neural network (ANN), decision tree classification, etc (Liu R. et al. 2012;Wu J.S. et al., 2012).Another approach is object-oriented research methods (Yang C. M. et al., 2005;Xu H. W., 2012;Liu Y. et al., 2014).In addition, the grassland extraction and classification can be accomplished by human computer interaction based on high resolution remote sensing images.Among these methods, pixelbased classification method is seriously mixed and has the phenomenon of salt and pepper classification problems, resulting in the appearance of several broken polygons, ultimately leading the low classification accuracy.However, this method has a good identification for the scattered grassland.The object-oriented method has a good effect on the large-scale distribution of grassland, but it has more leakage on the scattered grassland.Although the human computer interaction has relatively high translation accuracy and clear extraction boundary, but the efficiency interpretation is too low, and the workload is excessive.
In view of the above problems, this paper proposed a grassland extraction method based on the project of "ecological environment and source monitoring in seven provinces and cities around Beijing, Tianjin and Hebei".This method combines multi-resolution segmentation with the spectral heterogeneity segmentation and forms a multi-level segmentation system by the eCognition software.Based on the multi-level segmentation objects, classification rules are obtained from training data by the SPM data mining and analysis tool.Then the decision tree model is established combining the spectral features, remote sensing image texture features and the classification rules.This well-established model can extract the grassland information automatically.The proposed grassland information extraction method can obtain the boundary of grassland clearly and is proofed suitable for many other regions with complicated spatial features.As the interference of woodland, farmland, wetlands and other interfering objects is avoided, this method is an effective method for grassland information extraction without restriction of the distribution scale.

STUDY AREA AND DATA
According to the characteristics of the grassland distribution, Xilinhaote City, the Inner Mongolia Autonomous Region is selected as the research area.Xilinhaote city is located at 43°02′N -44°52′N, 115°13′E -117°06′E, 208km from north to south, 143km from east to west, and the total area is 15758km 2 .This area inclines from southeast to northwest.The southeast area is on the high side with many rocky low hills, less basins, while the northwest area becomes low and flat terrain with scattered distribution of low hills and lava plateau.Xilinhaote belongs to mid-temperate semi-arid and arid continental climate with few and uneven rains and 70% of the rainfall are concentrated in June to August.The natural Stipa grass steppe is widely distributed in this study area (Zhang L. Y. et al., 2006).The remote sensing data used in this paper are collected by GF-1 satellite.Table 1 shows the parameters of the satellite payload.The image in August 2016 is chosen as the grassland extraction image when the vegetation grows thriving.Some pre-processes for remote sensing data need to be done first, such as orthorectifying, atmospheric correction and image mosaic.1.The parameters of the satellite GF-1 Meanwhile, in order to evaluate the classify accuracy of this proposed method quantitatively, the GF-1 remote image data of Xilinhaote in 2016 were interpreted manually combining Google Earth data and its interpretation results were record as the approximate real distribution of the grassland.

Multi-level Segmentation
Segmentation is the basis of image feature analysis and classification.The effect of image segmentation directly influences the speed and accuracy of the extraction in the later period (Ming D. P. et al. 2005).The divided object is a collection of similar pixels, which is more valuable than the simple pixel.The image objects have obvious homogeneity, dispersion and spatial characteristics, which can reflect spectral characteristics and other characteristics, such as shapes, texture, structure and context (Baatz M., 2000;Huang H P., 2003).The size of the image objects is influenced by the difference of the image analysis target and can be obtained by setting the suitable threshold of the segmentation.At present, the common segmentation method is multi-resolution segmentation.This method can select the optimal scale to divide and classify the target region according to the characteristics of the research region.Although the multi-resolution segmentation method can meet the segmentation requirements of different scale objects to the greatest extent, there still exist over and insufficient segmentation, especially for grassland types.Based on the existing segmentation objects, the spectral difference segmentation merges the adjacent objects less than threshold into one object to optimize the segmentation results.This paper fully combines the advantages of the multi-resolution segmentation and spectral difference segmentation, to solve the over and insufficient segmentation phenomenon effectively caused by single segmentation method.
The multi-resolution segmentation is the process of merging the homogeneous pixels, separating the heterogeneous pixels, and dividing the original image into several relatively homogeneous polygons based on the characteristics of the spectrum, space and texture of the images.The size and quantity of the segmentation results can be modified by changing the scale parameter.Higher values need to be set for the scale parameter result in larger image objects, smaller values for the smaller ones.The segmentation scale required for the extraction of ground objects is affected by its characteristics, such as spectral and shape.The heterogeneity index of image segmentation can be evaluated by the colour, shape, smoothness and compactness of objects (Franklin S. E. et al., 1991;Haralick R. M. et al., 1985;Benz U. C. et al., 2004) .
When the average size of image objects is given, multiresolution segmentation has good abstraction and shaping in most application area.However, there still exists over or insufficient segmentation and its efficiency is significantly lower than other methods than some other segmentation techniques.Therefore, it may not always be the best option.The spectral difference segmentation aims to refine existing segmentation results by merging the adjacent image objects that their spectral differences are below the given threshold.As an optimization method to improve the segmentation results, the algorithm cannot create new image object levels based on the pixel level.
In this paper, multi-resolution segmentation and multi-level are carried out for remote sensing images in the target area, and the segmentation results are compared and analysed as shown in Figure 2. The result shows that combination of multi-resolution segmentation and spectral difference segmentation is better than the multi-resolution segmentation on the sense of hierarchy, integrity, boundary, minimum object of subjects.The multi-level segmentation has fulfilled the advantages of multiple segmentation methods.The contrast results are shown in Figure 2.

Classification Feature
The construction of image classification features is an important step in classification.Too much or too little feature will influence the accuracy of classification results.Spectral feature is an important interpretation element for the recognition of ground object.It is a collection that describes the grayscale values of image objects, reflecting the spectral information of the objects.The texture features of the image, that is, the internal structure features, are the features of the image structure caused by the regular changes in the interior hue of the target objects in the images.So far there are many methods to analyse texture features, which are generally accepted and effective as Grey Level Co-Occurrence Matrix (GLCM).GLCM is a matrix composed of two level joint conditional probability densities between image gray values, which reflect the spatial correlation between two points of image.
In this paper, there are 25 features are selected to construct the classification feature system, including spectral, texture and custom features.The spectral features mainly include Mean_Layer1-4 and Ratio Layer1-4.The texture features mainly include GLCM Homogeneity, GLCM Contrast, GLCM Dissimilarity, GLCM Entropy, GLCM Ang.2nd moment, GLCM Correlation, GLCM StdDev and GLCM Mean.The custom features mainly include the Normalized Difference Vegetation Index (NDVI), the Normalized Difference Water Index (NDWI) and the Ratio Vegetation Index (RVI).The calculation formulas of some features are shown in the following Table 2.

CART Model
CART model is a decision tree construction algorithm proposed by Breiman in 1984.The principle is to form the binary tree by cycle analysis of training data set consisting of test variables and target variables and apply the decision tree rules to the classification of new data.As a non-parametric algorithm, this model has no special requirements for the distribution of training samples and does not need to know the relationships between the dependent and independent variables.By selecting an independent variable threshold, the model divides the sample set into two sub sets (training samples and test samples), and recursively the subset, until the samples in the subset are as homogeneous as possible.The advantages of the CART model are reflected in structure clear, easy to understand, easy realization and high accuracy.It can indicate the importance of variables in the classification process clearly, there is no statistical distribution requirement for input sample data, and the loss and error of the data that can be contained (Breiman L. et al., 1984;Na X. D. et al. 2008) .

Machine Learning
The selection of training samples is a key step in machine learning.Based on multi-level segmentation result and Google Earth images, samples are selected by random sampling in this paper.Due to the quite smaller proportion of other land in Xilinhaote, this paper chooses 6270 samples including two samples types that are grassland and others and extracts their related spectral and texture features.
This paper used the CART model built in the data mining software SPM (Salford Predictive Modeler) developed by Stanford University for machine learning.As the best test variable and the segmentation threshold criterion in the economic field, the Gini coefficient (or Gini Index) was used to establish the decision tree model in the CART model.The mathematical definition of the Gene coefficient is shown in Eqs.

 
Where, p(j|h) is the probability that the random sample in the training set belonging to the j class when the value of the test variable value is h.nj(h) is the number of samples belonging to class j when the test variable value is h in the training samples.n(h) is the number of samples when the test variable is h in the training sample.J is the number of class.

Model Optimization
Over fitting often exists in the decision tree established in section 3.3.1, in order to make the training samples more representative, the structure of the tree need to be further pruned.
The CART tree is pruned with a cross validation method.The strong branch maintained low error rate will reserve and the others will be pruned.The final analysis result is an optimal binary tree takes both complexity and error rate into consideration.
In this paper, there are 3078 grassland samples and 3192 other landuse samples.The input prediction variables of the SPM are these two types of samples with 25 characteristics, and 50% of the sample set is set as training set, 50% of them is set as the test set.By the strategy of growth and pruning, the decision tree model can be obtained as shown in Figure 3.

RESULTS AND ANALYSIS
The Nearest Neighbor method (Nearest Neighbor) supervised classification was achieved by constructing feature space and feature optimization based on the same scale multi-resolution segmentation, the same training samples and features in eCognition software.The classification result is shown in Figure 4 (a).
Based on the map data of Google Earth, the GF-1 remote sensing data of Xilinhaote in 2016 is interpreted visually to extract the grassland information.The result can be regarded as approximate real distribution of the grassland, as shown in Figure 4 (c).The classification results of grassland information distribution adopted by the proposed method in this paper and the nearest neighbor supervised classification method are compared with the visual interpretation result.The results of the precision evaluation are shown in Figure 5.
Figure 4 shows the comparison of three interpretation method, compared with the visual interpretation result, these two methods both can extract large areas of the grassland effectively, and the buildings, water, and bare land can be distinguished effectively at the same time.Meanwhile, compared with these two results of methods, the proposed method in this paper largely improves the mapping boundary and the object fragmentation, and effectively reduced the wrongly classification of the grassland with the buildings, the bare land.What's more, the leaking classification phenomenon is been reduced significantly.

Figure 5 of precision evaluation
Figure 5 shows the results of precision evaluation, the evaluation parameters include the total precision, Kappa coefficient, produce accuracy and user accuracy.According to the comparison of Figure 5, conclusion can be made as follow.
1.All these evaluation parameters of the proposed method in this paper are superior to the nearest neighbor supervised classification method.
2. The total precision and Kappa coefficient of the proposed method are 95% and 0.9, respectively, its results are increased by 15% and 0.34 compare to the nearest neighbor supervised classification method.
3. Both produce accuracy and user accuracy are more than 90%, which means the proposed method has a good detection ability of grassland and can extract the grassland information accurately.
4. The user accuracy of the nearest neighbor supervised classification method is only about 80%, the grassland is misclassified with other fields significantly.The key reason may be that the single multi-resolution segmentation method adopts by nearest neighbor supervised classification method lead the patches to be fragmentized.
5. As there is a large area of misclassification and leaking classification, the Kappa coefficient of the nearest neighbor supervised classification method is only 0.56.
In conclusion, the proposed method based on multi-level segmentation has a better division of the patch objects and reduces the fragmentation degree of the object to a great extent.The feature and parameter optimization based on data mining adopt by the proposed method can reduce the fragmentation of patches, misclassification and leaking classification problems.
Based on the method mentioned above, the leaking classification rate of the proposed method can be limited to less than 5% and the misclassification rate to 10%.
The experimental verification shows that the classification results of the grassland information extraction adopt by the proposed method in this paper is quite similar to the result of visual interpretation.The proposed method combing the multilevel segmentation with CART methods has a better applicability than the nearest neighbor supervised classification method based on single segmentation for the extracting of the grassland.

CONCLUSION
As the extraction accuracy of the supervised classification method is insufficient, and the threshold of the object-oriented method with better extraction accuracy is difficult to set.by based on single segmentation.This paper proposed a grassland information extraction method combing the multi-level segmentation with CART method.The image features are optimized firstly by data mining.Then the CART model and rule set method are established to extract the grassland information.The Xilinhaote City is chosen as the typical research area in this paper.The grassland information in this research area is extracted from the GF-1 remote image data by the proposed method.The applicability and the extraction accuracy of the method are verified and analysed.The results show that grassland information extracted by the proposed method has a great improvement in space range and accuracy.Conclusions can be made as follows: (1) The grassland information extraction method combines the multi-level segmentation and object-oriented theory together.This method can avoid the over and insufficient segmentation phenomenon effectively.Moreover, different segmentation strategies can be formed according to the distribution characteristics of grassland, which extending the applicability and generalization of the method.
(2) The SPM data mining method is introduced to optimize the features of the remote sensing images.In comparison with other data optimization method, this method has the advantages of simple operation, high efficiency, direct result and high precision.What's more, the actual classification performance and accuracy of classifiers are greatly improved, and the errors caused by the subjective threshold are effectively avoided.
(3) Through experimental verification and precision analysis, the extraction accuracy of the proposed method is higher than 90%.This method had been carried out in the project of "ecological environment and source monitoring in seven provinces and cities around Beijing, Tianjin and Hebei", and effectively promoted and applied in Inner Mongolia, Henan and Shandong, which can provide a reference for the grassland information extraction in other regional and even national scale.
(4) The method also has some improvement in feature space selection, classification accuracy and automation efficiency.In the following study, the study on these aspects should be strengthened to further improve the efficiency and precision.Meanwhile, the terrain factor should be taken into consideration in following studies.The extraction of grassland information from complex terrain should be studied, such as the hilliness and mountainous areas.

Figure 1 .
Figure 1.Location of the study region (a) multi-resolution segmentation(Scale=70) (b) Multi-resolution segmentation and spectral difference segmentation Figure 2 the contrast of segmentation results

Figure
Figure 3 CART model The supervised classification method; (b) CART model method; (c) The visual interpretation result Figure 4 Comparison of grassland classification results.

Table 2
calculation formulas of some features