ABNORMAL CROWDSOURCED DATA DETECTION USING REMOTE SENSING IMAGE FEATURES

Quality is the key issue for judging the usability of crowdsourcing geographic data. While due to the un-professional of volunteers and the phenomenon of malicious labeling, there are many abnormal or poor quality objects in crowdsourced data. Based on this observation, an abnormal crowdsourced data detection method is proposed in this paper based on image features. This approach includes three main steps. 1) the crowdsourced vector data is used to segment the corresponding remote sensing imagery to get image objects with a priori information (e.g., shape and category) from vector data and spectral information from the images. Then, the sampling method is designed considering the spatial distribution and topographic properties of the objects, and the initial samples are obtained, although some samples are abnormal object or poor quality. 2) A feature contribution index (FCI) is defined based on information gain to select the optimal features, a feature space outlier index (FSOI) is presented to automatically identify outlier samples and changed objects. The initial samples are refined by an iteration procedure. After the iteration, the optimal features can be determined, and the refined samples with categories can be obtained; the imagery feature space is established using the optimal features for each category. 3) The abnormal objects are identified with the refined samples by calculating the FSOI values of image objects. In order to valid the effectiveness, an abnormal crowdsourced data detection prototype is developed using Visual Studio 2013 and C # programming, the above algorithms and methods are implemented and verified using water and vegetation categories as example, the OSM (OpenStreetMap) and corresponding imagery data of Changsha city as experiment data. The angular second moment (ASM), contrast, inverse difference moment (IDM), mean, variance, difference entropy, and normalized difference green index (NDGI) of vegetation, and the IDM, difference entropy and correlation and maximum band value of water are used to detect abnormal data after the selection of image optimal feature. Experimental results show that abnormal water and vegetation data in OSM can be effectively detected in this method, and the missed detection rate of the vegetation and water are all near to zero, and the positive detection rate reach 90.4% and 83.8%, respectively.


INTRODUCTION
Crowdsourced data are a voluntary geographic information platform, aiming to create and provide free geographic data for the world. With the development of Volunteered geographic information (VGI), several crowdsourced data platforms have sprung up, e.g., Wikimapia, OpenStreetMap (OSM), The Global Learning and Observations to Benefit the Environment (GLOBE) Program and Google Earth, etc (O'Reilly, 2007;Goodchild, 2007;Goodchild, 2009). These platforms have collected a wealth of geospatial data through active or passive means (Vatsavai & Chandola, 2016;Linda et al., 2016). However, the existing crowdsourced data platform generally lacks effective quality control measures (Heipke & Christian, 2010), due to the unprofessional of volunteers, there is much poor quality objects in crowdsourced data; furthermore there is even a phenomenon of malicious labelling. As a result, the quality of crowdsourced data has become a key issue restricting its wide application. For example, On April 24, 2015, a "Google Maps fan" produced a set of spatial objects using Google Map Maker. These objects formed a figure of "Android robot indecent Apple logo". It has become a typical event for the problem of quality control of crowdsourced data. In order to solve the problem of quality control for crowdsourced geographic data, at first, many scholars have tried to know the quality problem by evaluating the quality directly using highprecision professional vector data as reference data, from the following aspects, i.e., location accuracy, completeness, logical consistency, topic accuracy, time accuracy, the latest situation and availability (Haklay & Mordechai,2010;Mohammad et al.,2014;Zhou et al.,2014;Zhou,2018;Lyimo et al.,2020). However, due to the difficulty and high cost of obtaining highprecision professional vector data in practical applications, it is suitable for only local-area crowdsourced data quality evaluation. Therefore, some researchers explored methods of using crowdsourced data acquisition processes (historical records) and contributors' reputations to evaluate and control the quality of crowdsourced data (Nasiri et al., 2018;Almendros-Jimé nez & Becerra-Terón, 2018). Keßler et al. (2011) evaluated the quality of OSM data target objects by tracking the history of volunteer operations (confirmation, correction, and rollback) of the target object. Barron et al. (2014) analyzed the factors affecting the quality of crowdsourced geographic data, focusing on historical records, and proposed a comprehensive evaluation framework for OSM data quality including 25 methods and indicators. Zhao et al. (2016) proposed a model for calculating the credibility of volunteers based on the similarity of versions. Jacobs et al. (2020) used OSM data (including historical and volunteer data) in the Gatineau area of Ottawa as the research data, using unsupervised machine learning methods to expose many experienced contributors, and based on this to evaluate the quality of OSM data. However, these methods often consider only the object's shape, size, topology errors, and the credibility of users, etc. they cannot distinguish the object's category attribute errors and the objectivity. Moreover, for the data contributed by new users, since there are neither historical data nor editing and modification by the others, the existing methods cannot evaluate the credibility of the fresh users and the data quality contributed by them, so it is impossible to judge whether they are usable. In order to solve the problem of "Android robot indecent Apple logo", the author have tried to overlapped the objects and the corresponding images, and it is found that though the set of objects with regular shapes, and has no topological errors, it is obvious that this set of objects is seriously inconsistent with the images. Since the spatial object is a true reflection of the real geographic spatial phenomenon, fidelity is the most important characteristic of the spatial object. However, the authenticity of such a set of objects is doubtful. Therefore, how can we check the authenticity? Remote sensing image is the snapshot of the real world with authenticity, and remote sensing classification using the different features of the images has always been an important research region use of (Adesina & Mavomi , 2014;Sofina & Ehlers , 2017). Crowdsourced data are often generated by users who are independently edited referenced by remote sensing images. Therefore, introducing remote sensing images is a reasonable way to detect the abnormal crowdsourced data. Based on this observation, this paper proposed a crowdsourced vector data anomaly detection method based on remote sensing image features. In this method, the remote sensing image is segmented by crowdsourced vectors to obtain image objects with clear boundaries, locations and a priori category attribute. By comparing the labeled category with the image features of the same category of objects, abnormal data with incorrect category can be identified. The remainder of this paper is organized such that Section 2 outlines the proposed abnormal crowdsourced data detection method, including the sampling design, sample refinement by the iteration of remote sensing feature selection and outlier sample elimination, and the abnormal objects detection. Experimental results and discussion are presented in Section 3 and Section 4, respectively. Finally, the main conclusions are drawn in Section 5.

MODEL AND ALGORITHM
In this paper, we use remote sensing images as the ground truth, and assured that most crowdsourced data are true and credible. Based on this assumption, a method for detecting anomalies in crowdsourced vector data using the characteristics of remote sensing images is constructed. This method first uses preprocessed crowdsourced vectors to perform mask segmentation on remote sensing images to obtain image objects with clear boundaries, locations, and a priori category attribute. Then, regular grid random sampling is used to select image objects with prior category information as initial samples. Image object samples are classified and optimized according to the feature contribution to construct a feature space. Then, the local reachable density is used to define the abnormality index, and the abnormal objects in the initial sample are found and eliminated by selecting the appropriate threshold to obtain the preferred sample. Finally, abnormal data in the crowdsourced vector are found by combining the optimal sample and the abnormal index.
The key models and methods in this paper include the automatic extraction of test samples, the construction of the remote sensing image feature space, and development of an anomaly detection method based on local reachable density and its implementation algorithm. The proposed abnormal crowdsourcing data detection framework is shown in Figure 1.

Automatic Extraction of Test Samples
In the process of detecting crowdsourced data anomalies, the quality of samples directly determines the accuracy. The sample must be random and global. If the image objects of the same priori category of sample contain multiple categories, it will directly affect the optimization of the feature parameters, which in turn will affect the accuracy of the results of abnormal data detection. In fact, though there are many incorrectly labeled abnormal objects in crowdsourced vector data, but the proportion is still relatively small compared with correctly labeled. Sample selection mainly adopts stratified random sampling and cluster sampling methods. The main considerations include the reasonable distribution of sample spatial heterogeneity, the accuracy of spatial locations, and the accuracy of sample attributes (Scepan,1999;Zhao et al.,2014;Chen et al.,2016). Since the crowdsourced data come from volunteers, their regularity of distribution is based on the volunteers' preferences and interests. During sampling, the randomness and uncertainty of the data distribution should be fully considered, and due to the complexity of the data quality, it has to be pre-processed to achieve better sampling results. Therefore, this article mainly uses a regular grid method to randomly select samples. First, the sampling area is divided into * regular grids. The size of the grid can be established according to the size of the sampling area. The number of selected samples is given in the total number of objects of the corresponding category. The number of samples drawn by each grid can be expressed by the formula (1): In formula (1), represents the total number of objects in the sampling layer, * represents the number of objects contained in the * th grid, and represents the desired samples taken.

Feature Space Construction of Feature Category
In optical remote sensing images, the spectral features and nonspectral features of the same categories have high similarities, and there are greater differences between different categories. Spectral features can express the rules of electromagnetic wave reflection of surface objects, and different surface objects can be The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLIII- B4-2021XXIV ISPRS Congress (2021 distinguished by calculating the pixel value of each band (Lv et al.,2018). However, "similar spectra" affect the spectral feature classification accuracy. Using texture features for change detection of remote sensing image can significantly improve the accuracy of change detection (Ulaby et al.,1986;Du et al.,2014). Therefore, Wei Dongsheng and Yang Wentao (2020) weighted 14 texture feature parameters based on the gray-level cooccurrence matrix (GLCM). The texture feature parameters for the greater contribution of each category are selected for detecting damaged buildings after an earthquake. However, because the texture features of the same category in highresolution images are also complex and changeable, using only 14 texture features cannot fully reflect the differences between categories. Vegetation indices are the simplest and most effective measure of the vegetation condition on the surface, among which the normalized difference greenness index (NDGI) is commonly used (Meyer & Neto,2008). It selects the brightness values of the red band and the green band which can better distinguish the spectral features of vegetation from other categories. In addition, the values of each band of each object also have large differences in different categories. Therefore, this paper introduces the NDGI, band average, maximum, minimum and 14 texture features to construct the optimal feature space. Among them, the calculation formula of the NDGI, is as formula (2): In formula (2), and represent the red band and the green band value of the image object, respectively.
According to the calculation method of the texture feature contribution index (Wei Dongsheng, Yang Wentao,2020), ( = 1,2, … , ) are feature parameters, all of which are normalized, and ( = 1,2,3, … , ) is the feature contribution index corresponding to the prior category of the i-th feature parameter of image object j. The calculation method is shown in formula (3). The feature space of image object j is ( ), and the feature space ( ) based on information gain can be described by formula (4) In formulas (3) and (4), ( , ) is the information gain rate of the i-th feature parameter to the image object j corresponding to the prior category . Since each feature has a different contribution to each category, the larger the contribution index is, the higher the recognition of the category in the construction of the optimal feature space, the higher the recognition of the feature. Therefore, according to the size of the contribution index, the feature parameters are divided into high contribution (60%-100%), medium contribution (40%-60%) and low contribution (0%-40%) groups. In general, when the contribution of a feature is between 0 and 40%, it is considered that the recognition of the feature as part of the category is extremely low and should be excluded; when it is 40%-60%, it is considered that the feature is not recognizable as part of the category. When the contribution of a feature is between 60% and 100%, the feature is considered highly recognizable as part of the category and should be selected.

Detection Method Based on the Local Reachable Density
General anomalies are data that deviate from most of the data in a data set (Hawkins,1980). In this article, abnormal data can be defined as the user's incorrect labeling of crowdsourced data, which mean the wrong data for attribute tagging for surface category. After constructing the optimal feature space, using the idea of density-based anomaly detection of crowdsourced data. The reachable local density ( ) represents the reciprocal of the average reachable distance from all objects in the k-th neighborhood of object to object , as shown in formulas (5) and (6).
If an object is relatively distant from other objects in the feature space, then its local reachability density is small, and the probability that this object belongs to the same category as the object in its designated k neighborhood is small, and vice versa. ( ) represents the k-th neighborhood of the object . There should be at least k objects in the k-th neighborhood of the object .
( , ) represents the k-th reachable distance between objects and ,which means the maximum value of the distance between the k-th closest point to and the distance from object to . The Euclidean distance between two targets is adopted as distance measure. Suppose the anomaly index of the image object is .
D represents all objects in a category. The value of ( ) is between 0 and 100%, and its value reflects the probability that objects belongs to the category. The larger the value is, the smaller the probability that the sample object belongs to the prior category. If the value is greater than a certain threshold, the object can be considered abnormal data with incorrect attribute categories. Otherwise, the object is considered to be data with the correct attributes.

Experimental Data
This article takes OSM data of Changsha city China (it is the capital Hunan province, and it is a big city in the middle-south China; the coordinate range is 28.15934°N-28.24013°N, 112.91696°E-112.99370°E) as the experimental data. The Bing image contains 7424*8704 pixels with red, green, and blue bands selected for the anomaly detection experiment, and has a spatial resolution of 1.2 m. As showed in Figure 2, the OSM platform provides Bing images by default for volunteers to contribute. The all data are provided by OSM platform. Due to the attribute information of the OSM data is represented by the key and values in the tag, and the type of object cannot be determined directly. Therefore, this article uses a rule-based OSM data model conversion method to make the crowdsourced vector data have a relatively standard initial category (Zhou Xiaoguang et al., 2015).

Results and Analysis
In order to achieve the full automation of the anomaly detection process, this paper uses C# language, ArcGIS Engine, etc. to develop a prototype system for crowdsourced data anomaly detection based on image features. The prototype has includes data loading, image cropping, grid sampling, TFCI calculation and crowdsourced data anomaly detection, etc. function. Its main interface is shown in Figure 3. Since the experimental area is located in an urban fringe area, vegetation and water objects are relatively abundant in the crowdsourced data, this experiment uses vegetation and water as example to test the above method, and extracts the vegetation and water layers from the 2020 OSM crowdsourced vector data, and the sample exaction size of the field is 100 m×100 m. Then, the sampling layer is used to segment the 2018 image data to obtain the image objects. According to the distribution characteristics of the vegetation and water system image objects, a random sampling method is used to sample each grid. Figure 4 shows the results of the sampling layers and sample layout. In the anomaly detection of remote sensing images, the feature space vector of the sample image object is constructed first, and the optimal feature parameters of the two categories are determined according to the initial category attributes of the image object and the contribution of each feature, as shown in Figure 5, where the vegetation selection is the angular second moment (ASM), contrast, inverse difference moment (IDM), mean, variance, difference entropy, and NDGI constitute the optimal feature space vector, and the water selection includes four feature parameters, the IDM, difference entropy and correlation and maximum band value (BandMax), to constitute the optimal feature space vector.

Fig 5. Feature contribution indices
When calculating the local reachable density, the value of k will affect the calculation result of the abnormality index. Tables 1 shows the sample abnormality inspection accuracy when the abnormality threshold is set to 60% or more under different k values. Among them, the abnormality threshold (AT) represents the FSOI values, the positive detection rate (PDR) represents the total number of correct detections divided by the total number of abnormalities detected, and the missed detection rate (MDR) represents the total number of true abnormalities minus the total number of correct detections and then divided by the total number of true abnormalities. The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLIII-B4-2021 XXIV ISPRS Congress (2021 edition) Tab 1. Analysis of the vegetation sample anomaly detection accuracy Table 1 shows that when the abnormality threshold is 70% and k is 50, the MDR of vegetation is 0, and the highest PDR is 65.9%. The highest PDR is 60% when the abnormality threshold is 70% and the k value is 30, the MDR of water is 0. In the process of detecting abnormal data, the values of k and the abnormality threshold will affect the results and statistical accuracy. When the value of k is large, the number of samples with high abnormality will decrease, and the MDR will be correspondingly higher, which will lead to failure of abnormality rejection. When the value of k is in a certain interval (k is 30-50 in Table 1), the number of samples with a high degree of abnormality is relatively large, and the MDR is low. When the abnormality threshold is larger, the PDR will increase, but the MDR will increase. Therefore, in the calculated result, an MDR of zero can be achieved by setting an appropriate k value and a lower abnormality threshold, and then all samples with correct attribute categories, that is, preferred samples, can be obtained. It can be seen from the results in the table that when k is between 1/6 and 1/4 of the total number of samples and the abnormality threshold is 60%, it can basically ensure that all abnormal samples are eliminated, that is, an MDR of 0 can be achieved. Therefore, in the process of removing abnormal samples, the k value of vegetation is set to 50, the k value of the water is set to 30, and the abnormality threshold is set to 70%. In the preferred sample library obtained after the abnormal samples are eliminated, there are 166 vegetation samples and 95 water samples. Although some of the normal samples are inevitably removed in the abnormality removal process, it will not affect the detection of the overall crowdsourced vector data because in this method, it is necessary only to ensure that the selected samples after removal are all correct data. A new feature space is built by optimizing the samples. When k is 1/4 of the total number of objects and the abnormality threshold is set to 70%, the abnormality value of each crowdsourced vector data sample of each category in the experimental area is calculated to obtain the accuracy of the experimental results ( Table 2). The DA represents the detected abnormalities, the CD represents the Correct detections, and the TA represents the true abnormalities.  6. the results of OSM abnormal data detection. (a1), (b1) represent the normal OSM vector data of vegetation and water, (a2), (b2) represent the OSM abnormal vector data of vegetation and water, respectively. Figure 6 indicates the abnormal detection results of the water and vegetation categories in the experimental area. For each crowdsourced vector object, the corresponding abnormal value is automatically calculated. The data with an abnormal value greater than 70 will be highlighted and superimposed on the image data. As shown in Figure 7.b, there are obvious data of suspecting malicious annotations. When the crowdsourcing user provides this part of the vector data, its category attribute is marked as a water system. Depending on the remote sensing image visual judgment, the area where the data are located is actually a vegetation coverage area. Through the proposed detection method, similar annotation errors can be effectively and automatically detected, thereby providing users with more accurate crowdsourced vector data.

DISCUSSION
As we all know, data quality is the principal basis for determining its availability. Due to the non-professional of the contributors, OSM data category attribute poses a huge challenge for its use widely. Thus, this paper proposes an abnormal OSM data detection method based on the optimal feature space of each category to the reference image and the local reachable density. Although the OSM abnormal data detection in this article is mainly proposed for OSM data, this method also can be used for the other crowdsourced vector data that are mainly edited by using reference images. Experiments have proved the feasibility of the OSM abnormal data detection method based on image features proposed in this paper. However, the method proposed in this article still has some shortcomings, and the following aspects still need to be further studied and discussed: (1) In the process of OSM abnormal data detection, the quality of the image will also bring uncertain factors. For example, the image texture characteristics of some buildings are very complicated, and it is not suitable for feature optimization; in many remote sensing images, buildings and tall trees have large shadows on the ground due to the sunlight angle. In addition, since the image is more or less affected by clouds, shadows and aerosols when projected on the ground, etc., it will also make it impossible to calculate the corresponding preferred features of the image objects segmented by the vector. (2) In this article, the value of "k" means k-th neighborhood of object to object , it is important for the abnormal samples distinguish from OSM vector data. In this paper, we make k equals 1/5 to 1/4 of the total number of samples. 1/5 to 1/4 are empirical values obtained through experimental analysis, although the optimal parameters can be determined, it is still necessary to carry out a lot of experiments and manual parameter tuning. How to adaptively select the value of k based on the number of samples and the type of features is still an open issue to be researched.
(3) The preferred feature space in this article is for the image object of the OSM data preferred sample. The reference image has a significant role in promoting the selection of features. Therefore, whether the migration of the preferred sample can be guaranteed is a question that needs to be discussed in this article.

CONCLUSIONS
A crowdsourced vector data anomaly detection method is proposed in this paper based on remote sensing image features. In this method, the crowdsourced vector data is used to segment the corresponding remote sensing imagery to get image objects with a priori information (e.g., shape and category) from vector data and spectral information from the images. Then, the sampling method is used to obtain the initial samples, although some samples are abnormal object or in poor quality. A feature contribution index (FCI) is defined based on information gain to select the optimal features, a feature space outlier index (FSOI) is used to automatically distinguish the outlier samples and changed objects. The initial samples are refined by an iteration procedure. After the iteration, the optimal features can be determined, and the refined samples with categories can be obtained; the imagery feature space is established using the optimal features for each category. At last, the abnormal objects are identified with the refined samples by calculating the FSOI values of image objects. An abnormal crowdsourced data detection prototype system is implemented, ASM, contrast, IDM, mean, variance, difference entropy, and NDGI of vegetation; and the IDM, difference entropy and correlation and maximum band value of water are used to detect abnormal data after the selection of image optimal feature. Experimental results show that abnormal water and vegetation data in OSM can be effectively detected using this method, and the missed detection rate of the vegetation and water are all near to zero, and the positive detection rate reach 90.4% and 83.8%, respectively. Though the OSM anomaly data detection in this paper is mainly aimed at OSM data, this method may also be used for the other crowdsourcing vector data which was contributed mainly based on reference images.