CHANGE SEMANTIC CONSTRAINED ONLINE DATA CLEANING METHOD FOR REAL-TIME OBSERVATIONAL DATA STREAM

Recent breakthroughs in sensor networks have made it possible to collect and assemble increasing amounts of real-time observational data by observing dynamic phenomena at previously impossible time and space scales. Real-time observational data streams present potentially profound opportunities for real-time applications in disaster mitigation and emergency response, by providing accurate and timeliness estimates of environment’s status. However, the data are always subject to inevitable anomalies (including errors and anomalous changes/events) caused by various effects produced by the environment they are monitoring. The “big but dirty” real-time observational data streams can rarely achieve their full potential in the following real-time models or applications due to the low data quality. Therefore, timely and meaningful online data cleaning is a necessary pre-requisite step to ensure the quality, reliability, and timeliness of the real-time observational data. In general, a straightforward streaming data cleaning approach, is to define various types of models/classifiers representing normal behavior of sensor data streams and then declare any deviation from this model as normal or erroneous data. The effectiveness of these models is affected by dynamic changes of deployed environments. Due to the changing nature of the complicated process being observed, real-time observational data is characterized by diversity and dynamic, showing a typical Big (Geo) Data characters. Dynamics and diversity is not only reflected in the data values, but also reflected in the complicated changing patterns of the data distributions. This means the pattern of the real-time observational data distribution is not stationary or static but changing and dynamic. After the data pattern changed, it is necessary to adapt the model over time to cope with the changing patterns of real-time data streams. Otherwise, the model will not fit the following observational data streams, which may led to large estimation error. In order to achieve the best generalization error, it is an important challenge for the data cleaning methodology to be able to characterize the behavior of data stream distributions and adaptively update a model to include new information and remove old information. However, the complicated data changing property invalidates traditional data cleaning methods, which rely on the assumption of a stationary data distribution, and drives the need for more dynamic and adaptive online data cleaning methods. To overcome these shortcomings, this paper presents a change semantics constrained online filtering method for real-time observational data. Based on the principle that the filter parameter should vary in accordance to the data change patterns, this paper embeds semantic description, which quantitatively depicts the change patterns in the data distribution to self-adapt the filter parameter automatically. Real-time observational water level data streams of different precipitation scenarios are selected for testing. Experimental results prove that by means of this method, more accurate and reliable water level information can be available, which is prior to scientific and prompt flood assessment and decision-making.


INTRODUCTION
Recent breakthroughs in sensor networks have made it possible to collect and assemble increasing amounts of real-time observational data by observing dynamic phenomena at previously impossible time and space scales (NSF, 2005).Realtime observational data streams present potentially profound opportunities for real-time applications in disaster mitigation and emergency response, by providing accurate and timeliness estimates of environment's status (Gama and Gaber, 2007).However, the data are always subject to inevitable anomalies (including errors and anomalous changes/events) caused by various effects produced by the environment they are monitoring (Nativi et al., 2015).The "big but dirty" real-time observational data streams can rarely achieve their full potential in the following real-time applications due to the low data quality.Therefore, a timely and meaningful online data cleaning is a necessary pre-requisite step to ensure the quality, reliability, and timeliness of the real-time observational data streams (Huang, 2015;Goodchild, 2013;PhridviRaj and GuruRao, 2014).Collective Anomaly Fig. 1 The fluctuation patterns of the real-time water level data stream (Chandola et al., 2009) Anomalies in real-time observational data streams can cover a variety of different anomalous changes/events and errors and have various semantics with different length, distributions and change patterns (Chandola et al., 2009).Incorrect sensor measurements are considered as a type of anomalies in this study.In theory, errors are the observational values that are not conforming to the true state of monitoring phenomena and significantly deviate from the a priori normal behaviour of sensed data (Zhang et al., 2010).Anomalous changes/events are unusual patterns that reflect the true state of monitoring phenomena, but do not conform to the normal sensor data patterns.Fig. 1 illustrates several major anomalies that could occur in the real-time observational data stream, and could be defined into three types, including Point Anomaly, Contextual anomaly, and Collective Anomalies.Anomalies can occur due to different causes.These anomalies are always produced by sensor software or hardware malfunctions and data shifting errors.Highly abnormal phenomenon change may also cause anomalies.It is hard to distinct between errors and anomalous changes/events by analyzing sensor data only, as the same data instances might be considered errors or abnormal changes/events according to different context.Consequently, in order to efficiently identify the erroneous values and the unusual, but true event values, contextual semantic awareness is essential to a good data cleaning methodology.
Data cleaning for the real-time observational data streams refers to the dynamic data quality assurance and control process of finding anomalous patterns, removing errors and extracting useful information.In general, a straightforward streaming data cleaning approach, is to define various types of models/classifiers representing normal behaviour of sensor data streams and then declare any deviation from this model as abnormal or erroneous data.The effectiveness of these models is affected by dynamic changes of deployed environments.Due to the changing nature of the complicated phenomenon, realtime observational data is characterized by diversity and dynamic, showing a typical Big (Geo) Data characters.Dynamics and diversity is not only reflected in the data values, but also reflected in the complicated changing patterns of the data distributions.This means the pattern of the real-time observational data distribution is not stationary or static but changing and dynamic.After the data pattern changed, it is necessary to adapt the model over time to cope with the changing patterns of real-time data streams.Otherwise, the model will not fit the following observational data streams, which may led to large estimation error.Therefore, it is an important challenge for the data cleaning methodology to be able to characterize the behaviour of data stream distributions and adaptively update the model.To meet the need for online environmental sensor data stream cleaning, several studies have explored methods suitable for separating fault data in real-time observational data streams (Hill, 2013).Previous approaches to sensor data streams cleaning can be broadly categorized into two classes (Chandola et al., 2009;Patcha and Park, 2007).The rule-based approach uses a prior knowledge to classify the sensor data.The statistical-based approaches execute the detecting process by compare the sensed data trajectory with predefined samples.The values that deviate from predefined patterns are labelled as outliers.
Rule-based approaches exploit domain knowledge to define heuristic rules/constraints that the normal sensor values must satisfy (Rahman et al., 2014), otherwise the value will be regarded as outliers.Expert knowledge and historical experiments of specific domains have been popularly used for sensors validation and faults detection in many fields, including engineered systems (Rabatel et al., 2011), water environment (Mounce et al., 2011), and so on.The rule-based approaches establish a series of rules (e.g.threshold) from domain experts' knowledge or historical training experiments, and then classify the newly injected data into one of the predefined (normal/outlier) classes.Rule-based approaches rely on the availability of accurate and various domain rules (a priori knowledge) and have substantial connection with their application domain.However, the a priori knowledge required by rule-based approaches is often not available.(Ramanathan et al., 2006) have introduced a series of thresholds obtained from domain experts that can be used to construct the soil chemical concentration data quality rules.Rule-based methods can be highly accurate in detecting and classifying errors.Especially, the rule-based model/classifier has high sensitivity to the selected rules (Sharma et al., 2010) but poor adaptability to new data characteristics.Therefore, when new data characteristics emerge, the rule-based priori model/classifier have to optimize itself to accommodate new instances that belong to the normal class, that make them unsuitable for real-time streaming data cleaning.
Statistical-based approaches leverage historical distributions of sensor behavior to build a statistical probability distribution model for these streaming data (Bolton and Hand, 2002;Ge et al., 2010;Montgomery, 2007).In most of the statistical-based approaches, the statistical model are exploited to predict the underlying data distribution based on the temporal correlations of the historical data (Hill and Minsker, 2010;Yao et al., 2010) of individual sensors or spatiotemporal correlation of sensor network.This study here concentrates on cleaning the data sets on the individual nodes.After the statistical model defined, statistical inference tests will be applied to determine whether a newly injected data conform to the model or not.For the statistical model outputs, the normal data can fit into the high probability regions, and the outliers are assumed to be distributed in the low probability regions.Approaches of this type could work without supervision, in which the statistical model can be defined when small number of data points are outliers and the majority of the observations can fit into the model.Not like rule-based approaches, a prior knowledge is not a requisite condition for statistical approaches.Statisticalbased techniques is mathematically proved, and justified through computation (Bolton and Hand, 2002).The major disadvantage of statistical techniques is that they often rely on the assumptions that the spatial, temporal or spatiotemporal correlation of the observational data obeyed the accurate and quantifiable probability distributions (Markou and Singh, 2003).These assumptions are not universally applicable to real-life phenomenon.
To conclude, various data cleaning approaches for sensor data streams have been proposed in the literature.These researches have given detailed review on the usage of rulebased classifier or statistical models to represent the underlying distribution patterns of normal observations and identify any highly different instants as error.The effectiveness of these models is affected by dynamic changes of monitored environments.The adaptability to the changing data distributions of these models in dynamic environments is an important challenge for assuring the quality of sensor measurements.However, most of the proposed data cleaning approaches require the assumptions of a stationary data distributions and quasi-stationary process, which cannot be assumed for most data processed by anomaly detection systems (Patcha and Park, 2007).Besides, the adaptability of data cleaning methodology with dynamic changes in the monitored environment was rarely considered by previous studies.The complicated changing property of real-time observational data streams is not suitable for traditional data cleaning methods, and requires dynamic and adaptive online data cleaning methods.
In this paper, a novel hydrological change semantics constrained online data cleaning method for real-time observational water level data is proposed.The trajectory composed by consecutive observation points on the real-time data stream contains rich semantic information of the underlying changing phenomenon states; moreover, the changing pattern of trajectory distribution implies the change semantics of the evolution process and evolution trend of the observed dynamic process.In this case, semantically understanding the observed phenomenon and the extreme events, or learning the data distribution patterns, can assist in adaptively model updating as well as to resolving the ambiguity.Therefore, this work presents an adaptive model optimizing strategy by creating dynamic semantic mapping between realtime data changing patterns and the rules of spatial-temporal geographic process evolution, and then using change semantics constrain the Kalman filter optimizing process.
The efficiency and effectiveness of the proposed method are validated and compared with some existing filters using real life real-time observational water level data streams of different precipitation scenarios.
The remainder of this paper is structured as follows.The proposed methodology and the implementation of the proposed approach are described in Section 2. The experimental results, analysis, and evaluation of the proposed models are outlined in Section 3. Finally, concluding remarks and some directions for future research are presented in Section 4.

Overview
In this section, the framework of our proposed approach is described in detail.The brief framework of our approach is illustrated in Fig. 2 The observation is recorded once per 5 minutes, so each day has 288 observations accordingly.It can be noted that in each day, the observations show clear regular patterns of "risingpeakdescending -slack", even the water level data obtained in heavy rainy weather shows the same regular patterns.Thus, it is clear that, real time water level stream could be divided into segments with semantic annotation of "rising" or "descending".By this, we can distinguish whether or not a change in water level is abnormal.That is to say, if in "rising" period, there is a "descending" value, then the "descending" value should be considered as an error, and vice versa.

Adaptively Online Kalman filtering
After having characterized the behavior in the first step of the proposed method, adaptively online Kalman filtering will be executed to detect anomalies and errors in newly recorded data stream.In this part, the previously obtained knowledge will be used to identify a time when the data distribution has changed significantly enough to justify an update of the Kalman model parameters in the data filtering step.Different model parameters can be optimal for different data distributions.
The Kalman filter provides an efficient computational (recursive) mean to estimate the state of a process, in a way that minimises the mean of the square error.Due to the merits of low computational complexies and memory footprint of Kalman filter, it is suitable for dynamic applications and widely adopted in data stream processing (Sun and Deng, 2004;Li and Peng, 2014;Wang et al., 2011).However, in most cases while the application of classical Kalman filters for the dynamic data quality control of environmental sensor data seems to be good and for water level observations filtering may lead to poor results.This is because it is important that if there is a change in the data distribution, the Kalman filter model parameters need to be re-optimized to reflect this.In our study, the adaptively online Kalman filtering is based on the semantic constrained parameter re-optimizing.The state variable at the time tk here is true dynamic water level wk, defined as wk = wk-1 + △wk, where wk-1 is the true dynamic water level at the time tk-1, △wk is the semantic constrained change value of two consecutive water level observations.The observation procedure zk can then be represented by forcing a bias to the true dynamic water level wk.Therefore, the structure of the Kalman filter in our study is formed as the following Equations: Equation (1) and Equation (2) represent the state vector and measurement vector respectively.Ωk denotes the process noise and δk denotes the measurement noise.Qk denotes the process noise covariance and Rk denotes the measurement noise covariance and we assume E[Ωk] = 0, E[δk] = 0.That is to say the estimators of Qk, Rk contain non-systematic part of errors and the water level changes follow the normal distribution by assumption.During the recursive online Kalman filtering process, the contextual criteria on water level change will be dynamically searched from the knowledge base, then the appropriate range for △ wk will be given.For each △ wk, our filter will it determine whether the result of zk -zk-I is distributed in the range.

EXPERIMENTS AND RESULTS
In order to evaluate the efficacy of the online data cleaning method proposed in this study, the method was applied to a hydrological station operated by the Pinghu Shi Water Conservancy Bureau, located in the in Huanggutang.This hydrological station provides water level measurements at 5min min measurement intervals, a frequency selected to support urban flood monitoring.Black solid line in It can be noted that in the abnormal areas (denoted by the dashed rectangles), the proposed method can filter the outliers correctly.On the other hands, the classical Kalman filter still produces noticeable spines in the profiles.In the outlier region, the slope of the water level exceeds a certain range, which is ignored by the classical Kalman filter.However, in the proposed method, the range of the slopes are detected and determined through the combination of contextual semantics as denoted in Table 1 and the amount of the observed perception.So abnormal observations can be dynamically filtered using the constrained slope of the water level.Finally, it is worth noting that, projection of the current water levels estimated by our method are is reasonable, which shows high correlation with the regular water change patterns and impacts of weather.Fig. 6 shows the water level of the historical water level observations without rains (blue lines), the projected water levels after filtering of a rain (red lines) and the corresponding perception observations (blue bars).The observation is recorded once per 5 minutes, so each day has 288 observations accordingly.These patterns have defined the contextual semantics in the proposed method.Furthermore, the filtering and projection of the water level observations are also aware of the perception data.For example, in the time period of t1, which is the period with descending semantic and positive amount of perceptions, the constraints of the water level is smoothly rising or descending according to Table 1.
In order to evaluate the results, we have calculated three measures of quality widely used: (1) Determinant coefficient, denoted as DC.This is a commonly used statistic value to evaluate the fitness between the projections and the ground truth observations.DC ranges from 0 -1, where 1 denotes perfect fit between the projection and the ground truth.Equation (3) shows the detailed calculation of the DC as: (2) Traditional evaluation statistics, including the ERMSE (Root Mean Square Error) and the ENRMSE (Normalized Root Mean Square Error), as the following Equations: Table 2 compares the performances of the proposed method and the classical Kalman filters with regard to computational complexes and accuracies of the projections against ground truth observations.It can be noted that, the proposed methods requires lower runtime, because some observations are filtered using the slope threshold from the constraints of the contextual semantics and the observed perceptions.Furthermore, the proposed methods present a certain amount of enhancements in both DC and the error statistics.In the first dataset, because the ratio of the outliers is higher, the classical methods only resulted in a DC of about 0.6, which is about 60% inferior to the proposed method.Similar results also show in the RMSE and NRMSE.Furthermore, in the second dataset, the ratio of outliers is relatively lower, and thus the DC of the classical dataset is also reasonable; however, the proposed method can almost achieve perfect fitness with the ground truth observations.Furthermore, because the outliers are not filtered off, the results is also reflected in the RMSE and NRMSE, where the proposed method is about 50% better.

CONCLUSION
Aiming at online data stream cleaning, this paper presents a change semantics constrained online filtering method.In summary, the proposed method has the following two innovative aspects.(1) A new efficient Semantic Awareness based real-time data stream change patterns detection function is used and embedded in the proposed study to explicitly semantic depict and quantitatively describe the data pattern changes in dynamic environments.
(2) Dynamic semantic mapping between real-time data changing patterns and the rules of spatial-temporal geographic process evolution is implemented in the algorithm for adaptive threshold determination and Kalman filter model parameters optimizing.As a result, the misclassification error caused by this dynamic change is reduced.Further work may be can be done on the change detection step to handle more complicated observational datasets.
Fig.2 General Framework of the proposed approach2.2Change detectionIn brief, the change detection process is divided into three steps.First, the most frequent data distribution patterns without anomalies are extracted from historical observations.Then, these patterns are characterized into knowledge classes.The knowledge class will describe normal behaviour with contextual semantics, and also provide the data filtering process with essential information about the impact of contextual criteria.For example, Fig2shows the normal distribution of real time water level observations from Pinghu Shi hydrological station.The observation is recorded once per 5 minutes, so each day has 288 observations accordingly.It can be noted that in each day, the observations show clear regular patterns of "risingpeakdescending -slack", even the water level data obtained in heavy rainy weather shows the same regular patterns.Thus, it is clear that, real time water level stream could be divided into segments with semantic annotation of "rising" or "descending".By this,

Fig. 3
Fig. 2 Normal change patterns of the real-time water level data stream obtained at Huanggutang station Furthermore, the slope between two consecutive water level observations are also found distributed evenly within a range.The water level change between two consecutive observations have roughly the same proportion.However, the results show that due to different weather conditions, the distribution of water level change pattern stays the same, but the slope two consecutive water level observations changes.This is reasonable because the water level change in Pinghu City is a cyclical fluctuation phenomenon and is vulnerable to be effected by rainy weather, as shown in Fig. 3.
Fig 4 and Fig 5 illustrate the online data cleaning results from the classical Kalman filtering method and the proposed method, in which the blue solid lines show the results from the proposed method and the red solid lines denote the results of Kalman filter.It can be noted that in the abnormal areas (denoted by the dashed rectangles), the proposed method can filter the outliers

Fig. 4
Fig. 4 Filtering results of the real time observational water level data stream obtained at Huanggutang station with no precipitation (2014/6/9-2014/6/10) wt denotes the ground truth of the water level.Because the water level shows regular pattern in the historical observations of different days, observations with no outliers are chosen as ground truth.ˆt w denotes the projections of the Kalman fiter and t w denotes the mean of the ground truth.
Tab. 2 Performances comparison of our proposed and traditional methods