journal of la misplaced subsequences repairing with

12
arXiv:2012.14555v2 [cs.DB] 6 Jan 2021 JOURNAL OF L A T E X CLASS FILES, VOL. X NO. X, X 20XX 1 Misplaced Subsequences Repairing with Application to Multivariate Industrial Time Series Data Xiaoou Ding, Hongzhi Wang, Jiaxuan Su, and Chen Wang. Abstract—Both the volume and the collection velocity of time series generated by monitoring sensors are increasing in the Internet of Things (IoT). Data management and analysis requires high quality and applicability of the IoT data. However, errors are prevalent in original time series data. Inconsistency in time series is a serious data quality problem existing widely in IoT. Such problem could be hardly solved by existing techniques. Motivated by this, we define an inconsistent subsequences problem in multivariate time series, and propose an integrity data repair approach to solve inconsistent problems. Our proposed repairing method consists of two parts: (1) we design effective anomaly detection method to discover latent inconsistent subsequences in the IoT time series; and (2) we develop repair algorithms to precisely locate the start and finish time of inconsistent intervals, and provide reliable repairing strategies. A thorough experiment on two real-life datasets verifies the superiority of our method compared to other practical approaches. Experimental results also show that our method captures and repairs inconsistency problems effectively in industrial time series in complex IIoT scenarios. Index Terms—IoT data quality management, industrial time series, inconsistency repairing, industrial data cleaning. 1 I NTRODUCTION T HIS widespread use of various monitoring sensors and the rapid performance improvement of sensing devices both give birth to data management and analysis in the Internet of Things (IoT). Time series data collected from sensor devices are one important data form in IoT. In data monitoring systems, data points are always collected together simultaneously from multiple dimensions, where each dimension (a.k.a., attribute) corresponds to one sensor [1]. Thus, the multi-dimension data from multiple sensors describe the status of a whole equipment together. That is, for a M -dimensional time series S , the m-th sequence of S corresponds to the m-th dimension monitoring data. As the high-quality IoT data is acknowledged to be the basic premise to achieve reliable information extraction and valuable knowledge discovery [2], the quality demand for time series data has grown stricter in various data applica- tion scenarios [3], [4]. However, time series data are often dirty and contain quality problems, especially in industrial background. [5] has proposed three kind of industrial time series data problems, namely missing values, inconsistent attribute values, and abnormal values or anomaly events. We have further investigate that misplaced subsequences in multivariate time series is one serious inconsistency prob- lem during data quality management. In real time series monitoring system e.g., Cyber-Physical Systems (CPS), some values in the m-th sequence may not correspond to the m-th dimension monitoring, due to XO Ding, HZ Wang, and JX Su are with Harbin Institute of Technology, P.O.Box 750, Harbin, Heilongjiang, 150001, China. C Wang is with National Engineering Laboratory for Big Data Software, EIRI, Tsinghua University, Beijing, China. E-mail: [email protected], [email protected], [email protected], wang [email protected] the unexpected troubles and signal interference during the undergoing working condition transition of the equipments. For example, clock errors may arise among sensors with different types. Transmission delay between sensors and the monitoring system also probably happens because of short- time network faults. In such cases, a length of subsequence from m 1 -th dimension may be recorded in the m 2 -th dimen- sion from a time point, and it will last for a time interval. This gives rise to an inconsistency problem during a certain working condition. We present a motivation example for an inconsistency instance below. Example 1. Figure 1 shows a segment of sequences from five sensors of an equipment in the same time interval. We can find that subsequences in sensor HN110, HN111, HNC10, and HNC02 1 present abnormal sequence pat- terns, and they are possibly recorded incorrectly in the current sequence in time interval [30000s, 40000s]. Fig- ure 2 partially enlarges the inconsistent subsequences existing in HN110 and HN111 in [30000s, 40000s]. The fact is that a length of subsequence of HN110 is falsely recorded in HN111, while subsequence of HN111 is placed in HN110. The aforesaid misplaced subsequences problem in time series data under industry scenarios bring at least two kinds of challenges. The misplaced subsequences problem belongs to contin- uous errors [6], other than happens in a single point. It is necessary to find out when the misplaced subsequences problem arises and how long it will last. As the time series data is collected continuously and densely, it 1. For privacy concern, we have made data desensitization of the name and the ID of sensors.

Upload: others

Post on 09-May-2022

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: JOURNAL OF LA Misplaced Subsequences Repairing with

arX

iv:2

012.

1455

5v2

[cs

.DB

] 6

Jan

202

1JOURNAL OF LATEX CLASS FILES, VOL. X NO. X, X 20XX 1

Misplaced Subsequences Repairing withApplication to Multivariate Industrial Time Series

Data

Xiaoou Ding, Hongzhi Wang, Jiaxuan Su, and Chen Wang.

Abstract—Both the volume and the collection velocity of time series generated by monitoring sensors are increasing in the Internet of

Things (IoT). Data management and analysis requires high quality and applicability of the IoT data. However, errors are prevalent in

original time series data. Inconsistency in time series is a serious data quality problem existing widely in IoT. Such problem could be

hardly solved by existing techniques. Motivated by this, we define an inconsistent subsequences problem in multivariate time series,

and propose an integrity data repair approach to solve inconsistent problems. Our proposed repairing method consists of two parts: (1)

we design effective anomaly detection method to discover latent inconsistent subsequences in the IoT time series; and (2) we develop

repair algorithms to precisely locate the start and finish time of inconsistent intervals, and provide reliable repairing strategies. A

thorough experiment on two real-life datasets verifies the superiority of our method compared to other practical approaches.

Experimental results also show that our method captures and repairs inconsistency problems effectively in industrial time series in

complex IIoT scenarios.

Index Terms—IoT data quality management, industrial time series, inconsistency repairing, industrial data cleaning.

1 INTRODUCTION

THIS widespread use of various monitoring sensors andthe rapid performance improvement of sensing devices

both give birth to data management and analysis in theInternet of Things (IoT). Time series data collected fromsensor devices are one important data form in IoT. Indata monitoring systems, data points are always collectedtogether simultaneously from multiple dimensions, whereeach dimension (a.k.a., attribute) corresponds to one sensor[1]. Thus, the multi-dimension data from multiple sensorsdescribe the status of a whole equipment together. That is,for a M -dimensional time series S, the m-th sequence of Scorresponds to the m-th dimension monitoring data.

As the high-quality IoT data is acknowledged to be thebasic premise to achieve reliable information extraction andvaluable knowledge discovery [2], the quality demand fortime series data has grown stricter in various data applica-tion scenarios [3], [4]. However, time series data are oftendirty and contain quality problems, especially in industrialbackground. [5] has proposed three kind of industrial timeseries data problems, namely missing values, inconsistentattribute values, and abnormal values or anomaly events.We have further investigate that misplaced subsequences inmultivariate time series is one serious inconsistency prob-lem during data quality management.

In real time series monitoring system e.g., Cyber-PhysicalSystems (CPS), some values in the m-th sequence maynot correspond to the m-th dimension monitoring, due to

• XO Ding, HZ Wang, and JX Su are with Harbin Institute of Technology,P.O.Box 750, Harbin, Heilongjiang, 150001, China.C Wang is with National Engineering Laboratory for Big Data Software,EIRI, Tsinghua University, Beijing, China.E-mail: [email protected], [email protected], [email protected],wang [email protected]

the unexpected troubles and signal interference during theundergoing working condition transition of the equipments.For example, clock errors may arise among sensors withdifferent types. Transmission delay between sensors and themonitoring system also probably happens because of short-time network faults. In such cases, a length of subsequencefrom m1-th dimension may be recorded in the m2-th dimen-sion from a time point, and it will last for a time interval.This gives rise to an inconsistency problem during a certainworking condition. We present a motivation example for aninconsistency instance below.

Example 1. Figure 1 shows a segment of sequences fromfive sensors of an equipment in the same time interval.We can find that subsequences in sensor HN110, HN111,HNC10, and HNC021 present abnormal sequence pat-terns, and they are possibly recorded incorrectly in thecurrent sequence in time interval [30000s, 40000s]. Fig-ure 2 partially enlarges the inconsistent subsequencesexisting in HN110 and HN111 in [30000s, 40000s]. Thefact is that a length of subsequence of HN110 is falselyrecorded in HN111, while subsequence of HN111 isplaced in HN110.

The aforesaid misplaced subsequences problem in timeseries data under industry scenarios bring at least two kindsof challenges.

• The misplaced subsequences problem belongs to contin-uous errors [6], other than happens in a single point. It isnecessary to find out when the misplaced subsequencesproblem arises and how long it will last. As the timeseries data is collected continuously and densely, it

1. For privacy concern, we have made data desensitization of thename and the ID of sensors.

Page 2: JOURNAL OF LA Misplaced Subsequences Repairing with

JOURNAL OF LATEX CLASS FILES, VOL. X NO. X, X 20XX 2

20000 25000 30000 35000 40000 45000

150

300

450

20000 25000 30000 35000 40000 45000

150

300

450

20000 25000 30000 35000 40000 450000

15304560

20000 25000 30000 35000 40000 4500050

52

54

20000 25000 30000 35000 40000 450000

15304560

HN110

HN111

HNC10

HN102

HNC02

Time points (s)

Fig. 1. An inconsistency example.

is not easy to precisely compute the start and endtime point of such inconsistency intervals. Method toidentify these interval bounds must be sensitive and re-liable enough for capturing the tendency of unexpectedchanges timeously.

• The pattern of misplaced subsequence errors is com-plicated. The data monitoring system often suffers dif-ferent sensor failures or system errors, it is uncertainthat how many attributes are involved in one misplacederror. As the number of attributes (sensors) of an equip-ment is not small, the increasing data amount and theattribute number add to the difficulty of both errordetection and repair process.

Though data quality demand in time series is increasing,the research on repairing multi-dimensional inconsistentsubsequences is not adequate. For time series data cleaningstudy, outlier detection (or called anomaly detection, erroridentification, etc) techniques have been developed for var-ious application [4]. However, few methods are proposedfor the data repairing tasks. A recent survey paper [6] hasreviewed kinds of time series error cleaning methods. Moststudies focus on the data cleaning of single point errors, andpay less attention to continuous errors in multivariate timeseries. For the existing data inconsistency repairing study,most techniques are mainly designed for relational data, anddo not apply to inconsistency problems in time series.

As the inconsistency in multivariate time series havejust uncovered recently in time series management systems,especially in industry field, effective solutions are still inhigh demand in abnormal patterns identification and in-consistent subsequence pairs repairing in multi-dimensionaltime series [1]. Motivated by this, we address the problemof repairing inconsistent subsequences in multivariate timeseries under the industrial applications in this paper. Wesummarize our contributions as follows:

(1) We extend the misplaced formalize a serious incon-sistency problem in Industrial Internet of Things (IIoT) datamanagement, i.e., inconsistent subsequences repairing inmultivariate time series, according to real IIoT scenarios.

25000 30000 35000 40000 45000100

150

200

250

300

350

400

Time points (s)

HN110 HN111

Fig. 2. Inconsistency between sequence HN110 and HN111.

(2) We devise an integrated method to detect inconsis-tent time intervals in data collected by data monitoringsystems, and correspond the inconsistent subsequences ineach interval to the correct dimensions. Considering the realchallenges in industrial data management, our method hasthe following accomplishments.

• Effectiveness in complex industrial scenarios. Wedesign algorithms to detect inconsistent subsequencesaccurately from multi-dimensional time series undercomplex situations. Our method can well identifies andrepairs hybrid inconsistent subsequences. (see Fig. 4 inSec. 3)

• Less negative cumulative effect in sequence behaviormodelling. During the abnormal sequence behaviorsprocess, our method distinguishes real inconsistentdata from normal sequences with a well-designed se-quence behavior model (see Algorithm 1). The pro-posed detection phase guarantees that the anomalypart will not effect the performance of the followingdetection. Moreover, our method always identifies in-consistent time intervals with both the start and theend points (we called them bounds below). It guaranteesthe reliability of the solutions under industrial datarepairing scenarios, because it will not modify thosenormal sequences by mistake.

• Fault-tolerance repairing approach. We propose amethod to obtain repairing solutions from the evalu-ation of the candidate repair schemas (see Sec. 4.1 and4.2). In this step, our algorithms reconsider all the possi-ble falsely processed schemas carefully with necessarymodification (e.g., union or replace), and then providehigh-quality repair solutions for true inconsistent timeintervals.

(3) We conduct a thorough experiment on two real-lifedatasets from large-scale IIoT monitoring systems over 5consecutive months. Experimental results on real-life datademonstrate the effectiveness of our method. Comparisonexperiments show that the proposed repairing strategiessignificantly improve both the accurate and the efficiencyof the inconsistency repairing.

Organization. The rest of the paper is organized asfollows: We introduce the related work in Sec. 6, and discussthe basic definitions and the overview of our approachin Sec. 2. Sec. 3 introduces inconsistent intervals detectionapproach and candidate repair schemas computation. Sec.4 discusses the evaluation on candidate repairing schemasand the determination of repairing results. Experimentalstudy is reported in Sec. 5, and we draw our conclusion

Page 3: JOURNAL OF LA Misplaced Subsequences Repairing with

JOURNAL OF LATEX CLASS FILES, VOL. X NO. X, X 20XX 3

in Sec. 7.

2 PROBLEM OVERVIEW

We first define the inconsistent subsequences repairingproblem in Sec. 2.1, and introduce our solution frameworkof the proposed problem in Sec. 2.2.

2.1 Preliminaries

An equipment is normally regarded as the minimum inde-pendent unit of monitoring time series in IoT data man-agement systems. According to [1], we outline the basicconcepts in our problem with Fig. 3. Each equipment hasa sensor set, denoted by SEq = {S1, ..., SM}, where eachsensor Sm(m ∈ [1,M ]) generates one time series along thetime axis, and all sensors generate time series simultane-ously. These time series are collected from the correspondingequipment and monitored by IoT data management sys-tems. We define the sequence from a sensor in Definition 1,and the sensor set of an equipment is regarded as a multi-variate time series, as shown in Definition 2.

Definition 1. (Sequence). S = 〈s1, ..., sN 〉 is a sequence onsensor S, where N = |S| is the length of S, i.e., the totalnumber of elements in S. sn = 〈xn, tn〉, (n ∈ [1, N ]),where xn is a real-valued number with a time point tn,and for ∀n, k ∈ [1, N ], it has (n < k)⇔ (tn < tk).

Definition 2. (Multivariate time series). Let Eq be an equip-ment sensor group. SEq = {S1, ..., SM} ∈ R

N×M is a M -dimensional time series, where M is the total number ofequipment sensors, i.e., the number of dimensions.

In this paper, we focus on one of the continuous errors,i.e., inconsistent subsequences in time series. According toDefinition 3, a subsequence S[l,n] corresponds to a timeinterval T[l:n] (Definition 5). It is obvious that for a M -dimensional time series S, T[l:n] provides M subsequencesfrom their corresponding sequences. All these subsequencesshare a common length, i.e., T[l:n]’s length.

Definition 3. (Subsequence). A subsequence S[l,n] =〈sl, ..., sn〉, (1 ≤ l ≤ n < N) is a continuous subset ofsequence S, which begins from the element sl and endsin sn.

Definition 4. (Sequence tuple). A sequence tuple in a M -dimensional S is the set of all data points at time ti,denoted by S(ti) = 〈si1, si2, ..., siM 〉, i.e., the i-th row ofS.

Definition 5. (Time interval). Let T = {t1, ..., tn} be the setof time points of time series SEq, T[l:n] is a time intervalin T which begins from time point tl and ends at tn.

In industrial data acquisition systems, the M -dimensionalsequences of equipment Eq have a definite acquisition or-der, and they are recorded into a sequence S1, S2, ..., SM

correspondingly. As unexpected problems will cause incon-sistency in some time intervals among multiple sensors,subsequences may be recorded into wrong dimensions dur-ing a period of time. That is, the M sequences are notorderly recorded into S1, S2, ..., SM in time interval T[l:n].On the basis of a practical observation and investigation,inconsistency in time intervals presents different patterns.

Here, we apply the permutation structure [7], [8] to describethe inconsistency pattern in Definition 6.

Definition 6. (Permutation pattern). Given a time intervalT[l:n] and the set of sequences with inconsistency prob-lems i.e., SINC = {S1, ..., Sm},m ∈ [2,M), an one-onemapping of SINC to itself is regarded as a permutation ofSINC , denoted by σ : SINC → SINC , having

σ =

(

Si

Sσi

)

=

(

S1 S2 ... Sm

Sσ1 Sσ

2 ... Sσm

)

Example 2. Let HN110, HN111, HNC10, HN102, andHNC02 in Fig. 1 be sequence S1, S2, S3, S4, and S5,respectively. According to Definition 1, the inconsistencyproblem in time interval [30000s, 40000s] can be formal-ized as,

σ =

(

S1 S2 S3 S5

S2 S1 S5 S3

)

Further, one permutation σ may consists smaller structures,i.e., the permutation between S1 and S2. We introducerotation pattern in Definition 7, which is indivisible anddenoted as the unambiguous minimum repair unit in ourmethod. Definition 7 shows that a m-rotation σ describesthat each element αi in SINC is replaced by the next elementαi+1, and the last element αm is replaced by α1.

Definition 7. (Rotation pattern). σ = {α1, α2, ..., αm} is apermutation pattern if having

σ =

(

α1 α2 ... αm

ασ1 ασ

2 ... ασm

)

=

(

α1 α2 ... αm

α2 α3 ... α1

)

Such m-rotation pattern is denoted by σ(α1, α2, ..., αm),where m is the order of σ, i.e., the number of elements inσ. �

Let σ1(α1, α2, ..., αl) and σ2(β1, β2, ..., βk) be two ro-tation patterns on SINC , (l + k ≤ m). According to theproperties on permutation group [7], σ1 and σ2 is disjoint if{α1, α2, ..., αl} and {β1, β2, ..., βk} differ from each other.Such disjoint l-rotation σ1 and k-rotation σ2 is indicated as

σ1 ∪ σ2 = (α1, α2, ..., αl)(β1, β2, ..., βk) (1)

Now we apply rotation patterns to formalize an inconsis-tency instance in Definition 8.

Definition 8. (Inconsistency instance). Let SINC be the set ofall inconsistent subsequences on SEq. An inconsistencyinstance in a time interval T[l:n] is regarded as the unionof a number of disjoint rotation patterns, describing theinconsistent patterns of SINC . It has

ϕ(I) = σ1 ∪ σ2 ∪ · · · ∪ σk =k⋃

i=1

σi.

where∑k

i=1 oi ≤ m(k ≥ 1), and oi is the order of thei-th rotation pattern σi. I = T[l:n] is identified as aninconsistent time interval w.r.t. ϕ. �

Example 3. With the structure of rotation patterns, theinconsistency instance in Fig. 1 can be denoted as

Page 4: JOURNAL OF LA Misplaced Subsequences Repairing with

JOURNAL OF LATEX CLASS FILES, VOL. X NO. X, X 20XX 4

Sensor Set

S (Eq)

Sensor 1 Sequence S1

equipped

with

s1 s2 s3 ... ... sN

Sensor M Sequence SM

Time interval T[l:n]......

......

......

A sequence tuple S(ti)

Equipment (Eq)

produce

produces1 s2 s3 ... ... sN

Fig. 3. Multivariate IIoT time series.

ϕ(I) = σ1 ∪ σ2 = (S1, S2)(S3, S5). I = T[30000s:40000s] isan inconsistent time interval. �

It is worth noting that the properties of permutationgroup guarantee the uniqueness of the pattern of eachinconsistency instance in interval I as shown in Theorem1.

Theorem 1. Given an inconsistent time interval I , the in-consistency instance ϕ in I has the unique form of theproduct of disjoint rotation patterns, which covers allinconsistent subsequences. �

Faced with inconsistency instances existing in time seriesfrom sensors, we aim to identify all inconsistent time inter-vals and repair all inconsistent instances correctly. We for-malize the inconsistency repairing problems studied in thispaper below, which consists of two tasks: the inconsistencydetection problem and the inconsistency repair problem,respectively.

Problem 1. Given a N -length M -dimensional time se-ries S, the inconsistency detection problem on S isto find all K inconsistent time intervals, denoted byI = {I1, ...., IK}, which satisfies

(1) ∀Ii ∈ I, Ii is the maximal interval covers oneinconsistency instance ϕ; and

(2) ∀ i, j ∈ [1,K] and i 6= j, Ii and Ij are twoindependent time intervals, i.e., Ii ∩ Ij = ∅.

Problem 2. The inconsistency repair problem on S is tocompute the repair pattern of each inconsistent intervalIi by identifying the inconsistency instance in Ii, whichsatisfies ∀Ii ∈ I, ϕ(Ii) covers all inconsistent subse-quences denoted by the rotation patterns in Ii.

2.2 Method Framework

Figure 4 illustrates the framework of our proposed solu-tion, which consists of three phases: inconsistency detection,matching evaluation and repair determination.

The inconsistency detection phase (see Sec. 3.1) is thefirst step in our method, where sequence behavior modelsare constructed to distinguish abnormal subsequences ineach sensor from normal sequences. Parametric modelsare applied in our method according to priori knowledgeor normality learning from historical IIoT data. We detectanomalies in each sequence with a sliding window, and aninconsistency instance is considered to exist in interval T[l:n]

which contains a number of abnormal subsequences. Ourinconsistency detection phase is open to most time seriesanomaly detection techniques, which will be presented in

TABLE 1List of frequent notations

Symbol Description

s a data point in time seriesS the sequence generated by sensor S

S[l,n] subsequence beginning from sl and ending in snS (M-dimensional) time series

S(ti) the sequence set of all data points at tiSINC the set of inconsistent attributeti a time point

T[l:n] a time interval from tl to tn.I an inconsistent time intervalI the set of inconsistent time interval on S

σ a rotation repairϕ(I) the inconsistency instance on I .Φ(S) the set of candidate repair schemas for S

Φr(S) the set of final repair schemas for SR the repair unit of σR the set of all repair units

B(σ) the boolean sequence for σ

B1 a length of subsequence with all 1 elements

our experimental study.In the matching evaluation phase (see Sec. 3.2), we com-

pute possible repair schemas for all candidate inconsis-tent time intervals obtained from the previous step. Inorder to match inconsistent subsequences to their corre-sponding dimensions, we first construct a bipartite graphG = (VS , VM, E,W). Each abnormal dimension in Ii ispresented as a source node in VS(G), and sequence modelsof the involved dimensions are treated as terminal nodesin VM(G). We obtain repairing patterns with bipartite graphmatching algorithms on G.

Repair determination is the most important phase in ourproposed method (see Sec. 4). In this phase, we preciselylocate each inconsistent time interval with start and endtime points, and provide accurate repair solutions. We pro-pose algorithms to identify real inconsistent time intervalsand provide final repair patterns. We apply the structureof disjoint set to obtain reliable repairing and effectivelydecreasing false positives and false negatives.

We summarize the notations frequently used in thispaper in Table 1.

3 INCONSISTENCY BEHAVIOR DETECTION

In this section, we first outline how we detect anomalies insequences in Sec. 3.1, and discuss how to compute candidaterepair patterns in Sec. 3.2.

Page 5: JOURNAL OF LA Misplaced Subsequences Repairing with

JOURNAL OF LATEX CLASS FILES, VOL. X NO. X, X 20XX 5

Fig. 4. Method framework overview

3.1 Abnormal subsequences modelling and detection

Abnormal subsequences detection is of crucial importancein high-quality repairing solutions. The accurate identi-fication of abnormal subsequences contributes to a highperformance of repairing methods. We first construct be-havior detection model for each sequence in S. Sequencemodels are considered as priori knowledge provided bythe equipment instructions, which can also be learned fromhistorical data or labelled sample data. We use a 2-tuplefunction γ:(F , T ) to describe time series modelling metrics.Here, F is the set of metric functions, including statisticalvariables (e.g., mean and variance), subsequence distancemetrics and feature vectors of a sequence S in the durationof a working condition. We formalize the sequence behaviormodel in Definition 9.

Definition 9. (Sequence Behavior Model). Given a M -dimensional S = {S1, ..., SM}, the normal behavior ofthe i-th dimension sequence Si is modelled by Si ∼M(Si, γ), where γ:(F , T ) is 2-tuple function for Si.

Accordingly, we present the basic assumption in Propo-sition 1 that any segment of a normal subsequence on Sshould satisfies the model M(S, γ), and the conditionalprobability p(sn+1 |= M(S, γ)|sn, ..., sl) ≥ θM should belarger than a support threshold.

Proposition 1. Given a model support threshold θM, if wehave S ∼ M(S, γ), then the following conditions aretrue:

(1) ∀ 1 ≤ l ≤ n ≤ N , S[l,n] |=M(S, γ), and(2) p(sn+1 |=M(S, γ)|sn, ..., sl) ≥ θM,

where S[l,n] is a subsequence of S, and si is a datapoint in S[l,n], i ∈ [l, n + 1]. · |= M denotes that thesubsequence corresponds with modelM. �

Now we are able to detect unexpected values in sequenceand further discover latent inconsistent intervals accordingto Proposition 1. We detect abnormal data in each sensorsequence in S independently, where subsequence S[l,n] withn− l+1 continuous data point in S is taken as a sliding win-dow interval to determine whether there exists anomaly in

the (n+1)-th window, i.e., sn+1. Data sn+1 is recognized ab-normal when we detect p(sn+1 |=M(S, γ)|sn, ..., sl) < θM.Further, it is possible to be inconsistent when there existssome abnormal subsequences in a time interval T[l:n]. Andwe will compute candidate repair results for T[l:n] withAlgorithm 1 below.

3.2 Candidate Repairing Schemas

For the set SINC = {Si, ..., Sm}(m ∈ [2,M ]) in interval I , Iis likely to contain m inconsistent subsequences. Our taskis to match each inconsistent subsequence to the correctsensor sequence. In general, we need to find m one-onemapping between a subsequence and the correct sequence.We transform this matching problem into perfect matchingon a bipartite graph, and we construct the bipartite graphaccording to Definition 10.

Definition 10. (Bipartite graph construction). Given theset SINC on time t, and M(Si) is the sequence modelof the i-th element in SINC . G = (VS , VM, E,W ) isa directed bipartite graph of SINC , where each ele-ment in SINC is treated as a source node, i.e., VS ={ui|ui ∈ SINC , i ∈ [1,m]}, and terminal nodes are theset of these m sequence models, denoted by VM ={M(S1), ...,M(Sm)}. e(u, v) describes a matching func-tion from a subsequence to a sequence model f : u →M(v), and the edge weight w(e) = p(u |= M(v))represents the matching probability of e(u, v). �

After the construction of G, the repairing problem istransformed to discovering optimization matching on G.We consider two matching strategies to obtain candidaterepair patterns. We first introduce an exact maximumweight matching solution. We then discuss a simple andfast greedy-based method, considering the balance betweenmatching efficiency and effectiveness. Intuitively, an incon-sistent subsequence Si.[l,n] is recognized to only belong toone sequence Sj with a quite high matching probabilityp(Si.[l,n] |=M(Sj)).

Maximum weight matching. Considering our repairingproblem, we need to find a maximum matching [9] on G,

Page 6: JOURNAL OF LA Misplaced Subsequences Repairing with

JOURNAL OF LATEX CLASS FILES, VOL. X NO. X, X 20XX 6

which has m one-one mapping between VS(G) and VM(G).That is, to compute high-quality matching results with bothmaximum weights and maximum matching on G. We in-troduce the maximum cost maximum flow (MCMF) algorithm[9] to compute the matching patterns. Accordingly, we adda global source node s and terminal node t to G. s pointsto all 0-in-degree nodes VS(G), and t is connected by all 0-out-degree nodes VM(G). Clearly, an edge weight w(u, v)represents the cost of a flow u → v, i.e., the matchingfrom si to M(Sj). We obtain candidate repair patterns ϕby discovering a maximum matching on G as follows.

max∑

(u,v)∈E

cost(u, v) · flow(u, v)

s.t.∑

(u,v)∈E

flow(u, v)−∑

(u,v)∈E

flow(v, u) = f(u),

0 ≤ flow(u, v) ≤ 1.

(2)

where a feasible flow satisfies∑

f(u) = 0, and cost(u, v) =p(u |=M(v, γ)).

Note that the MCMF algorithm finds the maximummatching prior to maximum sum of weights, we can alwaysachieve an one-one mapping from VS(G) to VM(G).

Greedy-based matching. Intuitively, one inconsistency sub-sequence Si.[l,n] is recognized to only belong to one se-quence Sj with a quite high matching probability. In thiscase, we design a heuristic greedy-based matching approachto achieve a fast matching on graph. When we make a matchon G, we iteratively select ui → vj which has the maximumedge weight computed by Equation (3), and add this matchϕcurrent to the result set. We then temporarily delete ui andvj from G. The matching process terminates until all nodesin VS have been matched to VM.

ϕcurrent = argmaxi,j∈[1,m]

w(ui, vj) = argmaxsi,sj∈A

p(si |=M(Sj , γ))

(3)

Algorithm 1 shows the process of computing candidaterepairing schemas, which mainly consists of three phases:detecting abnormal behaviors (Lines 3-5), matching incon-sistent subsequences (Lines 6-15) and updating sequencemodels with repaired data (Lines 18-20).

We first initialize a candidate repair set cand(S(ti))for each sequence tuple S(ti), and maintain an array Arecording inconsistent data values at time point t. Forthe anomaly detection phase, the w-length set of sequencetuples {S(ti−1), ...,S(ti−w)} ahead of S(ti) serves as thesliding window to detect the model behavior of S(ti). Aseach inconsistency instance happens in multiple sequencesat the same time, we begin our detection simultaneouslyand independently in all M dimensions within sequencetuple S(ti). For each sensor dimension Sj , we computethe probability of the current data corresponding to thisdimension according to the modelling analysisM(Sj , γ) inDefinition 9. We insert the unexpected data point sij intothe inconsistency list A if the probability p(sij |=M(Sj , γ))is smaller than a given threshold θM. It reveals that sij isunexpected to be recorded in sequence Sj .

With the discovered abnormal data points, S(ti) is pos-sible to be inconsistent if there exists several abnormal datapoints in S(ti), i.e., the size of set A = {si1, ..., sin} is

Algorithm 1: Compute Candidate RepairingSchemas

Input: a N -length M -dimensional time series S,models for sequences in S:M(S, γ), modelsupport threshold θM, a size numberthreshold ǫ

Output: a set of candidate repair schemas on S:Φ(S)

1 foreach ti ∈ T do2 initialize cand(S(ti))← S(ti) and A← [];3 for j from 1 to M do4 if p(sij |=M(Sj, γ)) < θM then5 A← A ∪ {j};6 if Size(A) is no smaller than 2 then7 initialize a matrix A = |A| × |A|;8 for n from 1 to |A| do9 for m from 1 to |A| do

10 Anm ← p(siAm|=M(SAn

));

11 construct G according to Anm;12 ϕ← matching result on G;13 if Size(ϕ) ≤ ǫ then14 ϕi ← accept ϕ as a candidate

matching schema of S(ti);15 repair cand(S(ti)) with ϕi;

16 else17 return S(ti) and ϕ for artificial

process;

18 foreach s′j ∈ cand(S(ti)) do

19 merge s′j toM(Sj , γ) and update γ;20 move the sliding window to

[s′j , sj−1, ...., sj−w+1];

21 return Φ(S)← {ϕi|i ∈ [1, N ]};

larger than 1. In this case, we construct a square matrixA for S(ti), where the number of rows (resp. columns) isequal to the number of elements in A (Lines 8-10). For eachinconsistent data point si in A, we compute the probabilityof modelling si to each sequence involved in A and recordp(sim |=M(Sn)) to the corresponding element in A.

Since we aim to match inconsistent subsequences tocorrect dimensions with the maximum likelihood, we con-struct G according to the matching probability matrix Anm,and obtain a match result ϕ between VS and VM (Lines11-12). We check the total number of elements in ϕ, i.e.,Size(ϕ), and accept ϕ as a candidate repair schema ϕi fora sequence tuple S(ti) if Size(ϕ) is no larger than a giventhreshold ǫ. When Size(ϕ) > ǫ, we terminate this matchingschema and return S(ti) to human. This is because greaternumber of Size(ϕ) possibly reveals some complex anomaliesor faults from the equipment sensor group, rather thaninconsistency problems. Data will be returned to monitoringsystem engineers. We expect to obtain reliable decision andrepairing result under such complex unexpected cases withknowledge engineering methods from domain experts.

After we have accepted ϕi and matched inconsistentdata to correct sequences, we insert the repaired data values

Page 7: JOURNAL OF LA Misplaced Subsequences Repairing with

JOURNAL OF LATEX CLASS FILES, VOL. X NO. X, X 20XX 7

to the sequence model and update γ, in order to improveaccuracy of anomaly detection on following sequence datawith the correct parameters. This step guarantees that statis-tical metric values will not be affected by the abnormal datavalues. We then successively move the sliding window andprocess next S(ti+1) in S (Lines 18-20) with the above steps.Algorithm 1 finishes after we process all time points andobtain the candidate repair schema Φ(S) for the N-lengthtime series S.

4 REPAIRING SOLUTION

When we aim to achieve an accurate and reliable repair ofall inconsistency instances, it requires an effective determi-nation of inconsistency intervals. Accordingly, we designa step of determining final repair patterns with two maintasks: i) to locate both the start and end timestamps ofan inconsistent interval, and ii) to repair each inconsistentinterval with reliable schemas. However, both tasks arechallenging to be completely solved in Algorithm 1. Thereasons are discussed as follows.

For the former task, we need to further evaluate andmerge the candidate schemas on sequence tuples to accu-rately detect the location of inconsistency intervals. Noteagain that an inconsistency instance always lasts a duration,rather than happen in several discrete time points. Thus, areliable repair solution of an inconsistent interval shouldcover all data points within the interval. Since that se-quence behavior modelling is analyzed by sliding window(sn+1|sn, ..., sl) in Algorithm 1, it cannot always provide auniform and accurate repair schemas for one inconsistencyinterval for the foregoing reasons.

For the latter, continuous high-quality repair schemas aredifficult to be obtained from matching pattern evaluationin inconsistent industrial data. On the one hand, abnormalbehaviors are not easily be to detected and distinguishedfrom normal data for a sequence tuple in industrial timeseries. If the algorithms fail to precisely find the set SINC

for sequence tuple S(ti) (see lines 4-5 in Algorithm 1),we will consequently obtain wrong matching results fromthe incorrect set SINC . On the other hand, bipartite graphmatching algorithms may run into partial mismatch in somesequence tuples. Both cases add to the number of eitherfalse positives or false negatives, and further result in a poorrepair of S.

To achieve an accurate and robust inconsistency repair-ing result, we propose a repairing schemas determinationalgorithm (DRS) to precisely locate inconsistency intervalsand further effectively repair inconsistent subsequences. Wefirst indroduce DRS algorithm in Sec. 4.1, and then discusshow to determine inconsistent intervals both effectively andefficiently in Sec. 4.2.

4.1 Determining Repairing Schemas

As discussed in Sec. 2.1, an inconsistency instance containsno less than one disjoint rotation patterns. Each rotationpattern is both indivisible and unambiguous. In order toeffectively detect inconsistent intervals and identify incon-sistency patterns, we introduce a repair unit in Definition 11which serves as the minimum process unit in our method.

Definition 11. A repair unit is a triple of a rotation pattern σ,denoted by R:[σ,T, Size(T)]. T = {T[l1:n1], T[l2:n2], ...} isthe set of time intervals which are detected to be repairedby σ, and Size(T) is the total number of time points inset T.

Accordingly, a candidate repair schema ϕ can be dividedinto several disjoint rotation patterns i.e., σ1, σ2, · · ·. Wecreate and maintain the repair unit R of each rotation σto evaluate all candidate inconsistent time intervals anddetermine the final repair schemas on them. The repairschemas determination is outlined in Algorithm 2, whichconsists of two steps: i) updating the set of repair unitsaccording to all divided σs (Lines 2-7) and ii) repairingsubsequences in all inconsistent intervals (Lines 10-15).

We first enumerate each candidate repair schema ϕi

from Φ(S), and divide ϕi into rotation patterns accordingto Theorem 1. We create a repair unit for each σ andrecord the location of such time intervals that are computedto be repaired by σ as well as the total lengths of theseintervals Size(T) from Algorithm 1 (Lines 4-6). After weobtain all repair units R from Φ(S), we sort all repair unitsin descending order according to Size(T) and abandon thoseunits which are used in candidate inconsistent intervals witha low frequency (Lines 8-9).

With the selected repair units set R′ in line 9, we furtherdetermine the accurate location of inconsistent intervalswhich contains rotation pattern σ by processing algorithmIIE(Φ(S), σ) (see Algorithm 3 below). After that, we enu-merate each independent interval I from I(σ), in which wecombine all accepted rotation patterns into a final integratedrepair schema ϕr and make the final repair of S. After allrepair units are processed, Algorithm 2 finishes and returnshigh-quality time series Sr along with all repairing schemasΦr(S).

4.2 Inconsistency Intervals Evaluation

We now introduce how to detect the accurate location ofinconsistency intervals. As discussed above, we enumerateto evaluate each repair unit of a rotation pattern σ (Line 10in Algorithm 2). During the process, we need to label whichtime intervals contains inconsistency pattern σ and whichdoes not. We propose a boolean sequence B for rotationpattern σ in Definition 12, which can assist to identify andextract inconsistent intervals.

Definition 12. (Boolean sequence of σ). Given a N -lengthM -dimensional S, B(σ) = 〈b1, ..., bN 〉 is a booleansequence w.r.t. rotation pattern σ, where bi (i ∈ [1, N ])is a binary value assigned according to σ as follows,

bi =

{

1, σ exists in ti0, otherwise.

(4)

where ti is the i-th time point of S. B(σ) has the samelength with S, i.e., Size(B(σ)) = Size(S) = N . �

From the above, element bi = 1 in B(σ) represents thatrotation σ is adopted at time point ti, while 0 means σ isnot adopted at ti or no inconsistency happens in ti. Anintuitive observation is, either 0s or 1s in B(σ) trends tocontinuously appear and make up a time interval. We de-note a subsequence only consisting 0 (resp. 1) as 0-sequence

Page 8: JOURNAL OF LA Misplaced Subsequences Repairing with

JOURNAL OF LATEX CLASS FILES, VOL. X NO. X, X 20XX 8

Algorithm 2: Determining Repair Schemas

Input: the candidate Φ(S), schema applied intervallength threshold: len1

Output: Φr(S)1 Initialize R← [];2 foreach ϕi ∈ Φ(S) do3 foreach σ ∈ ϕi do4 if σ does not exist in R then5 create a triple R : [σ,T, Size(T)] for σ;6 R← R ∪R;

7 update Size(T) in R;

8 Sort all repair triples in R in descending order ofSize(T);

9 R′ ← select repair triples by Size(T) ≥ len1;

10 foreach R ∈ R′ do

11 I(σ)← IIE(Φ(S), σ); // see Algorithm 3.12 foreach I ∈ I(σ) do13 repair I with σ and update Sr with repaired

I ;14 ϕr(I)← ϕr(I) ∪ σ;15 Φr(S)← Φr(S) ∪R;

16 return Sr and Φr(S);

block a.k.a B0 (resp. 1-sequence block B1). Accordingly,B(σ) covers alternating appearance of B0 and B1, denotedby B(σ) = {..., B0

i , B1i+1, B

0i+2, ...}.

It is easy to discover subsequence B1 when the element1 continuously and uninterrupted lasts for a number oftime points in B(σ). However, things are not simple whenelement 1s and 0s are intertwined in a period of time. Itcan be concluded that there exists falsely recorded 1s or 0s,for the reason that the occurrence of inconsistency instancesalways continues for a time duration, rather than happen ina quite short period of time. Such cases include two falsepatterns: i) bi is a false positive (FP) where the normal patternis falsely detected to be inconsistent and repaired by σ, or ii)bi is a false negative (FN) where the inconsistency are falselyidentified to be normal.

Faced with both problems, we consider a metric τ tomeasure whether a B1 should be merged into its neighborB0 or not. In order to identify all real B1s, i.e., the realinconsistent intervals with rotation σ, we evaluate all B0sand B1s with Equation (5).

τ =|Bi+1|

|Bi|+ |Bi+2|, (5)

where Bi, Bi+1, Bi+2 are three continuous subsequences inB(σ), and |Bi| is the length of Bi.

Now we present inconsistent intervals evaluation pro-cess in Algorithm 3. We evaluate each ϕ with the involvedrotation patterns in ϕ, and generate the boolean sequence ofeach σ according to Definition 12 (Lines 3-6). We then beginto detect inconsistent intervals by determining all real B1swith the start and end time points from B(σ). For efficiencyoptimization, we use Disjoint Set structure [9] to gather asubsequence B1 (resp. B0) with elements 1 (resp. 0) in lines7-9, and further, we decide whether a B0 should be merged

Algorithm 3: Inconsistent Intervals Evaluation

Input: the candidate Φ(S), σ, θτ , minimuminconsistency interval length threshold: len2

Output: I(σ): the inconsistent intervals set repairedby σ

1 Initialize boolean sequence B = 〈b1, ..., bN〉 and adisjoint set D with dk.root ← k;

2 foreach ϕ ∈ Φ(S) do3 if σ(α1, ..., αn) ∈ ϕ and ∀αi ∈ σ has not been

repaired then4 bk ← 1;

5 else6 bk ← 0;

7 foreach bj ∈ B(j ≥ 2) do8 if bj = bj−1 then9 D.UNION(bj , bj−1);

10 B←{dx|dx is the current independent element in D};

11 foreach B∗i ∈ B do

12 for Bool ∈ {1, 0} do13 if the boolean value of B∗

i equals to Bool andτ > θτ then

14 label each element in B∗i+1 with Bool;

15 D.UNION(B∗i , B

∗i+1),

D.UNION(B∗i+1, B

∗i+2);

16 foreach dx ∈ D do17 if bdx

= 1 and Size(dx) < len2 then18 I(σ)← I(σ) ∪ dx.T ;

19 return I(σ);

into its neighbour B1 or vice versa (Lines 11-15).After B(σ) is kept as {..., B0

i , B1i+1, B

0i+2, ...} with dis-

joint structureD, we copy B(σ) to a set B and make furthermodification on B to avoid breaking the original structurein B(σ). In the loop lines 11-15, we enumerate each elementB∗

i from set B, and compute τ according to Equation (5). Wemerge the current B∗

i with its neighbor block with the unionfunctionD if τ is smaller than a given threshold θτ (Lines 14-15). It illustrates that the size of B∗

i+1 is too small to supportits boolean value here, and the real value of B∗

i+1 should bereplaced by its neighbor blocks i.e., B∗

i and B∗i+2. After the

whole union process, we enumerate the updated 1-sequenceblocks. If the block length is no smaller than len2, i.e., it hasenough length to be identified as an inconsistency instance,this block will be inserted into the inconsistent intervals setI(σ) (Lines 17-18). Algorithm 3 finishes until all sequenceblocks in B(σ) have been processed.

5 EXPERIMENTAL STUDY

We now evaluate the experimental study of the proposedmethods. All experiments run on a computer with 3.40 GHzCore i7 CPU and 32GB RAM.

Page 9: JOURNAL OF LA Misplaced Subsequences Repairing with

JOURNAL OF LATEX CLASS FILES, VOL. X NO. X, X 20XX 9

TABLE 2Summary of datasets

Dataset #Sensors #Modelling #DetectionFPP-sys 48 1050K for 5 months 50K for 5 daysWPP-sys 150 1620K for 5 months 75K for 7 days

5.1 Experimental Settings

Data source. We conduct our experiments on real-life indus-trial equipment monitoring data collected from two large-scale power stations. Details are shown in Table 2.

(1) FPP-sys dataset describes five main components ofone induced draft fan equipment with 48 attributes froma large-scale fossil-fuel power plant. Data on more than1050K historical time points for 5 consecutive months areapplied in our sequence behavior model. We report ourexperimental results of repairing inconsistency on 50K timepoints.

(2) WPP-sys dataset has 150 attributes describing theworking condition of fan-machine groups from a windpower plant. It collects data each 8 seconds, and 1620K timepoints data has been used in modelling process and wedetect and repair inconsistency in 75K time points data inthe experiments.

Implementation. We have developed Cleanits, a datacleaning system for industrial time series in our previouswork [5], where three IoT data cleaning and repairing func-tions are implemented under real industrial scenarios.

We implement all algorithms of the proposed method inthis paper as named ISR. Besides, we implement anotherthree algorithms for comparative evaluation:

• G-ISR uses greedy-based algorithm in bipartite graphmatching in Algorithm 1, with the other steps the sameas ISR;

• CRS only executes Algorithm 1 with maximum weightmatching and outputs Φ(S) and cand(S) as the finalrepair result;

• λ-Block blocks the N -length S into small-length inter-vals, and takes each interval data as a whole part inbehavior modelling process. The following steps are thesame as ISR. The appropriate length of blocked inter-vals is λ ·

√N, λ ∈ [1, 10]. We report the experimental

result with λ = 1 as the best performance of λ-Blockmethod.

Measure. Since that the solution of inconsistency repair-ing problems contains both detection and repairing tasks,we evaluate and report algorithm performance in detec-tion phase and repairing phase independently. We applyPrecision (P) and Recall (R) metrics to evaluate the per-formance of all comparison algorithms. In detection phase,we evaluate how well the algorithm identifies inconsistencyintervals with Equation (6). Pd measures the ratio betweenthe number of inconsistent intervals correctly detected andthe total number of intervals detected by algorithms. Rd

is the ratio between the number of intervals correctly de-tected and the total number of all inconsistent intervals.In repairing phase, we report the repairing quality withEquation (7). Pr computes the ratio between the numberof inconsistent intervals correctly repaired and the totalnumber of inconsistent intervals correctly detected from theabove detection phase. Similarly, Rr is the ratio between the

TABLE 3Algorithms comparison on two datasets

FPP-sys (45K) WPP-sys (60K)Pr Rr Time(s) Pr Rr Time(s)

ISR 0.788 0.852 147.93 0.782 0.877 163.45G-ISR 0.542 0.592 132.67 0.612 0.650 138.56CRS 0.595 0.737 147.1 0.715 0.720 161.25Block 0.17 0.517 112.34 0.316 0.623 119.61

number of correct repairs and the number of all detectedinconsistent intervals.

It is worth noting that we pay more attention to the iden-tification quality of inconsistent time intervals in detectionevaluation, while we focus on the repair quality of concreteinconsistency instances in repairing results.

Pd =#correctDectection

#Dectection,Rd =

#correctDectection

#InconsistentIntervals. (6)

Pr =#correctRepair

#correctDectection,Rr =

#correctRepair

#Dectection. (7)

5.2 Evaluation on Real Errors

General performance. Table 3 shows the repair performanceof algorithms on the two datasets, with #Time points =45K in FPP-sys and #Time points = 60K in WPP-sys. Ex-perimental results for algorithm λ-Block with varying λshow that the appropriate length of blocked intervals isλ ·√N, λ ∈ [1, 10]. We show the experimental result with

λ = 1 below as the best performance of λ-Block method.Table 3 shows that our proposed ISR has the highest perfor-mance of Pr and Rr on the two datasets. The repair recallof ISR is slightly higher than the precision. CRS comes thesecond, and its repair performance is a little better than G-ISR. It demonstrates that the greedy-based matching appliedin G-ISR is not as reliable as the maximum weight matchingin ISR. For CRS, without further evaluation on candidaterepair schemas, it fails to provide high-quality repair resultsas ISR does. It is not surprising that Block has low costs onboth datasets. However, the repair quality of Block is poorwith the lowest repair precision of these four algorithms.

We next report the detailed experimental results of allalgorithms on the FPP -sys dataset with two importantparameters: total data amount (i.e., #Time points) and themaximum amount of inconsistent attributes (i.e., #Inconsis-tent Attr).

Varying data amount. We report algorithm performancecomparison on data volume varying form 20K to 50K inFPP-sys dataset with various inconsistent time intervals.On the condition that #Inconsistent Attr = 12, Figure 5 andFigure 6 show the performance on inconsistency detectionand repairing, respectively.

Figure 5 reveals that with the increasing data amount,the proposed ISR can always well detect inconsistent inter-vals, and outperforms the other three methods on both Pd

and Rd. When #Time points reaches 40K, ISR’s Pd maintainsaround 0.9, while Rd keeps 0.92. It verifies that the proposedinconsistency detection method as well as the fault-tolerancestrategy really contribute to a high-quality repairing ofinconsistent intervals in industrial time series data.

Method G-ISR and CRS come the second place on both

Page 10: JOURNAL OF LA Misplaced Subsequences Repairing with

JOURNAL OF LATEX CLASS FILES, VOL. X NO. X, X 20XX 10

20 25 30 35 40 45 500.0

0.2

0.4

0.6

0.8

1.0

Prec

isio

n (D

)

#Time points (K)

ISR G-ISR CRS Block

(a) Pd, FPP-sys

20 25 30 35 40 45 500.6

0.7

0.8

0.9

1.0

Rec

all (D

)

#Time points (K)

ISR G-ISR CRS Block

(b) Rd, FPP-sys

Fig. 5. Inconsistency detection performance comparison vs. data vol-ume

20 25 30 35 40 45 500.0

0.2

0.4

0.6

0.8

1.0

Prec

isio

n (R

)

#Time points (K)

ISR G-ISR CRS Block

(a) Pr, FPP-sys

20 25 30 35 40 45 500.0

0.2

0.4

0.6

0.8

1.0

Rec

all (R

)

#Time points (K)

ISR G-ISR CRS Block

(b) Rr, FPP-sys

Fig. 6. Inconsistency repairing performance comparison vs. data volume

Pd and Rd. For CRS, it outputs the repairing schemas fromAlgorithm 1 as the final results without a further fault-tolerance strategy. This results in the poor performanceof CRS compared with ISR. Figure 5(a) shows that Pd ofCRS never reach 0.9 and it has a downtrend with theincreasing data volume. For G-ISR, the simple greedy-based matching approach does not make enough intervalidentification as the maximum weight matching does. It isbecause that G-ISR sometimes computes incorrect candidaterepair schemas, and consequently, the further inconsistentinterval evaluation process fails to always provide reliableresults. Method Block comes the least, and both metricsfell seriously with the growing data volume. It verifiesthat inconsistency problems cannot be detected well bysuch blocking method. As a crucial parameter affecting thequality of sequence behavior models, the appropriate block-ing length of intervals is challenged to be determined. Inaddition, some inconsistency instances with a small numberof inconsistent attributes are difficult to be discovered byBlock.

Figure 6 shows inconsistency repairing performancewith the same experimental condition. It shows that ISR

has the best performance on both Pr and Rr. Both metricsof ISR keep steadily with the increasing data amount, whileG-ISR shows a decline trend in either Pr or Rr. CRS alwaysoutperforms G-ISR, for the reason that G-ISR trends to makemore false matching on bipartite graphs with the increasinginconsistency instances in data. The stable metric values ofboth ISR and CRS show that the proposed non-aftereffectsequence behavior modelling in Algorithm 1 really helpsto avoid incorrect anomaly detection results. Further, theperformance difference between ISR and CRS highlights thenecessary of the fault-tolerance repairing strategy proposedin Sec. 4. ISR really improves the repair effectiveness byevaluation of all candidate repair schemas.

3 6 9 12 15 180.0

0.2

0.4

0.6

0.8

1.0

Prec

isio

n (D

)

#Inconsistent Attr

ISR G-ISR CRS Block

(a) Pd, FPP-sys

ISR G-ISR CRS Block

3 6 9 12 15 180.6

0.7

0.8

0.9

1.0

Rec

all (D

)

#Inconsistent Attr

(b) Rd, FPP-sys

Fig. 7. Inconsistency detection performance comparison vs. the numberof inconsistent attributes

ISR G-ISR CRS Block

3 6 9 12 15 180.0

0.2

0.4

0.6

0.8

1.0

Prec

isio

n (R

)

#Inconsistent Attr

(a) Pr, FPP-sys

ISR G-ISR CRS Block

3 6 9 12 15 180.0

0.2

0.4

0.6

0.8

1.0

Rec

all (R

)

#Inconsistent Attr

(b) Rr, FPP-sys

Fig. 8. Inconsistency repairing performance comparison vs. the numberof inconsistent attributes

Varying inconsistent attributes. Figure 7 and Figure 8report the performance on the condition that #Time points= 45K in FPP -sys with #Inconsistent Attr varying from 3 to18.

Figure 7(a) shows that ISR has the highest Pd and itkeeps 0.87-0.91 against the increasing number of inconsis-tent attributes. ISR’s repair recall only presents a slight dropwhen #Inconsistent Attr is larger than 12. CRS has a seriousdrop in the two metrics, which reflects the detection qualityof CRS decreases when there exists inconsistency instanceswith more attributes. In general, Figure 7 confirms that ourmethod is effectiveness in detecting inconsistency issues inmonitoring industrial data under complex conditions.

Figure 8 shows the repairing performance comparisonamong four methods. All methods suffer a drop in both Pr

and Rr with the increasing number of inconsistent attributes.Our ISR can still achieve Pr > 0.78 and Rr > 0.85 withthe maximal inconsistency instances existing in 12 attributessimultaneously. The results verify that ISR are able to repairinconsistent instances effectively from low-quality data withmultiple inconsistent attributes. These metric values are ac-ceptable and those false-repaired instances can be returnedto artificial evaluation process as mentioned in Sec. 3.2.

Compared with ISR, both G-ISR and CRS suffer a sharpdrop with the increasing number of inconsistent attributes.And G-ISR never performances better than ISR. This il-lustrates that 1) the proposed ISR can detect inconsistentinstances effectively from low-quality data with multipleinconsistent attributes, and 2) the greed-based matchingapproach is not as effective as the maximum weight ap-proach, and it gets even worse in the cases that inconsistencyhappens in much more attributes.

Efficiency. We report execution time costs of eachmethod with varying data volume in Fig. 9. Figure 9(a)shows the total running times on the condition that #In-

Page 11: JOURNAL OF LA Misplaced Subsequences Repairing with

JOURNAL OF LATEX CLASS FILES, VOL. X NO. X, X 20XX 11

ISR G-ISR CRS Block

20 25 30 35 40 45 5060

90

120

150R

unni

ng T

ime

(s)

#Time points (K)

(a) Execution time vs. data vol-ume

ISR G-ISR CRS Block

3 6 9 12 15 1880

100

120

140

160

Run

ning

Tim

e (s

)

#Inconsistent Attr

(b) Execution time vs. inconsistentattribute number

Fig. 9. Time costs comparison

consistent Attr = 12. Our method ISR has the highest timecost, which is only a little higher than the time cost of CRS.It is because Algorithm 2 and Algorithm 3 in ISR spendmore time than CRS to determine the final repair schemas.However, it is easy to see that our repairing schemas deter-mination step does not take much time in each inconsistencysolution. It also reflects the Disjoint structure applied in Al-gorithm 3 does improve the efficiency of determining repairsolution. The time costs difference between are certainly tobe acceptable, for ISR achieve better inconsistency repairingperformance than CRS does. Execution time of ISR and CRS

increase slower when #Time points reaches 35K, and we areable to finish detection and repairing of inconsistency issueswithin 5 days’ monitoring time series in 2.5 minutes.

G-ISR have less running times than ISR and CRS, forthe reason that G-ISR spends less time in graph matchingand obtaining candidate repairing schemas with greedy-base method. Block costs the least time compared with theother methods. It saves much time in the process of bipartitegraph construction and matching computation.

Time costs with varying total number of inconsistentattributes w.r.t #Time points = 45K in FPP -sys is shown inFig. 9(b). With the increasing #Inconsistent Attr, all meth-ods shows a growing execution time cost. It is becausealgorithms need more computation (especially in graphmatching and determining repair schemas) when repairingcomplex inconsistency instances among more attributes.Our proposed ISR has the highest time cost. But ISR ingeneral can finish inconsistency detection and repairing in9 attributes on 45K time points in 145 seconds. It verifiesthe effectiveness and efficiency of our method in industrialtemporal data cleaning under complex data quality problemscenarios.

6 RELATED WORK

We summarize a few works related to our proposed incon-sistency issues in time series.

Temporal data cleaning. Data cleaning and repairingis of great importance in data preprocessing, which hasbeen studied extensively. Along with the rise of temporaldata mining, temporal data quality issues become serious.Effective cleaning on temporal data is gaining attentionaccording to its valuable temporal information. With thefact that timestamps are often unavailable or imprecise indata application [10], [11], the cleaning involves two mainproblems: 1) cleaning inconsistent or imprecise timestamps,and 2) repairing anomalous data values and errors. For

the former problem, [12] first proposes a temporal con-straints processing framework to address time-related re-lationship between events. Song et al. [13] develops high-quality temporal constraints-based repairing algorithms tosolve inconsistent timestamps problems. [10] proposes atemporal framework to assign possible time interval toeach event considering occurrence times of patterns. Forthe latter, both statistical-based [14], [15] and constraints-based [16], [17] cleaning are widely applied in temporal datequality improvement. [16] extends the idea of constraintsfrom dependencies defined on relational database (e.g., FD,CFD in [18]), and proposes sequential dependencies (SD) todescribe the semantics of temporal data. Accordingly, speedconstraints are developed in sequential data and applied totime series cleaning solutions [15], [17].

Anomaly Detection over time series. As one commonform of temporal data, time series becomes more easily tobe collected and further analyzed under data applicationscenarios. Anomaly detection (see [19] as a survey) is a im-portant step in time series management process [20], whichaims to discover unexpected changes in patterns or datavalues in time series. Gupta et al. [4] summarizes anomalydetection tasks in kinds of temporal data and provide anoverview of detection techniques (e.g., statistical techniques,distance-based approaches, classification-based approaches)in different scenarios. Time series anomaly detection tasksinclude discovering discrete abnormal data points (outliers)and anomalous (sub)sequences. Autoregression and win-dow moving-average models (e.g., EWMA, ARIMA [21]) arewidely used in outlier points detections [22]. On the otherhand, anomalous subsequences are more challenged to bedetected because abnormal behaviors within subsequencesare difficult to be distinguished from normal behaviors [3].Sequence patterns discovery in time series is continuouslystudied such as [23], [24]. [25] studies anomalous timeseries intervals and abnormal subsequences. Further, high-dimension feature in time series is taken into account foreffectiveness improvement in anomaly detection methods[26], [27].

As the inconsistency problems in industrial temporaldata have just been brought to attention in both researchand applications of IoT, especially in IIoT scenarios. Tech-nological breakthroughs are still in demand in develop-ing a comprehensive data quality improvement and datacleaning approaches, where inconsistency repairing is a keyproblem. In our pervious work [5], we develop a datacleaning systems Cleanits, in which we implement reliabledata cleaning algorithms about missing value imputation,abnormal subsequence detection and so on.

Our work in this paper develops an integrated datainconsistency repairing method on IoT time series data. Theproposed method can also complement existing IoT datacleaning techniques.

7 CONCLUSION

We formalize one serious inconsistency problem on mul-tivariate industrial time series data in this paper. We pro-pose an integrated method to detect inconsistent instancesand then repair them with correct schemas. The proposed

Page 12: JOURNAL OF LA Misplaced Subsequences Repairing with

JOURNAL OF LATEX CLASS FILES, VOL. X NO. X, X 20XX 12

method achieves that: (1) It is effectiveness in IIoT data qual-ity management and data cleaning tasks, (2) Less-negative-cumulative-effect sequence behavior modelling guaranteesthe reliable of the proposed inconsistency detection process,and (3) Fault-tolerance evaluation on candidate repairingschemas contribute to high-quality repairing on variousinconsistency instances. The evaluation results on real-lifeIIoT data show that the proposed method effectively detectsand repairs inconsistency instances in industrial time serieswithin a reasonable time in IoT data monitoring systems.

REFERENCES

[1] J. Ding, Y. Liu, L. Zhang, J. Wang, and Y. Liu, “An anomalydetection approach for multiple monitoring data series based onlatent correlation probabilistic model,” Appl. Intell., vol. 44, no. 2,pp. 340–361, 2016.

[2] Z. Abedjan, X. Chu, D. Deng, R. C. Fernandez, I. F. Ilyas, M. Ouz-zani, P. Papotti, M. Stonebraker, and N. Tang, “Detecting dataerrors: Where are we and what needs to be done?” PVLDB, vol. 9,no. 12, pp. 993–1004, 2016.

[3] M. Toledano, I. Cohen, Y. Ben-Simhon, and I. Tadeski, “Real-timeanomaly detection system for time series at scale,” in Proceedingsof the KDD Workshop on Anomaly Detection, 2017, pp. 56–65.

[4] M. Gupta, J. Gao, C. C. Aggarwal, and J. Han, Outlier Detectionfor Temporal Data, ser. Synthesis Lectures on Data Mining andKnowledge Discovery. Morgan & Claypool Publishers, 2014.

[5] X. Ding, H. Wang, J. Su, Z. Li, J. Li, and H. Gao, “Cleanits: A datacleaning system for industrial time series,” PVLDB, vol. 12, no. 12,pp. 1786–1789, 2019.

[6] X. Wang and C. Wang, “Time series data cleaning: A survey,”IEEE Access, vol. 8, pp. 1866–1881, 2020. [Online]. Available:https://doi.org/10.1109/ACCESS.2019.2962152

[7] R. C. Lyndon and P. E. Schupp, Combinatorial group theory, 1977.[8] J. K. S. McKay, “Computing with finite groups,” Ph.D. dissertation,

University of Edinburgh, UK, 1970.[9] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, Introduc-

tion to Algorithms, 3rd Edition. MIT Press, 2009. [Online]. Available:http://mitpress.mit.edu/books/introduction-algorithms

[10] H. Zhang, Y. Diao, and N. Immerman, “Recognizing patterns instreams with imprecise timestamps,” PVLDB, vol. 3, no. 1, pp.244–255, 2010.

[11] W. Fan, F. Geerts, and J. Wijsen, “Determining the currency ofdata,” ACM Trans. Database Syst., vol. 37, no. 4, pp. 25:1–25:46,2012.

[12] P. Torasso, Ed., Advances in Artificial Intelligence, Third Congress ofthe Italian Association for Artificial Intelligence, AI*IA’93, Torino, Italy,October 26-28, 1993, Proceedings, ser. Lecture Notes in ComputerScience, vol. 728. Springer, 1993.

[13] S. Song, Y. Cao, and J. Wang, “Cleaning timestamps with temporalconstraints,” PVLDB, vol. 9, no. 10, pp. 708–719, 2016.

[14] M. Yakout, L. Berti-Equille, and A. K. Elmagarmid, “Don’t bescared: use scalable automatic repairing with maximal likelihoodand bounded changes,” in Proceedings of the ACM SIGMOD In-ternational Conference on Management of Data, SIGMOD 2013, NewYork, NY, USA, June 22-27, 2013, pp. 553–564.

[15] A. Zhang, S. Song, and J. Wang, “Sequential data cleaning: Astatistical approach,” in Proceedings of the International Conferenceon Management of Data, SIGMOD Conference, 2016, pp. 909–924.

[16] L. Golab, H. J. Karloff, F. Korn, A. Saha, and D. Srivastava,“Sequential dependencies,” PVLDB, vol. 2, no. 1, pp. 574–585,2009.

[17] S. Song, A. Zhang, J. Wang, and P. S. Yu, “SCREEN: streamdata cleaning under speed constraints,” in Proceedings of the 2015ACM SIGMOD International Conference on Management of Data,Melbourne, Victoria, Australia, May 31 - June 4, 2015, pp. 827–841.

[18] W. Fan and F. Geerts, Foundations of Data Quality Management, ser.Synthesis Lectures on Data Management. Morgan & ClaypoolPublishers, 2012.

[19] V. Chandola, A. Banerjee, and V. Kumar, “Anomaly detection: Asurvey,” ACM Comput. Surv., vol. 41, no. 3, pp. 15:1–15:58, 2009.

[20] S. K. Jensen, T. B. Pedersen, and C. Thomsen, “Time series man-agement systems: A survey,” IEEE Trans. Knowl. Data Eng., vol. 29,no. 11, pp. 2581–2600, 2017.

[21] W. W. S. Wei, Time series analysis - univariate and multivariatemethods. Addison-Wesley, 1989.

[22] J. Takeuchi and K. Yamanishi, “A unifying framework for de-tecting outliers and change points from time series,” IEEE Trans.Knowl. Data Eng., vol. 18, no. 4, pp. 482–492, 2006.

[23] S. Papadimitriou, J. Sun, and C. Faloutsos, “Streaming patterndiscovery in multiple time-series,” in Proceedings of the 31st In-ternational Conference on Very Large Data Bases, Trondheim, Norway,August 30 - September 2, 2005, pp. 697–708.

[24] F. Morchen, “Algorithms for time series knowledge mining,” inProceedings of the Twelfth ACM SIGKDD International Conference onKnowledge Discovery and Data Mining, Philadelphia, PA, USA, August20-23, 2006, 2006, pp. 668–673.

[25] U. Rebbapragada, P. Protopapas, C. E. Brodley, and C. R. Al-cock, “Finding anomalous periodic time series,” Machine Learning,vol. 74, no. 3, pp. 281–313, 2009.

[26] S. M. Erfani, S. Rajasegarar, S. Karunasekera, and C. Leckie, “High-dimensional and large-scale anomaly detection using a linear one-class SVM with deep learning,” Pattern Recognition, vol. 58, pp.121–134, 2016.

[27] H. Liu, X. Li, J. Li, and S. Zhang, “Efficient outlier detection forhigh-dimensional data,” IEEE Trans. Systems, Man, and Cybernetics:Systems, vol. 48, no. 12, pp. 2451–2461, 2018.