# [ieee 22nd international conference on data engineering (icde'06) - atlanta, ga, usa...

Post on 03-Feb-2017

213 views

Embed Size (px)

TRANSCRIPT

Mining Dense Periodic Patterns in Time Series Data

Chang Sheng Wynne Hsu Mong Li LeeSchool of Computing, National University of Singapore

{shengcha, whsu, leeml}@comp.nus.edu.sg

Abstract

Existing techniques to mine periodic patterns in time se-ries data are focused on discovering full-cycle periodic pat-terns from an entire time series. However, many useful par-tial periodic patterns are hidden in long and complex timeseries data. In this paper, we aim to discover the partialperiodicity in local segments of the time series data. Weintroduce the notion of character density to partition thetime series into variable-length fragments and to determinethe lower bound of each characters period. We proposea novel algorithm, called DPMiner, to find the dense peri-odic patterns in time series data. Experimental results onboth synthetic and real-life datasets demonstrate that theproposed algorithm is effective and efficient to reveal inter-esting dense periodic patterns.

1 Introduction

One area of research in time series databases is period-icity detection. Two kinds of periodicity detections exist:full-cycle periodicity and partial periodicity . In full-cycleperiodicity, every point in the time series contributes to partof the cycle (e.g. the season cycle of the year). In partial pe-riodicity, only a portion of the time series data are essentialto the mining results. Recent work has focused on partialperiodicity detection.

Previous works [3, 4, 1, 2] devise methods to dis-cover potential periods from the entire time series data.These methods are not applicable if the periodic pat-terns occur only within small segments of the time se-ries. For example, suppose Bob is an employee whodrives his car to work every day. However, his routemay change from month to month. In the first month,he follows the route of homeBLOCK ABLOCKDcompany; in the second month, he follows the routeof homeBLOCK BBLOCK Kcompany; and inthe third month, he changes to the route homeBLOCKCBLOCK Fcompany. Existing periodicity detectionalgorithms will not discover his traveling habits since theyare present in only a third of the entire three months period.

In this paper, we develop a new periodicity detection al-

gorithm to efficiently discover such short period patternsthat may exist only in a limited range of the time series.We refer to these patterns as dense periodic patterns. Ourcontributions are summarized as follows: (1) We introducethe notion of dense periodic patterns where the periodicityis focused on part of time series. To the best of our knowl-edge, this is the first work that deals with localized segmentsperiodic patterns. (2) We design a pruning strategy to limitthe search space to just the feasible periods. (3) We developa dense periodic pattern mining algorithm called DPMinerthat has been demonstrated to be both scalable and efficient.

2 Dense Periodicity

Given an alphabet

, a pattern P = (I1I2 . . . Ip) is anordered sequence of itemsets, where its itemset Ii is a setof zero or more non-repeated characters denoted as {c},c . For example, (b{b, c}a) is a pattern of alphabeta, b, c, and {b}, {b, c}, {a} are three itemsets of this pattern.

Here, we define the concept of density of a symbol inthe alphabet

. The distance between any two characters,

say ci and cj , in the time series T = c1 cn, is definedas |i j|. For any characters of the same symbol s, if theirdistance is not greater than the given parameter dmax, wesay they are directly density-reachable. dmax is a param-eter assigned by users to denote the maximum allowabledistance between two directly density-reachable characters.

For example, in Figure 1, if dmax=10, the two charac-ters of symbol a in positions 2 and 5 are directly density-reachable because their distance is 3, which is less thandmax. On the other hand, the two characters of the symbola in position 13 and 25 are not directly density-reachablebecause their distance is 12, which exceeds dmax.

A dense fragment of symbol s in the time seriesT , denoted as Fs,(bpos,epos), is a continuous sequence ofcharacters in T such that (1) s occurs at both the beginningposition (bpos) and the ending position (epos), and (2) anytwo neighboring s characters are directly density-reachable.The dense fragment set of s in the time series T is the setof all the dense fragments of s, denoted as FSs.

In Figure 1, the dense fragment set FSa for symbol aincludes two dense fragments, Fa,(2,13) and Fa,(25,35). Let

1

Proceedings of the 22nd International Conference on Data Engineering (ICDE06) 8-7695-2570-9/06 $20.00 2006 IEEE

Figure 1. Example of a Time Series T

|Fs,(bpos,epos)| = epos bpos denotes the length of thedense fragment, we have |Fa,(2,13)| = 11, |Fa,(25,35)| = 10.

For a dense fragment Fs,(bpos,epos), it is easy to countthe frequency of s denoted by freqs. By assigning two pa-rameters, dmax and min conf (the minimum confidence),we can deduce a lower bound period for all the possible 1-patterns containing symbol s in Fs,(bpos,epos).

THEOREM 2.1 (Lower Bound Period) The lowerbound period of all possible 1-patterns con-taining symbol s in Fs,(bpos,epos) is equal to

|Fs,(bpos,epos)|min confdmaxfreqsdmax|Fs,(bpos,epos)|(1min conf) .

This theorem allows us to prune the search space of allthe periods that are less than the lower bound computed.Please refer to [5] for the proof.

For example, in Figure 1, if min conf=0.8 anddmax=10, we can compute the minimum period of frag-ment Fa,(2,13) to be

120.810410120.2 2.55. The minimum

period obtained indicates that it is not possible to find fre-quent patterns containing the symbol a within a period of2. A closer look reveals that with a specified period of2, Fa,(2,13) contains total 6 segments, in which it requiresat least 6 0.8 = 5 segments to support the 1-patternscontaining a(i.e, (a) and (a)). However, we only havefreqa = 4 in Fa,(2,13). In other words, as 1-patterns ofperiod 2 are impossible to be frequent in Fa,(2,13).

To extend the pruning strategy to the k-patterns, we needto identify all the promising high density regions and per-form mining only in these regions.

DEFINITION 2.1 Given a period p and an itemset I ={c1, c2, . . . , cm}, the dense region of I with period p,DR(I, p), is defined as the union of all the dense fragmentsof ci, 1 i m, where the lower bound period of eachdense fragment is less than or equal to p.

The dense region of the empty itemset is defined tobe the entire time series. Note that all the merged densefragments must have the lower bound less than or equal top so that they satisfy the minimal density requirement.

DEFINITION 2.2 For a pattern P with period p,(I1, . . . , Ip), the dense interval of P , DI(P, p), is definedto be the intersection of all the dense regions of its itemsetmembers, namely DI(P, p) = DR(I1, p). . .DR(Ip, p).

Algorithm 1 DPMinerInput: A time series T = c1c2 . . . cn

Alphabet

Maximal distance dmaxFragment length coefficient Periodicity threshold min conf .

Output: Patterns of period [2, dmax] for T .1: Scan the time series once to find the fragment set Ss for each

symbol s ;2: for each symbol s do3: Delete the fragments of length less than |T | from Ss;4: Compute the lower bound period of every fragment in Ss;5: end for6: for period p= 2 to dmax do7: For symbol s, discover the frequent 1-patterns, |F1s|, from

Ss;8: Merge all |F1s|, s , to obtain max-pattern Pmax;9: Let Pmax with period p is the root node of max-subpattern

tree R, compute the dense support DS(R, p);10: Scan DS(P, p) to construct R;11: Traverse R to output the frequent patterns.12: end for

To mine a k-pattern, we only need to integrate the denseregions of all of its itemsets, and inspect the intersectionregions, namely the dense intervals of this pattern.

3 DPMiner

The density based pruning strategy is incorporated inDPMiner (Dense Periodic pattern Miner), whose outline isdescribed in Algorithm 1. The algorithm mines dense pe-riodic patterns in two phases. The first phase (Steps 1 to5 in Algorithm 1) scans the time series once to obtain thedense fragments for each symbol s in

. The second phase

(Steps 6 to 12 in Algorithm 1) utilizes a top-down methodthat is similar to the one proposed in [3], except that it onlyscans the union of the dense regions of the correspondingroot nodes itemsets instead of the entire time series.

DEFINITION 3.1 For a max-subpattern tree R taking themax-pattern Pmax with period p, (I1, . . . , Ip), as its rootnode, the dense support of tree R, DS(R, p), is defined tobe the union of all the dense regions of its root nodes item-sets, namely DS(R, p) = DR(I1, p) . . . DR(Ip, p).

The dense support of tree R thoroughly covers all seg-ments supporting the max-pattern or its subpatterns. In

2

Proceedings of the 22nd International Conference on Data Engineering (ICDE06) 8-7695-2570-9/06 $20.00 2006 IEEE

0 5 10 150

200

400

600

800

1000

1200

1400

Run

time

(ms)

Periods

DPMinerMaxsubpattern tree hit set method

(a) DATA-6-10000

0 5 10 150

500

1000

1500

2000

2500

3000

3500

4000

4500

Run

time

(ms)

Periods

DPMinerMaxsubpattern tree hit set method

(b) DATA-21-75000

0 5 10 150

0.5

1

1.5

2

2.5

3

3.5x 10

4

Run

time

(ms)

Periods

DPMiner

Maxsubpattern tree hit set method

(c) PACKET

Figure 2. Time Comparison of DPMiner and MTHS

other words, only those segments in DS(R) are useful forthe construction of max-subpattern tree. This allows us toprune away many unneces

Recommended