[ieee 22nd international conference on data engineering (icde'06) - atlanta, ga, usa...

3
Mining Dense Periodic Patterns in Time Series Data Chang Sheng Wynne Hsu Mong Li Lee School of Computing, National University of Singapore {shengcha, whsu, leeml}@comp.nus.edu.sg Abstract Existing techniques to mine periodic patterns in time se- ries data are focused on discovering full-cycle periodic pat- terns from an entire time series. However, many useful par- tial periodic patterns are hidden in long and complex time series data. In this paper, we aim to discover the partial periodicity in local segments of the time series data. We introduce the notion of character density to partition the time series into variable-length fragments and to determine the lower bound of each character’s period. We propose a novel algorithm, called DPMiner, to find the dense peri- odic patterns in time series data. Experimental results on both synthetic and real-life datasets demonstrate that the proposed algorithm is effective and efficient to reveal inter- esting dense periodic patterns. 1 Introduction One area of research in time series databases is period- icity detection. Two kinds of periodicity detections exist: full-cycle periodicity and partial periodicity . In full-cycle periodicity, every point in the time series contributes to part of the cycle (e.g. the season cycle of the year). In partial pe- riodicity, only a portion of the time series data are essential to the mining results. Recent work has focused on partial periodicity detection. Previous works [3, 4, 1, 2] devise methods to dis- cover potential periods from the entire time series data. These methods are not applicable if the periodic pat- terns occur only within small segments of the time se- ries. For example, suppose Bob is an employee who drives his car to work every day. However, his route may change from month to month. In the first month, he follows the route of ”homeBLOCK ABLOCK Dcompany”; in the second month, he follows the route of ”homeBLOCK BBLOCK Kcompany”; and in the third month, he changes to the route ”homeBLOCK CBLOCK Fcompany”. Existing periodicity detection algorithms will not discover his traveling habits since they are present in only a third of the entire three months period. In this paper, we develop a new periodicity detection al- gorithm to efficiently discover such short period patterns that may exist only in a limited range of the time series. We refer to these patterns as dense periodic patterns. Our contributions are summarized as follows: (1) We introduce the notion of dense periodic patterns where the periodicity is focused on part of time series. To the best of our knowl- edge, this is the first work that deals with localized segments periodic patterns. (2) We design a pruning strategy to limit the search space to just the feasible periods. (3) We develop a dense periodic pattern mining algorithm called DPMiner that has been demonstrated to be both scalable and efficient. 2 Dense Periodicity Given an alphabet ,a pattern P =(I 1 I 2 ...I p ) is an ordered sequence of itemsets, where its itemset I i is a set of zero or more non-repeated characters denoted as {c }, c . For example, (b{b, c}a) is a pattern of alphabet a, b, c, and {b}, {b, c}, {a} are three itemsets of this pattern. Here, we define the concept of density of a symbol in the alphabet . The distance between any two characters, say c i and c j , in the time series T = c 1 ··· c n , is defined as |i j |. For any characters of the same symbol s, if their distance is not greater than the given parameter d max , we say they are directly density-reachable. d max is a param- eter assigned by users to denote the maximum allowable distance between two directly density-reachable characters. For example, in Figure 1, if d max =10, the two charac- ters of symbol ’a’ in positions 2 and 5 are directly density- reachable because their distance is 3, which is less than d max . On the other hand, the two characters of the symbol ’a’ in position 13 and 25 are not directly density-reachable because their distance is 12, which exceeds d max . A dense fragment of symbol s in the time series T , denoted as F s,(bpos,epos) , is a continuous sequence of characters in T such that (1) s occurs at both the beginning position (bpos) and the ending position (epos), and (2) any two neighboring s characters are directly density-reachable. The dense fragment set of s in the time series T is the set of all the dense fragments of s, denoted as FS s . In Figure 1, the dense fragment set FS a for symbol a includes two dense fragments, F a,(2,13) and F a,(25,35) . Let 1 Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE

Upload: trinhcong

Post on 03-Feb-2017

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: [IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - Mining Dense

Mining Dense Periodic Patterns in Time Series Data

Chang Sheng Wynne Hsu Mong Li LeeSchool of Computing, National University of Singapore

{shengcha, whsu, leeml}@comp.nus.edu.sg

Abstract

Existing techniques to mine periodic patterns in time se-ries data are focused on discovering full-cycle periodic pat-terns from an entire time series. However, many useful par-tial periodic patterns are hidden in long and complex timeseries data. In this paper, we aim to discover the partialperiodicity in local segments of the time series data. Weintroduce the notion of character density to partition thetime series into variable-length fragments and to determinethe lower bound of each character’s period. We proposea novel algorithm, called DPMiner, to find the dense peri-odic patterns in time series data. Experimental results onboth synthetic and real-life datasets demonstrate that theproposed algorithm is effective and efficient to reveal inter-esting dense periodic patterns.

1 Introduction

One area of research in time series databases is period-icity detection. Two kinds of periodicity detections exist:full-cycle periodicity and partial periodicity . In full-cycleperiodicity, every point in the time series contributes to partof the cycle (e.g. the season cycle of the year). In partial pe-riodicity, only a portion of the time series data are essentialto the mining results. Recent work has focused on partialperiodicity detection.

Previous works [3, 4, 1, 2] devise methods to dis-cover potential periods from the entire time series data.These methods are not applicable if the periodic pat-terns occur only within small segments of the time se-ries. For example, suppose Bob is an employee whodrives his car to work every day. However, his routemay change from month to month. In the first month,he follows the route of ”home→BLOCK A→BLOCKD→company”; in the second month, he follows the routeof ”home→BLOCK B→BLOCK K→company”; and inthe third month, he changes to the route ”home→BLOCKC→BLOCK F→company”. Existing periodicity detectionalgorithms will not discover his traveling habits since theyare present in only a third of the entire three months period.

In this paper, we develop a new periodicity detection al-

gorithm to efficiently discover such short period patternsthat may exist only in a limited range of the time series.We refer to these patterns as dense periodic patterns. Ourcontributions are summarized as follows: (1) We introducethe notion of dense periodic patterns where the periodicityis focused on part of time series. To the best of our knowl-edge, this is the first work that deals with localized segmentsperiodic patterns. (2) We design a pruning strategy to limitthe search space to just the feasible periods. (3) We developa dense periodic pattern mining algorithm called DPMinerthat has been demonstrated to be both scalable and efficient.

2 Dense Periodicity

Given an alphabet∑

, a pattern P = (I1I2 . . . Ip) is anordered sequence of itemsets, where its itemset Ii is a setof zero or more non-repeated characters denoted as {c∗},c ∈ ∑

. For example, (b{b, c}a∗) is a pattern of alphabeta, b, c, and {b}, {b, c}, {a} are three itemsets of this pattern.

Here, we define the concept of density of a symbol inthe alphabet

∑. The distance between any two characters,

say ci and cj , in the time series T = c1 · · · cn, is definedas |i − j|. For any characters of the same symbol s, if theirdistance is not greater than the given parameter dmax, wesay they are directly density-reachable. dmax is a param-eter assigned by users to denote the maximum allowabledistance between two directly density-reachable characters.

For example, in Figure 1, if dmax=10, the two charac-ters of symbol ’a’ in positions 2 and 5 are directly density-reachable because their distance is 3, which is less thandmax. On the other hand, the two characters of the symbol’a’ in position 13 and 25 are not directly density-reachablebecause their distance is 12, which exceeds dmax.

A dense fragment of symbol s ∈ ∑in the time series

T , denoted as Fs,(bpos,epos), is a continuous sequence ofcharacters in T such that (1) s occurs at both the beginningposition (bpos) and the ending position (epos), and (2) anytwo neighboring s characters are directly density-reachable.The dense fragment set of s in the time series T is the setof all the dense fragments of s, denoted as FSs.

In Figure 1, the dense fragment set FSa for symbol aincludes two dense fragments, Fa,(2,13) and Fa,(25,35). Let

1

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE

Page 2: [IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - Mining Dense

Figure 1. Example of a Time Series T

|Fs,(bpos,epos)| = epos − bpos denotes the length of thedense fragment, we have |Fa,(2,13)| = 11, |Fa,(25,35)| = 10.

For a dense fragment Fs,(bpos,epos), it is easy to countthe frequency of s denoted by freqs. By assigning two pa-rameters, dmax and min conf (the minimum confidence),we can deduce a lower bound period for all the possible 1-patterns containing symbol s in Fs,(bpos,epos).

THEOREM 2.1 (Lower Bound Period) The lowerbound period of all possible 1-patterns con-taining symbol s in Fs,(bpos,epos) is equal to

|Fs,(bpos,epos)|×min conf×dmax

freqs×dmax−|Fs,(bpos,epos)|×(1−min conf) .

This theorem allows us to prune the search space of allthe periods that are less than the lower bound computed.Please refer to [5] for the proof.

For example, in Figure 1, if min conf=0.8 anddmax=10, we can compute the minimum period of frag-ment Fa,(2,13) to be 12×0.8×10

4×10−12×0.2 ≈ 2.55. The minimumperiod obtained indicates that it is not possible to find fre-quent patterns containing the symbol a within a period of2. A closer look reveals that with a specified period of2, Fa,(2,13) contains total 6 segments, in which it requiresat least �6 × 0.8� = 5 segments to support the 1-patternscontaining ’a’(i.e, (a∗) and (∗a)). However, we only havefreqa = 4 in Fa,(2,13). In other words, ’a”s 1-patterns ofperiod 2 are impossible to be frequent in Fa,(2,13).

To extend the pruning strategy to the k-patterns, we needto identify all the promising high density regions and per-form mining only in these regions.

DEFINITION 2.1 Given a period p and an itemset I ={c1, c2, . . . , cm}, the dense region of I with period p,DR(I, p), is defined as the union of all the dense fragmentsof ci, 1 ≤ i ≤ m, where the lower bound period of eachdense fragment is less than or equal to p.

The dense region of the empty itemset ∗ is defined tobe the entire time series. Note that all the merged densefragments must have the lower bound less than or equal top so that they satisfy the minimal density requirement.

DEFINITION 2.2 For a pattern P with period p,(I1, . . . , Ip), the dense interval of P , DI(P, p), is definedto be the intersection of all the dense regions of its itemsetmembers, namely DI(P, p) = DR(I1, p)∩. . .∩DR(Ip, p).

Algorithm 1 DPMinerInput: A time series T = c1c2 . . . cn

Alphabet∑

Maximal distance dmax

Fragment length coefficient μPeriodicity threshold min conf .

Output: Patterns of period [2, dmax] for T .1: Scan the time series once to find the fragment set Ss for each

symbol s ∈∑;2: for each symbol s ∈∑ do3: Delete the fragments of length less than μ × |T | from Ss;4: Compute the lower bound period of every fragment in Ss;5: end for6: for period p= 2 to dmax do7: For symbol s, discover the frequent 1-patterns, |F1s|, from

Ss;8: Merge all |F1s|, s ∈∑, to obtain max-pattern Pmax;9: Let Pmax with period p is the root node of max-subpattern

tree R, compute the dense support DS(R, p);10: Scan DS(P, p) to construct R;11: Traverse R to output the frequent patterns.12: end for

To mine a k-pattern, we only need to integrate the denseregions of all of its itemsets, and inspect the intersectionregions, namely the dense intervals of this pattern.

3 DPMiner

The density based pruning strategy is incorporated inDPMiner (Dense Periodic pattern Miner), whose outline isdescribed in Algorithm 1. The algorithm mines dense pe-riodic patterns in two phases. The first phase (Steps 1 to5 in Algorithm 1) scans the time series once to obtain thedense fragments for each symbol s in

∑. The second phase

(Steps 6 to 12 in Algorithm 1) utilizes a top-down methodthat is similar to the one proposed in [3], except that it onlyscans the union of the dense regions of the correspondingroot nodes’ itemsets instead of the entire time series.

DEFINITION 3.1 For a max-subpattern tree R taking themax-pattern Pmax with period p, (I1, . . . , Ip), as its rootnode, the dense support of tree R, DS(R, p), is defined tobe the union of all the dense regions of its root node’s item-sets, namely DS(R, p) = DR(I1, p) ∪ . . . ∪ DR(Ip, p).

The dense support of tree R thoroughly covers all seg-ments supporting the max-pattern or its subpatterns. In

2

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE

Page 3: [IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - Mining Dense

0 5 10 150

200

400

600

800

1000

1200

1400

Run

time

(ms)

Periods

DPMinerMaxsubpattern tree hit set method

(a) DATA-6-10000

0 5 10 150

500

1000

1500

2000

2500

3000

3500

4000

4500

Run

time

(ms)

Periods

DPMinerMaxsubpattern tree hit set method

(b) DATA-21-75000

0 5 10 150

0.5

1

1.5

2

2.5

3

3.5x 10

4

Run

time

(ms)

Periods

DPMiner

Maxsubpattern tree hit set method

(c) PACKET

Figure 2. Time Comparison of DPMiner and MTHS

other words, only those segments in DS(R) are useful forthe construction of max-subpattern tree. This allows us toprune away many unnecessary segments.

We also modify the structure of max-subpattern tree toinclude the patterns’ dense intervals. First, the dense sup-port of a max-subpattern tree R is computed according toits max-pattern and Definition 3.1. Then for every segmentin dense support of R, we check whether it supports themax-pattern Pmax or any subpattern of the max-pattern. Ifyes, this maximal subpattern P ′ will be found by search-ing from the node Pmax to its branches. If the node P ′ isfound, its count is increased by 1. Otherwise, a new nodeP ′ is created (together with any non-existent ancestors) andthe dense interval for P ′ is computed accordingly.

Complexity Analysis. The first phase of DPMiner scansthe time series once. The total number of scanned charac-ters for an iteration in the second phase is α × |T |, whereα ≤ 2 and α is bounded by the time series. For a range ofk(= dmax) period values, we scan k × α × |T | characters.Thus, the total number of characters scanned in DPMineris (1 + k × α) × |T |, for mining periodic patterns of k pe-riods, whereas the algorithm in [3] need to scan 2k × |T |characters.

4 Experiment Evaluation

We present a study of DPMiner versus max-subpatterntree hit set (MTHS) method [3]. Testing was done on aPentium 4 3Ghz PC with 1GB of memory, running Win-dows XP. We generated two synthetic datasets: (1) DATA-6-10000 is a time series satisfying normal distribution with6 symbols and 10000 characters; (2) DATA-21-75000 is atime series satisfying normal distribution with 21 symbols,and consists of 75000 characters. We also use the real-life dataset PACKET1. We discretize the values into 360K-length characters of alphabet 26.

Figure 2 shows the performance of DPMiner (withdmax = 30 and μ = 0.01) and the MTHS method. We ob-

1http://www.cs.ucr.edu/ eamonn/TSDMA/packet.data

serve that DPMiner has similar performance as the MTHSmethod for the DATA-6-10000 dataset, and outperformsMTHS for the datasets DATA-21-75000 and PACKET.

We also compare the patterns discovered by the two al-gorithms on DATA-21-75000 and PACKET. We observethat DPMiner can discover the dense periodic patterns aswell as their density ranges, while MTHS does not detectany frequent patterns on the two datasets.

5 ConclusionIn this paper, we have defined the problem of mining

dense periodic patterns. We introduced the concepts of den-sity and fragment, and a strategy for pruning the searchspace. We have developed a mining algorithm called DP-Miner to discover the dense periodic patterns. Experi-ment results on both synthetic and real-life datasets indi-cate the effectiveness and efficiency of DPMiner. The re-sults also show that DPMiner outperforms the existing max-subpattern tree hit set method on large alphabet datasets.

Acknowledgements. We would like to thank Mohamed G.Elfeky for giving us the source codes, and Junlian Xiang forher contribution in this research.

References

[1] W. Aref, M. Elfeky, and A. Elmagarmid. Incremental, online,and merge mining of partial periodic patterns in time seriesdatabases. IEEE TKDE, 16(3):332–342, 2004.

[2] M. Elfeky, W. Aref, and A. Elmagarmid. Periodicity detectionin time series databases. IEEE TKDE, 17(7):875–887, 2005.

[3] J. Han, G. Dong, and Y. Yin. Efficient mining of partial peri-odic patterns in time series database. In IEEE ICDE, 1999.

[4] S. Ma and J. Hellerstein. Mining partially periodic event pat-terns with unknown periods. In IEEE ICDE, 2001.

[5] C. Sheng, W. Hsu, and M. L. Lee. Efficient mining of denseperiodic patterns in time series database. In Technical ReportTR20/05, National University of Singapore, 2005.

3

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE