[ieee 22nd international conference on data engineering (icde'06) - atlanta, ga, usa...
TRANSCRIPT
Mining Dense Periodic Patterns in Time Series Data
Chang Sheng Wynne Hsu Mong Li LeeSchool of Computing, National University of Singapore
{shengcha, whsu, leeml}@comp.nus.edu.sg
Abstract
Existing techniques to mine periodic patterns in time se-ries data are focused on discovering full-cycle periodic pat-terns from an entire time series. However, many useful par-tial periodic patterns are hidden in long and complex timeseries data. In this paper, we aim to discover the partialperiodicity in local segments of the time series data. Weintroduce the notion of character density to partition thetime series into variable-length fragments and to determinethe lower bound of each character’s period. We proposea novel algorithm, called DPMiner, to find the dense peri-odic patterns in time series data. Experimental results onboth synthetic and real-life datasets demonstrate that theproposed algorithm is effective and efficient to reveal inter-esting dense periodic patterns.
1 Introduction
One area of research in time series databases is period-icity detection. Two kinds of periodicity detections exist:full-cycle periodicity and partial periodicity . In full-cycleperiodicity, every point in the time series contributes to partof the cycle (e.g. the season cycle of the year). In partial pe-riodicity, only a portion of the time series data are essentialto the mining results. Recent work has focused on partialperiodicity detection.
Previous works [3, 4, 1, 2] devise methods to dis-cover potential periods from the entire time series data.These methods are not applicable if the periodic pat-terns occur only within small segments of the time se-ries. For example, suppose Bob is an employee whodrives his car to work every day. However, his routemay change from month to month. In the first month,he follows the route of ”home→BLOCK A→BLOCKD→company”; in the second month, he follows the routeof ”home→BLOCK B→BLOCK K→company”; and inthe third month, he changes to the route ”home→BLOCKC→BLOCK F→company”. Existing periodicity detectionalgorithms will not discover his traveling habits since theyare present in only a third of the entire three months period.
In this paper, we develop a new periodicity detection al-
gorithm to efficiently discover such short period patternsthat may exist only in a limited range of the time series.We refer to these patterns as dense periodic patterns. Ourcontributions are summarized as follows: (1) We introducethe notion of dense periodic patterns where the periodicityis focused on part of time series. To the best of our knowl-edge, this is the first work that deals with localized segmentsperiodic patterns. (2) We design a pruning strategy to limitthe search space to just the feasible periods. (3) We developa dense periodic pattern mining algorithm called DPMinerthat has been demonstrated to be both scalable and efficient.
2 Dense Periodicity
Given an alphabet∑
, a pattern P = (I1I2 . . . Ip) is anordered sequence of itemsets, where its itemset Ii is a setof zero or more non-repeated characters denoted as {c∗},c ∈ ∑
. For example, (b{b, c}a∗) is a pattern of alphabeta, b, c, and {b}, {b, c}, {a} are three itemsets of this pattern.
Here, we define the concept of density of a symbol inthe alphabet
∑. The distance between any two characters,
say ci and cj , in the time series T = c1 · · · cn, is definedas |i − j|. For any characters of the same symbol s, if theirdistance is not greater than the given parameter dmax, wesay they are directly density-reachable. dmax is a param-eter assigned by users to denote the maximum allowabledistance between two directly density-reachable characters.
For example, in Figure 1, if dmax=10, the two charac-ters of symbol ’a’ in positions 2 and 5 are directly density-reachable because their distance is 3, which is less thandmax. On the other hand, the two characters of the symbol’a’ in position 13 and 25 are not directly density-reachablebecause their distance is 12, which exceeds dmax.
A dense fragment of symbol s ∈ ∑in the time series
T , denoted as Fs,(bpos,epos), is a continuous sequence ofcharacters in T such that (1) s occurs at both the beginningposition (bpos) and the ending position (epos), and (2) anytwo neighboring s characters are directly density-reachable.The dense fragment set of s in the time series T is the setof all the dense fragments of s, denoted as FSs.
In Figure 1, the dense fragment set FSa for symbol aincludes two dense fragments, Fa,(2,13) and Fa,(25,35). Let
1
Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE
Figure 1. Example of a Time Series T
|Fs,(bpos,epos)| = epos − bpos denotes the length of thedense fragment, we have |Fa,(2,13)| = 11, |Fa,(25,35)| = 10.
For a dense fragment Fs,(bpos,epos), it is easy to countthe frequency of s denoted by freqs. By assigning two pa-rameters, dmax and min conf (the minimum confidence),we can deduce a lower bound period for all the possible 1-patterns containing symbol s in Fs,(bpos,epos).
THEOREM 2.1 (Lower Bound Period) The lowerbound period of all possible 1-patterns con-taining symbol s in Fs,(bpos,epos) is equal to
|Fs,(bpos,epos)|×min conf×dmax
freqs×dmax−|Fs,(bpos,epos)|×(1−min conf) .
This theorem allows us to prune the search space of allthe periods that are less than the lower bound computed.Please refer to [5] for the proof.
For example, in Figure 1, if min conf=0.8 anddmax=10, we can compute the minimum period of frag-ment Fa,(2,13) to be 12×0.8×10
4×10−12×0.2 ≈ 2.55. The minimumperiod obtained indicates that it is not possible to find fre-quent patterns containing the symbol a within a period of2. A closer look reveals that with a specified period of2, Fa,(2,13) contains total 6 segments, in which it requiresat least �6 × 0.8� = 5 segments to support the 1-patternscontaining ’a’(i.e, (a∗) and (∗a)). However, we only havefreqa = 4 in Fa,(2,13). In other words, ’a”s 1-patterns ofperiod 2 are impossible to be frequent in Fa,(2,13).
To extend the pruning strategy to the k-patterns, we needto identify all the promising high density regions and per-form mining only in these regions.
DEFINITION 2.1 Given a period p and an itemset I ={c1, c2, . . . , cm}, the dense region of I with period p,DR(I, p), is defined as the union of all the dense fragmentsof ci, 1 ≤ i ≤ m, where the lower bound period of eachdense fragment is less than or equal to p.
The dense region of the empty itemset ∗ is defined tobe the entire time series. Note that all the merged densefragments must have the lower bound less than or equal top so that they satisfy the minimal density requirement.
DEFINITION 2.2 For a pattern P with period p,(I1, . . . , Ip), the dense interval of P , DI(P, p), is definedto be the intersection of all the dense regions of its itemsetmembers, namely DI(P, p) = DR(I1, p)∩. . .∩DR(Ip, p).
Algorithm 1 DPMinerInput: A time series T = c1c2 . . . cn
Alphabet∑
Maximal distance dmax
Fragment length coefficient μPeriodicity threshold min conf .
Output: Patterns of period [2, dmax] for T .1: Scan the time series once to find the fragment set Ss for each
symbol s ∈∑;2: for each symbol s ∈∑ do3: Delete the fragments of length less than μ × |T | from Ss;4: Compute the lower bound period of every fragment in Ss;5: end for6: for period p= 2 to dmax do7: For symbol s, discover the frequent 1-patterns, |F1s|, from
Ss;8: Merge all |F1s|, s ∈∑, to obtain max-pattern Pmax;9: Let Pmax with period p is the root node of max-subpattern
tree R, compute the dense support DS(R, p);10: Scan DS(P, p) to construct R;11: Traverse R to output the frequent patterns.12: end for
To mine a k-pattern, we only need to integrate the denseregions of all of its itemsets, and inspect the intersectionregions, namely the dense intervals of this pattern.
3 DPMiner
The density based pruning strategy is incorporated inDPMiner (Dense Periodic pattern Miner), whose outline isdescribed in Algorithm 1. The algorithm mines dense pe-riodic patterns in two phases. The first phase (Steps 1 to5 in Algorithm 1) scans the time series once to obtain thedense fragments for each symbol s in
∑. The second phase
(Steps 6 to 12 in Algorithm 1) utilizes a top-down methodthat is similar to the one proposed in [3], except that it onlyscans the union of the dense regions of the correspondingroot nodes’ itemsets instead of the entire time series.
DEFINITION 3.1 For a max-subpattern tree R taking themax-pattern Pmax with period p, (I1, . . . , Ip), as its rootnode, the dense support of tree R, DS(R, p), is defined tobe the union of all the dense regions of its root node’s item-sets, namely DS(R, p) = DR(I1, p) ∪ . . . ∪ DR(Ip, p).
The dense support of tree R thoroughly covers all seg-ments supporting the max-pattern or its subpatterns. In
2
Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE
0 5 10 150
200
400
600
800
1000
1200
1400
Run
time
(ms)
Periods
DPMinerMaxsubpattern tree hit set method
(a) DATA-6-10000
0 5 10 150
500
1000
1500
2000
2500
3000
3500
4000
4500
Run
time
(ms)
Periods
DPMinerMaxsubpattern tree hit set method
(b) DATA-21-75000
0 5 10 150
0.5
1
1.5
2
2.5
3
3.5x 10
4
Run
time
(ms)
Periods
DPMiner
Maxsubpattern tree hit set method
(c) PACKET
Figure 2. Time Comparison of DPMiner and MTHS
other words, only those segments in DS(R) are useful forthe construction of max-subpattern tree. This allows us toprune away many unnecessary segments.
We also modify the structure of max-subpattern tree toinclude the patterns’ dense intervals. First, the dense sup-port of a max-subpattern tree R is computed according toits max-pattern and Definition 3.1. Then for every segmentin dense support of R, we check whether it supports themax-pattern Pmax or any subpattern of the max-pattern. Ifyes, this maximal subpattern P ′ will be found by search-ing from the node Pmax to its branches. If the node P ′ isfound, its count is increased by 1. Otherwise, a new nodeP ′ is created (together with any non-existent ancestors) andthe dense interval for P ′ is computed accordingly.
Complexity Analysis. The first phase of DPMiner scansthe time series once. The total number of scanned charac-ters for an iteration in the second phase is α × |T |, whereα ≤ 2 and α is bounded by the time series. For a range ofk(= dmax) period values, we scan k × α × |T | characters.Thus, the total number of characters scanned in DPMineris (1 + k × α) × |T |, for mining periodic patterns of k pe-riods, whereas the algorithm in [3] need to scan 2k × |T |characters.
4 Experiment Evaluation
We present a study of DPMiner versus max-subpatterntree hit set (MTHS) method [3]. Testing was done on aPentium 4 3Ghz PC with 1GB of memory, running Win-dows XP. We generated two synthetic datasets: (1) DATA-6-10000 is a time series satisfying normal distribution with6 symbols and 10000 characters; (2) DATA-21-75000 is atime series satisfying normal distribution with 21 symbols,and consists of 75000 characters. We also use the real-life dataset PACKET1. We discretize the values into 360K-length characters of alphabet 26.
Figure 2 shows the performance of DPMiner (withdmax = 30 and μ = 0.01) and the MTHS method. We ob-
1http://www.cs.ucr.edu/ eamonn/TSDMA/packet.data
serve that DPMiner has similar performance as the MTHSmethod for the DATA-6-10000 dataset, and outperformsMTHS for the datasets DATA-21-75000 and PACKET.
We also compare the patterns discovered by the two al-gorithms on DATA-21-75000 and PACKET. We observethat DPMiner can discover the dense periodic patterns aswell as their density ranges, while MTHS does not detectany frequent patterns on the two datasets.
5 ConclusionIn this paper, we have defined the problem of mining
dense periodic patterns. We introduced the concepts of den-sity and fragment, and a strategy for pruning the searchspace. We have developed a mining algorithm called DP-Miner to discover the dense periodic patterns. Experi-ment results on both synthetic and real-life datasets indi-cate the effectiveness and efficiency of DPMiner. The re-sults also show that DPMiner outperforms the existing max-subpattern tree hit set method on large alphabet datasets.
Acknowledgements. We would like to thank Mohamed G.Elfeky for giving us the source codes, and Junlian Xiang forher contribution in this research.
References
[1] W. Aref, M. Elfeky, and A. Elmagarmid. Incremental, online,and merge mining of partial periodic patterns in time seriesdatabases. IEEE TKDE, 16(3):332–342, 2004.
[2] M. Elfeky, W. Aref, and A. Elmagarmid. Periodicity detectionin time series databases. IEEE TKDE, 17(7):875–887, 2005.
[3] J. Han, G. Dong, and Y. Yin. Efficient mining of partial peri-odic patterns in time series database. In IEEE ICDE, 1999.
[4] S. Ma and J. Hellerstein. Mining partially periodic event pat-terns with unknown periods. In IEEE ICDE, 2001.
[5] C. Sheng, W. Hsu, and M. L. Lee. Efficient mining of denseperiodic patterns in time series database. In Technical ReportTR20/05, National University of Singapore, 2005.
3
Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE