8timeseries.ppt
TRANSCRIPT
7/23/2019 8timeseries.ppt
http://slidepdf.com/reader/full/8timeseriesppt 1/26
Data Mining:
Concepts andTechniques
Mining sequence patterns in transactionaldatabases
7/23/2019 8timeseries.ppt
http://slidepdf.com/reader/full/8timeseriesppt 2/26
Patterns
Transaction databases, time-series databases vs.sequence databases
Frequent patterns vs. (frequent) sequential patterns
Applications of sequential pattern mining
Customer shopping sequences: First bu computer, then C!-"#$, and then digital
camera, %ithin & months.
$edical treatments, natural disasters (e.g.,
earthqua'es), science eng. processes, stoc's andmar'ets, etc.
Telephone calling patterns, eblog clic' streams
!*A sequences and gene structures
7/23/2019 8timeseries.ppt
http://slidepdf.com/reader/full/8timeseriesppt 3/26
Mining?
+iven a set of sequences, nd the completeset of frequent subsequences
A sequence database
A sequence : (ef) (ab) (df) c b
An element ma contain a set of item/tems %ithin an element are unorderand %e list them alphabeticall.
a(bc)dc is asubsequence of a(abc)(ac)d(cf)
+iven support threshold min_sup 01, (ab)c is
a sequential pattern
2/! sequence
34 a(abc)(ac)d(cf)
14 (ad)c(bc)(ae)
&4 (ef)(ab)(df)cb
54 eg(af)cbc
7/23/2019 8timeseries.ppt
http://slidepdf.com/reader/full/8timeseriesppt 4/26
Challenges on Sequential PatternMining
A huge number of possible sequential patterns are
hidden in databases
A mining algorithm should
nd the complete set of patterns, %hen
possible, satisfing the minimum support
(frequenc) threshold
be highl e6cient, scalable, involving onl asmall number of database scans
be able to incorporate various 'inds of user-
specic constraints
7/23/2019 8timeseries.ppt
http://slidepdf.com/reader/full/8timeseriesppt 5/26
Sequential Pattern MiningAlgorithms
Concept introduction and an initial Apriori-li'e algorithm Agra%al 2ri'ant. $ining sequential patterns, /C!789
Apriori-based method: +2; (+enerali<ed 2equential ;atterns:
2ri'ant Agra%al = 7!>T89?)
;attern-gro%th methods: Free2pan ;re@2pan (an et
al.=B!!844 ;ei, et al.=/C!7843)
Dertical format-based mining: 2;A!7 (Ea'i=$achine
eanining844)
Constraint-based sequential pattern mining (2;/"/T: +arofala'is,
"astogi, 2him=D!>899 ;ei, an, ang = C/B$841)
$ining closed sequential patterns: Clo2pan (Gan, an Afshar
=2!$84&)
h i i ! i l
7/23/2019 8timeseries.ppt
http://slidepdf.com/reader/full/8timeseriesppt 6/26
The Apriori Propert o! SequentialPatterns
A basic propert: Apriori (Agra%al 2ir'ant895)
/f a sequence 2 is not frequent
Then none of the super-sequences of 2 is frequent
7.g, hb is infrequent so do hab and (ah)b
a(bd)bcb(ade)4
(be)(ce)d54
(ah)(bf)abf&4
(bf)(ce)b(fg)14
(bd)cb(ac)34
2equence2eq. /! +iven support threshold min_sup 01
7/23/2019 8timeseries.ppt
http://slidepdf.com/reader/full/8timeseriesppt 7/26
"SP#"enerali$ed Sequential Pattern Mining
+2; (+enerali<ed 2equential ;attern) mining algorithm proposed b Agra%al and 2ri'ant, 7!>T89?
#utline of the method /nitiall, ever item in !> is a candidate of length-3
for each level (i.e., sequences of length-') do scan database to collect support count for each
candidate sequence generate candidate length-('H3) sequences from
length-' frequent sequences using Apriori repeat until no frequent sequence or no candidate
can be found $aIor strength: Candidate pruning b Apriori
7/23/2019 8timeseries.ppt
http://slidepdf.com/reader/full/8timeseriesppt 8/26
%inding ength'( Sequential Patterns
7@amine +2; using an e@ample /nitial candidates: all singleton
sequences a, b, c, d, e, f,
g, h 2can database once, count support
for candidates
a(bd)bcb(ade)4
(be)(ce)d54
(ah)(bf)abf&4
(bf)(ce)b(fg)14
(bd)cb(ac)34
2equence2eq. /!
min_sup01
Cand 2up
)a* +
)b* ,
)c* -
)d* +
)e* +
)!* .
g 3
h 3
7/23/2019 8timeseries.ppt
http://slidepdf.com/reader/full/8timeseriesppt 9/26
"SP: "enerating ength'.Candidates
a b c d e f
a aa ab ac ad ae af
b ba bb bc bd be bf
c ca cb cc cd ce cf
d da db dc dd de df
e ea eb ec ed ee ef
f fa fb fc fd fe J
a b c d e f
a (ab) (ac) (ad) (ae) (af)b (bc) (bd) (be) (bf)
c (cd) (ce) (cf)
d (de) (df)
e (ef)
f
51 length-2
Candidates
Without Apriori
property,8*8+8*7/2=92
andidates
Apriori prunes
!!"57# andidates
7/23/2019 8timeseries.ppt
http://slidepdf.com/reader/full/8timeseriesppt 10/26
The "SP Mining Process
a b c d e f g h
aa ab K af ba bb K J (ab) K (
abb aab aba baa bab K
abba (bd)bc K
(bd)cba
3st scan: L cand. ? length-3seq. pat.
1nd scan: 3 cand. 39 length-1seq. pat. 34 cand. not in !> atall
&rd scan: 5? cand. 39 length-&seq. pat. 14 cand. not in !> atall
5th scan: L cand. ? length-5seq. pat.
th scan: 3 cand. 3 length-seq. pat.
Cand. cannotpass sup.threshold
Cand. not in !> atall
a(bd)bcb(ade)4
(be)(ce)d54
(ah)(bf)abf&4
(bf)(ce)b(fg)14
(bd)cb(ac)34
2equence2eq. /!
min_sup01
did d
7/23/2019 8timeseries.ppt
http://slidepdf.com/reader/full/8timeseriesppt 11/26
Candidate "enerate'and'test:Dra/bac0s
A huge set of candidate sequences generated.
7speciall 1-item candidate sequence.
$ultiple 2cans of database needed.
The length of each candidate gro%s b one at
each database scan.
/ne6cient for mining long sequential patterns.
A long pattern gro% up from short patterns
The number of short patterns is e@ponential to
the length of mined patterns.
7/23/2019 8timeseries.ppt
http://slidepdf.com/reader/full/8timeseriesppt 12/26
The SPAD1 Algorithm
2;A!7 (2equential ;Attern !iscover using
7quivalent Class) developed b Ea'i 1443
A vertical format sequential pattern mining method
A sequence database is mapped to a large set of
/tem: 2/!, 7/!
2equential pattern mining is performed b
gro%ing the subsequences (patterns) one item
at a time b Apriori candidate generation
7/23/2019 8timeseries.ppt
http://slidepdf.com/reader/full/8timeseriesppt 13/26
The SPAD1 Algorithm
7/23/2019 8timeseries.ppt
http://slidepdf.com/reader/full/8timeseriesppt 14/26
2ottlenec0s o! "SP and SPAD1
A huge set of candidates could be generated
3,444 frequent length-3 sequences generate s huge
number of length-1 candidatesM
$ultiple scans of database in mining
>readth-rst search
$ining long sequential patterns
*eeds an e@ponential number of short candidates
A length-344 sequential pattern needs 34&4
candidate sequencesM
5$$,!99,12
9991$$$1$$$1$$$ =
×+×
%$1$$
1$$
1
1$121$$
≈−=
∑=i i
7/23/2019 8timeseries.ppt
http://slidepdf.com/reader/full/8timeseriesppt 15/26
Pre34 and Su54 6Pro7ection8
a, aa, a(ab) and a(abc) are prexes
of sequence a(abc)(ac)d(cf)
+iven sequence a(abc)(ac)d(cf)
;re@ Sux (;re@->ased Projection)
a (abc)(ac)d(cf)
aa (Nbc)(ac)d(cf)
ab (Nc)(ac)d(cf)
7/23/2019 8timeseries.ppt
http://slidepdf.com/reader/full/8timeseriesppt 16/26
Mining Sequential Patterns b Pre34Pro7ections
2tep 3: nd length-3 sequential patterns a, b, c, d, e, f
2tep 1: divide search space. The complete set of
seq. pat. can be partitioned into ? subsets: The ones having pre@ a The ones having pre@ b K The ones having pre@ f
2/! sequence
34 a(abc)(ac)d(cf)14 (ad)c(bc)(ae)
&4 (ef)(ab)(df)cb
54 eg(af)cbc
%i di S P tt ith P 3
7/23/2019 8timeseries.ppt
http://slidepdf.com/reader/full/8timeseriesppt 17/26
%inding Seq9 Patterns /ith Pre34)a*
#nl need to consider proIections %.r.t. a
a-proIected database: (abc)(ac)d(cf),
(Nd)c(bc)(ae), (Nb)(df)cb, (Nf)cbc
Find all the length-1 seq. pat. aving pre@ a:
aa, ab, (ab), ac, ad, af
Further partition into ? subsets
aving pre@ aa
K
aving pre@ af
2/! sequence
34 a(abc)(ac)d(cf)
14 (ad)c(bc)(ae)
&4 (ef)(ab)(df)cb
54 eg(af)cbc
7/23/2019 8timeseries.ppt
http://slidepdf.com/reader/full/8timeseriesppt 18/26
Completeness o! Pre34Span
2/! sequence
34 a(abc)(ac)d(cf)
14 (ad)c(bc)(ae)
&4 (ef)(ab)(df)cb
54 eg(af)cbc
2!>ength-3 sequential patternsa, b, c, d, e, f
a-proIected database(abc)(ac)d(cf)(Nd)c(bc)(ae)(Nb)(df)cb
(Nf)cbc
ength-1 sequentialpatternsaa, ab, (ab),
ac, ad, af
aving pre@ a
aving pre@ aa
aa-proI. db K af-proI. db
aving pre@ af
b-proIected database K
aving pre@ baving pre@ c, K, f
K K
7/23/2019 8timeseries.ppt
http://slidepdf.com/reader/full/8timeseriesppt 19/26
15cienc o! Pre34Span
*o candidate sequence needs to be
generated
;roIected databases 'eep shrin'ing $aIor cost of ;re@2pan: constructing
proIected databases
Can be improved b pseudo-proIections
Speed up b Pseudo
7/23/2019 8timeseries.ppt
http://slidepdf.com/reader/full/8timeseriesppt 20/26
Speed'up b Pseudo'pro7ection
$aIor cost of ;re@2pan: proIection ;ost@es of sequences often appear
repeatedl in recursive proIected
databases hen (proIected) database can be held in
main memor, use pointers to form
proIections
;ointer to the sequence
#Jset of the post@
s0a(abc)(ac)d(cf
(abc)(ac)d(cf)
(N c)(ac)d(cf)
a
ab
sOa: ( , 1)
sOab: ( , 5)
Pseudo Pro7ection s Phsical
7/23/2019 8timeseries.ppt
http://slidepdf.com/reader/full/8timeseriesppt 21/26
Pseudo'Pro7ection s9 PhsicalPro7ection
;seudo-proIection avoids phsicall coping
post@es
76cient in running time and space %hen
database can be held in main memor
o%ever, it is not e6cient %hen database
cannot t in main memor
!is'-based random accessing is ver costl
2uggested Approach: /ntegration of phsical and pseudo-proIection
2%apping to pseudo-proIection %hen the
data set ts in memor
7/23/2019 8timeseries.ppt
http://slidepdf.com/reader/full/8timeseriesppt 22/26
Constraint'2ased Seq9'Pattern Mining
Constraint-based sequential pattern mining Constraints: Pser-specied, for focused mining of
desired patterns o% to e@plore e6cient mining %ith constraintsQ R
#ptimi<ation Classication of constraints
Anti-monotone: 7.g., valueNsum(2) 34, min(2) 34 $onotone: 7.g., count (2) , 2 ⊇ S;C,
digitalNcamera
2uccinct: 7.g., length(2) ≥ 34, 2 ∈ S;entium, $2U#6ce,$2U$one Convertible: 7.g., valueNavg(2) 1, protNsum (2)
3?4, ma@(2)Uavg(2) 1, median(2) V min(2) /nconvertible: 7.g., avg(2) V median(2) 0 4
7/23/2019 8timeseries.ppt
http://slidepdf.com/reader/full/8timeseriesppt 23/26
%rom Sequential Patterns to StructuredPatterns
2ets, sequences, trees, graphs, and other structures Transaction !>: 2ets of items
SSi3, i1, K, im, K
2eq. !>: 2equences of sets:
SSi3, i1, K, Sim, in, i', K 2ets of 2equences:
SSi3, i1, K, im, in, i', K
2ets of trees: St3, t1, K, tn
2ets of graphs (mining for frequent subgraphs): Sg3, g1, K, gn
$ining structured patterns in W$ documents, bio-chemical structures, etc.
7/23/2019 8timeseries.ppt
http://slidepdf.com/reader/full/8timeseriesppt 24/26
1pisodes and 1pisode Pattern Mining
#ther methods for specifing the 'inds of patterns
2erial episodes: A → >
;arallel episodes: A >
"egular e@pressions: (A O >)CX(! → 7)
$ethods for episode pattern mining
Dariations of Apriori-li'e algorithms, e.g., +2;
!atabase proIection-based pattern gro%th
2imilar to the frequent pattern gro%th %ithout
candidate generation
7/23/2019 8timeseries.ppt
http://slidepdf.com/reader/full/8timeseriesppt 25/26
Periodicit Analsis
;eriodicit is ever%here: tides, seasons, dail po%erconsumption, etc. Full periodicit
7ver point in time contributes (precisel orappro@imatel) to the periodicit
;artial periodicit: A more general notion #nl some segments contribute to the periodicit Yim reads *G Times Z:44-Z:&4 am ever %ee' da
Cclic association rules Associations %hich form ccles
$ethods Full periodicit: FFT, other statistical analsis
methods ;artial and cclic periodicit: Dariations of Apriori-li'e
mining methods
7/23/2019 8timeseries.ppt
http://slidepdf.com/reader/full/8timeseriesppt 26/26
;e!: Mining SequentialPatterns
". 2ri'ant and ". Agra%al. $ining sequential patterns: +enerali<ations andperformance improvements. 7!>T89?.
. $annila, Toivonen, and A. /. Der'amo. !iscover of frequent episodes in eventsequences. !A$/:9Z.
$. Ea'i. 2;A!7: An 76cient Algorithm for $ining Frequent 2equences. $achineearning, 1443.
Y. ;ei, Y. an, . ;into, [. Chen, P. !aal, and $.-C. su. ;re@2pan: $ining
2equential ;atterns 76cientl b ;re@-;roIected ;attern +ro%th. /C!7\43(TB!7845).
Y. ;ei, Y. an and . ang, Constraint->ased 2equential ;attern $ining in arge!atabases, C/B$\41.
W. Gan, Y. an, and ". Afshar. Clo2pan: $ining Closed 2equential ;atterns in arge!atasets. 2!$\4&.
Y. ang and Y. an, >/!7: 76cient $ining of Frequent Closed 2equences, /C!7\45. . Cheng, W. Gan, and Y. an, /nc2pan: /ncremental $ining of 2equential ;atterns in
arge !atabase, B!!\45. Y. an, +. !ong and G. Gin, 76cient $ining of ;artial ;eriodic ;atterns in Time 2eries
!atabase, /C!7\99. Y. Gang, . ang, and ;. 2. Gu, $ining asnchronous periodic patterns in time series
data, B!!\44.