8timeseries.ppt

26
7/23/2019 8timeseries.ppt http://slidepdf.com/reader/full/8timeseriesppt 1/26 Data Mining: Concepts and Techniques  Mining sequence patterns in transactional databases

Upload: 081325296516

Post on 17-Feb-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 8timeseries.ppt

7/23/2019 8timeseries.ppt

http://slidepdf.com/reader/full/8timeseriesppt 1/26

Data Mining:

Concepts andTechniques

 Mining sequence patterns in transactionaldatabases

Page 2: 8timeseries.ppt

7/23/2019 8timeseries.ppt

http://slidepdf.com/reader/full/8timeseriesppt 2/26

 Patterns

 Transaction databases, time-series databases vs.sequence databases

Frequent patterns vs. (frequent) sequential patterns

Applications of sequential pattern mining

Customer shopping sequences: First bu computer, then C!-"#$, and then digital

camera, %ithin & months.

$edical treatments, natural disasters (e.g.,

earthqua'es), science eng. processes, stoc's andmar'ets, etc.

 Telephone calling patterns, eblog clic' streams

!*A sequences and gene structures

Page 3: 8timeseries.ppt

7/23/2019 8timeseries.ppt

http://slidepdf.com/reader/full/8timeseriesppt 3/26

 Mining?

+iven a set of sequences, nd the completeset of frequent subsequences

A sequence database 

A sequence : (ef) (ab) (df) c b

An element ma contain a set of item/tems %ithin an element are unorderand %e list them alphabeticall. 

a(bc)dc is asubsequence of a(abc)(ac)d(cf)

+iven support threshold min_sup 01, (ab)c is

a sequential pattern

2/! sequence

34 a(abc)(ac)d(cf)

14 (ad)c(bc)(ae)

&4 (ef)(ab)(df)cb

54 eg(af)cbc

Page 4: 8timeseries.ppt

7/23/2019 8timeseries.ppt

http://slidepdf.com/reader/full/8timeseriesppt 4/26

Challenges on Sequential PatternMining

A huge number of possible sequential patterns are

hidden in databases

A mining algorithm should

nd the complete set of patterns, %hen

possible, satisfing the minimum support

(frequenc) threshold

be highl e6cient, scalable, involving onl asmall number of database scans

be able to incorporate various 'inds of user-

specic constraints

Page 5: 8timeseries.ppt

7/23/2019 8timeseries.ppt

http://slidepdf.com/reader/full/8timeseriesppt 5/26

Sequential Pattern MiningAlgorithms

Concept introduction and an initial Apriori-li'e algorithm Agra%al 2ri'ant. $ining sequential patterns, /C!789

Apriori-based method: +2; (+enerali<ed 2equential ;atterns:

2ri'ant Agra%al = 7!>T89?)

;attern-gro%th methods: Free2pan ;re@2pan (an et

al.=B!!844 ;ei, et al.=/C!7843)

Dertical format-based mining: 2;A!7 (Ea'i=$achine

eanining844)

Constraint-based sequential pattern mining (2;/"/T: +arofala'is,

"astogi, 2him=D!>899 ;ei, an, ang = C/B$841)

$ining closed sequential patterns: Clo2pan (Gan, an Afshar

=2!$84&)

h i i ! i l

Page 6: 8timeseries.ppt

7/23/2019 8timeseries.ppt

http://slidepdf.com/reader/full/8timeseriesppt 6/26

The Apriori Propert o! SequentialPatterns

A basic propert: Apriori (Agra%al 2ir'ant895)

/f a sequence 2 is not frequent

 Then none of the super-sequences of 2 is frequent

7.g, hb is infrequent so do hab and (ah)b

a(bd)bcb(ade)4

(be)(ce)d54

(ah)(bf)abf&4

(bf)(ce)b(fg)14

(bd)cb(ac)34

2equence2eq. /! +iven support threshold min_sup 01

Page 7: 8timeseries.ppt

7/23/2019 8timeseries.ppt

http://slidepdf.com/reader/full/8timeseriesppt 7/26

"SP#"enerali$ed Sequential Pattern Mining

+2; (+enerali<ed 2equential ;attern) mining algorithm proposed b Agra%al and 2ri'ant, 7!>T89?

#utline of the method /nitiall, ever item in !> is a candidate of length-3

for each level (i.e., sequences of length-') do scan database to collect support count for each

candidate sequence generate candidate length-('H3) sequences from

length-' frequent sequences using Apriori repeat until no frequent sequence or no candidate

can be found $aIor strength: Candidate pruning b Apriori

Page 8: 8timeseries.ppt

7/23/2019 8timeseries.ppt

http://slidepdf.com/reader/full/8timeseriesppt 8/26

%inding ength'( Sequential Patterns

7@amine +2; using an e@ample /nitial candidates: all singleton

sequences a, b, c, d, e, f,

g, h 2can database once, count support

for candidates

a(bd)bcb(ade)4

(be)(ce)d54

(ah)(bf)abf&4

(bf)(ce)b(fg)14

(bd)cb(ac)34

2equence2eq. /!

min_sup01

Cand 2up

)a* +

)b* ,

)c* -

)d* +

)e* +

)!* .

g 3

h 3

Page 9: 8timeseries.ppt

7/23/2019 8timeseries.ppt

http://slidepdf.com/reader/full/8timeseriesppt 9/26

"SP: "enerating ength'.Candidates

a b c d e f

a aa ab ac ad ae af

b ba bb bc bd be bf

c ca cb cc cd ce cf

d da db dc dd de df

e ea eb ec ed ee ef

f fa fb fc fd fe J

a b c d e f

a (ab) (ac) (ad) (ae) (af)b (bc) (bd) (be) (bf)

c (cd) (ce) (cf)

d (de) (df)

e (ef)

f

51 length-2

Candidates

Without Apriori

 property,8*8+8*7/2=92

andidates

Apriori prunes

!!"57# andidates

Page 10: 8timeseries.ppt

7/23/2019 8timeseries.ppt

http://slidepdf.com/reader/full/8timeseriesppt 10/26

The "SP Mining Process

a b c d e f g h

aa ab K af ba bb K J (ab) K (

abb aab aba baa bab K

abba (bd)bc K

(bd)cba

3st scan: L cand. ? length-3seq. pat.

1nd scan: 3 cand. 39 length-1seq. pat. 34 cand. not in !> atall

&rd scan: 5? cand. 39 length-&seq. pat. 14 cand. not in !> atall

5th scan: L cand. ? length-5seq. pat.

th scan: 3 cand. 3 length-seq. pat.

Cand. cannotpass sup.threshold

Cand. not in !> atall

a(bd)bcb(ade)4

(be)(ce)d54

(ah)(bf)abf&4

(bf)(ce)b(fg)14

(bd)cb(ac)34

2equence2eq. /!

min_sup01

did d

Page 11: 8timeseries.ppt

7/23/2019 8timeseries.ppt

http://slidepdf.com/reader/full/8timeseriesppt 11/26

Candidate "enerate'and'test:Dra/bac0s

A huge set of candidate sequences generated.

7speciall 1-item candidate sequence.

$ultiple 2cans of database needed.

 The length of each candidate gro%s b one at

each database scan.

/ne6cient for mining long sequential patterns.

A long pattern gro% up from short patterns

 The number of short patterns is e@ponential to

the length of mined patterns.

Page 12: 8timeseries.ppt

7/23/2019 8timeseries.ppt

http://slidepdf.com/reader/full/8timeseriesppt 12/26

The SPAD1 Algorithm

2;A!7 (2equential ;Attern !iscover using

7quivalent Class) developed b Ea'i 1443

A vertical format sequential pattern mining method

A sequence database is mapped to a large set of

/tem: 2/!, 7/!

2equential pattern mining is performed b

gro%ing the subsequences (patterns) one item

at a time b Apriori candidate generation

Page 13: 8timeseries.ppt

7/23/2019 8timeseries.ppt

http://slidepdf.com/reader/full/8timeseriesppt 13/26

The SPAD1 Algorithm

Page 14: 8timeseries.ppt

7/23/2019 8timeseries.ppt

http://slidepdf.com/reader/full/8timeseriesppt 14/26

2ottlenec0s o! "SP and SPAD1

A huge set of candidates could be generated

3,444 frequent length-3 sequences generate s huge

number of length-1 candidatesM

$ultiple scans of database in mining

>readth-rst search

$ining long sequential patterns

*eeds an e@ponential number of short candidates

A length-344 sequential pattern needs 34&4  

candidate sequencesM

5$$,!99,12

9991$$$1$$$1$$$   =

×+×

%$1$$

1$$

1

1$121$$

≈−=   

  

 ∑=i   i

Page 15: 8timeseries.ppt

7/23/2019 8timeseries.ppt

http://slidepdf.com/reader/full/8timeseriesppt 15/26

Pre34 and Su54 6Pro7ection8

a, aa, a(ab) and a(abc) are prexes 

of sequence a(abc)(ac)d(cf)

+iven sequence a(abc)(ac)d(cf)

;re@ Sux  (;re@->ased Projection)

a (abc)(ac)d(cf)

aa (Nbc)(ac)d(cf)

ab (Nc)(ac)d(cf)

Page 16: 8timeseries.ppt

7/23/2019 8timeseries.ppt

http://slidepdf.com/reader/full/8timeseriesppt 16/26

Mining Sequential Patterns b Pre34Pro7ections

2tep 3: nd length-3 sequential patterns a, b, c, d, e, f

2tep 1: divide search space. The complete set of

seq. pat. can be partitioned into ? subsets:  The ones having pre@ a  The ones having pre@ b K  The ones having pre@ f

2/! sequence

34 a(abc)(ac)d(cf)14 (ad)c(bc)(ae)

&4 (ef)(ab)(df)cb

54 eg(af)cbc

%i di S P tt ith P 3

Page 17: 8timeseries.ppt

7/23/2019 8timeseries.ppt

http://slidepdf.com/reader/full/8timeseriesppt 17/26

%inding Seq9 Patterns /ith Pre34)a*

#nl need to consider proIections %.r.t. a

a-proIected database: (abc)(ac)d(cf),

(Nd)c(bc)(ae), (Nb)(df)cb, (Nf)cbc

Find all the length-1 seq. pat. aving pre@ a:

aa, ab, (ab), ac, ad, af

Further partition into ? subsets

aving pre@ aa

K

aving pre@ af

2/! sequence

34 a(abc)(ac)d(cf)

14 (ad)c(bc)(ae)

&4 (ef)(ab)(df)cb

54 eg(af)cbc

Page 18: 8timeseries.ppt

7/23/2019 8timeseries.ppt

http://slidepdf.com/reader/full/8timeseriesppt 18/26

Completeness o! Pre34Span

2/! sequence

34 a(abc)(ac)d(cf)

14 (ad)c(bc)(ae)

&4 (ef)(ab)(df)cb

54 eg(af)cbc

2!>ength-3 sequential patternsa, b, c, d, e, f 

a-proIected database(abc)(ac)d(cf)(Nd)c(bc)(ae)(Nb)(df)cb

(Nf)cbc

ength-1 sequentialpatternsaa, ab, (ab),

ac, ad, af

aving pre@ a

aving pre@ aa

aa-proI. db K af-proI. db

aving pre@ af

b-proIected database K

aving pre@ baving pre@ c, K, f

K K

Page 19: 8timeseries.ppt

7/23/2019 8timeseries.ppt

http://slidepdf.com/reader/full/8timeseriesppt 19/26

15cienc o! Pre34Span

*o candidate sequence needs to be

generated

;roIected databases 'eep shrin'ing $aIor cost of ;re@2pan: constructing

proIected databases

Can be improved b pseudo-proIections

Speed up b Pseudo

Page 20: 8timeseries.ppt

7/23/2019 8timeseries.ppt

http://slidepdf.com/reader/full/8timeseriesppt 20/26

Speed'up b Pseudo'pro7ection

$aIor cost of ;re@2pan: proIection ;ost@es of sequences often appear

repeatedl in recursive proIected

databases hen (proIected) database can be held in

main memor, use pointers to form

proIections

;ointer to the sequence

#Jset of the post@

s0a(abc)(ac)d(cf 

(abc)(ac)d(cf)

(N c)(ac)d(cf)

a

ab

sOa: ( , 1)

sOab: ( , 5)

Pseudo Pro7ection s Phsical

Page 21: 8timeseries.ppt

7/23/2019 8timeseries.ppt

http://slidepdf.com/reader/full/8timeseriesppt 21/26

Pseudo'Pro7ection s9 PhsicalPro7ection

;seudo-proIection avoids phsicall coping

post@es

76cient in running time and space %hen

database can be held in main memor

o%ever, it is not e6cient %hen database

cannot t in main memor

!is'-based random accessing is ver costl

2uggested Approach: /ntegration of phsical and pseudo-proIection

2%apping to pseudo-proIection %hen the

data set ts in memor

Page 22: 8timeseries.ppt

7/23/2019 8timeseries.ppt

http://slidepdf.com/reader/full/8timeseriesppt 22/26

Constraint'2ased Seq9'Pattern Mining

Constraint-based sequential pattern mining Constraints: Pser-specied, for focused mining of

desired patterns o% to e@plore e6cient mining %ith constraintsQ R

#ptimi<ation Classication of constraints

Anti-monotone: 7.g., valueNsum(2) 34, min(2) 34 $onotone: 7.g., count (2) , 2 ⊇ S;C,

digitalNcamera

2uccinct: 7.g., length(2) ≥ 34, 2 ∈ S;entium, $2U#6ce,$2U$one Convertible: 7.g., valueNavg(2) 1, protNsum (2)

3?4, ma@(2)Uavg(2) 1, median(2) V min(2) /nconvertible: 7.g., avg(2) V median(2) 0 4

Page 23: 8timeseries.ppt

7/23/2019 8timeseries.ppt

http://slidepdf.com/reader/full/8timeseriesppt 23/26

%rom Sequential Patterns to StructuredPatterns

2ets, sequences, trees, graphs, and other structures  Transaction !>: 2ets of items

SSi3, i1, K, im, K

2eq. !>: 2equences of sets:

SSi3, i1, K, Sim, in, i', K 2ets of 2equences:

SSi3, i1, K, im, in, i', K

2ets of trees: St3, t1, K, tn

2ets of graphs (mining for frequent subgraphs): Sg3, g1, K, gn

$ining structured patterns in W$ documents, bio-chemical structures, etc.

Page 24: 8timeseries.ppt

7/23/2019 8timeseries.ppt

http://slidepdf.com/reader/full/8timeseriesppt 24/26

1pisodes and 1pisode Pattern Mining

#ther methods for specifing the 'inds of patterns

2erial episodes: A → >

;arallel episodes: A >

"egular e@pressions: (A O >)CX(! → 7)

$ethods for episode pattern mining

Dariations of Apriori-li'e algorithms, e.g., +2;

!atabase proIection-based pattern gro%th

2imilar to the frequent pattern gro%th %ithout

candidate generation

Page 25: 8timeseries.ppt

7/23/2019 8timeseries.ppt

http://slidepdf.com/reader/full/8timeseriesppt 25/26

Periodicit Analsis

;eriodicit is ever%here: tides, seasons, dail po%erconsumption, etc. Full periodicit

7ver point in time contributes (precisel orappro@imatel) to the periodicit

;artial periodicit: A more general notion #nl some segments contribute to the periodicit Yim reads *G Times Z:44-Z:&4 am ever %ee' da

Cclic association rules Associations %hich form ccles

$ethods Full periodicit: FFT, other statistical analsis

methods ;artial and cclic periodicit: Dariations of Apriori-li'e

mining methods

Page 26: 8timeseries.ppt

7/23/2019 8timeseries.ppt

http://slidepdf.com/reader/full/8timeseriesppt 26/26

;e!: Mining SequentialPatterns

". 2ri'ant and ". Agra%al. $ining sequential patterns: +enerali<ations andperformance improvements. 7!>T89?.

. $annila, Toivonen, and A. /. Der'amo. !iscover of frequent episodes in eventsequences. !A$/:9Z.

$. Ea'i. 2;A!7: An 76cient Algorithm for $ining Frequent 2equences. $achineearning, 1443.

 Y. ;ei, Y. an, . ;into, [. Chen, P. !aal, and $.-C. su. ;re@2pan: $ining

2equential ;atterns 76cientl b ;re@-;roIected ;attern +ro%th. /C!7\43(TB!7845).

 Y. ;ei, Y. an and . ang, Constraint->ased 2equential ;attern $ining in arge!atabases, C/B$\41.

W. Gan, Y. an, and ". Afshar. Clo2pan: $ining Closed 2equential ;atterns in arge!atasets. 2!$\4&.

 Y. ang and Y. an, >/!7: 76cient $ining of Frequent Closed 2equences, /C!7\45. . Cheng, W. Gan, and Y. an, /nc2pan: /ncremental $ining of 2equential ;atterns in

arge !atabase, B!!\45.  Y. an, +. !ong and G. Gin, 76cient $ining of ;artial ;eriodic ;atterns in Time 2eries

!atabase, /C!7\99.  Y. Gang, . ang, and ;. 2. Gu, $ining asnchronous periodic patterns in time series

data, B!!\44.