mining sequential patterns in transactional dabases
TRANSCRIPT
MiningSequentialPatternsinTransactionalDabases
1
SequenceDatabases&SequentialPatterns
• Transactiondatabases,time-seriesdatabasesvs.sequencedatabases
• Frequentpatternsvs.(frequent)sequentialpatterns• Applications ofsequentialpatternmining– Customershoppingsequences:• Firstbuycomputer,thenCD-ROM,andthendigitalcamera,within3months.
– Medicaltreatments,naturaldisasters(e.g.,earthquakes),science&eng.processes,stocksandmarkets,etc.
– Telephonecallingpatterns,Weblogclickstreams– DNAsequencesandgenestructures
2
Concepts
• Asequenceisanorderedlistofitemsets,denotedas<s1,s2,….sn>wheresj isanitemset.
• Anelementofasequence:sj =(x1,x2,…xm)isanitemset.
3
A sequence: < (ef) (ab) (df) c b >
An element may contain a set of items.Items within an element are unorderedand we list them alphabetically.
Concepts
• Asequence<a1,a2,….an>isasubsequence of<b1,b2,….bm>ifthereexistsintegersi1 <i2<…<in s.t.a1⊆ bi1,a2⊆ bi2,…,an⊆ bin.
• Example– <(3)(4,5)(8)>isasubsequenceof<(7)(3,8)(9)(4,5,6)(8)>
– <4,5>isnotasubsequenceof<(3)(4,5)(8)>– <a(bc)dc>isasubsequence of<a(abc)(ac)d(cf)>
4
WhatIsSequentialPatternMining?
• Givenasetofsequences• Findthecompletesetoffrequent subsequences
5
A sequence database
Given support threshold min_sup =2, <(ab)c> is a frequent sequential pattern
SID sequence10 <a(abc)(ac)d(cf)>20 <(ad)c(bc)(ae)>30 <(ef)(ab)(df)cb>40 <eg(af)cbc>
ChallengesonSequentialPatternMining
• Ahuge numberofpossiblesequentialpatternsarehiddenindatabases
• Aminingalgorithmshould– findthecompletesetofpatterns,whenpossible,satisfyingtheminimumsupport(frequency)threshold
– behighlyefficient,scalable,involvingonlyasmallnumberofdatabasescans
– beabletoincorporatevariouskindsofuser-specificconstraints
6
SequentialPatternMiningAlgorithms
• ConceptintroductionandaninitialApriori-likealgorithm
– Agrawal&Srikant.Miningsequentialpatterns,ICDE’95
• Apriori-basedmethod:GSP(MiningSequentialPatterns:GeneralizationsandPerformanceImprovements:Srikant&Agrawal@EDBT’96)
• Pattern-growthmethods:FreeSpan&PrefixSpan (Hanetal.@KDD’00;Pei,etal.@ICDE’01)
• Verticalformat-basedmining:SPADE (Zaki@MachineLeanining’00)
• Constraint-basedsequentialpatternmining(SPIRIT:Garofalakis,Rastogi,Shim@VLDB’99;Pei,Han,Wang@CIKM’02)
• Miningclosedsequentialpatterns:CloSpan (Yan,Han&Afshar@SDM’03)
7
TheAprioriPropertyofSequentialPatterns
• Abasicproperty:Apriori(Agrawal&Sirkant’94)
– IfasequenceS isnotfrequent,thennoneofitssuper-sequencesisfrequent
– E.g,<hb>isinfrequentà Neitherdo<hab>and<(ah)b>
8
<a(bd)bcb(ade)>50<(be)(ce)d>40
<(ah)(bf)abf>30<(bf)(ce)b(fg)>20<(bd)cb(ac)>10
SequenceSeq. ID Given support thresholdmin_sup =2
GSP—GeneralizedSequentialPatternMining
• GSP(GeneralizedSequentialPattern)miningalgorithm– proposedbyAgrawalandSrikant,EDBT’96
• Outline ofthemethod– Initially,everyiteminDBisacandidateoflength-1– foreachlevel(i.e.,sequencesoflength-k)do• scandatabasetocollectsupportcountforeachcandidatesequence• generatecandidatelength-(k+1)sequencesfromlength-kfrequentsequencesusingApriori
– repeatuntilnofrequentsequenceornocandidatecanbefound
• Majorstrength:CandidatepruningbyApriori9
FindingLength-1SequentialPatterns
• ExamineGSPusinganexample• Initialcandidates:allsingletonsequences– <a>,<b>,<c>,<d>,<e>,<f>,<g>,<h>
• Scandatabaseonce,countsupportforcandidates
10
<a(bd)bcb(ade)>50<(be)(ce)d>40
<(ah)(bf)abf>30<(bf)(ce)b(fg)>20<(bd)cb(ac)>10
SequenceSeq. IDmin_sup =2
Cand Sup<a> 3<b> 5<c> 4<d> 3<e> 3<f> 2<g> 1<h> 1
Contiguoussubsequence• Givens=<s1,s2,…sn>andcasubsequenceofs,cisa
contiguoussubsequenceofs ifoneofthefollowingholds– cisderivedfromsbydroppinganitemfroms1orsn– cisderivedfromsbydroppinganitemfromanelementsiwhichhasatleast2items
– cisacontiguoussubsequenceofc’,andc’isacontiguoussubsequenceofs
• s=<(1,2)(3,4)(5)(6)>– <(2)(3,4)(5)(6)>,<(1,2)(3)(5)(6)>,<(3)(5)>arecontinuoussubsequencesofs.
– <(1,2)(3,4)(6)>,<(1)(5)(6)>arenotcontinuoussubsequencesofs.
11
Candidategeneration
• Joinphase:si=<si1,si2,…,sik>joinswithsj=<sj1,sj2,…,sjk>– Letsi’=droponeelementfromsi1(i.e.,thefirstitemset)– Letsj’=droponeelementfromsjk (i.e.,thelastitemset)– Ifsi’=sj’,generateonecandidateforsi andsj
• Prunephase
12
F3 C4(beforeprunging)
C4(afterpruning)
<(1,2)(3)><(1,2)(4)><(1)(3,4)><(1,3)(5)><(2)(3,4)><(2)(3)(5)>
<(1,2)(3,4)><(1,2)(3)(5)>
<(1,2)(3,4)>
TheGSPMiningProcess
13
<a> <b> <c> <d> <e> <f> <g> <h>
<aa> <ab> … <af> <ba> <bb> … <ff> <(ab)> … <(ef)>
<abb> <aab> <aba> <baa> <bab> …
<abba> <(bd)bc> …
<(bd)cba>
1st scan: 8 cand. 6 length-1 seq. pat.
2nd scan: 51 cand. 19 length-2 seq. pat. 10 cand. not in DB at all
3rd scan: 47 cand. 19 length-3 seq. pat. 20 cand. not in DB at all
4th scan: 8 cand. 6 length-4 seq. pat.
5th scan: 1 cand. 1 length-5 seq. pat.
Cand. cannot pass sup. threshold
Cand. not in DB at all
<a(bd)bcb(ade)>50
<(be)(ce)d>40
<(ah)(bf)abf>30
<(bf)(ce)b(fg)>20
<(bd)cb(ac)>10
SequenceSeq. ID
min_sup =2
CandidateGenerate-and-test:Drawbacks
• Ahugesetofcandidatesequences generated.
– Especially2-itemcandidatesequence.
• MultipleScansofdatabaseneeded.
– Thelengthofeachcandidategrowsbyoneateachdatabasescan.
• Inefficientformininglongsequentialpatterns.
– Alongpatterngrowupfromshortpatterns
– Thenumberofshortpatternsisexponentialtothelengthofminedpatterns.
14
TheSPADEAlgorithm
• SPADE(SequentialPAtternDiscoveryusingEquivalentClass)developedbyZaki2001
• Averticalformatsequentialpatternminingmethod
• Asequencedatabaseismappedtoalargesetof
– Item:<SID,EID>
• Sequentialpatternminingisperformedby
– growingthesubsequences(patterns)oneitematatimebyAprioricandidategeneration
15
TheSPADEAlgorithm
16
BottlenecksofGSPandSPADE
• Ahugesetofcandidatescouldbegenerated
– 1,000frequentlength-1sequencesgenerateahugenumberoflength-
2candidates!
• Multiplescansofdatabaseinmining
• Mininglongsequentialpatterns
– Needsanexponentialnumberofshortcandidates
– Alength-100sequentialpatternneeds1030
candidatesequences!
17
500,499,12999100010001000 =
´+´
30100100
11012
100»-=÷÷
ø
öççè
æå=i i
Rpackage
• CRANdocumentforSPADEalgorithm• Package‘arulesSequences’– https://cran.r-project.org/web/packages/arulesSequences/arulesSequences.pdf
18
FreeSpan
• Getfrequentitems,i.e.,F1– f_list=a:4,b:4,c:4,d:3,e:3,f:3
• Sequentialpatterns– Containingonlya– Containingb,butnoitemsafterb– …
• Projecteddatabases– {a}:<aaa><aa><a><a>– {b}:<a(ab)a>,<aba><(ab)b><ab>– … 19
SID sequence10 <a(abc)(ac)d(cf)>20 <(ad)c(bc)(ae)>30 <(ef)(ab)(df)cb>40 <eg(af)cbc>
FreeSpan
• Projecteddatabases– {a}:<aaa><aa><a><a>– {b}:<a(ab)a>,<aba><(ab)b><ab>– …
• Frequentpatterns (min_sup=50%)– {a}:<a><aa>– {b}:<b><ab><ba><(ab)>– …
20
SID sequence10 <a(abc)(ac)d(cf)>20 <(ad)c(bc)(ae)>30 <(ef)(ab)(df)cb>40 <eg(af)cbc>
FreeSpan
• Projecteddatabaseisrelevantlysmaller
• Ifapatternappearsineachsequenceofadatabase,theprojecteddatabasedoesnotshrink– {f}-projecteddatabase
21
PrefixSpan
• Prefix-projectedSequentialpattern mining
• Prefix,postfix
– <a>,<aa>,<a(ab)>and<a(abc)>areprefixesofsequence<a(abc)(ac)d(cf)>
• Givensequence<a(abc)(ac)d(cf)>
22
Prefix Suffix (Prefix-Based Projection)<a> <(abc)(ac)d(cf)><aa> <(_bc)(ac)d(cf)><ab> <(_c)(ac)d(cf)>
Outline
• Step1:findlength-1sequentialpatterns– <a>,<b>,<c>,<d>,<e>,<f>
• Step2:dividesearchspace.• Thecompletesetofsequentialpatterncanbepartitionedinto6
subsets:– Theoneshavingprefix<a>;– Theoneshavingprefix<b>;– …– Theoneshavingprefix<f>
23
SID sequence10 <a(abc)(ac)d(cf)>20 <(ad)c(bc)(ae)>30 <(ef)(ab)(df)cb>40 <eg(af)cbc>
FindingSequentialPatternswithPrefix<a>
• Onlyneedtoconsiderprojectionsw.r.t.<a>– <a>-projecteddatabase:• <(abc)(ac)d(cf)>,<(_d)c(bc)(ae)>,<(_b)(df)cb>,<(_f)cbc>
• Findallthelength-2sequentialpatternhavingprefix<a>– <aa>,<ab>,<(ab)>,<ac>,<ad>,<af>
• Furtherpartitioninto6subsets• Havingprefix<aa>• …• Havingprefix<af>
24
SID sequence10 <a(abc)(ac)d(cf)>20 <(ad)c(bc)(ae)>30 <(ef)(ab)(df)cb>40 <eg(af)cbc>
CompletenessofPrefixSpan
25
SID sequence10 <a(abc)(ac)d(cf)>20 <(ad)c(bc)(ae)>30 <(ef)(ab)(df)cb>40 <eg(af)cbc>
Length-1 sequential patterns<a>, <b>, <c>, <d>, <e>, <f>
<a>-projected database<(abc)(ac)d(cf)><(_d)c(bc)(ae)><(_b)(df)cb><(_f)cbc>
Length-2 sequentialpatterns<aa>, <ab>, <(ab)>,<ac>, <ad>, <af>
Having prefix <a>
Having prefix <aa>
<aa>-proj. db … <af>-proj. db
Having prefix <af>
<b>-projected database …Having prefix <b>
Having prefix <c>, …, <f>
… …
EfficiencyofPrefixSpan
• Nocandidatesequence needstobegenerated
• Projecteddatabaseskeepshrinking
• MajorcostofPrefixSpan:constructing
projecteddatabases
– Canbeimprovedbypseudo-projections
26
Variationsofsequentialpatterns
• Miningstructuredpatterns– XMLdocuments,bio-chemicalstructures,etc.
• Episodediscovery– Serialepisodes:A® B
– Parallelepisodes:A&B– Regularexpressions:(A|B)C*(D® E)
• Periodicpatterns
27
Ref:MiningSequentialPatterns• R.SrikantandR.Agrawal.Miningsequentialpatterns:Generalizationsandperformance
improvements.EDBT’96.• H.Mannila,HToivonen,andA.I.Verkamo.Discoveryoffrequentepisodesineventsequences.
DAMI:97.• RobertoJ.BayardoJr.:EfficientlyMiningLongPatternsfromDatabases.SIGMODConference
1998:85-93• M.Zaki.SPADE:AnEfficientAlgorithmforMiningFrequentSequences.MachineLearning,2001.• J.Pei,J.Han,H.Pinto,Q.Chen,U.Dayal,andM.-C.Hsu.PrefixSpan:MiningSequentialPatterns
EfficientlybyPrefix-ProjectedPatternGrowth.ICDE'01(TKDE’04).• J.Pei,J.HanandW.Wang,Constraint-BasedSequentialPatternMininginLargeDatabases,
CIKM'02.• X.Yan,J.Han,andR.Afshar.CloSpan:MiningClosedSequentialPatternsinLargeDatasets.
SDM'03.• J.WangandJ.Han,BIDE:EfficientMiningofFrequentClosedSequences,ICDE'04.• H.Cheng,X.Yan,andJ.Han,IncSpan:IncrementalMiningofSequentialPatternsinLarge
Database,KDD'04.• J.Han,G.DongandY.Yin,EfficientMiningofPartialPeriodicPatternsinTimeSeriesDatabase,
ICDE'99.• J.Yang,W.Wang,andP.S.Yu,Miningasynchronousperiodicpatternsintimeseriesdata,KDD'00.
28