approximate mining of consensus sequential patterns

Hye-Chung (Monica) Kum

University of North Carolina, Chapel HillComputer Science Department

School of Social Work

http://www.cs.unc.edu/~kum/approxMAP

Approximate Mining of Consensus Sequential Patterns

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>

Knowledge Discovery & Data mining (KDD)

"The nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data"




The goal is to discover and present knowledge in a form, which is easily comprehensible to humans in a timely manner




The goal is to discover and present knowledge in a form, which is easily comprehensible to humans in a timely manner

combining ideas drawn from databases, machine learning, artificial intelligence, knowledge-based systems, information retrieval, statistics, pattern recognition, visualization, and parallel and distributed computing

Fayyad, Piatetsky-Shapiro, Smyth 1996


What is KDD ?

Purpose– Extract useful information

Source– Operational or Administrative Data

Example– VIC card database for buying patterns– monthly welfare service patterns


Example

Analyze buying patterns for sales marketingTID Transaction1 {Diapers, Hotdogs, Buns, Beer}2 {Bread, Milk, Diapers, Wipes, Beer}3 {Milk, Diapers, Beer, Water}4 {Bread, Milk, Bananas, Cereal}5 {Bread, Milk, Diapers, Beer}6 {Steak, Corn, Coke, Beer}7 {Milk, Orange Juice, Diapers, Baby Food}8 {Bread, Milk, Diapers, Beer}


Example

VIC card : 4/8 = 50%TID Transaction1 {Diapers, Hotdogs, Buns, Beer}2 {Bread, Milk, Diapers, Wipes, Beer}3 {Milk, Diapers, Beer, Water}4 {Bread, Milk, Bananas, Cereal}5 {Bread, Milk, Diapers, Beer}6 {Steak, Corn, Coke, Beer}7 {Milk, Orange Juice, Diapers, Baby Food}8 {Bread, Milk, Diapers, Beer}


Example

VIC card : 5/8=63%TID Transaction1 {Diapers, Hotdogs, Buns, Beer}2 {Bread, Milk, Diapers, Wipes, Beer}3 {Milk, Diapers, Beer, Water}4 {Bread, Milk, Bananas, Cereal}5 {Bread, Milk, Diapers, Beer}6 {Steak, Corn, Coke, Beer}7 {Milk, Orange Juice, Diapers, Baby Food}8 {Bread, Milk, Diapers, Beer}


Overview

What is KDD (Knowledge Discovery & Data mining)Problem : Sequential Pattern MiningMethod : ApproxMAPEvaluation MethodResultsCase StudyConclusion


Sequential Pattern Mining

CID TID Transaction1 1 {Diapers, Hotdogs, Buns, Beer}1 2 {Bread, Milk, Diapers, Wipes, Beer}1 3 {Milk, Diapers, Beer, Water}2 4 {Bread, Milk, Bananas, Cereal}2 6 {Steak, Corn, Coke, Beer}3 5 {Bread, Milk, Diapers, Beer}3 7 {Milk, Orange Juice, Diapers, Baby Food}3 8 {Bread, Milk, Diapers, Beer}

C Sequential Transaction1 {Dp, HD, Buns, Br} {Bread, Mk, Dp, Wipes, Br} {Mk, Dp, Br, Wt}2 {Bread, Mk, Bananas, Cereal} {Steak, Corn, Coke, Br}3 {Bread, Mk, Dp, Br} {Mk, OJ, Dp, Baby Food} {Bread, Mk, Dp, Br}




Detecting patterns in sequences of sets


Welfare Program Participation Patterns

What are the common participation patterns ?What are the variations to them ?How do different policies affect these patterns?

Cid Sequential Transaction1 {W(elfare) M(edi) F(oodstamp)} {WMF} {WMF} {MF} {MF} {F}2 {WMF} {WMF} {WMF} {WMF} {WMF} {M} {M}3 {F} {F} {F} {WMF} {WMF} {WMF} {MF} {MF} {F}


Thesis Statement

The author of this dissertation asserts that multiple alignment is an effective model to uncover the underlying trend in sequences of sets.

I will show that approxMAP, – is a novel method to apply multiple alignment techniques to sequences of sets,– will effectively extract the underlying trend in the data – by organizing the large database into clusters – as well as give reasonable descriptors (weighted sequences and consensus

sequences) for the clusters via multiple alignment Furthermore, I will show that approxMAP

– is robust to its input parameters, – is robust to noise and outliers in the data, – scalable with respect to the size of the database, – and in comparison to the conventional support model, approxMAP can better

recover the underlying pattern with little confounding information under most circumstances.

In addition, I will demonstrate the usefulness of approxMAP using real world data.


Thesis Statement

Multiple alignment is an effective model to uncover the underlying trend in sequences of sets.

ApproxMAP is a novel method to apply multiple alignment techniques to sequences of sets.

ApproxMAP can recover the underlying patterns with little confounding information under most circumstances including those in which the conventional methods fail.

I will demonstrate the usefulness of approxMAP using real world data.



Detecting patterns in sequences of sets

Sequence seq1 < (A,B,D)(B)(C,D)(B,C) >

Itemsets s13 (C,D)

Items I {A, B, C, D}

• Nseq: Total # of sequences in the Database

• Lseq: Avg # of itemsets in a sequence

• Iseq : Avg # of items in an itemset

• Lseq * Iseq : Avg length of a sequence


Conventional Methods : Support Model

Super-sequence Sub-sequence– (A,B,D)(B)(C,D)(B,C)



Super-sequence Sub-sequence– (A,B,D)(B)(C,D)(B,C) (A)(B)(C,D)




Support (P ): # of super-sequences of P in D




Support (P ): # of super-sequences of P in D

Given D, and user threshold, min_sup– find complete set of P s.t. Support(P )

min_sup




Support (P ): # of super-sequences of P in DGiven D, and user threshold, min_sup

– find complete set of P s.t. Support(P ) min_sup• R. Agrawal and R. Srikant : ICDE 95 & EBDT 96

Methods– Breadth first – Apriori Principle (GSP)

• R. Agrawal and R. Srikant : ICDE 95 & EBDT 96

– Depth first – pattern growth (PrefixSpan)• J. Han and J. Pei : SIGKDD 2000 & ICDE 2001


Example: Support Model

{Dp, Br} {Mk, Dp} {Mk, Dp, Br} : 2/3=67% 2L - 1= 27-1=128-1=127 subsequences


– {Dp, Br} {Mk, Dp} {Mk, Br}

– {Dp, Br} {Mk, Dp} {Mk, Dp}

– {Mk, Dp} {Mk, Dp, Br}

– {Dp, Br} {Mk, Dp, Br}

– … etc …

– {Br} {Mk, Dp} {Mk, Dp, Br}

– {Dp} {Mk, Dp} {Mk, Dp, Br}

– {Dp, Br} {Dp} {Mk, Dp, Br}

– {Dp, Br} {Mk} {Mk, Dp, Br}

– {Dp, Br} {Mk, Dp} {Dp, Br}


Inherent Problems : the model

Support – cannot distinguish between statistically significant

patterns and random occurrencesTheoretically

– Short random sequences occur often in long sequential data simply by chance

Empirically– # of spurious patterns grows exponential w.r.t. Lseq


Inherent Problems : exact match

A pattern gets support– the pattern is exactly contained in the sequence

Often may not find general long patternsExample

– many customers may share similar buying habits– few of them follow an exactly same pattern


Inherent Problems : Complete set

Mines complete set – Too many trivial patterns

Given long sequences with noise – too expensive and too many patterns– 2L - 1= 210-1=1023

Finding max / closed sequential patterns – is non-trivial– In noisy environment, still too many max/closed

patterns


Possible Models

Support model– Patterns in sets – unordered list

Multiple alignment model– Find common patterns among strings– Simple ordered list of characters


Multiple Alignment

line up the sequences to detect the trend– Find common patterns among strings– DNA / bio sequences

P A T T T E R NP A T E R MP T T R NO A T T E R BP S Y Y R T N


Multiple Alignment

line up the sequences to detect the trend– Find common patterns among strings– DNA / bio sequences

P A T T T E R NP A T E R MP T T R NO A T T E R BP S Y Y R T NP A T T E R N


Pairwise Score(edit distance) : dist(seq1, seq2) – Minimum # of ops required to change seq1 to seq2

– Ops = INDEL(a) and/or REPLACE(a,b)– Recurrence relation

Edit Distance

P A T T T E R NP A T E R M


Multiple Alignment Score– ∑PS(seqi, seqj) ( 1 ≤ i ≤ N and 1≤ j ≤ N)– Optimal alignment : minimum score

Pairwise Score(edit distance) : dist(seq1, seq2) – Minimum # of ops required to change seq1 to seq2

– Ops = INDEL(a) and/or REPLACE(a,b)– Recurrence relation

Edit Distance

P A T T T E R NP A T E R M

INDEL INDEL REPL


Consensus Sequence

seq1

(A) (B) (DE)

seq2 (AE) (H) (BC) (E)

seq3 (A) (BCG) (D)


Consensus Sequence

Weighted Sequence : – compression of aligned sequences into one sequence

seq1

(A) (B) (DE)


seq3 (A) (BCG) (D)Weighted Seq (A:3, E:1):3 (H:1):1 (B:3, C:2, G:1):3 (D:2, E:2), 3 3


Consensus Sequence


strength(i, j) = # of occurrences of item i in position jtotal # of sequences

– A : 3/3 = 100%– E : 1/3 = 33%– H : 1/3 = 33%

seq1

(A) (B) (DE)




Consensus Sequence



Consensus itemset (j) : min_strength=2– ( ia | ia(I ()) & strength(ia, j) ≥ min_strength )

seq1

(A) (B) (DE)




Consensus Sequence



Consensus itemset (j) : min_strength=2– ( ia | ia(I ()) & strength(ia, j) ≥ min_strength )

Consensus sequence : – concatenation of the consensus itemsets

seq1

(A) (B) (DE)



Consensus Seq (A) (BC) (DE)

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>Multiple Alignment


Given – N sequences of sets, – Op costs (INDEL & REPLACE) for itemsets, and– Strength thresholds for consensus sequences

To(1) partition the N sequences into K sets of sequences such

that the sum of the K multiple alignment scores is minimum, and

(2) find the multiple alignment for each partition, and (3) find the pattern consensus sequence and the variation

consensus sequence for each partition


Overview


Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>ApproxMAP

(Approximate Multiple Alignment Pattern mining)

Exact solution : Too expensive!Approximation Method : ApproxMAP

– Organize into K partitions• Use clustering

– Compress each partition into • weighted sequences

– Summarize each partition into • Pattern consensus sequence• Variation consensus sequence


Tasks

Op costs (INDEL & REPLACE) for itemsetsOrganize into K partitions

– Use clusteringCompress each partition into

– weighted sequencesSummarize each partition into

– Pattern consensus sequence– Variation consensus sequence


Op costs for itemsets

Normalized set difference– R(X,Y)= (|X-Y|+|Y-X|)/(|X|+|Y|) – 0 ≤ R ≤ 1 , metric– INDEL(X) = R(X,) = 1

REPLACE(a) (a) 0(a) (ab) 1/3(ab) (ac) 1/2(a) (b) 1(ab) (cd) 1(a) () 1




Jaccard coefficient– 1-|XY| / |XY|





Jaccard coefficient– 1-|XY| / |XY|

Sørensen coefficient : simple index – Give greater "weight" to common elements





Jaccard coefficient– 1-|XY| / |XY|– 1-|(XY)| / |X-Y|+|Y-X|+|XY|

Sørensen coefficient : simple index – Give greater "weight" to common elements






Sørensen coefficient : simple index – Give greater "weight" to common elements – 1-2*|(XY)| / |X-Y|+|Y-X|+2*|XY|






Sørensen coefficient : simple index – Give greater "weight" to common elements – 1-2*|(XY)| / |X-Y|+|Y-X|+2*|XY|– = (|X|+|Y|-2*|XY|) / (|X|+|Y|) = R(X,Y)



Tasks






Organize : Partition into K sets

Goal: – To minimize the sum of the K multiple

alignment scores– Group similar sequences

Approximate– Calculate N*N proximity matrix

• Pairwise score : edit distance– Any clustering that works best for your data


Organize : Clustering

Desirable Properties– Form groups of arbitrary shape and size– Can estimate the number of clusters from the data


Density Based Clustering

k-nearest neighbor Partition based at the valleys of the density estimate Density of sequence = n / (|D|*d) n/d

– n & d : Based on user defined k nearest neighbor space– n : # of neighbors– d : size of neighbor region

Parameter k : Neighbor space– Can cluster at different resolutions as desired

General : Uniform kernel k-NN clustering– Efficient = O(kN)


Tasks

Op costs (INDEL & REPLACE) for itemsetsOrganize into k partitions





Data Compression : Multiple Alignment

Optimal multiple alignment too expensive !greedy approximation

– Incrementally align – in density-descending order– Pairwise alignment

• Sequence to Weighted sequence

ID Lexically sorted sequence clusterseq3 (A) (B) (DE)seq4 (A) (BCG) (D)seq2 (AE) (H) (B) (D)seq1 (AG) (F) (BC) (AE) (H)seq5 (BCI) (DE)

ID Aligned Sequence clusterseq3 (A) (B) (DE)seq2 (AE) (H) (B) (D)seq4 (A) (BCG) (D)seq5 (BCI) (DE)seq1 (AG) (F) (BC) (AE) (H)

ID Aligned Sequence clusterseq3 (A) (B) (DE)seq2 (AE) (H) (B) (D)seq4 (A) (BCG) (D)seq5 (BCI) (DE)seq1 (AG) (F) (BC) (AE) (H)WS4 (A:4,E:1,G:1)

: 4(H:1,F:1): 2

(B:5,C:3,G:1,I:1) : 5

(A:1,D:4,E:3): 5

(H:1): 1

5

ID Lexically sorted sequence clusterseq3 (A) (B) (DE)seq4 (A) (BCG) (D)seq2 (AE) (H) (B) (D)seq1 (AG) (F) (BC) (AE) (H)seq5 (BCI) (DE)

Multiple Alignment

seq3 (A) (B) (DE)seq2 (AE) (H) (B) (D)

Multiple Alignment

seq3 (A) (B) (DE)seq2 (AE) (H) (B) (D)

seq4 (A) (BCG) (D)

Multiple Alignment

seq3 (A) (B) (DE)seq2 (AE) (H) (B) (D)WS1 (A:2,E:1):2 (H:1):1 (B:2):2 (D:2,E:1):2 2

seq4 (A) (BCG) (D)

Multiple Alignment


seq4 (A) (BCG) (D)WS2 (A:3,E:1):3 (H:1):1 (B:3,C:1,G:1):3 (D:3,E:1):3 3

Multiple Alignment



seq5 (BCI) (DE)

Multiple Alignment



seq5 (BCI) (DE)WS3 (A:3,E:1):3 (H:1):1 (B:4,C:2,G:1,I:1):

4(D:4,E:2):4 4

Multiple Alignment



seq5 (BCI) (DE)WS3 (A:3,E:1):3 (H:1):1 (B:4,C:2,G:1,I:1):

4(D:4,E:2):4 4

seq1 (AG) (F) (BC) (AE) (H)

Multiple Alignment



seq5 (BCI) (DE)WS3 (A:3,E:1):3 (H:1):1 (B:4,C:2,G:1,I:1):4 (D:4,E:2):4 4

seq1 (AG) (F) (BC) (AE) (H)WS4 (A:4,E:1,G:1)

: 4(H:1,F:1): 2

(B:5,C:3,G:1,I:1) : 5

(A:1,D:4,E:3): 5

(H:1): 1

5

Op Cost for Itemset to weighted itemset

seq3 (A) (B) (DE)seq2 (AE) (H) (B) (D)seq4 (A) (BCG) (D)seq5 (BCI) (DE)WS3 (A:3,E:1):3 (H:1):1 (B:4,C:2,G:1,I:1):4 (D:4,E:2):4 4

seq1 (AG) (F) (BC) (AE) (H)

Replace ((A:3,E:1):3 – 4 , (AG) ) = ?


seq3 (A) R=1/3

Tot Avg=65/120seq2 (AE) R=1/2seq4 (A) R=1/3seq5 INDEL=1

seq1 (AG)

Replace ((A:3,E:1):3 – 4 , (AG) ) 65/120


seq3 (A) R=1/3Avg=7/18=35/90seq2 (AE) R=1/2

seq4 (A) R=1/3

seq5 INDEL=1 Tot Avg=65/120

seq1 (AG)


seq3 (A) R=1/3Avg=7/18=35/90seq2 (AE) R=1/2

seq4 (A) R=1/3

seq5 INDEL=1 Tot Avg=65/120WS3 (A:3,E:1):3

- 4R’w= (4+2*3)-2*3/(4+2*3) =2/5=36/90

seq1 (AG) 1* ( n – wX )


seq3 (A) R=1/3Avg=7/18=35/90seq2 (AE) R=1/2

seq4 (A) R=1/3


- 4R’w= (4+2*3)-2*3/(4+2*3) =2/5=36/90

seq1 (AG) R’w(Xw,Y) = [ weight(X)+|Y|*wX – 2*weight(XY) ]

[ weight(X) + |Y|*wX ](|X|+|Y|-2*|XY|)

(|X|+|Y|)


seq3 (A) R=1/3Avg=7/18=35/90seq2 (AE) R=1/2

seq4 (A) R=1/3


- 4R’w= (4+2*3)-2*3/(4+2*3) =2/5=36/90

seq1 (AG) R’w(Xw,Y) = [ weight(X)+|Y|*wX – 2*weight(XY) ]

[ weight(X) + |Y|*wX ]

R’w(Xw,Y) * wX

Op Cost for Itemset to weighted itemset: Rw

seq3 (A) R=1/3Avg=7/18=35/90seq2 (AE) R=1/2

seq4 (A) R=1/3


- 4R’w= (4+2*3)-2*3/(4+2*3) =2/5=36/90

seq1 (AG) 1* ( n – wX )

R’w(Xw,Y) * wX

Replace ((A:3,E:1):3 – 4 , (AG) )= Rw(Xw, Y) = [ R’w(Xw,Y) * wX + n – wX ] / n

seq3 (A) R=1/3Avg=7/18=35/90seq2 (AE) R=1/2

seq4 (A) R=1/3


- 4R’w= (4+2*3)-2*3/(4+2*3) =2/5=36/90

Rw=[(2/5)*3+1]/4=11/20=66/120seq1 (AG)

Op Cost for Itemset to weighted itemset: Rw

1* ( n – wX )

R’w(Xw,Y) * wX

Replace ((A:3,E:1):3 – 4 , (AG) )= Rw(Xw, Y) = [ R’w(Xw,Y) * wX + n – wX ] / n


Op Cost for Itemset to weighted itemset:Rw

Op cost– R’w(Xw,Y)= [ weight(X)+|Y|*wX – 2*weight(XY) ]


– Rw(Xw, Y) = [ R’w* wX + n – wX ] / n

– 0 ≤ Rw ≤ 1 , metric

– INDEL(Xw) = Rw (Xw,) = INDEL(Y) = Rw (, Y) =1


Tasks





Summarize: Generate and Present results

N sequences K weighted sequencesWeighted sequence : huge

– compression of all sequences


: 4(H:1,F:1): 2

(B:5,C:3,G:1,I:1) : 5

(A:1,D:4,E:3): 5

(H:1): 1

5

< (E:1, L:1, R:1, T:1, V:1, d:1) (A:1, B:9, C:8, D:8, E:12, F:1, L:4, P:1, S:1, T:8, V:5, X:1, a:1, d:10, e:2, f:1, g:1, p:1) (B:99, C:96, D:91, E:24, F:2, G:1, L:15, P:7, R:2, S:8, T:95, V:15, X:2, Y:1, a:2, d:26, e:3, g:6, l:1, m:1) (A:5, B:16, C:5, D:3, E:13, F:1, H:2, L:7, P:1, R:2, S:7, T:6, V:7, Y:3, d:3, g:1) (A:13, B:126, C:27, D:1, E:32, G:5, H:3, J:1, L:1, R:1, S:32, T:21, V:1, W:3, X:2, Y:8, d:13, e:1, f:8, i:2, p:7, l:3, g:1) (A:12, B:6, C:28, D:1, E:28, G:5, H:2, J:6, L:2, S:137, T:10, V:2, W:6, X:8, Y:124, a:1, d:6, g:2, i:1, l:1, m:2) (A:135, B:2, C:23, E:36, G:12, H:124, K:1, L:4, O:2, R:2, S:27, T:6, V:6, W:10, X:3, Y:8, Z:2, a:1, d:6, g:1, h:2, j:1, k:5, l:3, m:7, n:1) (A:11, B:1, C:5, E:12, G:3, H:10, L:7, O:4, S:5, T:1, V:7, W:3, X:2, Y:3, a:1, m:2) (A:31, C:15, E:10, G:15, H:25, K:1, L:7, M:1, O:1, R:4, S:12, T:10, V:6, W:3, Y:3, Z:3, d:7, h:3, j:2, l:1, n:1, p:1, q:1) (A:3, C:5, E:4, G:7, H:1, K:1, R:1, T:1, W:2, Z:2, a:1, d:1, h:1, n:1) (A:20, C:27, E:13, G:35, H:7, K:7, L:111, N:2, O:1, Q:3, R:11, S:10, T:20, V:111, W:2, X:2, Y:3, Z:8, a:1, b:1, d:13, h:9, j:1, n:1, o:2) (A:17, B:2, C:14, E:17, F:1, G:31, H:8, K:13, L:2, M:2, N:1, R:22, S:2, T:140, U:1, V:2, W:2, X:1, Z:13, a:1, b:8, d:6, h:14, n:6, p:1, q:1) (A:12, B:7, C:5, E:13, G:16, H:5, K:106, L:8, N:2, O:1, R:32, S:3, T:29, V:9, X:2, Z:9, b:16, c:5, d:5, h:7, l:1) (A:7, B:1, C:9, E:5, G:7, H:3, K:7, R:8, S:1, T:10, X:1, Z:3, a:2, b:3, c:1, d:5, h:3) (A:1, B:1, H:1, R:1, T:1, b:2, c:1) (A:3, B:2, C:2, E:6, F:2, G:4, H:2, K:20, M:2, N:3, R:19, S:3, T:11, U:2, X:4, Z:34, a:3, b:11, c:2, d:4) (H:1, Y:1, a:1, d:1) > : 162


Presentation model

Frequent items : – definite pattern items– Cutoff : =50%

Common items : – uncertain– Cutoff : =20%

Rare items – Noise items

FrequentItems

CommonItems

RareItems

=50%

=20%

W=100%

W=0%


Visualization

Pattern Consensus Sequence :– Cutoff : Minimum cluster strength (=50%)– Frequent items

Variation Consensus Sequence : – Cutoff : Minimum cluster strength (=20%)– Frequent items + common items

100%: 85%: 70%: 50%: 35%: 20%

(B:61%, C:59%, D:56%, T:59%)

(B: 78%)

(S:85%, Y:77%)

(A:83%, E:22%, H:77%)

(G:22%, L:69%, V:69%)

(T: 86%)

(K: 65%)

(B: 21%) : 162

Pat tern

(B C D T) (B) (S Y) (A H) (L V) (T) (K) :162 Variation (B C D T) (B) (S Y) (A E H) (G L V) (T) (K) (Z) :162

ID Full Sequence Database lexically sorted Cluster Lseq Len

seq1 (A) (B, C, Y) (D) 1 3 5

seq2 (A) (X) (B, C) (A, E) (Z) 1 5 7

seq3 (A, I) (Z) (K) (L, M) 2 4 6

seq4 (A, L) (D, E) 1 2 4

seq5 (I, J) (B) (K) (L) 2 4 5

seq6 (I, J) (L, M) 2 2 5

seq7 (I, J) (K) (J, K) (L) (M) 2 5 7

seq8 (I, M) (K) (K, M) (L, M) 2 4 7

seq9 (J) (K) (L, M) 2 3 4

seq10 (V) (K, W) (Z) 2 3 4

Example: Given 10 seqs lexically sorted

Color scheme <100: 85: 70: 50: 35: 20 >Cluster 1 (cluster strength = 40% = 2 sequences)

seq1 (A) (B, C, Y) (D)

seq4 (A, L) (D, E)

seq2 (A) (X) (B, C) (A, E) (Z)


seq1 (A) (B, C, Y) (D)

seq4 (A, L) (D, E)

seq2 (A) (X) (B, C) (A, E) (Z)

Weighted Seq (A:3, L:1):3 (X:1):1 (B:2, C:2, Y:1):2 (A:1, D:2,E:2):3 (Z:1):1 3


seq1 (A) (B, C, Y) (D)

seq4 (A, L) (D, E)

seq2 (A) (X) (B, C) (A, E) (Z)


Consensus Pat (A) (B, C) (D, E)


seq1 (A) (B, C, Y) (D)

seq4 (A, L) (D, E)

seq2 (A) (X) (B, C) (A, E) (Z)



Cluster 2 (cluster strength = 40% = 3 sequences)seq9 (J) (K) (L, M)

seq5 (I, J) (B) (K) (L)

seq3 (A, I) (Z) (K) (L, M)

seq7 (I, J) (K) (J, K) (L) (M)

seq8 (I, M) (K) (K, M) (L, M)

seq6 (I, J) (L, M)

seq10 (V) (K, W) (Z)


seq1 (A) (B, C, Y) (D)

seq4 (A, L) (D, E)

seq2 (A) (X) (B, C) (A, E) (Z)




seq5 (I, J) (B) (K) (L)

seq3 (A, I) (Z) (K) (L, M)

seq7 (I, J) (K) (J, K) (L) (M)

seq8 (I, M) (K) (K, M) (L, M)

seq6 (I, J) (L, M)

seq10 (V) (K, W) (Z)

Weightedsequence

(A:1,I:5,J:4,M:1):6

(B:1,K:2,V:1,Z:1):5

(J:1,K:6,M:1,W:1):6 (L:6,M:4):6 (M:1,

Z:1):2 7


seq1 (A) (B, C, Y) (D)

seq4 (A, L) (D, E)

seq2 (A) (X) (B, C) (A, E) (Z)




seq5 (I, J) (B) (K) (L)

seq3 (A, I) (Z) (K) (L, M)

seq7 (I, J) (K) (J, K) (L) (M)

seq8 (I, M) (K) (K, M) (L, M)

seq6 (I, J) (L, M)

seq10 (V) (K, W) (Z)

Weightedsequence

(A:1,I:5,J:4,M:1):6

(B:1,K:2,V:1,Z:1):5

(J:1,K:6,M:1,W:1):6 (L:6,M:4):6 (M:1,

Z:1):2 7

ConsensusPat (w≥3) (I, J) (K) (L, M)

ConsensusVar (w≥2) (I, J) (K) (K) (L, M)

Example: support Model (20%=2 seq)id pattern sup id pattern sup id pattern sup1 (A) 4 17 (A) (D) 2 33 (I,J) (K) 22 (B) 3 18 (A) (E) 2 34 (I,J) (L) 33 (C) 2 19 (A) (Z) 2 35 (I,J) (M) 24 (D) 2 20 (A) (B,C) 2 36 (I) (K) (K) 25 (E) 2 21 (I) (K) 4 37 (I) (K) (L) 26 (I) 5 22 (I) (L) 5 38 (I) (K) (M) 27 (J) 4 23 (I) (M) 4 39 (I) (K) (L,M) 28 (K) 6 24 (I) (L,M) 3 40 (J) (K) (L) 29 (L) 7 25 (J) (K) 3 41 (J) (K) (M) 210 (M) 5 26 (J) (L) 4 42 (K) (K) (L) 211 (Z) 3 27 (J) (M) 3 43 (K) (K) (M) 212 (B,C) 2 28 (J) (L,M) 2 44 (I,J) (K) (L) 213 (I,J) 2 29 (K) (K) 2 45 (I) (K) (K) (L) 214 (L,M) 2 30 (K) (L) 5 46 (I) (K) (K) (M) 215 (A) (B) 2 31 (K) (M) 416 (A) (C) 2 32 (K) (L,M) 3

(A) (B,C) (D,E)(I,J) (K) (L,M)

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>ApproxMAP

(Approximate Multiple Alignment Pattern mining)

Approximation Method : ApproxMAP– Organize into K partitions = O(Nseq

2Lseq2Iseq)

• Proximity matrix = O(Nseq2Lseq

2Iseq)

• Clustering = O(kNseq)

– Compress each partition = O(nL2)• weighted sequences = O(nL2)

– Summarize each partition = O (1)• Pattern consensus sequence• Variation consensus sequence

– Time Complexity : O(Nseq2Lseq

2Iseq)• 2 optimizations


Overview



Evaluation

Up to now: Only performance / scalabilityQuality?

– What kind of patterns will the model generate?– Evaluate correctness of the model

Why?– Basis for comparison of different models– Essential in understanding results of approximate

solutions


Evaluation Method

Given – Set of Base patterns B : E(FB) & E(LB) D–D Set of Result patterns P

How?– Map each Pi to best Bj

• based on Longest Common Subsequences • of all Bj

• max res pat B(|BP|)


Item level

Predicted (Result Patterns) +

Actual(Base Pat)

+ Pattern Items

Confusion Matrix


Item level


Actual(Base Pat)

Extraneous Items+ Pattern Items

Confusion Matrix


Item level


Actual(Base Pat)

Extraneous Items+ Missed Items Pattern Items

Confusion Matrix


Evaluation Criteria : Item level

Recoverability :– Degree of pattern items in the Base Pat (weighted)– ∑ E(FB) * [ max res pat B(|BP|) / E(LB)]– Cutoff so that 0 ≤ R ≤ 1

Precision : – Degree of pattern items in the Result Pat– Pattern Items / (Pattern Items + Extraneous Items)


Actual(Base Pat)

N/A Extraneous Items+ Missed Items Pattern Items


Evaluation Criteria : Sequence level

spurious patterns– Pattern Items Extraneous Items

determine max pattern for each Bj

– Of all Pi map to a particular Bj

– the Pi with the Longest Common Subsequence

– max res pat P(|BP|)

redundant patterns– All other patterns

Ntotal = Nmax + Nspur + Nredun


Evaluation Example

30% : (A)(BC)(DE)

70% : (IJ)(K)(LM)

(A)(B)(DE) (A)(BC)(D) (B)(BC)(DE)

(IJ)(LM) (J)(K)(LM) (IJ)(KX)(LM)

(XY)(K)(Z)


Evaluation Example

30% : (A)(BC)(DE)

70% : (IJ)(K)(LM)

Ntotal = 7

(A)(B)(DE) (A)(BC)(D) (B)(BC)(DE)


(XY)(K)(Z)


Evaluation Example

30% : (A)(BC)(DE)

70% : (IJ)(K)(LM)

Ntotal = 7 Spurious = 1

(A)(B)(DE) (A)(BC)(D) (B)(BC)(DE)


(XY)(K)(Z)


Evaluation Example

30% : (A)(BC)(DE)

70% : (IJ)(K)(LM)

Ntotal = 7 Spurious = 1 Recoverability =(30%)*4/5+(70%)*5/5

= 94%

(A)(B)(DE) (A)(BC)(D) (B)(BC)(DE)


(XY)(K)(Z)


Evaluation Example

30% : (A)(BC)(DE)

70% : (IJ)(K)(LM)

Ntotal = 7 Spurious = 1 Recoverability =(30%)*4/5+(70%)*5/5 =

94% Redundant = 4

(A)(B)(DE) (A)(BC)(D) (B)(BC)(DE)


(XY)(K)(Z)


Evaluation Example

30% : (A)(BC)(DE)

70% : (IJ)(K)(LM)

Ntotal = 7 Spurious = 1 Recoverability =(30%)*4/5+(70%)*5/5 = 94% Redundant = 4 Precision = 1-5/31=84%

(A)(B)(DE) (A)(BC)(D) (B)(BC)(DE)


(XY)(K)(Z)


Synthetic data

Patterned data : IBM synthetic data generator– Given certain DB parameters outputs

• sequence DB• base patterns used to generate it : E(FB) and E(LB)

– R. Agrawal and R. Srikant : ICDE 95 & EBDT 96 Random data

– Independence both between and across itemsets Patterned data + systematic noise

– Randomly change item with probability – Yang, SIGMOD 2002

Patterned data + systematic outliers– random sequences


Overview



Results

ApproxMAP– Pattern consensus sequence– No null or one-itemset

Machine : swan– 2GHz Intel Xeon processor – 2GB of memory– Public machine

• Difficult to get consistent running time measurements• Thanks !


Database Parameter

Notation Meaning Value|| I || # of items 100|| || # of potentially freq itemsets 500

Ipat Avg. # of items per itemset in BP 2

Lpat Avg. # of itemsets per base pat 7

Npat # of base pattern sequences 10

Nseq # of data sequences 1000

Lseq Avg. # of itemsets per data seq 10

Iseq Avg. # of items per itemset in DB 2.5

BasePi (E(FB):E(LB)) ||P|| Pattern <100: 85: 70: 50: 35: 20> BaseP1 (0.21:0.66) 14 <(15 16 17 66) (15) (58 99) (2 74) (31 76) (66) (62) (93) >

PatConSeq1 13 < (15 16 17 66) (15) (58 99) (2 74) (31 76) (66) (62) > VarConSeq1 18 < (15 16 17 66) (15 22) (58 99) (2 74) (24 31 76) (24 66) (50 62) (93) >

BaseP2 (0.161:0.83) 22 <(22 50 66) (16) (29 99) (94) (45 67) (12 28 36) (50) (96) (51) (66) (2 22 58) (63 74 99) > PatConSeq2 19 < (22 50 66) (16) (29 99) (94) (45 67) (12 28 36) (50) (96) (51) (66) (2 22 58) > VarConSeq2 25 < (22 50 66) (16) (29 99) (22 58 94) (2 45 58 67) (12 28 36) (2 50) (24 96) (51) (66) (2 22 58) > PatConSeq3 15 < (22 50 66) (16) (29 99) (94) (45 67) (12 28 36) (50) (96) (51) > VarConSeq3 15 < (22 50 66) (16) (29 99) (94) (45 67) (12 28 36) (50) (96) (51) >

BaseP3 (0.141:0.82) 14 < (22) (22) (58) (2 16 24 63) (24 65 93) (6) (11 15 74) > PatConSeq4 11 < (22) (22) (58) (2 16 24 63) (24 65 93) (6) > VarConSeq4 13 < (22) (22) (22) (58) (2 16 24 63) (2 24 65 93) (6 50) >

BaseP4 (0.131:0.90) 15 <(31 76) (58 66) (16 22 30) (16) (50 62 66) (2 16 24 63) > PatConSeq5 11 < (31 76) (58 66) (16 22 30) (16) (50 62 66) > VarConSeq5 11 < (31 76) (58 66) (16 22 30) (16) (50 62 66) (16 24) >

BaseP5 (0.123:0.81) 14 <(43) (2 28 73) (96) (95) (2 74) (5) (2) (24 63) (20) (93) > PatConSeq6 13 < (43) (2 28 73) (96) (95) (2 74) (5) (2) (24 63) (20) > VarConSeq6 16 < (22 43) (2 28 73) (58 96) (95) (2 74) (5) (2 66) (24 63) (20) >

BaseP6 (0.121:0.77) 9 <(63) (16) (2 22) (24) (22 50 66) (50) > PatConSeq7 8 < (63) (16) (2 22) (24) (22 50 66) > VarConSeq7 9 < (63) (16) (2 22) (24) (22 50 66) >

BaseP7 (0.054:0.60) 13 <(70) (58 66) (22) (74) (22 41) (2 74) (31 76) (2 74) > PatConSeq8 16 < (70) (58) (22 58 66) (22 58) (74) (22 41) (2 74) (31 76) (2 74) > VarConSeq8 18 < (70) (58 66) (22 58 66) (22 58) (74) (22 41) (2 22 66 74) (31 76) (2 74) > PatConSeq9 0 cluster size was only 5 sequences so no pattern consensus sequence was produced VarConSeq9 8 < (70) (58 66) (74) (74) (22 41) (74) >

BaseP8 (0.014:0.91) 17 < (20 22 23 96) (50) (51 63) (58) (16) (2 22) (50) (23 26 36) (10 74) > BaseP9 (0.038:0.78) 7 < (88) (24 58 78) (22) (58) (96) > BaseP10 (0.008:0.66) 17 < (16) (2 23 74 88) (24 63) (20 96) (91) (40 62) (15) (40) (29 40 99) >

10 Base Patterns

D = 1000 Sequences

IBM Synthetic Data Generator

Id:LB E(FB):E(LB) Base Pattern

B1:14 0.21:0.66 [15,16,17,66] [15] [58,99] [2,74] [31,76] [66] [62][93]

B2:22 0.161:0.83 [22,50,66][16][29,99][94][45,67]…[2,22,58][63,74,99]

B3:14 0.141:0.82 [22] [22] [58] [2,16,24,63] [24,65,93] [6] [11,15,74]

Etc

B10:17 0.008:0.66 [16][2,23,74,88][24,63][20,96][91][40,62]...[29,40,99]

ApproxMAP

wseq1(162 seqs)

wseq2 wseq9

cluster1(162 seqs)

cluster2 cluster9

10 Base Patterns

D = 1000 Sequences


< (E:1, L:1, R:1, T:1, V:1, d:1) (A:1, B:9, C:8, D:8, E:12, F:1, L:4, P:1, S:1, T:8, V:5, X:1, a:1, d:10, e:2, f:1, g:1, p:1) (B:99, C:96, D:91, E:24, F:2, G:1, L:15, P:7, R:2, S:8, T:95, V:15, X:2, Y:1, a:2, d:26, e:3, g:6, l:1, m:1) (A:5, B:16, C:5, D:3, E:13, F:1, H:2, L:7, P:1, R:2, S:7, T:6, V:7, Y:3, d:3, g:1) (A:13, B:126, C:27, D:1, E:32, G:5, H:3, J:1, L:1, R:1, S:32, T:21, V:1, W:3, X:2, Y:8, d:13, e:1, f:8, i:2, p:7, l:3, g:1) (A:12, B:6, C:28, D:1, E:28, G:5, H:2, J:6, L:2, S:137, T:10, V:2, W:6, X:8, Y:124, a:1, d:6, g:2, i:1, l:1, m:2) (A:135, B:2, C:23, E:36, G:12, H:124, K:1, L:4, O:2, R:2, S:27, T:6, V:6, W:10, X:3, Y:8, Z:2, a:1, d:6, g:1, h:2, j:1, k:5, l:3, m:7, n:1) (A:11, B:1, C:5, E:12, G:3, H:10, L:7, O:4, S:5, T:1, V:7, W:3, X:2, Y:3, a:1, m:2) (A:31, C:15, E:10, G:15, H:25, K:1, L:7, M:1, O:1, R:4, S:12, T:10, V:6, W:3, Y:3, Z:3, d:7, h:3, j:2, l:1, n:1, p:1, q:1) (A:3, C:5, E:4, G:7, H:1, K:1, R:1, T:1, W:2, Z:2, a:1, d:1, h:1, n:1) (A:20, C:27, E:13, G:35, H:7, K:7, L:111, N:2, O:1, Q:3, R:11, S:10, T:20, V:111, W:2, X:2, Y:3, Z:8, a:1, b:1, d:13, h:9, j:1, n:1, o:2) (A:17, B:2, C:14, E:17, F:1, G:31, H:8, K:13, L:2, M:2, N:1, R:22, S:2, T:140, U:1, V:2, W:2, X:1, Z:13, a:1, b:8, d:6, h:14, n:6, p:1, q:1) (A:12, B:7, C:5, E:13, G:16, H:5, K:106, L:8, N:2, O:1, R:32, S:3, T:29, V:9, X:2, Z:9, b:16, c:5, d:5, h:7, l:1) (A:7, B:1, C:9, E:5, G:7, H:3, K:7, R:8, S:1, T:10, X:1, Z:3, a:2, b:3, c:1, d:5, h:3) (A:1, B:1, H:1, R:1, T:1, b:2, c:1) (A:3, B:2, C:2, E:6, F:2, G:4, H:2, K:20, M:2, N:3, R:19, S:3, T:11, U:2, X:4, Z:34, a:3, b:11, c:2, d:4) (H:1, Y:1, a:1, d:1) > : 162

cluster1(162 seqs)

cluster2 cluster9

ApproxMAP

wseq1(162 seqs)

wseq2 wseq9

PatConSeq1

VarConSeq1

PatConSeq9

VarConSeq9

10 Base Patterns

D = 1000 Sequences


100%: 85%: 70%: 50%: 35%: 20%

(B:61%, C:59%, D:56%, T:59%)

(B: 78%)

(S:85%, Y:77%)

(A:83%, E:22%, H:77%)

(G:22%, L:69%, V:69%)

(T: 86%)

(K: 65%)

(B: 21%) : 162

Pat tern

(B C D T) (B) (S Y) (A H) (L V) (T) (K) :162 Variation (B C D T) (B) (S Y) (A E H) (G L V) (T) (K) (Z) :162

cluster1(162 seqs)

cluster2 cluster9

ApproxMAP

wseq1(162 seqs)

wseq2 wseq9

PatConSeq1

VarConSeq1

PatConSeq9

VarConSeq9

10 Base Patterns

D = 1000 Sequences


Evaluation

7 max patterns1 redundant pattern0 spurious patterns

1 null pattern

Recoverability : 91.16%Precision: 97.17%

Extraneous Items: 3/106










BaseP1(0.21:0.66) 14 <(15 16 17 66)(15)(58 99)(2 74)(31 76)(66)(62)(93)> PatConSeq1 13 <(15 16 17 66)(15)(58 99)(2 74)(31 76)(66)(62)> VarConSeq1 18 <(15 16 17 66)(15 22)(58 99)(2 74)(24 31 76)(24 66)(50 62)(93)>










8 patterns returned

7 max patterns

1 redundant patterns

0 spurious patterns

Recoverability : 91.16%

Precision: 97.17%

Extraneous Items: 3/106


Comparative Study

Conventional Sequential Pattern Mining– Support Model

Empirical analysis– Totally random data– Patterned data– Patterned data + noise– Patterned data + outliers

Evaluation : Comparison

ApproxMAP Support ModelRandom

DataNo patterns Numerous spurious patterns




Patterned Data

10 patterns embedded

into 1000 seqs

k=6 & MinStrgh=30%Recoverability : 91.16%Precision: 97.17% Extraneous Items: 3/1068 patterns returned1 redundant patterns0 spurious patterns

MinSup=5%Recoverability : 91.59%Precision: 96.29% Extraneous : 66,058/1,782,583253,782 patterns returned253,714 redundant patterns58 spurious patterns




Patterned Data


into 1000 seqs



Noise Robust Not Robust Recoverability degrades fast


Robustness w.r.t. noise

0.00%

20.00%

40.00%

60.00%

80.00%

100.00%

0% 10% 20% 30% 40% 50%Noise Level (1-alpha)

Reco

vera

bility

Support(min_sup=5%)Mult Alingment(k=6, theta=50%)




Patterned Data


into 1000 seqs



Noise Robust Not Robust Recoverability degrades fast

Outliers Robust Some what Robust


Understanding ApproxMAP

5 experiments– k in kNN clustering– Strength cutoff– Order of alignment– Optimization 1 : reduced precision in prox. matrix– Optimization 2 : sample based iterative clustering


A realistic DB

Notation Meaning Value|| I || # of items 1,000|| || # of potentially freq itemsets 5,000

Ipat Avg. # of items per itemset in BP 2

Lpat Avg. # of itemsets per base pat 14=0.7*Lseq

Npat # of base pattern sequences 100

Nseq # of data sequences 10,000

Lseq Avg. # of itemsets per data seq 20

Iseq Avg. # of items per itemset in DB 2.5


Input parameters

0%10%20%30%40%50%60%70%80%90%

100%110%

0 10 20 30 40 50 60 70 80 90 100Theta : Strength threshold (%)

Eval

Crite

ria (%

)

Recover Precision

50%

60%

70%

80%

90%

100%

2 4 6 8 10k : for kNN clustering

Eval

crite

ria (%

)

Recoverability Precis ion

The order in multiple alignment


Understanding ApproxMAP

Optimization 1 : Lseq

– Running time : reduced to 40%Optimization 2 : Nseq

– Running time: reduced to 10%-40% – For negligible reduction in recoverability


Effects of the DB param & scalability

4 experiments– || I || : # of unique items in the database

• Density of the database• 1,000 – 10,000

– Nseq : # of sequences in the data • 10,000 – 100,000

– Lseq : Avg. # of itemsets per data seq• 10 – 50

– Iseq : Avg. # of items per itemset in DB• 2.5 – 10

0

3600

7200

10800

0 2000 4000 6000 8000 10000

|I| : # of unique items in D

Run

ning

tim

e (s

ec)

.

03600072000

108000144000180000216000252000

0 20000 40000 60000 80000 100000Nseq

Run

ning

tim

e (s

ec) .

036007200

10800144001800021600

0 10 20 30 40 50Lseq

Run

ning

tim

e (s

ec) .

0

3600

7200

10800

14400

0 5 10 15 20

Iseq

Run

ning

tim

e (s

ec) .


Overview



Case Study : Real data

Monthly Services to children with A&N report 992 sequences15 interpretable and useful patterns (RPT)(INV,FC)(FC) ..11.. (FC)

– 419 sequences (RPT)(INV,FC)(FC)(FC)

– 57 sequences (RPT)(INV,FC,T)(FC,T)(FC,HM)(FC)(FC,HM)

– 39 sequences


Overview



Conclusion : why does it work well?

Robust on random & weak patterned noiseVery good at organizing sequencesLong sequence data that are not random have

unique signatures


What have I done?

defines a new model, Multiple Alignment Sequential Pattern Mining,

describes a novel solution ApproxMAP (for APPROXimate Multiple Alignment Pattern mining) – that introduces a new metric for itemsets– weighted sequences : a new representation of alignment

information, – and the effective use of strength cutoffs to control the level

of detail included in the consensus patterns designs a general evaluation method to assess the

quality of results from sequential pattern mining algorithms,


What have I done?

employs the evaluation method to run an extensive set of empirical evaluations of approxMAP on synthetic data,

employs the evaluation method to compare the effectiveness of approxMAP to the conventional methods based on support model,

derives the expected support of a random sequences under the null hypothesis of no pattern in the database to better understand the behavior of the support based methods, and

demonstrates the usefulness of approxMAP using real world data.


Future Work

Sample based iterative clustering – Memory management

Distance metric– Multisets– Taxonomy tree

Strength cutoff– Automatic detection of customized

Local alignment


Thank You ! Advisor

– Wei Wang (02-04)– James Coggins (00-02)– Prasun Dewan (99-00)– Kye Hedlund (96-99)– Jan Prins (95-96)

SW advisor– Dean Duncan (98-04)

Other people– Janet Jones– Kim Flair– Susan Paulsen

Fellow students– Priyank Porwal, Andrew

Leaver-Fay, Leland Smith

Committee– Stephen Aylward– Jan Prins– Andrew Nobel

Other faculty– Jian Pei– Jack Snoeyink– J. S. Marron– Stephen Pizer– Stephen Weiss

Colleagues– Sang-Uok Kum, Jisung Kim– Alexandra, Michelle, – Aron, Chris

Family– Sohmee, My mom, dad, sister


Rw

Rw(Xw, Y) = [R’w* wX + n - wX] / n

R’w(Xw,Y)= [weight(X)+|Y|*wX -2*weight(XY)]


(|X|+|Y|-2*|XY|) (|X|+|Y|)


: 4(H:1,F:1): 2

(B:5,C:3,G:1,I:1) : 5

(A:1,D:4,E:3): 5

(H:1): 1

5

approximate mining of consensus sequential patterns

Documents