approximate mining of consensus sequential patterns

Hye-Chung (Monica) Kum

University of North Carolina, Chapel HillComputer Science Department

School of Social Work

http://www.cs.unc.edu/~kum/approxMAP

Approximate Mining of Consensus Sequential Patterns

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>

Knowledge Discovery & Data mining (KDD)

"The nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data"

The goal is to discover and present knowledge in a form, which is easily comprehensible to humans in a timely manner

combining ideas drawn from databases, machine learning, artificial intelligence, knowledge-based systems, information retrieval, statistics, pattern recognition, visualization, and parallel and distributed computing

Fayyad, Piatetsky-Shapiro, Smyth 1996

What is KDD ?

Purpose– Extract useful information

Source– Operational or Administrative Data

Example– VIC card database for buying patterns– monthly welfare service patterns

Example

Analyze buying patterns for sales marketingTID Transaction1 {Diapers, Hotdogs, Buns, Beer}2 {Bread, Milk, Diapers, Wipes, Beer}3 {Milk, Diapers, Beer, Water}4 {Bread, Milk, Bananas, Cereal}5 {Bread, Milk, Diapers, Beer}6 {Steak, Corn, Coke, Beer}7 {Milk, Orange Juice, Diapers, Baby Food}8 {Bread, Milk, Diapers, Beer}

Example

VIC card : 4/8 = 50%TID Transaction1 {Diapers, Hotdogs, Buns, Beer}2 {Bread, Milk, Diapers, Wipes, Beer}3 {Milk, Diapers, Beer, Water}4 {Bread, Milk, Bananas, Cereal}5 {Bread, Milk, Diapers, Beer}6 {Steak, Corn, Coke, Beer}7 {Milk, Orange Juice, Diapers, Baby Food}8 {Bread, Milk, Diapers, Beer}

Example

VIC card : 5/8=63%TID Transaction1 {Diapers, Hotdogs, Buns, Beer}2 {Bread, Milk, Diapers, Wipes, Beer}3 {Milk, Diapers, Beer, Water}4 {Bread, Milk, Bananas, Cereal}5 {Bread, Milk, Diapers, Beer}6 {Steak, Corn, Coke, Beer}7 {Milk, Orange Juice, Diapers, Baby Food}8 {Bread, Milk, Diapers, Beer}

Overview

What is KDD (Knowledge Discovery & Data mining)Problem : Sequential Pattern MiningMethod : ApproxMAPEvaluation MethodResultsCase StudyConclusion

Overview

Sequential Pattern Mining

CID TID Transaction1 1 {Diapers, Hotdogs, Buns, Beer}1 2 {Bread, Milk, Diapers, Wipes, Beer}1 3 {Milk, Diapers, Beer, Water}2 4 {Bread, Milk, Bananas, Cereal}2 6 {Steak, Corn, Coke, Beer}3 5 {Bread, Milk, Diapers, Beer}3 7 {Milk, Orange Juice, Diapers, Baby Food}3 8 {Bread, Milk, Diapers, Beer}

C Sequential Transaction1 {Dp, HD, Buns, Br} {Bread, Mk, Dp, Wipes, Br} {Mk, Dp, Br, Wt}2 {Bread, Mk, Bananas, Cereal} {Steak, Corn, Coke, Br}3 {Bread, Mk, Dp, Br} {Mk, OJ, Dp, Baby Food} {Bread, Mk, Dp, Br}

Detecting patterns in sequences of sets

Welfare Program Participation Patterns

What are the common participation patterns ?What are the variations to them ?How do different policies affect these patterns?

Cid Sequential Transaction1 {W(elfare) M(edi) F(oodstamp)} {WMF} {WMF} {MF} {MF} {F}2 {WMF} {WMF} {WMF} {WMF} {WMF} {M} {M}3 {F} {F} {F} {WMF} {WMF} {WMF} {MF} {MF} {F}

Thesis Statement

The author of this dissertation asserts that multiple alignment is an effective model to uncover the underlying trend in sequences of sets.

I will show that approxMAP, – is a novel method to apply multiple alignment techniques to sequences of sets,– will effectively extract the underlying trend in the data – by organizing the large database into clusters – as well as give reasonable descriptors (weighted sequences and consensus

sequences) for the clusters via multiple alignment Furthermore, I will show that approxMAP

– is robust to its input parameters, – is robust to noise and outliers in the data, – scalable with respect to the size of the database, – and in comparison to the conventional support model, approxMAP can better

recover the underlying pattern with little confounding information under most circumstances.

In addition, I will demonstrate the usefulness of approxMAP using real world data.

Thesis Statement

Multiple alignment is an effective model to uncover the underlying trend in sequences of sets.

ApproxMAP is a novel method to apply multiple alignment techniques to sequences of sets.

ApproxMAP can recover the underlying patterns with little confounding information under most circumstances including those in which the conventional methods fail.

I will demonstrate the usefulness of approxMAP using real world data.

Detecting patterns in sequences of sets

Sequence seq1 < (A,B,D)(B)(C,D)(B,C) >

Itemsets s13 (C,D)

Items I {A, B, C, D}

• Nseq: Total # of sequences in the Database

• Lseq: Avg # of itemsets in a sequence

• Iseq : Avg # of items in an itemset

• Lseq * Iseq : Avg length of a sequence

Conventional Methods : Support Model

Super-sequence Sub-sequence– (A,B,D)(B)(C,D)(B,C)

Super-sequence Sub-sequence– (A,B,D)(B)(C,D)(B,C) (A)(B)(C,D)

Support (P ): # of super-sequences of P in D

Given D, and user threshold, min_sup– find complete set of P s.t. Support(P )

min_sup

Support (P ): # of super-sequences of P in DGiven D, and user threshold, min_sup

– find complete set of P s.t. Support(P ) min_sup• R. Agrawal and R. Srikant : ICDE 95 & EBDT 96

Methods– Breadth first – Apriori Principle (GSP)

• R. Agrawal and R. Srikant : ICDE 95 & EBDT 96

– Depth first – pattern growth (PrefixSpan)• J. Han and J. Pei : SIGKDD 2000 & ICDE 2001

Example: Support Model

{Dp, Br} {Mk, Dp} {Mk, Dp, Br} : 2/3=67% 2L - 1= 27-1=128-1=127 subsequences

– {Dp, Br} {Mk, Dp} {Mk, Br}

– {Dp, Br} {Mk, Dp} {Mk, Dp}

– {Mk, Dp} {Mk, Dp, Br}

– {Dp, Br} {Mk, Dp, Br}

– … etc …

– {Br} {Mk, Dp} {Mk, Dp, Br}

– {Dp} {Mk, Dp} {Mk, Dp, Br}

– {Dp, Br} {Dp} {Mk, Dp, Br}

– {Dp, Br} {Mk} {Mk, Dp, Br}

– {Dp, Br} {Mk, Dp} {Dp, Br}

Inherent Problems : the model

Support – cannot distinguish between statistically significant

patterns and random occurrencesTheoretically

– Short random sequences occur often in long sequential data simply by chance

Empirically– # of spurious patterns grows exponential w.r.t. Lseq

Inherent Problems : exact match

A pattern gets support– the pattern is exactly contained in the sequence

Often may not find general long patternsExample

– many customers may share similar buying habits– few of them follow an exactly same pattern

Inherent Problems : Complete set

Mines complete set – Too many trivial patterns

Given long sequences with noise – too expensive and too many patterns– 2L - 1= 210-1=1023

Finding max / closed sequential patterns – is non-trivial– In noisy environment, still too many max/closed

patterns

Possible Models

Support model– Patterns in sets – unordered list

Multiple alignment model– Find common patterns among strings– Simple ordered list of characters

Multiple Alignment

line up the sequences to detect the trend– Find common patterns among strings– DNA / bio sequences

P A T T T E R NP A T E R MP T T R NO A T T E R BP S Y Y R T N

Multiple Alignment

line up the sequences to detect the trend– Find common patterns among strings– DNA / bio sequences

P A T T T E R NP A T E R MP T T R NO A T T E R BP S Y Y R T NP A T T E R N

Pairwise Score(edit distance) : dist(seq1, seq2) – Minimum # of ops required to change seq1 to seq2

– Ops = INDEL(a) and/or REPLACE(a,b)– Recurrence relation

Edit Distance

P A T T T E R NP A T E R M

Multiple Alignment Score– ∑PS(seqi, seqj) ( 1 ≤ i ≤ N and 1≤ j ≤ N)– Optimal alignment : minimum score

Pairwise Score(edit distance) : dist(seq1, seq2) – Minimum # of ops required to change seq1 to seq2

– Ops = INDEL(a) and/or REPLACE(a,b)– Recurrence relation

Edit Distance

P A T T T E R NP A T E R M

INDEL INDEL REPL

Consensus Sequence

(A) (B) (DE)

seq2 (AE) (H) (BC) (E)

seq3 (A) (BCG) (D)

Consensus Sequence

Weighted Sequence : – compression of aligned sequences into one sequence

(A) (B) (DE)

seq3 (A) (BCG) (D)Weighted Seq (A:3, E:1):3 (H:1):1 (B:3, C:2, G:1):3 (D:2, E:2), 3 3

Consensus Sequence

(A) (B) (DE)

Consensus Sequence

(A) (B) (DE)

Consensus Sequence

(A) (B) (DE)

Consensus Sequence

strength(i, j) = # of occurrences of item i in position jtotal # of sequences

– A : 3/3 = 100%– E : 1/3 = 33%– H : 1/3 = 33%

(A) (B) (DE)

Consensus Sequence

Consensus itemset (j) : min_strength=2– ( ia | ia(I ()) & strength(ia, j) ≥ min_strength )

(A) (B) (DE)

Consensus Sequence

Consensus sequence : – concatenation of the consensus itemsets

(A) (B) (DE)

Consensus Seq (A) (BC) (DE)

Consensus Sequence

Consensus sequence : – concatenation of the consensus itemsets

(A) (B) (DE)

Consensus Seq (A) (BC) (DE)

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>Multiple Alignment

Given – N sequences of sets, – Op costs (INDEL & REPLACE) for itemsets, and– Strength thresholds for consensus sequences

To(1) partition the N sequences into K sets of sequences such

that the sum of the K multiple alignment scores is minimum, and

(2) find the multiple alignment for each partition, and (3) find the pattern consensus sequence and the variation

consensus sequence for each partition

Overview

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>ApproxMAP

(Approximate Multiple Alignment Pattern mining)

Exact solution : Too expensive!Approximation Method : ApproxMAP

– Organize into K partitions• Use clustering

– Compress each partition into • weighted sequences

– Summarize each partition into • Pattern consensus sequence• Variation consensus sequence

Op costs (INDEL & REPLACE) for itemsetsOrganize into K partitions

– Use clusteringCompress each partition into

– weighted sequencesSummarize each partition into

– Pattern consensus sequence– Variation consensus sequence

Op costs for itemsets

Normalized set difference– R(X,Y)= (|X-Y|+|Y-X|)/(|X|+|Y|) – 0 ≤ R ≤ 1 , metric– INDEL(X) = R(X,) = 1

REPLACE(a) (a) 0(a) (ab) 1/3(ab) (ac) 1/2(a) (b) 1(ab) (cd) 1(a) () 1

Jaccard coefficient– 1-|XY| / |XY|

Sørensen coefficient : simple index – Give greater "weight" to common elements

Jaccard coefficient– 1-|XY| / |XY|– 1-|(XY)| / |X-Y|+|Y-X|+|XY|

Sørensen coefficient : simple index – Give greater "weight" to common elements

Sørensen coefficient : simple index – Give greater "weight" to common elements – 1-2*|(XY)| / |X-Y|+|Y-X|+2*|XY|

Sørensen coefficient : simple index – Give greater "weight" to common elements – 1-2*|(XY)| / |X-Y|+|Y-X|+2*|XY|– = (|X|+|Y|-2*|XY|) / (|X|+|Y|) = R(X,Y)

Organize : Partition into K sets

Goal: – To minimize the sum of the K multiple

alignment scores– Group similar sequences

Approximate– Calculate N*N proximity matrix

• Pairwise score : edit distance– Any clustering that works best for your data

Organize : Clustering

Desirable Properties– Form groups of arbitrary shape and size– Can estimate the number of clusters from the data

Density Based Clustering

k-nearest neighbor Partition based at the valleys of the density estimate Density of sequence = n / (|D|*d) n/d

– n & d : Based on user defined k nearest neighbor space– n : # of neighbors– d : size of neighbor region

Parameter k : Neighbor space– Can cluster at different resolutions as desired

General : Uniform kernel k-NN clustering– Efficient = O(kN)

Op costs (INDEL & REPLACE) for itemsetsOrganize into k partitions

Data Compression : Multiple Alignment

Optimal multiple alignment too expensive !greedy approximation

– Incrementally align – in density-descending order– Pairwise alignment

• Sequence to Weighted sequence

ID Lexically sorted sequence clusterseq3 (A) (B) (DE)seq4 (A) (BCG) (D)seq2 (AE) (H) (B) (D)seq1 (AG) (F) (BC) (AE) (H)seq5 (BCI) (DE)

ID Aligned Sequence clusterseq3 (A) (B) (DE)seq2 (AE) (H) (B) (D)seq4 (A) (BCG) (D)seq5 (BCI) (DE)seq1 (AG) (F) (BC) (AE) (H)

ID Aligned Sequence clusterseq3 (A) (B) (DE)seq2 (AE) (H) (B) (D)seq4 (A) (BCG) (D)seq5 (BCI) (DE)seq1 (AG) (F) (BC) (AE) (H)WS4 (A:4,E:1,G:1)

: 4(H:1,F:1): 2

(B:5,C:3,G:1,I:1) : 5

(A:1,D:4,E:3): 5

(H:1): 1

ID Lexically sorted sequence clusterseq3 (A) (B) (DE)seq4 (A) (BCG) (D)seq2 (AE) (H) (B) (D)seq1 (AG) (F) (BC) (AE) (H)seq5 (BCI) (DE)

Multiple Alignment

seq3 (A) (B) (DE)seq2 (AE) (H) (B) (D)

Multiple Alignment

seq3 (A) (B) (DE)seq2 (AE) (H) (B) (D)

seq4 (A) (BCG) (D)

Multiple Alignment

seq3 (A) (B) (DE)seq2 (AE) (H) (B) (D)WS1 (A:2,E:1):2 (H:1):1 (B:2):2 (D:2,E:1):2 2

seq4 (A) (BCG) (D)

Multiple Alignment

seq4 (A) (BCG) (D)WS2 (A:3,E:1):3 (H:1):1 (B:3,C:1,G:1):3 (D:3,E:1):3 3

Multiple Alignment

seq5 (BCI) (DE)

Multiple Alignment

seq5 (BCI) (DE)WS3 (A:3,E:1):3 (H:1):1 (B:4,C:2,G:1,I:1):

4(D:4,E:2):4 4

Multiple Alignment

seq5 (BCI) (DE)WS3 (A:3,E:1):3 (H:1):1 (B:4,C:2,G:1,I:1):

4(D:4,E:2):4 4

seq1 (AG) (F) (BC) (AE) (H)

Multiple Alignment

seq5 (BCI) (DE)WS3 (A:3,E:1):3 (H:1):1 (B:4,C:2,G:1,I:1):4 (D:4,E:2):4 4

seq1 (AG) (F) (BC) (AE) (H)WS4 (A:4,E:1,G:1)

: 4(H:1,F:1): 2

(B:5,C:3,G:1,I:1) : 5

(A:1,D:4,E:3): 5

(H:1): 1

Op Cost for Itemset to weighted itemset

seq3 (A) (B) (DE)seq2 (AE) (H) (B) (D)seq4 (A) (BCG) (D)seq5 (BCI) (DE)WS3 (A:3,E:1):3 (H:1):1 (B:4,C:2,G:1,I:1):4 (D:4,E:2):4 4

seq1 (AG) (F) (BC) (AE) (H)

Replace ((A:3,E:1):3 – 4 , (AG) ) = ?

seq3 (A) R=1/3

Tot Avg=65/120seq2 (AE) R=1/2seq4 (A) R=1/3seq5 INDEL=1

seq1 (AG)

Replace ((A:3,E:1):3 – 4 , (AG) ) 65/120

seq3 (A) R=1/3Avg=7/18=35/90seq2 (AE) R=1/2

seq4 (A) R=1/3

seq5 INDEL=1 Tot Avg=65/120

seq1 (AG)

seq3 (A) R=1/3Avg=7/18=35/90seq2 (AE) R=1/2

seq4 (A) R=1/3

seq5 INDEL=1 Tot Avg=65/120WS3 (A:3,E:1):3

- 4R’w= (4+2*3)-2*3/(4+2*3) =2/5=36/90

seq1 (AG) 1* ( n – wX )

seq3 (A) R=1/3Avg=7/18=35/90seq2 (AE) R=1/2

seq4 (A) R=1/3

- 4R’w= (4+2*3)-2*3/(4+2*3) =2/5=36/90

seq1 (AG) R’w(Xw,Y) = [ weight(X)+|Y|*wX – 2*weight(XY) ]

[ weight(X) + |Y|*wX ](|X|+|Y|-2*|XY|)

(|X|+|Y|)

seq3 (A) R=1/3Avg=7/18=35/90seq2 (AE) R=1/2

seq4 (A) R=1/3

- 4R’w= (4+2*3)-2*3/(4+2*3) =2/5=36/90

seq1 (AG) R’w(Xw,Y) = [ weight(X)+|Y|*wX – 2*weight(XY) ]

[ weight(X) + |Y|*wX ]

R’w(Xw,Y) * wX

Op Cost for Itemset to weighted itemset: Rw

seq3 (A) R=1/3Avg=7/18=35/90seq2 (AE) R=1/2

seq4 (A) R=1/3

- 4R’w= (4+2*3)-2*3/(4+2*3) =2/5=36/90

seq1 (AG) 1* ( n – wX )

R’w(Xw,Y) * wX

Replace ((A:3,E:1):3 – 4 , (AG) )= Rw(Xw, Y) = [ R’w(Xw,Y) * wX + n – wX ] / n

seq3 (A) R=1/3Avg=7/18=35/90seq2 (AE) R=1/2

seq4 (A) R=1/3

- 4R’w= (4+2*3)-2*3/(4+2*3) =2/5=36/90

Rw=[(2/5)*3+1]/4=11/20=66/120seq1 (AG)

Op Cost for Itemset to weighted itemset: Rw

1* ( n – wX )

R’w(Xw,Y) * wX

Replace ((A:3,E:1):3 – 4 , (AG) )= Rw(Xw, Y) = [ R’w(Xw,Y) * wX + n – wX ] / n

Op Cost for Itemset to weighted itemset:Rw

Op cost– R’w(Xw,Y)= [ weight(X)+|Y|*wX – 2*weight(XY) ]

– Rw(Xw, Y) = [ R’w* wX + n – wX ] / n

– 0 ≤ Rw ≤ 1 , metric

– INDEL(Xw) = Rw (Xw,) = INDEL(Y) = Rw (, Y) =1

Summarize: Generate and Present results

N sequences K weighted sequencesWeighted sequence : huge

– compression of all sequences

: 4(H:1,F:1): 2

(B:5,C:3,G:1,I:1) : 5

(A:1,D:4,E:3): 5

(H:1): 1

< (E:1, L:1, R:1, T:1, V:1, d:1) (A:1, B:9, C:8, D:8, E:12, F:1, L:4, P:1, S:1, T:8, V:5, X:1, a:1, d:10, e:2, f:1, g:1, p:1) (B:99, C:96, D:91, E:24, F:2, G:1, L:15, P:7, R:2, S:8, T:95, V:15, X:2, Y:1, a:2, d:26, e:3, g:6, l:1, m:1) (A:5, B:16, C:5, D:3, E:13, F:1, H:2, L:7, P:1, R:2, S:7, T:6, V:7, Y:3, d:3, g:1) (A:13, B:126, C:27, D:1, E:32, G:5, H:3, J:1, L:1, R:1, S:32, T:21, V:1, W:3, X:2, Y:8, d:13, e:1, f:8, i:2, p:7, l:3, g:1) (A:12, B:6, C:28, D:1, E:28, G:5, H:2, J:6, L:2, S:137, T:10, V:2, W:6, X:8, Y:124, a:1, d:6, g:2, i:1, l:1, m:2) (A:135, B:2, C:23, E:36, G:12, H:124, K:1, L:4, O:2, R:2, S:27, T:6, V:6, W:10, X:3, Y:8, Z:2, a:1, d:6, g:1, h:2, j:1, k:5, l:3, m:7, n:1) (A:11, B:1, C:5, E:12, G:3, H:10, L:7, O:4, S:5, T:1, V:7, W:3, X:2, Y:3, a:1, m:2) (A:31, C:15, E:10, G:15, H:25, K:1, L:7, M:1, O:1, R:4, S:12, T:10, V:6, W:3, Y:3, Z:3, d:7, h:3, j:2, l:1, n:1, p:1, q:1) (A:3, C:5, E:4, G:7, H:1, K:1, R:1, T:1, W:2, Z:2, a:1, d:1, h:1, n:1) (A:20, C:27, E:13, G:35, H:7, K:7, L:111, N:2, O:1, Q:3, R:11, S:10, T:20, V:111, W:2, X:2, Y:3, Z:8, a:1, b:1, d:13, h:9, j:1, n:1, o:2) (A:17, B:2, C:14, E:17, F:1, G:31, H:8, K:13, L:2, M:2, N:1, R:22, S:2, T:140, U:1, V:2, W:2, X:1, Z:13, a:1, b:8, d:6, h:14, n:6, p:1, q:1) (A:12, B:7, C:5, E:13, G:16, H:5, K:106, L:8, N:2, O:1, R:32, S:3, T:29, V:9, X:2, Z:9, b:16, c:5, d:5, h:7, l:1) (A:7, B:1, C:9, E:5, G:7, H:3, K:7, R:8, S:1, T:10, X:1, Z:3, a:2, b:3, c:1, d:5, h:3) (A:1, B:1, H:1, R:1, T:1, b:2, c:1) (A:3, B:2, C:2, E:6, F:2, G:4, H:2, K:20, M:2, N:3, R:19, S:3, T:11, U:2, X:4, Z:34, a:3, b:11, c:2, d:4) (H:1, Y:1, a:1, d:1) > : 162

Presentation model

Frequent items : – definite pattern items– Cutoff : =50%

Common items : – uncertain– Cutoff : =20%

Rare items – Noise items

FrequentItems

CommonItems

RareItems

W=100%

Visualization

Pattern Consensus Sequence :– Cutoff : Minimum cluster strength (=50%)– Frequent items

Variation Consensus Sequence : – Cutoff : Minimum cluster strength (=20%)– Frequent items + common items

100%: 85%: 70%: 50%: 35%: 20%

(B:61%, C:59%, D:56%, T:59%)

(B: 78%)

(S:85%, Y:77%)

(A:83%, E:22%, H:77%)

(G:22%, L:69%, V:69%)

(T: 86%)

(K: 65%)

(B: 21%) : 162

Pat tern

(B C D T) (B) (S Y) (A H) (L V) (T) (K) :162 Variation (B C D T) (B) (S Y) (A E H) (G L V) (T) (K) (Z) :162

ID Full Sequence Database lexically sorted Cluster Lseq Len

seq1 (A) (B, C, Y) (D) 1 3 5

seq2 (A) (X) (B, C) (A, E) (Z) 1 5 7

seq3 (A, I) (Z) (K) (L, M) 2 4 6

seq4 (A, L) (D, E) 1 2 4

seq5 (I, J) (B) (K) (L) 2 4 5

seq6 (I, J) (L, M) 2 2 5

seq7 (I, J) (K) (J, K) (L) (M) 2 5 7

seq8 (I, M) (K) (K, M) (L, M) 2 4 7

seq9 (J) (K) (L, M) 2 3 4

seq10 (V) (K, W) (Z) 2 3 4

Example: Given 10 seqs lexically sorted

ID Full Sequence Database lexically sorted Cluster Lseq Len

seq1 (A) (B, C, Y) (D) 1 3 5

seq2 (A) (X) (B, C) (A, E) (Z) 1 5 7

seq3 (A, I) (Z) (K) (L, M) 2 4 6

seq4 (A, L) (D, E) 1 2 4

seq5 (I, J) (B) (K) (L) 2 4 5

seq6 (I, J) (L, M) 2 2 5

seq7 (I, J) (K) (J, K) (L) (M) 2 5 7

seq8 (I, M) (K) (K, M) (L, M) 2 4 7

seq9 (J) (K) (L, M) 2 3 4

seq10 (V) (K, W) (Z) 2 3 4

Example: Given 10 seqs lexically sorted

Color scheme <100: 85: 70: 50: 35: 20 >Cluster 1 (cluster strength = 40% = 2 sequences)

seq1 (A) (B, C, Y) (D)

seq4 (A, L) (D, E)

seq2 (A) (X) (B, C) (A, E) (Z)

seq1 (A) (B, C, Y) (D)

seq4 (A, L) (D, E)

seq2 (A) (X) (B, C) (A, E) (Z)

Weighted Seq (A:3, L:1):3 (X:1):1 (B:2, C:2, Y:1):2 (A:1, D:2,E:2):3 (Z:1):1 3

seq1 (A) (B, C, Y) (D)

seq4 (A, L) (D, E)

seq2 (A) (X) (B, C) (A, E) (Z)

Consensus Pat (A) (B, C) (D, E)

seq1 (A) (B, C, Y) (D)

seq4 (A, L) (D, E)

seq2 (A) (X) (B, C) (A, E) (Z)

Cluster 2 (cluster strength = 40% = 3 sequences)seq9 (J) (K) (L, M)

seq5 (I, J) (B) (K) (L)

seq3 (A, I) (Z) (K) (L, M)

seq7 (I, J) (K) (J, K) (L) (M)

seq8 (I, M) (K) (K, M) (L, M)

seq6 (I, J) (L, M)

seq10 (V) (K, W) (Z)

seq1 (A) (B, C, Y) (D)

seq4 (A, L) (D, E)

seq2 (A) (X) (B, C) (A, E) (Z)

seq5 (I, J) (B) (K) (L)

seq3 (A, I) (Z) (K) (L, M)

seq7 (I, J) (K) (J, K) (L) (M)

seq8 (I, M) (K) (K, M) (L, M)

seq6 (I, J) (L, M)

seq10 (V) (K, W) (Z)

Weightedsequence

(A:1,I:5,J:4,M:1):6

(B:1,K:2,V:1,Z:1):5

(J:1,K:6,M:1,W:1):6 (L:6,M:4):6 (M:1,

Z:1):2 7

seq1 (A) (B, C, Y) (D)

seq4 (A, L) (D, E)

seq2 (A) (X) (B, C) (A, E) (Z)

seq5 (I, J) (B) (K) (L)

seq3 (A, I) (Z) (K) (L, M)

seq7 (I, J) (K) (J, K) (L) (M)

seq8 (I, M) (K) (K, M) (L, M)

seq6 (I, J) (L, M)

seq10 (V) (K, W) (Z)

Weightedsequence

(A:1,I:5,J:4,M:1):6

(B:1,K:2,V:1,Z:1):5

(J:1,K:6,M:1,W:1):6 (L:6,M:4):6 (M:1,

Z:1):2 7

ConsensusPat (w≥3) (I, J) (K) (L, M)

ConsensusVar (w≥2) (I, J) (K) (K) (L, M)

Example: support Model (20%=2 seq)id pattern sup id pattern sup id pattern sup1 (A) 4 17 (A) (D) 2 33 (I,J) (K) 22 (B) 3 18 (A) (E) 2 34 (I,J) (L) 33 (C) 2 19 (A) (Z) 2 35 (I,J) (M) 24 (D) 2 20 (A) (B,C) 2 36 (I) (K) (K) 25 (E) 2 21 (I) (K) 4 37 (I) (K) (L) 26 (I) 5 22 (I) (L) 5 38 (I) (K) (M) 27 (J) 4 23 (I) (M) 4 39 (I) (K) (L,M) 28 (K) 6 24 (I) (L,M) 3 40 (J) (K) (L) 29 (L) 7 25 (J) (K) 3 41 (J) (K) (M) 210 (M) 5 26 (J) (L) 4 42 (K) (K) (L) 211 (Z) 3 27 (J) (M) 3 43 (K) (K) (M) 212 (B,C) 2 28 (J) (L,M) 2 44 (I,J) (K) (L) 213 (I,J) 2 29 (K) (K) 2 45 (I) (K) (K) (L) 214 (L,M) 2 30 (K) (L) 5 46 (I) (K) (K) (M) 215 (A) (B) 2 31 (K) (M) 416 (A) (C) 2 32 (K) (L,M) 3

(A) (B,C) (D,E)(I,J) (K) (L,M)

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>ApproxMAP

(Approximate Multiple Alignment Pattern mining)

Approximation Method : ApproxMAP– Organize into K partitions = O(Nseq

2Lseq2Iseq)

• Proximity matrix = O(Nseq2Lseq

2Iseq)

• Clustering = O(kNseq)

– Compress each partition = O(nL2)• weighted sequences = O(nL2)

– Summarize each partition = O (1)• Pattern consensus sequence• Variation consensus sequence

– Time Complexity : O(Nseq2Lseq

2Iseq)• 2 optimizations

Overview

Evaluation

Up to now: Only performance / scalabilityQuality?

– What kind of patterns will the model generate?– Evaluate correctness of the model

Why?– Basis for comparison of different models– Essential in understanding results of approximate

solutions

Evaluation Method

Given – Set of Base patterns B : E(FB) & E(LB) D–D Set of Result patterns P

How?– Map each Pi to best Bj

• based on Longest Common Subsequences • of all Bj

• max res pat B(|BP|)

Item level

Predicted (Result Patterns) +

Actual(Base Pat)

+ Pattern Items

Confusion Matrix

Item level

Actual(Base Pat)

Extraneous Items+ Pattern Items

Confusion Matrix

Item level

Actual(Base Pat)

Extraneous Items+ Missed Items Pattern Items

Confusion Matrix

Evaluation Criteria : Item level

Recoverability :– Degree of pattern items in the Base Pat (weighted)– ∑ E(FB) * [ max res pat B(|BP|) / E(LB)]– Cutoff so that 0 ≤ R ≤ 1

Precision : – Degree of pattern items in the Result Pat– Pattern Items / (Pattern Items + Extraneous Items)

Actual(Base Pat)

N/A Extraneous Items+ Missed Items Pattern Items

Evaluation Criteria : Sequence level

spurious patterns– Pattern Items Extraneous Items

determine max pattern for each Bj

– Of all Pi map to a particular Bj

– the Pi with the Longest Common Subsequence

– max res pat P(|BP|)

redundant patterns– All other patterns

Ntotal = Nmax + Nspur + Nredun

Evaluation Example

30% : (A)(BC)(DE)

70% : (IJ)(K)(LM)

(A)(B)(DE) (A)(BC)(D) (B)(BC)(DE)

(IJ)(LM) (J)(K)(LM) (IJ)(KX)(LM)

(XY)(K)(Z)

Evaluation Example

30% : (A)(BC)(DE)

70% : (IJ)(K)(LM)

Ntotal = 7

(A)(B)(DE) (A)(BC)(D) (B)(BC)(DE)

(XY)(K)(Z)

Evaluation Example

30% : (A)(BC)(DE)

70% : (IJ)(K)(LM)

Ntotal = 7 Spurious = 1

(A)(B)(DE) (A)(BC)(D) (B)(BC)(DE)

(XY)(K)(Z)

Evaluation Example

30% : (A)(BC)(DE)

70% : (IJ)(K)(LM)

Ntotal = 7 Spurious = 1 Recoverability =(30%)*4/5+(70%)*5/5

(A)(B)(DE) (A)(BC)(D) (B)(BC)(DE)

(XY)(K)(Z)

Evaluation Example

30% : (A)(BC)(DE)

70% : (IJ)(K)(LM)

Ntotal = 7 Spurious = 1 Recoverability =(30%)*4/5+(70%)*5/5 =

94% Redundant = 4

(A)(B)(DE) (A)(BC)(D) (B)(BC)(DE)

(XY)(K)(Z)

Evaluation Example

30% : (A)(BC)(DE)

70% : (IJ)(K)(LM)

Ntotal = 7 Spurious = 1 Recoverability =(30%)*4/5+(70%)*5/5 = 94% Redundant = 4 Precision = 1-5/31=84%

(A)(B)(DE) (A)(BC)(D) (B)(BC)(DE)

(XY)(K)(Z)

Synthetic data

Patterned data : IBM synthetic data generator– Given certain DB parameters outputs

• sequence DB• base patterns used to generate it : E(FB) and E(LB)

– R. Agrawal and R. Srikant : ICDE 95 & EBDT 96 Random data

– Independence both between and across itemsets Patterned data + systematic noise

– Randomly change item with probability – Yang, SIGMOD 2002

Patterned data + systematic outliers– random sequences

Overview

Results

ApproxMAP– Pattern consensus sequence– No null or one-itemset

Machine : swan– 2GHz Intel Xeon processor – 2GB of memory– Public machine

• Difficult to get consistent running time measurements• Thanks !

Database Parameter

Notation Meaning Value|| I || # of items 100|| || # of potentially freq itemsets 500

Ipat Avg. # of items per itemset in BP 2

Lpat Avg. # of itemsets per base pat 7

Npat # of base pattern sequences 10

Nseq # of data sequences 1000

Lseq Avg. # of itemsets per data seq 10

Iseq Avg. # of items per itemset in DB 2.5

Database Parameter

Notation Meaning Value|| I || # of items 100|| || # of potentially freq itemsets 500

Lpat Avg. # of itemsets per base pat 7

Nseq # of data sequences 1000

BasePi (E(FB):E(LB)) ||P|| Pattern <100: 85: 70: 50: 35: 20> BaseP1 (0.21:0.66) 14 <(15 16 17 66) (15) (58 99) (2 74) (31 76) (66) (62) (93) >

PatConSeq1 13 < (15 16 17 66) (15) (58 99) (2 74) (31 76) (66) (62) > VarConSeq1 18 < (15 16 17 66) (15 22) (58 99) (2 74) (24 31 76) (24 66) (50 62) (93) >

BaseP2 (0.161:0.83) 22 <(22 50 66) (16) (29 99) (94) (45 67) (12 28 36) (50) (96) (51) (66) (2 22 58) (63 74 99) > PatConSeq2 19 < (22 50 66) (16) (29 99) (94) (45 67) (12 28 36) (50) (96) (51) (66) (2 22 58) > VarConSeq2 25 < (22 50 66) (16) (29 99) (22 58 94) (2 45 58 67) (12 28 36) (2 50) (24 96) (51) (66) (2 22 58) > PatConSeq3 15 < (22 50 66) (16) (29 99) (94) (45 67) (12 28 36) (50) (96) (51) > VarConSeq3 15 < (22 50 66) (16) (29 99) (94) (45 67) (12 28 36) (50) (96) (51) >

BaseP3 (0.141:0.82) 14 < (22) (22) (58) (2 16 24 63) (24 65 93) (6) (11 15 74) > PatConSeq4 11 < (22) (22) (58) (2 16 24 63) (24 65 93) (6) > VarConSeq4 13 < (22) (22) (22) (58) (2 16 24 63) (2 24 65 93) (6 50) >

BaseP4 (0.131:0.90) 15 <(31 76) (58 66) (16 22 30) (16) (50 62 66) (2 16 24 63) > PatConSeq5 11 < (31 76) (58 66) (16 22 30) (16) (50 62 66) > VarConSeq5 11 < (31 76) (58 66) (16 22 30) (16) (50 62 66) (16 24) >

BaseP5 (0.123:0.81) 14 <(43) (2 28 73) (96) (95) (2 74) (5) (2) (24 63) (20) (93) > PatConSeq6 13 < (43) (2 28 73) (96) (95) (2 74) (5) (2) (24 63) (20) > VarConSeq6 16 < (22 43) (2 28 73) (58 96) (95) (2 74) (5) (2 66) (24 63) (20) >

BaseP6 (0.121:0.77) 9 <(63) (16) (2 22) (24) (22 50 66) (50) > PatConSeq7 8 < (63) (16) (2 22) (24) (22 50 66) > VarConSeq7 9 < (63) (16) (2 22) (24) (22 50 66) >

BaseP7 (0.054:0.60) 13 <(70) (58 66) (22) (74) (22 41) (2 74) (31 76) (2 74) > PatConSeq8 16 < (70) (58) (22 58 66) (22 58) (74) (22 41) (2 74) (31 76) (2 74) > VarConSeq8 18 < (70) (58 66) (22 58 66) (22 58) (74) (22 41) (2 22 66 74) (31 76) (2 74) > PatConSeq9 0 cluster size was only 5 sequences so no pattern consensus sequence was produced VarConSeq9 8 < (70) (58 66) (74) (74) (22 41) (74) >

BaseP8 (0.014:0.91) 17 < (20 22 23 96) (50) (51 63) (58) (16) (2 22) (50) (23 26 36) (10 74) > BaseP9 (0.038:0.78) 7 < (88) (24 58 78) (22) (58) (96) > BaseP10 (0.008:0.66) 17 < (16) (2 23 74 88) (24 63) (20 96) (91) (40 62) (15) (40) (29 40 99) >

10 Base Patterns

D = 1000 Sequences

IBM Synthetic Data Generator

Id:LB E(FB):E(LB) Base Pattern

B1:14 0.21:0.66 [15,16,17,66] [15] [58,99] [2,74] [31,76] [66] [62][93]

B2:22 0.161:0.83 [22,50,66][16][29,99][94][45,67]…[2,22,58][63,74,99]

B3:14 0.141:0.82 [22] [22] [58] [2,16,24,63] [24,65,93] [6] [11,15,74]

B10:17 0.008:0.66 [16][2,23,74,88][24,63][20,96][91][40,62]...[29,40,99]

ApproxMAP

wseq1(162 seqs)

wseq2 wseq9

cluster1(162 seqs)

cluster2 cluster9

10 Base Patterns

D = 1000 Sequences

< (E:1, L:1, R:1, T:1, V:1, d:1) (A:1, B:9, C:8, D:8, E:12, F:1, L:4, P:1, S:1, T:8, V:5, X:1, a:1, d:10, e:2, f:1, g:1, p:1) (B:99, C:96, D:91, E:24, F:2, G:1, L:15, P:7, R:2, S:8, T:95, V:15, X:2, Y:1, a:2, d:26, e:3, g:6, l:1, m:1) (A:5, B:16, C:5, D:3, E:13, F:1, H:2, L:7, P:1, R:2, S:7, T:6, V:7, Y:3, d:3, g:1) (A:13, B:126, C:27, D:1, E:32, G:5, H:3, J:1, L:1, R:1, S:32, T:21, V:1, W:3, X:2, Y:8, d:13, e:1, f:8, i:2, p:7, l:3, g:1) (A:12, B:6, C:28, D:1, E:28, G:5, H:2, J:6, L:2, S:137, T:10, V:2, W:6, X:8, Y:124, a:1, d:6, g:2, i:1, l:1, m:2) (A:135, B:2, C:23, E:36, G:12, H:124, K:1, L:4, O:2, R:2, S:27, T:6, V:6, W:10, X:3, Y:8, Z:2, a:1, d:6, g:1, h:2, j:1, k:5, l:3, m:7, n:1) (A:11, B:1, C:5, E:12, G:3, H:10, L:7, O:4, S:5, T:1, V:7, W:3, X:2, Y:3, a:1, m:2) (A:31, C:15, E:10, G:15, H:25, K:1, L:7, M:1, O:1, R:4, S:12, T:10, V:6, W:3, Y:3, Z:3, d:7, h:3, j:2, l:1, n:1, p:1, q:1) (A:3, C:5, E:4, G:7, H:1, K:1, R:1, T:1, W:2, Z:2, a:1, d:1, h:1, n:1) (A:20, C:27, E:13, G:35, H:7, K:7, L:111, N:2, O:1, Q:3, R:11, S:10, T:20, V:111, W:2, X:2, Y:3, Z:8, a:1, b:1, d:13, h:9, j:1, n:1, o:2) (A:17, B:2, C:14, E:17, F:1, G:31, H:8, K:13, L:2, M:2, N:1, R:22, S:2, T:140, U:1, V:2, W:2, X:1, Z:13, a:1, b:8, d:6, h:14, n:6, p:1, q:1) (A:12, B:7, C:5, E:13, G:16, H:5, K:106, L:8, N:2, O:1, R:32, S:3, T:29, V:9, X:2, Z:9, b:16, c:5, d:5, h:7, l:1) (A:7, B:1, C:9, E:5, G:7, H:3, K:7, R:8, S:1, T:10, X:1, Z:3, a:2, b:3, c:1, d:5, h:3) (A:1, B:1, H:1, R:1, T:1, b:2, c:1) (A:3, B:2, C:2, E:6, F:2, G:4, H:2, K:20, M:2, N:3, R:19, S:3, T:11, U:2, X:4, Z:34, a:3, b:11, c:2, d:4) (H:1, Y:1, a:1, d:1) > : 162

cluster1(162 seqs)

cluster2 cluster9

ApproxMAP

wseq1(162 seqs)

wseq2 wseq9

PatConSeq1

VarConSeq1

PatConSeq9

VarConSeq9

10 Base Patterns

D = 1000 Sequences

100%: 85%: 70%: 50%: 35%: 20%

(B:61%, C:59%, D:56%, T:59%)

(B: 78%)

(S:85%, Y:77%)

(A:83%, E:22%, H:77%)

(G:22%, L:69%, V:69%)

(T: 86%)

(K: 65%)

(B: 21%) : 162

Pat tern

(B C D T) (B) (S Y) (A H) (L V) (T) (K) :162 Variation (B C D T) (B) (S Y) (A E H) (G L V) (T) (K) (Z) :162

cluster1(162 seqs)

cluster2 cluster9

ApproxMAP

wseq1(162 seqs)

wseq2 wseq9

PatConSeq1

VarConSeq1

PatConSeq9

VarConSeq9

10 Base Patterns

D = 1000 Sequences

Evaluation

7 max patterns1 redundant pattern0 spurious patterns

1 null pattern

Recoverability : 91.16%Precision: 97.17%

Extraneous Items: 3/106

BaseP1(0.21:0.66) 14 <(15 16 17 66)(15)(58 99)(2 74)(31 76)(66)(62)(93)> PatConSeq1 13 <(15 16 17 66)(15)(58 99)(2 74)(31 76)(66)(62)> VarConSeq1 18 <(15 16 17 66)(15 22)(58 99)(2 74)(24 31 76)(24 66)(50 62)(93)>

8 patterns returned

7 max patterns

1 redundant patterns

0 spurious patterns

Recoverability : 91.16%

Precision: 97.17%

Extraneous Items: 3/106

Comparative Study

Conventional Sequential Pattern Mining– Support Model

Empirical analysis– Totally random data– Patterned data– Patterned data + noise– Patterned data + outliers

Evaluation : Comparison

ApproxMAP Support ModelRandom

DataNo patterns Numerous spurious patterns

Patterned Data

10 patterns embedded

into 1000 seqs

k=6 & MinStrgh=30%Recoverability : 91.16%Precision: 97.17% Extraneous Items: 3/1068 patterns returned1 redundant patterns0 spurious patterns

MinSup=5%Recoverability : 91.59%Precision: 96.29% Extraneous : 66,058/1,782,583253,782 patterns returned253,714 redundant patterns58 spurious patterns

Patterned Data

into 1000 seqs

Noise Robust Not Robust Recoverability degrades fast

Robustness w.r.t. noise

20.00%

40.00%

60.00%

80.00%

100.00%

0% 10% 20% 30% 40% 50%Noise Level (1-alpha)

bility

Support(min_sup=5%)Mult Alingment(k=6, theta=50%)

Patterned Data

into 1000 seqs

Noise Robust Not Robust Recoverability degrades fast

Outliers Robust Some what Robust

Understanding ApproxMAP

5 experiments– k in kNN clustering– Strength cutoff– Order of alignment– Optimization 1 : reduced precision in prox. matrix– Optimization 2 : sample based iterative clustering

A realistic DB

Notation Meaning Value|| I || # of items 1,000|| || # of potentially freq itemsets 5,000

Lpat Avg. # of itemsets per base pat 14=0.7*Lseq

Nseq # of data sequences 10,000

Input parameters

0%10%20%30%40%50%60%70%80%90%

100%110%

0 10 20 30 40 50 60 70 80 90 100Theta : Strength threshold (%)

ria (%

Recover Precision

2 4 6 8 10k : for kNN clustering

ria (%

Recoverability Precis ion

The order in multiple alignment

Understanding ApproxMAP

Optimization 1 : Lseq

– Running time : reduced to 40%Optimization 2 : Nseq

– Running time: reduced to 10%-40% – For negligible reduction in recoverability

Effects of the DB param & scalability

4 experiments– || I || : # of unique items in the database

• Density of the database• 1,000 – 10,000

– Nseq : # of sequences in the data • 10,000 – 100,000

– Lseq : Avg. # of itemsets per data seq• 10 – 50

– Iseq : Avg. # of items per itemset in DB• 2.5 – 10

0 2000 4000 6000 8000 10000

|I| : # of unique items in D

03600072000

108000144000180000216000252000

0 20000 40000 60000 80000 100000Nseq

036007200

10800144001800021600

0 10 20 30 40 50Lseq

0 5 10 15 20

Overview

Case Study : Real data

Monthly Services to children with A&N report 992 sequences15 interpretable and useful patterns (RPT)(INV,FC)(FC) ..11.. (FC)

– 419 sequences (RPT)(INV,FC)(FC)(FC)

– 57 sequences (RPT)(INV,FC,T)(FC,T)(FC,HM)(FC)(FC,HM)

– 39 sequences

Overview

Conclusion : why does it work well?

Robust on random & weak patterned noiseVery good at organizing sequencesLong sequence data that are not random have

unique signatures

What have I done?

defines a new model, Multiple Alignment Sequential Pattern Mining,

describes a novel solution ApproxMAP (for APPROXimate Multiple Alignment Pattern mining) – that introduces a new metric for itemsets– weighted sequences : a new representation of alignment

information, – and the effective use of strength cutoffs to control the level

of detail included in the consensus patterns designs a general evaluation method to assess the

quality of results from sequential pattern mining algorithms,

What have I done?

employs the evaluation method to run an extensive set of empirical evaluations of approxMAP on synthetic data,

employs the evaluation method to compare the effectiveness of approxMAP to the conventional methods based on support model,

derives the expected support of a random sequences under the null hypothesis of no pattern in the database to better understand the behavior of the support based methods, and

demonstrates the usefulness of approxMAP using real world data.

Future Work

Sample based iterative clustering – Memory management

Distance metric– Multisets– Taxonomy tree

Strength cutoff– Automatic detection of customized

Local alignment

Thank You ! Advisor

– Wei Wang (02-04)– James Coggins (00-02)– Prasun Dewan (99-00)– Kye Hedlund (96-99)– Jan Prins (95-96)

SW advisor– Dean Duncan (98-04)

Other people– Janet Jones– Kim Flair– Susan Paulsen

Fellow students– Priyank Porwal, Andrew

Leaver-Fay, Leland Smith

Committee– Stephen Aylward– Jan Prins– Andrew Nobel

Other faculty– Jian Pei– Jack Snoeyink– J. S. Marron– Stephen Pizer– Stephen Weiss

Colleagues– Sang-Uok Kum, Jisung Kim– Alexandra, Michelle, – Aron, Chris

Family– Sohmee, My mom, dad, sister

Rw(Xw, Y) = [R’w* wX + n - wX] / n

R’w(Xw,Y)= [weight(X)+|Y|*wX -2*weight(XY)]

(|X|+|Y|-2*|XY|) (|X|+|Y|)

: 4(H:1,F:1): 2

(B:5,C:3,G:1,I:1) : 5

(A:1,D:4,E:3): 5

(H:1): 1

approximate mining of consensus sequential patterns

Documents

learning sequential decision rules using simulation...

a bayesian approach to optimal sequential experimental...

newtown ˛ connecticut - loopnet · 2017-08-08 ·...

calculating approximate gcd of univariate polynomials with...

sequential monte carlo methods for dsge models 1 - ed...

sequential monte carlo workshop - uppsala university · a...

consensus distribué. le problème du consensus mb - lria

sequential clinical scheduling with service criteria...

dual sequential prediction models linking sequential ......

sequential abstract state machines capture sequential...

linear sequential machines1 linear sequential machine and...

reaching approximate arxiv:1206.0089v1 [cs.dc] 1 jun … ·...

expressions, variables, and sequential programs ·...

individual response consensus entry task: consensus board

orsten oefler parallel programming sequential consistency,...

linear sequential machine and reduction of linear sequential...

approximate computing with approximate circuits...

consensus new keynesian consensus new keynesian dsge model

sequential simulations of mixed discrete-continuous...

seqme ngs brochure-201904• 400.000 reads per chip, •...