approximate mining of consensus sequential patterns

147
Hye-Chung (Monica) Kum University of North Carolina, Chapel Hill Computer Science Department School of Social Work http://www.cs.unc.edu/~kum/approxMAP Approximate Mining of Consensus Sequential Patterns

Upload: medea

Post on 25-Feb-2016

59 views

Category:

Documents


1 download

DESCRIPTION

Approximate Mining of Consensus Sequential Patterns. Hye-Chung (Monica) Kum University of North Carolina, Chapel Hill Computer Science Department School of Social Work http://www.cs.unc.edu/~kum/approxMAP. Knowledge Discovery & Data mining (KDD). - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Approximate Mining of  Consensus Sequential Patterns

Hye-Chung (Monica) Kum

University of North Carolina, Chapel HillComputer Science Department

School of Social Work

http://www.cs.unc.edu/~kum/approxMAP

Approximate Mining of Consensus Sequential Patterns

Page 2: Approximate Mining of  Consensus Sequential Patterns

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>

Knowledge Discovery & Data mining (KDD)

"The nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data"

Page 3: Approximate Mining of  Consensus Sequential Patterns

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>

Knowledge Discovery & Data mining (KDD)

"The nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data"

Page 4: Approximate Mining of  Consensus Sequential Patterns

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>

Knowledge Discovery & Data mining (KDD)

"The nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data"

Page 5: Approximate Mining of  Consensus Sequential Patterns

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>

Knowledge Discovery & Data mining (KDD)

"The nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data"

Page 6: Approximate Mining of  Consensus Sequential Patterns

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>

Knowledge Discovery & Data mining (KDD)

"The nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data"

Page 7: Approximate Mining of  Consensus Sequential Patterns

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>

Knowledge Discovery & Data mining (KDD)

"The nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data"

The goal is to discover and present knowledge in a form, which is easily comprehensible to humans in a timely manner

Page 8: Approximate Mining of  Consensus Sequential Patterns

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>

Knowledge Discovery & Data mining (KDD)

"The nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data"

The goal is to discover and present knowledge in a form, which is easily comprehensible to humans in a timely manner

Page 9: Approximate Mining of  Consensus Sequential Patterns

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>

Knowledge Discovery & Data mining (KDD)

"The nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data"

The goal is to discover and present knowledge in a form, which is easily comprehensible to humans in a timely manner

combining ideas drawn from databases, machine learning, artificial intelligence, knowledge-based systems, information retrieval, statistics, pattern recognition, visualization, and parallel and distributed computing

Fayyad, Piatetsky-Shapiro, Smyth 1996

Page 10: Approximate Mining of  Consensus Sequential Patterns

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>

What is KDD ?

Purpose– Extract useful information

Source– Operational or Administrative Data

Example– VIC card database for buying patterns– monthly welfare service patterns

Page 11: Approximate Mining of  Consensus Sequential Patterns

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>

Example

Analyze buying patterns for sales marketingTID Transaction1 {Diapers, Hotdogs, Buns, Beer}2 {Bread, Milk, Diapers, Wipes, Beer}3 {Milk, Diapers, Beer, Water}4 {Bread, Milk, Bananas, Cereal}5 {Bread, Milk, Diapers, Beer}6 {Steak, Corn, Coke, Beer}7 {Milk, Orange Juice, Diapers, Baby Food}8 {Bread, Milk, Diapers, Beer}

Page 12: Approximate Mining of  Consensus Sequential Patterns

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>

Example

VIC card : 4/8 = 50%TID Transaction1 {Diapers, Hotdogs, Buns, Beer}2 {Bread, Milk, Diapers, Wipes, Beer}3 {Milk, Diapers, Beer, Water}4 {Bread, Milk, Bananas, Cereal}5 {Bread, Milk, Diapers, Beer}6 {Steak, Corn, Coke, Beer}7 {Milk, Orange Juice, Diapers, Baby Food}8 {Bread, Milk, Diapers, Beer}

Page 13: Approximate Mining of  Consensus Sequential Patterns

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>

Example

VIC card : 5/8=63%TID Transaction1 {Diapers, Hotdogs, Buns, Beer}2 {Bread, Milk, Diapers, Wipes, Beer}3 {Milk, Diapers, Beer, Water}4 {Bread, Milk, Bananas, Cereal}5 {Bread, Milk, Diapers, Beer}6 {Steak, Corn, Coke, Beer}7 {Milk, Orange Juice, Diapers, Baby Food}8 {Bread, Milk, Diapers, Beer}

Page 14: Approximate Mining of  Consensus Sequential Patterns

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>

Overview

What is KDD (Knowledge Discovery & Data mining)Problem : Sequential Pattern MiningMethod : ApproxMAPEvaluation MethodResultsCase StudyConclusion

Page 15: Approximate Mining of  Consensus Sequential Patterns

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>

Overview

What is KDD (Knowledge Discovery & Data mining)Problem : Sequential Pattern MiningMethod : ApproxMAPEvaluation MethodResultsCase StudyConclusion

Page 16: Approximate Mining of  Consensus Sequential Patterns

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>

Sequential Pattern Mining

CID TID Transaction1 1 {Diapers, Hotdogs, Buns, Beer}1 2 {Bread, Milk, Diapers, Wipes, Beer}1 3 {Milk, Diapers, Beer, Water}2 4 {Bread, Milk, Bananas, Cereal}2 6 {Steak, Corn, Coke, Beer}3 5 {Bread, Milk, Diapers, Beer}3 7 {Milk, Orange Juice, Diapers, Baby Food}3 8 {Bread, Milk, Diapers, Beer}

C Sequential Transaction1 {Dp, HD, Buns, Br} {Bread, Mk, Dp, Wipes, Br} {Mk, Dp, Br, Wt}2 {Bread, Mk, Bananas, Cereal} {Steak, Corn, Coke, Br}3 {Bread, Mk, Dp, Br} {Mk, OJ, Dp, Baby Food} {Bread, Mk, Dp, Br}

Page 17: Approximate Mining of  Consensus Sequential Patterns

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>

Sequential Pattern Mining

C Sequential Transaction1 {Dp, HD, Buns, Br} {Bread, Mk, Dp, Wipes, Br} {Mk, Dp, Br, Wt}2 {Bread, Mk, Bananas, Cereal} {Steak, Corn, Coke, Br}3 {Bread, Mk, Dp, Br} {Mk, OJ, Dp, Baby Food} {Bread, Mk, Dp, Br}

Detecting patterns in sequences of sets

Page 18: Approximate Mining of  Consensus Sequential Patterns

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>

Welfare Program Participation Patterns

What are the common participation patterns ?What are the variations to them ?How do different policies affect these patterns?

Cid Sequential Transaction1 {W(elfare) M(edi) F(oodstamp)} {WMF} {WMF} {MF} {MF} {F}2 {WMF} {WMF} {WMF} {WMF} {WMF} {M} {M}3 {F} {F} {F} {WMF} {WMF} {WMF} {MF} {MF} {F}

Page 19: Approximate Mining of  Consensus Sequential Patterns

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>

Thesis Statement

The author of this dissertation asserts that multiple alignment is an effective model to uncover the underlying trend in sequences of sets.

I will show that approxMAP, – is a novel method to apply multiple alignment techniques to sequences of sets,– will effectively extract the underlying trend in the data – by organizing the large database into clusters – as well as give reasonable descriptors (weighted sequences and consensus

sequences) for the clusters via multiple alignment Furthermore, I will show that approxMAP

– is robust to its input parameters, – is robust to noise and outliers in the data, – scalable with respect to the size of the database, – and in comparison to the conventional support model, approxMAP can better

recover the underlying pattern with little confounding information under most circumstances.

In addition, I will demonstrate the usefulness of approxMAP using real world data.

Page 20: Approximate Mining of  Consensus Sequential Patterns

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>

Thesis Statement

Multiple alignment is an effective model to uncover the underlying trend in sequences of sets.

ApproxMAP is a novel method to apply multiple alignment techniques to sequences of sets.

ApproxMAP can recover the underlying patterns with little confounding information under most circumstances including those in which the conventional methods fail.

I will demonstrate the usefulness of approxMAP using real world data.

Page 21: Approximate Mining of  Consensus Sequential Patterns

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>

Sequential Pattern Mining

Detecting patterns in sequences of sets

Sequence seq1 < (A,B,D)(B)(C,D)(B,C) >

Itemsets s13 (C,D)

Items I {A, B, C, D}

• Nseq: Total # of sequences in the Database

• Lseq: Avg # of itemsets in a sequence

• Iseq : Avg # of items in an itemset

• Lseq * Iseq : Avg length of a sequence

Page 22: Approximate Mining of  Consensus Sequential Patterns

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>

Conventional Methods : Support Model

Super-sequence Sub-sequence– (A,B,D)(B)(C,D)(B,C)

Page 23: Approximate Mining of  Consensus Sequential Patterns

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>

Conventional Methods : Support Model

Super-sequence Sub-sequence– (A,B,D)(B)(C,D)(B,C) (A)(B)(C,D)

Page 24: Approximate Mining of  Consensus Sequential Patterns

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>

Conventional Methods : Support Model

Super-sequence Sub-sequence– (A,B,D)(B)(C,D)(B,C) (A)(B)(C,D)

Support (P ): # of super-sequences of P in D

Page 25: Approximate Mining of  Consensus Sequential Patterns

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>

Conventional Methods : Support Model

Super-sequence Sub-sequence– (A,B,D)(B)(C,D)(B,C) (A)(B)(C,D)

Support (P ): # of super-sequences of P in D

Given D, and user threshold, min_sup– find complete set of P s.t. Support(P )

min_sup

Page 26: Approximate Mining of  Consensus Sequential Patterns

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>

Conventional Methods : Support Model

Super-sequence Sub-sequence– (A,B,D)(B)(C,D)(B,C) (A)(B)(C,D)

Support (P ): # of super-sequences of P in DGiven D, and user threshold, min_sup

– find complete set of P s.t. Support(P ) min_sup• R. Agrawal and R. Srikant : ICDE 95 & EBDT 96

Methods– Breadth first – Apriori Principle (GSP)

• R. Agrawal and R. Srikant : ICDE 95 & EBDT 96

– Depth first – pattern growth (PrefixSpan)• J. Han and J. Pei : SIGKDD 2000 & ICDE 2001

Page 27: Approximate Mining of  Consensus Sequential Patterns

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>

Example: Support Model

{Dp, Br} {Mk, Dp} {Mk, Dp, Br} : 2/3=67% 2L - 1= 27-1=128-1=127 subsequences

C Sequential Transaction1 {Dp, HD, Buns, Br} {Bread, Mk, Dp, Wipes, Br} {Mk, Dp, Br, Wt}2 {Bread, Mk, Bananas, Cereal} {Steak, Corn, Coke, Br}3 {Bread, Mk, Dp, Br} {Mk, OJ, Dp, Baby Food} {Bread, Mk, Dp, Br}

– {Dp, Br} {Mk, Dp} {Mk, Br}

– {Dp, Br} {Mk, Dp} {Mk, Dp}

– {Mk, Dp} {Mk, Dp, Br}

– {Dp, Br} {Mk, Dp, Br}

– … etc …

– {Br} {Mk, Dp} {Mk, Dp, Br}

– {Dp} {Mk, Dp} {Mk, Dp, Br}

– {Dp, Br} {Dp} {Mk, Dp, Br}

– {Dp, Br} {Mk} {Mk, Dp, Br}

– {Dp, Br} {Mk, Dp} {Dp, Br}

Page 28: Approximate Mining of  Consensus Sequential Patterns

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>

Inherent Problems : the model

Support – cannot distinguish between statistically significant

patterns and random occurrencesTheoretically

– Short random sequences occur often in long sequential data simply by chance

Empirically– # of spurious patterns grows exponential w.r.t. Lseq

Page 29: Approximate Mining of  Consensus Sequential Patterns

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>

Inherent Problems : exact match

A pattern gets support– the pattern is exactly contained in the sequence

Often may not find general long patternsExample

– many customers may share similar buying habits– few of them follow an exactly same pattern

Page 30: Approximate Mining of  Consensus Sequential Patterns

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>

Inherent Problems : Complete set

Mines complete set – Too many trivial patterns

Given long sequences with noise – too expensive and too many patterns– 2L - 1= 210-1=1023

Finding max / closed sequential patterns – is non-trivial– In noisy environment, still too many max/closed

patterns

Page 31: Approximate Mining of  Consensus Sequential Patterns

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>

Possible Models

Support model– Patterns in sets – unordered list

Multiple alignment model– Find common patterns among strings– Simple ordered list of characters

Page 32: Approximate Mining of  Consensus Sequential Patterns

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>

Multiple Alignment

line up the sequences to detect the trend– Find common patterns among strings– DNA / bio sequences

P A T T T E R NP A T E R MP T T R NO A T T E R BP S Y Y R T N

Page 33: Approximate Mining of  Consensus Sequential Patterns

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>

Multiple Alignment

line up the sequences to detect the trend– Find common patterns among strings– DNA / bio sequences

P A T T T E R NP A T E R MP T T R NO A T T E R BP S Y Y R T NP A T T E R N

Page 34: Approximate Mining of  Consensus Sequential Patterns

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>

Pairwise Score(edit distance) : dist(seq1, seq2) – Minimum # of ops required to change seq1 to seq2

– Ops = INDEL(a) and/or REPLACE(a,b)– Recurrence relation

Edit Distance

P A T T T E R NP A T E R M

Page 35: Approximate Mining of  Consensus Sequential Patterns

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>

Multiple Alignment Score– ∑PS(seqi, seqj) ( 1 ≤ i ≤ N and 1≤ j ≤ N)– Optimal alignment : minimum score

Pairwise Score(edit distance) : dist(seq1, seq2) – Minimum # of ops required to change seq1 to seq2

– Ops = INDEL(a) and/or REPLACE(a,b)– Recurrence relation

Edit Distance

P A T T T E R NP A T E R M

INDEL INDEL REPL

Page 36: Approximate Mining of  Consensus Sequential Patterns

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>

Consensus Sequence

seq1

(A) (B) (DE)

seq2 (AE) (H) (BC) (E)

seq3 (A) (BCG) (D)

Page 37: Approximate Mining of  Consensus Sequential Patterns

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>

Consensus Sequence

Weighted Sequence : – compression of aligned sequences into one sequence

seq1

(A) (B) (DE)

seq2 (AE) (H) (BC) (E)

seq3 (A) (BCG) (D)Weighted Seq (A:3, E:1):3 (H:1):1 (B:3, C:2, G:1):3 (D:2, E:2), 3 3

Page 38: Approximate Mining of  Consensus Sequential Patterns

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>

Consensus Sequence

Weighted Sequence : – compression of aligned sequences into one sequence

seq1

(A) (B) (DE)

seq2 (AE) (H) (BC) (E)

seq3 (A) (BCG) (D)Weighted Seq (A:3, E:1):3 (H:1):1 (B:3, C:2, G:1):3 (D:2, E:2), 3 3

Page 39: Approximate Mining of  Consensus Sequential Patterns

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>

Consensus Sequence

Weighted Sequence : – compression of aligned sequences into one sequence

seq1

(A) (B) (DE)

seq2 (AE) (H) (BC) (E)

seq3 (A) (BCG) (D)Weighted Seq (A:3, E:1):3 (H:1):1 (B:3, C:2, G:1):3 (D:2, E:2), 3 3

Page 40: Approximate Mining of  Consensus Sequential Patterns

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>

Consensus Sequence

Weighted Sequence : – compression of aligned sequences into one sequence

seq1

(A) (B) (DE)

seq2 (AE) (H) (BC) (E)

seq3 (A) (BCG) (D)Weighted Seq (A:3, E:1):3 (H:1):1 (B:3, C:2, G:1):3 (D:2, E:2), 3 3

Page 41: Approximate Mining of  Consensus Sequential Patterns

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>

Consensus Sequence

Weighted Sequence : – compression of aligned sequences into one sequence

strength(i, j) = # of occurrences of item i in position jtotal # of sequences

– A : 3/3 = 100%– E : 1/3 = 33%– H : 1/3 = 33%

seq1

(A) (B) (DE)

seq2 (AE) (H) (BC) (E)

seq3 (A) (BCG) (D)Weighted Seq (A:3, E:1):3 (H:1):1 (B:3, C:2, G:1):3 (D:2, E:2), 3 3

Page 42: Approximate Mining of  Consensus Sequential Patterns

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>

Consensus Sequence

Weighted Sequence : – compression of aligned sequences into one sequence

strength(i, j) = # of occurrences of item i in position jtotal # of sequences

Consensus itemset (j) : min_strength=2– ( ia | ia(I ()) & strength(ia, j) ≥ min_strength )

seq1

(A) (B) (DE)

seq2 (AE) (H) (BC) (E)

seq3 (A) (BCG) (D)Weighted Seq (A:3, E:1):3 (H:1):1 (B:3, C:2, G:1):3 (D:2, E:2), 3 3

Page 43: Approximate Mining of  Consensus Sequential Patterns

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>

Consensus Sequence

Weighted Sequence : – compression of aligned sequences into one sequence

strength(i, j) = # of occurrences of item i in position jtotal # of sequences

Consensus itemset (j) : min_strength=2– ( ia | ia(I ()) & strength(ia, j) ≥ min_strength )

Consensus sequence : – concatenation of the consensus itemsets

seq1

(A) (B) (DE)

seq2 (AE) (H) (BC) (E)

seq3 (A) (BCG) (D)Weighted Seq (A:3, E:1):3 (H:1):1 (B:3, C:2, G:1):3 (D:2, E:2), 3 3

Consensus Seq (A) (BC) (DE)

Page 44: Approximate Mining of  Consensus Sequential Patterns

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>

Consensus Sequence

Weighted Sequence : – compression of aligned sequences into one sequence

strength(i, j) = # of occurrences of item i in position jtotal # of sequences

Consensus itemset (j) : min_strength=2– ( ia | ia(I ()) & strength(ia, j) ≥ min_strength )

Consensus sequence : – concatenation of the consensus itemsets

seq1

(A) (B) (DE)

seq2 (AE) (H) (BC) (E)

seq3 (A) (BCG) (D)Weighted Seq (A:3, E:1):3 (H:1):1 (B:3, C:2, G:1):3 (D:2, E:2), 3 3

Consensus Seq (A) (BC) (DE)

Page 45: Approximate Mining of  Consensus Sequential Patterns

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>Multiple Alignment

Sequential Pattern Mining

Given – N sequences of sets, – Op costs (INDEL & REPLACE) for itemsets, and– Strength thresholds for consensus sequences

To(1) partition the N sequences into K sets of sequences such

that the sum of the K multiple alignment scores is minimum, and

(2) find the multiple alignment for each partition, and (3) find the pattern consensus sequence and the variation

consensus sequence for each partition

Page 46: Approximate Mining of  Consensus Sequential Patterns

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>

Overview

What is KDD (Knowledge Discovery & Data mining)Problem : Sequential Pattern MiningMethod : ApproxMAPEvaluation MethodResultsCase StudyConclusion

Page 47: Approximate Mining of  Consensus Sequential Patterns

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>ApproxMAP

(Approximate Multiple Alignment Pattern mining)

Exact solution : Too expensive!Approximation Method : ApproxMAP

– Organize into K partitions• Use clustering

– Compress each partition into • weighted sequences

– Summarize each partition into • Pattern consensus sequence• Variation consensus sequence

Page 48: Approximate Mining of  Consensus Sequential Patterns

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>

Tasks

Op costs (INDEL & REPLACE) for itemsetsOrganize into K partitions

– Use clusteringCompress each partition into

– weighted sequencesSummarize each partition into

– Pattern consensus sequence– Variation consensus sequence

Page 49: Approximate Mining of  Consensus Sequential Patterns

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>

Tasks

Op costs (INDEL & REPLACE) for itemsetsOrganize into K partitions

– Use clusteringCompress each partition into

– weighted sequencesSummarize each partition into

– Pattern consensus sequence– Variation consensus sequence

Page 50: Approximate Mining of  Consensus Sequential Patterns

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>

Op costs for itemsets

Normalized set difference– R(X,Y)= (|X-Y|+|Y-X|)/(|X|+|Y|) – 0 ≤ R ≤ 1 , metric– INDEL(X) = R(X,) = 1

REPLACE(a) (a) 0(a) (ab) 1/3(ab) (ac) 1/2(a) (b) 1(ab) (cd) 1(a) () 1

Page 51: Approximate Mining of  Consensus Sequential Patterns

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>

Op costs for itemsets

Normalized set difference– R(X,Y)= (|X-Y|+|Y-X|)/(|X|+|Y|) – 0 ≤ R ≤ 1 , metric– INDEL(X) = R(X,) = 1

REPLACE(a) (a) 0(a) (ab) 1/3(ab) (ac) 1/2(a) (b) 1(ab) (cd) 1(a) () 1

Page 52: Approximate Mining of  Consensus Sequential Patterns

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>

Op costs for itemsets

Normalized set difference– R(X,Y)= (|X-Y|+|Y-X|)/(|X|+|Y|) – 0 ≤ R ≤ 1 , metric– INDEL(X) = R(X,) = 1

REPLACE(a) (a) 0(a) (ab) 1/3(ab) (ac) 1/2(a) (b) 1(ab) (cd) 1(a) () 1

Page 53: Approximate Mining of  Consensus Sequential Patterns

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>

Op costs for itemsets

Normalized set difference– R(X,Y)= (|X-Y|+|Y-X|)/(|X|+|Y|) – 0 ≤ R ≤ 1 , metric– INDEL(X) = R(X,) = 1

REPLACE(a) (a) 0(a) (ab) 1/3(ab) (ac) 1/2(a) (b) 1(ab) (cd) 1(a) () 1

Page 54: Approximate Mining of  Consensus Sequential Patterns

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>

Op costs for itemsets

Normalized set difference– R(X,Y)= (|X-Y|+|Y-X|)/(|X|+|Y|) – 0 ≤ R ≤ 1 , metric– INDEL(X) = R(X,) = 1

Jaccard coefficient– 1-|XY| / |XY|

REPLACE(a) (a) 0(a) (ab) 1/3(ab) (ac) 1/2(a) (b) 1(ab) (cd) 1(a) () 1

Page 55: Approximate Mining of  Consensus Sequential Patterns

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>

Op costs for itemsets

Normalized set difference– R(X,Y)= (|X-Y|+|Y-X|)/(|X|+|Y|) – 0 ≤ R ≤ 1 , metric– INDEL(X) = R(X,) = 1

Jaccard coefficient– 1-|XY| / |XY|

Sørensen coefficient : simple index – Give greater "weight" to common elements

REPLACE(a) (a) 0(a) (ab) 1/3(ab) (ac) 1/2(a) (b) 1(ab) (cd) 1(a) () 1

Page 56: Approximate Mining of  Consensus Sequential Patterns

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>

Op costs for itemsets

Normalized set difference– R(X,Y)= (|X-Y|+|Y-X|)/(|X|+|Y|) – 0 ≤ R ≤ 1 , metric– INDEL(X) = R(X,) = 1

Jaccard coefficient– 1-|XY| / |XY|– 1-|(XY)| / |X-Y|+|Y-X|+|XY|

Sørensen coefficient : simple index – Give greater "weight" to common elements

REPLACE(a) (a) 0(a) (ab) 1/3(ab) (ac) 1/2(a) (b) 1(ab) (cd) 1(a) () 1

Page 57: Approximate Mining of  Consensus Sequential Patterns

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>

Op costs for itemsets

Normalized set difference– R(X,Y)= (|X-Y|+|Y-X|)/(|X|+|Y|) – 0 ≤ R ≤ 1 , metric– INDEL(X) = R(X,) = 1

Jaccard coefficient– 1-|XY| / |XY|– 1-|(XY)| / |X-Y|+|Y-X|+|XY|

Sørensen coefficient : simple index – Give greater "weight" to common elements – 1-2*|(XY)| / |X-Y|+|Y-X|+2*|XY|

REPLACE(a) (a) 0(a) (ab) 1/3(ab) (ac) 1/2(a) (b) 1(ab) (cd) 1(a) () 1

Page 58: Approximate Mining of  Consensus Sequential Patterns

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>

Op costs for itemsets

Normalized set difference– R(X,Y)= (|X-Y|+|Y-X|)/(|X|+|Y|) – 0 ≤ R ≤ 1 , metric– INDEL(X) = R(X,) = 1

Jaccard coefficient– 1-|XY| / |XY|– 1-|(XY)| / |X-Y|+|Y-X|+|XY|

Sørensen coefficient : simple index – Give greater "weight" to common elements – 1-2*|(XY)| / |X-Y|+|Y-X|+2*|XY|– = (|X|+|Y|-2*|XY|) / (|X|+|Y|) = R(X,Y)

REPLACE(a) (a) 0(a) (ab) 1/3(ab) (ac) 1/2(a) (b) 1(ab) (cd) 1(a) () 1

Page 59: Approximate Mining of  Consensus Sequential Patterns

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>

Tasks

Op costs (INDEL & REPLACE) for itemsetsOrganize into K partitions

– Use clusteringCompress each partition into

– weighted sequencesSummarize each partition into

– Pattern consensus sequence– Variation consensus sequence

Page 60: Approximate Mining of  Consensus Sequential Patterns

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>

Organize : Partition into K sets

Goal: – To minimize the sum of the K multiple

alignment scores– Group similar sequences

Approximate– Calculate N*N proximity matrix

• Pairwise score : edit distance– Any clustering that works best for your data

Page 61: Approximate Mining of  Consensus Sequential Patterns

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>

Organize : Clustering

Desirable Properties– Form groups of arbitrary shape and size– Can estimate the number of clusters from the data

Page 62: Approximate Mining of  Consensus Sequential Patterns

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>

Density Based Clustering

k-nearest neighbor Partition based at the valleys of the density estimate Density of sequence = n / (|D|*d) n/d

– n & d : Based on user defined k nearest neighbor space– n : # of neighbors– d : size of neighbor region

Parameter k : Neighbor space– Can cluster at different resolutions as desired

General : Uniform kernel k-NN clustering– Efficient = O(kN)

Page 63: Approximate Mining of  Consensus Sequential Patterns

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>

Tasks

Op costs (INDEL & REPLACE) for itemsetsOrganize into k partitions

– Use clusteringCompress each partition into

– weighted sequencesSummarize each partition into

– Pattern consensus sequence– Variation consensus sequence

Page 64: Approximate Mining of  Consensus Sequential Patterns

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>

Data Compression : Multiple Alignment

Optimal multiple alignment too expensive !greedy approximation

– Incrementally align – in density-descending order– Pairwise alignment

• Sequence to Weighted sequence

Page 65: Approximate Mining of  Consensus Sequential Patterns

ID Lexically sorted sequence clusterseq3 (A) (B) (DE)seq4 (A) (BCG) (D)seq2 (AE) (H) (B) (D)seq1 (AG) (F) (BC) (AE) (H)seq5 (BCI) (DE)

ID Aligned Sequence clusterseq3 (A) (B) (DE)seq2 (AE) (H) (B) (D)seq4 (A) (BCG) (D)seq5 (BCI) (DE)seq1 (AG) (F) (BC) (AE) (H)

Page 66: Approximate Mining of  Consensus Sequential Patterns

ID Aligned Sequence clusterseq3 (A) (B) (DE)seq2 (AE) (H) (B) (D)seq4 (A) (BCG) (D)seq5 (BCI) (DE)seq1 (AG) (F) (BC) (AE) (H)WS4 (A:4,E:1,G:1)

: 4(H:1,F:1): 2

(B:5,C:3,G:1,I:1) : 5

(A:1,D:4,E:3): 5

(H:1): 1

5

ID Lexically sorted sequence clusterseq3 (A) (B) (DE)seq4 (A) (BCG) (D)seq2 (AE) (H) (B) (D)seq1 (AG) (F) (BC) (AE) (H)seq5 (BCI) (DE)

Page 67: Approximate Mining of  Consensus Sequential Patterns

Multiple Alignment

seq3 (A) (B) (DE)seq2 (AE) (H) (B) (D)

Page 68: Approximate Mining of  Consensus Sequential Patterns

Multiple Alignment

seq3 (A) (B) (DE)seq2 (AE) (H) (B) (D)

seq4 (A) (BCG) (D)

Page 69: Approximate Mining of  Consensus Sequential Patterns

Multiple Alignment

seq3 (A) (B) (DE)seq2 (AE) (H) (B) (D)WS1 (A:2,E:1):2 (H:1):1 (B:2):2 (D:2,E:1):2 2

seq4 (A) (BCG) (D)

Page 70: Approximate Mining of  Consensus Sequential Patterns

Multiple Alignment

seq3 (A) (B) (DE)seq2 (AE) (H) (B) (D)WS1 (A:2,E:1):2 (H:1):1 (B:2):2 (D:2,E:1):2 2

seq4 (A) (BCG) (D)WS2 (A:3,E:1):3 (H:1):1 (B:3,C:1,G:1):3 (D:3,E:1):3 3

Page 71: Approximate Mining of  Consensus Sequential Patterns

Multiple Alignment

seq3 (A) (B) (DE)seq2 (AE) (H) (B) (D)WS1 (A:2,E:1):2 (H:1):1 (B:2):2 (D:2,E:1):2 2

seq4 (A) (BCG) (D)WS2 (A:3,E:1):3 (H:1):1 (B:3,C:1,G:1):3 (D:3,E:1):3 3

seq5 (BCI) (DE)

Page 72: Approximate Mining of  Consensus Sequential Patterns

Multiple Alignment

seq3 (A) (B) (DE)seq2 (AE) (H) (B) (D)WS1 (A:2,E:1):2 (H:1):1 (B:2):2 (D:2,E:1):2 2

seq4 (A) (BCG) (D)WS2 (A:3,E:1):3 (H:1):1 (B:3,C:1,G:1):3 (D:3,E:1):3 3

seq5 (BCI) (DE)WS3 (A:3,E:1):3 (H:1):1 (B:4,C:2,G:1,I:1):

4(D:4,E:2):4 4

Page 73: Approximate Mining of  Consensus Sequential Patterns

Multiple Alignment

seq3 (A) (B) (DE)seq2 (AE) (H) (B) (D)WS1 (A:2,E:1):2 (H:1):1 (B:2):2 (D:2,E:1):2 2

seq4 (A) (BCG) (D)WS2 (A:3,E:1):3 (H:1):1 (B:3,C:1,G:1):3 (D:3,E:1):3 3

seq5 (BCI) (DE)WS3 (A:3,E:1):3 (H:1):1 (B:4,C:2,G:1,I:1):

4(D:4,E:2):4 4

seq1 (AG) (F) (BC) (AE) (H)

Page 74: Approximate Mining of  Consensus Sequential Patterns

Multiple Alignment

seq3 (A) (B) (DE)seq2 (AE) (H) (B) (D)WS1 (A:2,E:1):2 (H:1):1 (B:2):2 (D:2,E:1):2 2

seq4 (A) (BCG) (D)WS2 (A:3,E:1):3 (H:1):1 (B:3,C:1,G:1):3 (D:3,E:1):3 3

seq5 (BCI) (DE)WS3 (A:3,E:1):3 (H:1):1 (B:4,C:2,G:1,I:1):4 (D:4,E:2):4 4

seq1 (AG) (F) (BC) (AE) (H)WS4 (A:4,E:1,G:1)

: 4(H:1,F:1): 2

(B:5,C:3,G:1,I:1) : 5

(A:1,D:4,E:3): 5

(H:1): 1

5

Page 75: Approximate Mining of  Consensus Sequential Patterns

Op Cost for Itemset to weighted itemset

seq3 (A) (B) (DE)seq2 (AE) (H) (B) (D)seq4 (A) (BCG) (D)seq5 (BCI) (DE)WS3 (A:3,E:1):3 (H:1):1 (B:4,C:2,G:1,I:1):4 (D:4,E:2):4 4

seq1 (AG) (F) (BC) (AE) (H)

Replace ((A:3,E:1):3 – 4 , (AG) ) = ?

Page 76: Approximate Mining of  Consensus Sequential Patterns

Op Cost for Itemset to weighted itemset

seq3 (A) R=1/3

Tot Avg=65/120seq2 (AE) R=1/2seq4 (A) R=1/3seq5 INDEL=1

seq1 (AG)

Replace ((A:3,E:1):3 – 4 , (AG) ) 65/120

Page 77: Approximate Mining of  Consensus Sequential Patterns

Op Cost for Itemset to weighted itemset

seq3 (A) R=1/3Avg=7/18=35/90seq2 (AE) R=1/2

seq4 (A) R=1/3

seq5 INDEL=1 Tot Avg=65/120

seq1 (AG)

Page 78: Approximate Mining of  Consensus Sequential Patterns

Op Cost for Itemset to weighted itemset

seq3 (A) R=1/3Avg=7/18=35/90seq2 (AE) R=1/2

seq4 (A) R=1/3

seq5 INDEL=1 Tot Avg=65/120WS3 (A:3,E:1):3

- 4R’w= (4+2*3)-2*3/(4+2*3) =2/5=36/90

seq1 (AG) 1* ( n – wX )

Page 79: Approximate Mining of  Consensus Sequential Patterns

Op Cost for Itemset to weighted itemset

seq3 (A) R=1/3Avg=7/18=35/90seq2 (AE) R=1/2

seq4 (A) R=1/3

seq5 INDEL=1 Tot Avg=65/120WS3 (A:3,E:1):3

- 4R’w= (4+2*3)-2*3/(4+2*3) =2/5=36/90

seq1 (AG) R’w(Xw,Y) = [ weight(X)+|Y|*wX – 2*weight(XY) ]

[ weight(X) + |Y|*wX ](|X|+|Y|-2*|XY|)

(|X|+|Y|)

Page 80: Approximate Mining of  Consensus Sequential Patterns

Op Cost for Itemset to weighted itemset

seq3 (A) R=1/3Avg=7/18=35/90seq2 (AE) R=1/2

seq4 (A) R=1/3

seq5 INDEL=1 Tot Avg=65/120WS3 (A:3,E:1):3

- 4R’w= (4+2*3)-2*3/(4+2*3) =2/5=36/90

seq1 (AG) R’w(Xw,Y) = [ weight(X)+|Y|*wX – 2*weight(XY) ]

[ weight(X) + |Y|*wX ]

R’w(Xw,Y) * wX

Page 81: Approximate Mining of  Consensus Sequential Patterns

Op Cost for Itemset to weighted itemset: Rw

seq3 (A) R=1/3Avg=7/18=35/90seq2 (AE) R=1/2

seq4 (A) R=1/3

seq5 INDEL=1 Tot Avg=65/120WS3 (A:3,E:1):3

- 4R’w= (4+2*3)-2*3/(4+2*3) =2/5=36/90

seq1 (AG) 1* ( n – wX )

R’w(Xw,Y) * wX

Replace ((A:3,E:1):3 – 4 , (AG) )= Rw(Xw, Y) = [ R’w(Xw,Y) * wX + n – wX ] / n

Page 82: Approximate Mining of  Consensus Sequential Patterns

seq3 (A) R=1/3Avg=7/18=35/90seq2 (AE) R=1/2

seq4 (A) R=1/3

seq5 INDEL=1 Tot Avg=65/120WS3 (A:3,E:1):3

- 4R’w= (4+2*3)-2*3/(4+2*3) =2/5=36/90

Rw=[(2/5)*3+1]/4=11/20=66/120seq1 (AG)

Op Cost for Itemset to weighted itemset: Rw

1* ( n – wX )

R’w(Xw,Y) * wX

Replace ((A:3,E:1):3 – 4 , (AG) )= Rw(Xw, Y) = [ R’w(Xw,Y) * wX + n – wX ] / n

Page 83: Approximate Mining of  Consensus Sequential Patterns

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>

Op Cost for Itemset to weighted itemset:Rw

Op cost– R’w(Xw,Y)= [ weight(X)+|Y|*wX – 2*weight(XY) ]

[ weight(X) + |Y|*wX ]

– Rw(Xw, Y) = [ R’w* wX + n – wX ] / n

– 0 ≤ Rw ≤ 1 , metric

– INDEL(Xw) = Rw (Xw,) = INDEL(Y) = Rw (, Y) =1

Page 84: Approximate Mining of  Consensus Sequential Patterns

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>

Tasks

Op costs (INDEL & REPLACE) for itemsetsOrganize into K partitions

– Use clusteringCompress each partition into

– weighted sequencesSummarize each partition into

– Pattern consensus sequence– Variation consensus sequence

Page 85: Approximate Mining of  Consensus Sequential Patterns

Summarize: Generate and Present results

N sequences K weighted sequencesWeighted sequence : huge

– compression of all sequences

ID Aligned Sequence clusterseq3 (A) (B) (DE)seq2 (AE) (H) (B) (D)seq4 (A) (BCG) (D)seq5 (BCI) (DE)seq1 (AG) (F) (BC) (AE) (H)WS4 (A:4,E:1,G:1)

: 4(H:1,F:1): 2

(B:5,C:3,G:1,I:1) : 5

(A:1,D:4,E:3): 5

(H:1): 1

5

Page 86: Approximate Mining of  Consensus Sequential Patterns

< (E:1, L:1, R:1, T:1, V:1, d:1) (A:1, B:9, C:8, D:8, E:12, F:1, L:4, P:1, S:1, T:8, V:5, X:1, a:1, d:10, e:2, f:1, g:1, p:1) (B:99, C:96, D:91, E:24, F:2, G:1, L:15, P:7, R:2, S:8, T:95, V:15, X:2, Y:1, a:2, d:26, e:3, g:6, l:1, m:1) (A:5, B:16, C:5, D:3, E:13, F:1, H:2, L:7, P:1, R:2, S:7, T:6, V:7, Y:3, d:3, g:1) (A:13, B:126, C:27, D:1, E:32, G:5, H:3, J:1, L:1, R:1, S:32, T:21, V:1, W:3, X:2, Y:8, d:13, e:1, f:8, i:2, p:7, l:3, g:1) (A:12, B:6, C:28, D:1, E:28, G:5, H:2, J:6, L:2, S:137, T:10, V:2, W:6, X:8, Y:124, a:1, d:6, g:2, i:1, l:1, m:2) (A:135, B:2, C:23, E:36, G:12, H:124, K:1, L:4, O:2, R:2, S:27, T:6, V:6, W:10, X:3, Y:8, Z:2, a:1, d:6, g:1, h:2, j:1, k:5, l:3, m:7, n:1) (A:11, B:1, C:5, E:12, G:3, H:10, L:7, O:4, S:5, T:1, V:7, W:3, X:2, Y:3, a:1, m:2) (A:31, C:15, E:10, G:15, H:25, K:1, L:7, M:1, O:1, R:4, S:12, T:10, V:6, W:3, Y:3, Z:3, d:7, h:3, j:2, l:1, n:1, p:1, q:1) (A:3, C:5, E:4, G:7, H:1, K:1, R:1, T:1, W:2, Z:2, a:1, d:1, h:1, n:1) (A:20, C:27, E:13, G:35, H:7, K:7, L:111, N:2, O:1, Q:3, R:11, S:10, T:20, V:111, W:2, X:2, Y:3, Z:8, a:1, b:1, d:13, h:9, j:1, n:1, o:2) (A:17, B:2, C:14, E:17, F:1, G:31, H:8, K:13, L:2, M:2, N:1, R:22, S:2, T:140, U:1, V:2, W:2, X:1, Z:13, a:1, b:8, d:6, h:14, n:6, p:1, q:1) (A:12, B:7, C:5, E:13, G:16, H:5, K:106, L:8, N:2, O:1, R:32, S:3, T:29, V:9, X:2, Z:9, b:16, c:5, d:5, h:7, l:1) (A:7, B:1, C:9, E:5, G:7, H:3, K:7, R:8, S:1, T:10, X:1, Z:3, a:2, b:3, c:1, d:5, h:3) (A:1, B:1, H:1, R:1, T:1, b:2, c:1) (A:3, B:2, C:2, E:6, F:2, G:4, H:2, K:20, M:2, N:3, R:19, S:3, T:11, U:2, X:4, Z:34, a:3, b:11, c:2, d:4) (H:1, Y:1, a:1, d:1) > : 162

Page 87: Approximate Mining of  Consensus Sequential Patterns

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>

Presentation model

Frequent items : – definite pattern items– Cutoff : =50%

Common items : – uncertain– Cutoff : =20%

Rare items – Noise items

FrequentItems

CommonItems

RareItems

=50%

=20%

W=100%

W=0%

Page 88: Approximate Mining of  Consensus Sequential Patterns

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>

Visualization

Pattern Consensus Sequence :– Cutoff : Minimum cluster strength (=50%)– Frequent items

Variation Consensus Sequence : – Cutoff : Minimum cluster strength (=20%)– Frequent items + common items

100%: 85%: 70%: 50%: 35%: 20%

(B:61%, C:59%, D:56%, T:59%)

(B: 78%)

(S:85%, Y:77%)

(A:83%, E:22%, H:77%)

(G:22%, L:69%, V:69%)

(T: 86%)

(K: 65%)

(B: 21%) : 162

Pat tern

(B C D T) (B) (S Y) (A H) (L V) (T) (K) :162 Variation (B C D T) (B) (S Y) (A E H) (G L V) (T) (K) (Z) :162

Page 89: Approximate Mining of  Consensus Sequential Patterns

ID Full Sequence Database lexically sorted Cluster Lseq Len

seq1 (A) (B, C, Y) (D)     1 3 5

seq2 (A) (X) (B, C) (A, E) (Z) 1 5 7

seq3 (A, I) (Z) (K) (L, M)   2 4 6

seq4 (A, L) (D, E)       1 2 4

seq5 (I, J) (B) (K) (L)   2 4 5

seq6 (I, J) (L, M)       2 2 5

seq7 (I, J) (K) (J, K) (L) (M) 2 5 7

seq8 (I, M) (K) (K, M) (L, M)   2 4 7

seq9 (J) (K) (L, M)     2 3 4

seq10 (V) (K, W) (Z)     2 3 4

Example: Given 10 seqs lexically sorted

Page 90: Approximate Mining of  Consensus Sequential Patterns

ID Full Sequence Database lexically sorted Cluster Lseq Len

seq1 (A) (B, C, Y) (D)     1 3 5

seq2 (A) (X) (B, C) (A, E) (Z) 1 5 7

seq3 (A, I) (Z) (K) (L, M)   2 4 6

seq4 (A, L) (D, E)       1 2 4

seq5 (I, J) (B) (K) (L)   2 4 5

seq6 (I, J) (L, M)       2 2 5

seq7 (I, J) (K) (J, K) (L) (M) 2 5 7

seq8 (I, M) (K) (K, M) (L, M)   2 4 7

seq9 (J) (K) (L, M)     2 3 4

seq10 (V) (K, W) (Z)     2 3 4

Example: Given 10 seqs lexically sorted

Page 91: Approximate Mining of  Consensus Sequential Patterns

Color scheme <100: 85: 70: 50: 35: 20 >Cluster 1 (cluster strength = 40% = 2 sequences)

seq1 (A)   (B, C, Y) (D)    

seq4 (A, L)     (D, E)    

seq2 (A) (X) (B, C) (A, E) (Z)  

     

Page 92: Approximate Mining of  Consensus Sequential Patterns

Color scheme <100: 85: 70: 50: 35: 20 >Cluster 1 (cluster strength = 40% = 2 sequences)

seq1 (A)   (B, C, Y) (D)    

seq4 (A, L)     (D, E)    

seq2 (A) (X) (B, C) (A, E) (Z)  

Weighted Seq (A:3, L:1):3 (X:1):1 (B:2, C:2, Y:1):2 (A:1, D:2,E:2):3 (Z:1):1 3

     

Page 93: Approximate Mining of  Consensus Sequential Patterns

Color scheme <100: 85: 70: 50: 35: 20 >Cluster 1 (cluster strength = 40% = 2 sequences)

seq1 (A)   (B, C, Y) (D)    

seq4 (A, L)     (D, E)    

seq2 (A) (X) (B, C) (A, E) (Z)  

Weighted Seq (A:3, L:1):3 (X:1):1 (B:2, C:2, Y:1):2 (A:1, D:2,E:2):3 (Z:1):1 3

Consensus Pat (A)   (B, C) (D, E)    

Page 94: Approximate Mining of  Consensus Sequential Patterns

Color scheme <100: 85: 70: 50: 35: 20 >Cluster 1 (cluster strength = 40% = 2 sequences)

seq1 (A)   (B, C, Y) (D)    

seq4 (A, L)     (D, E)    

seq2 (A) (X) (B, C) (A, E) (Z)  

Weighted Seq (A:3, L:1):3 (X:1):1 (B:2, C:2, Y:1):2 (A:1, D:2,E:2):3 (Z:1):1 3

Consensus Pat (A)   (B, C) (D, E)    

Cluster 2 (cluster strength = 40% = 3 sequences)seq9 (J)   (K) (L, M)    

seq5 (I, J) (B) (K) (L)    

seq3 (A, I) (Z) (K) (L, M)    

seq7 (I, J) (K) (J, K) (L) (M)  

seq8 (I, M) (K) (K, M) (L, M)    

seq6 (I, J)     (L, M)    

seq10  (V) (K, W)   (Z)  

     

   

Page 95: Approximate Mining of  Consensus Sequential Patterns

Color scheme <100: 85: 70: 50: 35: 20 >Cluster 1 (cluster strength = 40% = 2 sequences)

seq1 (A)   (B, C, Y) (D)    

seq4 (A, L)     (D, E)    

seq2 (A) (X) (B, C) (A, E) (Z)  

Weighted Seq (A:3, L:1):3 (X:1):1 (B:2, C:2, Y:1):2 (A:1, D:2,E:2):3 (Z:1):1 3

Consensus Pat (A)   (B, C) (D, E)    

Cluster 2 (cluster strength = 40% = 3 sequences)seq9 (J)   (K) (L, M)    

seq5 (I, J) (B) (K) (L)    

seq3 (A, I) (Z) (K) (L, M)    

seq7 (I, J) (K) (J, K) (L) (M)  

seq8 (I, M) (K) (K, M) (L, M)    

seq6 (I, J)     (L, M)    

seq10  (V) (K, W)   (Z)  

Weightedsequence

(A:1,I:5,J:4,M:1):6

(B:1,K:2,V:1,Z:1):5

(J:1,K:6,M:1,W:1):6 (L:6,M:4):6 (M:1,

Z:1):2 7

     

   

Page 96: Approximate Mining of  Consensus Sequential Patterns

Color scheme <100: 85: 70: 50: 35: 20 >Cluster 1 (cluster strength = 40% = 2 sequences)

seq1 (A)   (B, C, Y) (D)    

seq4 (A, L)     (D, E)    

seq2 (A) (X) (B, C) (A, E) (Z)  

Weighted Seq (A:3, L:1):3 (X:1):1 (B:2, C:2, Y:1):2 (A:1, D:2,E:2):3 (Z:1):1 3

Consensus Pat (A)   (B, C) (D, E)    

Cluster 2 (cluster strength = 40% = 3 sequences)seq9 (J)   (K) (L, M)    

seq5 (I, J) (B) (K) (L)    

seq3 (A, I) (Z) (K) (L, M)    

seq7 (I, J) (K) (J, K) (L) (M)  

seq8 (I, M) (K) (K, M) (L, M)    

seq6 (I, J)     (L, M)    

seq10  (V) (K, W)   (Z)  

Weightedsequence

(A:1,I:5,J:4,M:1):6

(B:1,K:2,V:1,Z:1):5

(J:1,K:6,M:1,W:1):6 (L:6,M:4):6 (M:1,

Z:1):2 7

ConsensusPat (w≥3) (I, J)   (K) (L, M)    

ConsensusVar (w≥2) (I, J) (K) (K) (L, M)    

Page 97: Approximate Mining of  Consensus Sequential Patterns

Example: support Model (20%=2 seq)id pattern sup id pattern sup id pattern sup1 (A) 4 17 (A) (D) 2 33 (I,J) (K) 22 (B) 3 18 (A) (E) 2 34 (I,J) (L) 33 (C) 2 19 (A) (Z) 2 35 (I,J) (M) 24 (D) 2 20 (A) (B,C) 2 36 (I) (K) (K) 25 (E) 2 21 (I) (K) 4 37 (I) (K) (L) 26 (I) 5 22 (I) (L) 5 38 (I) (K) (M) 27 (J) 4 23 (I) (M) 4 39 (I) (K) (L,M) 28 (K) 6 24 (I) (L,M) 3 40 (J) (K) (L) 29 (L) 7 25 (J) (K) 3 41 (J) (K) (M) 210 (M) 5 26 (J) (L) 4 42 (K) (K) (L) 211 (Z) 3 27 (J) (M) 3 43 (K) (K) (M) 212 (B,C) 2 28 (J) (L,M) 2 44 (I,J) (K) (L) 213 (I,J) 2 29 (K) (K) 2 45 (I) (K) (K) (L) 214 (L,M) 2 30 (K) (L) 5 46 (I) (K) (K) (M) 215 (A) (B) 2 31 (K) (M) 416 (A) (C) 2 32 (K) (L,M) 3

(A) (B,C) (D,E)(I,J) (K) (L,M)

Page 98: Approximate Mining of  Consensus Sequential Patterns

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>ApproxMAP

(Approximate Multiple Alignment Pattern mining)

Approximation Method : ApproxMAP– Organize into K partitions = O(Nseq

2Lseq2Iseq)

• Proximity matrix = O(Nseq2Lseq

2Iseq)

• Clustering = O(kNseq)

– Compress each partition = O(nL2)• weighted sequences = O(nL2)

– Summarize each partition = O (1)• Pattern consensus sequence• Variation consensus sequence

– Time Complexity : O(Nseq2Lseq

2Iseq)• 2 optimizations

Page 99: Approximate Mining of  Consensus Sequential Patterns

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>

Overview

What is KDD (Knowledge Discovery & Data mining)Problem : Sequential Pattern MiningMethod : ApproxMAPEvaluation MethodResultsCase StudyConclusion

Page 100: Approximate Mining of  Consensus Sequential Patterns

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>

Evaluation

Up to now: Only performance / scalabilityQuality?

– What kind of patterns will the model generate?– Evaluate correctness of the model

Why?– Basis for comparison of different models– Essential in understanding results of approximate

solutions

Page 101: Approximate Mining of  Consensus Sequential Patterns

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>

Evaluation Method

Given – Set of Base patterns B : E(FB) & E(LB) D–D Set of Result patterns P

How?– Map each Pi to best Bj

• based on Longest Common Subsequences • of all Bj

• max res pat B(|BP|)

Page 102: Approximate Mining of  Consensus Sequential Patterns

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>

Item level

Predicted (Result Patterns) +

Actual(Base Pat)

+ Pattern Items

Confusion Matrix

Page 103: Approximate Mining of  Consensus Sequential Patterns

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>

Item level

Predicted (Result Patterns) +

Actual(Base Pat)

Extraneous Items+ Pattern Items

Confusion Matrix

Page 104: Approximate Mining of  Consensus Sequential Patterns

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>

Item level

Predicted (Result Patterns) +

Actual(Base Pat)

Extraneous Items+ Missed Items Pattern Items

Confusion Matrix

Page 105: Approximate Mining of  Consensus Sequential Patterns

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>

Evaluation Criteria : Item level

Recoverability :– Degree of pattern items in the Base Pat (weighted)– ∑ E(FB) * [ max res pat B(|BP|) / E(LB)]– Cutoff so that 0 ≤ R ≤ 1

Precision : – Degree of pattern items in the Result Pat– Pattern Items / (Pattern Items + Extraneous Items)

Predicted (Result Patterns) +

Actual(Base Pat)

N/A Extraneous Items+ Missed Items Pattern Items

Page 106: Approximate Mining of  Consensus Sequential Patterns

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>

Evaluation Criteria : Sequence level

spurious patterns– Pattern Items Extraneous Items

determine max pattern for each Bj

– Of all Pi map to a particular Bj

– the Pi with the Longest Common Subsequence

– max res pat P(|BP|)

redundant patterns– All other patterns

Ntotal = Nmax + Nspur + Nredun

Page 107: Approximate Mining of  Consensus Sequential Patterns

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>

Evaluation Example

30% : (A)(BC)(DE)

70% : (IJ)(K)(LM)

(A)(B)(DE) (A)(BC)(D) (B)(BC)(DE)

(IJ)(LM) (J)(K)(LM) (IJ)(KX)(LM)

(XY)(K)(Z)

Page 108: Approximate Mining of  Consensus Sequential Patterns

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>

Evaluation Example

30% : (A)(BC)(DE)

70% : (IJ)(K)(LM)

Ntotal = 7

(A)(B)(DE) (A)(BC)(D) (B)(BC)(DE)

(IJ)(LM) (J)(K)(LM) (IJ)(KX)(LM)

(XY)(K)(Z)

Page 109: Approximate Mining of  Consensus Sequential Patterns

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>

Evaluation Example

30% : (A)(BC)(DE)

70% : (IJ)(K)(LM)

Ntotal = 7 Spurious = 1

(A)(B)(DE) (A)(BC)(D) (B)(BC)(DE)

(IJ)(LM) (J)(K)(LM) (IJ)(KX)(LM)

(XY)(K)(Z)

Page 110: Approximate Mining of  Consensus Sequential Patterns

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>

Evaluation Example

30% : (A)(BC)(DE)

70% : (IJ)(K)(LM)

Ntotal = 7 Spurious = 1 Recoverability =(30%)*4/5+(70%)*5/5

= 94%

(A)(B)(DE) (A)(BC)(D) (B)(BC)(DE)

(IJ)(LM) (J)(K)(LM) (IJ)(KX)(LM)

(XY)(K)(Z)

Page 111: Approximate Mining of  Consensus Sequential Patterns

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>

Evaluation Example

30% : (A)(BC)(DE)

70% : (IJ)(K)(LM)

Ntotal = 7 Spurious = 1 Recoverability =(30%)*4/5+(70%)*5/5 =

94% Redundant = 4

(A)(B)(DE) (A)(BC)(D) (B)(BC)(DE)

(IJ)(LM) (J)(K)(LM) (IJ)(KX)(LM)

(XY)(K)(Z)

Page 112: Approximate Mining of  Consensus Sequential Patterns

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>

Evaluation Example

30% : (A)(BC)(DE)

70% : (IJ)(K)(LM)

Ntotal = 7 Spurious = 1 Recoverability =(30%)*4/5+(70%)*5/5 = 94% Redundant = 4 Precision = 1-5/31=84%

(A)(B)(DE) (A)(BC)(D) (B)(BC)(DE)

(IJ)(LM) (J)(K)(LM) (IJ)(KX)(LM)

(XY)(K)(Z)

Page 113: Approximate Mining of  Consensus Sequential Patterns

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>

Synthetic data

Patterned data : IBM synthetic data generator– Given certain DB parameters outputs

• sequence DB• base patterns used to generate it : E(FB) and E(LB)

– R. Agrawal and R. Srikant : ICDE 95 & EBDT 96 Random data

– Independence both between and across itemsets Patterned data + systematic noise

– Randomly change item with probability – Yang, SIGMOD 2002

Patterned data + systematic outliers– random sequences

Page 114: Approximate Mining of  Consensus Sequential Patterns

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>

Overview

What is KDD (Knowledge Discovery & Data mining)Problem : Sequential Pattern MiningMethod : ApproxMAPEvaluation MethodResultsCase StudyConclusion

Page 115: Approximate Mining of  Consensus Sequential Patterns

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>

Results

ApproxMAP– Pattern consensus sequence– No null or one-itemset

Machine : swan– 2GHz Intel Xeon processor – 2GB of memory– Public machine

• Difficult to get consistent running time measurements• Thanks !

Page 116: Approximate Mining of  Consensus Sequential Patterns

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>

Database Parameter

Notation Meaning Value|| I || # of items 100|| || # of potentially freq itemsets 500

Ipat Avg. # of items per itemset in BP 2

Lpat Avg. # of itemsets per base pat 7

Npat # of base pattern sequences 10

Nseq # of data sequences 1000

Lseq Avg. # of itemsets per data seq 10

Iseq Avg. # of items per itemset in DB 2.5

Page 117: Approximate Mining of  Consensus Sequential Patterns

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>

Database Parameter

Notation Meaning Value|| I || # of items 100|| || # of potentially freq itemsets 500

Ipat Avg. # of items per itemset in BP 2

Lpat Avg. # of itemsets per base pat 7

Npat # of base pattern sequences 10

Nseq # of data sequences 1000

Lseq Avg. # of itemsets per data seq 10

Iseq Avg. # of items per itemset in DB 2.5

Page 118: Approximate Mining of  Consensus Sequential Patterns

BasePi (E(FB):E(LB)) ||P|| Pattern <100: 85: 70: 50: 35: 20> BaseP1 (0.21:0.66) 14 <(15 16 17 66) (15) (58 99) (2 74) (31 76) (66) (62) (93) >

PatConSeq1 13 < (15 16 17 66) (15) (58 99) (2 74) (31 76) (66) (62) > VarConSeq1 18 < (15 16 17 66) (15 22) (58 99) (2 74) (24 31 76) (24 66) (50 62) (93) >

BaseP2 (0.161:0.83) 22 <(22 50 66) (16) (29 99) (94) (45 67) (12 28 36) (50) (96) (51) (66) (2 22 58) (63 74 99) > PatConSeq2 19 < (22 50 66) (16) (29 99) (94) (45 67) (12 28 36) (50) (96) (51) (66) (2 22 58) > VarConSeq2 25 < (22 50 66) (16) (29 99) (22 58 94) (2 45 58 67) (12 28 36) (2 50) (24 96) (51) (66) (2 22 58) > PatConSeq3 15 < (22 50 66) (16) (29 99) (94) (45 67) (12 28 36) (50) (96) (51) > VarConSeq3 15 < (22 50 66) (16) (29 99) (94) (45 67) (12 28 36) (50) (96) (51) >

BaseP3 (0.141:0.82) 14 < (22) (22) (58) (2 16 24 63) (24 65 93) (6) (11 15 74) > PatConSeq4 11 < (22) (22) (58) (2 16 24 63) (24 65 93) (6) > VarConSeq4 13 < (22) (22) (22) (58) (2 16 24 63) (2 24 65 93) (6 50) >

BaseP4 (0.131:0.90) 15 <(31 76) (58 66) (16 22 30) (16) (50 62 66) (2 16 24 63) > PatConSeq5 11 < (31 76) (58 66) (16 22 30) (16) (50 62 66) > VarConSeq5 11 < (31 76) (58 66) (16 22 30) (16) (50 62 66) (16 24) >

BaseP5 (0.123:0.81) 14 <(43) (2 28 73) (96) (95) (2 74) (5) (2) (24 63) (20) (93) > PatConSeq6 13 < (43) (2 28 73) (96) (95) (2 74) (5) (2) (24 63) (20) > VarConSeq6 16 < (22 43) (2 28 73) (58 96) (95) (2 74) (5) (2 66) (24 63) (20) >

BaseP6 (0.121:0.77) 9 <(63) (16) (2 22) (24) (22 50 66) (50) > PatConSeq7 8 < (63) (16) (2 22) (24) (22 50 66) > VarConSeq7 9 < (63) (16) (2 22) (24) (22 50 66) >

BaseP7 (0.054:0.60) 13 <(70) (58 66) (22) (74) (22 41) (2 74) (31 76) (2 74) > PatConSeq8 16 < (70) (58) (22 58 66) (22 58) (74) (22 41) (2 74) (31 76) (2 74) > VarConSeq8 18 < (70) (58 66) (22 58 66) (22 58) (74) (22 41) (2 22 66 74) (31 76) (2 74) > PatConSeq9 0 cluster size was only 5 sequences so no pattern consensus sequence was produced VarConSeq9 8 < (70) (58 66) (74) (74) (22 41) (74) >

BaseP8 (0.014:0.91) 17 < (20 22 23 96) (50) (51 63) (58) (16) (2 22) (50) (23 26 36) (10 74) > BaseP9 (0.038:0.78) 7 < (88) (24 58 78) (22) (58) (96) > BaseP10 (0.008:0.66) 17 < (16) (2 23 74 88) (24 63) (20 96) (91) (40 62) (15) (40) (29 40 99) >

Page 119: Approximate Mining of  Consensus Sequential Patterns

10 Base Patterns

D = 1000 Sequences

IBM Synthetic Data Generator

Id:LB E(FB):E(LB) Base Pattern

B1:14 0.21:0.66 [15,16,17,66] [15] [58,99] [2,74] [31,76] [66] [62][93]

B2:22 0.161:0.83 [22,50,66][16][29,99][94][45,67]…[2,22,58][63,74,99]

B3:14 0.141:0.82 [22] [22] [58] [2,16,24,63] [24,65,93] [6] [11,15,74]

Etc

B10:17 0.008:0.66 [16][2,23,74,88][24,63][20,96][91][40,62]...[29,40,99]

Page 120: Approximate Mining of  Consensus Sequential Patterns

ApproxMAP

wseq1(162 seqs)

wseq2 wseq9

cluster1(162 seqs)

cluster2 cluster9

10 Base Patterns

D = 1000 Sequences

IBM Synthetic Data Generator

< (E:1, L:1, R:1, T:1, V:1, d:1) (A:1, B:9, C:8, D:8, E:12, F:1, L:4, P:1, S:1, T:8, V:5, X:1, a:1, d:10, e:2, f:1, g:1, p:1) (B:99, C:96, D:91, E:24, F:2, G:1, L:15, P:7, R:2, S:8, T:95, V:15, X:2, Y:1, a:2, d:26, e:3, g:6, l:1, m:1) (A:5, B:16, C:5, D:3, E:13, F:1, H:2, L:7, P:1, R:2, S:7, T:6, V:7, Y:3, d:3, g:1) (A:13, B:126, C:27, D:1, E:32, G:5, H:3, J:1, L:1, R:1, S:32, T:21, V:1, W:3, X:2, Y:8, d:13, e:1, f:8, i:2, p:7, l:3, g:1) (A:12, B:6, C:28, D:1, E:28, G:5, H:2, J:6, L:2, S:137, T:10, V:2, W:6, X:8, Y:124, a:1, d:6, g:2, i:1, l:1, m:2) (A:135, B:2, C:23, E:36, G:12, H:124, K:1, L:4, O:2, R:2, S:27, T:6, V:6, W:10, X:3, Y:8, Z:2, a:1, d:6, g:1, h:2, j:1, k:5, l:3, m:7, n:1) (A:11, B:1, C:5, E:12, G:3, H:10, L:7, O:4, S:5, T:1, V:7, W:3, X:2, Y:3, a:1, m:2) (A:31, C:15, E:10, G:15, H:25, K:1, L:7, M:1, O:1, R:4, S:12, T:10, V:6, W:3, Y:3, Z:3, d:7, h:3, j:2, l:1, n:1, p:1, q:1) (A:3, C:5, E:4, G:7, H:1, K:1, R:1, T:1, W:2, Z:2, a:1, d:1, h:1, n:1) (A:20, C:27, E:13, G:35, H:7, K:7, L:111, N:2, O:1, Q:3, R:11, S:10, T:20, V:111, W:2, X:2, Y:3, Z:8, a:1, b:1, d:13, h:9, j:1, n:1, o:2) (A:17, B:2, C:14, E:17, F:1, G:31, H:8, K:13, L:2, M:2, N:1, R:22, S:2, T:140, U:1, V:2, W:2, X:1, Z:13, a:1, b:8, d:6, h:14, n:6, p:1, q:1) (A:12, B:7, C:5, E:13, G:16, H:5, K:106, L:8, N:2, O:1, R:32, S:3, T:29, V:9, X:2, Z:9, b:16, c:5, d:5, h:7, l:1) (A:7, B:1, C:9, E:5, G:7, H:3, K:7, R:8, S:1, T:10, X:1, Z:3, a:2, b:3, c:1, d:5, h:3) (A:1, B:1, H:1, R:1, T:1, b:2, c:1) (A:3, B:2, C:2, E:6, F:2, G:4, H:2, K:20, M:2, N:3, R:19, S:3, T:11, U:2, X:4, Z:34, a:3, b:11, c:2, d:4) (H:1, Y:1, a:1, d:1) > : 162

Page 121: Approximate Mining of  Consensus Sequential Patterns

cluster1(162 seqs)

cluster2 cluster9

ApproxMAP

wseq1(162 seqs)

wseq2 wseq9

PatConSeq1

VarConSeq1

PatConSeq9

VarConSeq9

10 Base Patterns

D = 1000 Sequences

IBM Synthetic Data Generator

100%: 85%: 70%: 50%: 35%: 20%

(B:61%, C:59%, D:56%, T:59%)

(B: 78%)

(S:85%, Y:77%)

(A:83%, E:22%, H:77%)

(G:22%, L:69%, V:69%)

(T: 86%)

(K: 65%)

(B: 21%) : 162

Pat tern

(B C D T) (B) (S Y) (A H) (L V) (T) (K) :162 Variation (B C D T) (B) (S Y) (A E H) (G L V) (T) (K) (Z) :162

Page 122: Approximate Mining of  Consensus Sequential Patterns

cluster1(162 seqs)

cluster2 cluster9

ApproxMAP

wseq1(162 seqs)

wseq2 wseq9

PatConSeq1

VarConSeq1

PatConSeq9

VarConSeq9

10 Base Patterns

D = 1000 Sequences

IBM Synthetic Data Generator

Evaluation

7 max patterns1 redundant pattern0 spurious patterns

1 null pattern

Recoverability : 91.16%Precision: 97.17%

Extraneous Items: 3/106

Page 123: Approximate Mining of  Consensus Sequential Patterns

BasePi (E(FB):E(LB)) ||P|| Pattern <100: 85: 70: 50: 35: 20> BaseP1 (0.21:0.66) 14 <(15 16 17 66) (15) (58 99) (2 74) (31 76) (66) (62) (93) >

PatConSeq1 13 < (15 16 17 66) (15) (58 99) (2 74) (31 76) (66) (62) > VarConSeq1 18 < (15 16 17 66) (15 22) (58 99) (2 74) (24 31 76) (24 66) (50 62) (93) >

BaseP2 (0.161:0.83) 22 <(22 50 66) (16) (29 99) (94) (45 67) (12 28 36) (50) (96) (51) (66) (2 22 58) (63 74 99) > PatConSeq2 19 < (22 50 66) (16) (29 99) (94) (45 67) (12 28 36) (50) (96) (51) (66) (2 22 58) > VarConSeq2 25 < (22 50 66) (16) (29 99) (22 58 94) (2 45 58 67) (12 28 36) (2 50) (24 96) (51) (66) (2 22 58) > PatConSeq3 15 < (22 50 66) (16) (29 99) (94) (45 67) (12 28 36) (50) (96) (51) > VarConSeq3 15 < (22 50 66) (16) (29 99) (94) (45 67) (12 28 36) (50) (96) (51) >

BaseP3 (0.141:0.82) 14 < (22) (22) (58) (2 16 24 63) (24 65 93) (6) (11 15 74) > PatConSeq4 11 < (22) (22) (58) (2 16 24 63) (24 65 93) (6) > VarConSeq4 13 < (22) (22) (22) (58) (2 16 24 63) (2 24 65 93) (6 50) >

BaseP4 (0.131:0.90) 15 <(31 76) (58 66) (16 22 30) (16) (50 62 66) (2 16 24 63) > PatConSeq5 11 < (31 76) (58 66) (16 22 30) (16) (50 62 66) > VarConSeq5 11 < (31 76) (58 66) (16 22 30) (16) (50 62 66) (16 24) >

BaseP5 (0.123:0.81) 14 <(43) (2 28 73) (96) (95) (2 74) (5) (2) (24 63) (20) (93) > PatConSeq6 13 < (43) (2 28 73) (96) (95) (2 74) (5) (2) (24 63) (20) > VarConSeq6 16 < (22 43) (2 28 73) (58 96) (95) (2 74) (5) (2 66) (24 63) (20) >

BaseP6 (0.121:0.77) 9 <(63) (16) (2 22) (24) (22 50 66) (50) > PatConSeq7 8 < (63) (16) (2 22) (24) (22 50 66) > VarConSeq7 9 < (63) (16) (2 22) (24) (22 50 66) >

BaseP7 (0.054:0.60) 13 <(70) (58 66) (22) (74) (22 41) (2 74) (31 76) (2 74) > PatConSeq8 16 < (70) (58) (22 58 66) (22 58) (74) (22 41) (2 74) (31 76) (2 74) > VarConSeq8 18 < (70) (58 66) (22 58 66) (22 58) (74) (22 41) (2 22 66 74) (31 76) (2 74) > PatConSeq9 0 cluster size was only 5 sequences so no pattern consensus sequence was produced VarConSeq9 8 < (70) (58 66) (74) (74) (22 41) (74) >

BaseP8 (0.014:0.91) 17 < (20 22 23 96) (50) (51 63) (58) (16) (2 22) (50) (23 26 36) (10 74) > BaseP9 (0.038:0.78) 7 < (88) (24 58 78) (22) (58) (96) > BaseP10 (0.008:0.66) 17 < (16) (2 23 74 88) (24 63) (20 96) (91) (40 62) (15) (40) (29 40 99) >

BaseP1(0.21:0.66) 14 <(15 16 17 66)(15)(58 99)(2 74)(31 76)(66)(62)(93)> PatConSeq1 13 <(15 16 17 66)(15)(58 99)(2 74)(31 76)(66)(62)> VarConSeq1 18 <(15 16 17 66)(15 22)(58 99)(2 74)(24 31 76)(24 66)(50 62)(93)>

Page 124: Approximate Mining of  Consensus Sequential Patterns

BasePi (E(FB):E(LB)) ||P|| Pattern <100: 85: 70: 50: 35: 20> BaseP1 (0.21:0.66) 14 <(15 16 17 66) (15) (58 99) (2 74) (31 76) (66) (62) (93) >

PatConSeq1 13 < (15 16 17 66) (15) (58 99) (2 74) (31 76) (66) (62) > VarConSeq1 18 < (15 16 17 66) (15 22) (58 99) (2 74) (24 31 76) (24 66) (50 62) (93) >

BaseP2 (0.161:0.83) 22 <(22 50 66) (16) (29 99) (94) (45 67) (12 28 36) (50) (96) (51) (66) (2 22 58) (63 74 99) > PatConSeq2 19 < (22 50 66) (16) (29 99) (94) (45 67) (12 28 36) (50) (96) (51) (66) (2 22 58) > VarConSeq2 25 < (22 50 66) (16) (29 99) (22 58 94) (2 45 58 67) (12 28 36) (2 50) (24 96) (51) (66) (2 22 58) > PatConSeq3 15 < (22 50 66) (16) (29 99) (94) (45 67) (12 28 36) (50) (96) (51) > VarConSeq3 15 < (22 50 66) (16) (29 99) (94) (45 67) (12 28 36) (50) (96) (51) >

BaseP3 (0.141:0.82) 14 < (22) (22) (58) (2 16 24 63) (24 65 93) (6) (11 15 74) > PatConSeq4 11 < (22) (22) (58) (2 16 24 63) (24 65 93) (6) > VarConSeq4 13 < (22) (22) (22) (58) (2 16 24 63) (2 24 65 93) (6 50) >

BaseP4 (0.131:0.90) 15 <(31 76) (58 66) (16 22 30) (16) (50 62 66) (2 16 24 63) > PatConSeq5 11 < (31 76) (58 66) (16 22 30) (16) (50 62 66) > VarConSeq5 11 < (31 76) (58 66) (16 22 30) (16) (50 62 66) (16 24) >

BaseP5 (0.123:0.81) 14 <(43) (2 28 73) (96) (95) (2 74) (5) (2) (24 63) (20) (93) > PatConSeq6 13 < (43) (2 28 73) (96) (95) (2 74) (5) (2) (24 63) (20) > VarConSeq6 16 < (22 43) (2 28 73) (58 96) (95) (2 74) (5) (2 66) (24 63) (20) >

BaseP6 (0.121:0.77) 9 <(63) (16) (2 22) (24) (22 50 66) (50) > PatConSeq7 8 < (63) (16) (2 22) (24) (22 50 66) > VarConSeq7 9 < (63) (16) (2 22) (24) (22 50 66) >

BaseP7 (0.054:0.60) 13 <(70) (58 66) (22) (74) (22 41) (2 74) (31 76) (2 74) > PatConSeq8 16 < (70) (58) (22 58 66) (22 58) (74) (22 41) (2 74) (31 76) (2 74) > VarConSeq8 18 < (70) (58 66) (22 58 66) (22 58) (74) (22 41) (2 22 66 74) (31 76) (2 74) > PatConSeq9 0 cluster size was only 5 sequences so no pattern consensus sequence was produced VarConSeq9 8 < (70) (58 66) (74) (74) (22 41) (74) >

BaseP8 (0.014:0.91) 17 < (20 22 23 96) (50) (51 63) (58) (16) (2 22) (50) (23 26 36) (10 74) > BaseP9 (0.038:0.78) 7 < (88) (24 58 78) (22) (58) (96) > BaseP10 (0.008:0.66) 17 < (16) (2 23 74 88) (24 63) (20 96) (91) (40 62) (15) (40) (29 40 99) >

8 patterns returned

7 max patterns

1 redundant patterns

0 spurious patterns

Recoverability : 91.16%

Precision: 97.17%

Extraneous Items: 3/106

Page 125: Approximate Mining of  Consensus Sequential Patterns

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>

Comparative Study

Conventional Sequential Pattern Mining– Support Model

Empirical analysis– Totally random data– Patterned data– Patterned data + noise– Patterned data + outliers

Page 126: Approximate Mining of  Consensus Sequential Patterns

Evaluation : Comparison

ApproxMAP Support ModelRandom

DataNo patterns Numerous spurious patterns

Page 127: Approximate Mining of  Consensus Sequential Patterns

Evaluation : Comparison

ApproxMAP Support ModelRandom

DataNo patterns Numerous spurious patterns

Patterned Data

10 patterns embedded

into 1000 seqs

k=6 & MinStrgh=30%Recoverability : 91.16%Precision: 97.17% Extraneous Items: 3/1068 patterns returned1 redundant patterns0 spurious patterns

MinSup=5%Recoverability : 91.59%Precision: 96.29% Extraneous : 66,058/1,782,583253,782 patterns returned253,714 redundant patterns58 spurious patterns

Page 128: Approximate Mining of  Consensus Sequential Patterns

Evaluation : Comparison

ApproxMAP Support ModelRandom

DataNo patterns Numerous spurious patterns

Patterned Data

10 patterns embedded

into 1000 seqs

k=6 & MinStrgh=30%Recoverability : 91.16%Precision: 97.17% Extraneous Items: 3/1068 patterns returned1 redundant patterns0 spurious patterns

MinSup=5%Recoverability : 91.59%Precision: 96.29% Extraneous : 66,058/1,782,583253,782 patterns returned253,714 redundant patterns58 spurious patterns

Noise Robust Not Robust Recoverability degrades fast

Page 129: Approximate Mining of  Consensus Sequential Patterns

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>

Robustness w.r.t. noise

0.00%

20.00%

40.00%

60.00%

80.00%

100.00%

0% 10% 20% 30% 40% 50%Noise Level (1-alpha)

Reco

vera

bility

Support(min_sup=5%)Mult Alingment(k=6, theta=50%)

Page 130: Approximate Mining of  Consensus Sequential Patterns

Evaluation : Comparison

ApproxMAP Support ModelRandom

DataNo patterns Numerous spurious patterns

Patterned Data

10 patterns embedded

into 1000 seqs

k=6 & MinStrgh=30%Recoverability : 91.16%Precision: 97.17% Extraneous Items: 3/1068 patterns returned1 redundant patterns0 spurious patterns

MinSup=5%Recoverability : 91.59%Precision: 96.29% Extraneous : 66,058/1,782,583253,782 patterns returned253,714 redundant patterns58 spurious patterns

Noise Robust Not Robust Recoverability degrades fast

Outliers Robust Some what Robust

Page 131: Approximate Mining of  Consensus Sequential Patterns

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>

Understanding ApproxMAP

5 experiments– k in kNN clustering– Strength cutoff– Order of alignment– Optimization 1 : reduced precision in prox. matrix– Optimization 2 : sample based iterative clustering

Page 132: Approximate Mining of  Consensus Sequential Patterns

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>

A realistic DB

Notation Meaning Value|| I || # of items 1,000|| || # of potentially freq itemsets 5,000

Ipat Avg. # of items per itemset in BP 2

Lpat Avg. # of itemsets per base pat 14=0.7*Lseq

Npat # of base pattern sequences 100

Nseq # of data sequences 10,000

Lseq Avg. # of itemsets per data seq 20

Iseq Avg. # of items per itemset in DB 2.5

Page 133: Approximate Mining of  Consensus Sequential Patterns

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>

Input parameters

0%10%20%30%40%50%60%70%80%90%

100%110%

0 10 20 30 40 50 60 70 80 90 100Theta : Strength threshold (%)

Eval

Crite

ria (%

)

Recover Precision

50%

60%

70%

80%

90%

100%

2 4 6 8 10k : for kNN clustering

Eval

crite

ria (%

)

Recoverability Precis ion

Page 134: Approximate Mining of  Consensus Sequential Patterns

The order in multiple alignment

Page 135: Approximate Mining of  Consensus Sequential Patterns

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>

Understanding ApproxMAP

Optimization 1 : Lseq

– Running time : reduced to 40%Optimization 2 : Nseq

– Running time: reduced to 10%-40% – For negligible reduction in recoverability

Page 136: Approximate Mining of  Consensus Sequential Patterns

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>

Effects of the DB param & scalability

4 experiments– || I || : # of unique items in the database

• Density of the database• 1,000 – 10,000

– Nseq : # of sequences in the data • 10,000 – 100,000

– Lseq : Avg. # of itemsets per data seq• 10 – 50

– Iseq : Avg. # of items per itemset in DB• 2.5 – 10

Page 137: Approximate Mining of  Consensus Sequential Patterns

0

3600

7200

10800

0 2000 4000 6000 8000 10000

|I| : # of unique items in D

Run

ning

tim

e (s

ec)

.

03600072000

108000144000180000216000252000

0 20000 40000 60000 80000 100000Nseq

Run

ning

tim

e (s

ec) .

036007200

10800144001800021600

0 10 20 30 40 50Lseq

Run

ning

tim

e (s

ec) .

0

3600

7200

10800

14400

0 5 10 15 20

Iseq

Run

ning

tim

e (s

ec) .

Page 138: Approximate Mining of  Consensus Sequential Patterns

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>

Overview

What is KDD (Knowledge Discovery & Data mining)Problem : Sequential Pattern MiningMethod : ApproxMAPEvaluation MethodResultsCase StudyConclusion

Page 139: Approximate Mining of  Consensus Sequential Patterns

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>

Case Study : Real data

Monthly Services to children with A&N report 992 sequences15 interpretable and useful patterns (RPT)(INV,FC)(FC) ..11.. (FC)

– 419 sequences (RPT)(INV,FC)(FC)(FC)

– 57 sequences (RPT)(INV,FC,T)(FC,T)(FC,HM)(FC)(FC,HM)

– 39 sequences

Page 140: Approximate Mining of  Consensus Sequential Patterns

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>

Overview

What is KDD (Knowledge Discovery & Data mining)Problem : Sequential Pattern MiningMethod : ApproxMAPEvaluation MethodResultsCase StudyConclusion

Page 141: Approximate Mining of  Consensus Sequential Patterns

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>

Conclusion : why does it work well?

Robust on random & weak patterned noiseVery good at organizing sequencesLong sequence data that are not random have

unique signatures

Page 142: Approximate Mining of  Consensus Sequential Patterns

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>

What have I done?

defines a new model, Multiple Alignment Sequential Pattern Mining,

describes a novel solution ApproxMAP (for APPROXimate Multiple Alignment Pattern mining) – that introduces a new metric for itemsets– weighted sequences : a new representation of alignment

information, – and the effective use of strength cutoffs to control the level

of detail included in the consensus patterns designs a general evaluation method to assess the

quality of results from sequential pattern mining algorithms,

Page 143: Approximate Mining of  Consensus Sequential Patterns

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>

What have I done?

employs the evaluation method to run an extensive set of empirical evaluations of approxMAP on synthetic data,

employs the evaluation method to compare the effectiveness of approxMAP to the conventional methods based on support model,

derives the expected support of a random sequences under the null hypothesis of no pattern in the database to better understand the behavior of the support based methods, and

demonstrates the usefulness of approxMAP using real world data.

Page 144: Approximate Mining of  Consensus Sequential Patterns

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>

Future Work

Sample based iterative clustering – Memory management

Distance metric– Multisets– Taxonomy tree

Strength cutoff– Automatic detection of customized

Local alignment

Page 145: Approximate Mining of  Consensus Sequential Patterns

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>

Thank You ! Advisor

– Wei Wang (02-04)– James Coggins (00-02)– Prasun Dewan (99-00)– Kye Hedlund (96-99)– Jan Prins (95-96)

SW advisor– Dean Duncan (98-04)

Other people– Janet Jones– Kim Flair– Susan Paulsen

Fellow students– Priyank Porwal, Andrew

Leaver-Fay, Leland Smith

Committee– Stephen Aylward– Jan Prins– Andrew Nobel

Other faculty– Jian Pei– Jack Snoeyink– J. S. Marron– Stephen Pizer– Stephen Weiss

Colleagues– Sang-Uok Kum, Jisung Kim– Alexandra, Michelle, – Aron, Chris

Family– Sohmee, My mom, dad, sister

Page 146: Approximate Mining of  Consensus Sequential Patterns

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>

Page 147: Approximate Mining of  Consensus Sequential Patterns

Seq=<(A,B,D)(B)Seq=<(A,B,D)(B)(C,D)>(C,D)>

Rw

Rw(Xw, Y) = [R’w* wX + n - wX] / n

R’w(Xw,Y)= [weight(X)+|Y|*wX -2*weight(XY)]

[ weight(X) + |Y|*wX ]

(|X|+|Y|-2*|XY|) (|X|+|Y|)

ID Aligned Sequence clusterseq3 (A) (B) (DE)seq2 (AE) (H) (B) (D)seq4 (A) (BCG) (D)seq5 (BCI) (DE)seq1 (AG) (F) (BC) (AE) (H)WS4 (A:4,E:1,G:1)

: 4(H:1,F:1): 2

(B:5,C:3,G:1,I:1) : 5

(A:1,D:4,E:3): 5

(H:1): 1

5