mining long sequential patterns in a noisy environment jiong yang, wei wang, philip s. yu, jiawei...

36
Mining Long Sequential Patterns in a Noisy Environment Jiong Yang, Wei Wang, Philip S. Yu, Jiawei Han SIGMOD 2002

Post on 21-Dec-2015

218 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Mining Long Sequential Patterns in a Noisy Environment Jiong Yang, Wei Wang, Philip S. Yu, Jiawei Han SIGMOD 2002

Mining Long Sequential Patterns in a Noisy Environment

Jiong Yang, Wei Wang, Philip S. Yu, Jiawei Han

SIGMOD 2002

Page 2: Mining Long Sequential Patterns in a Noisy Environment Jiong Yang, Wei Wang, Philip S. Yu, Jiawei Han SIGMOD 2002

Outline

• Introduction

• Model

• Algorithm

• Evaluation

• Conclusion

Page 3: Mining Long Sequential Patterns in a Noisy Environment Jiong Yang, Wei Wang, Philip S. Yu, Jiawei Han SIGMOD 2002

Introduction

• Pattern discovery in long sequences has many applications.

• The common metric used to qualify a significant pattern is support.

• Noise usually exists.– A symbol is misrepresented by another symbol.

– An occurrence of a pattern cannot be recognized. • E.g: when a sequence: d1d3d4d5 is misrepresented by: d1d2d4d5,

pattern ‘d1d3 ’ cannot be found.

Page 4: Mining Long Sequential Patterns in a Noisy Environment Jiong Yang, Wei Wang, Philip S. Yu, Jiawei Han SIGMOD 2002

Introduction

– The observed support of a pattern may be less than the real support of it.

– Some frequent patterns cannot be discovered due to the noise.

• The result of failing to find a frequent pattern is more critical when the pattern is long.– Long patterns are more vulnerable to distortions.

• If there are noises and long patterns in the database, support is not a very suitable measure for significant patterns.

Page 5: Mining Long Sequential Patterns in a Noisy Environment Jiong Yang, Wei Wang, Philip S. Yu, Jiawei Han SIGMOD 2002

Introduction

– E.g: gene sequence analysis• with amino acid as the granularity, the length of a gene

expression is usually a few thousands.

• Noise is common: some mutation of amino acids occurs with a non-negligible probability.

• Compatibility Matrix – a matrix whose element shows the probability of the

observed value being a real underlying substance.

– each observed symbol is interpreted as an occurrence of a set of symbols with various probabilities.

Page 6: Mining Long Sequential Patterns in a Noisy Environment Jiong Yang, Wei Wang, Philip S. Yu, Jiawei Han SIGMOD 2002

Introduction

• An example of the compatibility matrix. Prob(d1|d1)=0.9

Prob(d2|d1)=0.05

Prob(d3|d1)=0.05

Prob(d4|d1)=0

Prob(d5|d1)=0

• Based on the compatibility matrix, a new measurement, called match, is proposed to qualify important patterns.

Observed value

d1 d2 d3 d4 d5

true value

d1 0.9 0.1 0 0 0

d2 0.05 0.8 0.05 0.1 0

d3 0.05 0 0.7 0.15 0.1

d4 0 0.1 0.1 0.75 0.05

d5 0 0 0.15 0 0.85

Page 7: Mining Long Sequential Patterns in a Noisy Environment Jiong Yang, Wei Wang, Philip S. Yu, Jiawei Han SIGMOD 2002

Model

• I = {d1, d2, …, dm}.• A sequence (pattern) of length n is an ordered list of n

symbols in I.– E.g: d1d2d1 is a sequence (pattern) of length 3.

• Given a sequence S=s1s2…sls, a pattern P=d1d2…dlp

is a subsequence (subpattern) of S – if there exist a list of integers 1i1<i2<…<ilp

ls such that dj=sij

for 1 j lp.

S is also called a supersequence (superpattern) of P.– E.g: d1d4d5 is a subpattern of d1d3d4d5.

Page 8: Mining Long Sequential Patterns in a Noisy Environment Jiong Yang, Wei Wang, Philip S. Yu, Jiawei Han SIGMOD 2002

Model

• Given I={d1,d2,…,dm}, the compatibility matrix is an mm matrix– Its element C(di,dj)=Prob(true_value = di |

observed_value = dj), where 1 i,j m. – The compatibility matrix is assumed to be provided by

an expert in the area.

• Given a pattern P= d1d2…dl, and a sequence s= d1’d2’…dl’, the match of P in s, denoted by M(P,s):– is defined as the conditional probability that s

corresponds to an occurrence of P.

Page 9: Mining Long Sequential Patterns in a Noisy Environment Jiong Yang, Wei Wang, Philip S. Yu, Jiawei Han SIGMOD 2002

Model

• If each observed symbol is generated independently, then M(P,s) =Prob(P|s) =

1ilC(di,di’).– E.g: if P=d1d2, s=d1d3, then M(P,s)=C(d1,d1)C(d2,d3)=0.9

0.05=0.045. – Here P is not a subpattern of s, but M(P,s) >0.

• Given a sequence S of length ls and a pattern P of length lp where lslp, the match of P in S– is defined as the maximal match of P in every distinct

subsequence (of length lp) of S. – i.e. M(P,S)=maxs is a subpattern of SM(P,s).

Page 10: Mining Long Sequential Patterns in a Noisy Environment Jiong Yang, Wei Wang, Philip S. Yu, Jiawei Han SIGMOD 2002

Model

– as many as distinct subsequences

– dynamic programming: O(lpls) time

– optimization to nearly O(ls) time

• Given a pattern P and a database D of N sequence, the match of P in D is defined as M(P,D)=SDM(P,S) / N.

• A minimum match threshold match is specified by a user. All patterns that meet the match threshold are called to frequent patterns.

Page 11: Mining Long Sequential Patterns in a Noisy Environment Jiong Yang, Wei Wang, Philip S. Yu, Jiawei Han SIGMOD 2002

Model

• The match model can accommodate misrepresentation of symbols due to noise.

• The Apriori property also holds on the match metric. – The match of a pattern P in a database D the match of

any subpattern of P.

• In a noise-free environment, match model can represent support model.– Let the compatibility matrix be an identity matrix

(C(di,dj)=1 if i=j and is 0 otherwise). – The match of a pattern is equal to the support of a pattern.

Page 12: Mining Long Sequential Patterns in a Noisy Environment Jiong Yang, Wei Wang, Philip S. Yu, Jiawei Han SIGMOD 2002

Algorithm

• Problem to tackle– Large number of frequent patterns with match metric.– Long length of frequent patterns.

• Technique used: sampling, Chernoff bound estimation and border collapsing.

• Three steps:– Phase 1: finding match of individual symbols and

sampling– Phase 2: ambiguous pattern discovery on samples– Phase 3: border collapsing

Page 13: Mining Long Sequential Patterns in a Noisy Environment Jiong Yang, Wei Wang, Philip S. Yu, Jiawei Han SIGMOD 2002

Algorithm

• Phase 1: finding match of each symbols and sampling– For a sequence Di in the database, the match of a symbol d in Di

is: M(d, Di )=max di Di C(d,di).

– e.g:

if Di =d2d3, then M(d1, Di)=max{0.1, 0}=0.1

M(d2, Di)=max{0.8, 0.05}=0.8

M(d3, Di)=max{0, 0.7}=0.7

M(d4, Di)=max{0.1, 0.1}=0.1

M(d5, Di)=max{0, 0.15}=0.15

– Draw a sample of the whole database and store it in memory.

Observed value

d1 d2 d3 d4 d5

true value

d1 0.9 0.1 0 0 0

d2 0.05 0.8 0.05 0.1 0

d3 0.05 0 0.7 0.15 0.1

d4 0 0.1 0.1 0.75 0.05

d5 0 0 0.15 0 0.85

Page 14: Mining Long Sequential Patterns in a Noisy Environment Jiong Yang, Wei Wang, Philip S. Yu, Jiawei Han SIGMOD 2002

Algorithm

• Phase 2: ambiguous pattern discovery on the sample dataset

• Chernoff bound estimation– If n is the size of the sample, is the match of a pattern P=

d1d2…dl in the sample, then P is • frequent in the whole db with probability 1- if > match+ • infrequent in the whole db with probability 1- if < match - • ambiguous if (match - , match +)

where , R is the spread of , R=min 1il match[di] .

can be selected by users, e.g. =0.001, 1- =99.9%.

Page 15: Mining Long Sequential Patterns in a Noisy Environment Jiong Yang, Wei Wang, Philip S. Yu, Jiawei Han SIGMOD 2002

Algorithm

• Phase 2: ambiguous pattern discovery on the sample dataset– Use an existing algorithm to mine the sample.

– For a pattern discovered in the sample, label it as frequent, ambiguous, or infrequent according to Chernoff bound estimation.

– Find the border (denoted by FQT) between frequent and ambiguous patterns, and the border (denoted by INFQT) between the ambiguous and infrequent patterns.

Page 16: Mining Long Sequential Patterns in a Noisy Environment Jiong Yang, Wei Wang, Philip S. Yu, Jiawei Han SIGMOD 2002

Algorithm

• Phase 2: ambiguous pattern discovery on the sample dataset

• FQT={p | p is frequent immediate superpatterns of p are either ambiguous or infrequent}

• INFQT={p | p is ambiguous the superpatterns of p are all infrequent}

Page 17: Mining Long Sequential Patterns in a Noisy Environment Jiong Yang, Wei Wang, Philip S. Yu, Jiawei Han SIGMOD 2002

Algorithm

• Phase 3: Border Collapsing

input output

– Scan the database to count the matches of ambiguous patterns to see whether they are frequent or infrequent.

Infrequent patterns

INFQT

Ambiguous patterns

FQT

Frequent patterns

Infrequent patterns

Border

Frequent patterns

processing

Page 18: Mining Long Sequential Patterns in a Noisy Environment Jiong Yang, Wei Wang, Philip S. Yu, Jiawei Han SIGMOD 2002

Algorithm

• Phase 3: border Collapsing– If memory can hold the counters associated for all

ambiguous patterns, one database scan is ok.– Sometimes, there is a huge amount of ambiguous

patterns, and the database have to be scanned several times.

• Selects a set of ambiguous patterns until the memory is filled up by the counters, scans the database to get their matches, and collapses the border. Repeat the select-scan-collapse procedure until the two borders become one.

• Try to minimize the No. of I/O passes needed.• The ambiguous patterns which have high border collapsing

power are selected.

Page 19: Mining Long Sequential Patterns in a Noisy Environment Jiong Yang, Wei Wang, Philip S. Yu, Jiawei Han SIGMOD 2002

Algorithm

• Phase 3: Border Collapsing– How to select patterns? ---- like binary search

Page 20: Mining Long Sequential Patterns in a Noisy Environment Jiong Yang, Wei Wang, Philip S. Yu, Jiawei Han SIGMOD 2002

Algorithm

• Phase3: Border Collapsing

Page 21: Mining Long Sequential Patterns in a Noisy Environment Jiong Yang, Wei Wang, Philip S. Yu, Jiawei Han SIGMOD 2002

Algorithm

• Phase 3: Border Collapsing– If there are x levels of ambiguous patterns

• A level-wise method needs to scan the database O(x) times;

• while border collapsing method only needs to scan the database O(log x) times.

– For some previously ambiguous patterns, their labels (whether they are frequent or infrequent) are known, but their matches remain unknown after the step.

Page 22: Mining Long Sequential Patterns in a Noisy Environment Jiong Yang, Wei Wang, Philip S. Yu, Jiawei Han SIGMOD 2002

Evaluation

• Database– Standard database

• a protein database consists of 600K sequences of amino acids.• the average length of a sequence is around 500.• 20 different symbols

– Test databases are generated from the standard database with random noises. controls the degree of noise.• A symbol d in the standard database remains the same in the

test database with a probability of 1- , changes to any one of the other 19 symbols with a probability of /19.

Page 23: Mining Long Sequential Patterns in a Noisy Environment Jiong Yang, Wei Wang, Philip S. Yu, Jiawei Han SIGMOD 2002

Evaluation

• Robustness of Match Model– Mine standard database

• RM={frequent patterns found by match model}• Rs={frequent patterns found by support model}• RM=Rs

– Mine test database• RM‘ Rs‘

• Accuracy: |RM‘ RM| / |RM‘|, |Rs‘ Rs| / |Rs‘|• Completeness: |RM‘ RM| / |RM|, |Rs‘ Rs| / |Rs|

Page 24: Mining Long Sequential Patterns in a Noisy Environment Jiong Yang, Wei Wang, Philip S. Yu, Jiawei Han SIGMOD 2002

Evaluation

• Robustness of Match Model—different noise degrees

• Match model: accuracy and completeness are more than 95%• Support model: vulnerable to the noise

Page 25: Mining Long Sequential Patterns in a Noisy Environment Jiong Yang, Wei Wang, Philip S. Yu, Jiawei Han SIGMOD 2002

Evaluation

• Robustness of Match Model—different pattern lengths

• Match model: unaffected by the pattern length• Support model: degrades as the pattern length becomes long

Page 26: Mining Long Sequential Patterns in a Noisy Environment Jiong Yang, Wei Wang, Philip S. Yu, Jiawei Han SIGMOD 2002

Evaluation

• Robustness of Match Model– when there is some error in the compatibility matrix

• When the error is 10%, match model can still achieve 88% accuracy and 85% completeness.

Page 27: Mining Long Sequential Patterns in a Noisy Environment Jiong Yang, Wei Wang, Philip S. Yu, Jiawei Han SIGMOD 2002

Evaluation

• Sample size– Patterns whose match follows in the range (match - ,

match +) are ambiguous.

larger sample size ->

smaller ->

fewer ambiguous patterns

Page 28: Mining Long Sequential Patterns in a Noisy Environment Jiong Yang, Wei Wang, Philip S. Yu, Jiawei Han SIGMOD 2002

Evaluation

• Spread of Match R– R(P)=minimum match of its involved symbols

• Longer length, tighter R

• Higher degree of noise, smaller R

Page 29: Mining Long Sequential Patterns in a Noisy Environment Jiong Yang, Wei Wang, Philip S. Yu, Jiawei Han SIGMOD 2002

Evaluation

• Effects of Confidence 1-– Previous experiments: 1-=0.9999

Page 30: Mining Long Sequential Patterns in a Noisy Environment Jiong Yang, Wei Wang, Philip S. Yu, Jiawei Han SIGMOD 2002

Evaluation

• Missing Patterns

Page 31: Mining Long Sequential Patterns in a Noisy Environment Jiong Yang, Wei Wang, Philip S. Yu, Jiawei Han SIGMOD 2002

Evaluation

• Performance of Border Collapsing Algorithm– Compared with

• Max-miner, one of the fastest algorithm for mining frequent long patterns;

• A sampling method, which uses level-wise search to finalize the border.

– Experiment result• CPU time vs. match

• No. of the database scans vs. match

• No. of the database scans vs. length of the longest patterns

Page 32: Mining Long Sequential Patterns in a Noisy Environment Jiong Yang, Wei Wang, Philip S. Yu, Jiawei Han SIGMOD 2002

Evaluation

• Performance of Border Collapsing Algorithm

Page 33: Mining Long Sequential Patterns in a Noisy Environment Jiong Yang, Wei Wang, Philip S. Yu, Jiawei Han SIGMOD 2002

Evaluation

• Scalability w.r.t to the No. of distinct symbols m– synthetic database: 100K sequences, average

length of 1000– a larger m leads to less frequent patterns– a larger m leads to a larger size (m m) of

compatibility matrix

Page 34: Mining Long Sequential Patterns in a Noisy Environment Jiong Yang, Wei Wang, Philip S. Yu, Jiawei Han SIGMOD 2002

Evaluation

• Scalability w.r.t to the No. of distinct symbols

Page 35: Mining Long Sequential Patterns in a Noisy Environment Jiong Yang, Wei Wang, Philip S. Yu, Jiawei Han SIGMOD 2002

Conclusion

• In a noise environment, symbols observed may be different from the real ones.

• Compatibility matrix can provide a probabilistic connection from the observation to the underlying true value.

• A new metric, match, is proposed to measure significant patterns.

• Experiment results shows that – The match model is robust w.r.t. the noise.– Border collapsing algorithm is very efficient for finding long

patterns.

Page 36: Mining Long Sequential Patterns in a Noisy Environment Jiong Yang, Wei Wang, Philip S. Yu, Jiawei Han SIGMOD 2002

End

?