mining patterns in long sequential data with noise wei wang, jiong yang, philip s. yu acm sigkdd...

20
Mining Patterns in Long Mining Patterns in Long Sequential Data with Sequential Data with Noise Noise Wei Wang, Jiong Yang, Philip S. Yu ACM SIGKDD Explorations Newsletter Volume 2 , Issue 2 (December 2000) Special issue on “Scalable data mining algorithms”

Upload: aileen-newton

Post on 06-Jan-2018

216 views

Category:

Documents


3 download

DESCRIPTION

Introduction Pattern discovery in time series data or some inherent physical structure → mining patterns in long sequential data Application : –Bio-Medical Study : chromosomes as sequences of amino acids –Performance Analysis : system-monitoring application –Client Profile : User profiles can be built based on the discovered pattern on trace logs

TRANSCRIPT

Page 1: Mining Patterns in Long Sequential Data with Noise Wei Wang, Jiong Yang, Philip S. Yu ACM SIGKDD Explorations Newsletter Volume 2, Issue 2 (December 2000)

Mining Patterns in Long Mining Patterns in Long Sequential Data with Sequential Data with

NoiseNoise

Wei Wang, Jiong Yang, Philip S. Yu ACM SIGKDD Explorations Newsletter

Volume 2 ,  Issue 2  (December 2000) Special issue on “Scalable data mining algorithms”

Page 2: Mining Patterns in Long Sequential Data with Noise Wei Wang, Jiong Yang, Philip S. Yu ACM SIGKDD Explorations Newsletter Volume 2, Issue 2 (December 2000)

OutlineOutline• Introduction• Injection of noises

– Asynchronous Patterns– Meta Patterns

• Over-population of uninteresting patterns

• Conclusions

Page 3: Mining Patterns in Long Sequential Data with Noise Wei Wang, Jiong Yang, Philip S. Yu ACM SIGKDD Explorations Newsletter Volume 2, Issue 2 (December 2000)

IntroductionIntroduction• Pattern discovery in time series data or

some inherent physical structure → mining patterns in long sequential data

• Application :– Bio-Medical Study : chromosomes as

sequences of amino acids– Performance Analysis : system-monitoring

application– Client Profile : User profiles can be built

based on the discovered pattern on trace logs

Page 4: Mining Patterns in Long Sequential Data with Noise Wei Wang, Jiong Yang, Philip S. Yu ACM SIGKDD Explorations Newsletter Volume 2, Issue 2 (December 2000)

Introduction (cont.)Introduction (cont.)• Tolerable noises may in different

formats (depend on the type of application and the user’s interests) :– Injection of noises.– Over-population of uninteresting

patterns.

Page 5: Mining Patterns in Long Sequential Data with Noise Wei Wang, Jiong Yang, Philip S. Yu ACM SIGKDD Explorations Newsletter Volume 2, Issue 2 (December 2000)

Injection of noisesInjection of noises• Two models are proposed to

address the issues of accommodating insertion of random noises and characterizing change of behavior.– Asynchronous Patterns– Meta Patterns

Page 6: Mining Patterns in Long Sequential Data with Noise Wei Wang, Jiong Yang, Philip S. Yu ACM SIGKDD Explorations Newsletter Volume 2, Issue 2 (December 2000)

Asynchronous PatternsAsynchronous Patterns• Mining periodic pattern assumed that

Disturbance is allowed only in terms of "missing occurrences" but not as general as any "insertion of random noise events". – "Smith reads newspaper every morning" is a

periodic pattern. missing occurrences (synchronization)

– inventory replenishment of cold medicine : the refill time shifts to the 3rd week of a month (not the beginning of the month any longer). insertion of random noise event (asynchronous)

Page 7: Mining Patterns in Long Sequential Data with Noise Wei Wang, Jiong Yang, Philip S. Yu ACM SIGKDD Explorations Newsletter Volume 2, Issue 2 (December 2000)

Asynchronous Patterns Asynchronous Patterns (cont.)(cont.)• Valid segments : is required to be of at

least min_rep contiguous repetitions of the pattern and the length of each piece of disturbance is allowed only up to max_dis.

• Valid subsequence : is a set of non-overlapping valid segments.

• Longest valid subsequence : A valid subsequence with the most overall repetitions.

Page 8: Mining Patterns in Long Sequential Data with Noise Wei Wang, Jiong Yang, Philip S. Yu ACM SIGKDD Explorations Newsletter Volume 2, Issue 2 (December 2000)

S2 & S4 can be a valid

subsequence whose overall # of repetitions is

10

S1 & S4 : valid segment (dis=9 > mix_dis=6 )

D1,D2~D19 are 19

matches of (d1,*,*)

If min-rep=5, S1,S2,S4 are

valid segments , S3 is not

Both X And Y are extendible (given position i and ending position j (j<i), if j≧ I- max_dis-1 then j is extendible), X dominates Y at position 20 (iif the number of repetitions of X ≧ Y)

Page 9: Mining Patterns in Long Sequential Data with Noise Wei Wang, Jiong Yang, Philip S. Yu ACM SIGKDD Explorations Newsletter Volume 2, Issue 2 (December 2000)

Asynchronous Patterns Asynchronous Patterns (cont.)(cont.)• Distance-based pruning of candidate patterns

– Given a symbol d and a period l, if Cdl ≧ min_rep threshold, then it’s possible that d might participate in some valid pattern of period l.– For example, {A,B,D,A,C,A,A,C,A,A,A,B,A,A,C } and min_rep=3, then A,C may be valid pattern and B,D not be

• Apriori property of complex patterns– a valid segment of a pattern is also a valid segment of any pattern with fewer symbols specified in the pattern. – For example, a valid segment for (dl,d2,*) will also be one for (dl,*,*).

• Extendibility and subsequence dominate

For example, if (d1,*,*,*) and (d2,*,*,*) are valid, then three candidates 2-patterns can be generated: (d1,d2,*,*) , (d1,*,d2,*) , (d1,*,*,d2) .Similarly, (d1, d2, d3,*) can become a candidates 3-pattern only if (d1, d2,*,*), (d1,*, d3,*) and (*, d2, d3,*) are all valid.

Page 10: Mining Patterns in Long Sequential Data with Noise Wei Wang, Jiong Yang, Philip S. Yu ACM SIGKDD Explorations Newsletter Volume 2, Issue 2 (December 2000)

Meta patternsMeta patterns• Let S={a,b,c…} be a set of literals.

– Basic pattern: each component in the pattern is restricted to be either a literal or a “*”. • biweekly replenishment P1=(r:[1,1],*:[2,2])• triweekly replenishment P2=(r:[1,1],*:[2,3])

– Meta pattern : may have pattern(s)/meta-pattern(s) as its component(s).• Two-level periodicity (P1:[1,24],*:[25,25],P2:[26,52])• Three components: P1,*,P2

Page 11: Mining Patterns in Long Sequential Data with Noise Wei Wang, Jiong Yang, Philip S. Yu ACM SIGKDD Explorations Newsletter Volume 2, Issue 2 (December 2000)

Meta patterns (cont.)Meta patterns (cont.)• ((r:[1,1],*:[2,2]):[1,24],*:[25,25],(r:[1,1],*:[2,3]):[26,52])

– length of a component : (r:[1,1],*:[2,2])=24– Span of meta-pattern : 52– Abbreviation : ((r,*):[1,24],*,(r,*:[2,3]):[26,52])– Level of meta-pattern: max level of its component

+1• Level of basic pattern is 1. For instance, (r,*:

[2,3]) is level 1• P1 = ((r,*):[1,24],*,(r,*:[2,3]):[26,52]) is level 2• the components of a meta-pattern do not have

to be of the same level. For instance, (P1:[1,260],*:[261,300]) is level 3 (P1 is level 2)

Page 12: Mining Patterns in Long Sequential Data with Noise Wei Wang, Jiong Yang, Philip S. Yu ACM SIGKDD Explorations Newsletter Volume 2, Issue 2 (December 2000)

Meta patterns (cont.)Meta patterns (cont.)

• Figure 2(a) : min_rep=3, max_dis=4, a meta-pattern ((a,b,*):[1,19],*:[20,21],(b,c):[22,27],*:[28,30],(a,b,*):[31,49],*:[50,51],(b,c):[52,57],*:[58,60])• Figure 2(b) : Many patterns/meta-patterns may collocate or overlap for any given portion of a sequence. For example,both of (a, b, a, *) and (a, *) are valid within the subsequence.

Page 13: Mining Patterns in Long Sequential Data with Noise Wei Wang, Jiong Yang, Philip S. Yu ACM SIGKDD Explorations Newsletter Volume 2, Issue 2 (December 2000)

Meta patterns (cont.)Meta patterns (cont.)How to identify the “proper” candidate ?• Component location property : can provide substantial inter-level pruning effect during the generation of high level candidates from valid low level meta-patterns.• Apriori property : can render some pruning power to conduct the mining process of meta-patterns of the same level.

A valid low level meta-pattern may serve as a component of a higher level meta-pattern only if its presence in the symbol sequence exhibits some cyclic behavior and such cyclic behavior has to follow the same periodicity as the higher level meta-pattern by sufficient number of times (i.e., at least min_rep times). • For example: X1 can server as a component of a higher level meta-pattern X2X1=((a,b,*):[1,19],*:[20,21])X2=((a,b,*):[1,19],*:[20,21],(b,c):[22,27],*:[28,30],X1:[31,150]

Page 14: Mining Patterns in Long Sequential Data with Noise Wei Wang, Jiong Yang, Philip S. Yu ACM SIGKDD Explorations Newsletter Volume 2, Issue 2 (December 2000)

Meta patterns (cont.)Meta patterns (cont.)

Figure 3, the pruning effects provided by the component location property and the Apriori property are indicated by dashed arrows and solid arrows, respectively.

Page 15: Mining Patterns in Long Sequential Data with Noise Wei Wang, Jiong Yang, Philip S. Yu ACM SIGKDD Explorations Newsletter Volume 2, Issue 2 (December 2000)

Over-population of Over-population of uninteresting patterns uninteresting patterns

• In some applications, the number of occurrences may not represent the significance of a pattern. – Computational Biology : gene expressions– Web server load : the high workload on all

servers may occur at a much lower frequency than other states.

– Earthquake : big earthquake is much more valuable even though it occurs at a much lower frequency than smaller ones.

Page 16: Mining Patterns in Long Sequential Data with Noise Wei Wang, Jiong Yang, Philip S. Yu ACM SIGKDD Explorations Newsletter Volume 2, Issue 2 (December 2000)

Over-population of Over-population of uninteresting patterns (cont.)uninteresting patterns (cont.)• Information gain : is a measurement of

how likely a pattern will occur or the amount of "surprise" when a pattern actually occurs.

• Information model : For a given minimum information gain threshold, let Ψ be the set of patterns that satisfy this threshold.

• Support model : in order to find all patterns in Ψ, the minimum support threshold has to be set very low. --> too many patterns discovered. next

Page 17: Mining Patterns in Long Sequential Data with Noise Wei Wang, Jiong Yang, Philip S. Yu ACM SIGKDD Explorations Newsletter Volume 2, Issue 2 (December 2000)

Information gainInformation gainLet E = {a1 , a2 , . . . an} be a set of distinct events. The event sequence is a sequence of events in E. • information carried by an event ai (ai E) is defined to be I(ai) = -log|E| Prob(ai)

– |E| : is # of events in E– Prob(ai) : the probability that ai occurs = Num(a

i)/N• information gain : a pattern P in an event sequence D, the information gain of P in D is defined as G(P) = I ( P ) x (Support(P) - 1).

Page 18: Mining Patterns in Long Sequential Data with Noise Wei Wang, Jiong Yang, Philip S. Yu ACM SIGKDD Explorations Newsletter Volume 2, Issue 2 (December 2000)

Information gain (cont.)Information gain (cont.)I(ai)= -log|E| Prob(ai)

G(P)= I (P) x (Support(P) - 1)

E = {a1 , a2 , . . . a6}

|E| = 40

Support (a2 , a6 , * , *) = Repetition((a2 , a6 , * , * )) = 3

G((a2 , a6 , * ,* )) = I(I(a2)+I(a6)) x (Support ((a2 , a6 , * ,* ))-1 )= (0.90+1.45) x (3-1) = 2.35 x 2 = 4.70

back

Page 19: Mining Patterns in Long Sequential Data with Noise Wei Wang, Jiong Yang, Philip S. Yu ACM SIGKDD Explorations Newsletter Volume 2, Issue 2 (December 2000)

Over-population of Over-population of uninteresting patterns uninteresting patterns

(cont.)(cont.)• Two traces :

– Scour is a web search engine that is specialized for multimedia contents.

– IBM Intranet traces consist of 160 critical nodes, e.g.,file servers, routers, etc., in the IBM T. J. Watson Intranet.

For example, the pattern (nodea_fail,*, nodeb_saturated,*) has the eighth highest information gain. This pattern means that a short time after a router (nodea) fails, the CPU on another node (nodeb) is saturated. Under a thorough investigation, we found that nodeb is a file server and after nodea fails, all requests to some files are sent to nodeb, thus causes the bottleneck.

Page 20: Mining Patterns in Long Sequential Data with Noise Wei Wang, Jiong Yang, Philip S. Yu ACM SIGKDD Explorations Newsletter Volume 2, Issue 2 (December 2000)

ConclusionConclusion• This paper discuss three recent research advances of mining patterns in time series data given the presence of noise.

– J. Yang, W. Wang, and P. Yu. Mining asynchronous periodic patterns in time series data. Proc. ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (SIGKDD), pp. 275-279, 2000.– J. Yang, W. Yang, and P. Yu. Meta-patterns: revealing hierarchy of periodic patterns. IBM Research Report,2001.– J. Yang, W. Yaug, and P. Yu. InfoMiner: mining significant periodic patterns with rare events in time series data. IBM Research Report, 2001.