copyright © 2005 by limsoon wong convexity in itemset spaces limsoon wong institute for infocomm...

51
Copyright © 2005 by Limsoon Wong Convexity in Itemset Spaces Limsoon Wong Institute for Infocomm Research

Post on 21-Dec-2015

220 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Copyright © 2005 by Limsoon Wong Convexity in Itemset Spaces Limsoon Wong Institute for Infocomm Research

Copyright © 2005 by Limsoon Wong

Convexity in Itemset Spaces

Limsoon WongInstitute for Infocomm Research

Page 2: Copyright © 2005 by Limsoon Wong Convexity in Itemset Spaces Limsoon Wong Institute for Infocomm Research

Copyright © 2005 by Limsoon Wong

Plan

• Frequent itemsets– Convexity– Equivalence classes, generators, & closed

patterns– Plateau representation– Efficient mining of generators & closed

patterns

• Emerging patterns• Odds ratio patterns • Relative risk patterns

Page 3: Copyright © 2005 by Limsoon Wong Convexity in Itemset Spaces Limsoon Wong Institute for Infocomm Research

Copyright © 2005 by Limsoon Wong

Frequent Itemsets

Page 4: Copyright © 2005 by Limsoon Wong Convexity in Itemset Spaces Limsoon Wong Institute for Infocomm Research

Copyright © 2005 by Limsoon Wong

Association Rules

• Buyer’s behaviour in supermarket

• Mgmt are interested in rules such as

Page 5: Copyright © 2005 by Limsoon Wong Convexity in Itemset Spaces Limsoon Wong Institute for Infocomm Research

Copyright © 2005 by Limsoon Wong

Frequent Itemsets

• List of items: I = {a, b, c, d, e, f}

• List of transactions: T = {T1, T2, T3, T4, T5}• T1 = {a, c, d}

• T2 = {b, c, e}

• T3 = {a, b, c, e, f}

• T4 = {b, e}

• T5 = {a, b, c, e}

• For each itemset I I, sup(I,T) = |{ Ti T | I Ti}|

• Freq itemsets: FT = F(ms,T) ={I I | sup(I,T) ms}

Page 6: Copyright © 2005 by Limsoon Wong Convexity in Itemset Spaces Limsoon Wong Institute for Infocomm Research

Copyright © 2005 by Limsoon Wong

• Freq itemset from our example:

• A priori property: I FT I’ I, I’ FT

A Priori Property

ms=2

Page 7: Copyright © 2005 by Limsoon Wong Convexity in Itemset Spaces Limsoon Wong Institute for Infocomm Research

Copyright © 2005 by Limsoon Wong

Lattice of Freq Itemsets

• FT can be very large

• Is there a concise rep?• Observation:

– {a, b, c, e} is maximal– { } is minimal– everything else is betw them

{ }, {a, b, c, e} a concise rep for FT?

Page 8: Copyright © 2005 by Limsoon Wong Convexity in Itemset Spaces Limsoon Wong Institute for Infocomm Research

Copyright © 2005 by Limsoon Wong

Convexity

• An itemset space S is convex if, for all X, Y S st X Y, we have Z S whenever X Z Y

• An itemset X is most general in S if there is no proper subset of X in S. These itemsets form the left bound L of S

• An itemset is most specific in S if there is no proper superset of X in S.These itemsets form the right bound R of S

L, R is a concise rep of S• [L, R] = { Z | X L, Y R, X Z Y} = S

Page 9: Copyright © 2005 by Limsoon Wong Convexity in Itemset Spaces Limsoon Wong Institute for Infocomm Research

Copyright © 2005 by Limsoon Wong

Convexity of Freq Itemsets

• Proposition 1: The freq itemset space is convex

L, R is a concise rep for a freq itemset space

Page 10: Copyright © 2005 by Limsoon Wong Convexity in Itemset Spaces Limsoon Wong Institute for Infocomm Research

Copyright © 2005 by Limsoon Wong

Is it good enough?

{ }, {a, b, c, e} can be a concise rep for FT

• But we cant get support values for elems in FT

Page 11: Copyright © 2005 by Limsoon Wong Convexity in Itemset Spaces Limsoon Wong Institute for Infocomm Research

Copyright © 2005 by Limsoon Wong

What is a good concise rep?

• A good concise rep for FT should enable these tasks below efficiently, w/o accessing T again:– Task 1: Enumerate {I FT}

– Task 2: Enumerate {(I, sup(I,T)) | I FT }

– Task 3: Given I, decide if I FT, & if so report sup(I,T)

– Task 4: Enumerate itemsets w/ sup in a given range

– etc.

Page 12: Copyright © 2005 by Limsoon Wong Convexity in Itemset Spaces Limsoon Wong Institute for Infocomm Research

Copyright © 2005 by Limsoon Wong

Closed Itemset Rep• A pattern is a closed pattern if each of its

supersets has a smaller support than it

• The closed itemset rep of FT is

CR ={ (I, sup(I,T)) | I FT, I is closed pattern}

• Proposition 2: {(I, sup(I,T)) | I FT} =

{(I, max{sup(I’, T) | (I’, sup(I’,T)) CR, I I’}) | I FT}

May be inefficient for Tasks 2, 3, 4

Page 13: Copyright © 2005 by Limsoon Wong Convexity in Itemset Spaces Limsoon Wong Institute for Infocomm Research

Copyright © 2005 by Limsoon Wong

Generator Rep

• A pattern is a generator if each of its subsets has a larger support than it

• The generator rep of FT is

GR ={(I, sup(I,T)) | I FT, I is generator}, GBd-

where GBd- are the min in-freq itemsets

• Proposition 3: {(I, sup(I,T)) | I FT} =

{(I, min{sup(I’,T) | I’ GR, I’ I}) | I FT} May be inefficient for Tasks 2, 3, 4

Page 14: Copyright © 2005 by Limsoon Wong Convexity in Itemset Spaces Limsoon Wong Institute for Infocomm Research

Copyright © 2005 by Limsoon Wong

• Decompose freq itemset lattice into plateaus wrt itemset support, S = i Pi,

with Pi = {I S | sup(I,T) = i}

• Proposition 6: Each Pi is convex

S = i [Li, Ri], where [Li, Ri] = Pi

Freq Itemset Plateaus

Page 15: Copyright © 2005 by Limsoon Wong Convexity in Itemset Spaces Limsoon Wong Institute for Infocomm Research

Copyright © 2005 by Limsoon Wong

From Generators & Closed Patterns To Equivalence Classes• The equivalence class of an itemset I is

[I]T = { I’ | { Ti T | I’ Ti} = {Tj T | I Tj}}

• Proposition 4: [I]T is convex. Furthermore, if [L,R] = [I]T, then L = min [I]T, and R = max [I]T is a singleton

• Proposition 5:– An itemset I is a generator iff I min [I]T

– An itemset I is a closed pattern iff I max [I]T

Page 16: Copyright © 2005 by Limsoon Wong Convexity in Itemset Spaces Limsoon Wong Institute for Infocomm Research

Copyright © 2005 by Limsoon Wong

Plateaus = Generators + Closed Patterns• Theorem 7:

Let [Li,Ri] = Pi be a freq itemset plateau of FT. Then

– Pi = [X1]T … … [Xk]T, where Ri = {X1, …, Xk}

– Ri are the closed patterns in Pi

– Li = i min [Xi]T are the generators in Pi

Page 17: Copyright © 2005 by Limsoon Wong Convexity in Itemset Spaces Limsoon Wong Institute for Infocomm Research

Copyright © 2005 by Limsoon Wong

Freq Itemset Plateau Rep• The freq itemset plateau rep of FT is

PR = {(Li, Ri,i) | i ms}

where [Li,Ri] is plateau at support level i in FT

• Proposition 8: {(I, sup(I,T)) | I FT} =

{(I, i)| (Li, Ri, i) PR,

X Li, Y Ri, X I Y} All 4 tasks are obviously efficient

Page 18: Copyright © 2005 by Limsoon Wong Convexity in Itemset Spaces Limsoon Wong Institute for Infocomm Research

Copyright © 2005 by Limsoon Wong

Remarks

• PR is a good concise rep for freq itemsets• PR is more flexible compared to other

reps• PR unifies diff notions used in data

mining

• Nice ... But can we mine PR fast?

Page 19: Copyright © 2005 by Limsoon Wong Convexity in Itemset Spaces Limsoon Wong Institute for Infocomm Research

Copyright © 2005 by Limsoon Wong

Mining PR Fast

• To mine PR fast, mine its borders fast• To mine its borders fast, mine equiv classes in

the plateau fast• To mine equiv classes fast, mine generators &

closed patterns of equivalence classes fast

Page 20: Copyright © 2005 by Limsoon Wong Convexity in Itemset Spaces Limsoon Wong Institute for Infocomm Research

From SE-Tree To Trie To FP-Tree

{}

b c da

ab ac ad

abc abd

abcd

acd

bc bd

bcd

cd

SE-tree of possibleitemsets

TT1 = {a,c,d}T2 = {b,c,d}T3 = {a,b,c,d}T4 = {a,d}

Copyright © 2005 by Limsoon Wong

.

. . ..

. . •

. .

. .

.

a

b

c

d

d

c

d

b

cd

d

d

c

d

d

Trie of transactions

<1: right-to-left,top-to-bottomtraversal of SE-tree

abcd

FP-tree head table

Page 21: Copyright © 2005 by Limsoon Wong Convexity in Itemset Spaces Limsoon Wong Institute for Infocomm Research

Copyright © 2005 by Limsoon Wong

GC-growth: Fast Simultaneous Mining of Generators & Closed Patterns

Page 22: Copyright © 2005 by Limsoon Wong Convexity in Itemset Spaces Limsoon Wong Institute for Infocomm Research

Step 1: FP-tree construction

Copyright © 2005 by Limsoon Wong

Page 23: Copyright © 2005 by Limsoon Wong Convexity in Itemset Spaces Limsoon Wong Institute for Infocomm Research

Step 2: Right-to-left, top-to-bottom traversal

Copyright © 2005 by Limsoon Wong

Page 24: Copyright © 2005 by Limsoon Wong Convexity in Itemset Spaces Limsoon Wong Institute for Infocomm Research

Step 5: Confirm Xi is generator

Copyright © 2005 by Limsoon Wong

Proposition 9:Generators enjoy the apriori property. That is every subsetof a generator is also a generator

Page 25: Copyright © 2005 by Limsoon Wong Convexity in Itemset Spaces Limsoon Wong Institute for Infocomm Research

Step 7: Find closed pattern of Xi

Copyright © 2005 by Limsoon Wong

Proposition 10:Let X be a generator. Then theclosed pattern of X is {X’’|X’H[last(X)],X X’, X’ prefixof X’’, T[X’’] = true}.

Page 26: Copyright © 2005 by Limsoon Wong Convexity in Itemset Spaces Limsoon Wong Institute for Infocomm Research

Correctness of GC-growth

• Theorem 11:GC-growth is sound and complete for mining generators and closed patterns

Copyright © 2005 by Limsoon Wong

Page 27: Copyright © 2005 by Limsoon Wong Convexity in Itemset Spaces Limsoon Wong Institute for Infocomm Research

Copyright © 2005 by Limsoon Wong

Performance ofGC-growth

• GC-growth is mining both generators and closed patterns

• But is comparable in speed to the fastest algorithms that mined only closed patterns

Page 28: Copyright © 2005 by Limsoon Wong Convexity in Itemset Spaces Limsoon Wong Institute for Infocomm Research

Copyright © 2005 by Limsoon Wong

Emerging Patterns

Page 29: Copyright © 2005 by Limsoon Wong Convexity in Itemset Spaces Limsoon Wong Institute for Infocomm Research

0%

edible mushrooms poisonous mushrooms

EPs

x%

Example: {odor=none, gill_size=broad, ring_number=1} 64% (edible) vs 0% (poisonous)

Differentiation and Contrast

Copyright © 2005 by Limsoon Wong

Page 30: Copyright © 2005 by Limsoon Wong Convexity in Itemset Spaces Limsoon Wong Institute for Infocomm Research

Copyright © 2005 by Limsoon Wong

NB: For this talk, we restrict ourselves to “jumping” emerging patterns

Emerging Patterns

• An emerging pattern is a set of conditions– usually involving several features– that most members of a class P satisfy – but none or few of the other class N satisfy

I is emerging pattern if sup(I,P) / sup(I,N) > k, for some fixed threshold k

Page 31: Copyright © 2005 by Limsoon Wong Convexity in Itemset Spaces Limsoon Wong Institute for Infocomm Research

Copyright © 2005 by Limsoon Wong

Convexity of Emerging Patterns• Theorem 12:

Let E be an EP space and Pi = { I E | sup(I) = i}. Then E = i Pi, E is convex, and each Pi is convex. That is, E can be decomposed into convex plateaus

Page 32: Copyright © 2005 by Limsoon Wong Convexity in Itemset Spaces Limsoon Wong Institute for Infocomm Research

Copyright © 2005 by Limsoon Wong

EP Plateau Rep

• A concise rep for E = i Pi is EP plateau rep:

EP_PR = { (Li, Ri, i) | [Li, Ri] = Pi}

• Proposition 13: {(I, sup(I)) | I E} =

{ (I, i) | (Li, Ri, i) EP_PR,

X Li, Y Ri, X I Y}

All 4 tasks are obvious efficient

Page 33: Copyright © 2005 by Limsoon Wong Convexity in Itemset Spaces Limsoon Wong Institute for Infocomm Research

Efficient Mining of EP_PR• Modify GC-growth so

that for each equiv class C, it outputs its support in +ve transactions Spos[C] & in -ve transactions Sneg[C]

• Then [R[C], C] are emerging patterns if Spos[C] / Sneg[C] > k

Copyright © 2005 by Limsoon Wong

NB. Assume the threshold for EP is k

Page 34: Copyright © 2005 by Limsoon Wong Convexity in Itemset Spaces Limsoon Wong Institute for Infocomm Research

Copyright © 2005 by Limsoon Wong

Odds Ratio Patterns

Page 35: Copyright © 2005 by Limsoon Wong Convexity in Itemset Spaces Limsoon Wong Institute for Infocomm Research

0%

edible mushrooms poisonous mushrooms

EPs

x%

Example: {odor=none, gill_size=broad, ring_number=1} 64% (edible) vs 0% (poisonous)

Is an emerging pattern that is absent in most of the positive transactions a “real” pattern?

Copyright © 2005 by Limsoon Wong

What if this is 4%? 0.4%? 0.04%?

Page 36: Copyright © 2005 by Limsoon Wong Convexity in Itemset Spaces Limsoon Wong Institute for Infocomm Research

Copyright © 2005 by Limsoon Wong

Odds Ratio

• Odds ratio for a (compound) factor P in a case-control study D is

OR(P,D) = (PD,ed / PD,-d) / (PD,e- / PD,--)

P is a odds ratio pattern if OR(P,D) > k, for some threshold k

Page 37: Copyright © 2005 by Limsoon Wong Convexity in Itemset Spaces Limsoon Wong Institute for Infocomm Research

Copyright © 2005 by Limsoon Wong

Nonconvexity of Odds Ratio Pattern Space

• Proposition 14:Let Sk

OR(ms,D) = { P F(ms,D) | OR(P,D) k}. Then Sk

OR(ms,D) is not convex

Page 38: Copyright © 2005 by Limsoon Wong Convexity in Itemset Spaces Limsoon Wong Institute for Infocomm Research

Convexity of Odds Ratio Pattern Space Plateaus• Theorem 15:

Let Sn,kOR(ms,D) =

{ P F(ms,D) | PD,ed=n, OR(P,D) k}. Then Sn,k

OR(ms,D) is convex

The space of odds ratio patterns is not convex in general, but becomes convex when stratified into plateaus based on support levels

The space of odds ratio patterns can be concisely represented by plateau borders

Copyright © 2005 by Limsoon Wong

Page 39: Copyright © 2005 by Limsoon Wong Convexity in Itemset Spaces Limsoon Wong Institute for Infocomm Research

Copyright © 2005 by Limsoon Wong

How do you find these fast is key!

Efficient Mining ofOdds Ratio Pattern Space Plateaus

GC-growth can find these fast :-)

Page 40: Copyright © 2005 by Limsoon Wong Convexity in Itemset Spaces Limsoon Wong Institute for Infocomm Research

Copyright © 2005 by Limsoon Wong

Performance

• FPClose* and CLOSET+ – closed patterns only

• Our method computes – closed patterns– generators, and– odds ratio patterns (OR >

2.5)

Patterns that are much more statistically sophisticated than frequent patterns can now be mined efficiently

Page 41: Copyright © 2005 by Limsoon Wong Convexity in Itemset Spaces Limsoon Wong Institute for Infocomm Research

Copyright © 2005 by Limsoon Wong

Relative Risk Patterns

Page 42: Copyright © 2005 by Limsoon Wong Convexity in Itemset Spaces Limsoon Wong Institute for Infocomm Research

Copyright © 2005 by Limsoon Wong

RelativeRisk

• Relative risk for a (compound) factor P in a prospective study D is

P is a relative risk pattern if RR(P,D) > k, for some threshold k

Page 43: Copyright © 2005 by Limsoon Wong Convexity in Itemset Spaces Limsoon Wong Institute for Infocomm Research

Copyright © 2005 by Limsoon Wong

Nonconvexity of Relative Risk Pattern Space

• Proposition 16:Let Sk

RR(ms,D) = { P F(ms,D) | RR(P,D) k}. Then Sk

RR(ms,D) is not convex

Page 44: Copyright © 2005 by Limsoon Wong Convexity in Itemset Spaces Limsoon Wong Institute for Infocomm Research

Convexity of Relative Risk Pattern Space Plateaus• Theorem 17:

Let Sn,kRR(ms,D) =

{ P F(ms,D) | PD,ed=n, RR(P,D) k}. Then Sn,k

RR(ms,D) is convex

The space of relative risk patterns is not convex in general, but becomes convex when stratified into plateaus based on support levels

The space of relative risk patterns can be concisely represented by plateau borders

Copyright © 2005 by Limsoon Wong

Page 45: Copyright © 2005 by Limsoon Wong Convexity in Itemset Spaces Limsoon Wong Institute for Infocomm Research

Copyright © 2005 by Limsoon Wong

How do you find these fast is key!

Efficient Mining of Relative Risk Pattern Space Plateaus

GC-growth can find these fast :-)

x := RR(R,D);

Page 46: Copyright © 2005 by Limsoon Wong Convexity in Itemset Spaces Limsoon Wong Institute for Infocomm Research

Copyright © 2005 by Limsoon Wong

Concluding Remarks

• Equiv classes & plateaus are fundamental in– Frequent itemsets– Emerging patterns– Odds ratio patterns – Relative risk patterns, ...

• Equiv classes & plateaus of these complex patterns are convex spaces

Complex pattern spaces are concisely representable by borders

Complex pattern spaces can be efficiently and completely mined

Page 47: Copyright © 2005 by Limsoon Wong Convexity in Itemset Spaces Limsoon Wong Institute for Infocomm Research

Copyright © 2005 by Limsoon Wong

Future Works

Page 48: Copyright © 2005 by Limsoon Wong Convexity in Itemset Spaces Limsoon Wong Institute for Infocomm Research

Copyright © 2005 by Limsoon Wong

• Impact of item ordering

• Impact of pushing complex statistical filters deeper into equivalence class generators

Generate bordersof equiv classes & support levels

Test for odds ratio

Test for relative

risk

Test for 2

Improve Implementations

• Modular pattern mining by construction of a fast equiv class generator and multiple statistical condition filters

Page 49: Copyright © 2005 by Limsoon Wong Convexity in Itemset Spaces Limsoon Wong Institute for Infocomm Research

Copyright © 2005 by Limsoon Wong

• Simple ensemble

• PCL

Apply to Classification

• Develop classifiers based on the mined patterns– Simple ensemble– PCL

• Impact on accuracy of using generators vs closed patterns

Argmaxc C

r Rc,

r > 50% accuracy

r(X)f(X) =

Page 50: Copyright © 2005 by Limsoon Wong Convexity in Itemset Spaces Limsoon Wong Institute for Infocomm Research

Copyright © 2005 by Limsoon Wong

Enrich Data Mining Foundations• Increase statistical

sophistication of patterns mined

• Increase dimensions and size of data handled

Page 51: Copyright © 2005 by Limsoon Wong Convexity in Itemset Spaces Limsoon Wong Institute for Infocomm Research

Copyright © 2005 by Limsoon Wong

Acknowledgements

• Haiquan Li• Jinyan Li• Mengling Feng• Yap Peng Tan