association rule mining. mining association rules in large databases association rule mining ...

25
Association Rule Mining

Post on 22-Dec-2015

247 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Association Rule Mining. Mining Association Rules in Large Databases  Association rule mining  Algorithms Apriori and FP-Growth  Max and closed patterns

Association Rule Mining

Page 2: Association Rule Mining. Mining Association Rules in Large Databases  Association rule mining  Algorithms Apriori and FP-Growth  Max and closed patterns

Mining Association Rules in Large Databases

Association rule mining

Algorithms Apriori and FP-Growth

Max and closed patterns

Mining various kinds of association/correlation

rules

Page 3: Association Rule Mining. Mining Association Rules in Large Databases  Association rule mining  Algorithms Apriori and FP-Growth  Max and closed patterns

Max-patterns & Close-patterns If there are frequent patterns with many

items, enumerating all of them is costly. We may be interested in finding the

‘boundary’ frequent patterns. Two types…

Page 4: Association Rule Mining. Mining Association Rules in Large Databases  Association rule mining  Algorithms Apriori and FP-Growth  Max and closed patterns

Max-patterns Frequent pattern {a1, …, a100} (100

1) + (100

2) + … + (11

00

00) = 2100-1 = 1.27*1030

frequent sub-patterns! Max-pattern: frequent patterns without

proper frequent super pattern BCDE, ACD are max-patterns BCD is not a max-pattern

Tid Items

10 A,B,C,D,E

20 B,C,D,E,

30 A,C,D,FMin_sup=2

Page 5: Association Rule Mining. Mining Association Rules in Large Databases  Association rule mining  Algorithms Apriori and FP-Growth  Max and closed patterns

MaxMiner: Mining Max-patterns

Idea: generate the complete set-enumeration tree one level at a time, while prune if applicable.

(ABCD)

A (BCD) B (CD) C (D) D ()

AB (CD) AC (D) AD () BC (D) BD () CD ()

ABC (C)

ABCD ()

ABD () ACD () BCD ()

Page 6: Association Rule Mining. Mining Association Rules in Large Databases  Association rule mining  Algorithms Apriori and FP-Growth  Max and closed patterns

Local Pruning Techniques (e.g. at node A)

Check the frequency of ABCD and AB, AC, AD. If ABCD is frequent, prune the whole sub-tree. If AC is NOT frequent, remove C from the

parenthesis before expanding.

(ABCD)

A (BCD) B (CD) C (D) D ()

AB (CD) AC (D) AD () BC (D) BD () CD ()

ABC (C)

ABCD ()

ABD () ACD () BCD ()

Page 7: Association Rule Mining. Mining Association Rules in Large Databases  Association rule mining  Algorithms Apriori and FP-Growth  Max and closed patterns

Algorithm MaxMiner

Initially, generate one node N= , where h(N)= and t(N)={A,B,C,D}.

Consider expanding N, If h(N)t(N) is frequent, do not expand N. If for some it(N), h(N){i} is NOT frequent,

remove i from t(N) before expanding N. Apply global pruning techniques…

(ABCD)

Page 8: Association Rule Mining. Mining Association Rules in Large Databases  Association rule mining  Algorithms Apriori and FP-Growth  Max and closed patterns

Global Pruning Technique (across sub-trees)

When a max pattern is identified (e.g. ABCD), prune all nodes (e.g. B, C and D) where h(N)t(N) is a sub-set of it (e.g. ABCD).

(ABCD)

A (BCD) B (CD) C (D) D ()

AB (CD) AC (D) AD () BC (D) BD () CD ()

ABC (C)

ABCD ()

ABD () ACD () BCD ()

Page 9: Association Rule Mining. Mining Association Rules in Large Databases  Association rule mining  Algorithms Apriori and FP-Growth  Max and closed patterns

Example

Tid Items

10 A,B,C,D,E

20 B,C,D,E,

30 A,C,D,F

(ABCDEF)

Items Frequency

ABCDEF 0

A 2

B 2

C 3

D 3

E 2

F 1

Min_sup=2

Max patterns:

A (BCDE)B (CDE) C (DE) E ()D (E)

Page 10: Association Rule Mining. Mining Association Rules in Large Databases  Association rule mining  Algorithms Apriori and FP-Growth  Max and closed patterns

Example

Tid Items

10 A,B,C,D,E

20 B,C,D,E,

30 A,C,D,F

(ABCDEF)

Items Frequency

ABCDE 1

AB 1

AC 2

AD 2

AE 1

Min_sup=2

A (BCDE)B (CDE) C (DE) E ()D (E)

AC (D) AD ()

Max patterns:

Node A

Page 11: Association Rule Mining. Mining Association Rules in Large Databases  Association rule mining  Algorithms Apriori and FP-Growth  Max and closed patterns

Example

Tid Items

10 A,B,C,D,E

20 B,C,D,E,

30 A,C,D,F

(ABCDEF)

Items Frequency

BCDE 2

BC

BD

BE

Min_sup=2

A (BCDE)B (CDE) C (DE) E ()D (E)

AC (D) AD ()

Max patterns:

BCDE

Node B

Page 12: Association Rule Mining. Mining Association Rules in Large Databases  Association rule mining  Algorithms Apriori and FP-Growth  Max and closed patterns

Example

Tid Items

10 A,B,C,D,E

20 B,C,D,E,

30 A,C,D,F

(ABCDEF)

Items Frequency

ACD 2

Min_sup=2

A (BCDE)B (CDE) C (DE) E ()D (E)

AC (D) AD ()

Max patterns:

BCDE

ACD ()

ACD

Node AC

Page 13: Association Rule Mining. Mining Association Rules in Large Databases  Association rule mining  Algorithms Apriori and FP-Growth  Max and closed patterns

Frequent Closed Patterns For frequent itemset X, if there exists no

item y s.t. every transaction containing X also contains y, then X is a frequent closed pattern “ab” is a frequent closed pattern

Concise rep. of freq pats Reduce # of patterns and rules N. Pasquier et al. In ICDT’99

TID Items

10 a, b, c

20 a, b, c

30 a, b, d

40 a, b, d

50 e, f

Min_sup=2

Page 14: Association Rule Mining. Mining Association Rules in Large Databases  Association rule mining  Algorithms Apriori and FP-Growth  Max and closed patterns

Max Pattern vs. Frequent Closed Pattern max pattern closed pattern

if itemset X is a max pattern, adding any item to it would not be a frequent pattern; thus there exists no item y s.t. every transaction containing X also contains y.

closed pattern max pattern “ab” is a closed pattern, but not max TID Items

10 a, b, c

20 a, b, c

30 a, b, d

40 a, b, d

50 e, f

Min_sup=2

Page 15: Association Rule Mining. Mining Association Rules in Large Databases  Association rule mining  Algorithms Apriori and FP-Growth  Max and closed patterns

Mining Frequent Closed Patterns: CLOSET

Flist: list of all frequent items in support ascending order

Flist: d-a-f-e-c

Divide search space

Patterns having d

Patterns having a but not d, etc.

Find frequent closed pattern recursively

Among the transactions having d, cfa is frequent closed cfad is a frequent closed pattern

J. Pei, J. Han & R. Mao. CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets", DMKD'00.

TID Items

10 a, c, d, e, f20 a, b, e30 c, e, f40 a, c, d, f50 c, e, f

Min_sup=2

Page 16: Association Rule Mining. Mining Association Rules in Large Databases  Association rule mining  Algorithms Apriori and FP-Growth  Max and closed patterns

Multiple-Level Association Rules

Items often form hierarchy. Items at the lower level are

expected to have lower support.

Rules regarding itemsets at appropriate levels could be

quite useful. A transactional database can

be encoded based on dimensions and levels

We can explore shared multi-level mining

Food

breadmilk

skim

Garelick

2% fat whitewheat

Wonder....

Page 17: Association Rule Mining. Mining Association Rules in Large Databases  Association rule mining  Algorithms Apriori and FP-Growth  Max and closed patterns

Mining Multi-Level Associations

A top_down, progressive deepening approach: First find high-level strong rules:

milk bread [20%, 60%]. Then find their lower-level “weaker” rules:

2% fat milk wheat bread [6%, 50%]. Variations at mining multiple-level association

rules. Level-crossed association rules:

skim milk Wonder wheat bread Association rules with multiple, alternative

hierarchies:

full fat milk Wonder bread

Page 18: Association Rule Mining. Mining Association Rules in Large Databases  Association rule mining  Algorithms Apriori and FP-Growth  Max and closed patterns

Multi-level Association: Uniform Support vs. Reduced Support Uniform Support: the same minimum

support for all levels + One minimum support threshold. No need to

examine itemsets containing any item whose ancestors do not have minimum support.

– Lower level items do not occur as frequently. If support threshold

too high miss low level associations too low generate too many high level

associations

Page 19: Association Rule Mining. Mining Association Rules in Large Databases  Association rule mining  Algorithms Apriori and FP-Growth  Max and closed patterns

Multi-level Association: Uniform Support vs. Reduced Support Reduced Support: reduced minimum

support at lower levels There are 4 search strategies:

Level-by-level independent Independent search at all levels (no misses)

Level-cross filtering by k-itemset Prune a k-pattern if the corresponding k-pattern at

the upper level is infrequent Level-cross filtering by single item

Prune an item if its parent node is infrequent Controlled level-cross filtering by single item

Consider ‘subfrequent’ items that pass a passage threshold

Page 20: Association Rule Mining. Mining Association Rules in Large Databases  Association rule mining  Algorithms Apriori and FP-Growth  Max and closed patterns

Uniform SupportMulti-level mining with uniform support

Milk

[support = 10%]

full fat Milk

[support = 6%]

Skim Milk

[support = 4%]

Level 1min_sup = 5%

Level 2min_sup = 5%

X

Page 21: Association Rule Mining. Mining Association Rules in Large Databases  Association rule mining  Algorithms Apriori and FP-Growth  Max and closed patterns

Reduced SupportMulti-level mining with reduced support

full fat Milk

[support = 6%]

Skim Milk

[support = 4%]

Level 1min_sup = 5%

Level 2min_sup = 3%

Milk

[support = 10%]

Page 22: Association Rule Mining. Mining Association Rules in Large Databases  Association rule mining  Algorithms Apriori and FP-Growth  Max and closed patterns

Interestingness Measurements

Objective measuresTwo popular measurements: support; and confidence

Subjective measuresA rule (pattern) is interesting if it is unexpected (surprising to the user); and/or actionable (the user can do something with it)

Page 23: Association Rule Mining. Mining Association Rules in Large Databases  Association rule mining  Algorithms Apriori and FP-Growth  Max and closed patterns

Criticism to Support and Confidence

Example 1: Among 5000 students

3000 play basketball 3750 eat cereal 2000 both play basket ball and eat cereal

play basketball eat cereal [40%, 66.7%] is misleading because the overall percentage of students eating cereal is 75% which is higher than 66.7%.

play basketball not eat cereal [20%, 33.3%] is far more accurate, although with lower support and confidence

basketball not basketball sum(row)cereal 2000 1750 3750not cereal 1000 250 1250sum(col.) 3000 2000 5000

Page 24: Association Rule Mining. Mining Association Rules in Large Databases  Association rule mining  Algorithms Apriori and FP-Growth  Max and closed patterns

Criticism to Support and Confidence (Cont.)

Example 2: X and Y: positively correlated, X and Z, negatively related support and confidence of X=>Z dominates

We need a measure of dependent or correlated events

P(B|A)/P(B) is also called the lift of rule A => B

X 1 1 1 1 0 0 0 0Y 1 1 0 0 0 0 0 0Z 0 1 1 1 1 1 1 1

Rule Support ConfidenceX=>Y 25% 50%X=>Z 37.50% 75%)()(

)(, BPAP

BAPcorr BA

Page 25: Association Rule Mining. Mining Association Rules in Large Databases  Association rule mining  Algorithms Apriori and FP-Growth  Max and closed patterns

Other Interestingness Measures: Interest Interest (correlation, lift)

taking both P(A) and P(B) in consideration

P(AB)=P(B)*P(A), if A and B are independent events

A and B negatively correlated, if the value is less than 1;

otherwise A and B positively correlated

)()(

)(

BPAP

BAP

X 1 1 1 1 0 0 0 0Y 1 1 0 0 0 0 0 0Z 0 1 1 1 1 1 1 1

Itemset Support InterestX,Y 25% 2X,Z 37.50% 0.9Y,Z 12.50% 0.57