associion rule mining

8/12/2019 Associion Rule Mining

1/19

Association Rule Mining

Mining Association Rules in

Large Databases

Association rule mining

Algorithms Apriori and FP-Growth

Max and closed patterns

Mining various kinds of association/correlation

rules


2/19

Max-patterns & Close-patterns

If there are frequent patterns with manyitems, enumerating all of them is costly.

We may be interested in finding theboundaryfrequent patterns.

Two types

Max-patterns

Frequent pattern {a1, , a100} (1001) +

(1002) + + (1

10000) = 2100-1 = 1.27*1030

frequent sub-patterns!

Max-pattern: frequent patterns withoutproper frequent super pattern

BCDE, ACD are max-patterns

BCD is not a max-pattern

A,C,D,F30

B,C,D,E,20

A,B,C,D,E10

ItemsTid

Min_sup=2


3/19

Maximal Frequent Itemset

Border

Infrequent

Itemsets

Maximal

Itemsets

An itemset is maximal frequent i f none of its immediate supersets

is frequent

Closed Itemset

An itemset is closed if none of its immediatesupersets has the same support as the itemset

TID Items

1 {A,B}

2 {B,C,D}

3 {A,B,C,D}

4 {A,B,D}

5 {A,B,C,D}

Itemset Support

{A} 4

{B} 5

{C} 3

{D} 4

{A,B} 4

{A,C} 2

{A,D} 3

{B,C} 3

{B,D} 4

{C,D} 3

Itemset Support

{A,B,C} 2

{A,B,D} 3

{A,C,D} 2

{B,C,D} 3

{A,B,C,D} 2


4/19


5/19

Maximal vs Closed Itemsets

MaxMiner: Mining Max-patterns Idea: generate the complete set-

enumeration tree one level at a time, whileprune if applicable.

(ABCD)

A (BCD) B (CD) C (D) D ()

AB (CD) AC (D) AD () BC (D) BD () CD ()

ABC (C)

ABCD ()

ABD () ACD () BCD ()


6/19

Local Pruning Techniques (e.g. at node A)

Check the frequency of ABCD and AB, AC, AD.

If ABCD is frequent, prune the whole sub-tree.

If AC is NOT frequent, remove C from theparenthesis before expanding.

(ABCD)



ABC (C)

ABCD ()


Algorithm MaxMiner Initially, generate one node N= ,

where h(N)= and t(N)={A,B,C,D}.

Consider expanding N,

If h(N)t(N) is frequent, do not expand N.

If for some it(N), h(N){i} is NOT frequent,remove i from t(N) before expanding N.

Apply global pruning techniques

(ABCD)


7/19

Global Pruning Technique (across sub-trees)

When a max pattern is identified (e.g. ABCD),prune all nodes (e.g. B, C and D) where h(N)t(N)is a sub-set of it (e.g. ABCD).

(ABCD)



ABC (C)

ABCD ()


Example

A,C,D,F30

B,C,D,E,20

A,B,C,D,E10

ItemsTid (ABCDEF)

3C

2B

2E

3D

1F

2A

0ABCDEF

FrequencyItemsMin_sup=2

Max patterns:

A (BCDE)B (CDE) C (DE) E ()D (E)


8/19

Example

A,C,D,F30

B,C,D,E,20

A,B,C,D,E10

ItemsTid (ABCDEF)

2AD

2AC

1AE

1AB

1ABCDE

FrequencyItems

Min_sup=2


AC (D) AD ()

Max patterns:

Node A

Example

A,C,D,F30

B,C,D,E,20

A,B,C,D,E10

ItemsTid (ABCDEF)

BC

BD

2BCDE

BE

FrequencyItems

Min_sup=2


AC (D) AD ()

Max patterns:

BCDE

Node B


9/19

Example

A,C,D,F30

B,C,D,E,20

A,B,C,D,E10

ItemsTid (ABCDEF)

2ACD

FrequencyItems

Min_sup=2


AC (D) AD ()

Max patterns:

BCDE

ACD

Node AC

Frequent Closed Patterns

For frequent itemset X, if there exists noitem y s.t. every transaction containing Xalso contains y, then X is a frequentclosed pattern

abis a frequent closed pattern

Concise rep. of freq pats

Reduce # of patterns and rules N. Pasquier et al. In ICDT99

e, f50

a, b, d40

a, b, d30

a, b, c20

a, b, c10

ItemsTID

Min_sup=2


10/19

Max Pattern vs. Frequent Closed Pattern

max pattern closed pattern

if itemset X is a max pattern, adding any itemto it would not be a frequent pattern; thusthere exists no item y s.t. every transactioncontaining X also contains y.

closed pattern max pattern

abis a closed pattern, but not max

e, f50

a, b, d40

a, b, d30

a, b, c20

a, b, c10

ItemsTID

Min_sup=2

Mining Frequent Closed Patterns: CLOSET

Flist: list of all frequent items in support ascending order

Flist: d-a-f-e-c

Divide search space

Patterns having d

Patterns having a but not d, etc.

Find frequent closed pattern recursively

Among the transactions having d, cfa is frequent closed

cfad is a frequent closed pattern

J. Pei, J. Han & R. Mao. CLOSET: An Efficient Algorithm for

Mining Frequent Closed Itemsets", DMKD'00.

c, e, f50

a, c, d, f40

c, e, f30

a, b, e20

a, c, d, e, f10

ItemsTID

Min_sup=2


11/19


12/19

Multi-level Association: UniformSupport vs. Reduced Support

Uniform Support: the same minimumsupport for all levels

+ One minimum support threshold. No need toexamine itemsets containing any item whoseancestors do not have minimum support.

Lower level items do not occur as frequently.If support threshold

too high miss low level associations

too low generate too many high levelassociations

Multi-level Association: UniformSupport vs. Reduced Support

Reduced Support: reduced minimumsupport at lower levels There are 4 search strategies:

Level-by-level independent Independent search at all levels (no misses)

Level-cross filtering by k-itemset Prune a k-pattern if the corresponding k-pattern at

the upper level is infrequent

Level-cross filtering by single item Prune an item if its parent node is infrequent

Controlled level-cross filtering by single item Consider subfrequent items that pass a passage

threshold


13/19

Uniform Support

Multi-level mining with uniform support

Milk

[support = 10%]

full fat Milk

[support = 6%]

Skim Milk

[support = 4%]

Level 1

min_sup = 5%

Level 2

min_sup = 5%

X

Reduced Support

Multi-level mining with reduced support

full fat Milk

[support = 6%]

Skim Milk

[support = 4%]

Level 1

min_sup = 5%

Level 2min_sup = 3%

Milk

[support = 10%]


14/19

Pattern Evaluation

Association rule algorithms tend to produce too

many rules

many of them are uninteresting or redundant

Redundant if {A,B,C} {D} and {A,B} {D}have same support & confidence

Interestingness measures can be used to prune/rankthe derived patterns

In the original formulation of association rules,support & confidence are the only measures used

Computing Interestingness Measure Given a rule X Y, information needed to compute

rule interestingness can be obtained from acontingency table

|T|f+0f+1

fo+f00f01X

f1+f10f11X

YY

Contingency table for X Y

f11: support of X and Y




Used to define various measures

support, confidence, lift, Gini,

J-measure, etc.


15/19

Drawback of Confidence

1001090

80575Tea

20515Tea

CoffeeCoffee

Association Rule: Tea Coffee

Confidence= P(Coffee|Tea) = 0.75

but P(Coffee) = 0.9

Although confidence is high, rule is misleading

P(Coffee|Tea) = 0.9375

Statistical Independence

Population of 1000 students

600 students know how to swim (S)

700 students know how to bike (B)

420 students know how to swim and bike (S,B)

P(SB) = 420/1000 = 0.42

P(S) P(B) = 0.6 0.7 = 0.42

P(SB) = P(S) P(B) => Statistical independence

P(SB) > P(S) P(B) => Positively correlated

P(SB) < P(S) P(B) => Negatively correlated


16/19

Statistical-based Measures

Measures that take into account statisticaldependence

)](1)[()](1)[(

)()(),(

)()(),(

)()(

),(

)(

)|(

YPYPXPXP

YPXPYXPtcoefficien

YPXPYXPPS

YPXP

YXPInterest

YP

XYPLift

=

=

=

=

Example: Lift/Interest

1001090

80575Tea

20515Tea

CoffeeCoffee

Association Rule: Tea Coffee

Confidence= P(Coffee|Tea) = 0.75

but P(Coffee) = 0.9

Lift = 0.75/0.9= 0.8333 (< 1, therefore is negatively associated)


17/19

Drawback of Lift & Interest

1009010

90900X

10010X

YY

1001090

10100X

90090X

YY

10)1.0)(1.0(

1.0==Lift 11.1

)9.0)(9.0(

9.0==Lift

Statistical independence:

If P(X,Y)=P(X)P(Y) => Li ft = 1

There are lots of

measures proposedin the literature

Some measures are

good for certain

applications, but not

for others

What criteria should

we use to determine

whether a measure

is good or bad?

What about Apriori-style support based

pruning? How does

it affect these

measures?


18/19

Properties of A Good Measure

Piatetsky-Shapiro:3 properties a good measure M mustsatisfy:

M(A,B) = 0 if A and B are statisticallyindependent

M(A,B) increase monotonically with P(A,B)when P(A) and P(B) remain unchanged

M(A,B) decreases monotonically with P(A) [orP(B)] when P(A,B) and P(B) [or P(A)] remainunchanged

Comparing Different Measures

10 examples of

contingency tables:

Rankings of contingency tables

using various measures:

Example f11 f10 f01 f00

E1 8123 83 424 1370

E2 8330 2 622 1046

E3 9481 94 127 298

E4 3954 3080 5 2961

E5 2886 1 363 1 320 4 431

E6 1500 2000 500 6000

E7 4000 2 000 1 000 3 000

E8 4000 2 000 2 000 2 000

E9 1720 7121 5 1154

E10 61 2483 4 7452


19/19

Property under Variable

Permutation B B

A p q

A r s

A A B p r

B q s

Does M(A,B) = M(B,A)?

Symmetric measures:

support, lift, collective strength, cosine, Jaccard, etc

Asymmetric measures:

confidence, conviction, Laplace, J-measure, etc

associion rule mining

Documents