data mining arm-2009-v0

42
Data Mining Association Rules Mining or Market Basket Analysis Prithwis Mukerjee, Ph.D.

Upload: prithwis-mukerjee

Post on 11-May-2015

779 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Data mining arm-2009-v0

Data Mining

Association Rules Mining or

Market Basket Analysis

Prithwis Mukerjee, Ph.D.

Page 2: Data mining arm-2009-v0

Prithwis Mukerjee 2

Let us describe the problem ...

A retailer sells the following items

And we assume that the shopkeeper keeps track of what each customer purchases :

He needs to know which items are generally sold together

Bread Cheese Coffee JuiceMilk Tea BiscuitsSugar Newspaper

Items10 Bread, Cheese, Newspaper20 Bread, Cheese, Juice30 Bread, Milk40 Cheese, Juice, Milk, Coffee50 Sugar, Tea, Coffee, Biscuits, Newspaper60 Sugar, Tea, Coffee, Biscuits, Milk, Juice, Newspaper70 Bread, Cheese 80 Bread, Cheese, Juice, Coffee90 Bread, Milk

100 Sugar, Tea, Coffee, Bread, Milk, Juice, Newspaper

Trans ID

Page 3: Data mining arm-2009-v0

Prithwis Mukerjee 3

Associations

Rules expressing relations between items in a “Market Basket”

{ Sugar and Tea } => {Biscuits} Is it true, that if a customer buys Sugar and Tea,

she will also buy biscuits ? If so, then

These items should be ordered together But discounts should not be given on these items at

the same time !

We can make a guess but It would be better if we could structure this problem

in terms of mathematics

Page 4: Data mining arm-2009-v0

Prithwis Mukerjee 4

Basic Concepts

Set of n Items on Sale I = { i1 , i2 , i3 , i4 , i5 , i5 , ......, in }

Transaction A subset of I : T I A set of items purchased in an individual

transaction With each transaction having m items ti = { i1 , i2 , i3 , i4 , i5 , i5 , ......, im } with m < n

If we have N transactions then we have t1 , t2 ,t3 ,.. tN as unique identifier for each transaction

D is our total data about all N transactions D = {t1 , t2 ,t3 ,.. tN}

Page 5: Data mining arm-2009-v0

Prithwis Mukerjee 5

An Association Rule

Whenever X appears, Y also appears X Y X I, Y I, X Y =

X and Y may be Single items or Sets of items – in which the same item does not

appear

X is referred to as the antecedent

Y is referred to as the consequent

Whether a rule like this exists is the focus of our analysis

Page 6: Data mining arm-2009-v0

Prithwis Mukerjee 6

Two key concepts

Support ( or prevalence) How often does X and Y appear together in the basket ? If this number is very low then it is not worth examining Expressed as a fraction of the total number of

transactions Say 10% or 0.1

Confidence ( or predictability ) Of all the occurances of X, in what fraction does Y also

appear ? Expressed as a fraction of all transactions containing X Say 80% or 0.8

We are interested in rules that have a Minimum value of support : say 25% Minimum value of confidence : say 75%

Page 7: Data mining arm-2009-v0

Prithwis Mukerjee 7

Mathematically speaking ...

Support (X) = (Number of times X appears ) / N = P(X)

Support (XY) = (Number of times X and Y appears ) / N = P(X Y)

Confidence (X Y) = Support (XY) / Support(X) = Probability (X Y) / P(X) = Conditional Probability P( Y | X)

Lift : an optional term Measures the power of association P( Y | X) / P(Y)

Page 8: Data mining arm-2009-v0

Prithwis Mukerjee 8

The task at hand ...

Given a large set of transactions, we seek a procedure ( or algorithm ) That will discover all association rules That have a minimum support of p% And a minimum confidence level of q% And to do so in an efficient manner

Algorithms The Naive or Brute Force Method

The Improved Naive algorithm The Apriori Algorithm

Improvements to the Apriori algorithm FP ( Frequent Pattern ) Algorithm

Page 9: Data mining arm-2009-v0

Prithwis Mukerjee 9

Let us try the Naive Algorithm manually !

This is the set of transaction that we have ...

We want to find Association Rules with Minimum 50% support and Minimum 75% confidence

Items100 Bread, Cheese200 Bread, Cheese, Juice300 Bread, Milk400 Cheese, Juice, Milk

Trans ID

Page 10: Data mining arm-2009-v0

Prithwis Mukerjee 10

Itemsets & Frequencies

Which sets are frequent ? Since we are looking for a

support of 50%, we need a set to appear in 2 out of 4 transactions = (# of times X

appears ) / N = P(X)

6 sets meet this criteria

Item Sets Frequency{Bread} 3{Cheese } 3{Juice} 2{Milk} 2{Bread, Cheese} 2{Bread, Juice } 1{Bread, Milk} 1{Cheese, Juice} 2{Cheese, Milk} 1{Juice, Milk} 1{Bread, Cheese, Juice} 1{Bread, Cheese, Milk} 0{Bread, Juice, Milk} 0{Cheese, Juice, Milk} 1{Bread, Cheese, Juice, Milk} 0

Page 11: Data mining arm-2009-v0

Prithwis Mukerjee 11

A closer look at the “Frequent Set”

Look at itemsets with more than 1 item {Bread, Cheese}, {Cheese, Juice} 4 rules are possible

Look for confidence levels Confidence (X Y) = Support (XY) / Support(X)

Item Sets Frequency Rule Confidence

{Bread} 3 Bread => Cheese 2 / 3 67.00%{Cheese } 3{Juice} 2 Cheese => Bread 2 / 3 67.00%{Milk} 2{Bread, Cheese} 2 Cheese => Juice 2 / 3 67.00%{Cheese, Juice} 2

Juice => Cheese 2 / 2 100.00%

Page 12: Data mining arm-2009-v0

Prithwis Mukerjee 12

A closer look at the “Frequent Set”

Look at itemsets with more than 1 item {Bread, Cheese}, {Cheese, Juice} 4 rules are possible

Look for confidence levels Confidence (X Y) = Support (XY) / Support(X)

Item Sets Frequency Rule Confidence

{Bread} 3 Bread => Cheese 2 / 3 67.00%{Cheese } 3{Juice} 2 Cheese => Bread 2 / 3 67.00%{Milk} 2{Bread, Cheese} 2 Cheese => Juice 2 / 3 67.00%{Cheese, Juice} 2

Juice => Cheese 2 / 2 100.00%

Page 13: Data mining arm-2009-v0

Prithwis Mukerjee 13

The Big Picture

List all itemsets Find frequency of each

Identify “frequent sets” Based on support

Search for Rules within “frequent sets” Based on confidence

Page 14: Data mining arm-2009-v0

Prithwis Mukerjee 14

Looking Beyond the Retail Store

Counter Terrorism Track phone calls

made or received from a particular number every day

Is an incoming call from a particular number followed by a call to another number ?

Are there any sets of numbers that are always called together ?

Expand the item sets to include Electronic fund

transfers Travel between two

locations Boarding cards Railway reservation

All data is available in electronic format

Page 15: Data mining arm-2009-v0

Prithwis Mukerjee 15

Major Problem

Exponential Growth of number of Itemsets 4 items : 16 = 24 members n items : 2n members As n becomes larger, the

problem cannot be solved anymore in finite time

All attempts are made to reduce the number of Item sets to be processed

“Improved” Naive algorithm Ignore sets with zero

frequency

Item Sets Frequency{Bread} 3{Cheese } 3{Juice} 2{Milk} 2{Bread, Cheese} 2{Bread, Juice } 1{Bread, Milk} 1{Cheese, Juice} 2{Cheese, Milk} 1{Juice, Milk} 1{Bread, Cheese, Juice} 1{Bread, Cheese, Milk} 0{Bread, Juice, Milk} 0{Cheese, Juice, Milk} 1{Bread, Cheese, Juice, Milk} 0

Page 16: Data mining arm-2009-v0

Prithwis Mukerjee 16

The APriori Algorithm

Consists of two PARTS First find the frequent itemsets

Most of the cleverness happens here We will do better than the naive algorithm

Find the rules This is relatively simpler

Page 17: Data mining arm-2009-v0

Prithwis Mukerjee 17

APriori : Part 1 - Frequent Sets

Step 1 Scan all transactions and find all frequent items

that have support above p%. This is set L1

Step 2 : Apriori-Gen Build potential sets of k items from the Lk-1 by using

pairs of itemsets in Lk-1 that has the first k-2 items common and one remaining item from each member of the pair.

This is Candidate set CK

Step 3 : Find Frequent Item Sets again Scan all transactions and find frequency of sets in

CK that are frequent : This gives LK

If LK is empty, stop, else go back to step 2

Page 18: Data mining arm-2009-v0

Prithwis Mukerjee 18

APriori : Part 1 - Frequent Sets

Step 1 Scan all transactions and find all frequent items

that have support above p% - This is set L1

Page 19: Data mining arm-2009-v0

Prithwis Mukerjee 19

Example

We have 16 items spread over 25 transactionsItem No Item Name

1 Biscuits2 Bread3 Cereal4 Cheese5 Chocolate6 Coffee78 Eggs9 Juice

10 Milk11 Newspaper12 Pastry13 Rolls14 Sugar15 Tea16

Donuts

Yogurt

TID Items12 Bread, Cereal, Cheese, Coffee34 Bread, Cheese, Coffee, Cereal, Juice56 Milk, Tea7 Biscuits, Bread, Cheese, Coffee, Milk8 Eggs, Milk, Tea9 Bread, Cereal, Cheese, Chocolate, Coffee

1011 Bread, Cheese, Juice1213 Biscuits, Bread, Cereal1415 Chocolate, Coffee161718 Biscuits, Bread, Cheese, Coffee 19202122 Bread, Cereal, Cheese, Coffee2324 Newspaper, Pastry, Rolls25 Rolls, Sugar, Tea

Biscuits, Bread, Cheese, Coffee, Yogurt

Cheese, Chocolate, Donuts, Juice, Milk

Bread, Cereal, Chocolate, Donuts, Juice

Bread, Cereal, Chocolate, Donuts, Juice

Bread, Cheese, Coffee, Donuts, Juice

Cereal, Cheese, Chocolate, Donuts, Juice

DonutsDonuts, Eggs, Juice

Bread, Cereal, Chocolate, Donuts, JuiceCheese, Chocolate, Donuts, Juice Milk, Tea, Yogurt

Chocolate, Donuts, Juice, Milk, Newspaper

Page 20: Data mining arm-2009-v0

Prithwis Mukerjee 20

Apriori : Step 1 – Computing L1

Count frequency for each item and exclude those that are below minimum support

Item No Item Name Frequency1 Biscuits 42 Bread 133 Cereal 104 Cheese 115 Chocolate 96 Coffee 97 108 Eggs 29 Juice 11

10 Milk 611 Newspaper 212 Pastry 113 Rolls 214 Sugar 115 Tea 416 2

Donuts

Yogurt

Item No Item Name Frequency2 Bread 133 Cereal 104 Cheese 115 Chocolate 96 Coffee 97 109 Juice 11

Donuts

25% support25%

support

This is set L1

Page 21: Data mining arm-2009-v0

Prithwis Mukerjee 21

APriori : Part 1 - Frequent Sets

Step 1 Scan all transactions and find all frequent items

that have support above p%. This is set L1

Step 2 : Apriori-Gen Build potential sets of k items from the Lk-1 by using

pairs of itemsets in Lk-1 that has the first k-2 items common and one remaining item from each member of the pair.

This is Candidate set CK

Page 22: Data mining arm-2009-v0

Prithwis Mukerjee 22

Step 2 : Computing C2

Given L1, we now form candidate pairs of C2. The 7 items in form 21 pairs : d*(d-1)/2 – this is a quadratic function and not a exponential function.

1 {Bread, Cereal}2 {Bread, Cheese}3 {Bread, Chocolate}4 {Bread, Coffee}56 {Bread,Juice}7 {Cereal, Cheese}8 {Cereal, Coffee}9 {Cereal, Chocolate}

1011 {Cereal, Juice}12 {Cheese, Chocolate}13 {Cheese, Coffee}1415 {Cheese, Juice}16 {Chocolate, Coffee}1718 {Chocolate, Juice}1920 {Coffee, Juice}21

{Bread, Donuts}

{Cereal, Donuts}

{Cheese, Donuts}

{Chocolate, Donuts}

{Coffee, Donuts}

{Donuts, Juice}

Item No Item Name Frequency2 Bread 133 Cereal 104 Cheese 115 Chocolate 96 Coffee 97 109 Juice 11

Donuts

L1 to C

2L

1 to C

2

Page 23: Data mining arm-2009-v0

Prithwis Mukerjee 23

APriori : Part 1 - Frequent Sets

Step 1 Scan all transactions and find all frequent items

that have support above p%. This is set L1

Step 2 : Apriori-Gen Build potential sets of k items from the Lk-1 by using

pairs of itemsets in Lk-1 that has the first k-2 items common and one remaining item from each member of the pair.

This is Candidate set CK

Step 3 : Find Frequent Item Sets again Scan all transactions and find frequency of sets in

CK that are frequent : This gives LK

If LK is empty, stop, else go back to step 2

Page 24: Data mining arm-2009-v0

Prithwis Mukerjee 24

From C2 to L2 based on minimum

support

Candidate 2-Item Set Freq{Bread, Cereal} 9{Bread, Cheese} 8{Bread, Chocolate} 4{Bread, Coffee} 8

4{Bread,Juice} 6{Cereal, Cheese} 5{Cereal, Coffee} 4{Cereal, Chocolate} 5

4{Cereal, Juice} 6{Cheese, Chocolate} 4{Cheese, Coffee} 9

3{Cheese, Juice} 4{Chocolate, Coffee} 1

7{Chocolate, Juice} 7

1{Coffee, Juice} 2

9

{Bread, Donuts}

{Cereal, Donuts}

{Cheese, Donuts}

{Chocolate, Donuts}

{Coffee, Donuts}

{Donuts, Juice}

Frequent 2-Item Set Freq{Bread, Cereal} 9{Bread, Cheese} 8{Bread, Coffee} 8{Cheese, Coffee} 9

7{Chocolate, Juice} 7

9

{Chocolate, Donuts}

{Donuts, Juice}

25% support25%

support

This is a computationally intensive step

L2 is not empty

This is set L2

Page 25: Data mining arm-2009-v0

Prithwis Mukerjee 25

APriori : Part 1 - Frequent Sets

Step 1 Scan all transactions and find all frequent items

that have support above p%. This is set L1

Step 2 : Apriori-Gen Build potential sets of k items from the Lk-1 by using

pairs of itemsets in Lk-1 that has the first k-2 items common and one remaining item from each member of the pair.

This is Candidate set CK

Step 3 : Find Frequent Item Sets again Scan all transactions and find frequency of sets in

CK that are frequent : This gives LK

If LK is empty, stop, else go back to step 2

Page 26: Data mining arm-2009-v0

Prithwis Mukerjee 26

Step 2 Again : Get C3

We combine the appropriate frequent 2-item sets from L2 (which must have the same first item) and obtain four such itemsets each containing three items

Frequent 2-Item Set Freq{Bread, Cereal} 9{Bread, Cheese} 8{Bread, Coffee} 8{Cheese, Coffee} 9

7{Chocolate, Juice} 7

9

{Chocolate, Donuts}

{Donuts, Juice}

This is set L2

Candidate 3 item set{Bread, Cheese, Cereal}{Bread, Cereal, Coffee}{Bread, Cheese, Coffee}{Chocolate, Donut, Juice}

L2 to C

3L

2 to C

3

Page 27: Data mining arm-2009-v0

Prithwis Mukerjee 27

Step 3 Again C3 to L3

Again Based on Minimum Support

Since C4 cannot be formed, L4 cannot be formed so we stop here

Candidate 3 item set Frequency

{Bread, Cheese, Cereal} 4{Bread, Cereal, Coffee} 4{Bread, Cheese, Coffee} 8

7{Chocolate, Donut, Juice}

Frequent 3 item set Frequency

{Bread, Cheese, Coffee} 87{Chocolate, Donut, Juice}

25% support25% support

Page 28: Data mining arm-2009-v0

Prithwis Mukerjee 28

APriori : Part 1 - Frequent Sets

Step 1 Scan all transactions and find all frequent items

that have support above p%. This is set L1

Step 2 : Apriori-Gen Build potential sets of k items from the Lk-1 by using

pairs of itemsets in Lk-1 that has the first k-2 items common and one remaining item from each member of the pair.

This is Candidate set CK

Step 3 : Find Frequent Item Sets again Scan all transactions and find frequency of sets in

CK that are frequent : This gives LK

If LK is empty, stop, else go back to step 2

Page 29: Data mining arm-2009-v0

Prithwis Mukerjee 29

The APriori Algorithm

Consists of two PARTS First find the frequent itemsets

Most of the cleverness happens here We will do better than the naive algorithm

Find the rules This is relatively simpler

Page 30: Data mining arm-2009-v0

Prithwis Mukerjee 30

APriori : Part 2 – Find Rules

Rules will be found by looking at 3-item sets found in L3 2-item sets in L2 that are not subsets of L3

In each case we Calculate confidence (A B )

= P (B | A) = P(A B ) / P(A)

Some short hand {Bread, Cheese, Coffee } is written as { B, C, D}

Page 31: Data mining arm-2009-v0

Prithwis Mukerjee 31

Rules for Finding Rules !

A 3 item frequent set { BCD} results in 6 rules B CD, C BD, D BC CD B, BD C, BC D

Also note that B CD can also be written as

B D, B C

We now look at these two 3-item sets and find their confidence levels { Bread, Cheese, Coffee} { Chocolate, Donuts, Juice } From the L3 set ( the highest L set ) and note that

support for these rules is 8 and 7

Page 32: Data mining arm-2009-v0

Prithwis Mukerjee 32

Rules from First of 2 Itemsets in L3

One rule drops out because confidence < 70%

Calculate confidence (X Y ) = P (Y | X) = P(X Y ) / P(X)

Confidence of association rules from { Bread, Cheese, Coffee }

Rule Confidence

B => CD 8 13 0.615C => BD 8 11 0.727D => BC 8 9 0.889CD => B 8 9 0.889BD => C 8 8 1.000BC => D 8 8 1.000

Support of BCD

Frequency of LHS

Item No Item Name Frequency1 Biscuits 42 Bread 133 Cereal 104 Cheese 115 Chocolate 96 Coffee 97 108 Eggs 29 Juice 11

10 Milk 611 Newspaper 212 Pastry 113 Rolls 214 Sugar 115 Tea 416 2

Donuts

Yogurt

Page 33: Data mining arm-2009-v0

Prithwis Mukerjee 33

Rules from First of 2 Itemsets in L3

One rule drops out because confidence < 70%Confidence of association rules from { Bread B, Cheese C, Coffee D }

Rule Confidence

B => CD 8 13 0.615C => BD 8 11 0.727D => BC 8 9 0.889CD => B 8 9 0.889BD => C 8 8 1.000BC => D 8 8 1.000

Support of BCD

Frequency of LHS

Frequent 2-Item Set Freq{Bread, Cereal} 9{Bread, Cheese} 8{Bread, Coffee} 8{Cheese, Coffee} 9

7{Chocolate, Juice} 7

9

{Chocolate, Donuts}

{Donuts, Juice}

Page 34: Data mining arm-2009-v0

Prithwis Mukerjee 34

Rules from Second of 2 Itemsets in L3

One rule drops out because confidence < 70%

Rule Confidence

N => MP 7 9 0.778M => NP 7 10 0.700P => NM 7 11 0.636MP => N 7 9 0.778NP => M 7 7 1.000NM => P 7 7 1.000

Confidence of association rules from { chocolate N, donut M, juice P}

Support of BCD

Frequency of LHS

Item No Item Name Frequency1 Biscuits 42 Bread 133 Cereal 104 Cheese 115 Chocolate 96 Coffee 97 108 Eggs 29 Juice 11

10 Milk 611 Newspaper 212 Pastry 113 Rolls 214 Sugar 115 Tea 416 2

Donuts

Yogurt

Page 35: Data mining arm-2009-v0

Prithwis Mukerjee 35

Rules from Second of 2 Itemsets in L3

One rule drops out because confidence < 70%

Rule Confidence

N => MP 7 9 0.778M => NP 7 10 0.700P => NM 7 11 0.636MP => N 7 9 0.778NP => M 7 7 1.000NM => P 7 7 1.000

Confidence of association rules from { chocolate N, donut M, juice P}

Support of BCD

Frequency of LHS

Frequent 2-Item Set Freq{Bread, Cereal} 9{Bread, Cheese} 8{Bread, Coffee} 8{Cheese, Coffee} 9

7{Chocolate, Juice} 7

9

{Chocolate, Donuts}

{Donuts, Juice}

Page 36: Data mining arm-2009-v0

Prithwis Mukerjee 36

Set of 14 Rules obtained from L3

C => BDC => B 1 Cheese => Bread

C => D 2 Cheese => CoffeeD => BC

D => B 3 Coffee = > BreadD => C 4 Coffee => Cheese

CD => B 5 Cheese, Coffee => Bread

BD => C 6 Bread, Coffee => CheeseBC => D 7 Bread, Cheese => Coffee

N => MPN => M 8N => P 9 Chocolate => Juice

M => NPM => P 10M => N 11

MP => N 12NP => M 13NM => P 14

Chocolate => Donuts

Donuts => Chocolate

Donuts => Juice

Donuts, Juice => Chocolate

Chocolate , Juice => Donuts

Chocolate, Donuts => Juice

Page 37: Data mining arm-2009-v0

Prithwis Mukerjee 37

What about L2 ?

Look for sets in L2 that are not subsets of L3

{ Bread, Cereal} is the only candidate Which gives are two more rules

Bread Cereal Cereal Bread

Frequent 2-Item Set Freq{Bread, Cereal} 9{Bread, Cheese} 8{Bread, Coffee} 8{Cheese, Coffee} 9

7{Chocolate, Juice} 7

9

{Chocolate, Donuts}

{Donuts, Juice}

Frequent 3 item set Frequency

{Bread, Cheese, Coffee} 87{Chocolate, Donut, Juice}

Page 38: Data mining arm-2009-v0

Prithwis Mukerjee 38

Which are now added to get 16 rules

C => BDC => B 1 Cheese => BreadC => D 2 Cheese => Coffee

D => BCD => B 3 Coffee = > BreadD => C 4 Coffee => Cheese

CD => B 5 Cheese, Coffee => Bread

BD => C 6 Bread, Coffee => CheeseBC => D 7 Bread, Cheese => Coffee

N => MPN => M 8N => P 9 Chocolate => Juice

M => NPM => P 10M => N 11

MP => N 12NP => M 13NM => P 14

15 Bread = > Cereal16 Cereal => Bread

Chocolate => Donuts

Donuts => Chocolate

Donuts => Juice

Donuts, Juice => Chocolate

Chocolate , Juice => Donuts

Chocolate, Donuts => Juice

Page 39: Data mining arm-2009-v0

Prithwis Mukerjee 39

So where are we ?

Apriori Algorithm Consists of two PARTS First find the frequent

itemsets Most of the cleverness

happens here We will do better than

the naive algorithm Find the rules

This is relatively simpler

We have just completed the two PARTS

Overall approach to ARM is as follows List all itemsets

Find frequency of each Identify “frequent sets”

Based on support Search for Rules within

“frequent sets” Based on confidence

Naive Algorithm Exponential Time

A Priori Algoritm Polynomial Time

Page 40: Data mining arm-2009-v0

Prithwis Mukerjee 40

Observations

Actual values of support and confidence 25%, 75% are very high values In reality one works with far smaller values

“Interestingness” of a rule Since X, Y are related events – not independent –

hence P(X Y) P(X)P(Y) Interestingness P(X Y) – P(X)P(Y)

Triviality of rules Rules involving very frequent items can be trivial You always buy potatoes when you go to the market

and so you can get rules that connect potatoes to many things

Inexplicable rules Toothbrush was the most frequent item on Tuesday ??

Page 41: Data mining arm-2009-v0

Prithwis Mukerjee 41

Better Algorithms

Enhancements to the Apriori Algorithm AP-TID Direct Hashing

and Pruning (DHP)

Dynamic Itemset Counting (DIC)

Frequent Pattern (FP) Tree Only frequent items are

needed to find association rules – so ignore others !

Move the data of only frequent items to a more compact and efficient structure A Tree structure or a

directed graph is used Multiple transactions with

same (frequent) items are stored once with a count information

Page 42: Data mining arm-2009-v0

Prithwis Mukerjee 42

Software Support

KDNuggets.com Excellent collections of software available

Bart Goethals Free software for Apriori, FP-Tree

ARMiner GNU Open Source software from UMass/Boston

DMII National University of Singapore

DB2 Intelligent Data Miner IBM Corporation Equivalent software available from other vendors as

well