data mining arm-2009-v0

Data Mining

Association Rules Mining or

Market Basket Analysis

Prithwis Mukerjee, Ph.D.

Prithwis Mukerjee 2

Let us describe the problem ...

A retailer sells the following items

And we assume that the shopkeeper keeps track of what each customer purchases :

He needs to know which items are generally sold together

Bread Cheese Coffee JuiceMilk Tea BiscuitsSugar Newspaper

Items10 Bread, Cheese, Newspaper20 Bread, Cheese, Juice30 Bread, Milk40 Cheese, Juice, Milk, Coffee50 Sugar, Tea, Coffee, Biscuits, Newspaper60 Sugar, Tea, Coffee, Biscuits, Milk, Juice, Newspaper70 Bread, Cheese 80 Bread, Cheese, Juice, Coffee90 Bread, Milk

100 Sugar, Tea, Coffee, Bread, Milk, Juice, Newspaper

Trans ID

Prithwis Mukerjee 3

Associations

Rules expressing relations between items in a “Market Basket”

{ Sugar and Tea } => {Biscuits} Is it true, that if a customer buys Sugar and Tea,

she will also buy biscuits ? If so, then

These items should be ordered together But discounts should not be given on these items at

the same time !

We can make a guess but It would be better if we could structure this problem

in terms of mathematics

Prithwis Mukerjee 4

Basic Concepts

Set of n Items on Sale I = { i1 , i2 , i3 , i4 , i5 , i5 , ......, in }

Transaction A subset of I : T I A set of items purchased in an individual

transaction With each transaction having m items ti = { i1 , i2 , i3 , i4 , i5 , i5 , ......, im } with m < n

If we have N transactions then we have t1 , t2 ,t3 ,.. tN as unique identifier for each transaction

D is our total data about all N transactions D = {t1 , t2 ,t3 ,.. tN}

Prithwis Mukerjee 5

An Association Rule

Whenever X appears, Y also appears X Y X I, Y I, X Y =

X and Y may be Single items or Sets of items – in which the same item does not

appear

X is referred to as the antecedent

Y is referred to as the consequent

Whether a rule like this exists is the focus of our analysis

Prithwis Mukerjee 6

Two key concepts

Support ( or prevalence) How often does X and Y appear together in the basket ? If this number is very low then it is not worth examining Expressed as a fraction of the total number of

transactions Say 10% or 0.1

Confidence ( or predictability ) Of all the occurances of X, in what fraction does Y also

appear ? Expressed as a fraction of all transactions containing X Say 80% or 0.8

We are interested in rules that have a Minimum value of support : say 25% Minimum value of confidence : say 75%

Prithwis Mukerjee 7

Mathematically speaking ...

Support (X) = (Number of times X appears ) / N = P(X)

Support (XY) = (Number of times X and Y appears ) / N = P(X Y)

Confidence (X Y) = Support (XY) / Support(X) = Probability (X Y) / P(X) = Conditional Probability P( Y | X)

Lift : an optional term Measures the power of association P( Y | X) / P(Y)

Prithwis Mukerjee 8

The task at hand ...

Given a large set of transactions, we seek a procedure ( or algorithm ) That will discover all association rules That have a minimum support of p% And a minimum confidence level of q% And to do so in an efficient manner

Algorithms The Naive or Brute Force Method

The Improved Naive algorithm The Apriori Algorithm

Improvements to the Apriori algorithm FP ( Frequent Pattern ) Algorithm

Prithwis Mukerjee 9

Let us try the Naive Algorithm manually !

This is the set of transaction that we have ...

We want to find Association Rules with Minimum 50% support and Minimum 75% confidence

Items100 Bread, Cheese200 Bread, Cheese, Juice300 Bread, Milk400 Cheese, Juice, Milk

Trans ID

Prithwis Mukerjee 10

Itemsets & Frequencies

Which sets are frequent ? Since we are looking for a

support of 50%, we need a set to appear in 2 out of 4 transactions = (# of times X

appears ) / N = P(X)

6 sets meet this criteria

Item Sets Frequency{Bread} 3{Cheese } 3{Juice} 2{Milk} 2{Bread, Cheese} 2{Bread, Juice } 1{Bread, Milk} 1{Cheese, Juice} 2{Cheese, Milk} 1{Juice, Milk} 1{Bread, Cheese, Juice} 1{Bread, Cheese, Milk} 0{Bread, Juice, Milk} 0{Cheese, Juice, Milk} 1{Bread, Cheese, Juice, Milk} 0


A closer look at the “Frequent Set”

Look at itemsets with more than 1 item {Bread, Cheese}, {Cheese, Juice} 4 rules are possible

Look for confidence levels Confidence (X Y) = Support (XY) / Support(X)

Item Sets Frequency Rule Confidence

{Bread} 3 Bread => Cheese 2 / 3 67.00%{Cheese } 3{Juice} 2 Cheese => Bread 2 / 3 67.00%{Milk} 2{Bread, Cheese} 2 Cheese => Juice 2 / 3 67.00%{Cheese, Juice} 2

Juice => Cheese 2 / 2 100.00%


The Big Picture

List all itemsets Find frequency of each

Identify “frequent sets” Based on support

Search for Rules within “frequent sets” Based on confidence


Looking Beyond the Retail Store

Counter Terrorism Track phone calls

made or received from a particular number every day

Is an incoming call from a particular number followed by a call to another number ?

Are there any sets of numbers that are always called together ?

Expand the item sets to include Electronic fund

transfers Travel between two

locations Boarding cards Railway reservation

All data is available in electronic format


Major Problem

Exponential Growth of number of Itemsets 4 items : 16 = 24 members n items : 2n members As n becomes larger, the

problem cannot be solved anymore in finite time

All attempts are made to reduce the number of Item sets to be processed

“Improved” Naive algorithm Ignore sets with zero

frequency

Item Sets Frequency{Bread} 3{Cheese } 3{Juice} 2{Milk} 2{Bread, Cheese} 2{Bread, Juice } 1{Bread, Milk} 1{Cheese, Juice} 2{Cheese, Milk} 1{Juice, Milk} 1{Bread, Cheese, Juice} 1{Bread, Cheese, Milk} 0{Bread, Juice, Milk} 0{Cheese, Juice, Milk} 1{Bread, Cheese, Juice, Milk} 0


The APriori Algorithm

Consists of two PARTS First find the frequent itemsets

Most of the cleverness happens here We will do better than the naive algorithm

Find the rules This is relatively simpler


APriori : Part 1 - Frequent Sets

Step 1 Scan all transactions and find all frequent items

that have support above p%. This is set L1

Step 2 : Apriori-Gen Build potential sets of k items from the Lk-1 by using

pairs of itemsets in Lk-1 that has the first k-2 items common and one remaining item from each member of the pair.

This is Candidate set CK

Step 3 : Find Frequent Item Sets again Scan all transactions and find frequency of sets in

CK that are frequent : This gives LK

If LK is empty, stop, else go back to step 2




that have support above p% - This is set L1


Example

We have 16 items spread over 25 transactionsItem No Item Name

1 Biscuits2 Bread3 Cereal4 Cheese5 Chocolate6 Coffee78 Eggs9 Juice

10 Milk11 Newspaper12 Pastry13 Rolls14 Sugar15 Tea16

Donuts

Yogurt

TID Items12 Bread, Cereal, Cheese, Coffee34 Bread, Cheese, Coffee, Cereal, Juice56 Milk, Tea7 Biscuits, Bread, Cheese, Coffee, Milk8 Eggs, Milk, Tea9 Bread, Cereal, Cheese, Chocolate, Coffee

1011 Bread, Cheese, Juice1213 Biscuits, Bread, Cereal1415 Chocolate, Coffee161718 Biscuits, Bread, Cheese, Coffee 19202122 Bread, Cereal, Cheese, Coffee2324 Newspaper, Pastry, Rolls25 Rolls, Sugar, Tea

Biscuits, Bread, Cheese, Coffee, Yogurt

Cheese, Chocolate, Donuts, Juice, Milk

Bread, Cereal, Chocolate, Donuts, Juice

Bread, Cereal, Chocolate, Donuts, Juice

Bread, Cheese, Coffee, Donuts, Juice

Cereal, Cheese, Chocolate, Donuts, Juice

DonutsDonuts, Eggs, Juice

Bread, Cereal, Chocolate, Donuts, JuiceCheese, Chocolate, Donuts, Juice Milk, Tea, Yogurt

Chocolate, Donuts, Juice, Milk, Newspaper


Apriori : Step 1 – Computing L1

Count frequency for each item and exclude those that are below minimum support

Item No Item Name Frequency1 Biscuits 42 Bread 133 Cereal 104 Cheese 115 Chocolate 96 Coffee 97 108 Eggs 29 Juice 11

10 Milk 611 Newspaper 212 Pastry 113 Rolls 214 Sugar 115 Tea 416 2

Donuts

Yogurt

Item No Item Name Frequency2 Bread 133 Cereal 104 Cheese 115 Chocolate 96 Coffee 97 109 Juice 11

Donuts

25% support25%

support

This is set L1


Step 2 : Computing C2

Given L1, we now form candidate pairs of C2. The 7 items in form 21 pairs : d*(d-1)/2 – this is a quadratic function and not a exponential function.

1 {Bread, Cereal}2 {Bread, Cheese}3 {Bread, Chocolate}4 {Bread, Coffee}56 {Bread,Juice}7 {Cereal, Cheese}8 {Cereal, Coffee}9 {Cereal, Chocolate}

1011 {Cereal, Juice}12 {Cheese, Chocolate}13 {Cheese, Coffee}1415 {Cheese, Juice}16 {Chocolate, Coffee}1718 {Chocolate, Juice}1920 {Coffee, Juice}21

{Bread, Donuts}

{Cereal, Donuts}

{Cheese, Donuts}

{Chocolate, Donuts}

{Coffee, Donuts}

{Donuts, Juice}

Item No Item Name Frequency2 Bread 133 Cereal 104 Cheese 115 Chocolate 96 Coffee 97 109 Juice 11

Donuts

L1 to C

2L

1 to C

2


From C2 to L2 based on minimum

support

Candidate 2-Item Set Freq{Bread, Cereal} 9{Bread, Cheese} 8{Bread, Chocolate} 4{Bread, Coffee} 8

4{Bread,Juice} 6{Cereal, Cheese} 5{Cereal, Coffee} 4{Cereal, Chocolate} 5

4{Cereal, Juice} 6{Cheese, Chocolate} 4{Cheese, Coffee} 9

3{Cheese, Juice} 4{Chocolate, Coffee} 1

7{Chocolate, Juice} 7

1{Coffee, Juice} 2

9

{Bread, Donuts}

{Cereal, Donuts}

{Cheese, Donuts}

{Chocolate, Donuts}

{Coffee, Donuts}

{Donuts, Juice}

Frequent 2-Item Set Freq{Bread, Cereal} 9{Bread, Cheese} 8{Bread, Coffee} 8{Cheese, Coffee} 9


9

{Chocolate, Donuts}

{Donuts, Juice}

25% support25%

support

This is a computationally intensive step

L2 is not empty

This is set L2


Step 2 Again : Get C3

We combine the appropriate frequent 2-item sets from L2 (which must have the same first item) and obtain four such itemsets each containing three items



9

{Chocolate, Donuts}

{Donuts, Juice}

This is set L2

Candidate 3 item set{Bread, Cheese, Cereal}{Bread, Cereal, Coffee}{Bread, Cheese, Coffee}{Chocolate, Donut, Juice}

L2 to C

3L

2 to C

3


Step 3 Again C3 to L3

Again Based on Minimum Support

Since C4 cannot be formed, L4 cannot be formed so we stop here

Candidate 3 item set Frequency

{Bread, Cheese, Cereal} 4{Bread, Cereal, Coffee} 4{Bread, Cheese, Coffee} 8

7{Chocolate, Donut, Juice}

Frequent 3 item set Frequency

{Bread, Cheese, Coffee} 87{Chocolate, Donut, Juice}

25% support25% support


The APriori Algorithm

Consists of two PARTS First find the frequent itemsets

Most of the cleverness happens here We will do better than the naive algorithm

Find the rules This is relatively simpler


APriori : Part 2 – Find Rules

Rules will be found by looking at 3-item sets found in L3 2-item sets in L2 that are not subsets of L3

In each case we Calculate confidence (A B )

= P (B | A) = P(A B ) / P(A)

Some short hand {Bread, Cheese, Coffee } is written as { B, C, D}


Rules for Finding Rules !

A 3 item frequent set { BCD} results in 6 rules B CD, C BD, D BC CD B, BD C, BC D

Also note that B CD can also be written as

B D, B C

We now look at these two 3-item sets and find their confidence levels { Bread, Cheese, Coffee} { Chocolate, Donuts, Juice } From the L3 set ( the highest L set ) and note that

support for these rules is 8 and 7


Rules from First of 2 Itemsets in L3

One rule drops out because confidence < 70%

Calculate confidence (X Y ) = P (Y | X) = P(X Y ) / P(X)

Confidence of association rules from { Bread, Cheese, Coffee }

Rule Confidence

B => CD 8 13 0.615C => BD 8 11 0.727D => BC 8 9 0.889CD => B 8 9 0.889BD => C 8 8 1.000BC => D 8 8 1.000

Support of BCD

Frequency of LHS



Donuts

Yogurt


Rules from First of 2 Itemsets in L3

One rule drops out because confidence < 70%Confidence of association rules from { Bread B, Cheese C, Coffee D }

Rule Confidence

B => CD 8 13 0.615C => BD 8 11 0.727D => BC 8 9 0.889CD => B 8 9 0.889BD => C 8 8 1.000BC => D 8 8 1.000

Support of BCD

Frequency of LHS



9

{Chocolate, Donuts}

{Donuts, Juice}


Rules from Second of 2 Itemsets in L3


Rule Confidence

N => MP 7 9 0.778M => NP 7 10 0.700P => NM 7 11 0.636MP => N 7 9 0.778NP => M 7 7 1.000NM => P 7 7 1.000

Confidence of association rules from { chocolate N, donut M, juice P}

Support of BCD

Frequency of LHS



Donuts

Yogurt


Rules from Second of 2 Itemsets in L3


Rule Confidence

N => MP 7 9 0.778M => NP 7 10 0.700P => NM 7 11 0.636MP => N 7 9 0.778NP => M 7 7 1.000NM => P 7 7 1.000

Confidence of association rules from { chocolate N, donut M, juice P}

Support of BCD

Frequency of LHS



9

{Chocolate, Donuts}

{Donuts, Juice}


Set of 14 Rules obtained from L3

C => BDC => B 1 Cheese => Bread

C => D 2 Cheese => CoffeeD => BC

D => B 3 Coffee = > BreadD => C 4 Coffee => Cheese

CD => B 5 Cheese, Coffee => Bread

BD => C 6 Bread, Coffee => CheeseBC => D 7 Bread, Cheese => Coffee

N => MPN => M 8N => P 9 Chocolate => Juice

M => NPM => P 10M => N 11

MP => N 12NP => M 13NM => P 14

Chocolate => Donuts

Donuts => Chocolate

Donuts => Juice

Donuts, Juice => Chocolate

Chocolate , Juice => Donuts

Chocolate, Donuts => Juice


What about L2 ?

Look for sets in L2 that are not subsets of L3

{ Bread, Cereal} is the only candidate Which gives are two more rules

Bread Cereal Cereal Bread



9

{Chocolate, Donuts}

{Donuts, Juice}

Frequent 3 item set Frequency

{Bread, Cheese, Coffee} 87{Chocolate, Donut, Juice}


Which are now added to get 16 rules

C => BDC => B 1 Cheese => BreadC => D 2 Cheese => Coffee

D => BCD => B 3 Coffee = > BreadD => C 4 Coffee => Cheese

CD => B 5 Cheese, Coffee => Bread

BD => C 6 Bread, Coffee => CheeseBC => D 7 Bread, Cheese => Coffee

N => MPN => M 8N => P 9 Chocolate => Juice

M => NPM => P 10M => N 11

MP => N 12NP => M 13NM => P 14

15 Bread = > Cereal16 Cereal => Bread

Chocolate => Donuts

Donuts => Chocolate

Donuts => Juice

Donuts, Juice => Chocolate

Chocolate , Juice => Donuts

Chocolate, Donuts => Juice


So where are we ?

Apriori Algorithm Consists of two PARTS First find the frequent

itemsets Most of the cleverness

happens here We will do better than

the naive algorithm Find the rules

This is relatively simpler

We have just completed the two PARTS

Overall approach to ARM is as follows List all itemsets

Find frequency of each Identify “frequent sets”

Based on support Search for Rules within

“frequent sets” Based on confidence

Naive Algorithm Exponential Time

A Priori Algoritm Polynomial Time


Observations

Actual values of support and confidence 25%, 75% are very high values In reality one works with far smaller values

“Interestingness” of a rule Since X, Y are related events – not independent –

hence P(X Y) P(X)P(Y) Interestingness P(X Y) – P(X)P(Y)

Triviality of rules Rules involving very frequent items can be trivial You always buy potatoes when you go to the market

and so you can get rules that connect potatoes to many things

Inexplicable rules Toothbrush was the most frequent item on Tuesday ??


Better Algorithms

Enhancements to the Apriori Algorithm AP-TID Direct Hashing

and Pruning (DHP)

Dynamic Itemset Counting (DIC)

Frequent Pattern (FP) Tree Only frequent items are

needed to find association rules – so ignore others !

Move the data of only frequent items to a more compact and efficient structure A Tree structure or a

directed graph is used Multiple transactions with

same (frequent) items are stored once with a count information


Software Support

KDNuggets.com Excellent collections of software available

Bart Goethals Free software for Apriori, FP-Tree

ARMiner GNU Open Source software from UMass/Boston

DMII National University of Singapore

DB2 Intelligent Data Miner IBM Corporation Equivalent software available from other vendors as

well

data mining arm-2009-v0

Technology

px y confidence x y

confidence items

probability x y px

occurances of x

n items

set of items

number of times x

antecedent y