market basket analysis

37
Market Basket Analysis and Association Rules

Upload: asad-khan

Post on 30-Oct-2014

59 views

Category:

Documents


0 download

TRANSCRIPT

Market Basket Analysis and Association Rules

2

What can be inferred

I purchase diapersI purchase diapers

I purchase a new carI purchase a new car

I purchase OTC cough medicineI purchase OTC cough medicine

I purchase a prescription I purchase a prescription

medicationmedication

I donrsquot show up for classI donrsquot show up for class

3

Market Basket Analysis

MBA is a set of techniques MBA is a set of techniques Association Rules being most Association Rules being most common that focus on point-of-sale common that focus on point-of-sale (p-o-s) transaction data(p-o-s) transaction data

3 types of market basket data (p-o-s 3 types of market basket data (p-o-s data)data) CustomersCustomers Orders (basic purchase data)Orders (basic purchase data) Items (merchandiseservices Items (merchandiseservices

purchased)purchased)

4

Market Basket Analysis

Retail ndash each customer purchases different set Retail ndash each customer purchases different set of products different quantities different of products different quantities different timestimes

MBA uses this information toMBA uses this information to Identify who customers are (not by name)Identify who customers are (not by name) Understand why they make certain purchasesUnderstand why they make certain purchases Gain insight about its merchandise (products)Gain insight about its merchandise (products)

Fast and slow moversFast and slow movers Products which are purchased togetherProducts which are purchased together Products which might benefit from promotionProducts which might benefit from promotion

Take actionTake action Store layoutsStore layouts Which products to put on specials promote couponshellipWhich products to put on specials promote couponshellip

Combining all of this with a customer loyalty Combining all of this with a customer loyalty card it becomes even more valuablecard it becomes even more valuable

5

Association Rules

DM technique most closely allied DM technique most closely allied with Market Basket Analysiswith Market Basket Analysis

AR can be automatically AR can be automatically generatedgenerated AR represent patterns in the data AR represent patterns in the data

without a specified target variablewithout a specified target variable Good example of undirected data Good example of undirected data

miningmining

6

7

Market Basket Analysis Measures

Consider the association rule Y 1048782 Z where Y and Z are two products Y Consider the association rule Y 1048782 Z where Y and Z are two products Y represents the antecedent en Z is called the consequentrepresents the antecedent en Z is called the consequent

Support Support of the rule the percentage of all baskets that contain both of the rule the percentage of all baskets that contain both product Y and Zproduct Y and Zsupport = P(Y Λ Z)support = P(Y Λ Z)

Confidence Confidence of the rule the percentage of all the baskets containing Y that of the rule the percentage of all the baskets containing Y that also contain Zalso contain ZHence confidence is a conditional probability ie P(Z|Y)Hence confidence is a conditional probability ie P(Z|Y)confidence = P(Y Λ Z)P(Y)confidence = P(Y Λ Z)P(Y)

Interest Interest of the rule measures the statistical dependence of the rule by of the rule measures the statistical dependence of the rule by relating the observed frequency of occurrence (P(Y Λ Z)) to the expected relating the observed frequency of occurrence (P(Y Λ Z)) to the expected frequency of co-occurrence under the assumption of conditional frequency of co-occurrence under the assumption of conditional independence of Y and Z (P(Y)P(Z))independence of Y and Z (P(Y)P(Z))interest = P(Y Λ Z)(P(Y)P(Z))interest = P(Y Λ Z)(P(Y)P(Z))

Association-rule discovery is the process of finding strong product Association-rule discovery is the process of finding strong product associations with aassociations with aminimum support andor confidence and an interest of at least oneminimum support andor confidence and an interest of at least one

8

Association Rules Apply Elsewhere

Besides retail ndash supermarkets etchellipBesides retail ndash supermarkets etchellip Purchases made using creditdebit Purchases made using creditdebit

cardscards Optional Telco Service purchasesOptional Telco Service purchases Banking servicesBanking services Unusual combinations of insurance Unusual combinations of insurance

claims can be a warning of fraudclaims can be a warning of fraud Medical patient historiesMedical patient histories

9

A certainty measure for A certainty measure for association rules of the form ldquoA association rules of the form ldquoA =gt Brdquo where A and B are sets of =gt Brdquo where A and B are sets of items is confidenceitems is confidence

Given a set of task Given a set of task

10

Typical Data Structure (Relational Database)

Lots of questions can be answeredLots of questions can be answered Avg of orderscustomerAvg of orderscustomer Avg unique itemsorderAvg unique itemsorder Avg of itemsorderAvg of itemsorder For a productFor a product

What of customers have purchasedWhat of customers have purchased Avg orderscustomer include itAvg orderscustomer include it Avg quantity of it purchasedorderAvg quantity of it purchasedorder

EtchellipEtchellip Visualization is extremely helpfulhellipnext Visualization is extremely helpfulhellipnext

slide slide

Transaction Data

11

Sales Order Characteristics

12

Sales Order Characteristics

Did the order use gift wrapDid the order use gift wrap Billing address same as Shipping addressBilling address same as Shipping address Did purchaser acceptdecline a cross-sellDid purchaser acceptdecline a cross-sell What is the most common item found on a What is the most common item found on a

one-item orderone-item order What is the most common item found on a What is the most common item found on a

multi-item ordermulti-item order What is the most common item for repeat What is the most common item for repeat

customer purchasescustomer purchases How has ordering of an item changed over How has ordering of an item changed over

timetime How does the ordering of an item vary How does the ordering of an item vary

geographicallygeographically

13

Pivoting for Cluster Algorithms

14

Association Rules

Wal-Mart customers who purchase Wal-Mart customers who purchase Barbie dolls have a 60 likelihood of Barbie dolls have a 60 likelihood of also purchasing one of three types of also purchasing one of three types of candy bars [candy bars [ForbesForbes Sept 8 1997] Sept 8 1997]

Customers who purchase maintenance Customers who purchase maintenance agreements are very likely to purchase agreements are very likely to purchase large appliances When a new hardware large appliances When a new hardware store opens one of the most commonly store opens one of the most commonly sold items is toilet bowl cleanerssold items is toilet bowl cleaners

15

Association Rules

Association rule typesAssociation rule types Actionable Rules ndash contain high-Actionable Rules ndash contain high-

quality actionable informationquality actionable information Trivial Rules ndash information already Trivial Rules ndash information already

well-known by those familiar with well-known by those familiar with the businessthe business

Inexplicable Rules ndash no explanation Inexplicable Rules ndash no explanation and do not suggest actionand do not suggest action

Trivial and Inexplicable Rules Trivial and Inexplicable Rules occur most oftenoccur most often

16

How Good is an Association Rule

CustomerCustomer Items PurchasedItems Purchased

11 Coke sodaCoke soda

22 Milk Coke window cleanerMilk Coke window cleaner

33 Coke detergentCoke detergent

44 Coke detergent sodaCoke detergent soda

55 Window cleaner sodaWindow cleaner soda

CokCokee

Window Window cleanercleaner

MilkMilk SodaSoda DetergentDetergent

CokeCoke 44 11 11 22 22

Window cleanerWindow cleaner 11 22 11 11 00

MilkMilk 11 11 11 00 00

SodaSoda 22 11 00 33 11

DetergentDetergent 22 00 00 11 22

POS Transactions

Co-occurrence ofProducts

17

How Good is an Association Rule

CokCokee

Window Window cleanercleaner

MilkMilk SodaSoda DetergentDetergent

44 11 11 22 22

Window cleanerWindow cleaner 11 22 11 11 00

MilkMilk 11 11 11 00 00

SodaSoda 22 11 00 33 11

DetergentDetergent 22 00 00 11 22

Simple patterns1 Coke and soda are more likely purchased together thanany other two items2 Detergent is never purchased with milk or window cleaner3 Milk is never purchased with soda or detergent

18

How Good is an Association Rule

What is the confidence for this ruleWhat is the confidence for this rule If a customer purchases soda then customer also purchases CokeIf a customer purchases soda then customer also purchases Coke 2 out of 3 soda purchases also include Coke so 672 out of 3 soda purchases also include Coke so 67

What about the confidence of this rule reversedWhat about the confidence of this rule reversed 2 out of 4 Coke purchases also include soda so 502 out of 4 Coke purchases also include soda so 50

Confidence Confidence = Ratio of the number of transactions with all the = Ratio of the number of transactions with all the items to the number of transactions with just the ldquoifrdquo itemsitems to the number of transactions with just the ldquoifrdquo items

Customer Items Purchased

1 Coke soda

2 Milk Coke window cleaner

3 Coke detergent

4 Coke detergent soda

5 Window cleaner soda

POS Transactions

19

How Good is an Association Rule

How much better than chance is a ruleHow much better than chance is a rule Lift (improvementa) tells us how much better a rule is at Lift (improvementa) tells us how much better a rule is at

predicting the result than just assuming the result in the first predicting the result than just assuming the result in the first placeplace

Lift Lift is the ratio of the records that support the entire rule to is the ratio of the records that support the entire rule to the number that would be expected assuming there was no the number that would be expected assuming there was no relationship between the productsrelationship between the products

Calculating lifthellipp 310hellipWhen lift gt 1 then the rule is better at Calculating lifthellipp 310hellipWhen lift gt 1 then the rule is better at predicting the result than guessingpredicting the result than guessing

When lift lt 1 the rule is doing worse than informed guessing When lift lt 1 the rule is doing worse than informed guessing and using the and using the Negative RuleNegative Rule produces a better rule than produces a better rule than guessingguessing

Co-occurrence can occur in 3 4 or more dimensionshellipCo-occurrence can occur in 3 4 or more dimensionshellip

20

Creating Association Rules

11 Choosing the right set Choosing the right set of itemsof items

22 Generating rules by Generating rules by deciphering the deciphering the counts in the co-counts in the co-occurrence matrixoccurrence matrix

33 Overcoming the Overcoming the practical limits practical limits imposed by thousands imposed by thousands or tens of thousands or tens of thousands of unique itemsof unique items

21

Overcoming Practical Limits for Association Rules

11 Generate co-occurrence matrix Generate co-occurrence matrix for single itemshelliprdquofor single itemshelliprdquoif Coke then if Coke then sodardquosodardquo

22 Generate co-occurrence matrix Generate co-occurrence matrix for two itemshelliprdquofor two itemshelliprdquoif Coke and Milk if Coke and Milk then sodardquothen sodardquo

33 Generate co-occurrence matrix Generate co-occurrence matrix for three itemshelliprdquofor three itemshelliprdquoif Coke and Milk if Coke and Milk and Windowand Window Cleanerrdquo then soda Cleanerrdquo then soda

44 EtchellipEtchellip

22

Final Thought on Association RulesThe Problem of Lots of Data

Fast Food Restauranthellipcould have 100 Fast Food Restauranthellipcould have 100 items on its menuitems on its menu How many combinations are there with 3 How many combinations are there with 3

different menu items 161700 different menu items 161700 Supermarkethellip10000 or more unique Supermarkethellip10000 or more unique

itemsitems 50 million 2-item combinations50 million 2-item combinations 100 billion 3-item combinations100 billion 3-item combinations

Use of product hierarchies (groupings) Use of product hierarchies (groupings) helps address this common issuehelps address this common issue

Finally know that the number of Finally know that the number of transactions in a given time-period could transactions in a given time-period could also be huge (hence expensive to analyze)also be huge (hence expensive to analyze)

23

Business and other cases

24

25

26

27

28

29

30

31

32

33

General Observations

Banking case seems to provide Banking case seems to provide well defined and intelligible well defined and intelligible information of the forminformation of the form account_1 and account_2 etc or account_1 and account_2 etc or

activity_1 and activity_2 etc activity_1 and activity_2 etc possibly indexed by timepossibly indexed by time

As such rules found provide guide As such rules found provide guide to action to offer product or service to action to offer product or service (cross-sell)(cross-sell)

34

In retailing case of items In retailing case of items purchased together guidance is purchased together guidance is not so clear cut due to extensive not so clear cut due to extensive number of rulesnumber of rules

Soccer event exemplifies Soccer event exemplifies sequencing of events towards sequencing of events towards reaching goal Basketball-applied reaching goal Basketball-applied software has been developed years software has been developed years ago Web mining shares the same ago Web mining shares the same principles without passion usually principles without passion usually associated with sportsassociated with sports

35

Challenges

A major difficulty is that a large number of A major difficulty is that a large number of the rules found may be trivial for anyone the rules found may be trivial for anyone familiar with the business familiar with the business

The computational complexity involved in The computational complexity involved in calculating the results of market basket calculating the results of market basket analysis is at least the square of the number analysis is at least the square of the number of transaction item-lines (records of every of transaction item-lines (records of every item purchased) With data warehouses item purchased) With data warehouses storing billions of transaction lines this storing billions of transaction lines this yields extremely high computational yields extremely high computational requirements requirements

36

Solutions

Differential market basket analysisDifferential market basket analysis can find interesting results and can also can find interesting results and can also eliminate the problem of a potentially eliminate the problem of a potentially high volume of trivial resultshigh volume of trivial results

Special techniques involving Special techniques involving filtering filtering or aggregationor aggregation of the transaction of the transaction database are commonly used to in database are commonly used to in analysis algorithms to increase analysis algorithms to increase performance and allow some level of performance and allow some level of interactivity such as in business interactivity such as in business intelligence applicationsintelligence applications

37

Thank You

2

What can be inferred

I purchase diapersI purchase diapers

I purchase a new carI purchase a new car

I purchase OTC cough medicineI purchase OTC cough medicine

I purchase a prescription I purchase a prescription

medicationmedication

I donrsquot show up for classI donrsquot show up for class

3

Market Basket Analysis

MBA is a set of techniques MBA is a set of techniques Association Rules being most Association Rules being most common that focus on point-of-sale common that focus on point-of-sale (p-o-s) transaction data(p-o-s) transaction data

3 types of market basket data (p-o-s 3 types of market basket data (p-o-s data)data) CustomersCustomers Orders (basic purchase data)Orders (basic purchase data) Items (merchandiseservices Items (merchandiseservices

purchased)purchased)

4

Market Basket Analysis

Retail ndash each customer purchases different set Retail ndash each customer purchases different set of products different quantities different of products different quantities different timestimes

MBA uses this information toMBA uses this information to Identify who customers are (not by name)Identify who customers are (not by name) Understand why they make certain purchasesUnderstand why they make certain purchases Gain insight about its merchandise (products)Gain insight about its merchandise (products)

Fast and slow moversFast and slow movers Products which are purchased togetherProducts which are purchased together Products which might benefit from promotionProducts which might benefit from promotion

Take actionTake action Store layoutsStore layouts Which products to put on specials promote couponshellipWhich products to put on specials promote couponshellip

Combining all of this with a customer loyalty Combining all of this with a customer loyalty card it becomes even more valuablecard it becomes even more valuable

5

Association Rules

DM technique most closely allied DM technique most closely allied with Market Basket Analysiswith Market Basket Analysis

AR can be automatically AR can be automatically generatedgenerated AR represent patterns in the data AR represent patterns in the data

without a specified target variablewithout a specified target variable Good example of undirected data Good example of undirected data

miningmining

6

7

Market Basket Analysis Measures

Consider the association rule Y 1048782 Z where Y and Z are two products Y Consider the association rule Y 1048782 Z where Y and Z are two products Y represents the antecedent en Z is called the consequentrepresents the antecedent en Z is called the consequent

Support Support of the rule the percentage of all baskets that contain both of the rule the percentage of all baskets that contain both product Y and Zproduct Y and Zsupport = P(Y Λ Z)support = P(Y Λ Z)

Confidence Confidence of the rule the percentage of all the baskets containing Y that of the rule the percentage of all the baskets containing Y that also contain Zalso contain ZHence confidence is a conditional probability ie P(Z|Y)Hence confidence is a conditional probability ie P(Z|Y)confidence = P(Y Λ Z)P(Y)confidence = P(Y Λ Z)P(Y)

Interest Interest of the rule measures the statistical dependence of the rule by of the rule measures the statistical dependence of the rule by relating the observed frequency of occurrence (P(Y Λ Z)) to the expected relating the observed frequency of occurrence (P(Y Λ Z)) to the expected frequency of co-occurrence under the assumption of conditional frequency of co-occurrence under the assumption of conditional independence of Y and Z (P(Y)P(Z))independence of Y and Z (P(Y)P(Z))interest = P(Y Λ Z)(P(Y)P(Z))interest = P(Y Λ Z)(P(Y)P(Z))

Association-rule discovery is the process of finding strong product Association-rule discovery is the process of finding strong product associations with aassociations with aminimum support andor confidence and an interest of at least oneminimum support andor confidence and an interest of at least one

8

Association Rules Apply Elsewhere

Besides retail ndash supermarkets etchellipBesides retail ndash supermarkets etchellip Purchases made using creditdebit Purchases made using creditdebit

cardscards Optional Telco Service purchasesOptional Telco Service purchases Banking servicesBanking services Unusual combinations of insurance Unusual combinations of insurance

claims can be a warning of fraudclaims can be a warning of fraud Medical patient historiesMedical patient histories

9

A certainty measure for A certainty measure for association rules of the form ldquoA association rules of the form ldquoA =gt Brdquo where A and B are sets of =gt Brdquo where A and B are sets of items is confidenceitems is confidence

Given a set of task Given a set of task

10

Typical Data Structure (Relational Database)

Lots of questions can be answeredLots of questions can be answered Avg of orderscustomerAvg of orderscustomer Avg unique itemsorderAvg unique itemsorder Avg of itemsorderAvg of itemsorder For a productFor a product

What of customers have purchasedWhat of customers have purchased Avg orderscustomer include itAvg orderscustomer include it Avg quantity of it purchasedorderAvg quantity of it purchasedorder

EtchellipEtchellip Visualization is extremely helpfulhellipnext Visualization is extremely helpfulhellipnext

slide slide

Transaction Data

11

Sales Order Characteristics

12

Sales Order Characteristics

Did the order use gift wrapDid the order use gift wrap Billing address same as Shipping addressBilling address same as Shipping address Did purchaser acceptdecline a cross-sellDid purchaser acceptdecline a cross-sell What is the most common item found on a What is the most common item found on a

one-item orderone-item order What is the most common item found on a What is the most common item found on a

multi-item ordermulti-item order What is the most common item for repeat What is the most common item for repeat

customer purchasescustomer purchases How has ordering of an item changed over How has ordering of an item changed over

timetime How does the ordering of an item vary How does the ordering of an item vary

geographicallygeographically

13

Pivoting for Cluster Algorithms

14

Association Rules

Wal-Mart customers who purchase Wal-Mart customers who purchase Barbie dolls have a 60 likelihood of Barbie dolls have a 60 likelihood of also purchasing one of three types of also purchasing one of three types of candy bars [candy bars [ForbesForbes Sept 8 1997] Sept 8 1997]

Customers who purchase maintenance Customers who purchase maintenance agreements are very likely to purchase agreements are very likely to purchase large appliances When a new hardware large appliances When a new hardware store opens one of the most commonly store opens one of the most commonly sold items is toilet bowl cleanerssold items is toilet bowl cleaners

15

Association Rules

Association rule typesAssociation rule types Actionable Rules ndash contain high-Actionable Rules ndash contain high-

quality actionable informationquality actionable information Trivial Rules ndash information already Trivial Rules ndash information already

well-known by those familiar with well-known by those familiar with the businessthe business

Inexplicable Rules ndash no explanation Inexplicable Rules ndash no explanation and do not suggest actionand do not suggest action

Trivial and Inexplicable Rules Trivial and Inexplicable Rules occur most oftenoccur most often

16

How Good is an Association Rule

CustomerCustomer Items PurchasedItems Purchased

11 Coke sodaCoke soda

22 Milk Coke window cleanerMilk Coke window cleaner

33 Coke detergentCoke detergent

44 Coke detergent sodaCoke detergent soda

55 Window cleaner sodaWindow cleaner soda

CokCokee

Window Window cleanercleaner

MilkMilk SodaSoda DetergentDetergent

CokeCoke 44 11 11 22 22

Window cleanerWindow cleaner 11 22 11 11 00

MilkMilk 11 11 11 00 00

SodaSoda 22 11 00 33 11

DetergentDetergent 22 00 00 11 22

POS Transactions

Co-occurrence ofProducts

17

How Good is an Association Rule

CokCokee

Window Window cleanercleaner

MilkMilk SodaSoda DetergentDetergent

44 11 11 22 22

Window cleanerWindow cleaner 11 22 11 11 00

MilkMilk 11 11 11 00 00

SodaSoda 22 11 00 33 11

DetergentDetergent 22 00 00 11 22

Simple patterns1 Coke and soda are more likely purchased together thanany other two items2 Detergent is never purchased with milk or window cleaner3 Milk is never purchased with soda or detergent

18

How Good is an Association Rule

What is the confidence for this ruleWhat is the confidence for this rule If a customer purchases soda then customer also purchases CokeIf a customer purchases soda then customer also purchases Coke 2 out of 3 soda purchases also include Coke so 672 out of 3 soda purchases also include Coke so 67

What about the confidence of this rule reversedWhat about the confidence of this rule reversed 2 out of 4 Coke purchases also include soda so 502 out of 4 Coke purchases also include soda so 50

Confidence Confidence = Ratio of the number of transactions with all the = Ratio of the number of transactions with all the items to the number of transactions with just the ldquoifrdquo itemsitems to the number of transactions with just the ldquoifrdquo items

Customer Items Purchased

1 Coke soda

2 Milk Coke window cleaner

3 Coke detergent

4 Coke detergent soda

5 Window cleaner soda

POS Transactions

19

How Good is an Association Rule

How much better than chance is a ruleHow much better than chance is a rule Lift (improvementa) tells us how much better a rule is at Lift (improvementa) tells us how much better a rule is at

predicting the result than just assuming the result in the first predicting the result than just assuming the result in the first placeplace

Lift Lift is the ratio of the records that support the entire rule to is the ratio of the records that support the entire rule to the number that would be expected assuming there was no the number that would be expected assuming there was no relationship between the productsrelationship between the products

Calculating lifthellipp 310hellipWhen lift gt 1 then the rule is better at Calculating lifthellipp 310hellipWhen lift gt 1 then the rule is better at predicting the result than guessingpredicting the result than guessing

When lift lt 1 the rule is doing worse than informed guessing When lift lt 1 the rule is doing worse than informed guessing and using the and using the Negative RuleNegative Rule produces a better rule than produces a better rule than guessingguessing

Co-occurrence can occur in 3 4 or more dimensionshellipCo-occurrence can occur in 3 4 or more dimensionshellip

20

Creating Association Rules

11 Choosing the right set Choosing the right set of itemsof items

22 Generating rules by Generating rules by deciphering the deciphering the counts in the co-counts in the co-occurrence matrixoccurrence matrix

33 Overcoming the Overcoming the practical limits practical limits imposed by thousands imposed by thousands or tens of thousands or tens of thousands of unique itemsof unique items

21

Overcoming Practical Limits for Association Rules

11 Generate co-occurrence matrix Generate co-occurrence matrix for single itemshelliprdquofor single itemshelliprdquoif Coke then if Coke then sodardquosodardquo

22 Generate co-occurrence matrix Generate co-occurrence matrix for two itemshelliprdquofor two itemshelliprdquoif Coke and Milk if Coke and Milk then sodardquothen sodardquo

33 Generate co-occurrence matrix Generate co-occurrence matrix for three itemshelliprdquofor three itemshelliprdquoif Coke and Milk if Coke and Milk and Windowand Window Cleanerrdquo then soda Cleanerrdquo then soda

44 EtchellipEtchellip

22

Final Thought on Association RulesThe Problem of Lots of Data

Fast Food Restauranthellipcould have 100 Fast Food Restauranthellipcould have 100 items on its menuitems on its menu How many combinations are there with 3 How many combinations are there with 3

different menu items 161700 different menu items 161700 Supermarkethellip10000 or more unique Supermarkethellip10000 or more unique

itemsitems 50 million 2-item combinations50 million 2-item combinations 100 billion 3-item combinations100 billion 3-item combinations

Use of product hierarchies (groupings) Use of product hierarchies (groupings) helps address this common issuehelps address this common issue

Finally know that the number of Finally know that the number of transactions in a given time-period could transactions in a given time-period could also be huge (hence expensive to analyze)also be huge (hence expensive to analyze)

23

Business and other cases

24

25

26

27

28

29

30

31

32

33

General Observations

Banking case seems to provide Banking case seems to provide well defined and intelligible well defined and intelligible information of the forminformation of the form account_1 and account_2 etc or account_1 and account_2 etc or

activity_1 and activity_2 etc activity_1 and activity_2 etc possibly indexed by timepossibly indexed by time

As such rules found provide guide As such rules found provide guide to action to offer product or service to action to offer product or service (cross-sell)(cross-sell)

34

In retailing case of items In retailing case of items purchased together guidance is purchased together guidance is not so clear cut due to extensive not so clear cut due to extensive number of rulesnumber of rules

Soccer event exemplifies Soccer event exemplifies sequencing of events towards sequencing of events towards reaching goal Basketball-applied reaching goal Basketball-applied software has been developed years software has been developed years ago Web mining shares the same ago Web mining shares the same principles without passion usually principles without passion usually associated with sportsassociated with sports

35

Challenges

A major difficulty is that a large number of A major difficulty is that a large number of the rules found may be trivial for anyone the rules found may be trivial for anyone familiar with the business familiar with the business

The computational complexity involved in The computational complexity involved in calculating the results of market basket calculating the results of market basket analysis is at least the square of the number analysis is at least the square of the number of transaction item-lines (records of every of transaction item-lines (records of every item purchased) With data warehouses item purchased) With data warehouses storing billions of transaction lines this storing billions of transaction lines this yields extremely high computational yields extremely high computational requirements requirements

36

Solutions

Differential market basket analysisDifferential market basket analysis can find interesting results and can also can find interesting results and can also eliminate the problem of a potentially eliminate the problem of a potentially high volume of trivial resultshigh volume of trivial results

Special techniques involving Special techniques involving filtering filtering or aggregationor aggregation of the transaction of the transaction database are commonly used to in database are commonly used to in analysis algorithms to increase analysis algorithms to increase performance and allow some level of performance and allow some level of interactivity such as in business interactivity such as in business intelligence applicationsintelligence applications

37

Thank You

3

Market Basket Analysis

MBA is a set of techniques MBA is a set of techniques Association Rules being most Association Rules being most common that focus on point-of-sale common that focus on point-of-sale (p-o-s) transaction data(p-o-s) transaction data

3 types of market basket data (p-o-s 3 types of market basket data (p-o-s data)data) CustomersCustomers Orders (basic purchase data)Orders (basic purchase data) Items (merchandiseservices Items (merchandiseservices

purchased)purchased)

4

Market Basket Analysis

Retail ndash each customer purchases different set Retail ndash each customer purchases different set of products different quantities different of products different quantities different timestimes

MBA uses this information toMBA uses this information to Identify who customers are (not by name)Identify who customers are (not by name) Understand why they make certain purchasesUnderstand why they make certain purchases Gain insight about its merchandise (products)Gain insight about its merchandise (products)

Fast and slow moversFast and slow movers Products which are purchased togetherProducts which are purchased together Products which might benefit from promotionProducts which might benefit from promotion

Take actionTake action Store layoutsStore layouts Which products to put on specials promote couponshellipWhich products to put on specials promote couponshellip

Combining all of this with a customer loyalty Combining all of this with a customer loyalty card it becomes even more valuablecard it becomes even more valuable

5

Association Rules

DM technique most closely allied DM technique most closely allied with Market Basket Analysiswith Market Basket Analysis

AR can be automatically AR can be automatically generatedgenerated AR represent patterns in the data AR represent patterns in the data

without a specified target variablewithout a specified target variable Good example of undirected data Good example of undirected data

miningmining

6

7

Market Basket Analysis Measures

Consider the association rule Y 1048782 Z where Y and Z are two products Y Consider the association rule Y 1048782 Z where Y and Z are two products Y represents the antecedent en Z is called the consequentrepresents the antecedent en Z is called the consequent

Support Support of the rule the percentage of all baskets that contain both of the rule the percentage of all baskets that contain both product Y and Zproduct Y and Zsupport = P(Y Λ Z)support = P(Y Λ Z)

Confidence Confidence of the rule the percentage of all the baskets containing Y that of the rule the percentage of all the baskets containing Y that also contain Zalso contain ZHence confidence is a conditional probability ie P(Z|Y)Hence confidence is a conditional probability ie P(Z|Y)confidence = P(Y Λ Z)P(Y)confidence = P(Y Λ Z)P(Y)

Interest Interest of the rule measures the statistical dependence of the rule by of the rule measures the statistical dependence of the rule by relating the observed frequency of occurrence (P(Y Λ Z)) to the expected relating the observed frequency of occurrence (P(Y Λ Z)) to the expected frequency of co-occurrence under the assumption of conditional frequency of co-occurrence under the assumption of conditional independence of Y and Z (P(Y)P(Z))independence of Y and Z (P(Y)P(Z))interest = P(Y Λ Z)(P(Y)P(Z))interest = P(Y Λ Z)(P(Y)P(Z))

Association-rule discovery is the process of finding strong product Association-rule discovery is the process of finding strong product associations with aassociations with aminimum support andor confidence and an interest of at least oneminimum support andor confidence and an interest of at least one

8

Association Rules Apply Elsewhere

Besides retail ndash supermarkets etchellipBesides retail ndash supermarkets etchellip Purchases made using creditdebit Purchases made using creditdebit

cardscards Optional Telco Service purchasesOptional Telco Service purchases Banking servicesBanking services Unusual combinations of insurance Unusual combinations of insurance

claims can be a warning of fraudclaims can be a warning of fraud Medical patient historiesMedical patient histories

9

A certainty measure for A certainty measure for association rules of the form ldquoA association rules of the form ldquoA =gt Brdquo where A and B are sets of =gt Brdquo where A and B are sets of items is confidenceitems is confidence

Given a set of task Given a set of task

10

Typical Data Structure (Relational Database)

Lots of questions can be answeredLots of questions can be answered Avg of orderscustomerAvg of orderscustomer Avg unique itemsorderAvg unique itemsorder Avg of itemsorderAvg of itemsorder For a productFor a product

What of customers have purchasedWhat of customers have purchased Avg orderscustomer include itAvg orderscustomer include it Avg quantity of it purchasedorderAvg quantity of it purchasedorder

EtchellipEtchellip Visualization is extremely helpfulhellipnext Visualization is extremely helpfulhellipnext

slide slide

Transaction Data

11

Sales Order Characteristics

12

Sales Order Characteristics

Did the order use gift wrapDid the order use gift wrap Billing address same as Shipping addressBilling address same as Shipping address Did purchaser acceptdecline a cross-sellDid purchaser acceptdecline a cross-sell What is the most common item found on a What is the most common item found on a

one-item orderone-item order What is the most common item found on a What is the most common item found on a

multi-item ordermulti-item order What is the most common item for repeat What is the most common item for repeat

customer purchasescustomer purchases How has ordering of an item changed over How has ordering of an item changed over

timetime How does the ordering of an item vary How does the ordering of an item vary

geographicallygeographically

13

Pivoting for Cluster Algorithms

14

Association Rules

Wal-Mart customers who purchase Wal-Mart customers who purchase Barbie dolls have a 60 likelihood of Barbie dolls have a 60 likelihood of also purchasing one of three types of also purchasing one of three types of candy bars [candy bars [ForbesForbes Sept 8 1997] Sept 8 1997]

Customers who purchase maintenance Customers who purchase maintenance agreements are very likely to purchase agreements are very likely to purchase large appliances When a new hardware large appliances When a new hardware store opens one of the most commonly store opens one of the most commonly sold items is toilet bowl cleanerssold items is toilet bowl cleaners

15

Association Rules

Association rule typesAssociation rule types Actionable Rules ndash contain high-Actionable Rules ndash contain high-

quality actionable informationquality actionable information Trivial Rules ndash information already Trivial Rules ndash information already

well-known by those familiar with well-known by those familiar with the businessthe business

Inexplicable Rules ndash no explanation Inexplicable Rules ndash no explanation and do not suggest actionand do not suggest action

Trivial and Inexplicable Rules Trivial and Inexplicable Rules occur most oftenoccur most often

16

How Good is an Association Rule

CustomerCustomer Items PurchasedItems Purchased

11 Coke sodaCoke soda

22 Milk Coke window cleanerMilk Coke window cleaner

33 Coke detergentCoke detergent

44 Coke detergent sodaCoke detergent soda

55 Window cleaner sodaWindow cleaner soda

CokCokee

Window Window cleanercleaner

MilkMilk SodaSoda DetergentDetergent

CokeCoke 44 11 11 22 22

Window cleanerWindow cleaner 11 22 11 11 00

MilkMilk 11 11 11 00 00

SodaSoda 22 11 00 33 11

DetergentDetergent 22 00 00 11 22

POS Transactions

Co-occurrence ofProducts

17

How Good is an Association Rule

CokCokee

Window Window cleanercleaner

MilkMilk SodaSoda DetergentDetergent

44 11 11 22 22

Window cleanerWindow cleaner 11 22 11 11 00

MilkMilk 11 11 11 00 00

SodaSoda 22 11 00 33 11

DetergentDetergent 22 00 00 11 22

Simple patterns1 Coke and soda are more likely purchased together thanany other two items2 Detergent is never purchased with milk or window cleaner3 Milk is never purchased with soda or detergent

18

How Good is an Association Rule

What is the confidence for this ruleWhat is the confidence for this rule If a customer purchases soda then customer also purchases CokeIf a customer purchases soda then customer also purchases Coke 2 out of 3 soda purchases also include Coke so 672 out of 3 soda purchases also include Coke so 67

What about the confidence of this rule reversedWhat about the confidence of this rule reversed 2 out of 4 Coke purchases also include soda so 502 out of 4 Coke purchases also include soda so 50

Confidence Confidence = Ratio of the number of transactions with all the = Ratio of the number of transactions with all the items to the number of transactions with just the ldquoifrdquo itemsitems to the number of transactions with just the ldquoifrdquo items

Customer Items Purchased

1 Coke soda

2 Milk Coke window cleaner

3 Coke detergent

4 Coke detergent soda

5 Window cleaner soda

POS Transactions

19

How Good is an Association Rule

How much better than chance is a ruleHow much better than chance is a rule Lift (improvementa) tells us how much better a rule is at Lift (improvementa) tells us how much better a rule is at

predicting the result than just assuming the result in the first predicting the result than just assuming the result in the first placeplace

Lift Lift is the ratio of the records that support the entire rule to is the ratio of the records that support the entire rule to the number that would be expected assuming there was no the number that would be expected assuming there was no relationship between the productsrelationship between the products

Calculating lifthellipp 310hellipWhen lift gt 1 then the rule is better at Calculating lifthellipp 310hellipWhen lift gt 1 then the rule is better at predicting the result than guessingpredicting the result than guessing

When lift lt 1 the rule is doing worse than informed guessing When lift lt 1 the rule is doing worse than informed guessing and using the and using the Negative RuleNegative Rule produces a better rule than produces a better rule than guessingguessing

Co-occurrence can occur in 3 4 or more dimensionshellipCo-occurrence can occur in 3 4 or more dimensionshellip

20

Creating Association Rules

11 Choosing the right set Choosing the right set of itemsof items

22 Generating rules by Generating rules by deciphering the deciphering the counts in the co-counts in the co-occurrence matrixoccurrence matrix

33 Overcoming the Overcoming the practical limits practical limits imposed by thousands imposed by thousands or tens of thousands or tens of thousands of unique itemsof unique items

21

Overcoming Practical Limits for Association Rules

11 Generate co-occurrence matrix Generate co-occurrence matrix for single itemshelliprdquofor single itemshelliprdquoif Coke then if Coke then sodardquosodardquo

22 Generate co-occurrence matrix Generate co-occurrence matrix for two itemshelliprdquofor two itemshelliprdquoif Coke and Milk if Coke and Milk then sodardquothen sodardquo

33 Generate co-occurrence matrix Generate co-occurrence matrix for three itemshelliprdquofor three itemshelliprdquoif Coke and Milk if Coke and Milk and Windowand Window Cleanerrdquo then soda Cleanerrdquo then soda

44 EtchellipEtchellip

22

Final Thought on Association RulesThe Problem of Lots of Data

Fast Food Restauranthellipcould have 100 Fast Food Restauranthellipcould have 100 items on its menuitems on its menu How many combinations are there with 3 How many combinations are there with 3

different menu items 161700 different menu items 161700 Supermarkethellip10000 or more unique Supermarkethellip10000 or more unique

itemsitems 50 million 2-item combinations50 million 2-item combinations 100 billion 3-item combinations100 billion 3-item combinations

Use of product hierarchies (groupings) Use of product hierarchies (groupings) helps address this common issuehelps address this common issue

Finally know that the number of Finally know that the number of transactions in a given time-period could transactions in a given time-period could also be huge (hence expensive to analyze)also be huge (hence expensive to analyze)

23

Business and other cases

24

25

26

27

28

29

30

31

32

33

General Observations

Banking case seems to provide Banking case seems to provide well defined and intelligible well defined and intelligible information of the forminformation of the form account_1 and account_2 etc or account_1 and account_2 etc or

activity_1 and activity_2 etc activity_1 and activity_2 etc possibly indexed by timepossibly indexed by time

As such rules found provide guide As such rules found provide guide to action to offer product or service to action to offer product or service (cross-sell)(cross-sell)

34

In retailing case of items In retailing case of items purchased together guidance is purchased together guidance is not so clear cut due to extensive not so clear cut due to extensive number of rulesnumber of rules

Soccer event exemplifies Soccer event exemplifies sequencing of events towards sequencing of events towards reaching goal Basketball-applied reaching goal Basketball-applied software has been developed years software has been developed years ago Web mining shares the same ago Web mining shares the same principles without passion usually principles without passion usually associated with sportsassociated with sports

35

Challenges

A major difficulty is that a large number of A major difficulty is that a large number of the rules found may be trivial for anyone the rules found may be trivial for anyone familiar with the business familiar with the business

The computational complexity involved in The computational complexity involved in calculating the results of market basket calculating the results of market basket analysis is at least the square of the number analysis is at least the square of the number of transaction item-lines (records of every of transaction item-lines (records of every item purchased) With data warehouses item purchased) With data warehouses storing billions of transaction lines this storing billions of transaction lines this yields extremely high computational yields extremely high computational requirements requirements

36

Solutions

Differential market basket analysisDifferential market basket analysis can find interesting results and can also can find interesting results and can also eliminate the problem of a potentially eliminate the problem of a potentially high volume of trivial resultshigh volume of trivial results

Special techniques involving Special techniques involving filtering filtering or aggregationor aggregation of the transaction of the transaction database are commonly used to in database are commonly used to in analysis algorithms to increase analysis algorithms to increase performance and allow some level of performance and allow some level of interactivity such as in business interactivity such as in business intelligence applicationsintelligence applications

37

Thank You

4

Market Basket Analysis

Retail ndash each customer purchases different set Retail ndash each customer purchases different set of products different quantities different of products different quantities different timestimes

MBA uses this information toMBA uses this information to Identify who customers are (not by name)Identify who customers are (not by name) Understand why they make certain purchasesUnderstand why they make certain purchases Gain insight about its merchandise (products)Gain insight about its merchandise (products)

Fast and slow moversFast and slow movers Products which are purchased togetherProducts which are purchased together Products which might benefit from promotionProducts which might benefit from promotion

Take actionTake action Store layoutsStore layouts Which products to put on specials promote couponshellipWhich products to put on specials promote couponshellip

Combining all of this with a customer loyalty Combining all of this with a customer loyalty card it becomes even more valuablecard it becomes even more valuable

5

Association Rules

DM technique most closely allied DM technique most closely allied with Market Basket Analysiswith Market Basket Analysis

AR can be automatically AR can be automatically generatedgenerated AR represent patterns in the data AR represent patterns in the data

without a specified target variablewithout a specified target variable Good example of undirected data Good example of undirected data

miningmining

6

7

Market Basket Analysis Measures

Consider the association rule Y 1048782 Z where Y and Z are two products Y Consider the association rule Y 1048782 Z where Y and Z are two products Y represents the antecedent en Z is called the consequentrepresents the antecedent en Z is called the consequent

Support Support of the rule the percentage of all baskets that contain both of the rule the percentage of all baskets that contain both product Y and Zproduct Y and Zsupport = P(Y Λ Z)support = P(Y Λ Z)

Confidence Confidence of the rule the percentage of all the baskets containing Y that of the rule the percentage of all the baskets containing Y that also contain Zalso contain ZHence confidence is a conditional probability ie P(Z|Y)Hence confidence is a conditional probability ie P(Z|Y)confidence = P(Y Λ Z)P(Y)confidence = P(Y Λ Z)P(Y)

Interest Interest of the rule measures the statistical dependence of the rule by of the rule measures the statistical dependence of the rule by relating the observed frequency of occurrence (P(Y Λ Z)) to the expected relating the observed frequency of occurrence (P(Y Λ Z)) to the expected frequency of co-occurrence under the assumption of conditional frequency of co-occurrence under the assumption of conditional independence of Y and Z (P(Y)P(Z))independence of Y and Z (P(Y)P(Z))interest = P(Y Λ Z)(P(Y)P(Z))interest = P(Y Λ Z)(P(Y)P(Z))

Association-rule discovery is the process of finding strong product Association-rule discovery is the process of finding strong product associations with aassociations with aminimum support andor confidence and an interest of at least oneminimum support andor confidence and an interest of at least one

8

Association Rules Apply Elsewhere

Besides retail ndash supermarkets etchellipBesides retail ndash supermarkets etchellip Purchases made using creditdebit Purchases made using creditdebit

cardscards Optional Telco Service purchasesOptional Telco Service purchases Banking servicesBanking services Unusual combinations of insurance Unusual combinations of insurance

claims can be a warning of fraudclaims can be a warning of fraud Medical patient historiesMedical patient histories

9

A certainty measure for A certainty measure for association rules of the form ldquoA association rules of the form ldquoA =gt Brdquo where A and B are sets of =gt Brdquo where A and B are sets of items is confidenceitems is confidence

Given a set of task Given a set of task

10

Typical Data Structure (Relational Database)

Lots of questions can be answeredLots of questions can be answered Avg of orderscustomerAvg of orderscustomer Avg unique itemsorderAvg unique itemsorder Avg of itemsorderAvg of itemsorder For a productFor a product

What of customers have purchasedWhat of customers have purchased Avg orderscustomer include itAvg orderscustomer include it Avg quantity of it purchasedorderAvg quantity of it purchasedorder

EtchellipEtchellip Visualization is extremely helpfulhellipnext Visualization is extremely helpfulhellipnext

slide slide

Transaction Data

11

Sales Order Characteristics

12

Sales Order Characteristics

Did the order use gift wrapDid the order use gift wrap Billing address same as Shipping addressBilling address same as Shipping address Did purchaser acceptdecline a cross-sellDid purchaser acceptdecline a cross-sell What is the most common item found on a What is the most common item found on a

one-item orderone-item order What is the most common item found on a What is the most common item found on a

multi-item ordermulti-item order What is the most common item for repeat What is the most common item for repeat

customer purchasescustomer purchases How has ordering of an item changed over How has ordering of an item changed over

timetime How does the ordering of an item vary How does the ordering of an item vary

geographicallygeographically

13

Pivoting for Cluster Algorithms

14

Association Rules

Wal-Mart customers who purchase Wal-Mart customers who purchase Barbie dolls have a 60 likelihood of Barbie dolls have a 60 likelihood of also purchasing one of three types of also purchasing one of three types of candy bars [candy bars [ForbesForbes Sept 8 1997] Sept 8 1997]

Customers who purchase maintenance Customers who purchase maintenance agreements are very likely to purchase agreements are very likely to purchase large appliances When a new hardware large appliances When a new hardware store opens one of the most commonly store opens one of the most commonly sold items is toilet bowl cleanerssold items is toilet bowl cleaners

15

Association Rules

Association rule typesAssociation rule types Actionable Rules ndash contain high-Actionable Rules ndash contain high-

quality actionable informationquality actionable information Trivial Rules ndash information already Trivial Rules ndash information already

well-known by those familiar with well-known by those familiar with the businessthe business

Inexplicable Rules ndash no explanation Inexplicable Rules ndash no explanation and do not suggest actionand do not suggest action

Trivial and Inexplicable Rules Trivial and Inexplicable Rules occur most oftenoccur most often

16

How Good is an Association Rule

CustomerCustomer Items PurchasedItems Purchased

11 Coke sodaCoke soda

22 Milk Coke window cleanerMilk Coke window cleaner

33 Coke detergentCoke detergent

44 Coke detergent sodaCoke detergent soda

55 Window cleaner sodaWindow cleaner soda

CokCokee

Window Window cleanercleaner

MilkMilk SodaSoda DetergentDetergent

CokeCoke 44 11 11 22 22

Window cleanerWindow cleaner 11 22 11 11 00

MilkMilk 11 11 11 00 00

SodaSoda 22 11 00 33 11

DetergentDetergent 22 00 00 11 22

POS Transactions

Co-occurrence ofProducts

17

How Good is an Association Rule

CokCokee

Window Window cleanercleaner

MilkMilk SodaSoda DetergentDetergent

44 11 11 22 22

Window cleanerWindow cleaner 11 22 11 11 00

MilkMilk 11 11 11 00 00

SodaSoda 22 11 00 33 11

DetergentDetergent 22 00 00 11 22

Simple patterns1 Coke and soda are more likely purchased together thanany other two items2 Detergent is never purchased with milk or window cleaner3 Milk is never purchased with soda or detergent

18

How Good is an Association Rule

What is the confidence for this ruleWhat is the confidence for this rule If a customer purchases soda then customer also purchases CokeIf a customer purchases soda then customer also purchases Coke 2 out of 3 soda purchases also include Coke so 672 out of 3 soda purchases also include Coke so 67

What about the confidence of this rule reversedWhat about the confidence of this rule reversed 2 out of 4 Coke purchases also include soda so 502 out of 4 Coke purchases also include soda so 50

Confidence Confidence = Ratio of the number of transactions with all the = Ratio of the number of transactions with all the items to the number of transactions with just the ldquoifrdquo itemsitems to the number of transactions with just the ldquoifrdquo items

Customer Items Purchased

1 Coke soda

2 Milk Coke window cleaner

3 Coke detergent

4 Coke detergent soda

5 Window cleaner soda

POS Transactions

19

How Good is an Association Rule

How much better than chance is a ruleHow much better than chance is a rule Lift (improvementa) tells us how much better a rule is at Lift (improvementa) tells us how much better a rule is at

predicting the result than just assuming the result in the first predicting the result than just assuming the result in the first placeplace

Lift Lift is the ratio of the records that support the entire rule to is the ratio of the records that support the entire rule to the number that would be expected assuming there was no the number that would be expected assuming there was no relationship between the productsrelationship between the products

Calculating lifthellipp 310hellipWhen lift gt 1 then the rule is better at Calculating lifthellipp 310hellipWhen lift gt 1 then the rule is better at predicting the result than guessingpredicting the result than guessing

When lift lt 1 the rule is doing worse than informed guessing When lift lt 1 the rule is doing worse than informed guessing and using the and using the Negative RuleNegative Rule produces a better rule than produces a better rule than guessingguessing

Co-occurrence can occur in 3 4 or more dimensionshellipCo-occurrence can occur in 3 4 or more dimensionshellip

20

Creating Association Rules

11 Choosing the right set Choosing the right set of itemsof items

22 Generating rules by Generating rules by deciphering the deciphering the counts in the co-counts in the co-occurrence matrixoccurrence matrix

33 Overcoming the Overcoming the practical limits practical limits imposed by thousands imposed by thousands or tens of thousands or tens of thousands of unique itemsof unique items

21

Overcoming Practical Limits for Association Rules

11 Generate co-occurrence matrix Generate co-occurrence matrix for single itemshelliprdquofor single itemshelliprdquoif Coke then if Coke then sodardquosodardquo

22 Generate co-occurrence matrix Generate co-occurrence matrix for two itemshelliprdquofor two itemshelliprdquoif Coke and Milk if Coke and Milk then sodardquothen sodardquo

33 Generate co-occurrence matrix Generate co-occurrence matrix for three itemshelliprdquofor three itemshelliprdquoif Coke and Milk if Coke and Milk and Windowand Window Cleanerrdquo then soda Cleanerrdquo then soda

44 EtchellipEtchellip

22

Final Thought on Association RulesThe Problem of Lots of Data

Fast Food Restauranthellipcould have 100 Fast Food Restauranthellipcould have 100 items on its menuitems on its menu How many combinations are there with 3 How many combinations are there with 3

different menu items 161700 different menu items 161700 Supermarkethellip10000 or more unique Supermarkethellip10000 or more unique

itemsitems 50 million 2-item combinations50 million 2-item combinations 100 billion 3-item combinations100 billion 3-item combinations

Use of product hierarchies (groupings) Use of product hierarchies (groupings) helps address this common issuehelps address this common issue

Finally know that the number of Finally know that the number of transactions in a given time-period could transactions in a given time-period could also be huge (hence expensive to analyze)also be huge (hence expensive to analyze)

23

Business and other cases

24

25

26

27

28

29

30

31

32

33

General Observations

Banking case seems to provide Banking case seems to provide well defined and intelligible well defined and intelligible information of the forminformation of the form account_1 and account_2 etc or account_1 and account_2 etc or

activity_1 and activity_2 etc activity_1 and activity_2 etc possibly indexed by timepossibly indexed by time

As such rules found provide guide As such rules found provide guide to action to offer product or service to action to offer product or service (cross-sell)(cross-sell)

34

In retailing case of items In retailing case of items purchased together guidance is purchased together guidance is not so clear cut due to extensive not so clear cut due to extensive number of rulesnumber of rules

Soccer event exemplifies Soccer event exemplifies sequencing of events towards sequencing of events towards reaching goal Basketball-applied reaching goal Basketball-applied software has been developed years software has been developed years ago Web mining shares the same ago Web mining shares the same principles without passion usually principles without passion usually associated with sportsassociated with sports

35

Challenges

A major difficulty is that a large number of A major difficulty is that a large number of the rules found may be trivial for anyone the rules found may be trivial for anyone familiar with the business familiar with the business

The computational complexity involved in The computational complexity involved in calculating the results of market basket calculating the results of market basket analysis is at least the square of the number analysis is at least the square of the number of transaction item-lines (records of every of transaction item-lines (records of every item purchased) With data warehouses item purchased) With data warehouses storing billions of transaction lines this storing billions of transaction lines this yields extremely high computational yields extremely high computational requirements requirements

36

Solutions

Differential market basket analysisDifferential market basket analysis can find interesting results and can also can find interesting results and can also eliminate the problem of a potentially eliminate the problem of a potentially high volume of trivial resultshigh volume of trivial results

Special techniques involving Special techniques involving filtering filtering or aggregationor aggregation of the transaction of the transaction database are commonly used to in database are commonly used to in analysis algorithms to increase analysis algorithms to increase performance and allow some level of performance and allow some level of interactivity such as in business interactivity such as in business intelligence applicationsintelligence applications

37

Thank You

5

Association Rules

DM technique most closely allied DM technique most closely allied with Market Basket Analysiswith Market Basket Analysis

AR can be automatically AR can be automatically generatedgenerated AR represent patterns in the data AR represent patterns in the data

without a specified target variablewithout a specified target variable Good example of undirected data Good example of undirected data

miningmining

6

7

Market Basket Analysis Measures

Consider the association rule Y 1048782 Z where Y and Z are two products Y Consider the association rule Y 1048782 Z where Y and Z are two products Y represents the antecedent en Z is called the consequentrepresents the antecedent en Z is called the consequent

Support Support of the rule the percentage of all baskets that contain both of the rule the percentage of all baskets that contain both product Y and Zproduct Y and Zsupport = P(Y Λ Z)support = P(Y Λ Z)

Confidence Confidence of the rule the percentage of all the baskets containing Y that of the rule the percentage of all the baskets containing Y that also contain Zalso contain ZHence confidence is a conditional probability ie P(Z|Y)Hence confidence is a conditional probability ie P(Z|Y)confidence = P(Y Λ Z)P(Y)confidence = P(Y Λ Z)P(Y)

Interest Interest of the rule measures the statistical dependence of the rule by of the rule measures the statistical dependence of the rule by relating the observed frequency of occurrence (P(Y Λ Z)) to the expected relating the observed frequency of occurrence (P(Y Λ Z)) to the expected frequency of co-occurrence under the assumption of conditional frequency of co-occurrence under the assumption of conditional independence of Y and Z (P(Y)P(Z))independence of Y and Z (P(Y)P(Z))interest = P(Y Λ Z)(P(Y)P(Z))interest = P(Y Λ Z)(P(Y)P(Z))

Association-rule discovery is the process of finding strong product Association-rule discovery is the process of finding strong product associations with aassociations with aminimum support andor confidence and an interest of at least oneminimum support andor confidence and an interest of at least one

8

Association Rules Apply Elsewhere

Besides retail ndash supermarkets etchellipBesides retail ndash supermarkets etchellip Purchases made using creditdebit Purchases made using creditdebit

cardscards Optional Telco Service purchasesOptional Telco Service purchases Banking servicesBanking services Unusual combinations of insurance Unusual combinations of insurance

claims can be a warning of fraudclaims can be a warning of fraud Medical patient historiesMedical patient histories

9

A certainty measure for A certainty measure for association rules of the form ldquoA association rules of the form ldquoA =gt Brdquo where A and B are sets of =gt Brdquo where A and B are sets of items is confidenceitems is confidence

Given a set of task Given a set of task

10

Typical Data Structure (Relational Database)

Lots of questions can be answeredLots of questions can be answered Avg of orderscustomerAvg of orderscustomer Avg unique itemsorderAvg unique itemsorder Avg of itemsorderAvg of itemsorder For a productFor a product

What of customers have purchasedWhat of customers have purchased Avg orderscustomer include itAvg orderscustomer include it Avg quantity of it purchasedorderAvg quantity of it purchasedorder

EtchellipEtchellip Visualization is extremely helpfulhellipnext Visualization is extremely helpfulhellipnext

slide slide

Transaction Data

11

Sales Order Characteristics

12

Sales Order Characteristics

Did the order use gift wrapDid the order use gift wrap Billing address same as Shipping addressBilling address same as Shipping address Did purchaser acceptdecline a cross-sellDid purchaser acceptdecline a cross-sell What is the most common item found on a What is the most common item found on a

one-item orderone-item order What is the most common item found on a What is the most common item found on a

multi-item ordermulti-item order What is the most common item for repeat What is the most common item for repeat

customer purchasescustomer purchases How has ordering of an item changed over How has ordering of an item changed over

timetime How does the ordering of an item vary How does the ordering of an item vary

geographicallygeographically

13

Pivoting for Cluster Algorithms

14

Association Rules

Wal-Mart customers who purchase Wal-Mart customers who purchase Barbie dolls have a 60 likelihood of Barbie dolls have a 60 likelihood of also purchasing one of three types of also purchasing one of three types of candy bars [candy bars [ForbesForbes Sept 8 1997] Sept 8 1997]

Customers who purchase maintenance Customers who purchase maintenance agreements are very likely to purchase agreements are very likely to purchase large appliances When a new hardware large appliances When a new hardware store opens one of the most commonly store opens one of the most commonly sold items is toilet bowl cleanerssold items is toilet bowl cleaners

15

Association Rules

Association rule typesAssociation rule types Actionable Rules ndash contain high-Actionable Rules ndash contain high-

quality actionable informationquality actionable information Trivial Rules ndash information already Trivial Rules ndash information already

well-known by those familiar with well-known by those familiar with the businessthe business

Inexplicable Rules ndash no explanation Inexplicable Rules ndash no explanation and do not suggest actionand do not suggest action

Trivial and Inexplicable Rules Trivial and Inexplicable Rules occur most oftenoccur most often

16

How Good is an Association Rule

CustomerCustomer Items PurchasedItems Purchased

11 Coke sodaCoke soda

22 Milk Coke window cleanerMilk Coke window cleaner

33 Coke detergentCoke detergent

44 Coke detergent sodaCoke detergent soda

55 Window cleaner sodaWindow cleaner soda

CokCokee

Window Window cleanercleaner

MilkMilk SodaSoda DetergentDetergent

CokeCoke 44 11 11 22 22

Window cleanerWindow cleaner 11 22 11 11 00

MilkMilk 11 11 11 00 00

SodaSoda 22 11 00 33 11

DetergentDetergent 22 00 00 11 22

POS Transactions

Co-occurrence ofProducts

17

How Good is an Association Rule

CokCokee

Window Window cleanercleaner

MilkMilk SodaSoda DetergentDetergent

44 11 11 22 22

Window cleanerWindow cleaner 11 22 11 11 00

MilkMilk 11 11 11 00 00

SodaSoda 22 11 00 33 11

DetergentDetergent 22 00 00 11 22

Simple patterns1 Coke and soda are more likely purchased together thanany other two items2 Detergent is never purchased with milk or window cleaner3 Milk is never purchased with soda or detergent

18

How Good is an Association Rule

What is the confidence for this ruleWhat is the confidence for this rule If a customer purchases soda then customer also purchases CokeIf a customer purchases soda then customer also purchases Coke 2 out of 3 soda purchases also include Coke so 672 out of 3 soda purchases also include Coke so 67

What about the confidence of this rule reversedWhat about the confidence of this rule reversed 2 out of 4 Coke purchases also include soda so 502 out of 4 Coke purchases also include soda so 50

Confidence Confidence = Ratio of the number of transactions with all the = Ratio of the number of transactions with all the items to the number of transactions with just the ldquoifrdquo itemsitems to the number of transactions with just the ldquoifrdquo items

Customer Items Purchased

1 Coke soda

2 Milk Coke window cleaner

3 Coke detergent

4 Coke detergent soda

5 Window cleaner soda

POS Transactions

19

How Good is an Association Rule

How much better than chance is a ruleHow much better than chance is a rule Lift (improvementa) tells us how much better a rule is at Lift (improvementa) tells us how much better a rule is at

predicting the result than just assuming the result in the first predicting the result than just assuming the result in the first placeplace

Lift Lift is the ratio of the records that support the entire rule to is the ratio of the records that support the entire rule to the number that would be expected assuming there was no the number that would be expected assuming there was no relationship between the productsrelationship between the products

Calculating lifthellipp 310hellipWhen lift gt 1 then the rule is better at Calculating lifthellipp 310hellipWhen lift gt 1 then the rule is better at predicting the result than guessingpredicting the result than guessing

When lift lt 1 the rule is doing worse than informed guessing When lift lt 1 the rule is doing worse than informed guessing and using the and using the Negative RuleNegative Rule produces a better rule than produces a better rule than guessingguessing

Co-occurrence can occur in 3 4 or more dimensionshellipCo-occurrence can occur in 3 4 or more dimensionshellip

20

Creating Association Rules

11 Choosing the right set Choosing the right set of itemsof items

22 Generating rules by Generating rules by deciphering the deciphering the counts in the co-counts in the co-occurrence matrixoccurrence matrix

33 Overcoming the Overcoming the practical limits practical limits imposed by thousands imposed by thousands or tens of thousands or tens of thousands of unique itemsof unique items

21

Overcoming Practical Limits for Association Rules

11 Generate co-occurrence matrix Generate co-occurrence matrix for single itemshelliprdquofor single itemshelliprdquoif Coke then if Coke then sodardquosodardquo

22 Generate co-occurrence matrix Generate co-occurrence matrix for two itemshelliprdquofor two itemshelliprdquoif Coke and Milk if Coke and Milk then sodardquothen sodardquo

33 Generate co-occurrence matrix Generate co-occurrence matrix for three itemshelliprdquofor three itemshelliprdquoif Coke and Milk if Coke and Milk and Windowand Window Cleanerrdquo then soda Cleanerrdquo then soda

44 EtchellipEtchellip

22

Final Thought on Association RulesThe Problem of Lots of Data

Fast Food Restauranthellipcould have 100 Fast Food Restauranthellipcould have 100 items on its menuitems on its menu How many combinations are there with 3 How many combinations are there with 3

different menu items 161700 different menu items 161700 Supermarkethellip10000 or more unique Supermarkethellip10000 or more unique

itemsitems 50 million 2-item combinations50 million 2-item combinations 100 billion 3-item combinations100 billion 3-item combinations

Use of product hierarchies (groupings) Use of product hierarchies (groupings) helps address this common issuehelps address this common issue

Finally know that the number of Finally know that the number of transactions in a given time-period could transactions in a given time-period could also be huge (hence expensive to analyze)also be huge (hence expensive to analyze)

23

Business and other cases

24

25

26

27

28

29

30

31

32

33

General Observations

Banking case seems to provide Banking case seems to provide well defined and intelligible well defined and intelligible information of the forminformation of the form account_1 and account_2 etc or account_1 and account_2 etc or

activity_1 and activity_2 etc activity_1 and activity_2 etc possibly indexed by timepossibly indexed by time

As such rules found provide guide As such rules found provide guide to action to offer product or service to action to offer product or service (cross-sell)(cross-sell)

34

In retailing case of items In retailing case of items purchased together guidance is purchased together guidance is not so clear cut due to extensive not so clear cut due to extensive number of rulesnumber of rules

Soccer event exemplifies Soccer event exemplifies sequencing of events towards sequencing of events towards reaching goal Basketball-applied reaching goal Basketball-applied software has been developed years software has been developed years ago Web mining shares the same ago Web mining shares the same principles without passion usually principles without passion usually associated with sportsassociated with sports

35

Challenges

A major difficulty is that a large number of A major difficulty is that a large number of the rules found may be trivial for anyone the rules found may be trivial for anyone familiar with the business familiar with the business

The computational complexity involved in The computational complexity involved in calculating the results of market basket calculating the results of market basket analysis is at least the square of the number analysis is at least the square of the number of transaction item-lines (records of every of transaction item-lines (records of every item purchased) With data warehouses item purchased) With data warehouses storing billions of transaction lines this storing billions of transaction lines this yields extremely high computational yields extremely high computational requirements requirements

36

Solutions

Differential market basket analysisDifferential market basket analysis can find interesting results and can also can find interesting results and can also eliminate the problem of a potentially eliminate the problem of a potentially high volume of trivial resultshigh volume of trivial results

Special techniques involving Special techniques involving filtering filtering or aggregationor aggregation of the transaction of the transaction database are commonly used to in database are commonly used to in analysis algorithms to increase analysis algorithms to increase performance and allow some level of performance and allow some level of interactivity such as in business interactivity such as in business intelligence applicationsintelligence applications

37

Thank You

6

7

Market Basket Analysis Measures

Consider the association rule Y 1048782 Z where Y and Z are two products Y Consider the association rule Y 1048782 Z where Y and Z are two products Y represents the antecedent en Z is called the consequentrepresents the antecedent en Z is called the consequent

Support Support of the rule the percentage of all baskets that contain both of the rule the percentage of all baskets that contain both product Y and Zproduct Y and Zsupport = P(Y Λ Z)support = P(Y Λ Z)

Confidence Confidence of the rule the percentage of all the baskets containing Y that of the rule the percentage of all the baskets containing Y that also contain Zalso contain ZHence confidence is a conditional probability ie P(Z|Y)Hence confidence is a conditional probability ie P(Z|Y)confidence = P(Y Λ Z)P(Y)confidence = P(Y Λ Z)P(Y)

Interest Interest of the rule measures the statistical dependence of the rule by of the rule measures the statistical dependence of the rule by relating the observed frequency of occurrence (P(Y Λ Z)) to the expected relating the observed frequency of occurrence (P(Y Λ Z)) to the expected frequency of co-occurrence under the assumption of conditional frequency of co-occurrence under the assumption of conditional independence of Y and Z (P(Y)P(Z))independence of Y and Z (P(Y)P(Z))interest = P(Y Λ Z)(P(Y)P(Z))interest = P(Y Λ Z)(P(Y)P(Z))

Association-rule discovery is the process of finding strong product Association-rule discovery is the process of finding strong product associations with aassociations with aminimum support andor confidence and an interest of at least oneminimum support andor confidence and an interest of at least one

8

Association Rules Apply Elsewhere

Besides retail ndash supermarkets etchellipBesides retail ndash supermarkets etchellip Purchases made using creditdebit Purchases made using creditdebit

cardscards Optional Telco Service purchasesOptional Telco Service purchases Banking servicesBanking services Unusual combinations of insurance Unusual combinations of insurance

claims can be a warning of fraudclaims can be a warning of fraud Medical patient historiesMedical patient histories

9

A certainty measure for A certainty measure for association rules of the form ldquoA association rules of the form ldquoA =gt Brdquo where A and B are sets of =gt Brdquo where A and B are sets of items is confidenceitems is confidence

Given a set of task Given a set of task

10

Typical Data Structure (Relational Database)

Lots of questions can be answeredLots of questions can be answered Avg of orderscustomerAvg of orderscustomer Avg unique itemsorderAvg unique itemsorder Avg of itemsorderAvg of itemsorder For a productFor a product

What of customers have purchasedWhat of customers have purchased Avg orderscustomer include itAvg orderscustomer include it Avg quantity of it purchasedorderAvg quantity of it purchasedorder

EtchellipEtchellip Visualization is extremely helpfulhellipnext Visualization is extremely helpfulhellipnext

slide slide

Transaction Data

11

Sales Order Characteristics

12

Sales Order Characteristics

Did the order use gift wrapDid the order use gift wrap Billing address same as Shipping addressBilling address same as Shipping address Did purchaser acceptdecline a cross-sellDid purchaser acceptdecline a cross-sell What is the most common item found on a What is the most common item found on a

one-item orderone-item order What is the most common item found on a What is the most common item found on a

multi-item ordermulti-item order What is the most common item for repeat What is the most common item for repeat

customer purchasescustomer purchases How has ordering of an item changed over How has ordering of an item changed over

timetime How does the ordering of an item vary How does the ordering of an item vary

geographicallygeographically

13

Pivoting for Cluster Algorithms

14

Association Rules

Wal-Mart customers who purchase Wal-Mart customers who purchase Barbie dolls have a 60 likelihood of Barbie dolls have a 60 likelihood of also purchasing one of three types of also purchasing one of three types of candy bars [candy bars [ForbesForbes Sept 8 1997] Sept 8 1997]

Customers who purchase maintenance Customers who purchase maintenance agreements are very likely to purchase agreements are very likely to purchase large appliances When a new hardware large appliances When a new hardware store opens one of the most commonly store opens one of the most commonly sold items is toilet bowl cleanerssold items is toilet bowl cleaners

15

Association Rules

Association rule typesAssociation rule types Actionable Rules ndash contain high-Actionable Rules ndash contain high-

quality actionable informationquality actionable information Trivial Rules ndash information already Trivial Rules ndash information already

well-known by those familiar with well-known by those familiar with the businessthe business

Inexplicable Rules ndash no explanation Inexplicable Rules ndash no explanation and do not suggest actionand do not suggest action

Trivial and Inexplicable Rules Trivial and Inexplicable Rules occur most oftenoccur most often

16

How Good is an Association Rule

CustomerCustomer Items PurchasedItems Purchased

11 Coke sodaCoke soda

22 Milk Coke window cleanerMilk Coke window cleaner

33 Coke detergentCoke detergent

44 Coke detergent sodaCoke detergent soda

55 Window cleaner sodaWindow cleaner soda

CokCokee

Window Window cleanercleaner

MilkMilk SodaSoda DetergentDetergent

CokeCoke 44 11 11 22 22

Window cleanerWindow cleaner 11 22 11 11 00

MilkMilk 11 11 11 00 00

SodaSoda 22 11 00 33 11

DetergentDetergent 22 00 00 11 22

POS Transactions

Co-occurrence ofProducts

17

How Good is an Association Rule

CokCokee

Window Window cleanercleaner

MilkMilk SodaSoda DetergentDetergent

44 11 11 22 22

Window cleanerWindow cleaner 11 22 11 11 00

MilkMilk 11 11 11 00 00

SodaSoda 22 11 00 33 11

DetergentDetergent 22 00 00 11 22

Simple patterns1 Coke and soda are more likely purchased together thanany other two items2 Detergent is never purchased with milk or window cleaner3 Milk is never purchased with soda or detergent

18

How Good is an Association Rule

What is the confidence for this ruleWhat is the confidence for this rule If a customer purchases soda then customer also purchases CokeIf a customer purchases soda then customer also purchases Coke 2 out of 3 soda purchases also include Coke so 672 out of 3 soda purchases also include Coke so 67

What about the confidence of this rule reversedWhat about the confidence of this rule reversed 2 out of 4 Coke purchases also include soda so 502 out of 4 Coke purchases also include soda so 50

Confidence Confidence = Ratio of the number of transactions with all the = Ratio of the number of transactions with all the items to the number of transactions with just the ldquoifrdquo itemsitems to the number of transactions with just the ldquoifrdquo items

Customer Items Purchased

1 Coke soda

2 Milk Coke window cleaner

3 Coke detergent

4 Coke detergent soda

5 Window cleaner soda

POS Transactions

19

How Good is an Association Rule

How much better than chance is a ruleHow much better than chance is a rule Lift (improvementa) tells us how much better a rule is at Lift (improvementa) tells us how much better a rule is at

predicting the result than just assuming the result in the first predicting the result than just assuming the result in the first placeplace

Lift Lift is the ratio of the records that support the entire rule to is the ratio of the records that support the entire rule to the number that would be expected assuming there was no the number that would be expected assuming there was no relationship between the productsrelationship between the products

Calculating lifthellipp 310hellipWhen lift gt 1 then the rule is better at Calculating lifthellipp 310hellipWhen lift gt 1 then the rule is better at predicting the result than guessingpredicting the result than guessing

When lift lt 1 the rule is doing worse than informed guessing When lift lt 1 the rule is doing worse than informed guessing and using the and using the Negative RuleNegative Rule produces a better rule than produces a better rule than guessingguessing

Co-occurrence can occur in 3 4 or more dimensionshellipCo-occurrence can occur in 3 4 or more dimensionshellip

20

Creating Association Rules

11 Choosing the right set Choosing the right set of itemsof items

22 Generating rules by Generating rules by deciphering the deciphering the counts in the co-counts in the co-occurrence matrixoccurrence matrix

33 Overcoming the Overcoming the practical limits practical limits imposed by thousands imposed by thousands or tens of thousands or tens of thousands of unique itemsof unique items

21

Overcoming Practical Limits for Association Rules

11 Generate co-occurrence matrix Generate co-occurrence matrix for single itemshelliprdquofor single itemshelliprdquoif Coke then if Coke then sodardquosodardquo

22 Generate co-occurrence matrix Generate co-occurrence matrix for two itemshelliprdquofor two itemshelliprdquoif Coke and Milk if Coke and Milk then sodardquothen sodardquo

33 Generate co-occurrence matrix Generate co-occurrence matrix for three itemshelliprdquofor three itemshelliprdquoif Coke and Milk if Coke and Milk and Windowand Window Cleanerrdquo then soda Cleanerrdquo then soda

44 EtchellipEtchellip

22

Final Thought on Association RulesThe Problem of Lots of Data

Fast Food Restauranthellipcould have 100 Fast Food Restauranthellipcould have 100 items on its menuitems on its menu How many combinations are there with 3 How many combinations are there with 3

different menu items 161700 different menu items 161700 Supermarkethellip10000 or more unique Supermarkethellip10000 or more unique

itemsitems 50 million 2-item combinations50 million 2-item combinations 100 billion 3-item combinations100 billion 3-item combinations

Use of product hierarchies (groupings) Use of product hierarchies (groupings) helps address this common issuehelps address this common issue

Finally know that the number of Finally know that the number of transactions in a given time-period could transactions in a given time-period could also be huge (hence expensive to analyze)also be huge (hence expensive to analyze)

23

Business and other cases

24

25

26

27

28

29

30

31

32

33

General Observations

Banking case seems to provide Banking case seems to provide well defined and intelligible well defined and intelligible information of the forminformation of the form account_1 and account_2 etc or account_1 and account_2 etc or

activity_1 and activity_2 etc activity_1 and activity_2 etc possibly indexed by timepossibly indexed by time

As such rules found provide guide As such rules found provide guide to action to offer product or service to action to offer product or service (cross-sell)(cross-sell)

34

In retailing case of items In retailing case of items purchased together guidance is purchased together guidance is not so clear cut due to extensive not so clear cut due to extensive number of rulesnumber of rules

Soccer event exemplifies Soccer event exemplifies sequencing of events towards sequencing of events towards reaching goal Basketball-applied reaching goal Basketball-applied software has been developed years software has been developed years ago Web mining shares the same ago Web mining shares the same principles without passion usually principles without passion usually associated with sportsassociated with sports

35

Challenges

A major difficulty is that a large number of A major difficulty is that a large number of the rules found may be trivial for anyone the rules found may be trivial for anyone familiar with the business familiar with the business

The computational complexity involved in The computational complexity involved in calculating the results of market basket calculating the results of market basket analysis is at least the square of the number analysis is at least the square of the number of transaction item-lines (records of every of transaction item-lines (records of every item purchased) With data warehouses item purchased) With data warehouses storing billions of transaction lines this storing billions of transaction lines this yields extremely high computational yields extremely high computational requirements requirements

36

Solutions

Differential market basket analysisDifferential market basket analysis can find interesting results and can also can find interesting results and can also eliminate the problem of a potentially eliminate the problem of a potentially high volume of trivial resultshigh volume of trivial results

Special techniques involving Special techniques involving filtering filtering or aggregationor aggregation of the transaction of the transaction database are commonly used to in database are commonly used to in analysis algorithms to increase analysis algorithms to increase performance and allow some level of performance and allow some level of interactivity such as in business interactivity such as in business intelligence applicationsintelligence applications

37

Thank You

7

Market Basket Analysis Measures

Consider the association rule Y 1048782 Z where Y and Z are two products Y Consider the association rule Y 1048782 Z where Y and Z are two products Y represents the antecedent en Z is called the consequentrepresents the antecedent en Z is called the consequent

Support Support of the rule the percentage of all baskets that contain both of the rule the percentage of all baskets that contain both product Y and Zproduct Y and Zsupport = P(Y Λ Z)support = P(Y Λ Z)

Confidence Confidence of the rule the percentage of all the baskets containing Y that of the rule the percentage of all the baskets containing Y that also contain Zalso contain ZHence confidence is a conditional probability ie P(Z|Y)Hence confidence is a conditional probability ie P(Z|Y)confidence = P(Y Λ Z)P(Y)confidence = P(Y Λ Z)P(Y)

Interest Interest of the rule measures the statistical dependence of the rule by of the rule measures the statistical dependence of the rule by relating the observed frequency of occurrence (P(Y Λ Z)) to the expected relating the observed frequency of occurrence (P(Y Λ Z)) to the expected frequency of co-occurrence under the assumption of conditional frequency of co-occurrence under the assumption of conditional independence of Y and Z (P(Y)P(Z))independence of Y and Z (P(Y)P(Z))interest = P(Y Λ Z)(P(Y)P(Z))interest = P(Y Λ Z)(P(Y)P(Z))

Association-rule discovery is the process of finding strong product Association-rule discovery is the process of finding strong product associations with aassociations with aminimum support andor confidence and an interest of at least oneminimum support andor confidence and an interest of at least one

8

Association Rules Apply Elsewhere

Besides retail ndash supermarkets etchellipBesides retail ndash supermarkets etchellip Purchases made using creditdebit Purchases made using creditdebit

cardscards Optional Telco Service purchasesOptional Telco Service purchases Banking servicesBanking services Unusual combinations of insurance Unusual combinations of insurance

claims can be a warning of fraudclaims can be a warning of fraud Medical patient historiesMedical patient histories

9

A certainty measure for A certainty measure for association rules of the form ldquoA association rules of the form ldquoA =gt Brdquo where A and B are sets of =gt Brdquo where A and B are sets of items is confidenceitems is confidence

Given a set of task Given a set of task

10

Typical Data Structure (Relational Database)

Lots of questions can be answeredLots of questions can be answered Avg of orderscustomerAvg of orderscustomer Avg unique itemsorderAvg unique itemsorder Avg of itemsorderAvg of itemsorder For a productFor a product

What of customers have purchasedWhat of customers have purchased Avg orderscustomer include itAvg orderscustomer include it Avg quantity of it purchasedorderAvg quantity of it purchasedorder

EtchellipEtchellip Visualization is extremely helpfulhellipnext Visualization is extremely helpfulhellipnext

slide slide

Transaction Data

11

Sales Order Characteristics

12

Sales Order Characteristics

Did the order use gift wrapDid the order use gift wrap Billing address same as Shipping addressBilling address same as Shipping address Did purchaser acceptdecline a cross-sellDid purchaser acceptdecline a cross-sell What is the most common item found on a What is the most common item found on a

one-item orderone-item order What is the most common item found on a What is the most common item found on a

multi-item ordermulti-item order What is the most common item for repeat What is the most common item for repeat

customer purchasescustomer purchases How has ordering of an item changed over How has ordering of an item changed over

timetime How does the ordering of an item vary How does the ordering of an item vary

geographicallygeographically

13

Pivoting for Cluster Algorithms

14

Association Rules

Wal-Mart customers who purchase Wal-Mart customers who purchase Barbie dolls have a 60 likelihood of Barbie dolls have a 60 likelihood of also purchasing one of three types of also purchasing one of three types of candy bars [candy bars [ForbesForbes Sept 8 1997] Sept 8 1997]

Customers who purchase maintenance Customers who purchase maintenance agreements are very likely to purchase agreements are very likely to purchase large appliances When a new hardware large appliances When a new hardware store opens one of the most commonly store opens one of the most commonly sold items is toilet bowl cleanerssold items is toilet bowl cleaners

15

Association Rules

Association rule typesAssociation rule types Actionable Rules ndash contain high-Actionable Rules ndash contain high-

quality actionable informationquality actionable information Trivial Rules ndash information already Trivial Rules ndash information already

well-known by those familiar with well-known by those familiar with the businessthe business

Inexplicable Rules ndash no explanation Inexplicable Rules ndash no explanation and do not suggest actionand do not suggest action

Trivial and Inexplicable Rules Trivial and Inexplicable Rules occur most oftenoccur most often

16

How Good is an Association Rule

CustomerCustomer Items PurchasedItems Purchased

11 Coke sodaCoke soda

22 Milk Coke window cleanerMilk Coke window cleaner

33 Coke detergentCoke detergent

44 Coke detergent sodaCoke detergent soda

55 Window cleaner sodaWindow cleaner soda

CokCokee

Window Window cleanercleaner

MilkMilk SodaSoda DetergentDetergent

CokeCoke 44 11 11 22 22

Window cleanerWindow cleaner 11 22 11 11 00

MilkMilk 11 11 11 00 00

SodaSoda 22 11 00 33 11

DetergentDetergent 22 00 00 11 22

POS Transactions

Co-occurrence ofProducts

17

How Good is an Association Rule

CokCokee

Window Window cleanercleaner

MilkMilk SodaSoda DetergentDetergent

44 11 11 22 22

Window cleanerWindow cleaner 11 22 11 11 00

MilkMilk 11 11 11 00 00

SodaSoda 22 11 00 33 11

DetergentDetergent 22 00 00 11 22

Simple patterns1 Coke and soda are more likely purchased together thanany other two items2 Detergent is never purchased with milk or window cleaner3 Milk is never purchased with soda or detergent

18

How Good is an Association Rule

What is the confidence for this ruleWhat is the confidence for this rule If a customer purchases soda then customer also purchases CokeIf a customer purchases soda then customer also purchases Coke 2 out of 3 soda purchases also include Coke so 672 out of 3 soda purchases also include Coke so 67

What about the confidence of this rule reversedWhat about the confidence of this rule reversed 2 out of 4 Coke purchases also include soda so 502 out of 4 Coke purchases also include soda so 50

Confidence Confidence = Ratio of the number of transactions with all the = Ratio of the number of transactions with all the items to the number of transactions with just the ldquoifrdquo itemsitems to the number of transactions with just the ldquoifrdquo items

Customer Items Purchased

1 Coke soda

2 Milk Coke window cleaner

3 Coke detergent

4 Coke detergent soda

5 Window cleaner soda

POS Transactions

19

How Good is an Association Rule

How much better than chance is a ruleHow much better than chance is a rule Lift (improvementa) tells us how much better a rule is at Lift (improvementa) tells us how much better a rule is at

predicting the result than just assuming the result in the first predicting the result than just assuming the result in the first placeplace

Lift Lift is the ratio of the records that support the entire rule to is the ratio of the records that support the entire rule to the number that would be expected assuming there was no the number that would be expected assuming there was no relationship between the productsrelationship between the products

Calculating lifthellipp 310hellipWhen lift gt 1 then the rule is better at Calculating lifthellipp 310hellipWhen lift gt 1 then the rule is better at predicting the result than guessingpredicting the result than guessing

When lift lt 1 the rule is doing worse than informed guessing When lift lt 1 the rule is doing worse than informed guessing and using the and using the Negative RuleNegative Rule produces a better rule than produces a better rule than guessingguessing

Co-occurrence can occur in 3 4 or more dimensionshellipCo-occurrence can occur in 3 4 or more dimensionshellip

20

Creating Association Rules

11 Choosing the right set Choosing the right set of itemsof items

22 Generating rules by Generating rules by deciphering the deciphering the counts in the co-counts in the co-occurrence matrixoccurrence matrix

33 Overcoming the Overcoming the practical limits practical limits imposed by thousands imposed by thousands or tens of thousands or tens of thousands of unique itemsof unique items

21

Overcoming Practical Limits for Association Rules

11 Generate co-occurrence matrix Generate co-occurrence matrix for single itemshelliprdquofor single itemshelliprdquoif Coke then if Coke then sodardquosodardquo

22 Generate co-occurrence matrix Generate co-occurrence matrix for two itemshelliprdquofor two itemshelliprdquoif Coke and Milk if Coke and Milk then sodardquothen sodardquo

33 Generate co-occurrence matrix Generate co-occurrence matrix for three itemshelliprdquofor three itemshelliprdquoif Coke and Milk if Coke and Milk and Windowand Window Cleanerrdquo then soda Cleanerrdquo then soda

44 EtchellipEtchellip

22

Final Thought on Association RulesThe Problem of Lots of Data

Fast Food Restauranthellipcould have 100 Fast Food Restauranthellipcould have 100 items on its menuitems on its menu How many combinations are there with 3 How many combinations are there with 3

different menu items 161700 different menu items 161700 Supermarkethellip10000 or more unique Supermarkethellip10000 or more unique

itemsitems 50 million 2-item combinations50 million 2-item combinations 100 billion 3-item combinations100 billion 3-item combinations

Use of product hierarchies (groupings) Use of product hierarchies (groupings) helps address this common issuehelps address this common issue

Finally know that the number of Finally know that the number of transactions in a given time-period could transactions in a given time-period could also be huge (hence expensive to analyze)also be huge (hence expensive to analyze)

23

Business and other cases

24

25

26

27

28

29

30

31

32

33

General Observations

Banking case seems to provide Banking case seems to provide well defined and intelligible well defined and intelligible information of the forminformation of the form account_1 and account_2 etc or account_1 and account_2 etc or

activity_1 and activity_2 etc activity_1 and activity_2 etc possibly indexed by timepossibly indexed by time

As such rules found provide guide As such rules found provide guide to action to offer product or service to action to offer product or service (cross-sell)(cross-sell)

34

In retailing case of items In retailing case of items purchased together guidance is purchased together guidance is not so clear cut due to extensive not so clear cut due to extensive number of rulesnumber of rules

Soccer event exemplifies Soccer event exemplifies sequencing of events towards sequencing of events towards reaching goal Basketball-applied reaching goal Basketball-applied software has been developed years software has been developed years ago Web mining shares the same ago Web mining shares the same principles without passion usually principles without passion usually associated with sportsassociated with sports

35

Challenges

A major difficulty is that a large number of A major difficulty is that a large number of the rules found may be trivial for anyone the rules found may be trivial for anyone familiar with the business familiar with the business

The computational complexity involved in The computational complexity involved in calculating the results of market basket calculating the results of market basket analysis is at least the square of the number analysis is at least the square of the number of transaction item-lines (records of every of transaction item-lines (records of every item purchased) With data warehouses item purchased) With data warehouses storing billions of transaction lines this storing billions of transaction lines this yields extremely high computational yields extremely high computational requirements requirements

36

Solutions

Differential market basket analysisDifferential market basket analysis can find interesting results and can also can find interesting results and can also eliminate the problem of a potentially eliminate the problem of a potentially high volume of trivial resultshigh volume of trivial results

Special techniques involving Special techniques involving filtering filtering or aggregationor aggregation of the transaction of the transaction database are commonly used to in database are commonly used to in analysis algorithms to increase analysis algorithms to increase performance and allow some level of performance and allow some level of interactivity such as in business interactivity such as in business intelligence applicationsintelligence applications

37

Thank You

8

Association Rules Apply Elsewhere

Besides retail ndash supermarkets etchellipBesides retail ndash supermarkets etchellip Purchases made using creditdebit Purchases made using creditdebit

cardscards Optional Telco Service purchasesOptional Telco Service purchases Banking servicesBanking services Unusual combinations of insurance Unusual combinations of insurance

claims can be a warning of fraudclaims can be a warning of fraud Medical patient historiesMedical patient histories

9

A certainty measure for A certainty measure for association rules of the form ldquoA association rules of the form ldquoA =gt Brdquo where A and B are sets of =gt Brdquo where A and B are sets of items is confidenceitems is confidence

Given a set of task Given a set of task

10

Typical Data Structure (Relational Database)

Lots of questions can be answeredLots of questions can be answered Avg of orderscustomerAvg of orderscustomer Avg unique itemsorderAvg unique itemsorder Avg of itemsorderAvg of itemsorder For a productFor a product

What of customers have purchasedWhat of customers have purchased Avg orderscustomer include itAvg orderscustomer include it Avg quantity of it purchasedorderAvg quantity of it purchasedorder

EtchellipEtchellip Visualization is extremely helpfulhellipnext Visualization is extremely helpfulhellipnext

slide slide

Transaction Data

11

Sales Order Characteristics

12

Sales Order Characteristics

Did the order use gift wrapDid the order use gift wrap Billing address same as Shipping addressBilling address same as Shipping address Did purchaser acceptdecline a cross-sellDid purchaser acceptdecline a cross-sell What is the most common item found on a What is the most common item found on a

one-item orderone-item order What is the most common item found on a What is the most common item found on a

multi-item ordermulti-item order What is the most common item for repeat What is the most common item for repeat

customer purchasescustomer purchases How has ordering of an item changed over How has ordering of an item changed over

timetime How does the ordering of an item vary How does the ordering of an item vary

geographicallygeographically

13

Pivoting for Cluster Algorithms

14

Association Rules

Wal-Mart customers who purchase Wal-Mart customers who purchase Barbie dolls have a 60 likelihood of Barbie dolls have a 60 likelihood of also purchasing one of three types of also purchasing one of three types of candy bars [candy bars [ForbesForbes Sept 8 1997] Sept 8 1997]

Customers who purchase maintenance Customers who purchase maintenance agreements are very likely to purchase agreements are very likely to purchase large appliances When a new hardware large appliances When a new hardware store opens one of the most commonly store opens one of the most commonly sold items is toilet bowl cleanerssold items is toilet bowl cleaners

15

Association Rules

Association rule typesAssociation rule types Actionable Rules ndash contain high-Actionable Rules ndash contain high-

quality actionable informationquality actionable information Trivial Rules ndash information already Trivial Rules ndash information already

well-known by those familiar with well-known by those familiar with the businessthe business

Inexplicable Rules ndash no explanation Inexplicable Rules ndash no explanation and do not suggest actionand do not suggest action

Trivial and Inexplicable Rules Trivial and Inexplicable Rules occur most oftenoccur most often

16

How Good is an Association Rule

CustomerCustomer Items PurchasedItems Purchased

11 Coke sodaCoke soda

22 Milk Coke window cleanerMilk Coke window cleaner

33 Coke detergentCoke detergent

44 Coke detergent sodaCoke detergent soda

55 Window cleaner sodaWindow cleaner soda

CokCokee

Window Window cleanercleaner

MilkMilk SodaSoda DetergentDetergent

CokeCoke 44 11 11 22 22

Window cleanerWindow cleaner 11 22 11 11 00

MilkMilk 11 11 11 00 00

SodaSoda 22 11 00 33 11

DetergentDetergent 22 00 00 11 22

POS Transactions

Co-occurrence ofProducts

17

How Good is an Association Rule

CokCokee

Window Window cleanercleaner

MilkMilk SodaSoda DetergentDetergent

44 11 11 22 22

Window cleanerWindow cleaner 11 22 11 11 00

MilkMilk 11 11 11 00 00

SodaSoda 22 11 00 33 11

DetergentDetergent 22 00 00 11 22

Simple patterns1 Coke and soda are more likely purchased together thanany other two items2 Detergent is never purchased with milk or window cleaner3 Milk is never purchased with soda or detergent

18

How Good is an Association Rule

What is the confidence for this ruleWhat is the confidence for this rule If a customer purchases soda then customer also purchases CokeIf a customer purchases soda then customer also purchases Coke 2 out of 3 soda purchases also include Coke so 672 out of 3 soda purchases also include Coke so 67

What about the confidence of this rule reversedWhat about the confidence of this rule reversed 2 out of 4 Coke purchases also include soda so 502 out of 4 Coke purchases also include soda so 50

Confidence Confidence = Ratio of the number of transactions with all the = Ratio of the number of transactions with all the items to the number of transactions with just the ldquoifrdquo itemsitems to the number of transactions with just the ldquoifrdquo items

Customer Items Purchased

1 Coke soda

2 Milk Coke window cleaner

3 Coke detergent

4 Coke detergent soda

5 Window cleaner soda

POS Transactions

19

How Good is an Association Rule

How much better than chance is a ruleHow much better than chance is a rule Lift (improvementa) tells us how much better a rule is at Lift (improvementa) tells us how much better a rule is at

predicting the result than just assuming the result in the first predicting the result than just assuming the result in the first placeplace

Lift Lift is the ratio of the records that support the entire rule to is the ratio of the records that support the entire rule to the number that would be expected assuming there was no the number that would be expected assuming there was no relationship between the productsrelationship between the products

Calculating lifthellipp 310hellipWhen lift gt 1 then the rule is better at Calculating lifthellipp 310hellipWhen lift gt 1 then the rule is better at predicting the result than guessingpredicting the result than guessing

When lift lt 1 the rule is doing worse than informed guessing When lift lt 1 the rule is doing worse than informed guessing and using the and using the Negative RuleNegative Rule produces a better rule than produces a better rule than guessingguessing

Co-occurrence can occur in 3 4 or more dimensionshellipCo-occurrence can occur in 3 4 or more dimensionshellip

20

Creating Association Rules

11 Choosing the right set Choosing the right set of itemsof items

22 Generating rules by Generating rules by deciphering the deciphering the counts in the co-counts in the co-occurrence matrixoccurrence matrix

33 Overcoming the Overcoming the practical limits practical limits imposed by thousands imposed by thousands or tens of thousands or tens of thousands of unique itemsof unique items

21

Overcoming Practical Limits for Association Rules

11 Generate co-occurrence matrix Generate co-occurrence matrix for single itemshelliprdquofor single itemshelliprdquoif Coke then if Coke then sodardquosodardquo

22 Generate co-occurrence matrix Generate co-occurrence matrix for two itemshelliprdquofor two itemshelliprdquoif Coke and Milk if Coke and Milk then sodardquothen sodardquo

33 Generate co-occurrence matrix Generate co-occurrence matrix for three itemshelliprdquofor three itemshelliprdquoif Coke and Milk if Coke and Milk and Windowand Window Cleanerrdquo then soda Cleanerrdquo then soda

44 EtchellipEtchellip

22

Final Thought on Association RulesThe Problem of Lots of Data

Fast Food Restauranthellipcould have 100 Fast Food Restauranthellipcould have 100 items on its menuitems on its menu How many combinations are there with 3 How many combinations are there with 3

different menu items 161700 different menu items 161700 Supermarkethellip10000 or more unique Supermarkethellip10000 or more unique

itemsitems 50 million 2-item combinations50 million 2-item combinations 100 billion 3-item combinations100 billion 3-item combinations

Use of product hierarchies (groupings) Use of product hierarchies (groupings) helps address this common issuehelps address this common issue

Finally know that the number of Finally know that the number of transactions in a given time-period could transactions in a given time-period could also be huge (hence expensive to analyze)also be huge (hence expensive to analyze)

23

Business and other cases

24

25

26

27

28

29

30

31

32

33

General Observations

Banking case seems to provide Banking case seems to provide well defined and intelligible well defined and intelligible information of the forminformation of the form account_1 and account_2 etc or account_1 and account_2 etc or

activity_1 and activity_2 etc activity_1 and activity_2 etc possibly indexed by timepossibly indexed by time

As such rules found provide guide As such rules found provide guide to action to offer product or service to action to offer product or service (cross-sell)(cross-sell)

34

In retailing case of items In retailing case of items purchased together guidance is purchased together guidance is not so clear cut due to extensive not so clear cut due to extensive number of rulesnumber of rules

Soccer event exemplifies Soccer event exemplifies sequencing of events towards sequencing of events towards reaching goal Basketball-applied reaching goal Basketball-applied software has been developed years software has been developed years ago Web mining shares the same ago Web mining shares the same principles without passion usually principles without passion usually associated with sportsassociated with sports

35

Challenges

A major difficulty is that a large number of A major difficulty is that a large number of the rules found may be trivial for anyone the rules found may be trivial for anyone familiar with the business familiar with the business

The computational complexity involved in The computational complexity involved in calculating the results of market basket calculating the results of market basket analysis is at least the square of the number analysis is at least the square of the number of transaction item-lines (records of every of transaction item-lines (records of every item purchased) With data warehouses item purchased) With data warehouses storing billions of transaction lines this storing billions of transaction lines this yields extremely high computational yields extremely high computational requirements requirements

36

Solutions

Differential market basket analysisDifferential market basket analysis can find interesting results and can also can find interesting results and can also eliminate the problem of a potentially eliminate the problem of a potentially high volume of trivial resultshigh volume of trivial results

Special techniques involving Special techniques involving filtering filtering or aggregationor aggregation of the transaction of the transaction database are commonly used to in database are commonly used to in analysis algorithms to increase analysis algorithms to increase performance and allow some level of performance and allow some level of interactivity such as in business interactivity such as in business intelligence applicationsintelligence applications

37

Thank You

9

A certainty measure for A certainty measure for association rules of the form ldquoA association rules of the form ldquoA =gt Brdquo where A and B are sets of =gt Brdquo where A and B are sets of items is confidenceitems is confidence

Given a set of task Given a set of task

10

Typical Data Structure (Relational Database)

Lots of questions can be answeredLots of questions can be answered Avg of orderscustomerAvg of orderscustomer Avg unique itemsorderAvg unique itemsorder Avg of itemsorderAvg of itemsorder For a productFor a product

What of customers have purchasedWhat of customers have purchased Avg orderscustomer include itAvg orderscustomer include it Avg quantity of it purchasedorderAvg quantity of it purchasedorder

EtchellipEtchellip Visualization is extremely helpfulhellipnext Visualization is extremely helpfulhellipnext

slide slide

Transaction Data

11

Sales Order Characteristics

12

Sales Order Characteristics

Did the order use gift wrapDid the order use gift wrap Billing address same as Shipping addressBilling address same as Shipping address Did purchaser acceptdecline a cross-sellDid purchaser acceptdecline a cross-sell What is the most common item found on a What is the most common item found on a

one-item orderone-item order What is the most common item found on a What is the most common item found on a

multi-item ordermulti-item order What is the most common item for repeat What is the most common item for repeat

customer purchasescustomer purchases How has ordering of an item changed over How has ordering of an item changed over

timetime How does the ordering of an item vary How does the ordering of an item vary

geographicallygeographically

13

Pivoting for Cluster Algorithms

14

Association Rules

Wal-Mart customers who purchase Wal-Mart customers who purchase Barbie dolls have a 60 likelihood of Barbie dolls have a 60 likelihood of also purchasing one of three types of also purchasing one of three types of candy bars [candy bars [ForbesForbes Sept 8 1997] Sept 8 1997]

Customers who purchase maintenance Customers who purchase maintenance agreements are very likely to purchase agreements are very likely to purchase large appliances When a new hardware large appliances When a new hardware store opens one of the most commonly store opens one of the most commonly sold items is toilet bowl cleanerssold items is toilet bowl cleaners

15

Association Rules

Association rule typesAssociation rule types Actionable Rules ndash contain high-Actionable Rules ndash contain high-

quality actionable informationquality actionable information Trivial Rules ndash information already Trivial Rules ndash information already

well-known by those familiar with well-known by those familiar with the businessthe business

Inexplicable Rules ndash no explanation Inexplicable Rules ndash no explanation and do not suggest actionand do not suggest action

Trivial and Inexplicable Rules Trivial and Inexplicable Rules occur most oftenoccur most often

16

How Good is an Association Rule

CustomerCustomer Items PurchasedItems Purchased

11 Coke sodaCoke soda

22 Milk Coke window cleanerMilk Coke window cleaner

33 Coke detergentCoke detergent

44 Coke detergent sodaCoke detergent soda

55 Window cleaner sodaWindow cleaner soda

CokCokee

Window Window cleanercleaner

MilkMilk SodaSoda DetergentDetergent

CokeCoke 44 11 11 22 22

Window cleanerWindow cleaner 11 22 11 11 00

MilkMilk 11 11 11 00 00

SodaSoda 22 11 00 33 11

DetergentDetergent 22 00 00 11 22

POS Transactions

Co-occurrence ofProducts

17

How Good is an Association Rule

CokCokee

Window Window cleanercleaner

MilkMilk SodaSoda DetergentDetergent

44 11 11 22 22

Window cleanerWindow cleaner 11 22 11 11 00

MilkMilk 11 11 11 00 00

SodaSoda 22 11 00 33 11

DetergentDetergent 22 00 00 11 22

Simple patterns1 Coke and soda are more likely purchased together thanany other two items2 Detergent is never purchased with milk or window cleaner3 Milk is never purchased with soda or detergent

18

How Good is an Association Rule

What is the confidence for this ruleWhat is the confidence for this rule If a customer purchases soda then customer also purchases CokeIf a customer purchases soda then customer also purchases Coke 2 out of 3 soda purchases also include Coke so 672 out of 3 soda purchases also include Coke so 67

What about the confidence of this rule reversedWhat about the confidence of this rule reversed 2 out of 4 Coke purchases also include soda so 502 out of 4 Coke purchases also include soda so 50

Confidence Confidence = Ratio of the number of transactions with all the = Ratio of the number of transactions with all the items to the number of transactions with just the ldquoifrdquo itemsitems to the number of transactions with just the ldquoifrdquo items

Customer Items Purchased

1 Coke soda

2 Milk Coke window cleaner

3 Coke detergent

4 Coke detergent soda

5 Window cleaner soda

POS Transactions

19

How Good is an Association Rule

How much better than chance is a ruleHow much better than chance is a rule Lift (improvementa) tells us how much better a rule is at Lift (improvementa) tells us how much better a rule is at

predicting the result than just assuming the result in the first predicting the result than just assuming the result in the first placeplace

Lift Lift is the ratio of the records that support the entire rule to is the ratio of the records that support the entire rule to the number that would be expected assuming there was no the number that would be expected assuming there was no relationship between the productsrelationship between the products

Calculating lifthellipp 310hellipWhen lift gt 1 then the rule is better at Calculating lifthellipp 310hellipWhen lift gt 1 then the rule is better at predicting the result than guessingpredicting the result than guessing

When lift lt 1 the rule is doing worse than informed guessing When lift lt 1 the rule is doing worse than informed guessing and using the and using the Negative RuleNegative Rule produces a better rule than produces a better rule than guessingguessing

Co-occurrence can occur in 3 4 or more dimensionshellipCo-occurrence can occur in 3 4 or more dimensionshellip

20

Creating Association Rules

11 Choosing the right set Choosing the right set of itemsof items

22 Generating rules by Generating rules by deciphering the deciphering the counts in the co-counts in the co-occurrence matrixoccurrence matrix

33 Overcoming the Overcoming the practical limits practical limits imposed by thousands imposed by thousands or tens of thousands or tens of thousands of unique itemsof unique items

21

Overcoming Practical Limits for Association Rules

11 Generate co-occurrence matrix Generate co-occurrence matrix for single itemshelliprdquofor single itemshelliprdquoif Coke then if Coke then sodardquosodardquo

22 Generate co-occurrence matrix Generate co-occurrence matrix for two itemshelliprdquofor two itemshelliprdquoif Coke and Milk if Coke and Milk then sodardquothen sodardquo

33 Generate co-occurrence matrix Generate co-occurrence matrix for three itemshelliprdquofor three itemshelliprdquoif Coke and Milk if Coke and Milk and Windowand Window Cleanerrdquo then soda Cleanerrdquo then soda

44 EtchellipEtchellip

22

Final Thought on Association RulesThe Problem of Lots of Data

Fast Food Restauranthellipcould have 100 Fast Food Restauranthellipcould have 100 items on its menuitems on its menu How many combinations are there with 3 How many combinations are there with 3

different menu items 161700 different menu items 161700 Supermarkethellip10000 or more unique Supermarkethellip10000 or more unique

itemsitems 50 million 2-item combinations50 million 2-item combinations 100 billion 3-item combinations100 billion 3-item combinations

Use of product hierarchies (groupings) Use of product hierarchies (groupings) helps address this common issuehelps address this common issue

Finally know that the number of Finally know that the number of transactions in a given time-period could transactions in a given time-period could also be huge (hence expensive to analyze)also be huge (hence expensive to analyze)

23

Business and other cases

24

25

26

27

28

29

30

31

32

33

General Observations

Banking case seems to provide Banking case seems to provide well defined and intelligible well defined and intelligible information of the forminformation of the form account_1 and account_2 etc or account_1 and account_2 etc or

activity_1 and activity_2 etc activity_1 and activity_2 etc possibly indexed by timepossibly indexed by time

As such rules found provide guide As such rules found provide guide to action to offer product or service to action to offer product or service (cross-sell)(cross-sell)

34

In retailing case of items In retailing case of items purchased together guidance is purchased together guidance is not so clear cut due to extensive not so clear cut due to extensive number of rulesnumber of rules

Soccer event exemplifies Soccer event exemplifies sequencing of events towards sequencing of events towards reaching goal Basketball-applied reaching goal Basketball-applied software has been developed years software has been developed years ago Web mining shares the same ago Web mining shares the same principles without passion usually principles without passion usually associated with sportsassociated with sports

35

Challenges

A major difficulty is that a large number of A major difficulty is that a large number of the rules found may be trivial for anyone the rules found may be trivial for anyone familiar with the business familiar with the business

The computational complexity involved in The computational complexity involved in calculating the results of market basket calculating the results of market basket analysis is at least the square of the number analysis is at least the square of the number of transaction item-lines (records of every of transaction item-lines (records of every item purchased) With data warehouses item purchased) With data warehouses storing billions of transaction lines this storing billions of transaction lines this yields extremely high computational yields extremely high computational requirements requirements

36

Solutions

Differential market basket analysisDifferential market basket analysis can find interesting results and can also can find interesting results and can also eliminate the problem of a potentially eliminate the problem of a potentially high volume of trivial resultshigh volume of trivial results

Special techniques involving Special techniques involving filtering filtering or aggregationor aggregation of the transaction of the transaction database are commonly used to in database are commonly used to in analysis algorithms to increase analysis algorithms to increase performance and allow some level of performance and allow some level of interactivity such as in business interactivity such as in business intelligence applicationsintelligence applications

37

Thank You

10

Typical Data Structure (Relational Database)

Lots of questions can be answeredLots of questions can be answered Avg of orderscustomerAvg of orderscustomer Avg unique itemsorderAvg unique itemsorder Avg of itemsorderAvg of itemsorder For a productFor a product

What of customers have purchasedWhat of customers have purchased Avg orderscustomer include itAvg orderscustomer include it Avg quantity of it purchasedorderAvg quantity of it purchasedorder

EtchellipEtchellip Visualization is extremely helpfulhellipnext Visualization is extremely helpfulhellipnext

slide slide

Transaction Data

11

Sales Order Characteristics

12

Sales Order Characteristics

Did the order use gift wrapDid the order use gift wrap Billing address same as Shipping addressBilling address same as Shipping address Did purchaser acceptdecline a cross-sellDid purchaser acceptdecline a cross-sell What is the most common item found on a What is the most common item found on a

one-item orderone-item order What is the most common item found on a What is the most common item found on a

multi-item ordermulti-item order What is the most common item for repeat What is the most common item for repeat

customer purchasescustomer purchases How has ordering of an item changed over How has ordering of an item changed over

timetime How does the ordering of an item vary How does the ordering of an item vary

geographicallygeographically

13

Pivoting for Cluster Algorithms

14

Association Rules

Wal-Mart customers who purchase Wal-Mart customers who purchase Barbie dolls have a 60 likelihood of Barbie dolls have a 60 likelihood of also purchasing one of three types of also purchasing one of three types of candy bars [candy bars [ForbesForbes Sept 8 1997] Sept 8 1997]

Customers who purchase maintenance Customers who purchase maintenance agreements are very likely to purchase agreements are very likely to purchase large appliances When a new hardware large appliances When a new hardware store opens one of the most commonly store opens one of the most commonly sold items is toilet bowl cleanerssold items is toilet bowl cleaners

15

Association Rules

Association rule typesAssociation rule types Actionable Rules ndash contain high-Actionable Rules ndash contain high-

quality actionable informationquality actionable information Trivial Rules ndash information already Trivial Rules ndash information already

well-known by those familiar with well-known by those familiar with the businessthe business

Inexplicable Rules ndash no explanation Inexplicable Rules ndash no explanation and do not suggest actionand do not suggest action

Trivial and Inexplicable Rules Trivial and Inexplicable Rules occur most oftenoccur most often

16

How Good is an Association Rule

CustomerCustomer Items PurchasedItems Purchased

11 Coke sodaCoke soda

22 Milk Coke window cleanerMilk Coke window cleaner

33 Coke detergentCoke detergent

44 Coke detergent sodaCoke detergent soda

55 Window cleaner sodaWindow cleaner soda

CokCokee

Window Window cleanercleaner

MilkMilk SodaSoda DetergentDetergent

CokeCoke 44 11 11 22 22

Window cleanerWindow cleaner 11 22 11 11 00

MilkMilk 11 11 11 00 00

SodaSoda 22 11 00 33 11

DetergentDetergent 22 00 00 11 22

POS Transactions

Co-occurrence ofProducts

17

How Good is an Association Rule

CokCokee

Window Window cleanercleaner

MilkMilk SodaSoda DetergentDetergent

44 11 11 22 22

Window cleanerWindow cleaner 11 22 11 11 00

MilkMilk 11 11 11 00 00

SodaSoda 22 11 00 33 11

DetergentDetergent 22 00 00 11 22

Simple patterns1 Coke and soda are more likely purchased together thanany other two items2 Detergent is never purchased with milk or window cleaner3 Milk is never purchased with soda or detergent

18

How Good is an Association Rule

What is the confidence for this ruleWhat is the confidence for this rule If a customer purchases soda then customer also purchases CokeIf a customer purchases soda then customer also purchases Coke 2 out of 3 soda purchases also include Coke so 672 out of 3 soda purchases also include Coke so 67

What about the confidence of this rule reversedWhat about the confidence of this rule reversed 2 out of 4 Coke purchases also include soda so 502 out of 4 Coke purchases also include soda so 50

Confidence Confidence = Ratio of the number of transactions with all the = Ratio of the number of transactions with all the items to the number of transactions with just the ldquoifrdquo itemsitems to the number of transactions with just the ldquoifrdquo items

Customer Items Purchased

1 Coke soda

2 Milk Coke window cleaner

3 Coke detergent

4 Coke detergent soda

5 Window cleaner soda

POS Transactions

19

How Good is an Association Rule

How much better than chance is a ruleHow much better than chance is a rule Lift (improvementa) tells us how much better a rule is at Lift (improvementa) tells us how much better a rule is at

predicting the result than just assuming the result in the first predicting the result than just assuming the result in the first placeplace

Lift Lift is the ratio of the records that support the entire rule to is the ratio of the records that support the entire rule to the number that would be expected assuming there was no the number that would be expected assuming there was no relationship between the productsrelationship between the products

Calculating lifthellipp 310hellipWhen lift gt 1 then the rule is better at Calculating lifthellipp 310hellipWhen lift gt 1 then the rule is better at predicting the result than guessingpredicting the result than guessing

When lift lt 1 the rule is doing worse than informed guessing When lift lt 1 the rule is doing worse than informed guessing and using the and using the Negative RuleNegative Rule produces a better rule than produces a better rule than guessingguessing

Co-occurrence can occur in 3 4 or more dimensionshellipCo-occurrence can occur in 3 4 or more dimensionshellip

20

Creating Association Rules

11 Choosing the right set Choosing the right set of itemsof items

22 Generating rules by Generating rules by deciphering the deciphering the counts in the co-counts in the co-occurrence matrixoccurrence matrix

33 Overcoming the Overcoming the practical limits practical limits imposed by thousands imposed by thousands or tens of thousands or tens of thousands of unique itemsof unique items

21

Overcoming Practical Limits for Association Rules

11 Generate co-occurrence matrix Generate co-occurrence matrix for single itemshelliprdquofor single itemshelliprdquoif Coke then if Coke then sodardquosodardquo

22 Generate co-occurrence matrix Generate co-occurrence matrix for two itemshelliprdquofor two itemshelliprdquoif Coke and Milk if Coke and Milk then sodardquothen sodardquo

33 Generate co-occurrence matrix Generate co-occurrence matrix for three itemshelliprdquofor three itemshelliprdquoif Coke and Milk if Coke and Milk and Windowand Window Cleanerrdquo then soda Cleanerrdquo then soda

44 EtchellipEtchellip

22

Final Thought on Association RulesThe Problem of Lots of Data

Fast Food Restauranthellipcould have 100 Fast Food Restauranthellipcould have 100 items on its menuitems on its menu How many combinations are there with 3 How many combinations are there with 3

different menu items 161700 different menu items 161700 Supermarkethellip10000 or more unique Supermarkethellip10000 or more unique

itemsitems 50 million 2-item combinations50 million 2-item combinations 100 billion 3-item combinations100 billion 3-item combinations

Use of product hierarchies (groupings) Use of product hierarchies (groupings) helps address this common issuehelps address this common issue

Finally know that the number of Finally know that the number of transactions in a given time-period could transactions in a given time-period could also be huge (hence expensive to analyze)also be huge (hence expensive to analyze)

23

Business and other cases

24

25

26

27

28

29

30

31

32

33

General Observations

Banking case seems to provide Banking case seems to provide well defined and intelligible well defined and intelligible information of the forminformation of the form account_1 and account_2 etc or account_1 and account_2 etc or

activity_1 and activity_2 etc activity_1 and activity_2 etc possibly indexed by timepossibly indexed by time

As such rules found provide guide As such rules found provide guide to action to offer product or service to action to offer product or service (cross-sell)(cross-sell)

34

In retailing case of items In retailing case of items purchased together guidance is purchased together guidance is not so clear cut due to extensive not so clear cut due to extensive number of rulesnumber of rules

Soccer event exemplifies Soccer event exemplifies sequencing of events towards sequencing of events towards reaching goal Basketball-applied reaching goal Basketball-applied software has been developed years software has been developed years ago Web mining shares the same ago Web mining shares the same principles without passion usually principles without passion usually associated with sportsassociated with sports

35

Challenges

A major difficulty is that a large number of A major difficulty is that a large number of the rules found may be trivial for anyone the rules found may be trivial for anyone familiar with the business familiar with the business

The computational complexity involved in The computational complexity involved in calculating the results of market basket calculating the results of market basket analysis is at least the square of the number analysis is at least the square of the number of transaction item-lines (records of every of transaction item-lines (records of every item purchased) With data warehouses item purchased) With data warehouses storing billions of transaction lines this storing billions of transaction lines this yields extremely high computational yields extremely high computational requirements requirements

36

Solutions

Differential market basket analysisDifferential market basket analysis can find interesting results and can also can find interesting results and can also eliminate the problem of a potentially eliminate the problem of a potentially high volume of trivial resultshigh volume of trivial results

Special techniques involving Special techniques involving filtering filtering or aggregationor aggregation of the transaction of the transaction database are commonly used to in database are commonly used to in analysis algorithms to increase analysis algorithms to increase performance and allow some level of performance and allow some level of interactivity such as in business interactivity such as in business intelligence applicationsintelligence applications

37

Thank You

11

Sales Order Characteristics

12

Sales Order Characteristics

Did the order use gift wrapDid the order use gift wrap Billing address same as Shipping addressBilling address same as Shipping address Did purchaser acceptdecline a cross-sellDid purchaser acceptdecline a cross-sell What is the most common item found on a What is the most common item found on a

one-item orderone-item order What is the most common item found on a What is the most common item found on a

multi-item ordermulti-item order What is the most common item for repeat What is the most common item for repeat

customer purchasescustomer purchases How has ordering of an item changed over How has ordering of an item changed over

timetime How does the ordering of an item vary How does the ordering of an item vary

geographicallygeographically

13

Pivoting for Cluster Algorithms

14

Association Rules

Wal-Mart customers who purchase Wal-Mart customers who purchase Barbie dolls have a 60 likelihood of Barbie dolls have a 60 likelihood of also purchasing one of three types of also purchasing one of three types of candy bars [candy bars [ForbesForbes Sept 8 1997] Sept 8 1997]

Customers who purchase maintenance Customers who purchase maintenance agreements are very likely to purchase agreements are very likely to purchase large appliances When a new hardware large appliances When a new hardware store opens one of the most commonly store opens one of the most commonly sold items is toilet bowl cleanerssold items is toilet bowl cleaners

15

Association Rules

Association rule typesAssociation rule types Actionable Rules ndash contain high-Actionable Rules ndash contain high-

quality actionable informationquality actionable information Trivial Rules ndash information already Trivial Rules ndash information already

well-known by those familiar with well-known by those familiar with the businessthe business

Inexplicable Rules ndash no explanation Inexplicable Rules ndash no explanation and do not suggest actionand do not suggest action

Trivial and Inexplicable Rules Trivial and Inexplicable Rules occur most oftenoccur most often

16

How Good is an Association Rule

CustomerCustomer Items PurchasedItems Purchased

11 Coke sodaCoke soda

22 Milk Coke window cleanerMilk Coke window cleaner

33 Coke detergentCoke detergent

44 Coke detergent sodaCoke detergent soda

55 Window cleaner sodaWindow cleaner soda

CokCokee

Window Window cleanercleaner

MilkMilk SodaSoda DetergentDetergent

CokeCoke 44 11 11 22 22

Window cleanerWindow cleaner 11 22 11 11 00

MilkMilk 11 11 11 00 00

SodaSoda 22 11 00 33 11

DetergentDetergent 22 00 00 11 22

POS Transactions

Co-occurrence ofProducts

17

How Good is an Association Rule

CokCokee

Window Window cleanercleaner

MilkMilk SodaSoda DetergentDetergent

44 11 11 22 22

Window cleanerWindow cleaner 11 22 11 11 00

MilkMilk 11 11 11 00 00

SodaSoda 22 11 00 33 11

DetergentDetergent 22 00 00 11 22

Simple patterns1 Coke and soda are more likely purchased together thanany other two items2 Detergent is never purchased with milk or window cleaner3 Milk is never purchased with soda or detergent

18

How Good is an Association Rule

What is the confidence for this ruleWhat is the confidence for this rule If a customer purchases soda then customer also purchases CokeIf a customer purchases soda then customer also purchases Coke 2 out of 3 soda purchases also include Coke so 672 out of 3 soda purchases also include Coke so 67

What about the confidence of this rule reversedWhat about the confidence of this rule reversed 2 out of 4 Coke purchases also include soda so 502 out of 4 Coke purchases also include soda so 50

Confidence Confidence = Ratio of the number of transactions with all the = Ratio of the number of transactions with all the items to the number of transactions with just the ldquoifrdquo itemsitems to the number of transactions with just the ldquoifrdquo items

Customer Items Purchased

1 Coke soda

2 Milk Coke window cleaner

3 Coke detergent

4 Coke detergent soda

5 Window cleaner soda

POS Transactions

19

How Good is an Association Rule

How much better than chance is a ruleHow much better than chance is a rule Lift (improvementa) tells us how much better a rule is at Lift (improvementa) tells us how much better a rule is at

predicting the result than just assuming the result in the first predicting the result than just assuming the result in the first placeplace

Lift Lift is the ratio of the records that support the entire rule to is the ratio of the records that support the entire rule to the number that would be expected assuming there was no the number that would be expected assuming there was no relationship between the productsrelationship between the products

Calculating lifthellipp 310hellipWhen lift gt 1 then the rule is better at Calculating lifthellipp 310hellipWhen lift gt 1 then the rule is better at predicting the result than guessingpredicting the result than guessing

When lift lt 1 the rule is doing worse than informed guessing When lift lt 1 the rule is doing worse than informed guessing and using the and using the Negative RuleNegative Rule produces a better rule than produces a better rule than guessingguessing

Co-occurrence can occur in 3 4 or more dimensionshellipCo-occurrence can occur in 3 4 or more dimensionshellip

20

Creating Association Rules

11 Choosing the right set Choosing the right set of itemsof items

22 Generating rules by Generating rules by deciphering the deciphering the counts in the co-counts in the co-occurrence matrixoccurrence matrix

33 Overcoming the Overcoming the practical limits practical limits imposed by thousands imposed by thousands or tens of thousands or tens of thousands of unique itemsof unique items

21

Overcoming Practical Limits for Association Rules

11 Generate co-occurrence matrix Generate co-occurrence matrix for single itemshelliprdquofor single itemshelliprdquoif Coke then if Coke then sodardquosodardquo

22 Generate co-occurrence matrix Generate co-occurrence matrix for two itemshelliprdquofor two itemshelliprdquoif Coke and Milk if Coke and Milk then sodardquothen sodardquo

33 Generate co-occurrence matrix Generate co-occurrence matrix for three itemshelliprdquofor three itemshelliprdquoif Coke and Milk if Coke and Milk and Windowand Window Cleanerrdquo then soda Cleanerrdquo then soda

44 EtchellipEtchellip

22

Final Thought on Association RulesThe Problem of Lots of Data

Fast Food Restauranthellipcould have 100 Fast Food Restauranthellipcould have 100 items on its menuitems on its menu How many combinations are there with 3 How many combinations are there with 3

different menu items 161700 different menu items 161700 Supermarkethellip10000 or more unique Supermarkethellip10000 or more unique

itemsitems 50 million 2-item combinations50 million 2-item combinations 100 billion 3-item combinations100 billion 3-item combinations

Use of product hierarchies (groupings) Use of product hierarchies (groupings) helps address this common issuehelps address this common issue

Finally know that the number of Finally know that the number of transactions in a given time-period could transactions in a given time-period could also be huge (hence expensive to analyze)also be huge (hence expensive to analyze)

23

Business and other cases

24

25

26

27

28

29

30

31

32

33

General Observations

Banking case seems to provide Banking case seems to provide well defined and intelligible well defined and intelligible information of the forminformation of the form account_1 and account_2 etc or account_1 and account_2 etc or

activity_1 and activity_2 etc activity_1 and activity_2 etc possibly indexed by timepossibly indexed by time

As such rules found provide guide As such rules found provide guide to action to offer product or service to action to offer product or service (cross-sell)(cross-sell)

34

In retailing case of items In retailing case of items purchased together guidance is purchased together guidance is not so clear cut due to extensive not so clear cut due to extensive number of rulesnumber of rules

Soccer event exemplifies Soccer event exemplifies sequencing of events towards sequencing of events towards reaching goal Basketball-applied reaching goal Basketball-applied software has been developed years software has been developed years ago Web mining shares the same ago Web mining shares the same principles without passion usually principles without passion usually associated with sportsassociated with sports

35

Challenges

A major difficulty is that a large number of A major difficulty is that a large number of the rules found may be trivial for anyone the rules found may be trivial for anyone familiar with the business familiar with the business

The computational complexity involved in The computational complexity involved in calculating the results of market basket calculating the results of market basket analysis is at least the square of the number analysis is at least the square of the number of transaction item-lines (records of every of transaction item-lines (records of every item purchased) With data warehouses item purchased) With data warehouses storing billions of transaction lines this storing billions of transaction lines this yields extremely high computational yields extremely high computational requirements requirements

36

Solutions

Differential market basket analysisDifferential market basket analysis can find interesting results and can also can find interesting results and can also eliminate the problem of a potentially eliminate the problem of a potentially high volume of trivial resultshigh volume of trivial results

Special techniques involving Special techniques involving filtering filtering or aggregationor aggregation of the transaction of the transaction database are commonly used to in database are commonly used to in analysis algorithms to increase analysis algorithms to increase performance and allow some level of performance and allow some level of interactivity such as in business interactivity such as in business intelligence applicationsintelligence applications

37

Thank You

12

Sales Order Characteristics

Did the order use gift wrapDid the order use gift wrap Billing address same as Shipping addressBilling address same as Shipping address Did purchaser acceptdecline a cross-sellDid purchaser acceptdecline a cross-sell What is the most common item found on a What is the most common item found on a

one-item orderone-item order What is the most common item found on a What is the most common item found on a

multi-item ordermulti-item order What is the most common item for repeat What is the most common item for repeat

customer purchasescustomer purchases How has ordering of an item changed over How has ordering of an item changed over

timetime How does the ordering of an item vary How does the ordering of an item vary

geographicallygeographically

13

Pivoting for Cluster Algorithms

14

Association Rules

Wal-Mart customers who purchase Wal-Mart customers who purchase Barbie dolls have a 60 likelihood of Barbie dolls have a 60 likelihood of also purchasing one of three types of also purchasing one of three types of candy bars [candy bars [ForbesForbes Sept 8 1997] Sept 8 1997]

Customers who purchase maintenance Customers who purchase maintenance agreements are very likely to purchase agreements are very likely to purchase large appliances When a new hardware large appliances When a new hardware store opens one of the most commonly store opens one of the most commonly sold items is toilet bowl cleanerssold items is toilet bowl cleaners

15

Association Rules

Association rule typesAssociation rule types Actionable Rules ndash contain high-Actionable Rules ndash contain high-

quality actionable informationquality actionable information Trivial Rules ndash information already Trivial Rules ndash information already

well-known by those familiar with well-known by those familiar with the businessthe business

Inexplicable Rules ndash no explanation Inexplicable Rules ndash no explanation and do not suggest actionand do not suggest action

Trivial and Inexplicable Rules Trivial and Inexplicable Rules occur most oftenoccur most often

16

How Good is an Association Rule

CustomerCustomer Items PurchasedItems Purchased

11 Coke sodaCoke soda

22 Milk Coke window cleanerMilk Coke window cleaner

33 Coke detergentCoke detergent

44 Coke detergent sodaCoke detergent soda

55 Window cleaner sodaWindow cleaner soda

CokCokee

Window Window cleanercleaner

MilkMilk SodaSoda DetergentDetergent

CokeCoke 44 11 11 22 22

Window cleanerWindow cleaner 11 22 11 11 00

MilkMilk 11 11 11 00 00

SodaSoda 22 11 00 33 11

DetergentDetergent 22 00 00 11 22

POS Transactions

Co-occurrence ofProducts

17

How Good is an Association Rule

CokCokee

Window Window cleanercleaner

MilkMilk SodaSoda DetergentDetergent

44 11 11 22 22

Window cleanerWindow cleaner 11 22 11 11 00

MilkMilk 11 11 11 00 00

SodaSoda 22 11 00 33 11

DetergentDetergent 22 00 00 11 22

Simple patterns1 Coke and soda are more likely purchased together thanany other two items2 Detergent is never purchased with milk or window cleaner3 Milk is never purchased with soda or detergent

18

How Good is an Association Rule

What is the confidence for this ruleWhat is the confidence for this rule If a customer purchases soda then customer also purchases CokeIf a customer purchases soda then customer also purchases Coke 2 out of 3 soda purchases also include Coke so 672 out of 3 soda purchases also include Coke so 67

What about the confidence of this rule reversedWhat about the confidence of this rule reversed 2 out of 4 Coke purchases also include soda so 502 out of 4 Coke purchases also include soda so 50

Confidence Confidence = Ratio of the number of transactions with all the = Ratio of the number of transactions with all the items to the number of transactions with just the ldquoifrdquo itemsitems to the number of transactions with just the ldquoifrdquo items

Customer Items Purchased

1 Coke soda

2 Milk Coke window cleaner

3 Coke detergent

4 Coke detergent soda

5 Window cleaner soda

POS Transactions

19

How Good is an Association Rule

How much better than chance is a ruleHow much better than chance is a rule Lift (improvementa) tells us how much better a rule is at Lift (improvementa) tells us how much better a rule is at

predicting the result than just assuming the result in the first predicting the result than just assuming the result in the first placeplace

Lift Lift is the ratio of the records that support the entire rule to is the ratio of the records that support the entire rule to the number that would be expected assuming there was no the number that would be expected assuming there was no relationship between the productsrelationship between the products

Calculating lifthellipp 310hellipWhen lift gt 1 then the rule is better at Calculating lifthellipp 310hellipWhen lift gt 1 then the rule is better at predicting the result than guessingpredicting the result than guessing

When lift lt 1 the rule is doing worse than informed guessing When lift lt 1 the rule is doing worse than informed guessing and using the and using the Negative RuleNegative Rule produces a better rule than produces a better rule than guessingguessing

Co-occurrence can occur in 3 4 or more dimensionshellipCo-occurrence can occur in 3 4 or more dimensionshellip

20

Creating Association Rules

11 Choosing the right set Choosing the right set of itemsof items

22 Generating rules by Generating rules by deciphering the deciphering the counts in the co-counts in the co-occurrence matrixoccurrence matrix

33 Overcoming the Overcoming the practical limits practical limits imposed by thousands imposed by thousands or tens of thousands or tens of thousands of unique itemsof unique items

21

Overcoming Practical Limits for Association Rules

11 Generate co-occurrence matrix Generate co-occurrence matrix for single itemshelliprdquofor single itemshelliprdquoif Coke then if Coke then sodardquosodardquo

22 Generate co-occurrence matrix Generate co-occurrence matrix for two itemshelliprdquofor two itemshelliprdquoif Coke and Milk if Coke and Milk then sodardquothen sodardquo

33 Generate co-occurrence matrix Generate co-occurrence matrix for three itemshelliprdquofor three itemshelliprdquoif Coke and Milk if Coke and Milk and Windowand Window Cleanerrdquo then soda Cleanerrdquo then soda

44 EtchellipEtchellip

22

Final Thought on Association RulesThe Problem of Lots of Data

Fast Food Restauranthellipcould have 100 Fast Food Restauranthellipcould have 100 items on its menuitems on its menu How many combinations are there with 3 How many combinations are there with 3

different menu items 161700 different menu items 161700 Supermarkethellip10000 or more unique Supermarkethellip10000 or more unique

itemsitems 50 million 2-item combinations50 million 2-item combinations 100 billion 3-item combinations100 billion 3-item combinations

Use of product hierarchies (groupings) Use of product hierarchies (groupings) helps address this common issuehelps address this common issue

Finally know that the number of Finally know that the number of transactions in a given time-period could transactions in a given time-period could also be huge (hence expensive to analyze)also be huge (hence expensive to analyze)

23

Business and other cases

24

25

26

27

28

29

30

31

32

33

General Observations

Banking case seems to provide Banking case seems to provide well defined and intelligible well defined and intelligible information of the forminformation of the form account_1 and account_2 etc or account_1 and account_2 etc or

activity_1 and activity_2 etc activity_1 and activity_2 etc possibly indexed by timepossibly indexed by time

As such rules found provide guide As such rules found provide guide to action to offer product or service to action to offer product or service (cross-sell)(cross-sell)

34

In retailing case of items In retailing case of items purchased together guidance is purchased together guidance is not so clear cut due to extensive not so clear cut due to extensive number of rulesnumber of rules

Soccer event exemplifies Soccer event exemplifies sequencing of events towards sequencing of events towards reaching goal Basketball-applied reaching goal Basketball-applied software has been developed years software has been developed years ago Web mining shares the same ago Web mining shares the same principles without passion usually principles without passion usually associated with sportsassociated with sports

35

Challenges

A major difficulty is that a large number of A major difficulty is that a large number of the rules found may be trivial for anyone the rules found may be trivial for anyone familiar with the business familiar with the business

The computational complexity involved in The computational complexity involved in calculating the results of market basket calculating the results of market basket analysis is at least the square of the number analysis is at least the square of the number of transaction item-lines (records of every of transaction item-lines (records of every item purchased) With data warehouses item purchased) With data warehouses storing billions of transaction lines this storing billions of transaction lines this yields extremely high computational yields extremely high computational requirements requirements

36

Solutions

Differential market basket analysisDifferential market basket analysis can find interesting results and can also can find interesting results and can also eliminate the problem of a potentially eliminate the problem of a potentially high volume of trivial resultshigh volume of trivial results

Special techniques involving Special techniques involving filtering filtering or aggregationor aggregation of the transaction of the transaction database are commonly used to in database are commonly used to in analysis algorithms to increase analysis algorithms to increase performance and allow some level of performance and allow some level of interactivity such as in business interactivity such as in business intelligence applicationsintelligence applications

37

Thank You

13

Pivoting for Cluster Algorithms

14

Association Rules

Wal-Mart customers who purchase Wal-Mart customers who purchase Barbie dolls have a 60 likelihood of Barbie dolls have a 60 likelihood of also purchasing one of three types of also purchasing one of three types of candy bars [candy bars [ForbesForbes Sept 8 1997] Sept 8 1997]

Customers who purchase maintenance Customers who purchase maintenance agreements are very likely to purchase agreements are very likely to purchase large appliances When a new hardware large appliances When a new hardware store opens one of the most commonly store opens one of the most commonly sold items is toilet bowl cleanerssold items is toilet bowl cleaners

15

Association Rules

Association rule typesAssociation rule types Actionable Rules ndash contain high-Actionable Rules ndash contain high-

quality actionable informationquality actionable information Trivial Rules ndash information already Trivial Rules ndash information already

well-known by those familiar with well-known by those familiar with the businessthe business

Inexplicable Rules ndash no explanation Inexplicable Rules ndash no explanation and do not suggest actionand do not suggest action

Trivial and Inexplicable Rules Trivial and Inexplicable Rules occur most oftenoccur most often

16

How Good is an Association Rule

CustomerCustomer Items PurchasedItems Purchased

11 Coke sodaCoke soda

22 Milk Coke window cleanerMilk Coke window cleaner

33 Coke detergentCoke detergent

44 Coke detergent sodaCoke detergent soda

55 Window cleaner sodaWindow cleaner soda

CokCokee

Window Window cleanercleaner

MilkMilk SodaSoda DetergentDetergent

CokeCoke 44 11 11 22 22

Window cleanerWindow cleaner 11 22 11 11 00

MilkMilk 11 11 11 00 00

SodaSoda 22 11 00 33 11

DetergentDetergent 22 00 00 11 22

POS Transactions

Co-occurrence ofProducts

17

How Good is an Association Rule

CokCokee

Window Window cleanercleaner

MilkMilk SodaSoda DetergentDetergent

44 11 11 22 22

Window cleanerWindow cleaner 11 22 11 11 00

MilkMilk 11 11 11 00 00

SodaSoda 22 11 00 33 11

DetergentDetergent 22 00 00 11 22

Simple patterns1 Coke and soda are more likely purchased together thanany other two items2 Detergent is never purchased with milk or window cleaner3 Milk is never purchased with soda or detergent

18

How Good is an Association Rule

What is the confidence for this ruleWhat is the confidence for this rule If a customer purchases soda then customer also purchases CokeIf a customer purchases soda then customer also purchases Coke 2 out of 3 soda purchases also include Coke so 672 out of 3 soda purchases also include Coke so 67

What about the confidence of this rule reversedWhat about the confidence of this rule reversed 2 out of 4 Coke purchases also include soda so 502 out of 4 Coke purchases also include soda so 50

Confidence Confidence = Ratio of the number of transactions with all the = Ratio of the number of transactions with all the items to the number of transactions with just the ldquoifrdquo itemsitems to the number of transactions with just the ldquoifrdquo items

Customer Items Purchased

1 Coke soda

2 Milk Coke window cleaner

3 Coke detergent

4 Coke detergent soda

5 Window cleaner soda

POS Transactions

19

How Good is an Association Rule

How much better than chance is a ruleHow much better than chance is a rule Lift (improvementa) tells us how much better a rule is at Lift (improvementa) tells us how much better a rule is at

predicting the result than just assuming the result in the first predicting the result than just assuming the result in the first placeplace

Lift Lift is the ratio of the records that support the entire rule to is the ratio of the records that support the entire rule to the number that would be expected assuming there was no the number that would be expected assuming there was no relationship between the productsrelationship between the products

Calculating lifthellipp 310hellipWhen lift gt 1 then the rule is better at Calculating lifthellipp 310hellipWhen lift gt 1 then the rule is better at predicting the result than guessingpredicting the result than guessing

When lift lt 1 the rule is doing worse than informed guessing When lift lt 1 the rule is doing worse than informed guessing and using the and using the Negative RuleNegative Rule produces a better rule than produces a better rule than guessingguessing

Co-occurrence can occur in 3 4 or more dimensionshellipCo-occurrence can occur in 3 4 or more dimensionshellip

20

Creating Association Rules

11 Choosing the right set Choosing the right set of itemsof items

22 Generating rules by Generating rules by deciphering the deciphering the counts in the co-counts in the co-occurrence matrixoccurrence matrix

33 Overcoming the Overcoming the practical limits practical limits imposed by thousands imposed by thousands or tens of thousands or tens of thousands of unique itemsof unique items

21

Overcoming Practical Limits for Association Rules

11 Generate co-occurrence matrix Generate co-occurrence matrix for single itemshelliprdquofor single itemshelliprdquoif Coke then if Coke then sodardquosodardquo

22 Generate co-occurrence matrix Generate co-occurrence matrix for two itemshelliprdquofor two itemshelliprdquoif Coke and Milk if Coke and Milk then sodardquothen sodardquo

33 Generate co-occurrence matrix Generate co-occurrence matrix for three itemshelliprdquofor three itemshelliprdquoif Coke and Milk if Coke and Milk and Windowand Window Cleanerrdquo then soda Cleanerrdquo then soda

44 EtchellipEtchellip

22

Final Thought on Association RulesThe Problem of Lots of Data

Fast Food Restauranthellipcould have 100 Fast Food Restauranthellipcould have 100 items on its menuitems on its menu How many combinations are there with 3 How many combinations are there with 3

different menu items 161700 different menu items 161700 Supermarkethellip10000 or more unique Supermarkethellip10000 or more unique

itemsitems 50 million 2-item combinations50 million 2-item combinations 100 billion 3-item combinations100 billion 3-item combinations

Use of product hierarchies (groupings) Use of product hierarchies (groupings) helps address this common issuehelps address this common issue

Finally know that the number of Finally know that the number of transactions in a given time-period could transactions in a given time-period could also be huge (hence expensive to analyze)also be huge (hence expensive to analyze)

23

Business and other cases

24

25

26

27

28

29

30

31

32

33

General Observations

Banking case seems to provide Banking case seems to provide well defined and intelligible well defined and intelligible information of the forminformation of the form account_1 and account_2 etc or account_1 and account_2 etc or

activity_1 and activity_2 etc activity_1 and activity_2 etc possibly indexed by timepossibly indexed by time

As such rules found provide guide As such rules found provide guide to action to offer product or service to action to offer product or service (cross-sell)(cross-sell)

34

In retailing case of items In retailing case of items purchased together guidance is purchased together guidance is not so clear cut due to extensive not so clear cut due to extensive number of rulesnumber of rules

Soccer event exemplifies Soccer event exemplifies sequencing of events towards sequencing of events towards reaching goal Basketball-applied reaching goal Basketball-applied software has been developed years software has been developed years ago Web mining shares the same ago Web mining shares the same principles without passion usually principles without passion usually associated with sportsassociated with sports

35

Challenges

A major difficulty is that a large number of A major difficulty is that a large number of the rules found may be trivial for anyone the rules found may be trivial for anyone familiar with the business familiar with the business

The computational complexity involved in The computational complexity involved in calculating the results of market basket calculating the results of market basket analysis is at least the square of the number analysis is at least the square of the number of transaction item-lines (records of every of transaction item-lines (records of every item purchased) With data warehouses item purchased) With data warehouses storing billions of transaction lines this storing billions of transaction lines this yields extremely high computational yields extremely high computational requirements requirements

36

Solutions

Differential market basket analysisDifferential market basket analysis can find interesting results and can also can find interesting results and can also eliminate the problem of a potentially eliminate the problem of a potentially high volume of trivial resultshigh volume of trivial results

Special techniques involving Special techniques involving filtering filtering or aggregationor aggregation of the transaction of the transaction database are commonly used to in database are commonly used to in analysis algorithms to increase analysis algorithms to increase performance and allow some level of performance and allow some level of interactivity such as in business interactivity such as in business intelligence applicationsintelligence applications

37

Thank You

14

Association Rules

Wal-Mart customers who purchase Wal-Mart customers who purchase Barbie dolls have a 60 likelihood of Barbie dolls have a 60 likelihood of also purchasing one of three types of also purchasing one of three types of candy bars [candy bars [ForbesForbes Sept 8 1997] Sept 8 1997]

Customers who purchase maintenance Customers who purchase maintenance agreements are very likely to purchase agreements are very likely to purchase large appliances When a new hardware large appliances When a new hardware store opens one of the most commonly store opens one of the most commonly sold items is toilet bowl cleanerssold items is toilet bowl cleaners

15

Association Rules

Association rule typesAssociation rule types Actionable Rules ndash contain high-Actionable Rules ndash contain high-

quality actionable informationquality actionable information Trivial Rules ndash information already Trivial Rules ndash information already

well-known by those familiar with well-known by those familiar with the businessthe business

Inexplicable Rules ndash no explanation Inexplicable Rules ndash no explanation and do not suggest actionand do not suggest action

Trivial and Inexplicable Rules Trivial and Inexplicable Rules occur most oftenoccur most often

16

How Good is an Association Rule

CustomerCustomer Items PurchasedItems Purchased

11 Coke sodaCoke soda

22 Milk Coke window cleanerMilk Coke window cleaner

33 Coke detergentCoke detergent

44 Coke detergent sodaCoke detergent soda

55 Window cleaner sodaWindow cleaner soda

CokCokee

Window Window cleanercleaner

MilkMilk SodaSoda DetergentDetergent

CokeCoke 44 11 11 22 22

Window cleanerWindow cleaner 11 22 11 11 00

MilkMilk 11 11 11 00 00

SodaSoda 22 11 00 33 11

DetergentDetergent 22 00 00 11 22

POS Transactions

Co-occurrence ofProducts

17

How Good is an Association Rule

CokCokee

Window Window cleanercleaner

MilkMilk SodaSoda DetergentDetergent

44 11 11 22 22

Window cleanerWindow cleaner 11 22 11 11 00

MilkMilk 11 11 11 00 00

SodaSoda 22 11 00 33 11

DetergentDetergent 22 00 00 11 22

Simple patterns1 Coke and soda are more likely purchased together thanany other two items2 Detergent is never purchased with milk or window cleaner3 Milk is never purchased with soda or detergent

18

How Good is an Association Rule

What is the confidence for this ruleWhat is the confidence for this rule If a customer purchases soda then customer also purchases CokeIf a customer purchases soda then customer also purchases Coke 2 out of 3 soda purchases also include Coke so 672 out of 3 soda purchases also include Coke so 67

What about the confidence of this rule reversedWhat about the confidence of this rule reversed 2 out of 4 Coke purchases also include soda so 502 out of 4 Coke purchases also include soda so 50

Confidence Confidence = Ratio of the number of transactions with all the = Ratio of the number of transactions with all the items to the number of transactions with just the ldquoifrdquo itemsitems to the number of transactions with just the ldquoifrdquo items

Customer Items Purchased

1 Coke soda

2 Milk Coke window cleaner

3 Coke detergent

4 Coke detergent soda

5 Window cleaner soda

POS Transactions

19

How Good is an Association Rule

How much better than chance is a ruleHow much better than chance is a rule Lift (improvementa) tells us how much better a rule is at Lift (improvementa) tells us how much better a rule is at

predicting the result than just assuming the result in the first predicting the result than just assuming the result in the first placeplace

Lift Lift is the ratio of the records that support the entire rule to is the ratio of the records that support the entire rule to the number that would be expected assuming there was no the number that would be expected assuming there was no relationship between the productsrelationship between the products

Calculating lifthellipp 310hellipWhen lift gt 1 then the rule is better at Calculating lifthellipp 310hellipWhen lift gt 1 then the rule is better at predicting the result than guessingpredicting the result than guessing

When lift lt 1 the rule is doing worse than informed guessing When lift lt 1 the rule is doing worse than informed guessing and using the and using the Negative RuleNegative Rule produces a better rule than produces a better rule than guessingguessing

Co-occurrence can occur in 3 4 or more dimensionshellipCo-occurrence can occur in 3 4 or more dimensionshellip

20

Creating Association Rules

11 Choosing the right set Choosing the right set of itemsof items

22 Generating rules by Generating rules by deciphering the deciphering the counts in the co-counts in the co-occurrence matrixoccurrence matrix

33 Overcoming the Overcoming the practical limits practical limits imposed by thousands imposed by thousands or tens of thousands or tens of thousands of unique itemsof unique items

21

Overcoming Practical Limits for Association Rules

11 Generate co-occurrence matrix Generate co-occurrence matrix for single itemshelliprdquofor single itemshelliprdquoif Coke then if Coke then sodardquosodardquo

22 Generate co-occurrence matrix Generate co-occurrence matrix for two itemshelliprdquofor two itemshelliprdquoif Coke and Milk if Coke and Milk then sodardquothen sodardquo

33 Generate co-occurrence matrix Generate co-occurrence matrix for three itemshelliprdquofor three itemshelliprdquoif Coke and Milk if Coke and Milk and Windowand Window Cleanerrdquo then soda Cleanerrdquo then soda

44 EtchellipEtchellip

22

Final Thought on Association RulesThe Problem of Lots of Data

Fast Food Restauranthellipcould have 100 Fast Food Restauranthellipcould have 100 items on its menuitems on its menu How many combinations are there with 3 How many combinations are there with 3

different menu items 161700 different menu items 161700 Supermarkethellip10000 or more unique Supermarkethellip10000 or more unique

itemsitems 50 million 2-item combinations50 million 2-item combinations 100 billion 3-item combinations100 billion 3-item combinations

Use of product hierarchies (groupings) Use of product hierarchies (groupings) helps address this common issuehelps address this common issue

Finally know that the number of Finally know that the number of transactions in a given time-period could transactions in a given time-period could also be huge (hence expensive to analyze)also be huge (hence expensive to analyze)

23

Business and other cases

24

25

26

27

28

29

30

31

32

33

General Observations

Banking case seems to provide Banking case seems to provide well defined and intelligible well defined and intelligible information of the forminformation of the form account_1 and account_2 etc or account_1 and account_2 etc or

activity_1 and activity_2 etc activity_1 and activity_2 etc possibly indexed by timepossibly indexed by time

As such rules found provide guide As such rules found provide guide to action to offer product or service to action to offer product or service (cross-sell)(cross-sell)

34

In retailing case of items In retailing case of items purchased together guidance is purchased together guidance is not so clear cut due to extensive not so clear cut due to extensive number of rulesnumber of rules

Soccer event exemplifies Soccer event exemplifies sequencing of events towards sequencing of events towards reaching goal Basketball-applied reaching goal Basketball-applied software has been developed years software has been developed years ago Web mining shares the same ago Web mining shares the same principles without passion usually principles without passion usually associated with sportsassociated with sports

35

Challenges

A major difficulty is that a large number of A major difficulty is that a large number of the rules found may be trivial for anyone the rules found may be trivial for anyone familiar with the business familiar with the business

The computational complexity involved in The computational complexity involved in calculating the results of market basket calculating the results of market basket analysis is at least the square of the number analysis is at least the square of the number of transaction item-lines (records of every of transaction item-lines (records of every item purchased) With data warehouses item purchased) With data warehouses storing billions of transaction lines this storing billions of transaction lines this yields extremely high computational yields extremely high computational requirements requirements

36

Solutions

Differential market basket analysisDifferential market basket analysis can find interesting results and can also can find interesting results and can also eliminate the problem of a potentially eliminate the problem of a potentially high volume of trivial resultshigh volume of trivial results

Special techniques involving Special techniques involving filtering filtering or aggregationor aggregation of the transaction of the transaction database are commonly used to in database are commonly used to in analysis algorithms to increase analysis algorithms to increase performance and allow some level of performance and allow some level of interactivity such as in business interactivity such as in business intelligence applicationsintelligence applications

37

Thank You

15

Association Rules

Association rule typesAssociation rule types Actionable Rules ndash contain high-Actionable Rules ndash contain high-

quality actionable informationquality actionable information Trivial Rules ndash information already Trivial Rules ndash information already

well-known by those familiar with well-known by those familiar with the businessthe business

Inexplicable Rules ndash no explanation Inexplicable Rules ndash no explanation and do not suggest actionand do not suggest action

Trivial and Inexplicable Rules Trivial and Inexplicable Rules occur most oftenoccur most often

16

How Good is an Association Rule

CustomerCustomer Items PurchasedItems Purchased

11 Coke sodaCoke soda

22 Milk Coke window cleanerMilk Coke window cleaner

33 Coke detergentCoke detergent

44 Coke detergent sodaCoke detergent soda

55 Window cleaner sodaWindow cleaner soda

CokCokee

Window Window cleanercleaner

MilkMilk SodaSoda DetergentDetergent

CokeCoke 44 11 11 22 22

Window cleanerWindow cleaner 11 22 11 11 00

MilkMilk 11 11 11 00 00

SodaSoda 22 11 00 33 11

DetergentDetergent 22 00 00 11 22

POS Transactions

Co-occurrence ofProducts

17

How Good is an Association Rule

CokCokee

Window Window cleanercleaner

MilkMilk SodaSoda DetergentDetergent

44 11 11 22 22

Window cleanerWindow cleaner 11 22 11 11 00

MilkMilk 11 11 11 00 00

SodaSoda 22 11 00 33 11

DetergentDetergent 22 00 00 11 22

Simple patterns1 Coke and soda are more likely purchased together thanany other two items2 Detergent is never purchased with milk or window cleaner3 Milk is never purchased with soda or detergent

18

How Good is an Association Rule

What is the confidence for this ruleWhat is the confidence for this rule If a customer purchases soda then customer also purchases CokeIf a customer purchases soda then customer also purchases Coke 2 out of 3 soda purchases also include Coke so 672 out of 3 soda purchases also include Coke so 67

What about the confidence of this rule reversedWhat about the confidence of this rule reversed 2 out of 4 Coke purchases also include soda so 502 out of 4 Coke purchases also include soda so 50

Confidence Confidence = Ratio of the number of transactions with all the = Ratio of the number of transactions with all the items to the number of transactions with just the ldquoifrdquo itemsitems to the number of transactions with just the ldquoifrdquo items

Customer Items Purchased

1 Coke soda

2 Milk Coke window cleaner

3 Coke detergent

4 Coke detergent soda

5 Window cleaner soda

POS Transactions

19

How Good is an Association Rule

How much better than chance is a ruleHow much better than chance is a rule Lift (improvementa) tells us how much better a rule is at Lift (improvementa) tells us how much better a rule is at

predicting the result than just assuming the result in the first predicting the result than just assuming the result in the first placeplace

Lift Lift is the ratio of the records that support the entire rule to is the ratio of the records that support the entire rule to the number that would be expected assuming there was no the number that would be expected assuming there was no relationship between the productsrelationship between the products

Calculating lifthellipp 310hellipWhen lift gt 1 then the rule is better at Calculating lifthellipp 310hellipWhen lift gt 1 then the rule is better at predicting the result than guessingpredicting the result than guessing

When lift lt 1 the rule is doing worse than informed guessing When lift lt 1 the rule is doing worse than informed guessing and using the and using the Negative RuleNegative Rule produces a better rule than produces a better rule than guessingguessing

Co-occurrence can occur in 3 4 or more dimensionshellipCo-occurrence can occur in 3 4 or more dimensionshellip

20

Creating Association Rules

11 Choosing the right set Choosing the right set of itemsof items

22 Generating rules by Generating rules by deciphering the deciphering the counts in the co-counts in the co-occurrence matrixoccurrence matrix

33 Overcoming the Overcoming the practical limits practical limits imposed by thousands imposed by thousands or tens of thousands or tens of thousands of unique itemsof unique items

21

Overcoming Practical Limits for Association Rules

11 Generate co-occurrence matrix Generate co-occurrence matrix for single itemshelliprdquofor single itemshelliprdquoif Coke then if Coke then sodardquosodardquo

22 Generate co-occurrence matrix Generate co-occurrence matrix for two itemshelliprdquofor two itemshelliprdquoif Coke and Milk if Coke and Milk then sodardquothen sodardquo

33 Generate co-occurrence matrix Generate co-occurrence matrix for three itemshelliprdquofor three itemshelliprdquoif Coke and Milk if Coke and Milk and Windowand Window Cleanerrdquo then soda Cleanerrdquo then soda

44 EtchellipEtchellip

22

Final Thought on Association RulesThe Problem of Lots of Data

Fast Food Restauranthellipcould have 100 Fast Food Restauranthellipcould have 100 items on its menuitems on its menu How many combinations are there with 3 How many combinations are there with 3

different menu items 161700 different menu items 161700 Supermarkethellip10000 or more unique Supermarkethellip10000 or more unique

itemsitems 50 million 2-item combinations50 million 2-item combinations 100 billion 3-item combinations100 billion 3-item combinations

Use of product hierarchies (groupings) Use of product hierarchies (groupings) helps address this common issuehelps address this common issue

Finally know that the number of Finally know that the number of transactions in a given time-period could transactions in a given time-period could also be huge (hence expensive to analyze)also be huge (hence expensive to analyze)

23

Business and other cases

24

25

26

27

28

29

30

31

32

33

General Observations

Banking case seems to provide Banking case seems to provide well defined and intelligible well defined and intelligible information of the forminformation of the form account_1 and account_2 etc or account_1 and account_2 etc or

activity_1 and activity_2 etc activity_1 and activity_2 etc possibly indexed by timepossibly indexed by time

As such rules found provide guide As such rules found provide guide to action to offer product or service to action to offer product or service (cross-sell)(cross-sell)

34

In retailing case of items In retailing case of items purchased together guidance is purchased together guidance is not so clear cut due to extensive not so clear cut due to extensive number of rulesnumber of rules

Soccer event exemplifies Soccer event exemplifies sequencing of events towards sequencing of events towards reaching goal Basketball-applied reaching goal Basketball-applied software has been developed years software has been developed years ago Web mining shares the same ago Web mining shares the same principles without passion usually principles without passion usually associated with sportsassociated with sports

35

Challenges

A major difficulty is that a large number of A major difficulty is that a large number of the rules found may be trivial for anyone the rules found may be trivial for anyone familiar with the business familiar with the business

The computational complexity involved in The computational complexity involved in calculating the results of market basket calculating the results of market basket analysis is at least the square of the number analysis is at least the square of the number of transaction item-lines (records of every of transaction item-lines (records of every item purchased) With data warehouses item purchased) With data warehouses storing billions of transaction lines this storing billions of transaction lines this yields extremely high computational yields extremely high computational requirements requirements

36

Solutions

Differential market basket analysisDifferential market basket analysis can find interesting results and can also can find interesting results and can also eliminate the problem of a potentially eliminate the problem of a potentially high volume of trivial resultshigh volume of trivial results

Special techniques involving Special techniques involving filtering filtering or aggregationor aggregation of the transaction of the transaction database are commonly used to in database are commonly used to in analysis algorithms to increase analysis algorithms to increase performance and allow some level of performance and allow some level of interactivity such as in business interactivity such as in business intelligence applicationsintelligence applications

37

Thank You

16

How Good is an Association Rule

CustomerCustomer Items PurchasedItems Purchased

11 Coke sodaCoke soda

22 Milk Coke window cleanerMilk Coke window cleaner

33 Coke detergentCoke detergent

44 Coke detergent sodaCoke detergent soda

55 Window cleaner sodaWindow cleaner soda

CokCokee

Window Window cleanercleaner

MilkMilk SodaSoda DetergentDetergent

CokeCoke 44 11 11 22 22

Window cleanerWindow cleaner 11 22 11 11 00

MilkMilk 11 11 11 00 00

SodaSoda 22 11 00 33 11

DetergentDetergent 22 00 00 11 22

POS Transactions

Co-occurrence ofProducts

17

How Good is an Association Rule

CokCokee

Window Window cleanercleaner

MilkMilk SodaSoda DetergentDetergent

44 11 11 22 22

Window cleanerWindow cleaner 11 22 11 11 00

MilkMilk 11 11 11 00 00

SodaSoda 22 11 00 33 11

DetergentDetergent 22 00 00 11 22

Simple patterns1 Coke and soda are more likely purchased together thanany other two items2 Detergent is never purchased with milk or window cleaner3 Milk is never purchased with soda or detergent

18

How Good is an Association Rule

What is the confidence for this ruleWhat is the confidence for this rule If a customer purchases soda then customer also purchases CokeIf a customer purchases soda then customer also purchases Coke 2 out of 3 soda purchases also include Coke so 672 out of 3 soda purchases also include Coke so 67

What about the confidence of this rule reversedWhat about the confidence of this rule reversed 2 out of 4 Coke purchases also include soda so 502 out of 4 Coke purchases also include soda so 50

Confidence Confidence = Ratio of the number of transactions with all the = Ratio of the number of transactions with all the items to the number of transactions with just the ldquoifrdquo itemsitems to the number of transactions with just the ldquoifrdquo items

Customer Items Purchased

1 Coke soda

2 Milk Coke window cleaner

3 Coke detergent

4 Coke detergent soda

5 Window cleaner soda

POS Transactions

19

How Good is an Association Rule

How much better than chance is a ruleHow much better than chance is a rule Lift (improvementa) tells us how much better a rule is at Lift (improvementa) tells us how much better a rule is at

predicting the result than just assuming the result in the first predicting the result than just assuming the result in the first placeplace

Lift Lift is the ratio of the records that support the entire rule to is the ratio of the records that support the entire rule to the number that would be expected assuming there was no the number that would be expected assuming there was no relationship between the productsrelationship between the products

Calculating lifthellipp 310hellipWhen lift gt 1 then the rule is better at Calculating lifthellipp 310hellipWhen lift gt 1 then the rule is better at predicting the result than guessingpredicting the result than guessing

When lift lt 1 the rule is doing worse than informed guessing When lift lt 1 the rule is doing worse than informed guessing and using the and using the Negative RuleNegative Rule produces a better rule than produces a better rule than guessingguessing

Co-occurrence can occur in 3 4 or more dimensionshellipCo-occurrence can occur in 3 4 or more dimensionshellip

20

Creating Association Rules

11 Choosing the right set Choosing the right set of itemsof items

22 Generating rules by Generating rules by deciphering the deciphering the counts in the co-counts in the co-occurrence matrixoccurrence matrix

33 Overcoming the Overcoming the practical limits practical limits imposed by thousands imposed by thousands or tens of thousands or tens of thousands of unique itemsof unique items

21

Overcoming Practical Limits for Association Rules

11 Generate co-occurrence matrix Generate co-occurrence matrix for single itemshelliprdquofor single itemshelliprdquoif Coke then if Coke then sodardquosodardquo

22 Generate co-occurrence matrix Generate co-occurrence matrix for two itemshelliprdquofor two itemshelliprdquoif Coke and Milk if Coke and Milk then sodardquothen sodardquo

33 Generate co-occurrence matrix Generate co-occurrence matrix for three itemshelliprdquofor three itemshelliprdquoif Coke and Milk if Coke and Milk and Windowand Window Cleanerrdquo then soda Cleanerrdquo then soda

44 EtchellipEtchellip

22

Final Thought on Association RulesThe Problem of Lots of Data

Fast Food Restauranthellipcould have 100 Fast Food Restauranthellipcould have 100 items on its menuitems on its menu How many combinations are there with 3 How many combinations are there with 3

different menu items 161700 different menu items 161700 Supermarkethellip10000 or more unique Supermarkethellip10000 or more unique

itemsitems 50 million 2-item combinations50 million 2-item combinations 100 billion 3-item combinations100 billion 3-item combinations

Use of product hierarchies (groupings) Use of product hierarchies (groupings) helps address this common issuehelps address this common issue

Finally know that the number of Finally know that the number of transactions in a given time-period could transactions in a given time-period could also be huge (hence expensive to analyze)also be huge (hence expensive to analyze)

23

Business and other cases

24

25

26

27

28

29

30

31

32

33

General Observations

Banking case seems to provide Banking case seems to provide well defined and intelligible well defined and intelligible information of the forminformation of the form account_1 and account_2 etc or account_1 and account_2 etc or

activity_1 and activity_2 etc activity_1 and activity_2 etc possibly indexed by timepossibly indexed by time

As such rules found provide guide As such rules found provide guide to action to offer product or service to action to offer product or service (cross-sell)(cross-sell)

34

In retailing case of items In retailing case of items purchased together guidance is purchased together guidance is not so clear cut due to extensive not so clear cut due to extensive number of rulesnumber of rules

Soccer event exemplifies Soccer event exemplifies sequencing of events towards sequencing of events towards reaching goal Basketball-applied reaching goal Basketball-applied software has been developed years software has been developed years ago Web mining shares the same ago Web mining shares the same principles without passion usually principles without passion usually associated with sportsassociated with sports

35

Challenges

A major difficulty is that a large number of A major difficulty is that a large number of the rules found may be trivial for anyone the rules found may be trivial for anyone familiar with the business familiar with the business

The computational complexity involved in The computational complexity involved in calculating the results of market basket calculating the results of market basket analysis is at least the square of the number analysis is at least the square of the number of transaction item-lines (records of every of transaction item-lines (records of every item purchased) With data warehouses item purchased) With data warehouses storing billions of transaction lines this storing billions of transaction lines this yields extremely high computational yields extremely high computational requirements requirements

36

Solutions

Differential market basket analysisDifferential market basket analysis can find interesting results and can also can find interesting results and can also eliminate the problem of a potentially eliminate the problem of a potentially high volume of trivial resultshigh volume of trivial results

Special techniques involving Special techniques involving filtering filtering or aggregationor aggregation of the transaction of the transaction database are commonly used to in database are commonly used to in analysis algorithms to increase analysis algorithms to increase performance and allow some level of performance and allow some level of interactivity such as in business interactivity such as in business intelligence applicationsintelligence applications

37

Thank You

17

How Good is an Association Rule

CokCokee

Window Window cleanercleaner

MilkMilk SodaSoda DetergentDetergent

44 11 11 22 22

Window cleanerWindow cleaner 11 22 11 11 00

MilkMilk 11 11 11 00 00

SodaSoda 22 11 00 33 11

DetergentDetergent 22 00 00 11 22

Simple patterns1 Coke and soda are more likely purchased together thanany other two items2 Detergent is never purchased with milk or window cleaner3 Milk is never purchased with soda or detergent

18

How Good is an Association Rule

What is the confidence for this ruleWhat is the confidence for this rule If a customer purchases soda then customer also purchases CokeIf a customer purchases soda then customer also purchases Coke 2 out of 3 soda purchases also include Coke so 672 out of 3 soda purchases also include Coke so 67

What about the confidence of this rule reversedWhat about the confidence of this rule reversed 2 out of 4 Coke purchases also include soda so 502 out of 4 Coke purchases also include soda so 50

Confidence Confidence = Ratio of the number of transactions with all the = Ratio of the number of transactions with all the items to the number of transactions with just the ldquoifrdquo itemsitems to the number of transactions with just the ldquoifrdquo items

Customer Items Purchased

1 Coke soda

2 Milk Coke window cleaner

3 Coke detergent

4 Coke detergent soda

5 Window cleaner soda

POS Transactions

19

How Good is an Association Rule

How much better than chance is a ruleHow much better than chance is a rule Lift (improvementa) tells us how much better a rule is at Lift (improvementa) tells us how much better a rule is at

predicting the result than just assuming the result in the first predicting the result than just assuming the result in the first placeplace

Lift Lift is the ratio of the records that support the entire rule to is the ratio of the records that support the entire rule to the number that would be expected assuming there was no the number that would be expected assuming there was no relationship between the productsrelationship between the products

Calculating lifthellipp 310hellipWhen lift gt 1 then the rule is better at Calculating lifthellipp 310hellipWhen lift gt 1 then the rule is better at predicting the result than guessingpredicting the result than guessing

When lift lt 1 the rule is doing worse than informed guessing When lift lt 1 the rule is doing worse than informed guessing and using the and using the Negative RuleNegative Rule produces a better rule than produces a better rule than guessingguessing

Co-occurrence can occur in 3 4 or more dimensionshellipCo-occurrence can occur in 3 4 or more dimensionshellip

20

Creating Association Rules

11 Choosing the right set Choosing the right set of itemsof items

22 Generating rules by Generating rules by deciphering the deciphering the counts in the co-counts in the co-occurrence matrixoccurrence matrix

33 Overcoming the Overcoming the practical limits practical limits imposed by thousands imposed by thousands or tens of thousands or tens of thousands of unique itemsof unique items

21

Overcoming Practical Limits for Association Rules

11 Generate co-occurrence matrix Generate co-occurrence matrix for single itemshelliprdquofor single itemshelliprdquoif Coke then if Coke then sodardquosodardquo

22 Generate co-occurrence matrix Generate co-occurrence matrix for two itemshelliprdquofor two itemshelliprdquoif Coke and Milk if Coke and Milk then sodardquothen sodardquo

33 Generate co-occurrence matrix Generate co-occurrence matrix for three itemshelliprdquofor three itemshelliprdquoif Coke and Milk if Coke and Milk and Windowand Window Cleanerrdquo then soda Cleanerrdquo then soda

44 EtchellipEtchellip

22

Final Thought on Association RulesThe Problem of Lots of Data

Fast Food Restauranthellipcould have 100 Fast Food Restauranthellipcould have 100 items on its menuitems on its menu How many combinations are there with 3 How many combinations are there with 3

different menu items 161700 different menu items 161700 Supermarkethellip10000 or more unique Supermarkethellip10000 or more unique

itemsitems 50 million 2-item combinations50 million 2-item combinations 100 billion 3-item combinations100 billion 3-item combinations

Use of product hierarchies (groupings) Use of product hierarchies (groupings) helps address this common issuehelps address this common issue

Finally know that the number of Finally know that the number of transactions in a given time-period could transactions in a given time-period could also be huge (hence expensive to analyze)also be huge (hence expensive to analyze)

23

Business and other cases

24

25

26

27

28

29

30

31

32

33

General Observations

Banking case seems to provide Banking case seems to provide well defined and intelligible well defined and intelligible information of the forminformation of the form account_1 and account_2 etc or account_1 and account_2 etc or

activity_1 and activity_2 etc activity_1 and activity_2 etc possibly indexed by timepossibly indexed by time

As such rules found provide guide As such rules found provide guide to action to offer product or service to action to offer product or service (cross-sell)(cross-sell)

34

In retailing case of items In retailing case of items purchased together guidance is purchased together guidance is not so clear cut due to extensive not so clear cut due to extensive number of rulesnumber of rules

Soccer event exemplifies Soccer event exemplifies sequencing of events towards sequencing of events towards reaching goal Basketball-applied reaching goal Basketball-applied software has been developed years software has been developed years ago Web mining shares the same ago Web mining shares the same principles without passion usually principles without passion usually associated with sportsassociated with sports

35

Challenges

A major difficulty is that a large number of A major difficulty is that a large number of the rules found may be trivial for anyone the rules found may be trivial for anyone familiar with the business familiar with the business

The computational complexity involved in The computational complexity involved in calculating the results of market basket calculating the results of market basket analysis is at least the square of the number analysis is at least the square of the number of transaction item-lines (records of every of transaction item-lines (records of every item purchased) With data warehouses item purchased) With data warehouses storing billions of transaction lines this storing billions of transaction lines this yields extremely high computational yields extremely high computational requirements requirements

36

Solutions

Differential market basket analysisDifferential market basket analysis can find interesting results and can also can find interesting results and can also eliminate the problem of a potentially eliminate the problem of a potentially high volume of trivial resultshigh volume of trivial results

Special techniques involving Special techniques involving filtering filtering or aggregationor aggregation of the transaction of the transaction database are commonly used to in database are commonly used to in analysis algorithms to increase analysis algorithms to increase performance and allow some level of performance and allow some level of interactivity such as in business interactivity such as in business intelligence applicationsintelligence applications

37

Thank You

18

How Good is an Association Rule

What is the confidence for this ruleWhat is the confidence for this rule If a customer purchases soda then customer also purchases CokeIf a customer purchases soda then customer also purchases Coke 2 out of 3 soda purchases also include Coke so 672 out of 3 soda purchases also include Coke so 67

What about the confidence of this rule reversedWhat about the confidence of this rule reversed 2 out of 4 Coke purchases also include soda so 502 out of 4 Coke purchases also include soda so 50

Confidence Confidence = Ratio of the number of transactions with all the = Ratio of the number of transactions with all the items to the number of transactions with just the ldquoifrdquo itemsitems to the number of transactions with just the ldquoifrdquo items

Customer Items Purchased

1 Coke soda

2 Milk Coke window cleaner

3 Coke detergent

4 Coke detergent soda

5 Window cleaner soda

POS Transactions

19

How Good is an Association Rule

How much better than chance is a ruleHow much better than chance is a rule Lift (improvementa) tells us how much better a rule is at Lift (improvementa) tells us how much better a rule is at

predicting the result than just assuming the result in the first predicting the result than just assuming the result in the first placeplace

Lift Lift is the ratio of the records that support the entire rule to is the ratio of the records that support the entire rule to the number that would be expected assuming there was no the number that would be expected assuming there was no relationship between the productsrelationship between the products

Calculating lifthellipp 310hellipWhen lift gt 1 then the rule is better at Calculating lifthellipp 310hellipWhen lift gt 1 then the rule is better at predicting the result than guessingpredicting the result than guessing

When lift lt 1 the rule is doing worse than informed guessing When lift lt 1 the rule is doing worse than informed guessing and using the and using the Negative RuleNegative Rule produces a better rule than produces a better rule than guessingguessing

Co-occurrence can occur in 3 4 or more dimensionshellipCo-occurrence can occur in 3 4 or more dimensionshellip

20

Creating Association Rules

11 Choosing the right set Choosing the right set of itemsof items

22 Generating rules by Generating rules by deciphering the deciphering the counts in the co-counts in the co-occurrence matrixoccurrence matrix

33 Overcoming the Overcoming the practical limits practical limits imposed by thousands imposed by thousands or tens of thousands or tens of thousands of unique itemsof unique items

21

Overcoming Practical Limits for Association Rules

11 Generate co-occurrence matrix Generate co-occurrence matrix for single itemshelliprdquofor single itemshelliprdquoif Coke then if Coke then sodardquosodardquo

22 Generate co-occurrence matrix Generate co-occurrence matrix for two itemshelliprdquofor two itemshelliprdquoif Coke and Milk if Coke and Milk then sodardquothen sodardquo

33 Generate co-occurrence matrix Generate co-occurrence matrix for three itemshelliprdquofor three itemshelliprdquoif Coke and Milk if Coke and Milk and Windowand Window Cleanerrdquo then soda Cleanerrdquo then soda

44 EtchellipEtchellip

22

Final Thought on Association RulesThe Problem of Lots of Data

Fast Food Restauranthellipcould have 100 Fast Food Restauranthellipcould have 100 items on its menuitems on its menu How many combinations are there with 3 How many combinations are there with 3

different menu items 161700 different menu items 161700 Supermarkethellip10000 or more unique Supermarkethellip10000 or more unique

itemsitems 50 million 2-item combinations50 million 2-item combinations 100 billion 3-item combinations100 billion 3-item combinations

Use of product hierarchies (groupings) Use of product hierarchies (groupings) helps address this common issuehelps address this common issue

Finally know that the number of Finally know that the number of transactions in a given time-period could transactions in a given time-period could also be huge (hence expensive to analyze)also be huge (hence expensive to analyze)

23

Business and other cases

24

25

26

27

28

29

30

31

32

33

General Observations

Banking case seems to provide Banking case seems to provide well defined and intelligible well defined and intelligible information of the forminformation of the form account_1 and account_2 etc or account_1 and account_2 etc or

activity_1 and activity_2 etc activity_1 and activity_2 etc possibly indexed by timepossibly indexed by time

As such rules found provide guide As such rules found provide guide to action to offer product or service to action to offer product or service (cross-sell)(cross-sell)

34

In retailing case of items In retailing case of items purchased together guidance is purchased together guidance is not so clear cut due to extensive not so clear cut due to extensive number of rulesnumber of rules

Soccer event exemplifies Soccer event exemplifies sequencing of events towards sequencing of events towards reaching goal Basketball-applied reaching goal Basketball-applied software has been developed years software has been developed years ago Web mining shares the same ago Web mining shares the same principles without passion usually principles without passion usually associated with sportsassociated with sports

35

Challenges

A major difficulty is that a large number of A major difficulty is that a large number of the rules found may be trivial for anyone the rules found may be trivial for anyone familiar with the business familiar with the business

The computational complexity involved in The computational complexity involved in calculating the results of market basket calculating the results of market basket analysis is at least the square of the number analysis is at least the square of the number of transaction item-lines (records of every of transaction item-lines (records of every item purchased) With data warehouses item purchased) With data warehouses storing billions of transaction lines this storing billions of transaction lines this yields extremely high computational yields extremely high computational requirements requirements

36

Solutions

Differential market basket analysisDifferential market basket analysis can find interesting results and can also can find interesting results and can also eliminate the problem of a potentially eliminate the problem of a potentially high volume of trivial resultshigh volume of trivial results

Special techniques involving Special techniques involving filtering filtering or aggregationor aggregation of the transaction of the transaction database are commonly used to in database are commonly used to in analysis algorithms to increase analysis algorithms to increase performance and allow some level of performance and allow some level of interactivity such as in business interactivity such as in business intelligence applicationsintelligence applications

37

Thank You

19

How Good is an Association Rule

How much better than chance is a ruleHow much better than chance is a rule Lift (improvementa) tells us how much better a rule is at Lift (improvementa) tells us how much better a rule is at

predicting the result than just assuming the result in the first predicting the result than just assuming the result in the first placeplace

Lift Lift is the ratio of the records that support the entire rule to is the ratio of the records that support the entire rule to the number that would be expected assuming there was no the number that would be expected assuming there was no relationship between the productsrelationship between the products

Calculating lifthellipp 310hellipWhen lift gt 1 then the rule is better at Calculating lifthellipp 310hellipWhen lift gt 1 then the rule is better at predicting the result than guessingpredicting the result than guessing

When lift lt 1 the rule is doing worse than informed guessing When lift lt 1 the rule is doing worse than informed guessing and using the and using the Negative RuleNegative Rule produces a better rule than produces a better rule than guessingguessing

Co-occurrence can occur in 3 4 or more dimensionshellipCo-occurrence can occur in 3 4 or more dimensionshellip

20

Creating Association Rules

11 Choosing the right set Choosing the right set of itemsof items

22 Generating rules by Generating rules by deciphering the deciphering the counts in the co-counts in the co-occurrence matrixoccurrence matrix

33 Overcoming the Overcoming the practical limits practical limits imposed by thousands imposed by thousands or tens of thousands or tens of thousands of unique itemsof unique items

21

Overcoming Practical Limits for Association Rules

11 Generate co-occurrence matrix Generate co-occurrence matrix for single itemshelliprdquofor single itemshelliprdquoif Coke then if Coke then sodardquosodardquo

22 Generate co-occurrence matrix Generate co-occurrence matrix for two itemshelliprdquofor two itemshelliprdquoif Coke and Milk if Coke and Milk then sodardquothen sodardquo

33 Generate co-occurrence matrix Generate co-occurrence matrix for three itemshelliprdquofor three itemshelliprdquoif Coke and Milk if Coke and Milk and Windowand Window Cleanerrdquo then soda Cleanerrdquo then soda

44 EtchellipEtchellip

22

Final Thought on Association RulesThe Problem of Lots of Data

Fast Food Restauranthellipcould have 100 Fast Food Restauranthellipcould have 100 items on its menuitems on its menu How many combinations are there with 3 How many combinations are there with 3

different menu items 161700 different menu items 161700 Supermarkethellip10000 or more unique Supermarkethellip10000 or more unique

itemsitems 50 million 2-item combinations50 million 2-item combinations 100 billion 3-item combinations100 billion 3-item combinations

Use of product hierarchies (groupings) Use of product hierarchies (groupings) helps address this common issuehelps address this common issue

Finally know that the number of Finally know that the number of transactions in a given time-period could transactions in a given time-period could also be huge (hence expensive to analyze)also be huge (hence expensive to analyze)

23

Business and other cases

24

25

26

27

28

29

30

31

32

33

General Observations

Banking case seems to provide Banking case seems to provide well defined and intelligible well defined and intelligible information of the forminformation of the form account_1 and account_2 etc or account_1 and account_2 etc or

activity_1 and activity_2 etc activity_1 and activity_2 etc possibly indexed by timepossibly indexed by time

As such rules found provide guide As such rules found provide guide to action to offer product or service to action to offer product or service (cross-sell)(cross-sell)

34

In retailing case of items In retailing case of items purchased together guidance is purchased together guidance is not so clear cut due to extensive not so clear cut due to extensive number of rulesnumber of rules

Soccer event exemplifies Soccer event exemplifies sequencing of events towards sequencing of events towards reaching goal Basketball-applied reaching goal Basketball-applied software has been developed years software has been developed years ago Web mining shares the same ago Web mining shares the same principles without passion usually principles without passion usually associated with sportsassociated with sports

35

Challenges

A major difficulty is that a large number of A major difficulty is that a large number of the rules found may be trivial for anyone the rules found may be trivial for anyone familiar with the business familiar with the business

The computational complexity involved in The computational complexity involved in calculating the results of market basket calculating the results of market basket analysis is at least the square of the number analysis is at least the square of the number of transaction item-lines (records of every of transaction item-lines (records of every item purchased) With data warehouses item purchased) With data warehouses storing billions of transaction lines this storing billions of transaction lines this yields extremely high computational yields extremely high computational requirements requirements

36

Solutions

Differential market basket analysisDifferential market basket analysis can find interesting results and can also can find interesting results and can also eliminate the problem of a potentially eliminate the problem of a potentially high volume of trivial resultshigh volume of trivial results

Special techniques involving Special techniques involving filtering filtering or aggregationor aggregation of the transaction of the transaction database are commonly used to in database are commonly used to in analysis algorithms to increase analysis algorithms to increase performance and allow some level of performance and allow some level of interactivity such as in business interactivity such as in business intelligence applicationsintelligence applications

37

Thank You

20

Creating Association Rules

11 Choosing the right set Choosing the right set of itemsof items

22 Generating rules by Generating rules by deciphering the deciphering the counts in the co-counts in the co-occurrence matrixoccurrence matrix

33 Overcoming the Overcoming the practical limits practical limits imposed by thousands imposed by thousands or tens of thousands or tens of thousands of unique itemsof unique items

21

Overcoming Practical Limits for Association Rules

11 Generate co-occurrence matrix Generate co-occurrence matrix for single itemshelliprdquofor single itemshelliprdquoif Coke then if Coke then sodardquosodardquo

22 Generate co-occurrence matrix Generate co-occurrence matrix for two itemshelliprdquofor two itemshelliprdquoif Coke and Milk if Coke and Milk then sodardquothen sodardquo

33 Generate co-occurrence matrix Generate co-occurrence matrix for three itemshelliprdquofor three itemshelliprdquoif Coke and Milk if Coke and Milk and Windowand Window Cleanerrdquo then soda Cleanerrdquo then soda

44 EtchellipEtchellip

22

Final Thought on Association RulesThe Problem of Lots of Data

Fast Food Restauranthellipcould have 100 Fast Food Restauranthellipcould have 100 items on its menuitems on its menu How many combinations are there with 3 How many combinations are there with 3

different menu items 161700 different menu items 161700 Supermarkethellip10000 or more unique Supermarkethellip10000 or more unique

itemsitems 50 million 2-item combinations50 million 2-item combinations 100 billion 3-item combinations100 billion 3-item combinations

Use of product hierarchies (groupings) Use of product hierarchies (groupings) helps address this common issuehelps address this common issue

Finally know that the number of Finally know that the number of transactions in a given time-period could transactions in a given time-period could also be huge (hence expensive to analyze)also be huge (hence expensive to analyze)

23

Business and other cases

24

25

26

27

28

29

30

31

32

33

General Observations

Banking case seems to provide Banking case seems to provide well defined and intelligible well defined and intelligible information of the forminformation of the form account_1 and account_2 etc or account_1 and account_2 etc or

activity_1 and activity_2 etc activity_1 and activity_2 etc possibly indexed by timepossibly indexed by time

As such rules found provide guide As such rules found provide guide to action to offer product or service to action to offer product or service (cross-sell)(cross-sell)

34

In retailing case of items In retailing case of items purchased together guidance is purchased together guidance is not so clear cut due to extensive not so clear cut due to extensive number of rulesnumber of rules

Soccer event exemplifies Soccer event exemplifies sequencing of events towards sequencing of events towards reaching goal Basketball-applied reaching goal Basketball-applied software has been developed years software has been developed years ago Web mining shares the same ago Web mining shares the same principles without passion usually principles without passion usually associated with sportsassociated with sports

35

Challenges

A major difficulty is that a large number of A major difficulty is that a large number of the rules found may be trivial for anyone the rules found may be trivial for anyone familiar with the business familiar with the business

The computational complexity involved in The computational complexity involved in calculating the results of market basket calculating the results of market basket analysis is at least the square of the number analysis is at least the square of the number of transaction item-lines (records of every of transaction item-lines (records of every item purchased) With data warehouses item purchased) With data warehouses storing billions of transaction lines this storing billions of transaction lines this yields extremely high computational yields extremely high computational requirements requirements

36

Solutions

Differential market basket analysisDifferential market basket analysis can find interesting results and can also can find interesting results and can also eliminate the problem of a potentially eliminate the problem of a potentially high volume of trivial resultshigh volume of trivial results

Special techniques involving Special techniques involving filtering filtering or aggregationor aggregation of the transaction of the transaction database are commonly used to in database are commonly used to in analysis algorithms to increase analysis algorithms to increase performance and allow some level of performance and allow some level of interactivity such as in business interactivity such as in business intelligence applicationsintelligence applications

37

Thank You

21

Overcoming Practical Limits for Association Rules

11 Generate co-occurrence matrix Generate co-occurrence matrix for single itemshelliprdquofor single itemshelliprdquoif Coke then if Coke then sodardquosodardquo

22 Generate co-occurrence matrix Generate co-occurrence matrix for two itemshelliprdquofor two itemshelliprdquoif Coke and Milk if Coke and Milk then sodardquothen sodardquo

33 Generate co-occurrence matrix Generate co-occurrence matrix for three itemshelliprdquofor three itemshelliprdquoif Coke and Milk if Coke and Milk and Windowand Window Cleanerrdquo then soda Cleanerrdquo then soda

44 EtchellipEtchellip

22

Final Thought on Association RulesThe Problem of Lots of Data

Fast Food Restauranthellipcould have 100 Fast Food Restauranthellipcould have 100 items on its menuitems on its menu How many combinations are there with 3 How many combinations are there with 3

different menu items 161700 different menu items 161700 Supermarkethellip10000 or more unique Supermarkethellip10000 or more unique

itemsitems 50 million 2-item combinations50 million 2-item combinations 100 billion 3-item combinations100 billion 3-item combinations

Use of product hierarchies (groupings) Use of product hierarchies (groupings) helps address this common issuehelps address this common issue

Finally know that the number of Finally know that the number of transactions in a given time-period could transactions in a given time-period could also be huge (hence expensive to analyze)also be huge (hence expensive to analyze)

23

Business and other cases

24

25

26

27

28

29

30

31

32

33

General Observations

Banking case seems to provide Banking case seems to provide well defined and intelligible well defined and intelligible information of the forminformation of the form account_1 and account_2 etc or account_1 and account_2 etc or

activity_1 and activity_2 etc activity_1 and activity_2 etc possibly indexed by timepossibly indexed by time

As such rules found provide guide As such rules found provide guide to action to offer product or service to action to offer product or service (cross-sell)(cross-sell)

34

In retailing case of items In retailing case of items purchased together guidance is purchased together guidance is not so clear cut due to extensive not so clear cut due to extensive number of rulesnumber of rules

Soccer event exemplifies Soccer event exemplifies sequencing of events towards sequencing of events towards reaching goal Basketball-applied reaching goal Basketball-applied software has been developed years software has been developed years ago Web mining shares the same ago Web mining shares the same principles without passion usually principles without passion usually associated with sportsassociated with sports

35

Challenges

A major difficulty is that a large number of A major difficulty is that a large number of the rules found may be trivial for anyone the rules found may be trivial for anyone familiar with the business familiar with the business

The computational complexity involved in The computational complexity involved in calculating the results of market basket calculating the results of market basket analysis is at least the square of the number analysis is at least the square of the number of transaction item-lines (records of every of transaction item-lines (records of every item purchased) With data warehouses item purchased) With data warehouses storing billions of transaction lines this storing billions of transaction lines this yields extremely high computational yields extremely high computational requirements requirements

36

Solutions

Differential market basket analysisDifferential market basket analysis can find interesting results and can also can find interesting results and can also eliminate the problem of a potentially eliminate the problem of a potentially high volume of trivial resultshigh volume of trivial results

Special techniques involving Special techniques involving filtering filtering or aggregationor aggregation of the transaction of the transaction database are commonly used to in database are commonly used to in analysis algorithms to increase analysis algorithms to increase performance and allow some level of performance and allow some level of interactivity such as in business interactivity such as in business intelligence applicationsintelligence applications

37

Thank You

22

Final Thought on Association RulesThe Problem of Lots of Data

Fast Food Restauranthellipcould have 100 Fast Food Restauranthellipcould have 100 items on its menuitems on its menu How many combinations are there with 3 How many combinations are there with 3

different menu items 161700 different menu items 161700 Supermarkethellip10000 or more unique Supermarkethellip10000 or more unique

itemsitems 50 million 2-item combinations50 million 2-item combinations 100 billion 3-item combinations100 billion 3-item combinations

Use of product hierarchies (groupings) Use of product hierarchies (groupings) helps address this common issuehelps address this common issue

Finally know that the number of Finally know that the number of transactions in a given time-period could transactions in a given time-period could also be huge (hence expensive to analyze)also be huge (hence expensive to analyze)

23

Business and other cases

24

25

26

27

28

29

30

31

32

33

General Observations

Banking case seems to provide Banking case seems to provide well defined and intelligible well defined and intelligible information of the forminformation of the form account_1 and account_2 etc or account_1 and account_2 etc or

activity_1 and activity_2 etc activity_1 and activity_2 etc possibly indexed by timepossibly indexed by time

As such rules found provide guide As such rules found provide guide to action to offer product or service to action to offer product or service (cross-sell)(cross-sell)

34

In retailing case of items In retailing case of items purchased together guidance is purchased together guidance is not so clear cut due to extensive not so clear cut due to extensive number of rulesnumber of rules

Soccer event exemplifies Soccer event exemplifies sequencing of events towards sequencing of events towards reaching goal Basketball-applied reaching goal Basketball-applied software has been developed years software has been developed years ago Web mining shares the same ago Web mining shares the same principles without passion usually principles without passion usually associated with sportsassociated with sports

35

Challenges

A major difficulty is that a large number of A major difficulty is that a large number of the rules found may be trivial for anyone the rules found may be trivial for anyone familiar with the business familiar with the business

The computational complexity involved in The computational complexity involved in calculating the results of market basket calculating the results of market basket analysis is at least the square of the number analysis is at least the square of the number of transaction item-lines (records of every of transaction item-lines (records of every item purchased) With data warehouses item purchased) With data warehouses storing billions of transaction lines this storing billions of transaction lines this yields extremely high computational yields extremely high computational requirements requirements

36

Solutions

Differential market basket analysisDifferential market basket analysis can find interesting results and can also can find interesting results and can also eliminate the problem of a potentially eliminate the problem of a potentially high volume of trivial resultshigh volume of trivial results

Special techniques involving Special techniques involving filtering filtering or aggregationor aggregation of the transaction of the transaction database are commonly used to in database are commonly used to in analysis algorithms to increase analysis algorithms to increase performance and allow some level of performance and allow some level of interactivity such as in business interactivity such as in business intelligence applicationsintelligence applications

37

Thank You

23

Business and other cases

24

25

26

27

28

29

30

31

32

33

General Observations

Banking case seems to provide Banking case seems to provide well defined and intelligible well defined and intelligible information of the forminformation of the form account_1 and account_2 etc or account_1 and account_2 etc or

activity_1 and activity_2 etc activity_1 and activity_2 etc possibly indexed by timepossibly indexed by time

As such rules found provide guide As such rules found provide guide to action to offer product or service to action to offer product or service (cross-sell)(cross-sell)

34

In retailing case of items In retailing case of items purchased together guidance is purchased together guidance is not so clear cut due to extensive not so clear cut due to extensive number of rulesnumber of rules

Soccer event exemplifies Soccer event exemplifies sequencing of events towards sequencing of events towards reaching goal Basketball-applied reaching goal Basketball-applied software has been developed years software has been developed years ago Web mining shares the same ago Web mining shares the same principles without passion usually principles without passion usually associated with sportsassociated with sports

35

Challenges

A major difficulty is that a large number of A major difficulty is that a large number of the rules found may be trivial for anyone the rules found may be trivial for anyone familiar with the business familiar with the business

The computational complexity involved in The computational complexity involved in calculating the results of market basket calculating the results of market basket analysis is at least the square of the number analysis is at least the square of the number of transaction item-lines (records of every of transaction item-lines (records of every item purchased) With data warehouses item purchased) With data warehouses storing billions of transaction lines this storing billions of transaction lines this yields extremely high computational yields extremely high computational requirements requirements

36

Solutions

Differential market basket analysisDifferential market basket analysis can find interesting results and can also can find interesting results and can also eliminate the problem of a potentially eliminate the problem of a potentially high volume of trivial resultshigh volume of trivial results

Special techniques involving Special techniques involving filtering filtering or aggregationor aggregation of the transaction of the transaction database are commonly used to in database are commonly used to in analysis algorithms to increase analysis algorithms to increase performance and allow some level of performance and allow some level of interactivity such as in business interactivity such as in business intelligence applicationsintelligence applications

37

Thank You

24

25

26

27

28

29

30

31

32

33

General Observations

Banking case seems to provide Banking case seems to provide well defined and intelligible well defined and intelligible information of the forminformation of the form account_1 and account_2 etc or account_1 and account_2 etc or

activity_1 and activity_2 etc activity_1 and activity_2 etc possibly indexed by timepossibly indexed by time

As such rules found provide guide As such rules found provide guide to action to offer product or service to action to offer product or service (cross-sell)(cross-sell)

34

In retailing case of items In retailing case of items purchased together guidance is purchased together guidance is not so clear cut due to extensive not so clear cut due to extensive number of rulesnumber of rules

Soccer event exemplifies Soccer event exemplifies sequencing of events towards sequencing of events towards reaching goal Basketball-applied reaching goal Basketball-applied software has been developed years software has been developed years ago Web mining shares the same ago Web mining shares the same principles without passion usually principles without passion usually associated with sportsassociated with sports

35

Challenges

A major difficulty is that a large number of A major difficulty is that a large number of the rules found may be trivial for anyone the rules found may be trivial for anyone familiar with the business familiar with the business

The computational complexity involved in The computational complexity involved in calculating the results of market basket calculating the results of market basket analysis is at least the square of the number analysis is at least the square of the number of transaction item-lines (records of every of transaction item-lines (records of every item purchased) With data warehouses item purchased) With data warehouses storing billions of transaction lines this storing billions of transaction lines this yields extremely high computational yields extremely high computational requirements requirements

36

Solutions

Differential market basket analysisDifferential market basket analysis can find interesting results and can also can find interesting results and can also eliminate the problem of a potentially eliminate the problem of a potentially high volume of trivial resultshigh volume of trivial results

Special techniques involving Special techniques involving filtering filtering or aggregationor aggregation of the transaction of the transaction database are commonly used to in database are commonly used to in analysis algorithms to increase analysis algorithms to increase performance and allow some level of performance and allow some level of interactivity such as in business interactivity such as in business intelligence applicationsintelligence applications

37

Thank You

25

26

27

28

29

30

31

32

33

General Observations

Banking case seems to provide Banking case seems to provide well defined and intelligible well defined and intelligible information of the forminformation of the form account_1 and account_2 etc or account_1 and account_2 etc or

activity_1 and activity_2 etc activity_1 and activity_2 etc possibly indexed by timepossibly indexed by time

As such rules found provide guide As such rules found provide guide to action to offer product or service to action to offer product or service (cross-sell)(cross-sell)

34

In retailing case of items In retailing case of items purchased together guidance is purchased together guidance is not so clear cut due to extensive not so clear cut due to extensive number of rulesnumber of rules

Soccer event exemplifies Soccer event exemplifies sequencing of events towards sequencing of events towards reaching goal Basketball-applied reaching goal Basketball-applied software has been developed years software has been developed years ago Web mining shares the same ago Web mining shares the same principles without passion usually principles without passion usually associated with sportsassociated with sports

35

Challenges

A major difficulty is that a large number of A major difficulty is that a large number of the rules found may be trivial for anyone the rules found may be trivial for anyone familiar with the business familiar with the business

The computational complexity involved in The computational complexity involved in calculating the results of market basket calculating the results of market basket analysis is at least the square of the number analysis is at least the square of the number of transaction item-lines (records of every of transaction item-lines (records of every item purchased) With data warehouses item purchased) With data warehouses storing billions of transaction lines this storing billions of transaction lines this yields extremely high computational yields extremely high computational requirements requirements

36

Solutions

Differential market basket analysisDifferential market basket analysis can find interesting results and can also can find interesting results and can also eliminate the problem of a potentially eliminate the problem of a potentially high volume of trivial resultshigh volume of trivial results

Special techniques involving Special techniques involving filtering filtering or aggregationor aggregation of the transaction of the transaction database are commonly used to in database are commonly used to in analysis algorithms to increase analysis algorithms to increase performance and allow some level of performance and allow some level of interactivity such as in business interactivity such as in business intelligence applicationsintelligence applications

37

Thank You

26

27

28

29

30

31

32

33

General Observations

Banking case seems to provide Banking case seems to provide well defined and intelligible well defined and intelligible information of the forminformation of the form account_1 and account_2 etc or account_1 and account_2 etc or

activity_1 and activity_2 etc activity_1 and activity_2 etc possibly indexed by timepossibly indexed by time

As such rules found provide guide As such rules found provide guide to action to offer product or service to action to offer product or service (cross-sell)(cross-sell)

34

In retailing case of items In retailing case of items purchased together guidance is purchased together guidance is not so clear cut due to extensive not so clear cut due to extensive number of rulesnumber of rules

Soccer event exemplifies Soccer event exemplifies sequencing of events towards sequencing of events towards reaching goal Basketball-applied reaching goal Basketball-applied software has been developed years software has been developed years ago Web mining shares the same ago Web mining shares the same principles without passion usually principles without passion usually associated with sportsassociated with sports

35

Challenges

A major difficulty is that a large number of A major difficulty is that a large number of the rules found may be trivial for anyone the rules found may be trivial for anyone familiar with the business familiar with the business

The computational complexity involved in The computational complexity involved in calculating the results of market basket calculating the results of market basket analysis is at least the square of the number analysis is at least the square of the number of transaction item-lines (records of every of transaction item-lines (records of every item purchased) With data warehouses item purchased) With data warehouses storing billions of transaction lines this storing billions of transaction lines this yields extremely high computational yields extremely high computational requirements requirements

36

Solutions

Differential market basket analysisDifferential market basket analysis can find interesting results and can also can find interesting results and can also eliminate the problem of a potentially eliminate the problem of a potentially high volume of trivial resultshigh volume of trivial results

Special techniques involving Special techniques involving filtering filtering or aggregationor aggregation of the transaction of the transaction database are commonly used to in database are commonly used to in analysis algorithms to increase analysis algorithms to increase performance and allow some level of performance and allow some level of interactivity such as in business interactivity such as in business intelligence applicationsintelligence applications

37

Thank You

27

28

29

30

31

32

33

General Observations

Banking case seems to provide Banking case seems to provide well defined and intelligible well defined and intelligible information of the forminformation of the form account_1 and account_2 etc or account_1 and account_2 etc or

activity_1 and activity_2 etc activity_1 and activity_2 etc possibly indexed by timepossibly indexed by time

As such rules found provide guide As such rules found provide guide to action to offer product or service to action to offer product or service (cross-sell)(cross-sell)

34

In retailing case of items In retailing case of items purchased together guidance is purchased together guidance is not so clear cut due to extensive not so clear cut due to extensive number of rulesnumber of rules

Soccer event exemplifies Soccer event exemplifies sequencing of events towards sequencing of events towards reaching goal Basketball-applied reaching goal Basketball-applied software has been developed years software has been developed years ago Web mining shares the same ago Web mining shares the same principles without passion usually principles without passion usually associated with sportsassociated with sports

35

Challenges

A major difficulty is that a large number of A major difficulty is that a large number of the rules found may be trivial for anyone the rules found may be trivial for anyone familiar with the business familiar with the business

The computational complexity involved in The computational complexity involved in calculating the results of market basket calculating the results of market basket analysis is at least the square of the number analysis is at least the square of the number of transaction item-lines (records of every of transaction item-lines (records of every item purchased) With data warehouses item purchased) With data warehouses storing billions of transaction lines this storing billions of transaction lines this yields extremely high computational yields extremely high computational requirements requirements

36

Solutions

Differential market basket analysisDifferential market basket analysis can find interesting results and can also can find interesting results and can also eliminate the problem of a potentially eliminate the problem of a potentially high volume of trivial resultshigh volume of trivial results

Special techniques involving Special techniques involving filtering filtering or aggregationor aggregation of the transaction of the transaction database are commonly used to in database are commonly used to in analysis algorithms to increase analysis algorithms to increase performance and allow some level of performance and allow some level of interactivity such as in business interactivity such as in business intelligence applicationsintelligence applications

37

Thank You

28

29

30

31

32

33

General Observations

Banking case seems to provide Banking case seems to provide well defined and intelligible well defined and intelligible information of the forminformation of the form account_1 and account_2 etc or account_1 and account_2 etc or

activity_1 and activity_2 etc activity_1 and activity_2 etc possibly indexed by timepossibly indexed by time

As such rules found provide guide As such rules found provide guide to action to offer product or service to action to offer product or service (cross-sell)(cross-sell)

34

In retailing case of items In retailing case of items purchased together guidance is purchased together guidance is not so clear cut due to extensive not so clear cut due to extensive number of rulesnumber of rules

Soccer event exemplifies Soccer event exemplifies sequencing of events towards sequencing of events towards reaching goal Basketball-applied reaching goal Basketball-applied software has been developed years software has been developed years ago Web mining shares the same ago Web mining shares the same principles without passion usually principles without passion usually associated with sportsassociated with sports

35

Challenges

A major difficulty is that a large number of A major difficulty is that a large number of the rules found may be trivial for anyone the rules found may be trivial for anyone familiar with the business familiar with the business

The computational complexity involved in The computational complexity involved in calculating the results of market basket calculating the results of market basket analysis is at least the square of the number analysis is at least the square of the number of transaction item-lines (records of every of transaction item-lines (records of every item purchased) With data warehouses item purchased) With data warehouses storing billions of transaction lines this storing billions of transaction lines this yields extremely high computational yields extremely high computational requirements requirements

36

Solutions

Differential market basket analysisDifferential market basket analysis can find interesting results and can also can find interesting results and can also eliminate the problem of a potentially eliminate the problem of a potentially high volume of trivial resultshigh volume of trivial results

Special techniques involving Special techniques involving filtering filtering or aggregationor aggregation of the transaction of the transaction database are commonly used to in database are commonly used to in analysis algorithms to increase analysis algorithms to increase performance and allow some level of performance and allow some level of interactivity such as in business interactivity such as in business intelligence applicationsintelligence applications

37

Thank You

29

30

31

32

33

General Observations

Banking case seems to provide Banking case seems to provide well defined and intelligible well defined and intelligible information of the forminformation of the form account_1 and account_2 etc or account_1 and account_2 etc or

activity_1 and activity_2 etc activity_1 and activity_2 etc possibly indexed by timepossibly indexed by time

As such rules found provide guide As such rules found provide guide to action to offer product or service to action to offer product or service (cross-sell)(cross-sell)

34

In retailing case of items In retailing case of items purchased together guidance is purchased together guidance is not so clear cut due to extensive not so clear cut due to extensive number of rulesnumber of rules

Soccer event exemplifies Soccer event exemplifies sequencing of events towards sequencing of events towards reaching goal Basketball-applied reaching goal Basketball-applied software has been developed years software has been developed years ago Web mining shares the same ago Web mining shares the same principles without passion usually principles without passion usually associated with sportsassociated with sports

35

Challenges

A major difficulty is that a large number of A major difficulty is that a large number of the rules found may be trivial for anyone the rules found may be trivial for anyone familiar with the business familiar with the business

The computational complexity involved in The computational complexity involved in calculating the results of market basket calculating the results of market basket analysis is at least the square of the number analysis is at least the square of the number of transaction item-lines (records of every of transaction item-lines (records of every item purchased) With data warehouses item purchased) With data warehouses storing billions of transaction lines this storing billions of transaction lines this yields extremely high computational yields extremely high computational requirements requirements

36

Solutions

Differential market basket analysisDifferential market basket analysis can find interesting results and can also can find interesting results and can also eliminate the problem of a potentially eliminate the problem of a potentially high volume of trivial resultshigh volume of trivial results

Special techniques involving Special techniques involving filtering filtering or aggregationor aggregation of the transaction of the transaction database are commonly used to in database are commonly used to in analysis algorithms to increase analysis algorithms to increase performance and allow some level of performance and allow some level of interactivity such as in business interactivity such as in business intelligence applicationsintelligence applications

37

Thank You

30

31

32

33

General Observations

Banking case seems to provide Banking case seems to provide well defined and intelligible well defined and intelligible information of the forminformation of the form account_1 and account_2 etc or account_1 and account_2 etc or

activity_1 and activity_2 etc activity_1 and activity_2 etc possibly indexed by timepossibly indexed by time

As such rules found provide guide As such rules found provide guide to action to offer product or service to action to offer product or service (cross-sell)(cross-sell)

34

In retailing case of items In retailing case of items purchased together guidance is purchased together guidance is not so clear cut due to extensive not so clear cut due to extensive number of rulesnumber of rules

Soccer event exemplifies Soccer event exemplifies sequencing of events towards sequencing of events towards reaching goal Basketball-applied reaching goal Basketball-applied software has been developed years software has been developed years ago Web mining shares the same ago Web mining shares the same principles without passion usually principles without passion usually associated with sportsassociated with sports

35

Challenges

A major difficulty is that a large number of A major difficulty is that a large number of the rules found may be trivial for anyone the rules found may be trivial for anyone familiar with the business familiar with the business

The computational complexity involved in The computational complexity involved in calculating the results of market basket calculating the results of market basket analysis is at least the square of the number analysis is at least the square of the number of transaction item-lines (records of every of transaction item-lines (records of every item purchased) With data warehouses item purchased) With data warehouses storing billions of transaction lines this storing billions of transaction lines this yields extremely high computational yields extremely high computational requirements requirements

36

Solutions

Differential market basket analysisDifferential market basket analysis can find interesting results and can also can find interesting results and can also eliminate the problem of a potentially eliminate the problem of a potentially high volume of trivial resultshigh volume of trivial results

Special techniques involving Special techniques involving filtering filtering or aggregationor aggregation of the transaction of the transaction database are commonly used to in database are commonly used to in analysis algorithms to increase analysis algorithms to increase performance and allow some level of performance and allow some level of interactivity such as in business interactivity such as in business intelligence applicationsintelligence applications

37

Thank You

31

32

33

General Observations

Banking case seems to provide Banking case seems to provide well defined and intelligible well defined and intelligible information of the forminformation of the form account_1 and account_2 etc or account_1 and account_2 etc or

activity_1 and activity_2 etc activity_1 and activity_2 etc possibly indexed by timepossibly indexed by time

As such rules found provide guide As such rules found provide guide to action to offer product or service to action to offer product or service (cross-sell)(cross-sell)

34

In retailing case of items In retailing case of items purchased together guidance is purchased together guidance is not so clear cut due to extensive not so clear cut due to extensive number of rulesnumber of rules

Soccer event exemplifies Soccer event exemplifies sequencing of events towards sequencing of events towards reaching goal Basketball-applied reaching goal Basketball-applied software has been developed years software has been developed years ago Web mining shares the same ago Web mining shares the same principles without passion usually principles without passion usually associated with sportsassociated with sports

35

Challenges

A major difficulty is that a large number of A major difficulty is that a large number of the rules found may be trivial for anyone the rules found may be trivial for anyone familiar with the business familiar with the business

The computational complexity involved in The computational complexity involved in calculating the results of market basket calculating the results of market basket analysis is at least the square of the number analysis is at least the square of the number of transaction item-lines (records of every of transaction item-lines (records of every item purchased) With data warehouses item purchased) With data warehouses storing billions of transaction lines this storing billions of transaction lines this yields extremely high computational yields extremely high computational requirements requirements

36

Solutions

Differential market basket analysisDifferential market basket analysis can find interesting results and can also can find interesting results and can also eliminate the problem of a potentially eliminate the problem of a potentially high volume of trivial resultshigh volume of trivial results

Special techniques involving Special techniques involving filtering filtering or aggregationor aggregation of the transaction of the transaction database are commonly used to in database are commonly used to in analysis algorithms to increase analysis algorithms to increase performance and allow some level of performance and allow some level of interactivity such as in business interactivity such as in business intelligence applicationsintelligence applications

37

Thank You

32

33

General Observations

Banking case seems to provide Banking case seems to provide well defined and intelligible well defined and intelligible information of the forminformation of the form account_1 and account_2 etc or account_1 and account_2 etc or

activity_1 and activity_2 etc activity_1 and activity_2 etc possibly indexed by timepossibly indexed by time

As such rules found provide guide As such rules found provide guide to action to offer product or service to action to offer product or service (cross-sell)(cross-sell)

34

In retailing case of items In retailing case of items purchased together guidance is purchased together guidance is not so clear cut due to extensive not so clear cut due to extensive number of rulesnumber of rules

Soccer event exemplifies Soccer event exemplifies sequencing of events towards sequencing of events towards reaching goal Basketball-applied reaching goal Basketball-applied software has been developed years software has been developed years ago Web mining shares the same ago Web mining shares the same principles without passion usually principles without passion usually associated with sportsassociated with sports

35

Challenges

A major difficulty is that a large number of A major difficulty is that a large number of the rules found may be trivial for anyone the rules found may be trivial for anyone familiar with the business familiar with the business

The computational complexity involved in The computational complexity involved in calculating the results of market basket calculating the results of market basket analysis is at least the square of the number analysis is at least the square of the number of transaction item-lines (records of every of transaction item-lines (records of every item purchased) With data warehouses item purchased) With data warehouses storing billions of transaction lines this storing billions of transaction lines this yields extremely high computational yields extremely high computational requirements requirements

36

Solutions

Differential market basket analysisDifferential market basket analysis can find interesting results and can also can find interesting results and can also eliminate the problem of a potentially eliminate the problem of a potentially high volume of trivial resultshigh volume of trivial results

Special techniques involving Special techniques involving filtering filtering or aggregationor aggregation of the transaction of the transaction database are commonly used to in database are commonly used to in analysis algorithms to increase analysis algorithms to increase performance and allow some level of performance and allow some level of interactivity such as in business interactivity such as in business intelligence applicationsintelligence applications

37

Thank You

33

General Observations

Banking case seems to provide Banking case seems to provide well defined and intelligible well defined and intelligible information of the forminformation of the form account_1 and account_2 etc or account_1 and account_2 etc or

activity_1 and activity_2 etc activity_1 and activity_2 etc possibly indexed by timepossibly indexed by time

As such rules found provide guide As such rules found provide guide to action to offer product or service to action to offer product or service (cross-sell)(cross-sell)

34

In retailing case of items In retailing case of items purchased together guidance is purchased together guidance is not so clear cut due to extensive not so clear cut due to extensive number of rulesnumber of rules

Soccer event exemplifies Soccer event exemplifies sequencing of events towards sequencing of events towards reaching goal Basketball-applied reaching goal Basketball-applied software has been developed years software has been developed years ago Web mining shares the same ago Web mining shares the same principles without passion usually principles without passion usually associated with sportsassociated with sports

35

Challenges

A major difficulty is that a large number of A major difficulty is that a large number of the rules found may be trivial for anyone the rules found may be trivial for anyone familiar with the business familiar with the business

The computational complexity involved in The computational complexity involved in calculating the results of market basket calculating the results of market basket analysis is at least the square of the number analysis is at least the square of the number of transaction item-lines (records of every of transaction item-lines (records of every item purchased) With data warehouses item purchased) With data warehouses storing billions of transaction lines this storing billions of transaction lines this yields extremely high computational yields extremely high computational requirements requirements

36

Solutions

Differential market basket analysisDifferential market basket analysis can find interesting results and can also can find interesting results and can also eliminate the problem of a potentially eliminate the problem of a potentially high volume of trivial resultshigh volume of trivial results

Special techniques involving Special techniques involving filtering filtering or aggregationor aggregation of the transaction of the transaction database are commonly used to in database are commonly used to in analysis algorithms to increase analysis algorithms to increase performance and allow some level of performance and allow some level of interactivity such as in business interactivity such as in business intelligence applicationsintelligence applications

37

Thank You

34

In retailing case of items In retailing case of items purchased together guidance is purchased together guidance is not so clear cut due to extensive not so clear cut due to extensive number of rulesnumber of rules

Soccer event exemplifies Soccer event exemplifies sequencing of events towards sequencing of events towards reaching goal Basketball-applied reaching goal Basketball-applied software has been developed years software has been developed years ago Web mining shares the same ago Web mining shares the same principles without passion usually principles without passion usually associated with sportsassociated with sports

35

Challenges

A major difficulty is that a large number of A major difficulty is that a large number of the rules found may be trivial for anyone the rules found may be trivial for anyone familiar with the business familiar with the business

The computational complexity involved in The computational complexity involved in calculating the results of market basket calculating the results of market basket analysis is at least the square of the number analysis is at least the square of the number of transaction item-lines (records of every of transaction item-lines (records of every item purchased) With data warehouses item purchased) With data warehouses storing billions of transaction lines this storing billions of transaction lines this yields extremely high computational yields extremely high computational requirements requirements

36

Solutions

Differential market basket analysisDifferential market basket analysis can find interesting results and can also can find interesting results and can also eliminate the problem of a potentially eliminate the problem of a potentially high volume of trivial resultshigh volume of trivial results

Special techniques involving Special techniques involving filtering filtering or aggregationor aggregation of the transaction of the transaction database are commonly used to in database are commonly used to in analysis algorithms to increase analysis algorithms to increase performance and allow some level of performance and allow some level of interactivity such as in business interactivity such as in business intelligence applicationsintelligence applications

37

Thank You

35

Challenges

A major difficulty is that a large number of A major difficulty is that a large number of the rules found may be trivial for anyone the rules found may be trivial for anyone familiar with the business familiar with the business

The computational complexity involved in The computational complexity involved in calculating the results of market basket calculating the results of market basket analysis is at least the square of the number analysis is at least the square of the number of transaction item-lines (records of every of transaction item-lines (records of every item purchased) With data warehouses item purchased) With data warehouses storing billions of transaction lines this storing billions of transaction lines this yields extremely high computational yields extremely high computational requirements requirements

36

Solutions

Differential market basket analysisDifferential market basket analysis can find interesting results and can also can find interesting results and can also eliminate the problem of a potentially eliminate the problem of a potentially high volume of trivial resultshigh volume of trivial results

Special techniques involving Special techniques involving filtering filtering or aggregationor aggregation of the transaction of the transaction database are commonly used to in database are commonly used to in analysis algorithms to increase analysis algorithms to increase performance and allow some level of performance and allow some level of interactivity such as in business interactivity such as in business intelligence applicationsintelligence applications

37

Thank You

36

Solutions

Differential market basket analysisDifferential market basket analysis can find interesting results and can also can find interesting results and can also eliminate the problem of a potentially eliminate the problem of a potentially high volume of trivial resultshigh volume of trivial results

Special techniques involving Special techniques involving filtering filtering or aggregationor aggregation of the transaction of the transaction database are commonly used to in database are commonly used to in analysis algorithms to increase analysis algorithms to increase performance and allow some level of performance and allow some level of interactivity such as in business interactivity such as in business intelligence applicationsintelligence applications

37

Thank You

37

Thank You