the marriage of market basket analysis to predictive modeling

The Marriage of Market Basket Analysis to Predictive Modeling

Sanford Gayle

How Would You Mine This Transactional Data?

Is Data Mining Simply Market Basket Analysis?

Market Basket Analysis identifies the rule /our_company/bboard/hr/café/ … but

• How do you use this information?• Can the information be used to develop a predictive

model? • More generally, how do you develop predictive

models using transactional tables?

Data Mining Software Objectives

• Predictive Modeling• Clustering • Market Basket Analysis• Feature Discovery; that is, improve the

predictive accuracy of existing models

Agenda

• Converting a transactional to a modeling table

• The curse of dimensionality & possible fixes• A feature discovery process; using market

basket analysis output as an input to predictive modeling

• A dimensional reduction scheme using confidence

DM Table Structures• Transactional tables (Market Basket Analysis)

Trans-id page spend countid-1 page1 $0 1id-1 page2 $0 1id-1 page3 $0 1id-1 page4 $19.99 1id-1 page5 $0 1 id-2 page1 $0 1

• Modeling tables (modeling & clustering tools)Trans-id page spend count

id-1 . $19.95 5 id-2 . $0 1

Converting Transactional Into Modeling Data• Continuous variable case - easy

• Collapse the spend or count columns via the sum, mean, or frequency statistic for each transaction-id value

• Proc sql; create table new as select id,sum(amount) as total from old group by id;

• Categorical variable case - challenging• It seems the detail page information is lost when the rows are

rolled-up or collapsed • However, with transposition you collapse the rows onto a single

row for each id, with each distinct page now being a column in the modeling table and taking the count or sum statistic as its value

The Input Discovery Process • Existing modeling table contains:

id-1, age, income, job-category, married, recency, frequency, zip-code …

• New potential predictors per transpose contains:id-1, spend on page1, spend on page2, spend on page3, spend on page4, spend on page5

• Augment existing modeling table with the new inputs and, hopefully, discover new, significant predictors to improve predictive accuracy

Problem with Transpose Method

• Suppose the server has 1,000 distinct pages; the transpose method now produces 1,000 new columns instead of 5

• Sparsity: new columns have a preponderance of missing values; e.g., id-2 will have 5 missing values and the 1 non-missing

• Regression, Neural, and Cluster tools struggle with this many variables, especially when there is such a preponderance of the same values (e.g., zeros or missing)

The Curse of Dimensionality

• Suppose interest lies in a second classification column too; e.g., both time (hour) and page visited

• Transpose method now produces 1,000+24 new variables, assuming no interest in interactions

• If interactions are of interest, then there will be 24,000 (1,000x24) new variable generated

General Fix

• Reduce the number of levels of the categorical variable (e.g., using confidence)

• Use the transpose method to convert the transactional to a modeling table

• Add the new inputs to the traditional modeling table in an effort to improve predictive accuracy

Creating Rules-Based Dummy Variables • Obtain rules using market basket analysis• Choose the rule of interest• Identify folks having the rule of interest in their

market basket• Create a dummy variable flagging them• Augment the traditional modeling table with the

dummy variable• Use the dummy variable as an input or target in a

predictive modeling tool

Using SQL to Identify Folks Having a Rule of Interest in Their Market Basket

Creating a Rule-Based Dummy Variable

The All-Info Table

Feature Discovery: A new potential predictor or input

Possible Sub-setting Criteria

• Any rule of interest• The confidence - e.g., all rules having confidence

>= 100 (optimal level of confidence?)• The support - e.g., all rules having support >= 10

(optimal level of support?)• The lift - e.g., all rules having lift >= 5 (optimal

level of lift)

Using Confidence as the Basis for a Reclassification Scheme

• Suppose diapersbeer has a confidence of 100%

• Then the two levels “diapers” & “beer” can be mapped into the value “diapersbeer”, it seems

• Actually, both the rule and its reverse must have a confidence of 100%

The Confidence Reclassification Scheme

• If confidence for the rule and its opposite is >80, then combine the two levels into the rule-based level

• e.g., “page1” & “page2” both mapped into “page1page2”

• Using 80 instead of 100 will introduce inaccuracy, but the analyst overwhelmed with too many levels will likely be willing to substitute a little accuracy for dimensional reduction

The Confidence Reclassification Scheme

• Use the transpose method to generate candidate predictors

• Augment the traditional modeling table with the new candidate predictors table

• Develop an enhanced model using some of the candidate predictors in the hope of fostering predictive accuracy

Contact Information

[email protected]

the marriage of market basket analysis to predictive modeling

Documents