fundamentals of data mining 2010

57
FUNDAMENTALS OF DATA MINING FOR MARKETERS James R. Stafford

Upload: jim-stafford

Post on 07-Nov-2014

531 views

Category:

Documents


1 download

DESCRIPTION

Presented at the Annual NCDM Conference Includes 7 Steps to Successful Data Mining.

TRANSCRIPT

Page 1: Fundamentals Of Data Mining 2010

FUNDAMENTALS OFDATA MINING FOR

MARKETERS

James R. Stafford

Page 2: Fundamentals Of Data Mining 2010

Today’s Agenda - 7 Steps to Better Models

1. Identify the business problem

2. Data audit - what data types are most useful and how much do I need?

3. Exploratory data analysis

1. Data quality - how to deal with missing data and outliers

2. Identifying your most predictive variables

3. Transforming your variables

4. Choose the best modeling approach

5. Make sure the model makes sense

6. Model validation - the Melatonin of modeling

7. When to re-build your model

Page 3: Fundamentals Of Data Mining 2010

What is the business problem?

Response Attrition/Lapse/

Churn Reactivation Lifetime Value Profitability Sales

What Should I Predict?

Page 4: Fundamentals Of Data Mining 2010

Data is the Key

Primary data Secondary data

What Data is Available to Predict Outcome?

Page 5: Fundamentals Of Data Mining 2010

Primary Data

Transaction based Recency Frequency Monetary Products

Purchased

The most important data for modeling!

Page 6: Fundamentals Of Data Mining 2010

Primary Data

Transaction based Recency Frequency Monetary Products

Purchased

Demographics Age Home Ownership Dependents

Lifestyle Type of Car Hobbies Travel

Preferences

The most important data for customer profiling and

building acquisition models!

Page 7: Fundamentals Of Data Mining 2010

Secondary data: consumer & business

Acquired from another source

Specific or inferred Actual and reported by

individual/household Modeled after similar

profiles Pct data specific or

inferred varies Costs vary from

$2/1,000 to $50/1,000 matches

Type of Car Travel Preferences SIC Employees

Demographics/Lifestyle Age Home Ownership Dependents Income

Page 8: Fundamentals Of Data Mining 2010

How much data do I need?

More is better!!! ... use all of your customers if possible (train & validate)

When to sample too many records test campaign to get response withhold some for model validation

Goal of sample - to be representative of your target customer population

Page 9: Fundamentals Of Data Mining 2010

How to determine acceptablesample size

N = sample size C = confidence level (1.96 for 95% confidence) E = acceptable error bound

(0.001 = 0.1% response rate) P = response rate from full file (e.g., 0.03=3%) Q = (1-P)

N = [ (C/E) * (P*Q) ]2

+_

Page 10: Fundamentals Of Data Mining 2010

How much data do I need?

More is always better, but there are diminishing returns.

5000

1500

0

2500

0

3500

0

4500

0

5500

0

6500

0

7500

0

8500

0

9500

0

1050

00

1150

00

1250

00

1350

00

1450

00

0

0.0005

0.001

0.0015

0.002

0.0025

0.003

SD of Response Rate

# of Records

SD

Page 11: Fundamentals Of Data Mining 2010

How much data do I need? --10,000 records & 3.0% RR

3.0%

3.3%2.7%

(95 times out of 100!)

Page 12: Fundamentals Of Data Mining 2010

What’s the minimum sample size I need to get 2.9 % <=> 3.1%?

3.0%

3.1%2.9%

(95 times out of 100!)

112,000

Page 13: Fundamentals Of Data Mining 2010

How much data do I need?

Response models - minimum of 300 customers that behaved in the “desired” way.

Lifetime value models - at least 300 customers/records.

Minimums

Page 14: Fundamentals Of Data Mining 2010

EDA - How to deal with missing data and outliers

Missing data - blanks, “NA” Use/recode: -999, may be meaningful, e.g., lots of

missing data can be important in fraud detection Substitute - mean, median or mode Delete records from analysis

Outliers - data outside of reasonable bounds customer age = 170 customer balance = $1.5M ($10,000 = other max

value) identify with plots: frequency distributions/histograms Use, substitute or delete

Page 15: Fundamentals Of Data Mining 2010

Locate Outliers

Customer Age

AVG = 52.1

Outlier

Page 16: Fundamentals Of Data Mining 2010

Effect of outliers

+

+

+

+

+

+

+

+

++

+

+

+

+

++

+

LTV

Number of purchases

Xo

++

+ Outliers+

Page 17: Fundamentals Of Data Mining 2010

+

+

+

+

+

+

+

+

++

+

+

+

+

++

+

LTV

Number of purchases

Xcorrect

Xoutlier

Effect of outliers

Page 18: Fundamentals Of Data Mining 2010

Replace outliers

Numeric data (age, # purchases, LTV, ...) mean median mode

ASCII string (sex, lifestyle code, …) Mode

Page 19: Fundamentals Of Data Mining 2010

EDA - Finding your most predictive variables

Correlation matrix

Cross-tabs

Scatter plots

Stepwise regression

CHAID

Page 20: Fundamentals Of Data Mining 2010

Correlation matrix

LTV NoPur MSLPur HHInc PPHH EducLevel AAgeMale AAgeFemLTV 1.00 0.78 0.43 0.62 0.19 0.23 -0.34 -0.23NoPur 0.78 1.00 0.55 0.47 0.24 0.15 -0.22 -0.30MSLPur 0.43 0.55 1.00 0.21 0.43 0.33 -0.13 -0.05HHInc 0.62 0.47 0.21 1.00 0.18 0.76 0.38 0.24PPHH 0.19 0.24 0.43 0.18 1.00 0.21 0.34 0.37EducLevel 0.23 0.15 0.33 0.76 0.21 1.00 0.41 0.46AAgeMale -0.34 -0.22 -0.13 0.38 0.34 0.41 1.00 0.82AAgeFem -0.23 -0.30 -0.05 0.24 0.37 0.46 0.82 1.00

Fitness Product Line

Page 21: Fundamentals Of Data Mining 2010

Scatter plot

LTV

AvgHH Income

+++

++

+

+

++ +

+

+

+

no relationship LTV

AvgHH Income

++

++

++

+ ++

+ +

+positive relationship

+

+

++++ +

+ +

++

Page 22: Fundamentals Of Data Mining 2010

Stepwise regression

Automated regressions to identify most predictive variables

1st regression finds the single most predictive

2nd regression keeps the most predictive and finds the 2nd most predictive that is significant given the presence of the 1st

Addresses collinearity

Page 23: Fundamentals Of Data Mining 2010

CHAID

Disability Insurance4.5% RR

Males3% RR

Females6% RR

<251.2% RR

>=253.9% RR

% 1 person HH’r2.4% RR

% >1 person HH’r

9.3% RR

Identify important variables and variable interactions

Page 24: Fundamentals Of Data Mining 2010

Variable transformations

Based on relationship with dependent/output variable (logs, x2, 1/x, etc.)

Based on characteristics of the data ZIP - SCF/1st 3 characters Phone number - area code Birth date - age Purchase date - month purchased==>seasonality Purchase date - months since last purchase Modeling methodology used (CHAID requires

binning)

Page 25: Fundamentals Of Data Mining 2010

Variable transformations

++

+

+

++

+

+

++

+

+

+

+++

+

LTV

AvgAgeMale

++

+

+

++

+

+

++

+

+

+

+++

+

LTV

AvgAgeMale

not transformed 1/x … or, 1/AvgAgeMale

Page 26: Fundamentals Of Data Mining 2010

Variable transformations

+

+

+

+

+

+

+

+

++

+

+

+

+

++

+

LTV

Number of purchases

X

Page 27: Fundamentals Of Data Mining 2010

Variable transformations

+

+

+

+

+

+

+

+

++

+

+

+

+

++

+

LTV

Number of purchases

X

X2

Page 28: Fundamentals Of Data Mining 2010

Variable transformations

+

+

+

+

++

+ +

+

+

+

++

+

+

+

+

LTV

Household Income

X

Page 29: Fundamentals Of Data Mining 2010

Variable transformations

+

+

+

+

++

+ +

+

+

+

++

+

+

+

+

LTV

Household Income

X

ln(x)…or,ln(HHInc)

Page 30: Fundamentals Of Data Mining 2010

Gut-feel: expert opinion RFM: non-statistical segmentation

scheme CHAID: segmentation scheme Regression: statistical analysis Neural Nets: statistical analysis

Chose the best modeling approach

Page 31: Fundamentals Of Data Mining 2010

What modeling approach should be used?

Who is likely to respond?

Which purchasers of product A will also purchase product B?

Who is most likely to lapse?

Predict yes/no, response/non-response

Output is binary, e.g., 0/1

Predict amount spent Output is numerical, e.g.,

$2,563 What is the expected

lifetime value of my customers?

How much will customers invest?

limited number of answers

If the business problem has a...

wide range of answers

Page 32: Fundamentals Of Data Mining 2010

Which approach should be used?

RFM CHAID Linear regression Logistic

regression Neural nets

Linear regression Logistic

regression CHAID Neural nets

limited number of answers

If the business problem has a...

wide range of answers

Page 33: Fundamentals Of Data Mining 2010

Does the model make sense?

Response Rate by Product TypeSuperior/

Inferior Standard LuxuryRecency + + +Frequency + + +Monetary + + +Income - + +Price - - +Age ? ? ?Sex ? ? ?Race ? ? ?City v.surburban dweller ? ? ?

? => Depends of the specific type of product

Relies on your knowledge of your product and the market!

Page 34: Fundamentals Of Data Mining 2010

The business problem

National Basketball Association Team Declining attendance Expanding to new stadium with more seats

Marketing Objectives Up-sell: Mini-plan to Season ticket holders Prospecting: identify Season ticket plan

prospects

Page 35: Fundamentals Of Data Mining 2010

Use best modeling approach

Page 36: Fundamentals Of Data Mining 2010

Appraise results - gains chart for our best model

Page 37: Fundamentals Of Data Mining 2010

Does the model make sense?

Most Important Variables

Cluster code

Home value

Age-male

Home value>=100K

# of HHs

# of seats

Page 38: Fundamentals Of Data Mining 2010

Does the model make sense -- what do my customers look like?

Page 39: Fundamentals Of Data Mining 2010

Does the model make sense -- what do my customers look like?

Page 40: Fundamentals Of Data Mining 2010

Does the model make sense -- what do my customers look like?

Page 41: Fundamentals Of Data Mining 2010

Does the model make sense -- what do my customers look like?

Page 42: Fundamentals Of Data Mining 2010

PRIZM cluster composition for segments

Modeled C1 C2 S1 S2 S3 U1 U3Segment 1 1.6 2.4 31.2 4.0 12.8 12.0 34.4 2 2.4 16.3 56.5 1.6 0.8 13.7 4.0

10 5.5 28.4 11.0 5.5 18.1 1.6 4.7

19 2.8 2.8 10.1 32.1 4.6 2.8 0.020 5.5 0.9 18.4 22.9 2.8 0.0 0.0

TOTAL 6.0 9.5 24.0 14.9 11.0 6.1 4.9

Page 43: Fundamentals Of Data Mining 2010

Summary profile of “the” best segment

Wealthy whites, Asians and Arabic

High spending levels Highest income High education High investment

Multi-racial Multi-lingual Dense/urban Home & apartment

renters High % of singles High % of single parents High unemployment Lowest income group

U3 - Urban CoresS1 - Elite Suburbs

Page 44: Fundamentals Of Data Mining 2010

How Can You Use This Information ?

Develop different messages Use different media/marketing

approaches to reach them Buy prospect lists based on best

segment profiles Develop retention and prospecting plans

with customized offers (e.g., free CD’s based on their particular tastes in music)

For each major customer segment, you can...

===>> improved customer up-sell and retention and better prospecting!

Page 45: Fundamentals Of Data Mining 2010

PRIZM cluster composition for segments

Modeled C1 C2 S1 S2 S3 U1 U3Segment 1 1.6 2.4 31.2 4.0 12.8 12.0 34.4 2 2.4 16.3 56.5 1.6 0.8 13.7 4.0

10 5.5 28.4 11.0 5.5 18.1 1.6 4.7

19 2.8 2.8 10.1 32.1 4.6 2.8 0.020 5.5 0.9 18.4 22.9 2.8 0.0 0.0

TOTAL 6.0 9.5 24.0 14.9 11.0 6.1 4.9

EliteSuburbs

UrbanCores

Top demi-decile, i.e., those most likely to become season ticket holders

Page 46: Fundamentals Of Data Mining 2010

Potential marketing plans

S1 U3Giveaways

1,000 FF miles Mini-music systemCD - Classical/Jazz CD - Jazz/RockFree WSJ sub Free Consumer Report sub

Contests1 trip to the Master's, or… 1 trip to Super Bowl, or… the NBA finals the NBA finals50 Montblanc pens 50 pairs Adidas/Nike

AdvertiseJazz stations Jazz stationsClassical stations Rock stationsLocal Business Section Local Classified section

AdvertiseJazz stations Jazz stationsClassical stations Rock stationsLocal Business sections Local Classified section

Page 47: Fundamentals Of Data Mining 2010

Model “validation”

At the PC error distribution out-of-sample validation cross-validation (good for small data bases)

In the field test marketing campaign

Page 48: Fundamentals Of Data Mining 2010

Check distribution of errors

Page 49: Fundamentals Of Data Mining 2010

Out-of-sample validation

FullCustomer

File

Training File

Validation File

randomly extracted

ProbResponse =.003*HHINC+.1*NumPurch

Training file lift chart

Validation file lift chart

compare

Page 50: Fundamentals Of Data Mining 2010

Training Data Validation Data

PC validation

Page 51: Fundamentals Of Data Mining 2010

When is it time to layan old model to rest?

When response rates start to declineResponse Rates

JanFeb 4.25MarApr 4.2MayJun 4.6JulAug 3.5SepOct 3.1NovDec 2.8

Page 52: Fundamentals Of Data Mining 2010

When is it time to lay an old model to rest?

When response rates start to decline

When significant events occur new competitive product introductions

Local ISP internet access v. cable access/DSL

Page 53: Fundamentals Of Data Mining 2010

When is it time to lay an old model to rest?

When response rates start to decline

When significant events occur new competitive product introductions significant price changes

Digital photography

Page 54: Fundamentals Of Data Mining 2010

When is it time to lay an old model to rest?

When response rates start to decline

When significant events occur new competitive product introductions significant price changes changes in the economy

Demand for luxury products (boats) in a recession

Page 55: Fundamentals Of Data Mining 2010

When is it time to lay an old model to rest?

When response rates start to decline

When significant events occur new competitive product introductions significant price changes changes in the economy news events which affect attitudes about

your product -- for better or worse

Mortgage bankers

Page 56: Fundamentals Of Data Mining 2010

Summary - 7 Steps to Better Models

1. Identify the business problem

2. Data audit - what data types are most useful and how much do I need?

3. Exploratory data analysis (Data visualization!)

1. Data quality - how to deal with missing data and outliers

2. Identifying your most predictive variables

3. Transforming your variables

4. Choose the best modeling approach - a cookbook approach

5. Make sure the model makes sense

6. Model validation - the Melatonin of modeling

7. Model maintenance

Page 57: Fundamentals Of Data Mining 2010

FUNDAMENTALS OFPREDICTIVE CUSTOMER

MODELING

James R. Stafford