fundamentals of data mining 2010

FUNDAMENTALS OFDATA MINING FOR

MARKETERS

James R. Stafford

Today’s Agenda - 7 Steps to Better Models

1. Identify the business problem

2. Data audit - what data types are most useful and how much do I need?

3. Exploratory data analysis

1. Data quality - how to deal with missing data and outliers

2. Identifying your most predictive variables

3. Transforming your variables

4. Choose the best modeling approach

5. Make sure the model makes sense

6. Model validation - the Melatonin of modeling

7. When to re-build your model

What is the business problem?

Response Attrition/Lapse/

Churn Reactivation Lifetime Value Profitability Sales

What Should I Predict?

Data is the Key

Primary data Secondary data

What Data is Available to Predict Outcome?

Primary Data

Transaction based Recency Frequency Monetary Products

Purchased

The most important data for modeling!

Primary Data

Transaction based Recency Frequency Monetary Products

Purchased

Demographics Age Home Ownership Dependents

Lifestyle Type of Car Hobbies Travel

Preferences

The most important data for customer profiling and

building acquisition models!

Secondary data: consumer & business

Acquired from another source

Specific or inferred Actual and reported by

individual/household Modeled after similar

profiles Pct data specific or

inferred varies Costs vary from

$2/1,000 to $50/1,000 matches

Type of Car Travel Preferences SIC Employees

Demographics/Lifestyle Age Home Ownership Dependents Income

How much data do I need?

More is better!!! ... use all of your customers if possible (train & validate)

When to sample too many records test campaign to get response withhold some for model validation

Goal of sample - to be representative of your target customer population

How to determine acceptablesample size

N = sample size C = confidence level (1.96 for 95% confidence) E = acceptable error bound

(0.001 = 0.1% response rate) P = response rate from full file (e.g., 0.03=3%) Q = (1-P)

N = [ (C/E) * (P*Q) ]2

+_


More is always better, but there are diminishing returns.

5000

1500

0

2500

0

3500

0

4500

0

5500

0

6500

0

7500

0

8500

0

9500

0

1050

00

1150

00

1250

00

1350

00

1450

00

0

0.0005

0.001

0.0015

0.002

0.0025

0.003

SD of Response Rate

# of Records

SD

How much data do I need? --10,000 records & 3.0% RR

3.0%

3.3%2.7%

(95 times out of 100!)

What’s the minimum sample size I need to get 2.9 % <=> 3.1%?

3.0%

3.1%2.9%

(95 times out of 100!)

112,000


Response models - minimum of 300 customers that behaved in the “desired” way.

Lifetime value models - at least 300 customers/records.

Minimums

EDA - How to deal with missing data and outliers

Missing data - blanks, “NA” Use/recode: -999, may be meaningful, e.g., lots of

missing data can be important in fraud detection Substitute - mean, median or mode Delete records from analysis

Outliers - data outside of reasonable bounds customer age = 170 customer balance = $1.5M ($10,000 = other max

value) identify with plots: frequency distributions/histograms Use, substitute or delete

Locate Outliers

Customer Age

AVG = 52.1

Outlier

Effect of outliers

+

+

+

+

+

+

+

+

++

+

+

+

+

++

+

LTV

Number of purchases

Xo

++

+ Outliers+

+

+

+

+

+

+

+

+

++

+

+

+

+

++

+

LTV

Number of purchases

Xcorrect

Xoutlier

Effect of outliers

Replace outliers

Numeric data (age, # purchases, LTV, ...) mean median mode

ASCII string (sex, lifestyle code, …) Mode

EDA - Finding your most predictive variables

Correlation matrix

Cross-tabs

Scatter plots

Stepwise regression

CHAID

Correlation matrix

LTV NoPur MSLPur HHInc PPHH EducLevel AAgeMale AAgeFemLTV 1.00 0.78 0.43 0.62 0.19 0.23 -0.34 -0.23NoPur 0.78 1.00 0.55 0.47 0.24 0.15 -0.22 -0.30MSLPur 0.43 0.55 1.00 0.21 0.43 0.33 -0.13 -0.05HHInc 0.62 0.47 0.21 1.00 0.18 0.76 0.38 0.24PPHH 0.19 0.24 0.43 0.18 1.00 0.21 0.34 0.37EducLevel 0.23 0.15 0.33 0.76 0.21 1.00 0.41 0.46AAgeMale -0.34 -0.22 -0.13 0.38 0.34 0.41 1.00 0.82AAgeFem -0.23 -0.30 -0.05 0.24 0.37 0.46 0.82 1.00

Fitness Product Line

Scatter plot

LTV

AvgHH Income

+++

++

+

+

++ +

+

+

+

no relationship LTV

AvgHH Income

++

++

++

+ ++

+ +

+positive relationship

+

+

++++ +

+ +

++

Stepwise regression

Automated regressions to identify most predictive variables

1st regression finds the single most predictive

2nd regression keeps the most predictive and finds the 2nd most predictive that is significant given the presence of the 1st

Addresses collinearity

CHAID

Disability Insurance4.5% RR

Males3% RR

Females6% RR

<251.2% RR

>=253.9% RR

% 1 person HH’r2.4% RR

% >1 person HH’r

9.3% RR

Identify important variables and variable interactions

Variable transformations

Based on relationship with dependent/output variable (logs, x2, 1/x, etc.)

Based on characteristics of the data ZIP - SCF/1st 3 characters Phone number - area code Birth date - age Purchase date - month purchased==>seasonality Purchase date - months since last purchase Modeling methodology used (CHAID requires

binning)


++

+

+

++

+

+

++

+

+

+

+++

+

LTV

AvgAgeMale

++

+

+

++

+

+

++

+

+

+

+++

+

LTV

AvgAgeMale

not transformed 1/x … or, 1/AvgAgeMale


+

+

+

+

+

+

+

+

++

+

+

+

+

++

+

LTV

Number of purchases

X


+

+

+

+

+

+

+

+

++

+

+

+

+

++

+

LTV

Number of purchases

X

X2


+

+

+

+

++

+ +

+

+

+

++

+

+

+

+

LTV

Household Income

X


+

+

+

+

++

+ +

+

+

+

++

+

+

+

+

LTV

Household Income

X

ln(x)…or,ln(HHInc)

Gut-feel: expert opinion RFM: non-statistical segmentation

scheme CHAID: segmentation scheme Regression: statistical analysis Neural Nets: statistical analysis

Chose the best modeling approach

What modeling approach should be used?

Who is likely to respond?

Which purchasers of product A will also purchase product B?

Who is most likely to lapse?

Predict yes/no, response/non-response

Output is binary, e.g., 0/1

Predict amount spent Output is numerical, e.g.,

$2,563 What is the expected

lifetime value of my customers?

How much will customers invest?

limited number of answers

If the business problem has a...

wide range of answers

Which approach should be used?

RFM CHAID Linear regression Logistic

regression Neural nets

Linear regression Logistic

regression CHAID Neural nets

limited number of answers

If the business problem has a...

wide range of answers

Does the model make sense?

Response Rate by Product TypeSuperior/

Inferior Standard LuxuryRecency + + +Frequency + + +Monetary + + +Income - + +Price - - +Age ? ? ?Sex ? ? ?Race ? ? ?City v.surburban dweller ? ? ?

? => Depends of the specific type of product

Relies on your knowledge of your product and the market!

The business problem

National Basketball Association Team Declining attendance Expanding to new stadium with more seats

Marketing Objectives Up-sell: Mini-plan to Season ticket holders Prospecting: identify Season ticket plan

prospects

Use best modeling approach

Appraise results - gains chart for our best model

Does the model make sense?

Most Important Variables

Cluster code

Home value

Age-male

Home value>=100K

# of HHs

# of seats

Does the model make sense -- what do my customers look like?

PRIZM cluster composition for segments

Modeled C1 C2 S1 S2 S3 U1 U3Segment 1 1.6 2.4 31.2 4.0 12.8 12.0 34.4 2 2.4 16.3 56.5 1.6 0.8 13.7 4.0

10 5.5 28.4 11.0 5.5 18.1 1.6 4.7

19 2.8 2.8 10.1 32.1 4.6 2.8 0.020 5.5 0.9 18.4 22.9 2.8 0.0 0.0

TOTAL 6.0 9.5 24.0 14.9 11.0 6.1 4.9

Summary profile of “the” best segment

Wealthy whites, Asians and Arabic

High spending levels Highest income High education High investment

Multi-racial Multi-lingual Dense/urban Home & apartment

renters High % of singles High % of single parents High unemployment Lowest income group

U3 - Urban CoresS1 - Elite Suburbs

How Can You Use This Information ?

Develop different messages Use different media/marketing

approaches to reach them Buy prospect lists based on best

segment profiles Develop retention and prospecting plans

with customized offers (e.g., free CD’s based on their particular tastes in music)

For each major customer segment, you can...

===>> improved customer up-sell and retention and better prospecting!

PRIZM cluster composition for segments

Modeled C1 C2 S1 S2 S3 U1 U3Segment 1 1.6 2.4 31.2 4.0 12.8 12.0 34.4 2 2.4 16.3 56.5 1.6 0.8 13.7 4.0

10 5.5 28.4 11.0 5.5 18.1 1.6 4.7

19 2.8 2.8 10.1 32.1 4.6 2.8 0.020 5.5 0.9 18.4 22.9 2.8 0.0 0.0

TOTAL 6.0 9.5 24.0 14.9 11.0 6.1 4.9

EliteSuburbs

UrbanCores

Top demi-decile, i.e., those most likely to become season ticket holders

Potential marketing plans

S1 U3Giveaways

1,000 FF miles Mini-music systemCD - Classical/Jazz CD - Jazz/RockFree WSJ sub Free Consumer Report sub

Contests1 trip to the Master's, or… 1 trip to Super Bowl, or… the NBA finals the NBA finals50 Montblanc pens 50 pairs Adidas/Nike

AdvertiseJazz stations Jazz stationsClassical stations Rock stationsLocal Business Section Local Classified section

AdvertiseJazz stations Jazz stationsClassical stations Rock stationsLocal Business sections Local Classified section

Model “validation”

At the PC error distribution out-of-sample validation cross-validation (good for small data bases)

In the field test marketing campaign

Check distribution of errors

Out-of-sample validation

FullCustomer

File

Training File

Validation File

randomly extracted

ProbResponse =.003*HHINC+.1*NumPurch

Training file lift chart

Validation file lift chart

compare

Training Data Validation Data

PC validation

When is it time to layan old model to rest?

When response rates start to declineResponse Rates

JanFeb 4.25MarApr 4.2MayJun 4.6JulAug 3.5SepOct 3.1NovDec 2.8

When is it time to lay an old model to rest?

When response rates start to decline

When significant events occur new competitive product introductions

Local ISP internet access v. cable access/DSL



When significant events occur new competitive product introductions significant price changes

Digital photography



When significant events occur new competitive product introductions significant price changes changes in the economy

Demand for luxury products (boats) in a recession



When significant events occur new competitive product introductions significant price changes changes in the economy news events which affect attitudes about

your product -- for better or worse

Mortgage bankers

Summary - 7 Steps to Better Models

1. Identify the business problem

2. Data audit - what data types are most useful and how much do I need?

3. Exploratory data analysis (Data visualization!)

1. Data quality - how to deal with missing data and outliers

2. Identifying your most predictive variables

3. Transforming your variables

4. Choose the best modeling approach - a cookbook approach

5. Make sure the model makes sense

6. Model validation - the Melatonin of modeling

7. Model maintenance

FUNDAMENTALS OFPREDICTIVE CUSTOMER

MODELING

James R. Stafford

fundamentals of data mining 2010

Documents