fundamentals of data mining 2010
DESCRIPTION
Presented at the Annual NCDM Conference Includes 7 Steps to Successful Data Mining.TRANSCRIPT
FUNDAMENTALS OFDATA MINING FOR
MARKETERS
James R. Stafford
Today’s Agenda - 7 Steps to Better Models
1. Identify the business problem
2. Data audit - what data types are most useful and how much do I need?
3. Exploratory data analysis
1. Data quality - how to deal with missing data and outliers
2. Identifying your most predictive variables
3. Transforming your variables
4. Choose the best modeling approach
5. Make sure the model makes sense
6. Model validation - the Melatonin of modeling
7. When to re-build your model
What is the business problem?
Response Attrition/Lapse/
Churn Reactivation Lifetime Value Profitability Sales
What Should I Predict?
Data is the Key
Primary data Secondary data
What Data is Available to Predict Outcome?
Primary Data
Transaction based Recency Frequency Monetary Products
Purchased
The most important data for modeling!
Primary Data
Transaction based Recency Frequency Monetary Products
Purchased
Demographics Age Home Ownership Dependents
Lifestyle Type of Car Hobbies Travel
Preferences
The most important data for customer profiling and
building acquisition models!
Secondary data: consumer & business
Acquired from another source
Specific or inferred Actual and reported by
individual/household Modeled after similar
profiles Pct data specific or
inferred varies Costs vary from
$2/1,000 to $50/1,000 matches
Type of Car Travel Preferences SIC Employees
Demographics/Lifestyle Age Home Ownership Dependents Income
How much data do I need?
More is better!!! ... use all of your customers if possible (train & validate)
When to sample too many records test campaign to get response withhold some for model validation
Goal of sample - to be representative of your target customer population
How to determine acceptablesample size
N = sample size C = confidence level (1.96 for 95% confidence) E = acceptable error bound
(0.001 = 0.1% response rate) P = response rate from full file (e.g., 0.03=3%) Q = (1-P)
N = [ (C/E) * (P*Q) ]2
+_
How much data do I need?
More is always better, but there are diminishing returns.
5000
1500
0
2500
0
3500
0
4500
0
5500
0
6500
0
7500
0
8500
0
9500
0
1050
00
1150
00
1250
00
1350
00
1450
00
0
0.0005
0.001
0.0015
0.002
0.0025
0.003
SD of Response Rate
# of Records
SD
How much data do I need? --10,000 records & 3.0% RR
3.0%
3.3%2.7%
(95 times out of 100!)
What’s the minimum sample size I need to get 2.9 % <=> 3.1%?
3.0%
3.1%2.9%
(95 times out of 100!)
112,000
How much data do I need?
Response models - minimum of 300 customers that behaved in the “desired” way.
Lifetime value models - at least 300 customers/records.
Minimums
EDA - How to deal with missing data and outliers
Missing data - blanks, “NA” Use/recode: -999, may be meaningful, e.g., lots of
missing data can be important in fraud detection Substitute - mean, median or mode Delete records from analysis
Outliers - data outside of reasonable bounds customer age = 170 customer balance = $1.5M ($10,000 = other max
value) identify with plots: frequency distributions/histograms Use, substitute or delete
Locate Outliers
Customer Age
AVG = 52.1
Outlier
Effect of outliers
+
+
+
+
+
+
+
+
++
+
+
+
+
++
+
LTV
Number of purchases
Xo
++
+ Outliers+
+
+
+
+
+
+
+
+
++
+
+
+
+
++
+
LTV
Number of purchases
Xcorrect
Xoutlier
Effect of outliers
Replace outliers
Numeric data (age, # purchases, LTV, ...) mean median mode
ASCII string (sex, lifestyle code, …) Mode
EDA - Finding your most predictive variables
Correlation matrix
Cross-tabs
Scatter plots
Stepwise regression
CHAID
Correlation matrix
LTV NoPur MSLPur HHInc PPHH EducLevel AAgeMale AAgeFemLTV 1.00 0.78 0.43 0.62 0.19 0.23 -0.34 -0.23NoPur 0.78 1.00 0.55 0.47 0.24 0.15 -0.22 -0.30MSLPur 0.43 0.55 1.00 0.21 0.43 0.33 -0.13 -0.05HHInc 0.62 0.47 0.21 1.00 0.18 0.76 0.38 0.24PPHH 0.19 0.24 0.43 0.18 1.00 0.21 0.34 0.37EducLevel 0.23 0.15 0.33 0.76 0.21 1.00 0.41 0.46AAgeMale -0.34 -0.22 -0.13 0.38 0.34 0.41 1.00 0.82AAgeFem -0.23 -0.30 -0.05 0.24 0.37 0.46 0.82 1.00
Fitness Product Line
Scatter plot
LTV
AvgHH Income
+++
++
+
+
++ +
+
+
+
no relationship LTV
AvgHH Income
++
++
++
+ ++
+ +
+positive relationship
+
+
++++ +
+ +
++
Stepwise regression
Automated regressions to identify most predictive variables
1st regression finds the single most predictive
2nd regression keeps the most predictive and finds the 2nd most predictive that is significant given the presence of the 1st
Addresses collinearity
CHAID
Disability Insurance4.5% RR
Males3% RR
Females6% RR
<251.2% RR
>=253.9% RR
% 1 person HH’r2.4% RR
% >1 person HH’r
9.3% RR
Identify important variables and variable interactions
Variable transformations
Based on relationship with dependent/output variable (logs, x2, 1/x, etc.)
Based on characteristics of the data ZIP - SCF/1st 3 characters Phone number - area code Birth date - age Purchase date - month purchased==>seasonality Purchase date - months since last purchase Modeling methodology used (CHAID requires
binning)
Variable transformations
++
+
+
++
+
+
++
+
+
+
+++
+
LTV
AvgAgeMale
++
+
+
++
+
+
++
+
+
+
+++
+
LTV
AvgAgeMale
not transformed 1/x … or, 1/AvgAgeMale
Variable transformations
+
+
+
+
+
+
+
+
++
+
+
+
+
++
+
LTV
Number of purchases
X
Variable transformations
+
+
+
+
+
+
+
+
++
+
+
+
+
++
+
LTV
Number of purchases
X
X2
Variable transformations
+
+
+
+
++
+ +
+
+
+
++
+
+
+
+
LTV
Household Income
X
Variable transformations
+
+
+
+
++
+ +
+
+
+
++
+
+
+
+
LTV
Household Income
X
ln(x)…or,ln(HHInc)
Gut-feel: expert opinion RFM: non-statistical segmentation
scheme CHAID: segmentation scheme Regression: statistical analysis Neural Nets: statistical analysis
Chose the best modeling approach
What modeling approach should be used?
Who is likely to respond?
Which purchasers of product A will also purchase product B?
Who is most likely to lapse?
Predict yes/no, response/non-response
Output is binary, e.g., 0/1
Predict amount spent Output is numerical, e.g.,
$2,563 What is the expected
lifetime value of my customers?
How much will customers invest?
limited number of answers
If the business problem has a...
wide range of answers
Which approach should be used?
RFM CHAID Linear regression Logistic
regression Neural nets
Linear regression Logistic
regression CHAID Neural nets
limited number of answers
If the business problem has a...
wide range of answers
Does the model make sense?
Response Rate by Product TypeSuperior/
Inferior Standard LuxuryRecency + + +Frequency + + +Monetary + + +Income - + +Price - - +Age ? ? ?Sex ? ? ?Race ? ? ?City v.surburban dweller ? ? ?
? => Depends of the specific type of product
Relies on your knowledge of your product and the market!
The business problem
National Basketball Association Team Declining attendance Expanding to new stadium with more seats
Marketing Objectives Up-sell: Mini-plan to Season ticket holders Prospecting: identify Season ticket plan
prospects
Use best modeling approach
Appraise results - gains chart for our best model
Does the model make sense?
Most Important Variables
Cluster code
Home value
Age-male
Home value>=100K
# of HHs
# of seats
Does the model make sense -- what do my customers look like?
Does the model make sense -- what do my customers look like?
Does the model make sense -- what do my customers look like?
Does the model make sense -- what do my customers look like?
PRIZM cluster composition for segments
Modeled C1 C2 S1 S2 S3 U1 U3Segment 1 1.6 2.4 31.2 4.0 12.8 12.0 34.4 2 2.4 16.3 56.5 1.6 0.8 13.7 4.0
10 5.5 28.4 11.0 5.5 18.1 1.6 4.7
19 2.8 2.8 10.1 32.1 4.6 2.8 0.020 5.5 0.9 18.4 22.9 2.8 0.0 0.0
TOTAL 6.0 9.5 24.0 14.9 11.0 6.1 4.9
Summary profile of “the” best segment
Wealthy whites, Asians and Arabic
High spending levels Highest income High education High investment
Multi-racial Multi-lingual Dense/urban Home & apartment
renters High % of singles High % of single parents High unemployment Lowest income group
U3 - Urban CoresS1 - Elite Suburbs
How Can You Use This Information ?
Develop different messages Use different media/marketing
approaches to reach them Buy prospect lists based on best
segment profiles Develop retention and prospecting plans
with customized offers (e.g., free CD’s based on their particular tastes in music)
For each major customer segment, you can...
===>> improved customer up-sell and retention and better prospecting!
PRIZM cluster composition for segments
Modeled C1 C2 S1 S2 S3 U1 U3Segment 1 1.6 2.4 31.2 4.0 12.8 12.0 34.4 2 2.4 16.3 56.5 1.6 0.8 13.7 4.0
10 5.5 28.4 11.0 5.5 18.1 1.6 4.7
19 2.8 2.8 10.1 32.1 4.6 2.8 0.020 5.5 0.9 18.4 22.9 2.8 0.0 0.0
TOTAL 6.0 9.5 24.0 14.9 11.0 6.1 4.9
EliteSuburbs
UrbanCores
Top demi-decile, i.e., those most likely to become season ticket holders
Potential marketing plans
S1 U3Giveaways
1,000 FF miles Mini-music systemCD - Classical/Jazz CD - Jazz/RockFree WSJ sub Free Consumer Report sub
Contests1 trip to the Master's, or… 1 trip to Super Bowl, or… the NBA finals the NBA finals50 Montblanc pens 50 pairs Adidas/Nike
AdvertiseJazz stations Jazz stationsClassical stations Rock stationsLocal Business Section Local Classified section
AdvertiseJazz stations Jazz stationsClassical stations Rock stationsLocal Business sections Local Classified section
Model “validation”
At the PC error distribution out-of-sample validation cross-validation (good for small data bases)
In the field test marketing campaign
Check distribution of errors
Out-of-sample validation
FullCustomer
File
Training File
Validation File
randomly extracted
ProbResponse =.003*HHINC+.1*NumPurch
Training file lift chart
Validation file lift chart
compare
Training Data Validation Data
PC validation
When is it time to layan old model to rest?
When response rates start to declineResponse Rates
JanFeb 4.25MarApr 4.2MayJun 4.6JulAug 3.5SepOct 3.1NovDec 2.8
When is it time to lay an old model to rest?
When response rates start to decline
When significant events occur new competitive product introductions
Local ISP internet access v. cable access/DSL
When is it time to lay an old model to rest?
When response rates start to decline
When significant events occur new competitive product introductions significant price changes
Digital photography
When is it time to lay an old model to rest?
When response rates start to decline
When significant events occur new competitive product introductions significant price changes changes in the economy
Demand for luxury products (boats) in a recession
When is it time to lay an old model to rest?
When response rates start to decline
When significant events occur new competitive product introductions significant price changes changes in the economy news events which affect attitudes about
your product -- for better or worse
Mortgage bankers
Summary - 7 Steps to Better Models
1. Identify the business problem
2. Data audit - what data types are most useful and how much do I need?
3. Exploratory data analysis (Data visualization!)
1. Data quality - how to deal with missing data and outliers
2. Identifying your most predictive variables
3. Transforming your variables
4. Choose the best modeling approach - a cookbook approach
5. Make sure the model makes sense
6. Model validation - the Melatonin of modeling
7. Model maintenance
FUNDAMENTALS OFPREDICTIVE CUSTOMER
MODELING
James R. Stafford