data refinement: the missing link between data collection and decisions
TRANSCRIPT
Data Refinement:The missing link between data collection and decisions
Stephen H. YuData Strategy & Analytics Consultant
What we will cover
1
• Database Marketing Landscape• Analytics and Models• “Model-Ready” Environment• Data Summarization & Categorization• Delivery: Scoring & QC• Closing the Loop
Big Data, Small Data, Neat Data, Messy Data
How is the "Big Data” working out for you?2.5 quintillion bytes collected per “day”1 quintillion (exabytes) = 1 billion gigabytes
• Did all this data improve your decision making process?• Do you have the results to show for?• Information Overload? You bet! Harness insights, drop the noise
2
• No guessing game – You MUST know your target• Vast amount of online & offline data collected
But are they being used properly?• Analytics play a huge roles in prospecting & CRM• Short paced marketing cycle getting shorter• Huge difference between advanced marketers
and those who are falling behind
Winners are the ones who know how to wield the power of all available data faster.
Database Marketing Landscape
3
• Database marketers must excel in:
• Must provide “marketing answers” via advanced analytics, not just bits and pieces of data
• Insight does not come from data, it is derived from data
CollectionSize & speed
matters
Refinement Get to answers, not
just ingredients
DeliveryFor consumption
by end-users
Insights, not Raw Data
4
Raw Data• Demographic / Firmographic• RFM• Products & Services Used• Promotion / Response History• Lifestyle / Survey Responses• Delinquent history• Call / Communication Log• Movement Data• Sentiments
Marketing Answers• Likely to buy a luxury car• Likely to take a foreign vacation• Likely to response to free
shipping offer• Likely to be a high value customer• Likely to be qualified for credit• Likely to upgrade• Likely to leave• Likely to come back
Refined Answers
5
“Analytics” means different things…• BI (Business Intelligence) Reporting: Display of success metrics, dashboard
reporting• Descriptive Analytics: Profiling, segmentation, clustering• Predictive Modeling: Response models, cloning Models, lifetime value,
revenue models, etc.• Optimization: Channel optimization, marketing spending analysis,
econometrics models
Predictive Modeling for 1-to-1 Marketing
Different Types of Analytics
6
•Increase Targeting Accuracy
•Reduce costs by contacting less/smart
•Stay relevant
•Consistent results
•Reveal hidden patterns in data
•Repeatable – key for automation
•Expandable
•“Supposedly” save time and effort
Why model?
7
Models summarize complex data into simple-to-use “scores”
• Universe is too small
• Predictable data not available
• 1-to-1 marketing channels not in plan
• Tight budget
• Lack of resources
Why NOT model?
8
Custom Models for every stage of Marketing Lifecycle
9
• Not easy to find “Best” customers
• Modelers are fixing data all the time
• Rely on a few popular variables
• Always need more variables
• Takes too long to build models and score
• Inconsistencies shown when scored
• Disappointing results
Any pain implementing models?
10
If you have a database…• Order Fulfillment• Contact Management• Standard Reports• Ad hoc Reports and Queries• Name Selections• Response Analysis• Trend AnalysisBut does it support predictive modeling and scoring?
What does your database support?
11
“Garbage-in, garbage-out”• Most data sets are messy & “unstructured”• Over 70-80% of model development time
goes to data prep work• Most databases are NOT model-ready
• Modeling & Scoring o Extension of database worko Consistency is “the” key
For modeling, clean the data first
12
Determine the level of data accordingly• Relational or unstructured
databases won’t cut it• Must create “Descriptors” that
fit the level that needs to be ranked
12 3
Predictive Modeling is all about “Ranking”
13
Ultimately, Models must properly “Rank”• Households• Individuals• Products
• Most modern databases optimized for massive storage and rapid retrieval, not necessarily for predictive analyticso Relational databaseso NoSQL databases
• Need “Analytical Sandbox” (or Database/Data-mart)o Structured & de-normalizedo Variables as descriptors of model targetso Common analytical language (SAS, R, SPSS)o Must support “in-database” scoring
Unstructured to Structured
14
From “all” types of data collection to decisions. Then repeat.
Analytical Sandbox – “Model-Ready” Environment
15
Data append/match becomes ineffective
Inconsistent data creates a chain reaction to melt-downs
Creative variables enhance models
Inexperienced analysts spend majority of time doing DP work Modeling work at the last minute!
Why Front-end DP Important?
16
• Rank & select prospect names• Cross-sell/Up-sell• Segment the universe for messaging strategy• Pinpoint attrition point• Assign lifetime values• Optimize media/channel spending• Create product packages• Project customer value• Detect fraud
First, Define Analytical Goals
17
Transaction Data / Behavioral Data
• Transaction data• Compiled / Co-op data• Lifestyle data• Online behavioral data
Descriptive Data• Demographic Data• Firmographic Data• Geo-demographic Data
3-Dimensions in Predictive Analytics
3 Major Types of Data for Marketers
18
Attitudinal Data• Surveys• Sentiments
Data Inventory
19
• “Modeling is making the best of what we know”
• Beyond obvious RFM data• Get Deeper
o Product/Service Level Data
o Historical Data
o Channel Data – Inbound & Outbound
o Online activities, sentiments, unstructured data
• External Data
Create Data Menu
20
• Base it on Companywide Need-Analysis• Ask the Analysts first: What type of models are in the plan?
o Affinity/Look-alike Models
o Promotion/Response Models
o Time-series Models
o Attrition Models• Consider non-analytical departments• Maintain the ones that fit the objective Don’t be afraid to throw out “noises”
Check the ingredients• What do you have today?• What can be bought?• What can be created?
Cost - Can you afford to maintain it?• Storage/Platform – Consider the
scoring part, too• Programming/Processing Time• Software• Update• External Data
Data Menu (continued)
21
You may have more than you thought, so let’s start with what you have:
• Name & Address: Key to Geo/Demographic Data• Order Transaction Data: “RFM”, Payment Methods• Item/SKU Level Data: Products, Price, Units• Promotion/Response History: Source, Channel, Offer• Life-to-Date/Past “x” Months Summary Data• Customer Level Status Flags: Active, Dormant, Delinquent• Surveys/Product Registration Forms : Attitudinal/Lifestyle• Customer Communication History Data: Call-center, Web• Social Media, Click-through, Page views : Sentiment/Intentions
Need Conversion, Categorization, & Summarization
Check Your Data Inventory
22
Maximize the Power of Transaction Data
23
Most databases describe shopping baskets Start describing your targets
• RFM Data must be Summarized (or De-normalized)
• Turn RFM data into individual / household level “Descriptors”
• Combine with essential categorical variables (e.g., product, offer, channel, etc.)
Cust ID Order # Order Date $ Amount
000123 100011 2009-05-06 $199.99
000123 100128 2010-08-30 $50.49
000123 103082 2011-12-21 $128.60
003859 100036 2010-06-06 $43.99
003859 101658 2011-01-20 $43.99
003859 102189 2011-04-15 $119.45
003859 106458 2012-02-18 $43.99
004593 104535 2012-07-30 $354.72
016899 107296 2011-07-14 $199.99
019872 102982 2010-09-07 $128.60
019872 103826 2011-04-30 $499.99
019872 109056 2012-03-12 $59.99
Cust ID # Orders $ Total First Order
DateLast Order
Date
000123 3 $379.08 2009-05-06 2011-12-21
003859 4 $251.42 2010-06-06 2012-02-18
004593 1 $354.72 2012-07-30 2012-07-30
016899 1 $199.99 2011-07-14 2011-07-14
019872 3 $688.58 2010-09-07 2012-03-12
Order Table Order Summary Table
Data Summarization – Matching the level of Data
24
Before After Summarization
Recency
• Weeks since last online purchase• Years since member sign up• Days since last delinquent date• Months since last response date
Frequency
• Orders by offer type• Orders by product/service type• Payments by pay method• Average days between transactions
Monetary• Total $ past 24 months• Life-to-date spending• Average dollars by channel• Average dollars by product type
Sample Variables after Summarization
25
RFM Data Summary – Timeline
26
Life-to-date Summary provides the historical view
May create bias towards
tenured customers
Put time limit on variables (e.g. 12-month, 24-month, etc.)
May require higher number of
variables and complicate the process
For Lifetime Value & Time Series Models
Must create historical arrays
(daily, weekly, monthly counts of events)
Who does the summary work?
27
Answer: Not the statistician!
Key Takeaway
The data variables must be consistent everywhere in the “Analytical Sandbox”• Main analytical database• Model development sample• Pool of records to be scored
Pre-built summary variables in the database
Free-form data comes to life through categorizationDon’t Give Up!• Hidden data in:
o Product, service, offer, channel, source, status, titles, surveys, etc.
• Have categorization guideline?• Who will do it?
o Consider text mining techniques
• What to throw out?o Keep data that matters in predictive modeling
Data Categorization
28
Any Non-numeric Data • Product• Service• Offer• Channel• Source• Market• Region• Business Title• Member Status• Payment Status• etc…
Offer Code Example:• Flat Dollar Discount• % Discount• Buy 1, Get 1 Free• Free Shipping• No Payment Until…• Free Gift• etc…
Categorize as much as possible at the data collection stage
Categorical Data
29
Be consistent throughout• Survey Form• Data Entry• Inventory Database• Data Collection & Compilation• Summarization• Modeling, and Scoring
Create “Code” structure • NEVER allow free-form answers
Categorization Guidelines
30
Categorization Guidelines (continued)
31
Create Rules and DON’T Deviate from them
Training & Automation
More specific the better
Combine them later But, don’t allow too many variations (over 20) in one code
Break into multiple
codes if necessary
Don’t forget the end goals and don’t over-do it
Must be “relevant”
Do you know how many customers you have in your database?• Data conversion
o Create consistencyo Standardizationo Edito Purge
• Cover all bases – PII & RFM Data• Create rules and be consistent
Data Hygiene & Data Append
32
What is hidden behind simple name & address?• Standardize Name & Address first
o Maintain PII (Personally Identifiable Information)o Hygiene via periodic NCOA and standardization
• First & Last Name – Ethnic, Gender• Name, Address, Email – Demographic Data• Address – Geo-demographic, Census Data• Zip – County, Market Region, DMA
PII – Gateway to External Data
33
• Always consider buying data before collecting and building
• Compiled Demographic / Firmographic Data
• Behavioral / Transaction / Co-op Data
• Lifestyle / Attitudinal / Survey• Census / Geo-Demographic Data
External Data
34
• Test multiple data sources1. Depth of information / Uniqueness2. Coverage / Match Rate3. Consistency / Update Cycle4. Price5. User-friendliness6. Delivery Options – Real-time?
• Learn about the data sourceso What’s real and what’s imputed?o Don’t stop at Demographic: always consider
“Behavioral” data
External Data Check List
35
“Missing Data can be meaningful.”• For Numeric Data (e.g., $, Counters, Dates, etc.)
o Incalculable vs. Data-append Non-matcheso Missing is missing: DO NOT fill in with 0’s
• For Categorical Data (e.g., Codes, Text, etc.)o Leave room for “N/A” (e.g., blank, “N/A”, “0”, “.”,
etc.)o Code “Non-matches” to external files differently
And yet, unstructured database only store what’s available.
$
Missing Values
36
• Agree on Imputation Ruleso Do it upfronto Must be part of scoring codes
• Educate non-analystso Hard to undo when combined with
other valueso Train vendors
• Always check % missingo Development Sample vs. Life
Databases
Data
More on missing data
37
• Development Sample vs. Live Databaseo Database Structureo Variable List/Nameo Variable Valueo Imputation Assumption
Lead to disasters if “anything” is different
• Do NOT play with model groups that are set in the development sample
Scoring – Sample vs. Database
38
Most troubles happen after the models are built…Check:• Model Group Distributions• Variable distributions (values and indices)• Missing Values• Match rate for appended data• Scoring codes, including score breaks• Compare to previous runs – Check Deterioration
Set parameters for acceptable differences and Enforce
Scoring QC
39
• Plan ahead to store model scores in the database and data-marts
Sync with the main database• Store raw scores, not just model
groups• Match the levels of scores
o Householdo Individualo Emailo Product
Score installment in the database
40
Close the Loop Properly!
• 1-to-1 MKT 101: “Learn from past campaigns”• Must Plan ahead
o No excuse for not doing ito Schedule ahead & budget properlyo Keep historical promo/response data (even
without marketing databases)• Set Metrics Upfront
o List/Data Sourceo By Offer, Creative, Season, Product, etc.
Back-end Analysis
41
• Start with “Canned” Reports and BI Dashboards from vendors
• Don’t insist real-time for every report• Get ready to create “Custom” reports and
dashboard viewso Prioritize what you want
• Format and Delivery• Timing and Interval• Timeline to be covered (YTD, 12-mo, etc.)
Response Reports & BI Analytics
42
Set ROI Metrics, such as:• Open, Click-through, Conversion Rates
o “Denominator” in each?• Revenue
o Per 1,000 mailed/callso Per Ordero Per Display, Email, Click-through, Conversion
• Byo Source, Campaign, Time Period, Model
Group, Offer, Creative, Targeting Criteria, Channel (in & outbound), Ad server, Publisher, Key word, Script, Day-part, etc.
• Key variables must reside in the databaseo Keep them in “ready-to-use” format
Key ROI Metrics
43
Spec it out:• Project Goal• Data Source List (as detailed as possible)• Final Variable List• Project Flow:
o Collectiono Conversion & Categorizationo Summarizationo Matching / Data Appendo Development Sample
Where to Begin with Analytical Sandbox
44
o Scoringo Storageo Backend Analysis
• In-house vs. Out-sourcing; Must consider o Platformo Softwareo Programmingo Staffing
• Cost it outo Don’t forget the update cost
• Back to the analysts for variable list review
• Don’t be shy and ask for help from consultants
Who will build the sandbox?
45
It is not just about modeling, but all surrounding services as well
10 essential items to consider when outsourcing analytical projects
1. Consulting capability: Translate marketing goals into mathematics2. Data processing: Conversion, edit, summarization, data-append, etc.3. Pricing structure: Model development is only one part; hidden fees?4. Track record in the industry: Not in rocket science, but in marketing5. Types of models supported: Watch out for one-trick ponies6. Speed of execution: Turnaround time measured in days, not weeks7. Documentation: Full disclosure of algorithms, charts and reports8. Scoring validation: Job not done until fully scored and validated9. Back-end analysis: For true “Closed-loop” marketing10. Ongoing Support: Periodic review and update
10 essential items to consider when outsourcing analytics
46
• Know what you need, but don’t over do it“Modeling is making the best of
what’s available”
• Take a phased approacho If budget is tight, start with low
hanging fruitso Proof of concepts without full
database commitment in the beginning
o Maintain consistencyo Keep Historical Data
Scope it out
47
• Invest in analytics: It is about the insights, not just the data• Set business & analytical goals first• Advanced analytics requires its own sandbox• Ensure consistent data every step of the way: From
sampling to scoring• Check every data source, but don’t wait for a perfect data
set• Match the levels of data (Data Summary)• Don’t over-do it – employ phased approach• Ask for help
Key Takeaways
Stephen H. Yu · Data Strategy & Analytics Consultant· [email protected] · 201.218.2068