claudia perlich, chief scientist, dstillery at mlconf nyc
TRANSCRIPT
All the data and still not enough!
Claudia Perlich Chief Scientist
@claudia_perlich
Predictive Modeling: Algorithms that Learn Functions
Income Age Buy
123,000 30 yes
51,100 40 yes
68,000 55 no
74,000 46 no
23,000 47 yes
100,000 49 no
Data for Predictive Modeling
Target E
xam
ple
s Features
?
yes
yes
no
no
yes
no
Rules for Predictive Modeling Target
Exam
ple
s
Features
Data should be:
Large enough
Independently Identically Distributed
Paradox of Big Data: “You never have the data you want”
Art of making due with second best
IBM: Sales Force Optimization
Wallet is NEVER observed We observe
this in the
data
But we do not
observe this
IBM Sales to
this Company
Company Revenue (D&B)
Wallet/Opportunity
How can we make this a
predictive modeling problem?
Wallet
10
5
31
17
39
4
Data for Wallet Estimation?
Target
Exam
ple
s
Features
9
REALISTIC Wallets as quantiles Motivation Imagine 100 identical firms with identical IT needs
Consider the distribution of the IBM sales to these firms
Bottom firms should spend as much as the top
Define wallet as high percentile of spending conditional on the customer attributes
Fre
qu
en
cy
IBM Sales
Wallet Estimate
Revenue
10
5
31
17
39
4
Data for Wallet Estimation
Target
Exam
ple
s
Features
Quantile Regression optimizing weighted absolute loss
10 20 30 40 50 60 70 801
2
3
4
5
6
7
8
9
Firm Sales
IBM
Rev
enue
Company Sales
IBM
Rev
enue
Opportunity for C 2
Opportunity for C 1
C1
C2
10 20 30 40 50 60 70 801
2
3
4
5
6
7
8
9
Firm Sales
IBM
Rev
enue
Company Sales
IBM
Rev
enue
Opportunity for C 2
Opportunity for C1
C1
C2
10 20 30 40 50 60 70 801
2
3
4
5
6
7
8
9
10 20 30 40 50 60 70 801
2
3
4
5
6
7
8
9
Firm Sales
IBM
Rev
enue
Company Sales
IBM
Rev
enue
Opportunity for C 2
Opportunity for C 1
C1
C2
10 20 30 40 50 60 70 801
2
3
4
5
6
7
8
9
20 30 40 50 60 70 801
2
3
4
5
6
7
8
9
Firm Sales
IBM
Rev
enue
Company Sales
IBM
Rev
enue
Opportunity for C 2
Opportunity for C1
C1
C2
Medical Diagnosis: Brest Cancer
© IBM Corporation 2008 Slide 13
Siemens: Computer-Aided Detection of Breast Cancer in Mammograms
1712 Patients 6816 Images
105,000 Candidates
[ x1 , x2 , … , x117 ] Image feature vector
Malignant
?
MLO CC MLO CC
Siemens Medical: fMRI breast cancer data
245 Patients:
36% Cancer
414 Patients:
1% Cancer
1027 Patients
0% Cancer
18 Patients:
85% Cancer
Mo
de
l
sc
ore
Log of Patient ID
Every point
is a candidate
In essence, the most predictive variable is the patient ID
Data for Diagnosis from Multiple Sources
Target
Exam
ple
s
Features
Cancer
yes
no
no
no
no
no
Modeling the Sources …
Target
Exam
ple
s
Features
Source Cancer
1 yes
2 no
1 no
1 no
4 no
3 no
Digital Advertising
Online Display Advertising
Do people buy stuff after seeing an ad?
Data collection for post-view purchase conversion
Time Cohort of random
prospects
?
Data For Advertising
Target
Exam
ple
s
Features
PV Buy
no
no
no
no
yes
yes
Multi-Armed Bandit: Exploration vs. exploitation
Show some random ads to learn a good model
Tradeoff between learning and using
Size of the Training Sample?
Target
Exam
ple
s
Features
Buy
no
no
no
no
yes
yes
Very few Luxury cars are bough online
Maserati $128,0000
$128,0000
Reality of Online Purchases
Target
Exam
ple
s
Features
Buy
no
no
no
no
no
yes
Online Display Advertising
Proxy for purchase? How about click?
Click?
yes
yes
no
no
yes
no
Optimizing Clicks in Advertising?
Click Optimization: Fumbling in the Dark Top 10 Apps by CTR
How Big Data and Optimization is killing Metrics
90% of clicks are ‘accidental/non intentional’
10% are meaningful, and changes can be measures
Optimization can find structure in the other 90%
You will end up with only non-intentional …
Online Display Advertising
Who cares about the ad anyway?
Predict Other indicators: search or brand site visit/schedule test drive
Target E
xa
mp
les
Features
Site Visit
no
no
no
yes
yes
yes
Advertising Fraud
Is there really a person on the other end wanting to see the site?
Data for Fraud Detection
Target
Exam
ple
s
Features
Human?
yes
no
no
yes
yes
no
Telling the difference between an algorithm and a human
Turing test KAPTCHA
Bot traffic networks
Online Display Advertising
Who should you really advertise to???
Data for Advertising Impact
Target
Exam
ple
s
Features
Impact
1
0.3
0.5
0
0
0.1
Alternative Histories (Counterfactual)
Fundamentally Impossible!
Target
Exam
ple
s
Features
Impact
1
0.3
0.5
0
0
0.1
Build two separate models and calculate impact as the difference
Site Visit
yes
no
no
yes
no
no
Site Visit
yes
no
no
yes
no
no
Exam
ple
s 1
se
en
ad
Exam
ple
s 2
not se
en
ad
Expected Impact: p(SV|Ad)-p(SV|no ad)
Use predictive models to measure impact
Negative Test: wrong ad
Positive Test: A/B comparison
Relationship of organic conversion rate and causal impact
-0.001000
0.000000
0.001000
0.002000
0.003000
0.004000
0.005000
0.006000
0.40% 0.50% 0.60% 0.70% 0.80% 0.90% 1.00% 1.10% 1.20% 1.30% 1.40%
Organic conversion propensity
Additiv
e c
asual im
pact
Audiences in Video Advertising
Pleasing the advertising oracle …
Audience reports from matched populations in Facebook
68% of the ads where shown to females
Makeup for 32% of ads The Oracle
Data for Audience Optimization
Target E
xa
mp
les
Features
Gender
male
female
female
male
male
female
Weighted Logistic Regression on aggregated
Target E
xa
mp
les
Features
Weight Gender
0.32 male
0.68 female
0.32 male
0.68 female
0.73 male
0.27 female
Hyperlocal Targeting?
Foursquare locations: very noisy…
Data for Location Reliability in Auction
Target E
xa
mp
les
Features
Reliable?
yes
no
no
yes
yes
no
30% smart phone users travel faster than speed of sound …
Catalan traditions pop up everywhere ….
Data for Location Reliability in Auction
Target
Exa
mple
s
Features
Reliable?
maybe
no
no
maybe
maybe
no
Paradox of Big Data: “You never have the data you want”
Art of making due with second best
All a matter how creative you are at cheating….