claudia perlich, chief scientist, dstillery at mlconf nyc

54
All the data and still not enough! Claudia Perlich Chief Scientist @claudia_perlich

Upload: sessionsevents

Post on 15-Jul-2015

297 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC

All the data and still not enough!

Claudia Perlich Chief Scientist

@claudia_perlich

Page 2: Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC

Predictive Modeling: Algorithms that Learn Functions

Page 3: Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC

Income Age Buy

123,000 30 yes

51,100 40 yes

68,000 55 no

74,000 46 no

23,000 47 yes

100,000 49 no

Data for Predictive Modeling

Target E

xam

ple

s Features

Page 4: Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC

?

yes

yes

no

no

yes

no

Rules for Predictive Modeling Target

Exam

ple

s

Features

Data should be:

Large enough

Independently Identically Distributed

Page 5: Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC

Paradox of Big Data: “You never have the data you want”

Art of making due with second best

Page 6: Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC

IBM: Sales Force Optimization

Page 7: Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC

Wallet is NEVER observed We observe

this in the

data

But we do not

observe this

IBM Sales to

this Company

Company Revenue (D&B)

Wallet/Opportunity

How can we make this a

predictive modeling problem?

Page 8: Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC

Wallet

10

5

31

17

39

4

Data for Wallet Estimation?

Target

Exam

ple

s

Features

Page 9: Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC

9

REALISTIC Wallets as quantiles Motivation Imagine 100 identical firms with identical IT needs

Consider the distribution of the IBM sales to these firms

Bottom firms should spend as much as the top

Define wallet as high percentile of spending conditional on the customer attributes

Fre

qu

en

cy

IBM Sales

Wallet Estimate

Page 10: Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC

Revenue

10

5

31

17

39

4

Data for Wallet Estimation

Target

Exam

ple

s

Features

Page 11: Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC

Quantile Regression optimizing weighted absolute loss

10 20 30 40 50 60 70 801

2

3

4

5

6

7

8

9

Firm Sales

IBM

Rev

enue

Company Sales

IBM

Rev

enue

Opportunity for C 2

Opportunity for C 1

C1

C2

10 20 30 40 50 60 70 801

2

3

4

5

6

7

8

9

Firm Sales

IBM

Rev

enue

Company Sales

IBM

Rev

enue

Opportunity for C 2

Opportunity for C1

C1

C2

10 20 30 40 50 60 70 801

2

3

4

5

6

7

8

9

10 20 30 40 50 60 70 801

2

3

4

5

6

7

8

9

Firm Sales

IBM

Rev

enue

Company Sales

IBM

Rev

enue

Opportunity for C 2

Opportunity for C 1

C1

C2

10 20 30 40 50 60 70 801

2

3

4

5

6

7

8

9

20 30 40 50 60 70 801

2

3

4

5

6

7

8

9

Firm Sales

IBM

Rev

enue

Company Sales

IBM

Rev

enue

Opportunity for C 2

Opportunity for C1

C1

C2

Page 12: Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC

Medical Diagnosis: Brest Cancer

Page 13: Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC

© IBM Corporation 2008 Slide 13

Siemens: Computer-Aided Detection of Breast Cancer in Mammograms

1712 Patients 6816 Images

105,000 Candidates

[ x1 , x2 , … , x117 ] Image feature vector

Malignant

?

MLO CC MLO CC

Page 14: Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC

Siemens Medical: fMRI breast cancer data

245 Patients:

36% Cancer

414 Patients:

1% Cancer

1027 Patients

0% Cancer

18 Patients:

85% Cancer

Mo

de

l

sc

ore

Log of Patient ID

Every point

is a candidate

In essence, the most predictive variable is the patient ID

Page 15: Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC

Data for Diagnosis from Multiple Sources

Target

Exam

ple

s

Features

Cancer

yes

no

no

no

no

no

Page 16: Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC

Modeling the Sources …

Target

Exam

ple

s

Features

Source Cancer

1 yes

2 no

1 no

1 no

4 no

3 no

Page 17: Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC

Digital Advertising

Page 18: Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC

Online Display Advertising

Do people buy stuff after seeing an ad?

Page 20: Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC

Data For Advertising

Target

Exam

ple

s

Features

PV Buy

no

no

no

no

yes

yes

Page 21: Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC

Multi-Armed Bandit: Exploration vs. exploitation

Show some random ads to learn a good model

Tradeoff between learning and using

Page 22: Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC

Size of the Training Sample?

Target

Exam

ple

s

Features

Buy

no

no

no

no

yes

yes

Page 24: Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC

Reality of Online Purchases

Target

Exam

ple

s

Features

Buy

no

no

no

no

no

yes

Page 25: Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC

Online Display Advertising

Proxy for purchase? How about click?

Page 26: Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC

Click?

yes

yes

no

no

yes

no

Optimizing Clicks in Advertising?

Page 27: Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC

Click Optimization: Fumbling in the Dark Top 10 Apps by CTR

Page 28: Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC

How Big Data and Optimization is killing Metrics

90% of clicks are ‘accidental/non intentional’

10% are meaningful, and changes can be measures

Optimization can find structure in the other 90%

You will end up with only non-intentional …

Page 29: Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC

Online Display Advertising

Who cares about the ad anyway?

Page 30: Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC

Predict Other indicators: search or brand site visit/schedule test drive

Target E

xa

mp

les

Features

Site Visit

no

no

no

yes

yes

yes

Page 31: Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC

Advertising Fraud

Page 32: Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC

Is there really a person on the other end wanting to see the site?

Page 33: Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC

Data for Fraud Detection

Target

Exam

ple

s

Features

Human?

yes

no

no

yes

yes

no

Page 35: Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC

Bot traffic networks

Page 36: Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC

Online Display Advertising

Who should you really advertise to???

Page 37: Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC

Data for Advertising Impact

Target

Exam

ple

s

Features

Impact

1

0.3

0.5

0

0

0.1

Page 38: Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC

Alternative Histories (Counterfactual)

Page 39: Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC

Fundamentally Impossible!

Target

Exam

ple

s

Features

Impact

1

0.3

0.5

0

0

0.1

Page 40: Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC

Build two separate models and calculate impact as the difference

Site Visit

yes

no

no

yes

no

no

Site Visit

yes

no

no

yes

no

no

Exam

ple

s 1

se

en

ad

Exam

ple

s 2

not se

en

ad

Expected Impact: p(SV|Ad)-p(SV|no ad)

Page 41: Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC

Use predictive models to measure impact

Negative Test: wrong ad

Positive Test: A/B comparison

Page 42: Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC

Relationship of organic conversion rate and causal impact

-0.001000

0.000000

0.001000

0.002000

0.003000

0.004000

0.005000

0.006000

0.40% 0.50% 0.60% 0.70% 0.80% 0.90% 1.00% 1.10% 1.20% 1.30% 1.40%

Organic conversion propensity

Additiv

e c

asual im

pact

Page 43: Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC

Audiences in Video Advertising

Page 44: Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC
Page 45: Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC

Pleasing the advertising oracle …

Audience reports from matched populations in Facebook

68% of the ads where shown to females

Makeup for 32% of ads The Oracle

Page 46: Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC

Data for Audience Optimization

Target E

xa

mp

les

Features

Gender

male

female

female

male

male

female

Page 47: Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC

Weighted Logistic Regression on aggregated

Target E

xa

mp

les

Features

Weight Gender

0.32 male

0.68 female

0.32 male

0.68 female

0.73 male

0.27 female

Page 48: Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC

Hyperlocal Targeting?

Foursquare locations: very noisy…

Page 49: Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC

Data for Location Reliability in Auction

Target E

xa

mp

les

Features

Reliable?

yes

no

no

yes

yes

no

Page 50: Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC

30% smart phone users travel faster than speed of sound …

Page 53: Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC

Paradox of Big Data: “You never have the data you want”

Art of making due with second best

Page 54: Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC

All a matter how creative you are at cheating….