claudia perlich, chief scientist, dstillery at mlconf nyc

Post on 15-Jul-2015

297 Views

Category:

Technology

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

All the data and still not enough!

Claudia Perlich Chief Scientist

@claudia_perlich

Predictive Modeling: Algorithms that Learn Functions

Income Age Buy

123,000 30 yes

51,100 40 yes

68,000 55 no

74,000 46 no

23,000 47 yes

100,000 49 no

Data for Predictive Modeling

Target E

xam

ple

s Features

?

yes

yes

no

no

yes

no

Rules for Predictive Modeling Target

Exam

ple

s

Features

Data should be:

Large enough

Independently Identically Distributed

Paradox of Big Data: “You never have the data you want”

Art of making due with second best

IBM: Sales Force Optimization

Wallet is NEVER observed We observe

this in the

data

But we do not

observe this

IBM Sales to

this Company

Company Revenue (D&B)

Wallet/Opportunity

How can we make this a

predictive modeling problem?

Wallet

10

5

31

17

39

4

Data for Wallet Estimation?

Target

Exam

ple

s

Features

9

REALISTIC Wallets as quantiles Motivation Imagine 100 identical firms with identical IT needs

Consider the distribution of the IBM sales to these firms

Bottom firms should spend as much as the top

Define wallet as high percentile of spending conditional on the customer attributes

Fre

qu

en

cy

IBM Sales

Wallet Estimate

Revenue

10

5

31

17

39

4

Data for Wallet Estimation

Target

Exam

ple

s

Features

Quantile Regression optimizing weighted absolute loss

10 20 30 40 50 60 70 801

2

3

4

5

6

7

8

9

Firm Sales

IBM

Rev

enue

Company Sales

IBM

Rev

enue

Opportunity for C 2

Opportunity for C 1

C1

C2

10 20 30 40 50 60 70 801

2

3

4

5

6

7

8

9

Firm Sales

IBM

Rev

enue

Company Sales

IBM

Rev

enue

Opportunity for C 2

Opportunity for C1

C1

C2

10 20 30 40 50 60 70 801

2

3

4

5

6

7

8

9

10 20 30 40 50 60 70 801

2

3

4

5

6

7

8

9

Firm Sales

IBM

Rev

enue

Company Sales

IBM

Rev

enue

Opportunity for C 2

Opportunity for C 1

C1

C2

10 20 30 40 50 60 70 801

2

3

4

5

6

7

8

9

20 30 40 50 60 70 801

2

3

4

5

6

7

8

9

Firm Sales

IBM

Rev

enue

Company Sales

IBM

Rev

enue

Opportunity for C 2

Opportunity for C1

C1

C2

Medical Diagnosis: Brest Cancer

© IBM Corporation 2008 Slide 13

Siemens: Computer-Aided Detection of Breast Cancer in Mammograms

1712 Patients 6816 Images

105,000 Candidates

[ x1 , x2 , … , x117 ] Image feature vector

Malignant

?

MLO CC MLO CC

Siemens Medical: fMRI breast cancer data

245 Patients:

36% Cancer

414 Patients:

1% Cancer

1027 Patients

0% Cancer

18 Patients:

85% Cancer

Mo

de

l

sc

ore

Log of Patient ID

Every point

is a candidate

In essence, the most predictive variable is the patient ID

Data for Diagnosis from Multiple Sources

Target

Exam

ple

s

Features

Cancer

yes

no

no

no

no

no

Modeling the Sources …

Target

Exam

ple

s

Features

Source Cancer

1 yes

2 no

1 no

1 no

4 no

3 no

Digital Advertising

Online Display Advertising

Do people buy stuff after seeing an ad?

Data For Advertising

Target

Exam

ple

s

Features

PV Buy

no

no

no

no

yes

yes

Multi-Armed Bandit: Exploration vs. exploitation

Show some random ads to learn a good model

Tradeoff between learning and using

Size of the Training Sample?

Target

Exam

ple

s

Features

Buy

no

no

no

no

yes

yes

Reality of Online Purchases

Target

Exam

ple

s

Features

Buy

no

no

no

no

no

yes

Online Display Advertising

Proxy for purchase? How about click?

Click?

yes

yes

no

no

yes

no

Optimizing Clicks in Advertising?

Click Optimization: Fumbling in the Dark Top 10 Apps by CTR

How Big Data and Optimization is killing Metrics

90% of clicks are ‘accidental/non intentional’

10% are meaningful, and changes can be measures

Optimization can find structure in the other 90%

You will end up with only non-intentional …

Online Display Advertising

Who cares about the ad anyway?

Predict Other indicators: search or brand site visit/schedule test drive

Target E

xa

mp

les

Features

Site Visit

no

no

no

yes

yes

yes

Advertising Fraud

Is there really a person on the other end wanting to see the site?

Data for Fraud Detection

Target

Exam

ple

s

Features

Human?

yes

no

no

yes

yes

no

Bot traffic networks

Online Display Advertising

Who should you really advertise to???

Data for Advertising Impact

Target

Exam

ple

s

Features

Impact

1

0.3

0.5

0

0

0.1

Alternative Histories (Counterfactual)

Fundamentally Impossible!

Target

Exam

ple

s

Features

Impact

1

0.3

0.5

0

0

0.1

Build two separate models and calculate impact as the difference

Site Visit

yes

no

no

yes

no

no

Site Visit

yes

no

no

yes

no

no

Exam

ple

s 1

se

en

ad

Exam

ple

s 2

not se

en

ad

Expected Impact: p(SV|Ad)-p(SV|no ad)

Use predictive models to measure impact

Negative Test: wrong ad

Positive Test: A/B comparison

Relationship of organic conversion rate and causal impact

-0.001000

0.000000

0.001000

0.002000

0.003000

0.004000

0.005000

0.006000

0.40% 0.50% 0.60% 0.70% 0.80% 0.90% 1.00% 1.10% 1.20% 1.30% 1.40%

Organic conversion propensity

Additiv

e c

asual im

pact

Audiences in Video Advertising

Pleasing the advertising oracle …

Audience reports from matched populations in Facebook

68% of the ads where shown to females

Makeup for 32% of ads The Oracle

Data for Audience Optimization

Target E

xa

mp

les

Features

Gender

male

female

female

male

male

female

Weighted Logistic Regression on aggregated

Target E

xa

mp

les

Features

Weight Gender

0.32 male

0.68 female

0.32 male

0.68 female

0.73 male

0.27 female

Hyperlocal Targeting?

Foursquare locations: very noisy…

Data for Location Reliability in Auction

Target E

xa

mp

les

Features

Reliable?

yes

no

no

yes

yes

no

30% smart phone users travel faster than speed of sound …

Paradox of Big Data: “You never have the data you want”

Art of making due with second best

All a matter how creative you are at cheating….

top related