alina beygelzimer, senior research scientist, yahoo labs at mlconf nyc

Learning with Exploration

Alina BeygelzimerYahoo Labs, New York

(based on work by many)

Interactive Learning

Repeatedly:

1 A user comes to Yahoo

2 Yahoo chooses content to present (urls, ads, news stories)

3 The user reacts to the presented information (clicks on something)

Making good content decisions requires learning from user feedback.

Abstracting the Setting

For t = 1, . . . ,T :

1 The world produces some context x ∈ X

2 The learner chooses an action a ∈ A

3 The world reacts with reward r(a, x)

Goal: Learn a good policy for choosing actions given context

Dominant Solution

1 Deploy some initial system

2 Collect data using this system

3 Use machine learning to build a reward predictor r̂(a, x) fromcollected data

4 Evaluate new system = arg maxa r̂(a, x)

offline evaluation on past databucket test

5 If metrics improve, switch to this new system and repeat

Example: Bagels vs. Pizza for New York and Chicago users

Initial system: NY gets bagels, Chicago gets pizza.

Observed CTR/Estimated CTR/True CTR

New York

?/1 0.6

Chicago

0.4/0.4/0.4

Bagels win. Switch to serving bagels for all and update modelbased on new data. Yikes! Missed out big in NY!

New York

?/1 0.6

Chicago

0.4/0.4/0.4

Observed CTR

/Estimated CTR/True CTR

New York ?

Chicago 0.4

/0.4/0.4

Observed CTR/Estimated CTR

/True CTR

New York ?/0.5

0.6/0.6

Chicago 0.4/0.4

/True CTR

New York ?/0.5

0.6/0.6

Chicago 0.4/0.4

Bagels win. Switch to serving bagels for all and update modelbased on new data.

Yikes! Missed out big in NY!

/True CTR

New York ?/0.5

0.6/0.6

Chicago 0.4/0.4

0.7/0.5

/True CTR

New York ?/0.4595

0.6/0.6

Chicago 0.4/0.4

0.7/0.7

New York ?/0.4595/1 0.6/0.6/0.6

Chicago 0.4/0.4/0.4 0.7/0.7/0.7

Basic Observations

1 Standard machine learning is not enough. Model fits collecteddata perfectly.

2 More data doesn’t help: Observed = True where data wascollected.

3 Better data helps! Exploration is required.

4 Prediction errors are not a proxy for controlled exploration.

Basic Observations

Attempt to fix

New policy: bagels in the morning, pizza at night for bothcities

This will overestimate the CTR for both!

Solution: Deployed system should be randomized withprobabilities recorded.

Attempt to fix

Offline Evaluation

Evaluating a new system on data collected by deployed systemmay mislead badly:

New York ?/1/1 0.6/0.6/0.5

Chicago 0.4/0.4/0.4 0.7/0.7/0.7

The new system appears worse than deployed system oncollected data, although its true loss may be much lower.

The Evaluation Problem

Given a new policy, how do we evaluate it?

One possibility: Deploy it in the world.

Very Expensive! Need a bucket for every candidate policy.

The Evaluation Problem

Given a new policy, how do we evaluate it?

One possibility: Deploy it in the world.

Very Expensive! Need a bucket for every candidate policy.

A/B testing for evaluating two policies

Policy 1 : Use pizza for New York, bagels for Chicago rule

Policy 2 : Use bagels for everyone rule

Segment users randomly into Policy 1 and Policy 2 groups:

Policy 2 Policy 1 Policy 2 Policy 1

no click no click click no click

Two weeks later, evaluate which is better.

Policy 2

Policy 1 Policy 2 Policy 1

Policy 2

no click

no click click no click

Policy 2

no click

Policy 2 Policy 1

no click

Policy 2 Policy 1

NYno click

Policy 2 Policy 1

no click no click

click no click

Policy 2 Policy 1

no click no click

click no click

Policy 1

no click no click

click no click

Policy 1

no click no click

click no click

Policy 1

no click no click click

no click

Policy 1

no click

Chicagono click no click click

no click

Instead randomize every transaction(at least for transactions you plan to use for learning and/or evaluation)

Simplest strategy: ε-greedy. Go with empirically best policy, butalways choose a random action with probability ε > 0.

no click no click click no click · · ·(x , b, 0, pb) (x , p, 0, pp) (x , p, 1, pp) (x , b, 0, pb)

Offline evaluation

Later evaluate any policy using the same events. Each evaluationis cheap and immediate.

Offline evaluation

no click

no click click no click · · ·

(x , b, 0, pb)

(x , p, 0, pp) (x , p, 1, pp) (x , b, 0, pb)

Offline evaluation

no click

(x , b, 0, pb)

(x , p, 0, pp) (x , p, 1, pp) (x , b, 0, pb)

Offline evaluation

no click

(x , b, 0, pb)

(x , p, 0, pp) (x , p, 1, pp) (x , b, 0, pb)

Offline evaluation

no click no click

click no click · · ·

(x , b, 0, pb) (x , p, 0, pp)

(x , p, 1, pp) (x , b, 0, pb)

Offline evaluation

no click no click

(x , b, 0, pb) (x , p, 0, pp)

(x , p, 1, pp) (x , b, 0, pb)

Offline evaluation

no click no click

(x , b, 0, pb) (x , p, 0, pp)

(x , p, 1, pp) (x , b, 0, pb)

Offline evaluation

no click · · ·

(x , b, 0, pb) (x , p, 0, pp) (x , p, 1, pp)

(x , b, 0, pb)

Offline evaluation

no click · · ·

(x , b, 0, pb) (x , p, 0, pp) (x , p, 1, pp)

(x , b, 0, pb)

Offline evaluation

no click · · ·

(x , b, 0, pb) (x , p, 0, pp) (x , p, 1, pp)

(x , b, 0, pb)

Offline evaluation

· · ·

(x , b, 0, pb) (x , p, 0, pp) (x , p, 1, pp) (x , b, 0, pb)

Offline evaluation

The Importance Weighting Trick

Let π : X → A be a policy. How do we evaluate it?

Collect exploration samples of the form

(x , a, ra, pa),

wherex = contexta = actionra = reward for actionpa = probability of action a

then evaluate

Value(π) = Average

(ra 1(π(x) = a)

Let π : X → A be a policy. How do we evaluate it?

Collect exploration samples of the form

(x , a, ra, pa),

wherex = contexta = actionra = reward for actionpa = probability of action a

then evaluate

Value(π) = Average

(ra 1(π(x) = a)

Theorem

Value(π) is an unbiased estimate of the expected reward of π:

E(x ,~r)∼D

[rπ(x)

]= E[ Value(π) ]

with deviations bounded by O( 1√T minx pπ(x)

Example:

Action 1 2

Reward 0.5 1Probability 1

Estimate

2 | 0 0 | 43

Theorem

E(x ,~r)∼D

[rπ(x)

]= E[ Value(π) ]

Example:

Action 1 2

Estimate 2

Theorem

E(x ,~r)∼D

[rπ(x)

]= E[ Value(π) ]

Example:

Action 1 2

Estimate 2 | 0 0 | 43

Can we do better?

Suppose we have a (possibly bad) reward estimator r̂(a, x). Howcan we use it?

Value′(π) = Average

((ra − r̂(a, x))1(π(x) = a)

pa+ r̂(π(x), x)

Why does this work?

Ea∼p

(r̂(a, x)1(π(x) = a)

)= r̂(π(x), x)

Keeps the estimate unbiased. It helps, because ra − r̂(a, x) smallreduces variance.

Can we do better?

((ra − r̂(a, x))1(π(x) = a)

pa+ r̂(π(x), x)

Why does this work?

Ea∼p

(r̂(a, x)1(π(x) = a)

)= r̂(π(x), x)

Can we do better?

((ra − r̂(a, x))1(π(x) = a)

pa+ r̂(π(x), x)

Why does this work?

Ea∼p

(r̂(a, x)1(π(x) = a)

)= r̂(π(x), x)

Can we do better?

((ra − r̂(a, x))1(π(x) = a)

pa+ r̂(π(x), x)

Why does this work?

Ea∼p

(r̂(a, x)1(π(x) = a)

)= r̂(π(x), x)

Can we do better?

((ra − r̂(a, x))1(π(x) = a)

pa+ r̂(π(x), x)

Why does this work?

Ea∼p

(r̂(a, x)1(π(x) = a)

)= r̂(π(x), x)

How do you directly optimize based on past explorationdata?

1 Learn r̂(a, x).

2 Compute for each x and a′ ∈ A:

(ra − r̂(a, x))1(a′ = a)

pa+ r̂(a′, x)

3 Learn π using a cost-sensitive multiclass classifier.

Take home summary

Using exploration data

1 There are techniques for using past exploration data toevaluate any policy.

2 You can reliably measure performance offline, and henceexperiment much faster, shifting from guess-and-check (A/Btesting) to direct optimization.

Doing exploration

1 There has been much recent progress on practicalregret-optimal algorithms.

2 ε-greedy has suboptimal regret but is a reasonable choice inpractice.

Comparison of Approaches

Supervised ε-greedy Optimal CB algorithms

Feedback full bandit bandit

Regret O

(√ln |Π|

√|A| ln |Π|

(√|A| ln |Π|

)Running time O(T ) O(T ) O(T 1.5)

A. Agarwal, D. Hsu, S. Kale, J. Langford, L. Li, R. Schapire, Taming theMonster: A Fast and Simple Algorithm for Contextual Bandits, 2014

M. Dudik, D. Hsu, S. Kale, N. Karampatziakis, J. Langford, L. Reyzin, T.Zhang: Efficient optimal learning for contextual bandits, 2011

A. Beygelzimer, J. Langford, L. Li, L. Reyzin, R. Schapire: Contextual BanditAlgorithms with Supervised Learning Guarantees, 2011

alina beygelzimer, senior research scientist, yahoo labs at mlconf nyc

Technology

new system

new data

chicago usersinitial

new yorkbased

reward ra

reward predictor ra

user feedback

arg maxa ra