machine learning in the bandit setting algorithms, evaluation, and case studies lihong li machine...

63
Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25

Upload: bertina-mccarthy

Post on 20-Jan-2016

227 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25

Machine Learning in the Bandit Setting

Algorithms, Evaluation, and Case Studies

Lihong Li

Machine LearningYahoo! Research

SEWM2012-05-25

Page 2: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25

2012-05-25SEWM 2

ACTION

Statistics, ML, DM, …

DATA

E = MC2

KNOWLEDGE

UTILITY

MOREDATA

ReinforcementLearning

Page 3: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25

Outline

Introduction

Basic Solutions

Advanced algorithms

Advanced Offline Evaluation

Conclusions

2012-05-25SEWM 3

Page 4: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25

Yahoo-User Interaction

2012-05-25SEWM 4

ads, news, ranking, …

click, conversion, revenue, …

gender,age, …

ACTION

REWARD

CONTEXT servingstrategyPOLICY

GoalMaximize total REWARD

by optimizing POLICY

Page 5: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25

Today Module @ Yahoo! Front Page

A small pool of articles chosen

by editors

“Featured Article”

2012-05-255SEWM

Page 6: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25

Objectives and Challenges

• Objectives• (informally) choose most interesting articles for

individual users• (formally) maximize click-through rate (CTR)

• Challenges• Dynamic content pool fast learning• Sparse user visits transfer interests among users• Partial user feedback efficient explore/exploit

2012-05-256SEWM

Page 7: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25

Challenge: Explore/Exploit

• Observation: only displayed articles get user click feedback

EXPLOIT(choose good articles)

Article CTR estimates

EXPLORE(choose novel articles)

How to trade off?

… with dynamic article pools… while considering user interests

2012-05-257SEWM

Page 8: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25

Insufficient Exploration Example

always pays $5/round

pays $100 a quarter of the time(so $25/round on average)

1

2

3

4

5

6

7

8

$5

$5

$0

$0

$0

$5

$5

$5

2012-05-258SEWM

It turns out…

$100

$100

$100

$0

$0

$5

$5

$5

Page 9: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25

Contextual Bandit Formulation

Multi-armed contextual bandit [LZ’08]

At : available articles at time t

x t : user features (age, gender, interests, ...)

at : the displayed article at time t

rt,a t: 1 for click, 0 for no - click

Formally, we want to maximize rt ,a t

t=1

T

In Today Module:

2012-05-259SEWM

Select

at ∈ At

Observe K arms At and “context”

x t ∈Rd

Receive reward

rt,a t∈ [0,1]

t ← t +1

Page 10: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25

2012-05-25SEWM 10

Another Example – Display Ads

At : eligible ads in current page view

x t : page/user features

at : the displayed ad(s)

rt,a t: $ if clicked/converted, 0 otherwise

Page 11: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25

2012-05-25SEWM 11

At : possible document rankings for query qt

x t : query/document features

at : the displayed ranking for query qt

rt,a t: 1 if session succeeds, 0 otherwise

Yet Another Example - Ranking

Page 12: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25

Related Work• Standard information retrieval and collaborative filtering

• Also concerns with (personalized) recommendation• But with (almost) static users/items

training often done in batch/offline mode

no need for online exploration

• Full reinforcement learning• General: including bandit problems as special cases• Need to tackle “temporal credit assignment”

2012-05-2512SEWM

Page 13: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25

Outline

Introduction

Basic Solutions› Algorithms

› Evaluation

› Experiments

Advanced algorithms

Advanced Offline Evaluation

Conclusions

2012-05-25SEWM 13

Page 14: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25

2012-05-25SEWM 14

Prior Bandit Algorithms

Herbert Robbins Tze Leung Lai

Regret minimization(focus of this talk)

Bayesian optimal solution

John Gittins

Page 15: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25

Traditional K-armed Bandits

Assumption: CTR (click-through rate) not affected by user features

CTR1 ≈ μ1

CTR2 ≈ μ2

CTR3 ≈ μ3

"ε − greedy":

with prob 1- ε : choose article argmaxa μa

with prob ε : choose a random article

"UCB1":

choose article argmaxa μa +α

Na

⎧ ⎨ ⎩

⎫ ⎬ ⎭

The more “a” has been displayed,the less uncertainty in CTRa

2012-05-2515SEWM

CTR estimates = #clicks / #impressions

No contexts no personalization

Page 16: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25

• EXP4 [ACFS’02], EXP4.P [BLLRS’11], elimination [ADKLS’12]• Strong theoretical guarantees• But computationally expensive

• Epoch-greedy [LZ’08]

• Similar to e-greedy• Simple, general and less expensive• But not most effective

• This talk: algorithms with compact, parametric models• Both efficient and effective• Extension of UCB1 to linear models• … and to generalized linear models• Randomized algorithm with Thompson sampling

Contextual Bandit Algorithms

2012-05-2516SEWM

Page 17: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25

• Linear model assumption:• Standard least-squares ridge regression

• Reward prediction for new user:

• Whether to explore requires quantifying parameter uncertainty

LinUCB: UCB for Linear Models

E ra | x[ ] = xTθ a

ˆ θ a = (DaTDa + I)−1Da

Tca

A a

where Da =

−x1T −

−x2T −

M

⎢ ⎢ ⎢

⎥ ⎥ ⎥, ca =

r1r2

M

⎢ ⎢ ⎢

⎥ ⎥ ⎥

xT ˆ θ a − xTθ a ≤ α xTA a−1x (with high probability)€

xT ˆ θ a ≈ xTθ a

measures how "dissimilar" x is to previous users

2012-05-2517SEWM

prediction error

Page 18: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25

LinUCB: UCB for Linear Models (II)

With high prob : xT ˆ θ a − xTθ a ≤ α xTA a−1x

LinUCB always selects an arm with highest UCB:

a* = argmaxa

xT ˆ θ a + α xTA a−1x{ }

to exploit to explore

UCB1: a* = argmaxa

ˆ μ a +α

Na

⎧ ⎨ ⎩

⎫ ⎬ ⎭

LinRel [Auer 2002] works similarly but in a more complicated way.

2012-05-2518SEWM

Recall...

Page 19: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25

Outline

Introduction

Basic Solutions› Algorithms

› Evaluation

› Experiments

Advanced algorithms

Advanced Offline Evaluation

Conclusions

2012-05-25SEWM 19

Page 20: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25

Goal: estimate average reward of running p with iid x

• Static p

• Adaptive p

Golden standard

• Run p in real system and see how well it works

• …but expensive and risky

Evaluation of Bandit Algorithms

2012-05-2520SEWM

V (π,T) :=1

TE r x t ,π (x t ,ht )( )

t =1

T

∑ ⎡

⎣ ⎢

⎦ ⎥

V (π ) := Ex

r x,π (x)( )[ ]

Page 21: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25

• Benefits• Cheap and risk-free!• Avoid frequent bucket tests• Replicable / fair comparisons

• Common in non-interactive learning problems (e.g., classification)• Benchmark data organized as (input, label) pairs

• … but not straightforward for interactive learning problems• Data in bandits usually consists of (context, arm, reward) triples• No reward signal for other arm’ ≠ arm

Offline Evaluation

2012-05-2521SEWM

Page 22: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25

Common/Prior Evaluation Approaches

data

x1,a1,r1 M

xL ,aL ,rL

⎨ ⎪

⎩ ⎪

⎬ ⎪

⎭ ⎪

Reward simulator :

ˆ r (x,a) ≈ E r x,a[ ]

cla

ssifi

catio

nre

gres

sio

nd

ensi

ty e

stim

atio

n

this (difficult) step is often biased

In contrast, our approach

• avoids explicit user modeling simple

• gives unbiased evaluation results reliable

unreliable evaluation

bandit algorithm p

2012-05-2522SEWM

Page 23: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25

reveal x i

choose ˆ a i = π (x i)€

For i =1,2,...,L :

reveal ri only if ˆ a i = ai (a "match")

Our Evaluation Method: “Replay”

data

x1,a1,r1 M

xL ,aL ,rL

⎨ ⎪

⎩ ⎪

⎬ ⎪

⎭ ⎪

bandit algorithm p

Finally, output ˆ V =K

Lri ⋅I( ˆ a i = ai)

i=1

L

2012-05-2523SEWM

Want to estimate V (π ) := Ex

r x,π (x)( )[ ]

Key requirement for data collection:

ai ~ unif(A)

Page 24: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25

2012-05-25SEWM 24

Theoretical Guarantees

Thm 1: Our estimator is unbiased Mathematically,

So on average reflects real, online performance

Thm 2: Estimation error 0 with more data Mathematically,

So accuracy guaranteed with large volume of data

V (π ) = E ˆ V [ ]

ˆ V

V (π ) − ˆ V = O K L( )

Page 25: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25

Case Study in Today Module [LCLW’11] Data:

› Large volume of real user traffic in Today Module

Policies being evaluated:

› EMP [ACE’ 09]

› SEMP/CEMP: personalized EMP variants

› Use policies’ online bucket CTR as “truth”

Random bucket data for evaluation:

› 40M visits, K ~= 20 on average

› Use it to offline-evaluate policies’ CTR

2012-05-25SEWM 25

Are they close?

Page 26: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25

Unbiasedness (Article nCTR)

Est

ima

ted

nC

TR

Recorded Online nCTR 2012-05-2526SEWM

The offline estimate is indeed unbiased!

Page 27: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25

Unbiasedness (Daily nCTR)

Recorded Online nCTR

Estimated nCTR

Ten Days in November 2009 2012-05-2527SEWM

The offline estimate is indeed unbiased!

Page 28: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25

Estimation Error

2012-05-2528SEWM €

1

L

Number of Data (L)

nC

TR

Est

ima

tion

Err

or

Recall our theoretical error bound:

Thm 2 (error bound) : V (π ) − ˆ V = O K L( )

Page 29: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25

Unbiased Offline Evaluation: Recap What we have shown

› A principled method for benchmark data collection

› which allows reliable/unbiased evaluation

› of any bandit algorithms

Analogue: UCI, Caltech101 ... datasets for supervised learning

The first such benchmark was released by Yahoo!

http://webscope.sandbox.yahoo.com/catalog.php?datatype=r

2nd and 3rd versions available for PASCAL2 Challenge

› ICML 2012 workshop2012-05-25SEWM 29

Page 30: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25

Outline

Introduction

Basic Solutions› Algorithms

› Evaluation

› Experiments

Advanced algorithms

Advanced Offline Evaluation

Conclusions

2012-05-25SEWM 30

Page 31: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25

Experiment Setup: Architecture

• Model updated every 5 minutes

• Main metric: overall normalized CTR in deployment bucket• nCTR = CTR * secretNumber

(to protect sensitive business information)

2012-05-2531SEWM

where E/E happens

exploitation only

“Learning Bucket”

“Deployment Bucket”

5%

95%

Page 32: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25

Experiment Setup: Data

• May 1 2009 data for parameter tuning• May 3-9 2009 data for performance evaluation (33M visits)• Number of candidate articles per user visit is about 20• Dimension reduction on user features [CBP+’09]

• 6 features

• Data available from Yahoo! Research’s Webscope program

http://webscope.sandbox.yahoo.com/catalog.php?datatype=r

E rt,a | x t ,a[ ] = x t ,aT θa

2012-05-2532SEWM

Page 33: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25

“cheating”policy

(no feature)

CTR in Deployment Bucket [LCLS’10]

• UCB-type algorithms do better than e-greedy counterparts

• CTR improved significantly when features/contexts are considered

2012-05-2533SEWM

Page 34: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25

Article CTR Lift

2012-05-2534SEWM

no context linear model

+ e-greedyo UCB

Page 35: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25

Outline

Introduction

Basic Solutions

Advanced algorithms› Hybrid linear models

› Generalized linear models

› Thompson sampling

› Theory

Advanced Offline Evaluation

Conclusions2012-05-25SEWM 35

Page 36: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25

Advantagelearns faster when there are few data

Challengeseems to require unbounded computation complexity

Good news!Efficient implementation made possible by block matrix manipulations

LinUCB for Hybrid Linear Models

New assumption : E ra | x[ ] = xTθ a + zaT β

Previous assumption : E ra | x[ ] = xTθ a

information shared by all articles(eg, teens like articles about Harry Potter)

article-specific information(eg, Californian males like this article)

2012-05-2536SEWM

Page 37: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25

Overall CTR in Deployment Bucket

advantageof hybrid

model

2012-05-2537SEWM

• UCB-type algorithms do better than e-greedy counterparts

• CTR improved significantly when features/contexts are considered

• Hybrid model is better when data are scarce

Page 38: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25

Outline

Introduction

Basic Solutions

Advanced algorithms› Hybrid linear models

› Generalized linear models

› Thompson sampling

› Theory

Advanced Offline Evaluation

Conclusions2012-05-25SEWM 38

Page 39: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25

2012-05-25SEWM 39

Extensions to GLMs Linear models are unnatural for binary events

Generalized linear models (GLMs)

Logistic regression

Probit regression

E ra | x[ ] = xTθ a

E ra | x[ ] = g−1 xTθ a( )

E ra | x[ ] =1

1+ exp(−xTθ a )

E ra | x[ ] = Φ xTθ a( )

(F: CDF of standard Gaussian)

“inverse link function”

g−1 : R →[0,1]

logistic function

xTθ a

Page 40: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25

2012-05-25SEWM 40

Model Fitting in GLMs

• Maintain a Bayesian posterior of parameter qa by N(ma, Sa) Use Bayes’ formula with new data (x,r):

p(θ a ) ∝ N(θ a;μa ,Σa )⋅ 1+ exp −(2r −1)xTθ a( )( )−1

Current posterior Likelihood

Laplace approximation

N μa ',Σa '( )New posterior

Page 41: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25

2012-05-25SEWM 41

UCB Heuristics for GLMs

E ra | xa[ ] ≤

xaT μa + α xa

TΣaxa linear

1+ α exp xaTΣaxa −1( )

1+ exp −xaT μa( )

logistic

Φ xaT μa + α xa

TΣaxa( ) probit

⎪ ⎪ ⎪ ⎪

⎪ ⎪ ⎪ ⎪

• Use posterior N(ma, Sa) to derive (approximate) upper confidence bounds [LCLMW’12]

Page 42: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25

Experiment Setup• One week data in from June 2009 (34M user visits)• About 20 candidate articles per user visit• Features: 20 features by PCA on raw binary user features• Model updated every 5 minutes

• Main metric: overall (normalized) CTR in deployment bucket

2012-05-2542SEWM

where E/E happens

exploitation only

“Learning Bucket”

“Deployment Bucket”

5%

95%

Page 43: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25

2012-05-25SEWM 43

GLM Comparisons

Obs #1: active exploration is necessaryObs #2: Logistic/probit > linearObs #3: UCB > e-greedy

e-greedy exploration UCB exploration

linea

rlo

gisi

tcpr

obit

Page 44: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25

Outline

Introduction

Basic Solutions

Advanced algorithms› Hybrid linear models

› Generalized linear models

› Thompson sampling

› Theory

Advanced Offline Evaluation

Conclusions2012-05-25SEWM 44

Page 45: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25

2012-05-25SEWM 45

Limitations of UCB Exploration

Exploration can be too much

may explore the whole space exhaustively

difficult to use prior knowledge

Exploration is deterministic

Poor performance when rewards are delayed

Deriving an (approx.) UCB is not always easy

Page 46: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25

2012-05-25SEWM 46

Thompson Sampling (1933)

Algorithmic idea: “probability matching”

Pr(a|x) = Pr(a is optimal for x)

Randomized action selection (by definition)

More robust to reward delay

Straightforward to implement [CL’12]

Maintain parameter posterior:

Draw random models:

Act accordingly:

Easily combined with other (non-)parametric models€

˜ θ a ~ Pr θ a D( )

Pr θ a D( )

a(x) = argmaxa

f (x,a; ˜ θ a )

Page 47: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25

2012-05-25SEWM 47

Thompson Sampling

One-week data from Today Module on Yahoo!’s front pageLogistic regression with Gaussian posteriors

Obs #1: TS is competitive uniformlyObs #2: TS is more robust to reward delay

Page 48: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25

Outline

Introduction

Basic Solutions

Advanced algorithms› Hybrid linear models

› Generalized linear models

› Thompson sampling

› Theory

Advanced Offline Evaluation

Conclusions2012-05-25SEWM 48

Page 49: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25

Regret-based Competitive Analysis

Regret(T) = E rt,a t

*

t =1

T

∑ ⎡

⎣ ⎢

⎦ ⎥− E rt,a t

t =1

T

∑ ⎡

⎣ ⎢

⎦ ⎥

the best we could do if we knew all

θa

achieved by algorithm

2012-05-2549SEWM

An algorithm “learns” if

An algorithm “learns fast” if is small

Regret(T) = O(Tα ) with α < 1

α

Page 50: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25

Regret Bounds

• LinUCB [CLRS’11]: with matching lower bound

• Generalized LinUCB: still open• A variant [FCGSz’11]:

• Thompson sampling• A variant [L’12]:

Average reward converges to optimal at the rate O Kd T( ).

Example : K = 20, d = 50, T =10M, Kd T = 0.01

2012-05-2550SEWM

O KdT( )

O d T( )

O K1/ 3T 2 / 3( )

Page 51: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25

Outline

Introduction

Basic Solutions

Advanced algorithms

Advanced Offline Evaluation› Importance weighting

› Doubly robust technique

Conclusions

2012-05-25SEWM 51

Page 52: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25

Uniformly random data sometimes are a luxury…

› System/cost constraints, user experience considerations, …

Randomized log suffices (by importance weighting)

Variance reduction with the “doubly robust” technique [DLL’11]

Better bias/variance tradeoff by soft rejection sampling [DDLL’12]

Extensions

2012-05-25

SEWM 52

V (π ) = E(x,r )~D rπ (x )[ ] ≈1

S

ra ⋅ I(π (x) = a)

max ˆ p (a | x),τ{ }(x,a,ra )∈S

t controls bias/variance trade-off [SLLK 2011]

Page 53: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25

Offline Evaluation with Non-Uniform Data

2012-05-25SEWM 53

Key idea: importance reweighting

Can use weighted empirical average with estimated p(a|x)

ˆ V =1

S

ra ⋅ I(π (x) = a)

max ˆ p (a | x),τ{ }(x,a,ra )∈S

∑ ≈ V (π )

t controls bias/variance trade-off [SLLK 2011]

V (π ) = E(x,r )~D rπ (x )[ ] = E(x,r)~D ra ⋅ I(π (x) = a)a

∑ ⎡

⎣ ⎢

⎦ ⎥

= E(x,r)~D

ra ⋅ I(π (x) = a)

p(a | x)p(a | x)

a

∑ ⎡

⎣ ⎢

⎦ ⎥= E(x,r )~D,a ~ p

ra ⋅ I(π (x) = a)

p(a | x)

⎣ ⎢

⎦ ⎥

Page 54: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25

Results in Today Module Data [SLLK’11]

2012-05-25SEWM 54

Page 55: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25

Outline

Introduction

Basic Solutions

Advanced algorithms

Advanced Offline Evaluation› Importance weighting

› Doubly robust technique

Conclusions

2012-05-25SEWM 55

Page 56: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25

Doubly Robust Estimation

Importance weighted formula

Doubly robust technique

Usually DR estimate decreases variance [DLL’11]

2012-05-25SEWM 56

Estimation has high variance if p(a|x) is small

ˆ V DR =1

S

ra − ˆ r a( )⋅ I(π (x) = a)

max ˆ p (a | x),τ{ }+ ˆ r a

⎣ ⎢ ⎢

⎦ ⎥ ⎥(x,a,ra )∈S

Unbiased if ˆ r a or ˆ p is correct.

V (π ) = E(x,r )~D,a ~ p

ra ⋅ I(π (x) = a)

p(a | x)

⎣ ⎢

⎦ ⎥≈

1

S

ra ⋅ I(π (x) = a)

max ˆ p (a | x),τ{ }(x,a,ra )∈S

Page 57: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25

2012-05-25SEWM 57

Multiclass Classification

K-class classification as a K-armed bandit

Training data› In usual (non-bandit) setting,

› In bandit setting,

x,c ⇒ x,r1,r2,K ,rK where ra =0 if a = c

1 otherwise

⎧ ⎨ ⎩

D = x i,c i{ }i=1,2,K m

D = x i,ai, pi,ri,a i{ }i=1,2,K m

usualsetting

banditsetting

123...

m

1 2 3 … K

observed loss

unobserved loss

Loss matrixwith rij in (i,j) entry

Page 58: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25

Experimental Results on UCI Datasets Split data 50/50 for training (fully labeled) and testing (partially

labeled)

Train p on training data, evaluate p on test data

Repeated 500 times

2012-05-25SEWM 58

Page 59: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25

Outline

Introduction

Basic Solutions

Advanced algorithms

Advanced Offline Evaluation

Conclusions

2012-05-25SEWM 59

Page 60: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25

Conclusions

• Contextual bandit as a principled formulation for• News article recommendation• Internet advertising• Web search• ...

• An offline evaluation method of bandit algorithms• unbiased• accurate compared to online bucket results

• Encouraging results in significant applications• strong performance of UCB/TS exploration

2012-05-2560SEWM

Page 61: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25

Future Work• Offline evaluation

• Better use of non-uniform data• Extension to full reinforcement learning

• Use of prior knowledge

• Variants of bandits• Bandits with budgets• Bandits with many arms• Bandits with multiple objectives• Bandits with submodular rewards• Bandits with delayed reward observations• …

2012-05-2561SEWM

Page 62: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25

2012-05-25SEWM 62

References Offline policy evaluation

[LCLW] Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms. WSDM, 2011

[SLLK] Learning from logged implicit exploration data. NIPS, 2010 [DLL] Doubly robust policy evaluation and learning. ICML, 2011 [DDLL] Sample-efficient nonstationary-policy evaluation for contextual

bandits. Under review.

Bandit algorithms [LCLS] A contextual-bandit approach to personalized news article

recommendation. WWW, 2010 [CLRS] Contextual bandits with linear payoff functions. AISTATS, 2011 [BLLRS] Contextual bandit algorithms with supervised learning

guarantees. AISTATS, 2011 [CL] An empirical evaluation of Thompson sampling. NIPS, 2011 [LCLMW] Unbiased offline evaluation of contextual bandit algorithms with

generalized linear models. JMLR W&PS, 2012

Page 63: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25

2012-05-25SEWM 63

Thank You!