machine learning in the bandit setting algorithms, evaluation, and case studies lihong li machine...

Machine Learning in the Bandit Setting

Algorithms, Evaluation, and Case Studies

Lihong Li

Machine LearningYahoo! Research

SEWM2012-05-25

2012-05-25SEWM 2

ACTION

Statistics, ML, DM, …

DATA

€

E = MC2

KNOWLEDGE

UTILITY

MOREDATA

ReinforcementLearning

Outline

Introduction

Basic Solutions

Advanced algorithms

Advanced Offline Evaluation

Conclusions

2012-05-25SEWM 3

Yahoo-User Interaction

2012-05-25SEWM 4

ads, news, ranking, …

click, conversion, revenue, …

gender,age, …

ACTION

REWARD

CONTEXT servingstrategyPOLICY

GoalMaximize total REWARD

by optimizing POLICY

Today Module @ Yahoo! Front Page

A small pool of articles chosen

by editors

“Featured Article”

2012-05-255SEWM

Objectives and Challenges

• Objectives• (informally) choose most interesting articles for

individual users• (formally) maximize click-through rate (CTR)

• Challenges• Dynamic content pool fast learning• Sparse user visits transfer interests among users• Partial user feedback efficient explore/exploit

2012-05-256SEWM

Challenge: Explore/Exploit

• Observation: only displayed articles get user click feedback

EXPLOIT(choose good articles)

Article CTR estimates

EXPLORE(choose novel articles)

How to trade off?

… with dynamic article pools… while considering user interests

2012-05-257SEWM

Insufficient Exploration Example

always pays $5/round

pays $100 a quarter of the time(so $25/round on average)

1

2

3

4

5

6

7

8

$5

$5

$0

$0

$0

$5

$5

$5

2012-05-258SEWM

It turns out…

$100

$100

$100

$0

$0

$5

$5

$5

Contextual Bandit Formulation

Multi-armed contextual bandit [LZ’08]

€

At : available articles at time t

x t : user features (age, gender, interests, ...)

at : the displayed article at time t

rt,a t: 1 for click, 0 for no - click

€

Formally, we want to maximize rt ,a t

t=1

T

∑

In Today Module:

2012-05-259SEWM

Select

€

at ∈ At

Observe K arms At and “context”

€

x t ∈Rd

Receive reward

€

rt,a t∈ [0,1]

€

t ← t +1

2012-05-25SEWM 10

Another Example – Display Ads

€

At : eligible ads in current page view

x t : page/user features

at : the displayed ad(s)

rt,a t: $ if clicked/converted, 0 otherwise

2012-05-25SEWM 11

€

At : possible document rankings for query qt

x t : query/document features

at : the displayed ranking for query qt

rt,a t: 1 if session succeeds, 0 otherwise

Yet Another Example - Ranking

Related Work• Standard information retrieval and collaborative filtering

• Also concerns with (personalized) recommendation• But with (almost) static users/items

training often done in batch/offline mode

no need for online exploration

• Full reinforcement learning• General: including bandit problems as special cases• Need to tackle “temporal credit assignment”

2012-05-2512SEWM

Outline

Introduction

Basic Solutions› Algorithms

› Evaluation

› Experiments

Advanced algorithms


Conclusions

2012-05-25SEWM 13

2012-05-25SEWM 14

Prior Bandit Algorithms

Herbert Robbins Tze Leung Lai

Regret minimization(focus of this talk)

Bayesian optimal solution

John Gittins

Traditional K-armed Bandits

Assumption: CTR (click-through rate) not affected by user features

€

CTR1 ≈ μ1

€

CTR2 ≈ μ2

€

CTR3 ≈ μ3

€

"ε − greedy":

with prob 1- ε : choose article argmaxa μa

with prob ε : choose a random article

€

"UCB1":

choose article argmaxa μa +α

Na

⎧ ⎨ ⎩

⎫ ⎬ ⎭

The more “a” has been displayed,the less uncertainty in CTRa

2012-05-2515SEWM

CTR estimates = #clicks / #impressions

No contexts no personalization

• EXP4 [ACFS’02], EXP4.P [BLLRS’11], elimination [ADKLS’12]• Strong theoretical guarantees• But computationally expensive

• Epoch-greedy [LZ’08]

• Similar to e-greedy• Simple, general and less expensive• But not most effective

• This talk: algorithms with compact, parametric models• Both efficient and effective• Extension of UCB1 to linear models• … and to generalized linear models• Randomized algorithm with Thompson sampling

Contextual Bandit Algorithms

2012-05-2516SEWM

• Linear model assumption:• Standard least-squares ridge regression

• Reward prediction for new user:

• Whether to explore requires quantifying parameter uncertainty

LinUCB: UCB for Linear Models

€

E ra | x[ ] = xTθ a

€

ˆ θ a = (DaTDa + I)−1Da

Tca

€

A a

€

where Da =

−x1T −

−x2T −

M

⎡

⎣

⎢ ⎢ ⎢

⎤

⎦

⎥ ⎥ ⎥, ca =

r1r2

M

⎡

⎣

⎢ ⎢ ⎢

⎤

⎦

⎥ ⎥ ⎥

€

xT ˆ θ a − xTθ a ≤ α xTA a−1x (with high probability)€

xT ˆ θ a ≈ xTθ a

€

measures how "dissimilar" x is to previous users

2012-05-2517SEWM

prediction error

LinUCB: UCB for Linear Models (II)

€

With high prob : xT ˆ θ a − xTθ a ≤ α xTA a−1x

LinUCB always selects an arm with highest UCB:

€

a* = argmaxa

xT ˆ θ a + α xTA a−1x{ }

to exploit to explore

€

UCB1: a* = argmaxa

ˆ μ a +α

Na

⎧ ⎨ ⎩

⎫ ⎬ ⎭

LinRel [Auer 2002] works similarly but in a more complicated way.

2012-05-2518SEWM

Recall...

Outline

Introduction


› Evaluation

› Experiments

Advanced algorithms


Conclusions

2012-05-25SEWM 19

Goal: estimate average reward of running p with iid x

• Static p

• Adaptive p

Golden standard

• Run p in real system and see how well it works

• …but expensive and risky

Evaluation of Bandit Algorithms

2012-05-2520SEWM

€

V (π,T) :=1

TE r x t ,π (x t ,ht )( )

t =1

T

∑ ⎡

⎣ ⎢

⎤

⎦ ⎥

€

V (π ) := Ex

r x,π (x)( )[ ]

• Benefits• Cheap and risk-free!• Avoid frequent bucket tests• Replicable / fair comparisons

• Common in non-interactive learning problems (e.g., classification)• Benchmark data organized as (input, label) pairs

• … but not straightforward for interactive learning problems• Data in bandits usually consists of (context, arm, reward) triples• No reward signal for other arm’ ≠ arm

Offline Evaluation

2012-05-2521SEWM

Common/Prior Evaluation Approaches

€

data

x1,a1,r1 M

xL ,aL ,rL

⎧

⎨ ⎪

⎩ ⎪

⎫

⎬ ⎪

⎭ ⎪

€

Reward simulator :

ˆ r (x,a) ≈ E r x,a[ ]

cla

ssifi

catio

nre

gres

sio

nd

ensi

ty e

stim

atio

n

this (difficult) step is often biased

In contrast, our approach

• avoids explicit user modeling simple

• gives unbiased evaluation results reliable

unreliable evaluation

bandit algorithm p

2012-05-2522SEWM

€

reveal x i

€

choose ˆ a i = π (x i)€

For i =1,2,...,L :

€

reveal ri only if ˆ a i = ai (a "match")

Our Evaluation Method: “Replay”

€

data

x1,a1,r1 M

xL ,aL ,rL

⎧

⎨ ⎪

⎩ ⎪

⎫

⎬ ⎪

⎭ ⎪

bandit algorithm p

€

Finally, output ˆ V =K

Lri ⋅I( ˆ a i = ai)

i=1

L

∑

2012-05-2523SEWM

€

Want to estimate V (π ) := Ex

r x,π (x)( )[ ]

Key requirement for data collection:

€

ai ~ unif(A)

2012-05-25SEWM 24

Theoretical Guarantees

Thm 1: Our estimator is unbiased Mathematically,

So on average reflects real, online performance

Thm 2: Estimation error 0 with more data Mathematically,

So accuracy guaranteed with large volume of data

€

V (π ) = E ˆ V [ ]

€

ˆ V

€

V (π ) − ˆ V = O K L( )

Case Study in Today Module [LCLW’11] Data:

› Large volume of real user traffic in Today Module

Policies being evaluated:

› EMP [ACE’ 09]

› SEMP/CEMP: personalized EMP variants

› Use policies’ online bucket CTR as “truth”

Random bucket data for evaluation:

› 40M visits, K ~= 20 on average

› Use it to offline-evaluate policies’ CTR

2012-05-25SEWM 25

Are they close?

Unbiasedness (Article nCTR)

Est

ima

ted

nC

TR

Recorded Online nCTR 2012-05-2526SEWM

The offline estimate is indeed unbiased!

Unbiasedness (Daily nCTR)

Recorded Online nCTR

Estimated nCTR

Ten Days in November 2009 2012-05-2527SEWM

The offline estimate is indeed unbiased!

Estimation Error

2012-05-2528SEWM €

1

L

Number of Data (L)

nC

TR

Est

ima

tion

Err

or

Recall our theoretical error bound:

€

Thm 2 (error bound) : V (π ) − ˆ V = O K L( )

Unbiased Offline Evaluation: Recap What we have shown

› A principled method for benchmark data collection

› which allows reliable/unbiased evaluation

› of any bandit algorithms

Analogue: UCI, Caltech101 ... datasets for supervised learning

The first such benchmark was released by Yahoo!

http://webscope.sandbox.yahoo.com/catalog.php?datatype=r

2nd and 3rd versions available for PASCAL2 Challenge

› ICML 2012 workshop2012-05-25SEWM 29


Outline

Introduction


› Evaluation

› Experiments

Advanced algorithms


Conclusions

2012-05-25SEWM 30

Experiment Setup: Architecture

• Model updated every 5 minutes

• Main metric: overall normalized CTR in deployment bucket• nCTR = CTR * secretNumber

(to protect sensitive business information)

2012-05-2531SEWM

where E/E happens

exploitation only

“Learning Bucket”

“Deployment Bucket”

5%

95%

Experiment Setup: Data

• May 1 2009 data for parameter tuning• May 3-9 2009 data for performance evaluation (33M visits)• Number of candidate articles per user visit is about 20• Dimension reduction on user features [CBP+’09]

• 6 features

• Data available from Yahoo! Research’s Webscope program


€

E rt,a | x t ,a[ ] = x t ,aT θa

2012-05-2532SEWM

“cheating”policy

(no feature)

CTR in Deployment Bucket [LCLS’10]

• UCB-type algorithms do better than e-greedy counterparts

• CTR improved significantly when features/contexts are considered

2012-05-2533SEWM

Article CTR Lift

2012-05-2534SEWM

no context linear model

+ e-greedyo UCB

Outline

Introduction

Basic Solutions

Advanced algorithms› Hybrid linear models

› Generalized linear models

› Thompson sampling

› Theory


Conclusions2012-05-25SEWM 35

Advantagelearns faster when there are few data

Challengeseems to require unbounded computation complexity

Good news!Efficient implementation made possible by block matrix manipulations

LinUCB for Hybrid Linear Models

€

New assumption : E ra | x[ ] = xTθ a + zaT β

€

Previous assumption : E ra | x[ ] = xTθ a

information shared by all articles(eg, teens like articles about Harry Potter)

article-specific information(eg, Californian males like this article)

2012-05-2536SEWM

Overall CTR in Deployment Bucket

advantageof hybrid

model

2012-05-2537SEWM

• UCB-type algorithms do better than e-greedy counterparts

• CTR improved significantly when features/contexts are considered

• Hybrid model is better when data are scarce

Outline

Introduction

Basic Solutions




› Theory



2012-05-25SEWM 39

Extensions to GLMs Linear models are unnatural for binary events

Generalized linear models (GLMs)

Logistic regression

Probit regression

€

E ra | x[ ] = xTθ a

€

E ra | x[ ] = g−1 xTθ a( )

€

E ra | x[ ] =1

1+ exp(−xTθ a )

€

E ra | x[ ] = Φ xTθ a( )

(F: CDF of standard Gaussian)

“inverse link function”

€

g−1 : R →[0,1]

logistic function

€

xTθ a

2012-05-25SEWM 40

Model Fitting in GLMs

• Maintain a Bayesian posterior of parameter qa by N(ma, Sa) Use Bayes’ formula with new data (x,r):

€

p(θ a ) ∝ N(θ a;μa ,Σa )⋅ 1+ exp −(2r −1)xTθ a( )( )−1

Current posterior Likelihood

Laplace approximation

€

N μa ',Σa '( )New posterior

2012-05-25SEWM 41

UCB Heuristics for GLMs

€

E ra | xa[ ] ≤

xaT μa + α xa

TΣaxa linear

1+ α exp xaTΣaxa −1( )

1+ exp −xaT μa( )

logistic

Φ xaT μa + α xa

TΣaxa( ) probit

⎧

⎨

⎪ ⎪ ⎪ ⎪

⎩

⎪ ⎪ ⎪ ⎪

• Use posterior N(ma, Sa) to derive (approximate) upper confidence bounds [LCLMW’12]

Experiment Setup• One week data in from June 2009 (34M user visits)• About 20 candidate articles per user visit• Features: 20 features by PCA on raw binary user features• Model updated every 5 minutes

• Main metric: overall (normalized) CTR in deployment bucket

2012-05-2542SEWM

where E/E happens

exploitation only

“Learning Bucket”

“Deployment Bucket”

5%

95%

2012-05-25SEWM 43

GLM Comparisons

Obs #1: active exploration is necessaryObs #2: Logistic/probit > linearObs #3: UCB > e-greedy

e-greedy exploration UCB exploration

linea

rlo

gisi

tcpr

obit

Outline

Introduction

Basic Solutions




› Theory



2012-05-25SEWM 45

Limitations of UCB Exploration

Exploration can be too much

may explore the whole space exhaustively

difficult to use prior knowledge

Exploration is deterministic

Poor performance when rewards are delayed

Deriving an (approx.) UCB is not always easy

2012-05-25SEWM 46

Thompson Sampling (1933)

Algorithmic idea: “probability matching”

Pr(a|x) = Pr(a is optimal for x)

Randomized action selection (by definition)

More robust to reward delay

Straightforward to implement [CL’12]

Maintain parameter posterior:

Draw random models:

Act accordingly:

Easily combined with other (non-)parametric models€

˜ θ a ~ Pr θ a D( )

€

Pr θ a D( )

€

a(x) = argmaxa

f (x,a; ˜ θ a )

2012-05-25SEWM 47

Thompson Sampling

One-week data from Today Module on Yahoo!’s front pageLogistic regression with Gaussian posteriors

Obs #1: TS is competitive uniformlyObs #2: TS is more robust to reward delay

Outline

Introduction

Basic Solutions




› Theory



Regret-based Competitive Analysis

€

Regret(T) = E rt,a t

*

t =1

T

∑ ⎡

⎣ ⎢

⎤

⎦ ⎥− E rt,a t

t =1

T

∑ ⎡

⎣ ⎢

⎤

⎦ ⎥

the best we could do if we knew all

€

θa

achieved by algorithm

2012-05-2549SEWM

An algorithm “learns” if

An algorithm “learns fast” if is small

€

Regret(T) = O(Tα ) with α < 1

€

α

Regret Bounds

• LinUCB [CLRS’11]: with matching lower bound

• Generalized LinUCB: still open• A variant [FCGSz’11]:

• Thompson sampling• A variant [L’12]:

€

Average reward converges to optimal at the rate O Kd T( ).

€

Example : K = 20, d = 50, T =10M, Kd T = 0.01

2012-05-2550SEWM

€

O KdT( )

€

O d T( )

€

O K1/ 3T 2 / 3( )

Outline

Introduction

Basic Solutions

Advanced algorithms

Advanced Offline Evaluation› Importance weighting

› Doubly robust technique

Conclusions

2012-05-25SEWM 51

Uniformly random data sometimes are a luxury…

› System/cost constraints, user experience considerations, …

Randomized log suffices (by importance weighting)

Variance reduction with the “doubly robust” technique [DLL’11]

Better bias/variance tradeoff by soft rejection sampling [DDLL’12]

Extensions

2012-05-25

SEWM 52

€

V (π ) = E(x,r )~D rπ (x )[ ] ≈1

S

ra ⋅ I(π (x) = a)

max ˆ p (a | x),τ{ }(x,a,ra )∈S

∑

t controls bias/variance trade-off [SLLK 2011]

Offline Evaluation with Non-Uniform Data

2012-05-25SEWM 53

Key idea: importance reweighting

Can use weighted empirical average with estimated p(a|x)

€

ˆ V =1

S

ra ⋅ I(π (x) = a)


∑ ≈ V (π )

t controls bias/variance trade-off [SLLK 2011]

€

V (π ) = E(x,r )~D rπ (x )[ ] = E(x,r)~D ra ⋅ I(π (x) = a)a

∑ ⎡

⎣ ⎢

⎤

⎦ ⎥

= E(x,r)~D

ra ⋅ I(π (x) = a)

p(a | x)p(a | x)

a

∑ ⎡

⎣ ⎢

⎤

⎦ ⎥= E(x,r )~D,a ~ p

ra ⋅ I(π (x) = a)

p(a | x)

⎡

⎣ ⎢

⎤

⎦ ⎥

Results in Today Module Data [SLLK’11]

2012-05-25SEWM 54

Outline

Introduction

Basic Solutions

Advanced algorithms

Advanced Offline Evaluation› Importance weighting

› Doubly robust technique

Conclusions

2012-05-25SEWM 55

Doubly Robust Estimation

Importance weighted formula

Doubly robust technique

Usually DR estimate decreases variance [DLL’11]

2012-05-25SEWM 56

Estimation has high variance if p(a|x) is small

€

ˆ V DR =1

S

ra − ˆ r a( )⋅ I(π (x) = a)

max ˆ p (a | x),τ{ }+ ˆ r a

⎡

⎣ ⎢ ⎢

⎤

⎦ ⎥ ⎥(x,a,ra )∈S

∑

Unbiased if ˆ r a or ˆ p is correct.

€

V (π ) = E(x,r )~D,a ~ p

ra ⋅ I(π (x) = a)

p(a | x)

⎡

⎣ ⎢

⎤

⎦ ⎥≈

1

S

ra ⋅ I(π (x) = a)


∑

2012-05-25SEWM 57

Multiclass Classification

K-class classification as a K-armed bandit

Training data› In usual (non-bandit) setting,

› In bandit setting,

€

x,c ⇒ x,r1,r2,K ,rK where ra =0 if a = c

1 otherwise

⎧ ⎨ ⎩

€

D = x i,c i{ }i=1,2,K m

€

D = x i,ai, pi,ri,a i{ }i=1,2,K m

usualsetting

banditsetting

123...

m

1 2 3 … K

observed loss

unobserved loss

Loss matrixwith rij in (i,j) entry

Experimental Results on UCI Datasets Split data 50/50 for training (fully labeled) and testing (partially

labeled)

Train p on training data, evaluate p on test data

Repeated 500 times

2012-05-25SEWM 58

Outline

Introduction

Basic Solutions

Advanced algorithms


Conclusions

2012-05-25SEWM 59

Conclusions

• Contextual bandit as a principled formulation for• News article recommendation• Internet advertising• Web search• ...

• An offline evaluation method of bandit algorithms• unbiased• accurate compared to online bucket results

• Encouraging results in significant applications• strong performance of UCB/TS exploration

2012-05-2560SEWM

Future Work• Offline evaluation

• Better use of non-uniform data• Extension to full reinforcement learning

• Use of prior knowledge

• Variants of bandits• Bandits with budgets• Bandits with many arms• Bandits with multiple objectives• Bandits with submodular rewards• Bandits with delayed reward observations• …

2012-05-2561SEWM

2012-05-25SEWM 62

References Offline policy evaluation

[LCLW] Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms. WSDM, 2011

[SLLK] Learning from logged implicit exploration data. NIPS, 2010 [DLL] Doubly robust policy evaluation and learning. ICML, 2011 [DDLL] Sample-efficient nonstationary-policy evaluation for contextual

bandits. Under review.

Bandit algorithms [LCLS] A contextual-bandit approach to personalized news article

recommendation. WWW, 2010 [CLRS] Contextual bandits with linear payoff functions. AISTATS, 2011 [BLLRS] Contextual bandit algorithms with supervised learning

guarantees. AISTATS, 2011 [CL] An empirical evaluation of Thompson sampling. NIPS, 2011 [LCLMW] Unbiased offline evaluation of contextual bandit algorithms with

generalized linear models. JMLR W&PS, 2012

2012-05-25SEWM 63

Thank You!

machine learning in the bandit setting algorithms, evaluation, and case studies lihong li machine...

Documents

bandit problems

contextual bandit model

user interests2012

new user

bandit setting algorithms

displayed articles

25sewm10another example

small pool of articles