dynamic resource allocation: bandit problems and extensionsagarivie/sites/... · 2014-10-16 ·...

Dynamic resource allocation:Bandit problems and extensions

Aurelien Garivier

Institut de Mathematiques de Toulouse

MAD Seminar, Universite Toulouse 1October 3rd, 2014

The Bandit Model

Roadmap

1 The Bandit Model

2 Lower Bound for the Regret

3 Optimistic Algorithms

4 An Optimistic Algorithm based on Kullback-Leibler Divergence

5 Parametric setting: the kl-UCB Algorithm

6 Non-parametric setting and Empirical Likelihood

7 Extensions

The Bandit Model

Dynamic resource allocation

Imagine you are a doctor:

patients visit you one after another for a given disease

you prescribe one of the (say) 5 treatments available

the treatments are not equally efficient

you do not know which one is the best, you observe the effectof the prescribed treatment on each patient

⇒ What do you do?

You must choose each prescription using only the previousobservations

Your goal is not to estimate each treatment’s efficiencyprecisely, but to heal as many patients as possible

The Bandit Model

The (stochastic) Multi-Armed Bandit Model

Environment K arms with parameters θ = (θ1, . . . , θK) such thatfor any possible choice of arm at ∈ 1, . . . ,K attime t, one receives the reward

Xt = Xat,t

where, for any 1 ≤ a ≤ K and s ≥ 1, Xa,s ∼ νa, andthe (Xa,s)a,s are independent.

Reward distributions νa ∈ Fa parametric family, or not. Examples:canonical exponential family, general boundedrewards

Example Bernoulli rewards: θ ∈ [0, 1]K , νa = B(θa)

Strategy The agent’s actions follow a dynamical strategyπ = (π1, π2, . . . ) such that

At = πt(X1, . . . , Xt−1)

The Bandit Model

Real challenges

Randomized clinical trials

original motivation since the 1930’sdynamic strategies can save resources

Recommender systems:

advertisement

website optimization

news, blog posts, . . .

Computer experiments

large systems can be simulated in order to optimize somecriterion over a set of parametersbut the simulation cost may be high, so that only few choicesare possible for the parameters

Games and planning (tree-structured options)

The Bandit Model

Performance Evaluation, Regret

Cumulated Reward ST =∑T

t=1Xt

Our goal Choose π so as to maximize

E [ST ] =

T∑t=1

K∑a=1

E[E [Xt1At = a|X1, . . . , Xt−1]

]=

K∑a=1

µaE [Nπa (T )]

where Nπa (T ) =

∑t≤T 1At = a is the number of

draws of arm a up to time T , and µa = E(νa).

Regret Minimization equivalent to minimizing

RT = Tµ∗ − E [ST ] =∑

a:µa<µ∗

(µ∗ − µa)E [Nπa (T )]

where µ∗ ∈ maxµa : 1 ≤ a ≤ K

Lower Bound for the Regret

Roadmap

1 The Bandit Model






7 Extensions


Asymptotically Optimal Strategies

A strategy π is said to be consistent if, for any (νa)a ∈ FK ,

1

TE[ST ]→ µ∗

The strategy is efficient if for all θ ∈ [0, 1]K and all α > 0,

RT = o(Tα)

There are efficient strategies and we consider the bestachievable asymptotic performance among efficient strategies


The Bound of Lai and Robbins

One-parameter reward distribution νa = νθa , θa ∈ Θ ⊂ R .

Theorem [Lai and Robbins, ’85]

If π is an efficient strategy, then, for any θ ∈ ΘK ,

lim infT→∞

RTlog(T )

≥∑

a:µa<µ∗

µ∗ − µaKL(νa, ν∗)

where KL(ν, ν ′) denotes the Kullback-Leibler divergence

For example, in the Bernoulli case:

KL(B(p), B(q)

)= dber(p, q) = p log

p

q+ (1− p) log

1− p1− q


The Bound of Burnetas and Katehakis

More general reward distributions νa ∈ Fa

Theorem [Burnetas and Katehakis, ’96]

If π is an efficient strategy, then, for any θ ∈ [0, 1]K ,

lim infT→∞

RTlog(T )

≥∑

a:µa<µ∗

µ∗ − µaKinf (νa, µ∗)

where

Kinf (νa, µ∗) = inf

K(νa, ν

′) :

ν ′ ∈ Fa, E(ν ′) ≥ µ∗

ν∗

δ1

δ 12

δ0

Kinf (νa, µ?)

νa

µ∗


Intuition

First assume that µ∗ is known and that T is fixed

How many draws na of νa are necessary to know that µa < µ∗

with probability at least 1− 1/T?

Test: H0 : µa = µ∗ against H1 : ν = νa

Stein’s Lemma: if the first type error αna ≤ 1/T , then

βna % exp(− naKinf (νa, µ

∗))

=⇒ it can be smaller than 1/T if

na ≥log(T )

Kinf (νa, µ∗)

How to do as well without knowing µ∗ and T in advance?Not asymptotically?

Optimistic Algorithms

Roadmap

1 The Bandit Model






7 Extensions


Optimism in the Face of Uncertainty

Optimism in an heuristic principle popularized by [Lai&Robins ’85;Agrawal ’95] which consists in letting the agent

play as if the environment was the most favorableamong all environments that are sufficiently likelygiven the observations accumulated so far

Surprisingly, this simple heuristic principle can be instantiated intoalgorithms that are robust, efficient and easy to implement inmany scenarios pertaining to reinforcement learning


Upper Confidence Bound Strategies

UCB [Lai&Robins ’85; Agrawal ’95; Auer&al ’02]

Construct an upper confidence bound for the expected rewardof each arm:

Sa(t)

Na(t)︸︷︷︸estimated reward

+

√log(t)

2Na(t)︸︷︷︸exploration bonus

Choose the arm with the highest UCB

It is an index strategy [Gittins ’79]

Its behavior is easily interpretable and intuitively appealing


UCB in Action


Performance of UCB

For rewards in [0, 1], the regret of UCB is upper-bounded as

E[RT ] = O(log(T ))

(finite-time regret bound) and

lim supT→∞

E[RT ]

log(T )≤

∑a:µa<µ∗

1

2(µ∗ − µa)

Yet, in the case of Bernoulli variables, the rhs. is greater thansuggested by the bound by Lai & Robbins

Many variants have been suggested to incorporate an estimate ofthe variance in the exploration bonus (e.g., [Audibert&al ’07])

An Optimistic Algorithm based on Kullback-Leibler Divergence

Roadmap

1 The Bandit Model






7 Extensions

An Optimistic Algorithm based on Kullback-Leibler Divergence

The KL-UCB algorithm [Cappe,G.&al ’13]

Parameters: An operator ΠF : M1(S)→ F ; a non-decreasingfunction f : N→ RInitialization: Pull each arm of 1, . . . ,K once

for t = K to T − 1 do

compute for each arm a the quantity

Ua(t) = sup

E(ν) : ν ∈ F and KL

(ΠF(νa(t)

), ν)≤ f(t)

Na(t)

pick an arm At+1 ∈ arg max

a∈1,...,KUa(t)

end for

Parametric setting: the kl-UCB Algorithm

Roadmap

1 The Bandit Model






7 Extensions


Exponential Family RewardsAssume that Fa = F = canonical exponential family, i.e.such that the pdf of the rewards is given by

pθa(x) = exp(xθa − b(θa) + c(x)

), 1 ≤ a ≤ K

for a parameter θ ∈ RK , expectation µa = b(θa)The KL-UCB si simply:

Ua(t) = sup

µ ∈ I : d

(µa(t), µ

)≤ f(t)

Na(t)

For instance,

for Bernoulli rewards:

dber(p, q) = p logp

q+ (1− p) log

1− p1− q

for exponential rewards pθa(x) = θae−θax:

dexp(u, v) = u− v + u logu

v

The analysis is generic and yields a non-asymptotic regretbound optimal in the sense of Lai and Robbins.


Parametric version: the kl-UCB algorithm

Parameters: F parameterized by the expectation µ ∈ I ⊂ R withdivergence d, a non-decreasing function f : N→ RInitialization: Pull each arm of 1, . . . ,K once

for t = K to T − 1 do

compute for each arm a the quantity

Ua(t) = sup

µ ∈ I : d

(µa(t), µ

)≤ f(t)

Na(t)

pick an arm At+1 ∈ arg max

a∈1,...,KUa(t)

end for


The kl Upper Confidence Bound in Picture

If Z1, . . . , Zsiid∼ B(θ0), x < θ0 and

if ps = (Z1 + · · ·+Zs)/s, then byChernoff’s inequality

Pθ0 (ps ≤ x) ≤ exp (−sdber(x, θ0)) .

0

kl(⋅,θ)

θ0

x

−log(α)/s

In other words, if α = exp (−sdber(x, θ0)):

Pθ0 (ps ≤ x) = Pθ0(dber(ps, θ0) ≤ −

log(α)

s, ps < θ0

)≤ α

=⇒ Upper Confidence Bound for p at risk α :

us = supθ > ps : dber(ps, θ) ≤ −

log(α)

s

.


The kl Upper Confidence Bound in Picture

If Z1, . . . , Zsiid∼ B(θ0), x < θ0 and

if ps = (Z1 + · · ·+Zs)/s, then byChernoff’s inequality

Pθ0 (ps ≤ x) ≤ exp (−sdber(x, θ0)) .

0

kl(⋅,θ)

ps

kl(ps,⋅)

us

−log(α)/s

In other words, if α = exp (−sdber(x, θ0)):

Pθ0 (ps ≤ x) = Pθ0(dber(ps, θ0) ≤ −

log(α)

s, ps < θ0

)≤ α

=⇒ Upper Confidence Bound for p at risk α :

us = supθ > ps : dber(ps, θ) ≤ −

log(α)

s

.


Key Tool: Deviation Inequality for Self-Normalized Sums

Problem: random number of summands

Solution: peeling trick (as in the proof of the LIL)

Theorem For all ε > 1,

P(µa > µa(t) and Na(t) d

(µa(t), µa

)≥ ε)≤ e⌈ε log(t)

⌉e−ε .

Thus,P(Ua(t) < µa

)≤ e⌈f(t) log(t)

⌉e−f(t)


Regret bound

Theorem: Assume that all arms belong to a canonical, regular,exponential family F = νθ : θ ∈ Θ of probability distributionsindexed by its natural parameter space Θ ⊆ R. Then, with thechoice f(t) = log(t) + 3 log log(t) for t ≥ 3, the number of drawsof any suboptimal arm a is upper bounded for any horizon T ≥ 3 as

E [Na(T )] ≤ log(T )

d (µa, µ?)+2

√√√√2πσ2a,?

(d′(µa, µ?)

)2(d(µa, µ?)

)3 √log(T ) + 3 log(log(T ))

+

(4e+

3

d(µa, µ?)

)log(log(T )) + 8σ2

a,?

(d′(µa, µ

?)

d(µa, µ?)

)2

+ 6 ,

where σ2a,? = max

Var(νθ) : µa ≤ E(νθ) ≤ µ?

and whered′( · , µ?) denotes the derivative of d( · , µ?).


Results: Two-Arm Scenario

102

103

104

0

50

100

150

200

250

300

350

400

450

500

n (log scale)

N2(n

)

UCB

MOSS

UCB−Tuned

UCB−V

DMED

KL−UCB

bound

0

500

1000

1500

2000

2500

3000

3500

4000

UCB MOSS UCB−Tuned UCB−V DMED KL−UCB

N2(n

)

Figure: Performance of various algorithms when θ = (0.9, 0.8). Left:average number of draws of the sub-optimal arm as a function of time.Right: box-and-whiskers plot for the number of draws of the sub-optimalarm at time T = 5, 000. Results based on 50, 000 independentreplications


Results: Ten-Arm Scenario with Low Rewards

0

100

200

300

400

500

102

103

104

Rn

UCB

0

100

200

300

400

500

102

103

104

MOSS

0

100

200

300

400

500

102

103

104

UCB−V

0

100

200

300

400

500

102

103

104

Rn

UCB−Tuned

0

100

200

300

400

500

102

103

104

DMED

0

100

200

300

400

500

102

103

104

KL−UCB

0

100

200

300

400

500

102

103

104

n (log scale)

Rn

CP−UCB

0

100

200

300

400

500

102

103

104

n (log scale)

DMED+

0

100

200

300

400

500

102

103

104

n (log scale)

KL−UCB+

Figure: Average regret as a function of time whenθ = (0.1, 0.05, 0.05, 0.05, 0.02, 0.02, 0.02, 0.01, 0.01, 0.01). Red line: Lai& Robbins lower bound; thick line: average regret; shaded regions:central 99% region an upper 99.95% quantile

Non-parametric setting and Empirical Likelihood

Roadmap

1 The Bandit Model






7 Extensions


Non-parametric setting

Rewards are only assumed to be bounded (say in [0, 1])

Need for an estimation procedure

with non-asymptotic guaranteesefficient in the sense of Stein / Bahadur

=⇒ Idea 1: use dber (Hoeffding)

=⇒ Idea 2: Empirical Likelihood [Owen ’01]

Bad idea: use Bernstein / Bennett


First idea: use dber

Idea: rescale to [0, 1], and take the divergence dber.

−→ because Bernoulli distributions maximize deviations amongbounded variables with given expectation:

Lemma (Hoeffding ’63)

Let X denote a random variable such that 0 ≤ X ≤ 1 and denoteby µ = E[X] its mean. Then, for any λ ∈ R,

E [exp(λX)] ≤ 1− µ+ µ exp(λ) .

This fact is well-known for the variance, but also true for allexponential moments and thus for Cramer-type deviation bounds


Regret Bound for kl-UCB

Theorem

With the divergence dber, for all T > 3,

E[Na(T )

]≤ log(T )

dber(µa, µ?)+

√2π log

(µ?(1−µa)µa(1−µ?)

)(dber(µa, µ?)

)3/2 √log(T ) + 3 log

(log(T )

)

+

(4e+

3

dber(µa, µ?)

)log(log(T )

)+

2

(log(µ?(1−µa)µa(1−µ?)

))2

(dber(µa, µ?))2 + 6 .

kl-UCB satisfies an improved logarithmic finite-time regretbound

Besides, it is asymptotically optimal in the Bernoulli case


Comparison to UCBKL-UCB addresses exactly the same problem as UCB, with thesame generality, but it has always a smaller regret as can be seenfrom Pinsker’s inequality

dber(µ1, µ2) ≥ 2(µ1 − µ2)2

0 0.2 0.4 0.6 0.8 1 1.20

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

q

kl(0.7, q)

2(0.7−q)2


Idea 2: Empirical Likelihood

U(νn, ε) = supE(ν′) : ν′ ∈M1

(Supp(νn)

)and KL(νn, ν

′) ≤ ε

or, rather, modified Empirical Likelihood:

U(νn, ε) = supE(ν′) : ν′ ∈M1

(Supp(νn)∪1

)and KL(νn, ν

′) ≤ ε

µn

Un


Coverage properties of the modified EL confidence bound

Proposition: Let ν0 ∈M1([0, 1]) with E(ν0) ∈ (0, 1) and letX1, . . . , Xn be independent random variables with commondistribution ν0 ∈M1

([0, 1]

), not necessarily with finite support.

Then, for all ε > 0,

PU(νn, ε) ≤ E(ν0)

≤ P

Kinf

(νn, E(ν0)

)≥ ε

≤ e(n+ 2) exp(−nε) .

Remark: For 0, 1–valued observations, it is readily seen thatU(νn, ε) boils down to the upper-confidence bound above.=⇒ This proposition is at least not always optimal: the presence

of the factor n in front of the exponential exp(−nε) term isquestionable.


Regret bound

Theorem: Assume that F is the set of finitely supportedprobability distributions over S = [0, 1], that µa > 0 for all arms aand that µ? < 1. There exists a constant M(νa, µ

?) > 0 onlydepending on νa and µ? such that, with the choicef(t) = log(t) + log

(log(t)

)for t ≥ 2, for all T ≥ 3:

E[Na(T )

]≤ log(T )

Kinf

(νa, µ?

) +36

(µ?)4(log(T )

)4/5log(

log(T ))

+

(72

(µ?)4+

2µ?

(1− µ?)Kinf

(νa, µ?

)2)(

log(T ))4/5

+(1− µ?)2M(νa, µ

?)

2(µ?)2(log(T )

)2/5+

log(log(T )

)Kinf

(νa, µ?

) +2µ?

(1− µ?)Kinf

(νa, µ?

)2 + 4 .


Example: truncated Poisson rewards

for each arm 1 ≤ a ≤ 6 is associated with νa, a Poissondistribution with expectation (2 + a)/4, truncated at 10.

N = 10, 000 Monte-Carlo replications on an horizon ofT = 20, 000 steps.

102

103

104

0

200

400

600

800

1000

1200

1400

1600

1800

2000

Time (log scale)

Re

gre

t

kl−Poisson−UCB

KL−UCB

UCB−V

kl−UCB

UCB


Example: truncated Exponential rewardsexponential rewards with respective parameters 1/5, 1/4, 1/3,1/2 and 1, truncated at xmax = 10;kl-UCB uses the divergence d(x, y) = x/y − 1− log(x/y)prescribed for genuine exponential distributions, but it ignoresthe fact that the rewards are truncated.

102

103

104

0

200

400

600

800

1000

1200

Time (log scale)

Re

gre

t

kl−exp−UCB

KL−UCB

UCB−V

kl−UCB

UCB

Figure: Regret of the various algorithms as a function of time in thetruncated exponential scenario.


Take-home message on bandit algorithms

1 Use kl-UCB rather than UCB-1 or UCB-2

2 Use KL-UCB if speed is not a problem

3 todo: improve on the deviation bounds, address generalnon-parametric families of distributions

4 Alternative: Bayesian-flavored methods:

Bayes-UCB [Kaufmann, Cappe, G.]

Thompson sampling [Kaufmann & al.]

Extensions

Roadmap

1 The Bandit Model






7 Extensions

Extensions

Non-stationary Bandits [G. Moulines ’11]

Changepoint : rewarddistributions change abruptly

Goal : follow the best arm

Application : scanningtunnelling microscope

Variants D-UCB et SW-UCB including a progressive discountof the past

Bounds O(√n log n) are proved, which is (almost) optimal

Extensions

(Generalized) Linear Bandits [Filippi, Cappe, G. &Szepesvari ’10]

Bandit with contextual information:

E[Xt|At] = µ(m′Atθ∗)

where θ∗ ∈ Rd is an unkown parameter and µ : R→ R is alink function

Example : binary rewards

µ(x) =exp(x)

1 + exp(x)

Application : targeted web ads

GLM-UCB : regret bound depending on dimension d and noton the number of arms

Extensions

Stochastic Optimization

Goal : Find the maximum of afunction f : C ⊂ Rd → R(possibly) observed in noise

Application : DAS

0.0 0.2 0.4 0.6 0.8 1.0

0.6

0.8

1.0

1.2

1.4

1.6

res$x

res$

f

Model : f is the realization of a Gaussian Process (or has asmall norm in some RKHS)

GP-UCB : evaluate f at the point x ∈ C where theconfidence interval for f(x) has the highest upper-bound

Extensions

Recent Advances in Continuous bandits

Unimodal bandits withoutsmoothness: trisection algo-rithms, and better[Combes, Proutiere ’14]Application to internet networktraffic optimization

Reserve Price Optimization inSecond-price Options[Cesa-Bianchi, Gentile, Man-sour ’13]Application to advertisementsystems

Extensions

Markov Decision Processes (MDP) [Filippi, Cappe & G.’10]

The system is in state St which evolves as a Markov Chain:

St+1 ∼ P (·;St, At) et Rt = r(St, At) + εt

Optimistic algorithm: search the best transition matrix in aneighborhood of the ML estimate

The use of Kullback-Leibler neighborhoods leads to betterperformance and has desirable propoerties.

Extensions

Optimal Exploration with Probabilistic Expert Advice

Search space : B ⊂ Ω discrete set

Probabilistic experts : Pa ∈M1(Ω) for a ∈ ARequests : at time t, calling expert At yields a realization of

Xt = XAt,t independent with law Pa

Goal : find as many distinct elements of B as possible withfew requests :

Fn = Card (B ∩ X1, . . . , Xn)

6= bandit : finding the same element twice is no use !

Oracle : selects the expert with highest ‘missing mass’

A∗t+1 = arg maxa∈A

Pa (B\ X1, . . . , Xt)

Extensions

The model

Subset A ⊂ X ofimportant items

|X | 1, |A| |X |Access to X onlyby probabilisticexperts (Pi)1≤i≤K :sequentialindependent draws

Goal: discover rapidly the elements of A

Extensions

Goal

At each time step t = 1, 2, . . . :

pick an index It = πt(I1, Y1, . . . , Is−1, Ys−1

)∈ 1, . . . ,K

according to past observations

observe Yt = XIt,nIt,t∼ PIt , where

ni,t =∑s≤t

1Is = i

Goal: design the strategy π = (πt)t so as to maximize the numberof important items found after t requests

F π(t) =∣∣∣A ∩ Y1, . . . , Yt∣∣∣

Assumption: non-intersecting supports

A ∩ supp(Pi) ∩ supp(Pj) = ∅ for i 6= j

Extensions

Is it a Bandit Problem ?

It looks like a bandit problem. . .

sequential choices among K options

want to maximize cumulative rewards

exploration vs exploitation dilemma

. . . but it is not a bandit problem !

rewards are not i.i.d.

destructive rewards: no interest to observe twice the sameimportant item

all strategies eventually equivalent

Extensions

The oracle strategy

Proposition: Under the non-intersecting support hypothesis, thegreedy oracle strategy

I∗t ∈ arg max1≤i≤K

Pi (A \ Y1, . . . , Yt)

is optimal: for every possible strategy π, E[F π(t)

]≤ E

[F ∗(t)

].

Remark: the proposition if false if the supports may intersect

=⇒ estimate the “missing mass of important items”!

Extensions

Estimating the missing mass

Notation : Xtiid∼ P ∈M1(Ω), On(ω) =

∑nt=1 1Xt = ω

Zn(x) = 1On(ω) = 0Hn(ω) = 1On(ω) = 1, Hn =

∑ω∈BHn(ω)

Problem : estimate the missing mass

Rn =∑ω∈B

P (ω)Zn(ω)

Good-Turing : ‘estimator’ Rn = Hn/n st. E[Rn −Rn] ∈ [0, 1/n].

Concentration : by McDiarmid’s inequality, with probability≥ 1− δ∣∣∣Rn − E[Rn]

∣∣∣ ≤√(2/n+ pmax)2 n log(2/δ)

2

Extensions

The Good-UCB algorithm [Bubeck, Ernst & G.]

Optimistic algorithm based on Good-Turing’s estimator :

At+1 = arg maxa∈A

Ha(t)

Na(t)+ c

√log (t)

Na(t)

Na(t) = number of draws of Pa up to time t

Ha(t) = number of elements of B seen exactly once thanks toPa

c = tuning parameter

Extensions

Classical analysis

Theorem: For any t ≥ 1, under the non-intersecting supportassumption, Good-UCB (with constant C = (1 +

√2)√

3) satisfies

E[F ∗(t)− FUCB(t)

]≤ 17

√Kt log(t)+20

√Kt+K+K log(t/K)

Remark: Usual result for bandit problem, but not-so-simple analysis

Extensions

Good-UCB en action

Extensions

The macroscopic limit

Restricted framework: Pi = U1, . . . , NN →∞|A ∩ supp(Pi)|/N → qi ∈ (0, 1), q =

∑i qi

Extensions

The Oracle behaviour

The limiting discovery process of the Oracle strategy isdeterministic

Proposition: For every λ ∈ (0, q1), for every sequence (λN )Nconverging to λ as N goes to infinity, almost surely

limN→∞

TN∗ (λN )

N=∑i

(log

qiλ

)+

Extensions

Oracle vs. uniform sampling

Oracle: The proportion of important items not found afterNt draws tends to

q−F ∗(t) = I(t)qI(t)

exp (−t/I(t)) ≤ KqK

exp(−t/K)

with qK

=(∏K

i=1 qi

)1/Kthe geometric mean of the

(qi)i.

Uniform: The proportion of important items not found afterNt draws tends to KqK exp(−t/K)

=⇒ Asymptotic ratio of efficiency

ρ(q) =qKqK

=1K

∑ki=1 qi(∏k

i=1 qi

)1/K ≥ 1

larger if the (qi)i are unbalanced

Extensions

Macroscopic optimality

Theorem: Take C = (1 +√

2)√c+ 2 with c > 3/2 in the

Good-UCB algorithm.

For every sequence (λN )N converging to λ as N goes toinfinity, almost surely

lim supN→+∞

TNUCB(λN )

N≤∑i

(log

qiλ

)+

The proportion of items found after Nt steps FGUCB

converges uniformly to F ∗ as N goes to infinity

Extensions

Simulation

Number of items found by Good-UCB (line), the oracle (bolddashed), and by uniform sampling (light dotted) as a function oftime, for sample sizes N = 128, N = 500, N = 1000 etN = 10000, in an environment with 7 experts.

Extensions

Bibliography1 [Abbasi-Yadkori&al ’11] Yasin Abbasi-Yadkori, David Pal, Csaba Szepesvari: Online Least Squares

Estimation with Self-Normalized Processes: An Application to Bandit Problems CoRR abs/1102.2670:(2011)

2 [Agrawal ’95] R. Agrawal. Sample mean based index policies with O(log n) regret for the multi-armedbandit problem. Advances in Applied Probability, 27(4) :1054-1078, 1995.

3 [Audibert&al ’09] J-Y. Audibert, R. Munos, and Cs. Szepesvari. Exploration-exploitation tradeoff usingvariance estimates in multi-armed bandits. Theoretical Computer Science, 410(19), 2009

4 [Auer&al ’02] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed banditproblem. Machine Learning, 47(2) :235-256, 2002.

5 [Bubeck, Ernst&G. ’11] S. Bubeck, D. Ernst, A. Garivier, Optimal discovery with probabilistic expertadvice: finite time analysis and macroscopic optimality Journal of Machine Learning Research vol. 14 Feb.2013 p.601-623

6 [De La Pena&al ’04] V.H. De La Pena, M.J. Klass, and T.L. Lai. Self-normalized processes : exponentialinequalities, moment bounds and iterated logarithm laws. Annals of Probability, 32(3) :1902-1933, 2004.

7 [Filippi, Cappe&Garivier ’10] S. Filippi, O. Cappe, and A. Garivier. Optimism in reinforcement learningand Kullback-Leibler divergence. In Allerton Conf. on Communication, Control, and Computing,Monticello, US, 2010.

8 [Filippi, Cappe, G.& Szepesvari ’10] S. Filippi, O. Cappe, A. Garivier, and C. Szepesvari. Parametricbandits : The generalized linear case. In Neural Information Processing Systems (NIPS), 2010.

9 [G.&Cappe ’11] A. Garivier and O. Cappe. The KL-UCB algorithm for bounded stochastic bandits andbeyond. In 23rd Conf. Learning Theory (COLT), Budapest, Hungary, 2011.

10 [Cappe,G.&al ’13] O. Cappe, A. Garivier, O-A. Maillard, R. Munos, G. Stoltz, Kullback-Leibler UpperConfidence Bounds for Optimal Sequential Allocation, Annals of Statistics, accepted

11 [G.&Leonardi ’11] A. Garivier and F. Leonardi. Context tree selection : A unifying view. StochasticProcesses and their Applications, 121(11) :2488-2506, Nov. 2011.

12 [G.&Moulines ’11] A. Garivier and E. Moulines. On upper-confidence bound policies for non-stationarybandit problems. In Algorithmic Learning Theory (ALT), volume 6925 of Lecture Notes in ComputerScience, 2011.

13 [Lai&Robins ’85] T.L. Lai and H. Robbins. Asymptotically efficient adaptive allocation rules. Advances inApplied Mathematics, 6(1) :4-22, 1985.

dynamic resource allocation: bandit problems and extensionsagarivie/sites/... · 2014-10-16 ·...

Documents