bandit optimization: theory and applications - sigmetricsbandit optimization: theory and...

Bandit Optimization: Theory and Applications

Richard Combes1 and Alexandre Proutière2

1Centrale-Supélec / L2S, France2KTH, Royal Institute of Technology, Sweden.

SIGMETRICS 2015

1 / 40

Outline

Introduction and examples of application

Tools and techniques

Discrete bandits with independent arms

2 / 40

A first example: sequential treatment allocation

I There are T patients with the same symptoms awaitingtreatment

I Two treatments exist, one is better than the otherI Based on past successes and failures which treatment

should you use ?

3 / 40

The modelI At time n, choose action xn ∈ X , observe feedback

yn(xn) ∈ Y, and obtain reward rn(xn) ∈ R+.I ”Bandit feedback”: rewards and feedback depend on

actions (often yn ≡ rn)I Admissible algorithm:

xn+1 = fn+1(x0, r0(x0), y0(x0), ..., xn, rn(xn), rn(yn))I Performance metric: regret

R(T ) = maxx∈X

E

[T∑

n=1

rn(x)

]︸︷︷︸

oracle

−E

[T∑

n=1

rn(xn)

]︸︷︷︸

your algorithm

.

4 / 40

Bandit taxonomy: adversarial vs stochastic

Stochastic Bandit:I Game against a stochastic environmentI Unknown parameters θ ∈ Θ

I (rn(x))n is i.i.d with expectation θx

Adversarial Bandit:I Game against a non-adaptive adversaryI For all x , (rn(x))n arbitrary sequence in XI At time 0, the adversary “writes down (rn(x))n,x in an

envelope”

Engineering problems are mainly stochastic

5 / 40

Independent vs correlated arms

I Independent arms: Θ = [0,1]K

I Correlated arms: Θ 6= [0,1]K : choosing 1 givesinformation on 1 and 2

Correlation enables (sometimes much) faster learning.

6 / 40

Bandit taxonomy: frequentist vs bayesian

How to assess an algorithm applied to a set of problems ?

Frequentist (classical):I Problem dependent regret: Rπ

θ (T ), θ fixedI Minimax regret: maxθ∈Θ Rπ

θ (T )

I Usually very different regret scalingBayesian:

I Prior distribution θ ∼ P, known to the algorithmI Bayesian regret: Eθ∼P [Rπ

θ (T )]

I P naturally includes information on the problem structure

7 / 40

Bandit taxonomy: cardinality of the set of armsDiscrete Bandits:

I X = {1, ...,K}I All arms can be sampled infinitely many timesI Regret O(log(T )) (stochastic), O(

√T ) (adversarial)

Infinite Bandits:I X = N, Bayesian setting (otherwise trivial)I Explore o(T ) arms until a good one is foundI Regret: O(

√T ).

Continuous Bandits:I X ⊂ Rd convex, x 7→ µθ(x) has a structureI Structures: convex, Lipshitz, linear, unimodal

(quasi-convex) etc.I Similar to derivative-free stochastic optimizationI Regret: O(poly(d)

√T ).

8 / 40

Bandit taxonomy: regret minimization vs best armidentification

Sample arms and output the best arm with a given probability,similar to PAC learning

Fixed budget setting:I T fixed, sample arms x1, ..., xT , and output x̂T

I Easier problem: estimation + budget allocationI Goal: minimize P[x̂T 6= x∗]

Fixed confidence setting:I δ fixed, sample arms x1, ..., xτ and output x̂τ

I Harder problem: estimation + budget allocation + optimalstopping (τ is a stopping time)

I Goal: minimize E[τ ] s.t. P[x̂T 6= x∗] ≤ δ

9 / 40

Example 1: Rate adaptation in wireless networks

I Adapting the modulation/coding scheme to the radioenvironment

I Rates: r1, r2 , . . . , rK

I Success probabilities: θ1, θ2 , . . . , θK

I Throughputs: µ1, µ2 , . . . , µK

Structure: unimodality + θ1 > θ2 > · · · > θK .

10 / 40

Example 2: Shortest path routing

I Choose a path minimizing expected delayI Stochastic delays: Xi(n) ∼ Geo(1/θi)

I Path M ∈ {0,1}d , expected delay < M, θ >.I Semi-bandit feedback: Xi(n) , for {i : Mi(n) = 1}

11

1616

2

8

7

3

4 5

6

12

11

10

9 13

14

15

θ1,6

θ6,11 θ11,15

θ15,16

11 / 40

Example 3: Learning to Rank (search engines)I Given a query, N relevant items, L display slotsI A user is shown L items, scrolls down and selects the first

relevant itemI One must show the most relevant items in the first slots.I θn probability of clicking on item n (independence between

items is assumed)I Reward r(l) if user clicks on the l-th item, and 0 if the user

does not click

Environ 171 000 000 résultats (0,31 secondes)

Signaler des images inappropriées

Autres résultats sur jaguar.fr »

Images correspondant à jaguar

Plus d'images pour jaguar

Dans l'actualité

Jaguar : une boîte manuelle pour la F-Typeen 2015Turbo.fr - Il y a 23 heures

La Jaguar F-Type est en train d'étoffer sa gamme après un

lancement en fanfare. D'abord ...

Plus d'actualités pour "jaguar"

Les cookies assurent le bon fonctionnement de nos services. En utilisant ces derniers, vousacceptez l'utilisation des cookies.

En savoir plus OK

Jaguar France - Voitures de Sport et Voitures de luxewww.jaguar.fr/Découvrez les voitures de luxe Jaguar. Alliant héritage et technologie, les berlines et

voitures de sport Jaguar vous feront vivre une expérience de conduite ...

F-TypeLa F-TYPE est une sportive

d'exception héritière de la lignée ...

Créez Votre JaguarCRÉEZ VOTRE JAGUAR. Nous

faisons tout notre possible pour ...

XELe pedigrée de la XE se lit dans ses

lignes, dans sa tenue de ...

XFJaguar XF, une gamme de Berlines

sportives au design et ...

Jaguar - Annonce Jaguar d'occasion - La Centralewww.lacentrale.fr/occasion-voiture-marque-jaguar.htmlJaguar occasion : annonces de Jaguar d'occasion. La Centrale s'engage : Annonces

securisees et de qualite pour la vente de Jaguar.

Jaguar (automobile) — Wikipédiafr.wikipedia.org/wiki/Jaguar_(automobile)Jaguar, de son nom officiel « Jaguar Cars Ltd », est une marque automobile britannique

connue pour ses voitures de luxe et ses modèles sportifs. La marque ...

Jaguar — Wikipédiafr.wikipedia.org/wiki/JaguarStatut de conservation UICN. ( NT ) NT : Quasi menacé. Statut CITES. Sur l'annexe I de

la CITES Annexe I , Rév. du 01-07-1975. Le jaguar (Panthera onca) est ...

Jaguar - Caradisiacwww.caradisiac.com › Toutes les marques

Constructeur automobile

Jaguar, de son nom officiel« Jaguar Cars Ltd », est une marqueautomobile britannique connue pour sesvoitures de luxe et ses modèles sportifs.Wikipédia

Date de fondation : septembre 1922,Blackpool, Royaume-Uni

PDG : Ralf Speth

Fondateurs : William Walmsley, WilliamLyons

Jaguar France74 abonnés • Partagé en modepublic

Découvrez lanouvelle #JaguarXES en action dans lesrues de Londres.#FEELXEhttp://fal.cn/KSE9 nov. 2014

Jaguar

Posts récents sur Google+

Commentaires

Afficher les résultats pourJaguar (Animal)Rang : EspèceClassification : Panthera

Web Images Actualités Vidéos Maps Outils de recherchePlus

jaguar Connexion

jaguar - Recherche Google https://www.google.fr/search?client=ubuntu&ch...

1 of 2 17/11/2014 11:49

12 / 40

Example 4: Ad-display optimization

I Users are shown ads relevant to their queriesI Announcers k ∈ {1, ...,K}, with µk click-through-rate and

budget per unit of time ck

I Bandit with budgets: each arm has a budget of playsI Displayed announcer is charged per impression/click

13 / 40

Outline




14 / 40

Optimism in the face of uncertainty

I Replace arm values by upper confidence boundsI ”index” bx (n) such that bx (n) ≥ θx with high probabilityI Select the arm with highest index xn ∈ arg maxx∈X bx (n)

I Analysis idea:

E[tx (T )] ≤T∑

n=1

P[bx?(n) ≤ θ?]︸︷︷︸o(log(T ))

+T∑

n=1

P[xn = x ,bx (n) ≥ θ?]︸︷︷︸dominant term

.

Almost all algorithms in the literature are optimistic (sic!)

15 / 40

Information theory and statisticsI Distribtions P,Q with densities p and q w.r.t a measure mI Kullback-Leibler divergence:

D(P||Q) =

∫x

p(x) log(

p(x)

q(x)

)m(dx),

I Pinsker’s inequality:

D(P||Q) ≥ 12

∫x|p(x)− q(x)|m(dx).

I If P,Q ∼ Ber(p), Ber(q):

D(P||Q) = p log(

pq

)+ (1− p) log

(1− p1− q

)I Also (Pinkser + inequality log(x) ≤ x − 1):

2(p − q)2 ≤ D(P||Q) ≤ (p − q)2

q(1− q)

The KL-divergence is ubiquitous in bandit problems

16 / 40

Empirical divergence and Sanov’s inequality

I P a discrete distribution with support PI P̂n empirical distribution of an i.i.d sample of size n from P,I Sanov’s inequality:

P[D(P̂n||P) ≥ δ] ≤(

n + |P| − 1|P|+ 1

)e−nδ

I Suggests confidence regions (risk α) of the type:

{P : n D(P̂n||P) ≤ log(1/α)}

17 / 40

Hypothesis testing and sample complexity

I How many samples are needed to distinguish P from Q ?I Observe X = (X1, ...,Xn) i.i.d with distribution Pn or Qn

I Additivity of the KL divergence: D(Pn||Qn) = nD(P||Q)

I Test φ(X ) ∈ {0,1}, risk α > 0:

EP [φ(X )] + EQ[1− φ(X )] ≤ α

I Tsybakov’s inequality:

(1/2)e−min{D(Pn||Qn),D(Qn||Pn)} ≤ EP [φ(X )] + EQ[1− φ(X )]

I Minimal number of samples:

n ≥ log(1/α)− log(2)

min{D(P||Q),D(Q||P)}

18 / 40

Regret Lower Bounds: general technique

I Decision x , two parameters θ, λ, with x?(λ) = x 6= x?(θ).I Consider consider an algorithm with Rπ(T ) = log(T ) for all

parameters (unformly good):

Eθ[tx (T )] = O(log(T )) , Eλ[tx (T )] = T −O(log(T )).

I Markov inequality:

Pθ[tx (T ) ≥ T/2] + Pλ[tx (T ) < T/2] ≤ O(T−1 log(T )).

I 1{tx (T ) ≤ T/2} is a hypothesis test, risk O(T−1 log(T ))

I Hence (Neyman-Pearson / Tsybakov):∑x

Eθ[tx (T )]I(θx , λx )︸︷︷︸KL divergence of the observations

≥ log(T )−O(log(log(T ))).

19 / 40

Concentration inequalities: Chernoff bounds

I Building indexes requires tight concentration inequalitiesI Chernoff bounds: upper bound the MGFI X = (X1, ...,Xn) independent, with mean µ, Sn =

∑nn′=1 Xn′

I G such that log(E[eλ(Xn−µ)]) ≤ G(λ) , λ ≥ 0I Generic technique:

P[Sn − nµ ≥ δ] = P[eλ(Sn−nµ) ≥ eλδ]

≤ e−λδE[eλ(Sn−nµ)] (Markov)= exp(nG(λ)− λδ) (independence)

≤ exp(−n max

λ≥0{λδn−1 −G(λ)}

).

20 / 40

Concentration inequalities: Chernoff and Hoeffding’sinequality

I Bounded variables: if Xn ∈ [a,b] a.s thenE[eλ(Xn−µ)] ≤ eλ

2(b−a)2/8 (Hoeffding lemma)I Hoeffding’s inequality:

P[Sn − nµ ≥ δ] ≤ exp(− 2δ2

n(b − a)2

)I Subgaussian variables: E[eλ(Xn−µ)] ≤ eσ

2λ2/2, similar

I Bernoulli variables: E[eλ(Xn−µ)] = µeλ(1−µ) − (1− µ)e−λµ

I Chernoff’s inequality:

P[Sn − nµ ≥ δ] ≤ exp(−nI(µ+ δ/n, µ))

I Pinsker’s inequality: Chernoff is stronger than Hoeffding.

21 / 40

Concentration inequalities: variable sample size andpeeling

I In bandit problems, the sample size is random anddepends on the samples themselves

I Intervals Nk = {nk , ...,nk+1} , N = ∪Kk=1N

I Idea: Zn = eλ(Sn−nµ) is a positive sub-martingale:

P[maxn∈Nk

(Sn − µn) ≥ δ] = P[maxn∈Nk

Zn ≥ eλδ)]

≤ e−λδE[Znk+1 ] (Doob’s inequality)= exp(−λδ + nk+1G(λ))

≤ exp(−nk+1 max

λ≥0{λδn−1

k+1 −G(λ)}).

I Peeling trick (Neveu): union bound over k , nk = (1 + α)k .

22 / 40

Concentration inequalities: self normalized versions

I Self-normalized versions of classical inequalitiesI Garivier’s inequality:

P[

max1≤n≤T

nI(Sn/n, µ) ≥ δ]≤ 2edlog(T )δee−δ

I From Pinkser’s inequality (self-normalized Hoeffding):

P[

max1≤n≤T

√n|Sn/n − µ| ≥ δ

]≤ 4edlog(T )δ2ee−2δ2

I Multi-dimensional version, Y knk

= nk I(Snk/nk , µ)

P

[max

(n1,...,nK )∈[1,T ]K

K∑k=1

Y knk≥ δ

]≤ CK (log(T )δ)K e−δ

23 / 40

Outline




24 / 40

The Lai-Robbins bound

I Actions X = {1, ...,K}I Rewards θ = (θ1, ..., θK ) ∈ [0,1]K

I Uniformly good algorithm: R(T ) = O(log(T )) , ∀θ

Theorem (Lai ’85)

For any uniformly good algorithm, and x s.t θx < θ? we have:

lim infT→∞

E[tx (T )]

log(T )≥ 1

I(µx , µ?)

I For x 6= x?, apply the generic technique with:

λ = (θ1, ..., θx−1, θ? + ε, θx+1, ..., θK )

25 / 40

The Lai-Robbins bound

θx

x

Most confusing parameter

26 / 40

Optimistic Algorithms

I Select the arm with highest index k(n) ∈ arg maxk bk (n)

I UCB algorithm (Hoeffding’s ineqality):

bx (n) = θ̂x (n)︸︷︷︸empirical mean

+

√2 log(n)

tx (n)︸︷︷︸exploration bonus

.

I KL-UCB algorithm (Self-normalized Chernoff inequality):

bx (n) = max{q ≤ 1 : tx (n)I(θ̂x (n),q)︸︷︷︸likelihood ratio

≤ f (n)︸︷︷︸log(confidence level−1)

}.

with f (n) = log(n) + 3 log(log(n)).

27 / 40

Regret of optimistic Algorithms

Theorem (Auer’02)

Under algorithm UCB, for all x s.t θx < θ?:

E[tx (T )] ≤ 8 log(T )

(θx − θ?)2 +π2

6.

Theorem (Garivier’11)

Under algorithm KL-UCB, for all x s.t θk < θ? and for allδ < θ? − θx :

E[tx (T )] ≤ log(T )

I(θx + δ, θ?)+ C log(log(T )) + δ−2.

28 / 40

Regret of KL-UCB: sketch of proof

Decompose:

E[tx (T )] ≤ E[|A|] + E[|B|] + E[|C|],A = {n ≤ T : bx?(n) ≤ θ?},B = {n ≤ T : n /∈ A, xn = x , |θ̂x (n)− θx | ≥ δ},C = {n ≤ T : n /∈ A, xn = x , |θ̂x (n)− θx | ≤ δ,

tx (T ) ≤ f (T )/I(θx + δ, θ?)}.

Union bound:

E[|A|] ≤ C log(log(T )), (Index property)

E[|B|] ≤ δ−2, (Hoeffding + Union bound)|C| ≤ f (T )/I(θx + δ, θ?) (Counting)

29 / 40

Randomized algorithms: Thompson Sampling

I Prior distribution θ ∼ PI Time n, select x with probability

P[θx = maxx ′ θx ′ |x0, r0, ..., xn, rn]

Bernoulli with uniform priors:I Zx (n) ∼ Beta(tx (n)θ̂x (n) + 1, tx (n)(1− θ̂x (n)) + 1)

I xn+1 ∈ arg maxx Zx (n)

Theorem (Kaufmann’12)

Thompson sampling is asymptotically optimal. If θx < θ? then:

lim supT→∞

E[tx (T )]

log(T )≤ 1

I(θx , θ?).

30 / 40

Illustration of algorithms

−0.2 0 0.2 0.4 0.6 0.8 1 1.20

0.2

0.4

0.6

0.8

1

1.2UCB

µ̂1(n)

n = 102 t(n) = (20, 29, 54)

µ̂2(n)

n = 102 t(n) = (20, 29, 54)

µ̂3(n)n = 102 t(n) = (20, 29, 54)

−0.2 0 0.2 0.4 0.6 0.8 1 1.20

0.2

0.4

0.6

0.8

1

1.2KL−UCB

µ̂1(n)

n = 102 t(n) = (2, 5, 96)

µ̂2(n)

n = 102 t(n) = (2, 5, 96)

µ̂3(n)n = 102 t(n) = (2, 5, 96)

10 20 30 40 50 60 70 80 90 1000

5

10

15

Regret of various algorithms

Time T

Reg

ret R

(T)

UCBKL−UCBThompson

31 / 40

The EXP3 algorithm

I Adversarial setting: arm selection must be randomizedI At time n, select xn with distribution p(n)

I Reward estimate (unbiased):

Rx (n) =n∑

n′=1

r̃x (n′) =n∑

n′=1

rn

px (n)1{xn = x}

I Action distribution, pk (n) ∝ exp(ηRx (n)) with η > 0 fixed.I Favor actions with good historical rewards + explore a bit:

p(n) is a soft approximation to the max functionI For small η, EXP3 is the replicator dynamics (!)

32 / 40

Regret of EXP3

Theorem

Under EXP3 with η =√

2 log(K )KT , the regret is upper bounded by:

Rπ(T ) ≤√

2TK log(K )

I Larger exponent in T , but smaller in KI Suggests two regimes for (K ,T ): Stochastic regime vs.

Adversarial regimeI Matching lower bound: consider a stochastic adversary

with close arms

33 / 40

BibliographyDiscrete bandits- Thompson, On the likelihood that one unknown probabilityexceeds another in view of the evidence of two samples, 1933- Robbins, Some aspects of the sequential design ofexperiments, 1952- Lai and Robbins. Asymptotically efficient adaptive allocationrules, 1985- Lai. Adaptive treatment allocation and the multi-armed banditproblem, 1987- Gittins, Bandit Processes and Dynamic Allocation Indices,1989- Auer, Cesa-Bianchi and Fischer, Finite time analysis of themultiarmed bandit problem. 2002.- Garivier and Moulines, On upper-confidence bound policiesfor non-stationary bandit problems, 2008- Slivkins and Upfal, Adapting to a changing environment: thebrownian restless bandits, 2008

34 / 40

Bibliography

- Garivier and Cappé. The KL-UCB algorithm for boundedstochastic bandits and beyond, 2011- Honda and Takemura, An Asymptotically Optimal BanditAlgorithm for Bounded Support Models, 2010

Discrete bandits with correlated arms

- Anantharam, Varaiya, and Walrand, Asymptotically efficientallocation rules for the multiarmed bandit problem with multipleplays, 1987- Graves and Lai Asymptotically efficient adaptive choice ofcontrol laws in controlled Markov chains, 1997- György, Linder, Lugosi and Ottucsák, The on-line shortestpath problem under partial monitoring, 2007- Yu and Mannor, Unimodal bandits, 2011- Cesa-Bianchi and Lugosi, Combinatorial bandits, 2012.

35 / 40

Bibliography- Gai, Krishnamachari, and Jain, Combinatorial networkoptimization with unknown variables: Multi-armed bandits withlinear rewards and individual observations, 2012- Chen, Wang and Yuan. Combinatorial multi-armed bandit:General framework and applications, 2013- Combes and Proutiere, Unimodal bandits: Regret lowerbounds and optimal algorithms, 2014- Magureanu, Combes, and Proutiere. Lipschitz bandits: Regretlower bounds and optimal algorithms, 2014.

Thompson Sampling

- Chapelle and Li, An Empirical Evaluation of ThompsonSampling, 2011- Korda, Kaufmann and Munos, Thompson Sampling: anasymptotically optimal finite-time analysis, 2012- Korda, Kaufmann and Munos, Thompson Sampling forone-dimensional exponential family bandits, 2013.

36 / 40

Bibliography- Agrawal and Goyal, Further optimal regret bounds forThompson Sampling, 2013.- Agrawal and Goyal, Thompson Sampling for contextualbandits with linear payoffs, June 2013.

Discrete adversarial bandits

- Auer, Cesa-Bianchi, Freund and Schapire, The non-stochasticmulti-armed bandit, 2002

Continuous Bandits (Lipschitz)

- R. Agrawal, The continuum-armed bandit problem, 1995- Auer, Ortner, and Szepesvári, Improved rates for thestochastic continuum-armed bandit problem, 2007- Bubeck, Munos, Stoltz, and Szepesvári, Online optimization inx-armed bandits, 2008

37 / 40

Bibliography- Kleinberg. Nearly tight bounds for the continuum-armedbandit problem, 2004- Kleinberg, Slivkins, and Upfal, Multi-armed bandits in metricspaces, 2008- Bubeck, Stoltz and Yu, Lipschitz bandits without the Lipschitzconstant, 2011

Continuous Bandits (strongly convex)

- Cope, Regret and convergence bounds for a class ofcontinuum-armed bandit problems, 2009- Flaxman, Kalai, and McMahan, Online convex optimization inthe bandit setting: gradient descent without a gradient, 2005- Shamir, On the complexity of bandit and derivative-freestochastic convex optimization, 2013- Agarwal, Foster, Hsu, Kakade, and Rakhlin, Stochasticconvex optimization with bandit feedback, 2013.

38 / 40

BibliographyContinuous Bandits (linear)

- Dani, Hayes, and Kakade, Stochastic linear optimizationunder bandit feedback, 2008- Rusmevichientong and Tsitsiklis, Linearly ParameterizedBandits, 2010- Abbasi-Yadkori, Pal, Szepesvári, Improved Algorithms forLinear Stochastic Bandits, 2011

Best Arm identification

- Mannor and Tsitsiklis, The sample complexity of exploration inthe multi-armed bandit problem, 2004- Even-Dar, Mannor, Mansour, Action elimination and stoppingconditions for the multi-armed bandit and reinforcementlearning problems, 2006- Audibert, Bubeck, and Munos, Best arm identification inmulti-armed bandits, 2010. 39 / 40

Bibliography

- Kalyanakrishnan, Tewari, Auer, and Stone, Pac subsetselection in stochastic multi-armed bandits, 2012- Kaufmann and Kalyanakrishnan, Information complexity inbandit subset selection, 2013- Kaufmann, Garivier and Cappé, On the Complexity of A/BTesting, 2014

Infinite Bandits

- Berry, Chen, Zame, Heath, and Shepp, Bandit problems withinfinitely many arms, 1997.- Wang, Audibert, and Munos, Algorithms for infinitelymany-armed bandits, 2008.- Bonald and Proutiere, Two-Target Algorithms forInfinite-Armed Bandits with Bernoulli Rewards, 2013

40 / 40

bandit optimization: theory and applications - sigmetricsbandit optimization: theory and...

Documents