dynamic resource allocation: bandit problems and extensionsagarivie/sites/... · 2014-10-16 ·...
TRANSCRIPT
Dynamic resource allocation:Bandit problems and extensions
Aurelien Garivier
Institut de Mathematiques de Toulouse
MAD Seminar, Universite Toulouse 1October 3rd, 2014
The Bandit Model
Roadmap
1 The Bandit Model
2 Lower Bound for the Regret
3 Optimistic Algorithms
4 An Optimistic Algorithm based on Kullback-Leibler Divergence
5 Parametric setting: the kl-UCB Algorithm
6 Non-parametric setting and Empirical Likelihood
7 Extensions
The Bandit Model
Dynamic resource allocation
Imagine you are a doctor:
patients visit you one after another for a given disease
you prescribe one of the (say) 5 treatments available
the treatments are not equally efficient
you do not know which one is the best, you observe the effectof the prescribed treatment on each patient
⇒ What do you do?
You must choose each prescription using only the previousobservations
Your goal is not to estimate each treatment’s efficiencyprecisely, but to heal as many patients as possible
The Bandit Model
The (stochastic) Multi-Armed Bandit Model
Environment K arms with parameters θ = (θ1, . . . , θK) such thatfor any possible choice of arm at ∈ 1, . . . ,K attime t, one receives the reward
Xt = Xat,t
where, for any 1 ≤ a ≤ K and s ≥ 1, Xa,s ∼ νa, andthe (Xa,s)a,s are independent.
Reward distributions νa ∈ Fa parametric family, or not. Examples:canonical exponential family, general boundedrewards
Example Bernoulli rewards: θ ∈ [0, 1]K , νa = B(θa)
Strategy The agent’s actions follow a dynamical strategyπ = (π1, π2, . . . ) such that
At = πt(X1, . . . , Xt−1)
The Bandit Model
Real challenges
Randomized clinical trials
original motivation since the 1930’sdynamic strategies can save resources
Recommender systems:
advertisement
website optimization
news, blog posts, . . .
Computer experiments
large systems can be simulated in order to optimize somecriterion over a set of parametersbut the simulation cost may be high, so that only few choicesare possible for the parameters
Games and planning (tree-structured options)
The Bandit Model
Performance Evaluation, Regret
Cumulated Reward ST =∑T
t=1Xt
Our goal Choose π so as to maximize
E [ST ] =
T∑t=1
K∑a=1
E[E [Xt1At = a|X1, . . . , Xt−1]
]=
K∑a=1
µaE [Nπa (T )]
where Nπa (T ) =
∑t≤T 1At = a is the number of
draws of arm a up to time T , and µa = E(νa).
Regret Minimization equivalent to minimizing
RT = Tµ∗ − E [ST ] =∑
a:µa<µ∗
(µ∗ − µa)E [Nπa (T )]
where µ∗ ∈ maxµa : 1 ≤ a ≤ K
Lower Bound for the Regret
Roadmap
1 The Bandit Model
2 Lower Bound for the Regret
3 Optimistic Algorithms
4 An Optimistic Algorithm based on Kullback-Leibler Divergence
5 Parametric setting: the kl-UCB Algorithm
6 Non-parametric setting and Empirical Likelihood
7 Extensions
Lower Bound for the Regret
Asymptotically Optimal Strategies
A strategy π is said to be consistent if, for any (νa)a ∈ FK ,
1
TE[ST ]→ µ∗
The strategy is efficient if for all θ ∈ [0, 1]K and all α > 0,
RT = o(Tα)
There are efficient strategies and we consider the bestachievable asymptotic performance among efficient strategies
Lower Bound for the Regret
The Bound of Lai and Robbins
One-parameter reward distribution νa = νθa , θa ∈ Θ ⊂ R .
Theorem [Lai and Robbins, ’85]
If π is an efficient strategy, then, for any θ ∈ ΘK ,
lim infT→∞
RTlog(T )
≥∑
a:µa<µ∗
µ∗ − µaKL(νa, ν∗)
where KL(ν, ν ′) denotes the Kullback-Leibler divergence
For example, in the Bernoulli case:
KL(B(p), B(q)
)= dber(p, q) = p log
p
q+ (1− p) log
1− p1− q
Lower Bound for the Regret
The Bound of Burnetas and Katehakis
More general reward distributions νa ∈ Fa
Theorem [Burnetas and Katehakis, ’96]
If π is an efficient strategy, then, for any θ ∈ [0, 1]K ,
lim infT→∞
RTlog(T )
≥∑
a:µa<µ∗
µ∗ − µaKinf (νa, µ∗)
where
Kinf (νa, µ∗) = inf
K(νa, ν
′) :
ν ′ ∈ Fa, E(ν ′) ≥ µ∗
ν∗
δ1
δ 12
δ0
Kinf (νa, µ?)
νa
µ∗
Lower Bound for the Regret
Intuition
First assume that µ∗ is known and that T is fixed
How many draws na of νa are necessary to know that µa < µ∗
with probability at least 1− 1/T?
Test: H0 : µa = µ∗ against H1 : ν = νa
Stein’s Lemma: if the first type error αna ≤ 1/T , then
βna % exp(− naKinf (νa, µ
∗))
=⇒ it can be smaller than 1/T if
na ≥log(T )
Kinf (νa, µ∗)
How to do as well without knowing µ∗ and T in advance?Not asymptotically?
Optimistic Algorithms
Roadmap
1 The Bandit Model
2 Lower Bound for the Regret
3 Optimistic Algorithms
4 An Optimistic Algorithm based on Kullback-Leibler Divergence
5 Parametric setting: the kl-UCB Algorithm
6 Non-parametric setting and Empirical Likelihood
7 Extensions
Optimistic Algorithms
Optimism in the Face of Uncertainty
Optimism in an heuristic principle popularized by [Lai&Robins ’85;Agrawal ’95] which consists in letting the agent
play as if the environment was the most favorableamong all environments that are sufficiently likelygiven the observations accumulated so far
Surprisingly, this simple heuristic principle can be instantiated intoalgorithms that are robust, efficient and easy to implement inmany scenarios pertaining to reinforcement learning
Optimistic Algorithms
Upper Confidence Bound Strategies
UCB [Lai&Robins ’85; Agrawal ’95; Auer&al ’02]
Construct an upper confidence bound for the expected rewardof each arm:
Sa(t)
Na(t)︸ ︷︷ ︸estimated reward
+
√log(t)
2Na(t)︸ ︷︷ ︸exploration bonus
Choose the arm with the highest UCB
It is an index strategy [Gittins ’79]
Its behavior is easily interpretable and intuitively appealing
Optimistic Algorithms
UCB in Action
Optimistic Algorithms
UCB in Action
Optimistic Algorithms
Performance of UCB
For rewards in [0, 1], the regret of UCB is upper-bounded as
E[RT ] = O(log(T ))
(finite-time regret bound) and
lim supT→∞
E[RT ]
log(T )≤
∑a:µa<µ∗
1
2(µ∗ − µa)
Yet, in the case of Bernoulli variables, the rhs. is greater thansuggested by the bound by Lai & Robbins
Many variants have been suggested to incorporate an estimate ofthe variance in the exploration bonus (e.g., [Audibert&al ’07])
An Optimistic Algorithm based on Kullback-Leibler Divergence
Roadmap
1 The Bandit Model
2 Lower Bound for the Regret
3 Optimistic Algorithms
4 An Optimistic Algorithm based on Kullback-Leibler Divergence
5 Parametric setting: the kl-UCB Algorithm
6 Non-parametric setting and Empirical Likelihood
7 Extensions
An Optimistic Algorithm based on Kullback-Leibler Divergence
The KL-UCB algorithm [Cappe,G.&al ’13]
Parameters: An operator ΠF : M1(S)→ F ; a non-decreasingfunction f : N→ RInitialization: Pull each arm of 1, . . . ,K once
for t = K to T − 1 do
compute for each arm a the quantity
Ua(t) = sup
E(ν) : ν ∈ F and KL
(ΠF(νa(t)
), ν)≤ f(t)
Na(t)
pick an arm At+1 ∈ arg max
a∈1,...,KUa(t)
end for
Parametric setting: the kl-UCB Algorithm
Roadmap
1 The Bandit Model
2 Lower Bound for the Regret
3 Optimistic Algorithms
4 An Optimistic Algorithm based on Kullback-Leibler Divergence
5 Parametric setting: the kl-UCB Algorithm
6 Non-parametric setting and Empirical Likelihood
7 Extensions
Parametric setting: the kl-UCB Algorithm
Exponential Family RewardsAssume that Fa = F = canonical exponential family, i.e.such that the pdf of the rewards is given by
pθa(x) = exp(xθa − b(θa) + c(x)
), 1 ≤ a ≤ K
for a parameter θ ∈ RK , expectation µa = b(θa)The KL-UCB si simply:
Ua(t) = sup
µ ∈ I : d
(µa(t), µ
)≤ f(t)
Na(t)
For instance,
for Bernoulli rewards:
dber(p, q) = p logp
q+ (1− p) log
1− p1− q
for exponential rewards pθa(x) = θae−θax:
dexp(u, v) = u− v + u logu
v
The analysis is generic and yields a non-asymptotic regretbound optimal in the sense of Lai and Robbins.
Parametric setting: the kl-UCB Algorithm
Parametric version: the kl-UCB algorithm
Parameters: F parameterized by the expectation µ ∈ I ⊂ R withdivergence d, a non-decreasing function f : N→ RInitialization: Pull each arm of 1, . . . ,K once
for t = K to T − 1 do
compute for each arm a the quantity
Ua(t) = sup
µ ∈ I : d
(µa(t), µ
)≤ f(t)
Na(t)
pick an arm At+1 ∈ arg max
a∈1,...,KUa(t)
end for
Parametric setting: the kl-UCB Algorithm
The kl Upper Confidence Bound in Picture
If Z1, . . . , Zsiid∼ B(θ0), x < θ0 and
if ps = (Z1 + · · ·+Zs)/s, then byChernoff’s inequality
Pθ0 (ps ≤ x) ≤ exp (−sdber(x, θ0)) .
0
kl(⋅,θ)
θ0
x
−log(α)/s
In other words, if α = exp (−sdber(x, θ0)):
Pθ0 (ps ≤ x) = Pθ0(dber(ps, θ0) ≤ −
log(α)
s, ps < θ0
)≤ α
=⇒ Upper Confidence Bound for p at risk α :
us = supθ > ps : dber(ps, θ) ≤ −
log(α)
s
.
Parametric setting: the kl-UCB Algorithm
The kl Upper Confidence Bound in Picture
If Z1, . . . , Zsiid∼ B(θ0), x < θ0 and
if ps = (Z1 + · · ·+Zs)/s, then byChernoff’s inequality
Pθ0 (ps ≤ x) ≤ exp (−sdber(x, θ0)) .
0
kl(⋅,θ)
ps
kl(ps,⋅)
us
−log(α)/s
In other words, if α = exp (−sdber(x, θ0)):
Pθ0 (ps ≤ x) = Pθ0(dber(ps, θ0) ≤ −
log(α)
s, ps < θ0
)≤ α
=⇒ Upper Confidence Bound for p at risk α :
us = supθ > ps : dber(ps, θ) ≤ −
log(α)
s
.
Parametric setting: the kl-UCB Algorithm
Key Tool: Deviation Inequality for Self-Normalized Sums
Problem: random number of summands
Solution: peeling trick (as in the proof of the LIL)
Theorem For all ε > 1,
P(µa > µa(t) and Na(t) d
(µa(t), µa
)≥ ε)≤ e⌈ε log(t)
⌉e−ε .
Thus,P(Ua(t) < µa
)≤ e⌈f(t) log(t)
⌉e−f(t)
Parametric setting: the kl-UCB Algorithm
Regret bound
Theorem: Assume that all arms belong to a canonical, regular,exponential family F = νθ : θ ∈ Θ of probability distributionsindexed by its natural parameter space Θ ⊆ R. Then, with thechoice f(t) = log(t) + 3 log log(t) for t ≥ 3, the number of drawsof any suboptimal arm a is upper bounded for any horizon T ≥ 3 as
E [Na(T )] ≤ log(T )
d (µa, µ?)+2
√√√√2πσ2a,?
(d′(µa, µ?)
)2(d(µa, µ?)
)3 √log(T ) + 3 log(log(T ))
+
(4e+
3
d(µa, µ?)
)log(log(T )) + 8σ2
a,?
(d′(µa, µ
?)
d(µa, µ?)
)2
+ 6 ,
where σ2a,? = max
Var(νθ) : µa ≤ E(νθ) ≤ µ?
and whered′( · , µ?) denotes the derivative of d( · , µ?).
Parametric setting: the kl-UCB Algorithm
Results: Two-Arm Scenario
102
103
104
0
50
100
150
200
250
300
350
400
450
500
n (log scale)
N2(n
)
UCB
MOSS
UCB−Tuned
UCB−V
DMED
KL−UCB
bound
0
500
1000
1500
2000
2500
3000
3500
4000
UCB MOSS UCB−Tuned UCB−V DMED KL−UCB
N2(n
)
Figure: Performance of various algorithms when θ = (0.9, 0.8). Left:average number of draws of the sub-optimal arm as a function of time.Right: box-and-whiskers plot for the number of draws of the sub-optimalarm at time T = 5, 000. Results based on 50, 000 independentreplications
Parametric setting: the kl-UCB Algorithm
Results: Ten-Arm Scenario with Low Rewards
0
100
200
300
400
500
102
103
104
Rn
UCB
0
100
200
300
400
500
102
103
104
MOSS
0
100
200
300
400
500
102
103
104
UCB−V
0
100
200
300
400
500
102
103
104
Rn
UCB−Tuned
0
100
200
300
400
500
102
103
104
DMED
0
100
200
300
400
500
102
103
104
KL−UCB
0
100
200
300
400
500
102
103
104
n (log scale)
Rn
CP−UCB
0
100
200
300
400
500
102
103
104
n (log scale)
DMED+
0
100
200
300
400
500
102
103
104
n (log scale)
KL−UCB+
Figure: Average regret as a function of time whenθ = (0.1, 0.05, 0.05, 0.05, 0.02, 0.02, 0.02, 0.01, 0.01, 0.01). Red line: Lai& Robbins lower bound; thick line: average regret; shaded regions:central 99% region an upper 99.95% quantile
Non-parametric setting and Empirical Likelihood
Roadmap
1 The Bandit Model
2 Lower Bound for the Regret
3 Optimistic Algorithms
4 An Optimistic Algorithm based on Kullback-Leibler Divergence
5 Parametric setting: the kl-UCB Algorithm
6 Non-parametric setting and Empirical Likelihood
7 Extensions
Non-parametric setting and Empirical Likelihood
Non-parametric setting
Rewards are only assumed to be bounded (say in [0, 1])
Need for an estimation procedure
with non-asymptotic guaranteesefficient in the sense of Stein / Bahadur
=⇒ Idea 1: use dber (Hoeffding)
=⇒ Idea 2: Empirical Likelihood [Owen ’01]
Bad idea: use Bernstein / Bennett
Non-parametric setting and Empirical Likelihood
First idea: use dber
Idea: rescale to [0, 1], and take the divergence dber.
−→ because Bernoulli distributions maximize deviations amongbounded variables with given expectation:
Lemma (Hoeffding ’63)
Let X denote a random variable such that 0 ≤ X ≤ 1 and denoteby µ = E[X] its mean. Then, for any λ ∈ R,
E [exp(λX)] ≤ 1− µ+ µ exp(λ) .
This fact is well-known for the variance, but also true for allexponential moments and thus for Cramer-type deviation bounds
Non-parametric setting and Empirical Likelihood
Regret Bound for kl-UCB
Theorem
With the divergence dber, for all T > 3,
E[Na(T )
]≤ log(T )
dber(µa, µ?)+
√2π log
(µ?(1−µa)µa(1−µ?)
)(dber(µa, µ?)
)3/2 √log(T ) + 3 log
(log(T )
)
+
(4e+
3
dber(µa, µ?)
)log(log(T )
)+
2
(log(µ?(1−µa)µa(1−µ?)
))2
(dber(µa, µ?))2 + 6 .
kl-UCB satisfies an improved logarithmic finite-time regretbound
Besides, it is asymptotically optimal in the Bernoulli case
Non-parametric setting and Empirical Likelihood
Comparison to UCBKL-UCB addresses exactly the same problem as UCB, with thesame generality, but it has always a smaller regret as can be seenfrom Pinsker’s inequality
dber(µ1, µ2) ≥ 2(µ1 − µ2)2
0 0.2 0.4 0.6 0.8 1 1.20
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
q
kl(0.7, q)
2(0.7−q)2
Non-parametric setting and Empirical Likelihood
Idea 2: Empirical Likelihood
U(νn, ε) = supE(ν′) : ν′ ∈M1
(Supp(νn)
)and KL(νn, ν
′) ≤ ε
or, rather, modified Empirical Likelihood:
U(νn, ε) = supE(ν′) : ν′ ∈M1
(Supp(νn)∪1
)and KL(νn, ν
′) ≤ ε
µn
Un
Non-parametric setting and Empirical Likelihood
Coverage properties of the modified EL confidence bound
Proposition: Let ν0 ∈M1([0, 1]) with E(ν0) ∈ (0, 1) and letX1, . . . , Xn be independent random variables with commondistribution ν0 ∈M1
([0, 1]
), not necessarily with finite support.
Then, for all ε > 0,
PU(νn, ε) ≤ E(ν0)
≤ P
Kinf
(νn, E(ν0)
)≥ ε
≤ e(n+ 2) exp(−nε) .
Remark: For 0, 1–valued observations, it is readily seen thatU(νn, ε) boils down to the upper-confidence bound above.=⇒ This proposition is at least not always optimal: the presence
of the factor n in front of the exponential exp(−nε) term isquestionable.
Non-parametric setting and Empirical Likelihood
Regret bound
Theorem: Assume that F is the set of finitely supportedprobability distributions over S = [0, 1], that µa > 0 for all arms aand that µ? < 1. There exists a constant M(νa, µ
?) > 0 onlydepending on νa and µ? such that, with the choicef(t) = log(t) + log
(log(t)
)for t ≥ 2, for all T ≥ 3:
E[Na(T )
]≤ log(T )
Kinf
(νa, µ?
) +36
(µ?)4(log(T )
)4/5log(
log(T ))
+
(72
(µ?)4+
2µ?
(1− µ?)Kinf
(νa, µ?
)2)(
log(T ))4/5
+(1− µ?)2M(νa, µ
?)
2(µ?)2(log(T )
)2/5+
log(log(T )
)Kinf
(νa, µ?
) +2µ?
(1− µ?)Kinf
(νa, µ?
)2 + 4 .
Non-parametric setting and Empirical Likelihood
Example: truncated Poisson rewards
for each arm 1 ≤ a ≤ 6 is associated with νa, a Poissondistribution with expectation (2 + a)/4, truncated at 10.
N = 10, 000 Monte-Carlo replications on an horizon ofT = 20, 000 steps.
102
103
104
0
200
400
600
800
1000
1200
1400
1600
1800
2000
Time (log scale)
Re
gre
t
kl−Poisson−UCB
KL−UCB
UCB−V
kl−UCB
UCB
Non-parametric setting and Empirical Likelihood
Example: truncated Exponential rewardsexponential rewards with respective parameters 1/5, 1/4, 1/3,1/2 and 1, truncated at xmax = 10;kl-UCB uses the divergence d(x, y) = x/y − 1− log(x/y)prescribed for genuine exponential distributions, but it ignoresthe fact that the rewards are truncated.
102
103
104
0
200
400
600
800
1000
1200
Time (log scale)
Re
gre
t
kl−exp−UCB
KL−UCB
UCB−V
kl−UCB
UCB
Figure: Regret of the various algorithms as a function of time in thetruncated exponential scenario.
Non-parametric setting and Empirical Likelihood
Take-home message on bandit algorithms
1 Use kl-UCB rather than UCB-1 or UCB-2
2 Use KL-UCB if speed is not a problem
3 todo: improve on the deviation bounds, address generalnon-parametric families of distributions
4 Alternative: Bayesian-flavored methods:
Bayes-UCB [Kaufmann, Cappe, G.]
Thompson sampling [Kaufmann & al.]
Extensions
Roadmap
1 The Bandit Model
2 Lower Bound for the Regret
3 Optimistic Algorithms
4 An Optimistic Algorithm based on Kullback-Leibler Divergence
5 Parametric setting: the kl-UCB Algorithm
6 Non-parametric setting and Empirical Likelihood
7 Extensions
Extensions
Non-stationary Bandits [G. Moulines ’11]
Changepoint : rewarddistributions change abruptly
Goal : follow the best arm
Application : scanningtunnelling microscope
Variants D-UCB et SW-UCB including a progressive discountof the past
Bounds O(√n log n) are proved, which is (almost) optimal
Extensions
(Generalized) Linear Bandits [Filippi, Cappe, G. &Szepesvari ’10]
Bandit with contextual information:
E[Xt|At] = µ(m′Atθ∗)
where θ∗ ∈ Rd is an unkown parameter and µ : R→ R is alink function
Example : binary rewards
µ(x) =exp(x)
1 + exp(x)
Application : targeted web ads
GLM-UCB : regret bound depending on dimension d and noton the number of arms
Extensions
Stochastic Optimization
Goal : Find the maximum of afunction f : C ⊂ Rd → R(possibly) observed in noise
Application : DAS
0.0 0.2 0.4 0.6 0.8 1.0
0.6
0.8
1.0
1.2
1.4
1.6
res$x
res$
f
Model : f is the realization of a Gaussian Process (or has asmall norm in some RKHS)
GP-UCB : evaluate f at the point x ∈ C where theconfidence interval for f(x) has the highest upper-bound
Extensions
Recent Advances in Continuous bandits
Unimodal bandits withoutsmoothness: trisection algo-rithms, and better[Combes, Proutiere ’14]Application to internet networktraffic optimization
Reserve Price Optimization inSecond-price Options[Cesa-Bianchi, Gentile, Man-sour ’13]Application to advertisementsystems
Extensions
Markov Decision Processes (MDP) [Filippi, Cappe & G.’10]
The system is in state St which evolves as a Markov Chain:
St+1 ∼ P (·;St, At) et Rt = r(St, At) + εt
Optimistic algorithm: search the best transition matrix in aneighborhood of the ML estimate
The use of Kullback-Leibler neighborhoods leads to betterperformance and has desirable propoerties.
Extensions
Optimal Exploration with Probabilistic Expert Advice
Search space : B ⊂ Ω discrete set
Probabilistic experts : Pa ∈M1(Ω) for a ∈ ARequests : at time t, calling expert At yields a realization of
Xt = XAt,t independent with law Pa
Goal : find as many distinct elements of B as possible withfew requests :
Fn = Card (B ∩ X1, . . . , Xn)
6= bandit : finding the same element twice is no use !
Oracle : selects the expert with highest ‘missing mass’
A∗t+1 = arg maxa∈A
Pa (B\ X1, . . . , Xt)
Extensions
The model
Subset A ⊂ X ofimportant items
|X | 1, |A| |X |Access to X onlyby probabilisticexperts (Pi)1≤i≤K :sequentialindependent draws
Goal: discover rapidly the elements of A
Extensions
The model
Subset A ⊂ X ofimportant items
|X | 1, |A| |X |Access to X onlyby probabilisticexperts (Pi)1≤i≤K :sequentialindependent draws
Goal: discover rapidly the elements of A
Extensions
The model
Subset A ⊂ X ofimportant items
|X | 1, |A| |X |Access to X onlyby probabilisticexperts (Pi)1≤i≤K :sequentialindependent draws
Goal: discover rapidly the elements of A
Extensions
The model
Subset A ⊂ X ofimportant items
|X | 1, |A| |X |Access to X onlyby probabilisticexperts (Pi)1≤i≤K :sequentialindependent draws
Goal: discover rapidly the elements of A
Extensions
Goal
At each time step t = 1, 2, . . . :
pick an index It = πt(I1, Y1, . . . , Is−1, Ys−1
)∈ 1, . . . ,K
according to past observations
observe Yt = XIt,nIt,t∼ PIt , where
ni,t =∑s≤t
1Is = i
Goal: design the strategy π = (πt)t so as to maximize the numberof important items found after t requests
F π(t) =∣∣∣A ∩ Y1, . . . , Yt∣∣∣
Assumption: non-intersecting supports
A ∩ supp(Pi) ∩ supp(Pj) = ∅ for i 6= j
Extensions
Is it a Bandit Problem ?
It looks like a bandit problem. . .
sequential choices among K options
want to maximize cumulative rewards
exploration vs exploitation dilemma
. . . but it is not a bandit problem !
rewards are not i.i.d.
destructive rewards: no interest to observe twice the sameimportant item
all strategies eventually equivalent
Extensions
The oracle strategy
Proposition: Under the non-intersecting support hypothesis, thegreedy oracle strategy
I∗t ∈ arg max1≤i≤K
Pi (A \ Y1, . . . , Yt)
is optimal: for every possible strategy π, E[F π(t)
]≤ E
[F ∗(t)
].
Remark: the proposition if false if the supports may intersect
=⇒ estimate the “missing mass of important items”!
Extensions
Estimating the missing mass
Notation : Xtiid∼ P ∈M1(Ω), On(ω) =
∑nt=1 1Xt = ω
Zn(x) = 1On(ω) = 0Hn(ω) = 1On(ω) = 1, Hn =
∑ω∈BHn(ω)
Problem : estimate the missing mass
Rn =∑ω∈B
P (ω)Zn(ω)
Good-Turing : ‘estimator’ Rn = Hn/n st. E[Rn −Rn] ∈ [0, 1/n].
Concentration : by McDiarmid’s inequality, with probability≥ 1− δ∣∣∣Rn − E[Rn]
∣∣∣ ≤√(2/n+ pmax)2 n log(2/δ)
2
Extensions
The Good-UCB algorithm [Bubeck, Ernst & G.]
Optimistic algorithm based on Good-Turing’s estimator :
At+1 = arg maxa∈A
Ha(t)
Na(t)+ c
√log (t)
Na(t)
Na(t) = number of draws of Pa up to time t
Ha(t) = number of elements of B seen exactly once thanks toPa
c = tuning parameter
Extensions
Classical analysis
Theorem: For any t ≥ 1, under the non-intersecting supportassumption, Good-UCB (with constant C = (1 +
√2)√
3) satisfies
E[F ∗(t)− FUCB(t)
]≤ 17
√Kt log(t)+20
√Kt+K+K log(t/K)
Remark: Usual result for bandit problem, but not-so-simple analysis
Extensions
Good-UCB en action
Extensions
The macroscopic limit
Restricted framework: Pi = U1, . . . , NN →∞|A ∩ supp(Pi)|/N → qi ∈ (0, 1), q =
∑i qi
Extensions
The macroscopic limit
Restricted framework: Pi = U1, . . . , NN →∞|A ∩ supp(Pi)|/N → qi ∈ (0, 1), q =
∑i qi
Extensions
The macroscopic limit
Restricted framework: Pi = U1, . . . , NN →∞|A ∩ supp(Pi)|/N → qi ∈ (0, 1), q =
∑i qi
Extensions
The Oracle behaviour
The limiting discovery process of the Oracle strategy isdeterministic
Proposition: For every λ ∈ (0, q1), for every sequence (λN )Nconverging to λ as N goes to infinity, almost surely
limN→∞
TN∗ (λN )
N=∑i
(log
qiλ
)+
Extensions
Oracle vs. uniform sampling
Oracle: The proportion of important items not found afterNt draws tends to
q−F ∗(t) = I(t)qI(t)
exp (−t/I(t)) ≤ KqK
exp(−t/K)
with qK
=(∏K
i=1 qi
)1/Kthe geometric mean of the
(qi)i.
Uniform: The proportion of important items not found afterNt draws tends to KqK exp(−t/K)
=⇒ Asymptotic ratio of efficiency
ρ(q) =qKqK
=1K
∑ki=1 qi(∏k
i=1 qi
)1/K ≥ 1
larger if the (qi)i are unbalanced
Extensions
Macroscopic optimality
Theorem: Take C = (1 +√
2)√c+ 2 with c > 3/2 in the
Good-UCB algorithm.
For every sequence (λN )N converging to λ as N goes toinfinity, almost surely
lim supN→+∞
TNUCB(λN )
N≤∑i
(log
qiλ
)+
The proportion of items found after Nt steps FGUCB
converges uniformly to F ∗ as N goes to infinity
Extensions
Simulation
Number of items found by Good-UCB (line), the oracle (bolddashed), and by uniform sampling (light dotted) as a function oftime, for sample sizes N = 128, N = 500, N = 1000 etN = 10000, in an environment with 7 experts.
Extensions
Bibliography1 [Abbasi-Yadkori&al ’11] Yasin Abbasi-Yadkori, David Pal, Csaba Szepesvari: Online Least Squares
Estimation with Self-Normalized Processes: An Application to Bandit Problems CoRR abs/1102.2670:(2011)
2 [Agrawal ’95] R. Agrawal. Sample mean based index policies with O(log n) regret for the multi-armedbandit problem. Advances in Applied Probability, 27(4) :1054-1078, 1995.
3 [Audibert&al ’09] J-Y. Audibert, R. Munos, and Cs. Szepesvari. Exploration-exploitation tradeoff usingvariance estimates in multi-armed bandits. Theoretical Computer Science, 410(19), 2009
4 [Auer&al ’02] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed banditproblem. Machine Learning, 47(2) :235-256, 2002.
5 [Bubeck, Ernst&G. ’11] S. Bubeck, D. Ernst, A. Garivier, Optimal discovery with probabilistic expertadvice: finite time analysis and macroscopic optimality Journal of Machine Learning Research vol. 14 Feb.2013 p.601-623
6 [De La Pena&al ’04] V.H. De La Pena, M.J. Klass, and T.L. Lai. Self-normalized processes : exponentialinequalities, moment bounds and iterated logarithm laws. Annals of Probability, 32(3) :1902-1933, 2004.
7 [Filippi, Cappe&Garivier ’10] S. Filippi, O. Cappe, and A. Garivier. Optimism in reinforcement learningand Kullback-Leibler divergence. In Allerton Conf. on Communication, Control, and Computing,Monticello, US, 2010.
8 [Filippi, Cappe, G.& Szepesvari ’10] S. Filippi, O. Cappe, A. Garivier, and C. Szepesvari. Parametricbandits : The generalized linear case. In Neural Information Processing Systems (NIPS), 2010.
9 [G.&Cappe ’11] A. Garivier and O. Cappe. The KL-UCB algorithm for bounded stochastic bandits andbeyond. In 23rd Conf. Learning Theory (COLT), Budapest, Hungary, 2011.
10 [Cappe,G.&al ’13] O. Cappe, A. Garivier, O-A. Maillard, R. Munos, G. Stoltz, Kullback-Leibler UpperConfidence Bounds for Optimal Sequential Allocation, Annals of Statistics, accepted
11 [G.&Leonardi ’11] A. Garivier and F. Leonardi. Context tree selection : A unifying view. StochasticProcesses and their Applications, 121(11) :2488-2506, Nov. 2011.
12 [G.&Moulines ’11] A. Garivier and E. Moulines. On upper-confidence bound policies for non-stationarybandit problems. In Algorithmic Learning Theory (ALT), volume 6925 of Lecture Notes in ComputerScience, 2011.
13 [Lai&Robins ’85] T.L. Lai and H. Robbins. Asymptotically efficient adaptive allocation rules. Advances inApplied Mathematics, 6(1) :4-22, 1985.