regret of narendra shapiro bandit algorithmsthe so-called narendra-shapiro bandit algorithm (ns...

Regret of Narendra Shapiro Bandit Algorithms

S. Gadat

Toulouse School of EconomicsJoint work with F. Panloup and S. Saadane.

Toulouse, 5 avril 2015

I - IntroductionI - 1 MotivationsI - 2 Stochastic multi-armed bandit model

II Narendra Schapiro algorithmII - 1 An historical algorithm (1969)II - 2 Improvement through penalization

IV Conclusion

I - 1 Motivations - Stochastic Bandit Games

Problem : You want to earn as much as possible in casino

I You are in a casino and want to play with slot machines

I Each one can give you a potential gain, but these gains are not equivalent

I You sequentially play with one of the arms of the bandit machine

How to design a good policy to sequentially optimize the gain ?

I - 1 Motivations - Dynamic Ressource Allocation

Problem : Optimization of a sequence of clinical trials

Imagine you are a doctor :

I A sequence of patients visit you sequentially (one after another) for a givendisease

I You choose one treatment/drug among (say) 5 availables

I The treatment are not equivalent

I You do not know where is the best drug, but you observe the effect of theprescribed treatment on each patient

I You expect to find the best drug despite some uncertainty on the effect of eachtreatment

How can we design a good sequence of clinical trials ?


Problem : “Fast fashion” retailer

Source : Farias & Madan, Operation Research, Vol. 9, No 2, 2011Imagine you are a firm solding clothes :

I A population of customers visit you sequentially (one after another) eachweek/day

I You observe weekly/daily sales and measure item’s popularity

I You want to restock popular items and weed out unpopular ones on-line

I You expect to maximize your benefit while finding the best items

How can we design a good sequence of fast-fashion operations ?


Other motivating examples

I Pricing a product with uncertain demand to maximize revenue

I Trading (sequentially allocate a ratio of fund to the more efficient trader)I Recommender systems :

I advertisementI website optimizationI news, blog posts

I Computer experimentsI A code can be simulated in order to optimize a criterionI This simulation depends on a set of parametersI Simulation is costly and only few choices of parameters are possible

I - 2 Stochastic multi-armed bandit modelEnvironment :

I At your disposal : d arms with unknown parameters θ1, . . . , θd.I For any time t, and for any choice It ∈ {1 . . . , d}, you receive a reward :

AItt

I For any choice of an arm i, rewards are i.i.d. (Ait)t≥0 ∼ νθi .

Reward distribution :I In general, the reward distributions νθ belong to a parametric family

(Exponential, Poisson, . . . )I In this talk, simplest case of Bernoulli rewards νp = B(p) :

I you obtain a gain of 1 with probability pI 0 otherwise (with probability 1 − p).I Unknown probability of success : (p1, . . . , pd). Without l.o.g., we assume that

p1 > max2≤j≤d

pj .

Admissible policy :The agent’s action follow a dynamical strategy, which is defined on-line :

It = π(AIt−1t−1 . . . , AI11

).

Final goal : Maximize (in expectation) the cumulative rewards :

E

[n∑t=1

AItt

].

I - 2 Regret of Stochastic multi-armed bandit algorithms

Regret of an algorithm It yields the minimization of the expected regret Rn

ERn = E max1≤j≤d

n∑t=1

Ajt − En∑t=1

AItt = E max1≤j≤d

n∑t=1

(Ajt −AItt )

The expectation of the maximum makes the regret difficult to handle, but. . .

Proposition (Pseudo-regret)If we define Rn := max1≤j≤dE

[∑nt=1(Ajt −A

Itt )], one has

ERn ≤ Rn +

√n log d

2.

This upper bound is useful since

Proposition (Lower bound - (Auer, Cesa-Bianchi,Freund,Schapire 2002))Uniformly among all policies π and among all Bernoulli distribution rewards :

minπ

maxsup

2≤j≤dpj < p1

ERn

≥√nd

20.

Conclusion : Upper bounds of Rn of the order√nd are competitive (optimal).

I - 3 Roadmap

In this talk, we will :

I Briefly describe a standard old-fashioned method

Xt+1 = Xt + γt+1h(Xt) + γt+1∆Mt+1

I Introduce a new one whose regret will be deeply studied from a non asymptoticpoint of view :

∀n ∈ N∗ Rn ≤ C√n

I Provide an asymptotic limit of this penalized bandit up to a correct scaling

βn(Xn − δ1)w∗−→

n→+∞µ

I Describe ergodic properties of the rescaled process (Piecewise DeterministicMarkov Process)



IV Conclusion

II - 1 An historical algorithm (1969)The so-called Narendra-Shapiro bandit algorithm (NS bandit for short) defines aprobability vector of Sd

Xt = (X1t , . . . , X

dt ) |

d∑j=1

Xjt = 1.

Idea : Use Xt to sample one arm at step t and upgrade this probability according tothe obtained reward at time t.

I In the two-armed situation with p2 < p1, denote Xt = (xt, 1− xt)

xt+1 = xt +

γt+1(1− xn) if player 1 is selected and wins

−γt+1xt if player 2 is selected and wins

0 otherwise

I Multi-armed situation, It : arm sampled at time t, AItt : obtained reward. Upgrade

∀j ∈ {1 . . . d} Xjt = Xj

t−1 + γt[1{It=j} −X

jt−1

]AItt

I To sum up :I If you win : reinforce the probability to sample It w.r.t. the remaining weights

(Xjt )j 6=It and decrease the probability to sample other arms accordingly.

I If you loose (AItt = 0) : do nothing.

I Common step size :

γt =(1 + t/C)−α

), α ∈ (0, 1) with large enough C.

II - 1 An historical algorithm (1969)Few words about NS bandit :

I Recursive stochasticalgorithms

I Anytime policy

I Involves nontrivialmathematical difficulties

It can be written as mean drift + martingale increment

Xt+1 = Xt + γt+1h(Xt) + γt+1∆Mt+1.

In the 2-armed settings (p2 < p1 and Xt = (xt, 1− xt)) : h(x) = −(p1 − p2)x(1− x).O.D.E. approximation x = h(x), local trap at {1} and stable equilibrium at {0}.But : the conditional variance term vanishes at 0 and 1, making impossible the use ofDuflo’s argument about the escape of local traps.

Indeed, for any sequence γt =(

Ct+C

)α, α ∈ (0, 1), the algorithm is faillible

P (limxt = 0) > 0 =⇒ ERn & Cn >>√n

II - 2 Improvement through penalization

I What’s wrong with NS bandit ?Gittins, JRSS(B)’79 :

Good regret properties only occur with an exploration/exploitation trade-off...

I NS bandit is a pure exploitation method : no exploration term to exit local traps.

I Main idea : Introduce a penalty term [Pages & Lamberton, EJP’09]

I In the 2-armed settings (p2 < p1 and Xt = (xt, 1− xt)) :

Xt+1 = Xt +

+γt+1(1−Xt) if arm 1 is selected and wins

−γt+1Xt if arm 2 is selected and wins

−ρt+1γt+1Xt if arm 1 is selected and loses

+ρt+1γt+1(1−Xt) if arm 2 is selected and loses

When one arm fails, decrease theprobability to sample it.

LP’09 : penalized 2-armed bandit is infaillible (a.s. convergence to the good target) iff

ρt = ρ1t−β , γt = γ1t

−α with 0 < β ≤ α, α+ β ≤ 1.

II - 3 Over-penalized NS bandit

This additional penalty term is still inefficient from the minimax regret point of view.As a last resort : increase the penalty effect to reinforce the escape from local traps :

Xt+1 = Xt +

+γt+1(1−Xt)−ρt+1γt+1Xt if arm 1 is selected and wins

−γt+1Xt+ρt+1γt+1(1−Xt) if arm 2 is selected and wins

−ρt+1γt+1Xt if arm 1 is selected and loses

+ρt+1γt+1(1−Xt) if arm 2 is selected and loses

Whatever happens with the selected arm, it is penalized (escape from local traps).



IV Conclusion

IV Conclusion

Remarques importantes :

I Importance du cadre statistiques (classification / analyse discriminante).

I L’hypothese de Marge provoque une acceleration du risque.

I L’important est de comprendre la structure de voisinage et la taille des petitesboules probabilistes autour de chaque observation.

I Les kppv sont a utiliser avec precaution (variables descriptives, alea). DonneesECG : erreur de classif. passe de 25% a moins de 2% en utilisant les invariants.

Extensions Mathematiques :

I Pas d’utilisation de la regularite de η. (Cf Samworth ’12). Ponderation ?

I Resultat non optimal en Gaussien (perte minimax en dn

au lieu de n2/(2+d)).

I Vitesse non optimale (mais presque) en A.D. avec une presence d’un log(n).

I Meilleure inegalite de concentration ? Approche alternative a la Poissonisation pardes variables N.A. (c’est presque le cas pour ηn,k lorsque n 7−→ +∞).

I Variables d’entree sont perturbees par un operateur partiellement connu/inconnud’un point de vue stat. math, aspects semi-parametriques ou non parametriques.

I Borne inferieure en approche fonctionnelle. . .

I Coupler avec une sparse PCA (reduction de dimension et du biais dans les ppv)

I . . .

Merci de votre attention !

regret of narendra shapiro bandit algorithmsthe so-called narendra-shapiro bandit algorithm (ns...

Documents