regulating greed over time - arxiv.org e-print archive greed over time stefano traca and cynthia...

Regulating Greed Over Time

STEFANO TRACA AND CYNTHIA RUDIN

Abstract

In retail, there are predictable yet dramatic time-dependent patterns in customer behavior, such as periodic changes in thenumber of visitors, or increases in visitors just before major holidays (e.g., Christmas). The current paradigm of multi-armedbandit analysis does not take these known patterns into account, which means that despite the firm theoretical foundation ofthese methods, they are fundamentally flawed when it comes to real applications. This work provides a remedy that takesthe time-dependent patterns into account, and we show how this remedy is implemented in the UCB and ε-greedy methods.In the corrected methods, exploitation (greed) is regulated over time, so that more exploitation occurs during higher rewardperiods, and more exploration occurs in periods of low reward. In order to understand why regret is reduced with the correctedmethods, we present a set of bounds that provide insight into why we would want to exploit during periods of high reward,and discuss the impact on regret. Our proposed methods have excellent performance in experiments, and were inspired by ahigh-scoring entry in the Exploration and Exploitation 3 contest using data from Yahoo! Front Page. That entry heavily usedtime-series methods to regulate greed over time, which was substantially more effective than other contextual bandit methods.Keywords: Multi-armed bandit, exploration-exploitation trade-off, time series, retail management, marketing, online appli-cations, regret bounds.

1 IntroductionConsider the classic pricing problem faced by retailers, where the price of a product is chosen to maximize the expected profit.The optimal price is learned through a mix of exploring various pricing choices and exploiting those known to yield higherprofits. Exploration-exploitation problems occur not only in retail (both in stores and online), but in marketing, on websitessuch as Yahoo! Front Page, where the goal is to choose which of a set of articles to display to the user, on other webpageswhere ads are shown to the user on a sidebar (e.g., YouTube, Slashdot), and even on software like Skype, which displayspop up windows with targeted ads. In all of these applications, the classic assumptions of random rewards with a staticprobability distribution is badly violated. For retailers, there are almost always clear trends in customer arrivals, and they areoften predictable. For instance, Figure 1a shows clear weekly periodicity of customer arrivals, with a large dip on ChristmasEve and a huge peak on Boxing Day. Figure 1b shows clear half-day periodicity of the number of Skype users, Figure 1cshows daily trends in the number of Google queries and Figure 1d shows trends in the browsing time of Taobao.com shoppers.These dramatic trends might have a substantial impact on which policy we would use to show ads and price products – if weknew how to handle them – yet these trends are completely ignored in classic multi-armed bandit analysis, where we assumeeach arm has a static fixed distribution of rewards.

The time series patterns within the application areas discussed above take many different forms, but in all cases, the sameeffect is present, where regulating greed over time could be beneficial. For retailers, the number of customers is much largeron certain days than others, and for these days we should exploit by choosing the best prices, and not explore. For Yahoo!Front Page, articles have a short lifespan and some articles are much better than others, in which case, if we find a particularlygood article, we should exploit by repeatedly showing that one, and not explore new articles. (Incidentally this insight wasthe key to good performance on the Exploration and Exploitation 3 Phase 1 dataset.) For online advertising, there are certainperiods where customers are more likely to make a purchase, so prices should be set to optimal values during those times, andwe should not be exploring suboptimal choices.

In this work, we consider a simple but effective way to model a time-dependent effect on the rewards, which is that thestatic rewards for all arms are multiplied by a known time series, the reward multiplier G(t). Even in this simple case, we canintuitively see how exploiting when the reward multiplier is high and exploring when it is low can lead to better performance.The reward multiplier approach is general, and can encode “micro-lock” (i.e., playing many times the same arm in one round)or “lock” periods (i.e., playing the same arm in a sequence of rounds) where the prices can not change, but the purchase ratehas dynamically changing trends. As a result of the reward multiplier function, theoretical regret bound analysis of the multi-armed bandit problem becomes more complicated, because now the distribution of rewards depends explicitly on time. Wenot only care how many times each suboptimal arm is played, but exactly when they are played. For instance, if suboptimal

1

arX

iv:1

505.

0562

9v1

[st

at.M

L]

21

May

201

5

(a) English users shopping online. Source: ispreview.co.uk (b) Users on Skype. Source: voiceontheweb.biz

(c) Google users on tablet, mobile or laptop devices. Source: online-marketingdenver.net

(d) Tabao online shoppers. Source: chinainternetwatch.com

2

arms are played only when the reward multiplier is low, intuitively it should not hurt the overall regret; this is one insight thatthe classic analysis cannot provide.

We provide a way to modify some of the popular bandit algorithms to take into account the reward multiplier, and provideregret bounds in the classical style and we show that a logarithmic bound on the mean regret holds.

The ideas in this paper were inspired by a high scoring entry in the Exploration and Exploitation 3 Phase 1 data miningcompetition, where the goal was to build a better recommendation system for Yahoo! Front Page news articles. At each time,several articles were available to choose from, and these articles would appear only for short time periods and would neverbe available again. One of the main ideas in this entry was simple yet effective: if any article gets more than 9 clicks out ofthe last 100 times we show the article, and keep displaying it until the clickthrough rate goes down. This alone increased theclickthrough rate by almost a quarter of a percent. The way we distilled the problem within this paper allows us to isolate andstudy this effect.

2 Related WorkThe setup of this work differs from other works considering time-dependent multi-armed bandit problems – we do not assumethe mean rewards of the arms exhibit random changes over time, and we assume that the reward multiplier is known inadvance, in accordance with what we observe in real scenarios. No previous works that we know of consider regulating greedover time based on known reward trends. In contrast, Liu et al. [2013] consider a problem where each arm transitions in anunknown Markovian way to a different reward state when it is played, and evolves according to an unknown random processwhen it is not played. Garivier and Moulines [2008] presented an analysis of a discounted version of the UCB and a slidingwindow version of the UCB, where the distribution of rewards can have abrupt changes and stays stationary in between.Besbes et al. [2014] considers the case where the mean rewards for each arm can change, where the variation of that change isbounded. Slivkins and Upfal [2007] consider an extreme case where the rewards exhibit Brownian motion, leading to regretbounds that scale differently than typical bounds (linear in T rather than logarithmic). One of the works that is extremelyrelevant to ours is that of Chakrabarti et al. [2009] who consider “mortal bandits” that disappear, just like Yahoo! Front Pagenews articles or online advertisements. The setting of mortal bandits in the Yahoo! dataset inspired the framework proposedhere, since in the mortal setting, we claim one would want to curb exploration when a high reward mortal bandit is available.An interesting setting is discussed by Komiyama et al. [2013], where there are lock-up periods when one is required to playthe same arm several times in a row. In our scenario, the micro-lock-up periods occur at each step of the game, and theireffective lengths are given by G(t). In some of the algorithms that we propose, when the rewards multiplier are above acertain threshold, we create lock-up periods ourselves, where exploration is stopped, but the best arm is allowed to changeduring high-reward zones as we gather information over rounds. The methods proposed in this work are contextual Li et al.[e.g., 2010], in the sense that we consider externally available information in the form of time dependent trends.

3 AlgorithmsFormally, we have m arms, with reward Xj(t)G(t) provided at time t if arm j is played. The reward multiplier G(t) isassumed to be known in advance: this is (for instance) the periodic function of customer arrivals shown for Google, Skype,etc. The distribution from which Xj(t) are drawn does not change over time. 1{It=j} is an indicator function equal to 1if arm j is played at time t (otherwise its value is 0) and ∆j = µ∗ − µj is the difference between the mean of the rewarddistributions of X∗(t) and Xj(t), where arm “*” is the best arm, with the highest mean reward. Let d be a constant less than1 such that 0 < d < minj ∆j , let t be the number of rounds where G(t) is under the threshold z up to round t, and let c be aconstant greater than 10. The total number of turns played in one game is n (we assume n > m).

3.1 Regulating greed with threshold in the ε-greedy algorithmIn Algorithm 1 we present a variation of the ε-greedy algorithm of Auer et al. [2002], in which a threshold z has beenintroduced in order to regulate greed. When the rewards are high (i.e., the G(t) multiplier is above the threshold z) thealgorithm exploits the best arm found, that is, arm j with the highest mean estimate:

Xj =1

Tj(t− 1)

Tj(t−1)∑s=1

Xj(s), (1)

where Tj(t− 1) is the number of times arm j has been before round t.The following theorem provides a bound on the mean regret of this policy (the proof is given in Appendix A).

3

Algorithm 1: ε-greedy algorithm thresholdInput : number of rounds n, number of arms m, threshold z, a constant c > 10, a constant d such that

d < minj ∆j and 0 < d < 1, sequences {εt}nt=1 = min{

1, cmd2 t

}and {G(t)}nt=1

Initialization: play all arms once and initialize Xj (defined in (1)) for each j = 1, · · · ,mfor t = m+ 1 to n do

if (G(t) < z) thenwith probability εt play an arm uniformly at random (each arm has probability 1

m of being selected),otherwise (with probability 1− εt) play arm j such that

Xj ≥ Xi ∀i

elseplay arm j such that

Xj ≥ Xi ∀i

endendGet reward G(t)Xj ;Update Xj ;

end

Theorem 3.1 (ε-greedy algorithm with hard threshold). The bound on the mean regret E[Rn] at time n is given by

E[Rn] ≤m∑j=1

G(j)∆j (2)

+

n∑t=m+1

G(t)∑

j:µj<µ∗

∆j

(εt

1

m+ (1− εt)βj(t)

)1{G(t)<z} (3)

+∑t∈B

∑j:µj<µ∗

∆jβj(t)

[τt∑s=t

G(s)

], (4)

where

βj(t) =c

d2

(d2t

mce

)− c10d2

ln

(d2t

mce

)+

4

∆2j

(d2t

mce

)− c∆2j

4d2

, (5)

B = {t : G(t − 1) < z,G(t) ≥ z} is the set of rounds where the high-reward zone is entered, and τt is the last round ofhigh-reward zone entered at time t.

The sum in (2) is the exact mean regret during the initialization phase of Algorithm 1. In (3) we have a bound on theexpected regret for turns that present low values of G(t), where the quantity in the parenthesis is the bound on the probabilityof playing arm j (βj(t) is the bound on the probability that arm j is the best arm at round t, and 1/m is the probability ofchoosing arm j when the choice is made at random). Finally, in (4) we have a bound on the expected regret for turns thatpresent high values of G(t) and in this case we consider only the probability βj(t) that arm j is the best arm since we do notexplore at random during high reward periods. The usual ε-greedy algorithm is a special case when G(t) = 1 ∀t and z > 1.Notice that εt is a quantity θ

(1/t), while βj(t) is o

(1/t), so that an asymptotic logarithmic bound in n holds for E[Rn] if t

grows at the same rate as t (because of the logarithmic bound on the harmonic series).We want to compare this bound with the one of the usual version of the ε-greedy algorithm but, since the old version is

not well suited for the setting in which the rewards are altered by the multiplier function, we discount the rewards obtainedat each round (by simply dividing them by G(t)) so that it can also produce accurate estimates of the mean reward for eacharm. This “smarter” version of the ε-greedy algorithm is presented in Algorithm 6. The bound on the probability of playinga suboptimal arm j for the usual ε-greedy algorithm is given by βj(t) (i.e. βj(t) when t = t) and we refer to it as βold

j (t).In general, βold

j (t) is lower than βj(t) (since t ≤ t). Intuitively, this reflects the fact that the new algorithm performs fewerexploration steps. Moreover, in the usual ε-greedy algorithm, the probability of choosing arm j at time t is given by

P({Ioldt = j}

)= εt

1

m+ (1− εt)βold

j (t),

4

which is less than the probability of the new algorithm in case of low G(t)

P ({Inewt = j}) = εt

1

m+ (1− εt)βj(t),

but can easily be higher than the probability of the new algorithm in case of high rewards (which is given by only βj(t)). Infact,

P({Ioldt = j}

)− P ({Inew

t = j}) = εt1

m+ (1− εt)βold

j (t)− βj(t)

=1

mmin

{1,cm

d2t

}− βold

j (t) min{

1,cm

d2t

}+ βold

j (t)− βj(t),

if t > cmd2 we get

1

m

cm

d2t+ βold

j (t)(

1− cm

d2t

)− βj(t), (6)

if t ≤ cmd2 we get

1

m−��βold

j (t) + ��βoldj (t) − βj(t), (7)

and for t large enough both expressions are positive since βj(t) is o(1/t)

and we assume that t is θ(t). Having (6) and(7) positive means that if we are in a high-rewards period the probability of choosing a suboptimal arm decreases faster inAlgorithm 1. In that case, Algorithm 1 would have lower regret than the ε-greedy algorithm.

3.2 Soft-ε-greedy algorithmWe present now in Algorithm 2 a “soft version” of the ε-greedy algorithm where greed is regulated gradually (in contrast withthe hard threshold of the previous section). Again, in high reward zones, exploitation will be preferred, while in low rewardzones the algorithm will explore the arms more. Let us define the following function

ψ(t) =log(

1 + 1G(t)

)log(

1 + 1mins∈{m+1,··· ,n}G(s)

) , (8)

and let γ = mins∈{m+1,··· ,n} ψ(s). Notice that 0 < ψ(t) ≤ 1 ∀t and that its values are close to 0 when G(t) is high, whilethey are close to 1 for low values of G(t). We generally assume that mins∈{m+1,··· ,n}G(s) = 1, and that the usual case isrecovered when G(t) = 1 for all t.

Algorithm 2: Soft-ε-greedy algorithmInput : number of rounds n, number of arms m, a constant c > 10, a constant d such that d < minj ∆j and

0 < d < 1, sequences {εt}nt=1 = min{ψ(t), cmd2t

}and {G(t)}nt=1

Initialization: play all arms once and initialize Xj (defined in (1)) for each j = 1, · · · ,mfor t = m+ 1 to n do

with probability εt play an arm uniformly at random (each arm has probability 1m of being selected),

otherwise (with probability 1− εt) play arm j such that

Xj ≥ Xi ∀i

Get reward G(t)Xj ;Update Xj ;

end

The following theorem (proved in Appendix B) shows that a logarithmic bound holds in this case too (because the εt areθ (1/t) and their sum over time is logarithmically bounded, while the βSj (t) term is o (1/t)).

5

Theorem 3.2 (Regret-bound for soft-ε-greedy algorithm). The bound on the mean regret E[Rn] at time n is given by

E[Rn] ≤m∑j=1

G(j)∆j (9)

+

n∑t=m+1

G(t)∑

j:µj<µ∗

∆j

(εt

1

m+ (1− εt)βSj (t)

)(10)

where

βSj (t) =c

d2

(d2γt

mce

)− c10d2

ln

(d2γt

mce

)+

4

∆2j

(d2γt

mce

)− c∆2j

4d2

. (11)

The sum in (9) is the exact mean regret during the initialization of Algorithm 2. For the rounds after the initializationphase, the quantity in the parenthesis of (10) is the bound on the probability of playing arm j (where βSj (t) is the bound onthe probability that arm j is the best arm at round t, and 1/m is the probability of choosing arm j when the choice is made atrandom).

As before, we want to compare this bound with the “smarter” version of the ε-greedy algorithm presented in Algorithm6. In the usual ε-greedy algorithm, after the “critical time” n′ = cm

d2 , the probability P(Xj,Tj(t−1) ≥ Xi,Ti(t−1)) of armj being the current best arm, can be bounded by a quantity βold

j (t) that is o (1/t) as t grows. Before time n′, the decay ofP(Xj,Tj(t−1) ≥ Xi,Ti(t−1)) is faster and the bound is a quantity that is o

(1/tλ

), ∀λ as t grows (see Remark 1 in Appendix

A). The probability of choosing a suboptimal arm j changes as follows:

• if t < n′, P({It = j}) = 1m ;

• if t ≥ n′, P({It = j}) = cd2t +

(1− cm

d2t

)βoldj (t) , which is θ

(1t

)as t grows.

In the soft-ε-greedy algorithm, before time w = mins∈{1,··· ,n}cmd2s < γ, we have that βSj (t), which is the bound on the

probability P(Xj,Tj(t−1) ≥ Xi,Ti(t−1)) of arm j being the current best arm, is a quantity that is o(1/(γt)λ

), ∀λ as t grows

(the argument is similar to the Remark 1 in Appendix A). After w, it can be bounded by a quantity that is o (1/(γt)) as tgrows. The probability of choosing a suboptimal arm j changes as follows:

• if t < n′, P({It = j}) = 1mψ(t) + (1− ψ(t))βSj (t);

• if n′ ≤ t ≤ w, P({It = j}) = 1m min

{ψ(t), cmd2t

}+(1−min

{ψ(t), cmd2t

})βSj (t);

• if t > w, P({It = j}) = cd2t +

(1− c

d2t

)βSj (t).

In order to interpret these quantities, let us see what happens for high or low values of the multiplier G(t) as t grows in Table1. For brevity, we abuse notation when using Landau’s symbols, because in some cases t is not allowed to go to infinity; itis convenient to still use the “little o” notation to compare the decay rates of the probabilities of choosing a suboptimal arm,which also gives a qualitative explanation of what happens when using the algorithms. For the soft-ε-algorithm, the rate atwhich the probability of choosing a suboptimal arm decays is faster when G(t) is high, and worse when G(t) is low. Noticethat the parameter γ slows down the decay with respect to the usual ε-greedy algorithm. This is direct consequence of theslower exploration. An example of a typical behavior of ψ(t) and εold

t is shown in Figure 1, where G(t) = 20 + 19 sin(t/2).

6

Table 1: Summary of the decay rate of the probabilities of choosing a suboptimal arm for the soft-ε-greedy algorithm and theusual ε-greedy algorithm (supposing it is taking in account the time-patterns.) The decay depends on the time-regions of thegame presented in Figure 1.

Region round G(t) P({It = j})old P({It = j})soft P({Isoftt = j}) < P({Iold

t = j}) ?

high 1m o

(1

(γt)λ

), ∀λ yes, much better

1 t < n′

low 1m close to 1

m no, but not by much

high θ(

1t

)+ o

(1t

)o(

1(γt)λ

), ∀λ yes, much better

2 n′ ≤ t ≤ wlow θ

(1t

)+ o

(1t

)θ(

1t

)+ o

(1

(γt)λ

), ∀λ yes, but not by much

high θ(

1t

)+ o

(1t

)θ(

1t

)+ o

(1γt

)no, but not by much

3 t > w

low θ(

1t

)+ o

(1t

)θ(

1t

)+ o

(1γt

)no, but not by much

Figure 1: Comparison of probabilities of exploration over the number of rounds. Before n′, εoldt is 1 and always greater than

ψ(t). After w, εoldt is always less than ψ(t).

7

3.3 Regulating greed in the UCB algorithmFollowing what has been presented to improve the ε-greedy algorithm in this setting, we introduce in Algorithm 3 a modifi-cation of the UCB algorithm. We again set a threshold z and, if the multiplier of the rewards G(t) is above this level, the newalgorithm exploits the best arm.

Algorithm 3: UCB algorithm with threshold

Input : number of rounds n, number of arms m, threshold z, sequence {G(t)}nt=1

Initialization: play all arms once and initialize Xj (as defined in (1)) for each j = 1, · · · ,mfor t = m+ 1 to n do

if (G(t) < z) thenplay arm j with the highest

Xj,Tj(t−1) +

√2 log t

Tj(t− 1)

elseplay arm j such that

Xj ≥ Xi ∀i

endendGet reward G(t)Xj ;Update Xj

end

It is possible to prove that also in this case the regret can be bounded logarithmically in n. Let B = {t : G(t − 1) <z,G(t) > z} be the set of rounds where the high-reward zone is entered, and let τt be the last round of the high-rewardzone that was entered at time t. Let us call y1, y2, · · · , yB the elements of B and order them in increasing order such thaty1 < y2 < · · · < yB . Let us also define for every k ∈ {1, · · · , |B|} the set Yk = {t : t ≥ yk, G(t) > z, t < yk+1} (whereyB+1 = n) of times in the high-reward period entered at time yk, and let Λk = maxt∈Yk G(t) the highest value of G(t) onYk. Finally, for every k, let Rk = Λk|Yk|.

Now, given a game of n total rounds, we can “collapse” the kth high reward zone into the entering time yk by definingG(yk) = Rk, for all k. Now, the maximum regret over B is given by (maxj ∆j)

∑|B|k=1Rk. By eliminating the set B from

the game, we have transformed the original game into a shorter one, with η steps, where G(t) is bounded by z and the usualUCB algorithm is played. When the size of the set B decreases with n, (is of order θ(1/t) after an arbitrary time), the totalregret has a logarithmic bound in n. This can be achieved by regulating the greed function in such a way that it will followunder the threshold, for example by introducing a sequence of exponents α = {αt}nt=1 that decrease the value of G(t)αt .

The ε-greedy methods are more amenable to this type of analysis than UCB methods, because the proofs require boundson the probability of choosing the wrong arm at each turn. The UCB proof instead require us to bound the expected numberof times the suboptimal arms are played, without regard to when those arms were chosen. We were able to avoid usingthe maximum of the G(t) values in the ε-greedy proofs, but this is unavoidable in the UCB proofs without leaving termsin the bound that cannot be explicitly calculated or simplified (an alternate proof would use weaker Central Limit Theoremarguments).

Theorem 3.3 (Regret-bound for the regulated UCB algorithm). The bound on the mean regret E[Rn] at time n is given by

E[Rn] ≤m∑j=1

G(j)∆j (12)

+ z

∑j:µj<µ∗

ln η

∆j+

(1 +

π2

3

) m∑j=1

∆j

(13)

+ (maxj

∆j)

|B|∑k=1

Rk (14)

The first sum in (12) is the exact mean regret of the initialization phase of Algorithm 3, the third sum in (14) is the boundon the regret from the high-reward zones that have been collapsed, and the second term in (13) is the bound on the regret for

8

the η rounds when G(t) is under the threshold z and it follows from the usual bound on the UCB algorithm (for n rounds theUCB algorithm has a mean regret bounded by

∑mj=1

lnn∆j

+(

1 + π2

3

)∑mj=1 ∆j).

3.4 The α-soft version of the Regulated UCB algorithmIn Algorithm 4, present now a “soft version” of the UCB algorithm where greed is regulated gradually (in contrast with thehard threshold of the previous section). Again, in high reward zones, exploitation will be preferred, while in low reward zonesthe algorithm will explore the arms. α can be a real number or a sequence α = {αt}nt=1 that can be used to mitigate orenhance the effects that the multiplier function G(t) has in the exploitation rate.Let us define the following function:

ξα(t) =

(1 +

t

G(t)αt

)(15)

One of the main difficulties of the formulation of these bounds is to define a correct functional form for ξα(t) so that it ispossible to obtain smoothness in the arm decision, reasonable Chernoff-Hoeffding inequality bounds while working out theproof (see Appendix C), and a convergent series (the second summation in (17)).

Algorithm 4: α-soft UCB algorithm

Input : number of rounds n, number of arms m, sequence {G(t)α}nt=1


play arm j with the highest

Xj,Tj(t−1) +

√2 log(ξα(t))

Tj(t− 1)


end

Also in this case, it is possible to achieve a bound that grows logarithmically in n.

Theorem 3.4 (Regret-bound for soft-UCB algorithm). The bound on the mean regret E[Rn] at time n is given by

E[Rn] ≤m∑j=1

G(j)∆j (16)

+ maxt∈{m+1,··· ,n}

G(t)

∑j:µj<µ∗

8

∆jlog

(max

t∈{m+1,··· ,n}ξα(t)

)+ 2

m∑j=1

∆j

n∑t=m+1

(ξα(t))−4 (t− 1−m)2

. (17)

The first sum in (16) is the exact mean regret of the initialization phase of Algorithm 4. For the rounds after the initial-ization phase, the mean regret is bounded by the quantity in (17), which is almost identical to the bound of the usual UCBalgorithm if we assume G(t) = 1 (i.e., rewards are not modified by the multiplier function).

4 Experimental resultsWe are going to consider three types of multiplier function G(t):

• The Wave Greed (Figure 2a): in this case customers come in waves: G(t) = 21 + 20 sin(0.25t). We want to exploitthe best arm found so far during the peaks, and explore the other arms during low-rewards periods ;

• The Christmas Greed (Figure 2b): again, G(t) = 21 + 20 sin(0.25t), but when t ∈ {650, 651, · · · , 670}, G(t) = 1000which shows that there is a peak in the rewards offered by the game (which we call “Christmas”, in analogy to thephenomenon of the boom of customers during the Christmas holidays) ;

• The Step Greed (Figure 2c): this case is similar to the Wave Greed case, but this time the function is not smooth:G(t) = 200, but for t ∈ {600, 601, · · · , 800}∪{1000, 1001, · · · , 1200}∪{1400, 1402, · · · , 1600}we haveG(t) = 400.

9

Figure 2: Multiplier functions used in the experiments.

(a) The Wave Greed (b) The Christmas Greed (c) The Step Greed

We consider a game with 500 arms and rewards normally distributed. Each arm j ∈ {1, · · · , 500} has mean rewardµj = 0.1 + (200 + 1.5(500− j + 1))/(1.5× 500) and common standard deviation σ = 0.05. The arms were chosen in thisway so that Xj would take values (with high probability) in [0, 1]. Having a bounded support for Xj is a standard assumptionmade when proving regret bounds (see Auer et al. [2002]). We play 2000 rounds each game. After 2000 rounds the algorithmsall essentially have determined which arm is the best and tend to perform very similarly from that point onwards.

The well-known UCB and ε-greedy algorithms are not suitable for the setting in which the rewards are altered by themultiplier function. Thus, in their current form, we can not compare directly with them. The fact that rewards are multipliedwould irremediably bias all the estimations of the mean rewards, leading UCB and ε-greedy to choose arms that look goodjust because they happened to be played in a high reward period. For example, suppose we show an ad on a website at lunchtime: many people will see it because at that time the web-surfing is at its peak (i.e., the G(t) multiplier is high). So even ifthe ad was bad, we may register more clicks than a good ad showed at 3:00AM (i.e., the G(t) multiplier is low). To obtain afair comparison, we created “smarter” versions of the UCB and ε-greedy algorithms in which the rewards are discounted ateach round (by simply dividing them by G(t)) so that also the old version of the algorithms can be smarter in that they canproduce accurate estimates of the mean reward for each arm. The smarter version of the usual UCB algorithm is presented inAlgorithm 5 and the one for the ε-greedy algorithm is shown in Algorithm 6. For the three multiplier functions, we report theperformance of the algorithms in Figures 3, 4, and 5.

In Figures 6, 7, and 8, we change the rewards to have a Bernoulli distribution (the assumption of bounded support isverified). Similarly to the normal case, each arm j ∈ {1, · · · , 500} has probability of success pj = 0.1 + (200 + 1.5(500−j + 1))/(1.5× 500). One of the advantages of the ε-greedy algorithm is that there are no assumptions on the distribution ofthe rewards, while in UCB they need bounded support ([0, 1] for convenience, so it is easier to use Hoeffding’s inequality).

Algorithm 5: Smarter version of the usual UCB algorithm

Input : number of rounds n, number of arms m, sequence {G(t)}nt=1


play arm j with the highest

Xj,Tj(t−1) +

√2 log(t)

Tj(t− 1)


end

10

Figure 3: Comparison for the Wave Greed case.

(a) ε-greedy algorithms rewards. (b) UCB algorithms rewards. (c) Final rewards comparison

Figure 4: Comparison for the Christmas Greed case.

(a) ε-greedy algorithms rewards. (b) UCB algorithms rewards. (c) Final rewards comparison

Figure 5: Comparison for the Step Greed case.

(a) ε-greedy algorithms rewards. (b) UCB algorithms rewards. (c) Final rewards comparison.

11

Figure 6: Comparison for the Wave Greed case.


Figure 7: Comparison for the Christmas Greed case.


Figure 8: Comparison for the Step Greed case.


12

Algorithm 6: Smarter version of the usual ε-greedy algorithmInput : number of rounds n, number of arms m, a constant c > 10, a constant d such that d < minj ∆j and

0 < d < 1, sequences {εt}nt=1 = min{

1, cmd2t

}and {G(t)}nt=1


with probability εt play an arm uniformly at random (each arm has probability 1m of being selected),

otherwise (with probability 1− εt) play arm j such that

Xj ≥ Xi ∀i


end

4.1 Discussion on Yahoo! contestThe motivation of this work comes from a high scoring entry in the Exploration and Exploitation 3 contest, where the goalwas to build a better recommender system for Yahoo! Front Page news article recommendations. The contest data, whichwas from Yahoo! and allows for unbiased evaluations, is described by Li et al. [2010]. These data had several challengingcharacteristics, including broad trends over time in click through rate, arms (news articles) appearing and disappearing overtime, the inability to access the data in order to cross-validate, and other complexities. This paper does not aim to handle allof these, but only the one which led to a key insight in increased performance, which is the regulation of greed over time.Although there were features available for each time, none of the contestants were able to successfully use the features tosubstantially boost performance, and the exploration/exploitation aspects turned out to be more important. Here are the maininsights leading to large performance gains, all involving regulating greed over time:

• “Peak grabber”: Stop exploration when a good arm appears. Specifically, when the article was clicked 9/100 times,keep showing it and stop exploration all together until the arm’s click through rate drops below that of another arm.Since this strategy does not handle the massive global trends we observed in the data, it needed to be modified asfollows:

• “Dynamic peak grabber”: Stop exploration when the click through rate of one arm is at least 15% above that of theglobal click through rate.

• Stop exploring old articles: We can determine approximately how long the arm is likely to stay, and we reduce explo-ration gradually as the arm gets older.

• Do not fully explore new arms: When a new arm appears, do not use 1 as the upper confidence bound for the probabilityof click, which would force a UCB algorithm to explore it, use .88 instead. This allows the algorithm to continueexploiting the arms that are known to be good rather than exploring new ones.

The peak grabber strategies inspired the abstracted setting here, where one can think of a good article appearing duringperiods of high G(t), where we would want to limit exploration; however, the other strategies are also relevant cases wherethe exploration/exploitation tradeoff is regulated over time. There were no “lock-up” periods in the contest dataset, thoughas discussed earlier, the G(t) function is also relevant for modeling that setting. The large global trends we observed in thecontest data click through rates are very relevant to the G(t) model, since obviously one would want to explore less when theclick rate is high in order to get more clicks overall.

5 ConclusionsThe dynamic trends we observe in most retail and marketing settings are dramatic. It is possible that understanding thesedynamics and how to take advantage of them is central to the success of multi-armed bandit algorithms. We showed in thiswork how to adapt regret bound analysis to this setting, where we now need to consider not only how many times an armwas pulled in the past, but precisely when the arm was pulled. The key element of our algorithms is that they regulate greed(exploitation) over time, where during high reward periods, less exploration is performed.

There are many possible extensions to this work. In particular, if G(t) is not known in advance, it may be easy to estimatefrom data in real time, as in the dynamic peak grabber strategy. The analysis of the algorithms in this paper could be extended

13

to other important multi-armed bandit algorithms besides ε-greedy and UCB. Further, it would be potentially interesting tobetter explore the connection of mortal bandits with the G(t) setting, since for mortal bandits, each bandit’s G(t) functioncan change at a different rate.

AcknowledgmentsWe thank team members Ed Su and Virot Ta Chiraphadhanakul for their work on the ICML Exploration and Exploitation3 contest that inspired the work here. Thank you also to Philippe Rigollet for helpful comments and encouragement. Thisproject was partially funded by the Ford-MIT alliance.

ReferencesJean-Yves Audibert, Sebastien Bubeck, et al. Best arm identification in multi-armed bandits. Conference on Learning Theory

2010-Proceedings, 2010.

Peter Auer and Ronald Ortner. UCB revisited: Improved regret bounds for the stochastic multi-armed bandit problem.Periodica Mathematica Hungarica, 61(1-2):55–65, 2010.

Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning,47(2-3):235–256, 2002.

Donald A Berry and Bert Fristedt. Bandit Problems: Sequential Allocation of Experiments (Monographs on Statistics andApplied Probability). Springer, 1985.

Omar Besbes, Yonatan Gur, and Assaf Zeevi. Optimal exploration-exploitation in a multi-armed-bandit problem with non-stationary rewards. Available at SSRN 2436629, 2014.

Olivier Bousquet. New approaches to statistical learning theory. Annals of the Institute of Statistical Mathematics, 55(2):371–389, 2003. ISSN 0020-3157. doi: 10.1007/BF02530506. URL http://dx.doi.org/10.1007/BF02530506.

Sebastien Bubeck and Nicolo Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems.arXiv preprint arXiv:1204.5721, 2012.

Deepayan Chakrabarti, Ravi Kumar, Filip Radlinski, and Eli Upfal. Mortal multi-armed bandits. In Advances in NeuralInformation Processing Systems, pages 273–280, 2009.

Aurelien Garivier and Olivier Cappe. The KL-UCB algorithm for bounded stochastic bandits and beyond. arXiv preprintarXiv:1102.2490, 2011.

Aurelien Garivier and Eric Moulines. On upper-confidence bound policies for non-stationary bandit problems. arXiv preprintarXiv:0805.3415, 2008.

Emilie Kaufmann, Olivier Cappe, and Aurelien Garivier. On bayesian upper confidence bounds for bandit problems. InInternational Conference on Artificial Intelligence and Statistics, pages 592–600, 2012.

Junpei Komiyama, Issei Sato, and Hiroshi Nakagawa. Multi-armed bandit problem with lock-up periods. In Asian Conferenceon Machine Learning, pages 116–132, 2013.

Lihong Li, Wei Chu, John Langford, and Robert E Schapire. A contextual-bandit approach to personalized news articlerecommendation. In Proceedings of the 19th International Conference on World Wide Web, pages 661–670. ACM, 2010.

Haoyang Liu, Keqin Liu, and Qing Zhao. Learning in a changing world: Restless multiarmed bandit with unknown dynamics.IEEE Transactions on Information Theory, 59(3):1902–1916, 2013.

Herbert Robbins. Some aspects of the sequential design of experiments. In Herbert Robbins Selected Papers, pages 169–177.Springer, 1985.

Cynthia Rudin, Virot Ta Chiraphadhanakul, and Edward Su. Regulating greed over time for Yahoo! front page news articlerecommendations, from the Exploration and Exploitation 3 Challenge. lecture slides, 2012.

14

http://dx.doi.org/10.1007/BF02530506

Aleksandrs Slivkins and Eli Upfal. Adapting to a stochastically changing environment: The dynamic multi-armed banditsproblem. Technical Report CS-07-05, Brown University, 2007.

William R Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of twosamples. Biometrika, pages 285–294, 1933.

A Regret-bound for ε-greedy algorithm with hard thresholdThe regret at round n is given by

Rn =

n∑t=1

m∑j=1

∆jG(t)1{It=j}, (18)

where G(t) is the greed function evaluated at time t, 1{It=j} is an indicator function equal to 1 if arm j is played at time t(otherwise its value is 0) and ∆j = µ∗ − µj is the difference between the mean of the best arm reward distribution and themean of the j’s arm reward distribution. By considering the threshold z which determines which rule is applied to decidewhat arm to play, we can rewrite the regret as

Rn =

n∑t=1

m∑j=1

∆jG(t)1{G(t)<z}1{It=j} +

+

n∑t=1

m∑j=1

∆jG(t)1{G(t)≥z}1{It=j}.

By taking the expectation we have that

E[Rn] =

n∑t=1

m∑j=1

∆jG(t)1{G(t)<z}P({It = j}) +

+

n∑t=1

m∑j=1

∆jG(t)1{G(t)≥z}P({It = j}). (19)

For the rounds of the algorithm where G(t) < z, we are in the standard setting, so for those times, we follow the standardproof of Auer et al. [2002]. For the times that are over the threshold, we need to create a separate bound. Let us now boundthe probability of playing the sub-optimal arm j at time t when the greed function is above the threshold z.

P(Xj,Tj(t−1) ≥ Xi,Ti(t−1) ∀i) ≤ P(Xj,Tj(t−1) ≥ X∗,T∗(t−1)) (20)

≤ P(Xj,Tj(t−1) ≥ µj +

∆j

2

)+ P

(X∗,T∗(t−1) ≤ µ∗ −

∆j

2

), (21)

where the last inequality follows from the fact that{Xj,Tj(t−1) ≥ X∗,T∗(t−1)

}⊂({

X∗,T∗(t−1) ≤ µ∗ −∆j

2

}∪{Xj,Tj(t−1) ≥ µj +

∆j

2

}). (22)

In fact, suppose that the inclusion (22) does not hold. Then we would have that

{Xj,Tj(t−1) ≥ X∗,T∗(t−1)

}⊂

({X∗,T∗(t−1) ≤ µ∗ −

∆j

2

}∪{Xj,Tj(t−1) ≥ µj +

∆j

2

})C(23)

=

{X∗,T∗(t−1) > µ∗ −

∆j

2

}∩{Xj,Tj(t−1) < µj +

∆j

2

}, (24)

15

but from the intersection of events given in (24) it follows that X∗,T∗(t−1) > µ∗ − ∆j

2 > µj − ∆j

2 > Xj,Tj(t−1) whichcontradicts (23). Let us consider the first term of (21) (the computations for the second term are similar),

P(Xj,Tj(t−1) ≥ µj +

∆j

2

)=

t−1∑s=1

P(Tj(t− 1) = s, Xj,s ≥ µj +

∆j

2

)

=

t−1∑s=1

P(Tj(t− 1) = s|Xj,s ≥ µj +

∆j

2

)P(Xj,s ≥ µj +

∆j

2

)

≤t−1∑s=1

P(Tj(t− 1) = s|Xj,s ≥ µj +

∆j

2

)e−

∆2j

2 s, (25)

where in the last inequality we used the Chernoff-Hoeffdings bound. Let us define TRj (t − 1) as the number of times armj is played at random (note that TRj (t − 1) ≤ Tj(t − 1) and that TRj (t − 1) =

∑t−1s=1Bs where Bs is a Bernoulli r.v. with

parameter εs/m), and let us define

x0 =1

2m

t∑s=1

εs,

where t is the number of rounds played under the threshold z up to time t. Then,

(25) ≤bx0c∑s=1

P(Tj(t− 1) = s|Xj,s ≥ µj +

∆j

2

)+

t−1∑s=bx0c+1

e−∆2j

2 s

≤bx0c∑s=1

P(Tj(t− 1) = s|Xj,s ≥ µj +

∆j

2

)+

2

∆2j

e−∆2j

2 bx0c

≤bx0c∑s=1

P(TRj (t− 1) ≤ s|Xj,s ≥ µj +

∆j

2

)+

2

∆2j

e−∆2j

2 bx0c

≤ bx0cP(TRj (t− 1) ≤ bx0c

)+

2

∆2j

e−∆2j

2 bx0c (26)

where for the first bx0c terms of the sum we upperbounded e−∆2j

2 s by 1, and for the remaining terms we used the fact that∑∞bx0c+1 e

−ks ≤ 1ke−kx, where in our case k =

∆2j

2 . We have that

E[TRj (t− 1)] =1

m

t∑s=1

εs, V ar(TRj (t− 1)) =

t∑s=1

εsm

(1− εs

m

)≤ 1

m

t∑s=1

εs = E[TRj (t− 1)],

and, using the Bernstein inequality P(Sn ≤ E[Sn]− a) ≤ exp{− a2/2σ2+a/2} with Sn = TRj (t− 1) and a = 1

2E[TRj (t− 1)],

P(TRj (t− 1) ≤ bx0c) = P(TRj (t− 1) ≤ E[TRj (t− 1)]− 1

2E[TRj (t− 1)]

)≤ exp

{−

18 (E[TRj (t− 1)])2

E[TRj (t− 1)] + 14E[TRj (t− 1)]

}

= exp

{−4

5

1

8E[TRj (t− 1)]

}= exp

{−1

5bx0c

}. (27)

16

Now we need a lower bound on bx0c. Let us define n′ =⌊cmd2

⌋, then

x0 =1

2m

t∑s=1

εs

=1

2m

t∑s=1

min{

1,cm

d2s

}

=1

2m

n′∑s=1

1 +1

2m

t∑s=n′+1

cm

d2s

=n′

2m+

1

2m

t∑s=1

cm

d2s−

n′∑s=1

cm

d2s

≥ n′

2m+

c

2d2

(ln(t+ 1)− (ln(n′) + ln(e)

)≥ c

2d2ln

(n′

m

d2

c

)+

c

2d2ln

(t

n′e

)=

c

2d2ln

(d2t

mce

). (28)

Remark 1. Note that if t (or t in the usual ε-greedy algorithm) was less than n′, then we would have x0 = t/2m, yielding anexponential decay of the bound on the probability of j being the best arm. In fact, t < n′ would imply that

(27) ≤ t

2mexp

{−1

5

t

2m

}.

Then, using (28) combined with (27) in (26), we get a bound for the first term in (21) given by

c

2d2

(d2t

mce

)− c10d2

ln

(d2t

mce

)+

2

∆2j

(d2t

mce

)− c∆2j

4d2

. (29)

Since the computations for the second term in (21) are similar, a bound on P(Xj,Tj(t−1) ≥ Xi,Ti(t−1) ∀i

)is given by

βj(t) =c

d2

(d2t

mce

)− c10d2

ln

(d2t

mce

)+

4

∆2j

(d2t

mce

)− c∆2j

4d2

. (30)

Using (19), the bound on the mean regret at time n is given by

E[Rn] ≤m∑j=1

G(j)∆j

+

n∑t=m+1

G(t)

m∑j=1

∆j

(εt

1

m+ (1− εt)βj(t)

)1{G(t)<z}

+

n∑t∈B

m∑j=1

∆jβj(t)

[τt∑s=t

G(s)

],

where B = {t : G(t− 1) < z,G(t) ≥ z} is the set of rounds where the high-reward zone is entered, and τt is the last roundof high-reward zone entered at time t. This, combined with the bound for βj(t) above, proves the theorem.

B Logarithmic bound for Soft-ε-greedy algorithmAt each round t, arm j is played with probability

εtm

+ (1− εt)P(Xj ≥ Xi ∀i

),

17

where εt = min{ψ(t), cmd2 t

}and

ψ(t) =log(

1 + 1G(t)

)log(

1 + 1mins∈{m+1,··· ,n}G(s)

) . (31)

Recall that γ = min1≤t≤n ψ(t).Let us bound the probability P({It = j}) of playing the sub-optimal arm j at time t. We have that

P(Xj,Tj(t−1) ≥ Xi,Ti(t−1) ∀i

)≤ P

(Xj,Tj(t−1) ≥ X∗,T∗(t−1)

)≤ P

(Xj,Tj(t−1) ≥ µj +

∆j

2

)+ P

(Xj,Tj(t−1) ≤ µ∗ +

∆j

2

). (32)

For the bound on the two addends in (32), we have identical steps to the proof for Theorem 1, and thus

P(Xj,Tj(t−1) ≥ µj +

∆j

2

)≤ bx0cP

(TRj (t− 1) ≤ bx0c

)+

2

∆2j

e−∆2j

2 bx0c (33)

P(Xj,Tj(t−1) ≤ µ∗ +

∆j

2

)≤ bx0cP

(TRj (t− 1) ≤ bx0c

)+

2

∆2j

e−∆2j

2 bx0c (34)

and we again have

P(TRj (t− 1) ≤ bx0c

)≤ exp

{−1

5bx0c

}. (35)

Now we need a lower bound on bx0c. Let t > w where w = min{1, · · · , n} such that cmd2w < γ. Then,

x0 =1

2m

t∑s=1

εs

=1

2m

t∑s=1

min{ψ(s),

cm

d2s

}≥ 1

2m

w∑s=1

γ +1

2m

(t∑

s=1

cm

d2s−

w∑s=1

cm

d2s

)≥ wγ

2m+

c

2d2(ln(t+ 1)− (ln(w) + ln(e))

≥ c

2d2ln

(wγ

m

d2

c

)+ ln

(t

we

)=

c

2d2ln

(d2γt

mce

). (36)

Using (36) in (35), combined with (33) and (34), from (32) the bound on P(Xj,Tj(t−1) ≥ Xi,Ti(t−1) ∀i

)is given by

βSj (t) =c

d2ln

(d2γt

mce

)(d2γt

mce

)− c10d2

+4

∆2j

(d2γt

mce

)− c∆2j

4d2

.

Since the mean regret is given by

E[Rn] =

n∑t=1

m∑j=1

∆jG(t)P ({It = j}) , (37)

the bound on the mean regret at time n is given by

E[Rn] ≤m∑j=1

G(j)∆j

+

n∑t=m+1

G(t)

m∑j=1

∆j

(εt

1

m+ (1− εt)βSj (t)

).

18

C The regret bound of the α-soft UCB algorithmThe regret at round n is given by

Rn =

m∑j=1

G(j)∆j +

n∑t=m+1

m∑j=1

∆jG(t)1{It=j}

≤m∑j=1

G(j)∆j +

(max

t∈{m+1,··· ,n}G(t)

) m∑j=1

∆j

n∑t=m+1

1{It=j}.

The expected regret E[Rn] at round n is bounded by

E[Rn] ≤m∑j=1

G(j)∆j +

(max

t∈{m+1,··· ,n}G(t)

) m∑j=1

∆jE[Tj(n)]. (38)

where Tj(n) =∑nt=1 1{It=j} is the number of times the sub-optimal arm j has been chosen up to round n. Recall from (1)

that

Xj =1

Tj(t− 1)

Tj(t−1)∑s=1

Xj(s). (39)

From the Chernoff-Hoeffding Inequality we have that

P

1

Tj(t− 1)

Tj(t−1)∑i=1

Xj,i − µj ≤ −ε

≤ exp{−2Tj(t− 1)ε2},

and

P

1

Tj(t− 1)

Tj(t−1)∑i=1

Xj,i − µj ≥ ε

≤ exp{−2Tj(t− 1)ε2}. (40)

Let us define the following function:

ξα(t) =

(1 +

t

G(t)αt

),

by selecting ε =√

2 log(ξα(t))Tj(t−1) we have

P

(Xj +

√2 log (ξα(t))

Tj(t− 1)≤ µj

)≤ (ξα(t))

−4, (41)

and

P

(Xj −

√2 log (ξα(t))

Tj(t− 1)≥ µj

)≤ (ξα(t))

−4. (42)

Equivalently, we may write for every j

µj −

√2 log (ξα(t))

Tj(t− 1)≤ Xj with probability at least 1− (ξα(t))

−4, (43)

µj +

√2 log (ξα(t))

Tj(t− 1)≥ Xj with probability at least 1− (ξα(t))

−4. (44)

If we choose arm j at round t (i.e., the event {It = j} occurs) we have that

Xj +

√2 log (ξα(t))

Tj(t− 1)≥ X∗ +

√2 log (ξα(t))

T∗(t− 1). (45)

19

Let us use (44) to upper bound the LHS and (43) to lower bound the RHS of (45), then we get with probability at least1− 2 (ξα(t))

−4

µj + 2

√2 log (ξα(t))

Tj(t− 1)≥ µ∗,

from which we get with probability at least 1− 2 (ξα(t))−4

Tj(t− 1) ≤ 8

∆2j

log (ξα(t)) . (46)

In order to emphasize the dependence of Xj from Tj(t− 1) we will sometimes write Xj,Tj(t−1). In the following, notice thatin (48) the summation starts from m+ 1 because in the first m initialization rounds each arm is played once. Moreover, step(49) follows from (48) by assuming that arm j has already been played u times. By using (45) we get (50), then, for each t,{

Xj,Tj(t−1) +

√2 log (ξα(t))

Tj(t− 1)≥ X∗,T∗(t−1) +

√2 log (ξα(t))

T∗(t− 1), Tj(t− 1) ≥ u

}⊂ max

sj∈{u,...,Tj(t−1)}Xj,sj +

√2 log (ξα(t))

sj≥ mins∗∈{1,...,T∗(t−1)}

X∗,s∗ +

√2 log (ξα(t))

s∗

(47)

which justifies (51). We also have that (47) is included in

T∗(t−1)⋃s∗=1

Tj(t−1)⋃sj=u

Xj,sj +

√2 log (ξα(t))

sj≥ X∗,s∗ +

√2 log (ξα(t))

s∗

.

Thus, for any integer u, we may write

Tj(n) = 1 +

n∑t=m+1

1{It = j} (48)

= u+

n∑t=m+1

1{It = j, Tj(t− 1) ≥ u} (49)

= u+n∑

t=m+1

1

{Xj,Tj(t−1) +

√2 log (ξα(t))

Tj(t− 1)≥ X∗,T∗(t−1) +

√2 log (ξα(t))

T∗(t− 1), Tj(t− 1) ≥ u

}(50)

≤ u+

n∑t=m+1

1

maxsj∈{u,...,Tj(t−1)}

Xj,sj +

√2 log (ξα(t))

sj≥ mins∗∈{1,...,T∗(t−1)}

X∗,s∗ +

√2 log (ξα(t))

s∗

(51)

≤ u+n∑

t=m+1

T∗(t−1)∑s∗=1

Tj(t−1)∑sj=u

1

Xj,sj +

√2 log (ξα(t))

sj≥ X∗,s∗ +

√2 log (ξα(t))

s∗

. (52)

When

1

{Xj +

√2 log (ξα(t))

Tj(t− 1)≥ X∗ +

√2 log (ξα(t))

T∗(t− 1)

}(53)

is equal to one, at least one of the following has to be true:

X∗ ≤ µ∗ −

√2 log (ξα(t))

T∗(t− 1); (54)

Xj ≥ µj +

√2 log (ξα(t))

Tj(t− 1); (55)

µ∗ < µj + 2

√2 log (ξα(t))

Tj(t− 1). (56)

(In fact, suppose none of them hold simultaneously. Then from (54) we would have that X∗ > µ∗ −√

2 log(ξα(t))T∗(t−1) ; then, by

applying (56) (with opposite verse since we are assuming it does not hold) we get X∗ > µj + 2√

2 log(ξα(t))Tj(t−1) −

√2 log(ξα(t))T∗(t−1)

20

and then from (55) (again, with opposite verse) follows that X∗ > Xj+√

2 log(ξα(t))Tj(t−1) −

√2 log(ξα(t))T∗(t−1) which is in contradiction

with (53).) Now, if we set u =

⌈8

∆2j

log

(max

t∈{m+1,··· ,n}ξα(t)

)⌉, for Tj(t− 1) ≥ u,

µ∗ − µj − 2

√2 log (ξα(t))

Tj(t− 1)

≥ µ∗ − µj − 2

√2 log (ξα(t))

u

= µ∗ − µj −∆j

√√√√√ log (ξα(t))⌈log

(max

t∈{m+1,··· ,n}ξα(t)

)⌉≥ µ∗ − µj −∆j = 0,

therefore, with this choice of u, (56) can not hold.Thus, using (52), we have that

Tj(n) ≤

⌈8

∆2j

log

(max

t∈{m+1,··· ,n}ξα(t)

)⌉

+

n∑t=m+1

T∗(t−1)∑s∗=1

Tj(t−1)∑sj=u

1

X∗,s∗ ≤ µ∗ −√

2 log (ξα(t))

s∗

+

n∑t=m+1

T∗(t−1)∑s∗=1

Tj(t−1)∑sj=u

1

{Xj,sj ≥ µj +

√2 log (ξα(t))

sj

}

and by taking expectation,

E[Tj(n)] ≤

⌈8

∆2j

log

(max

t∈{m+1,··· ,n}ξα(t)

)⌉

+

n∑t=m+1

T∗(t−1)∑s∗=1

Tj(t−1)∑sj=u

P

X∗,s∗ ≤ µ∗ −√

2 log (ξα(t))

s∗

+

n∑t=m+1

T∗(t−1)∑s∗=1

Tj(t−1)∑sj=u

P

{Xj,sj ≥ µj +

√2 log (ξα(t))

sj

}

≤

⌈8

∆2j

log

(max

t∈{m+1,··· ,n}ξα(t)

)⌉+ 2

n∑t=m+1

(ξα(t))−4

(t− 1−m)2.

where in the last step we upperbound T∗(t− 1) and Tj(t− 1) by (t− 1−m) (cases where we have only played the best armor arm j). Therefore, by using (38)

E[Rn] ≤m∑j=1

G(j)∆j

+ maxt∈{m+1,··· ,n}

G(t)

∑j:µj<µ∗

8

∆jlog

(max

t∈{m+1,··· ,n}ξα(t)

)+ 2

m∑j=1

∆j

n∑t=m+1

(ξα(t))−4

(t− 1−m)2

.

21

regulating greed over time - arxiv.org e-print archive greed over time stefano traca and cynthia...

Documents