multi-armed bandits with correlated armsusers.ece.cmu.edu/~oyagan/journals/correlatedmab.pdf ·...

Multi-Armed Bandits with Correlated ArmsSamarth Gupta, Shreyas Chaudhari, Gauri Joshi and Osman Yağan

samarthg,schaudh2,gaurij,[email protected] Mellon University

Pittsburgh, PA, USA

ABSTRACT

We consider a multi-armed bandit framework where the rewardsobtained by pulling different arms are correlated. The correlation in-formation is captured in terms of pseudo-rewards, which are boundson the rewards on the other arm given a reward realization andcan capture many general correlation structures. We leverage thesepseudo-rewards to design a novel approach that extends any classi-cal bandit algorithm to the correlated multi-armed bandit settingstudied in the framework. In each round, our proposed C-Banditalgorithm identifies some arms as empirically non-competitive, andavoids exploring them for that round. Through a unified regret anal-ysis of the proposed C-Bandit algorithm, we show that C-UCB andC-TS (the correlated bandit versions of Upper-confidence-boundand Thompson sampling) pull certain arms called non-competitive

arms, only O(1) times. As a result, we effectively reduce a K-armedbandit problem to a C + 1-armed bandit problem, where C is thenumber of competitive arms, as only C sub-optimal arms are pulledO(logT ) times. In many practical scenarios, C can be zero due towhich our proposed C-Bandit algorithms achieve bounded regret.In the special case where rewards are correlated through a latentrandom variable X , we give a regret lower bound that shows thatbounded regret is possible only when C = 0. In addition to simula-tions, we validate the proposed algorithms via experiments on tworeal-world recommendation datasets, movielens and goodreads,and show that C-UCB and C-TS significantly outperform classicalbandit algorithms.

CCS CONCEPTS

• Theory of computation→Online learning theory; Sequen-tial decision making; Regret bounds.

KEYWORDS

multi-armed bandits, regret analysis, recommendation systems

ACM Reference Format:

Samarth Gupta, Shreyas Chaudhari, Gauri Joshi and Osman Yağan. 2020.Multi-Armed Bandits with Correlated Arms. In Under review at SIGMETRICS

’20, June 08–12, 2020, Boston, MA. ACM, New York, NY, USA, 24 pages.https://doi.org/xx.xxxx/xxxxxxx.xxxxxxx

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected] review at SIGMETRICS ’20, June 08–12, 2020, Boston, MA

© 2020 Association for Computing Machinery.ACM ISBN xxx-x-xxxx-xxxx-x/xx/xx. . . $xx.xxhttps://doi.org/xx.xxxx/xxxxxxx.xxxxxxx

1 INTRODUCTION

1.1 Background and Motivation

Classical Multi-armed Bandits. The multi-armed bandit (MAB)problem falls under the class of sequential decision making prob-lems. In the classical multi-armed bandit problem, there are K arms,with each arm having an unknown reward distribution. At eachround t , we need to decide an arm kt ∈ K and we receive a randomreward Rt drawn from the reward distribution of arm kt . The goalin the classical multi-armed bandit is to maximize the long-termcumulative reward. In order to maximize cumulative reward, itis important to balance the exploration-exploitation trade-off, i.e.,learning the mean reward of each arm while trying to make surethat the arm with the highest mean reward is played as many timesas possible. This problem has been well studied for a long timestarting with the work of Lai and Robbins [1] that proposed theupper confidence bound (UCB) arm-selection algorithm and studiedits fundamental limits in terms of bounds on regret. Subsequently,several other algorithms [2] including Thompson Sampling [3] andKL-UCB [4] have been proposed for this setting. The classical multi-armed bandit model is useful in numerous applications involvingmedical diagnosis [5], system testing [6], scheduling in computingsystems [7–9], and web optimization [10, 11], among others.

Of particular interest to this work is the problem of optimalad-selection. Suppose that a company is to run an display adver-tising campaign for one of their products, and its creative teamhave designed several different versions that can be displayed. Itis expected that the user engagement (in terms of click probabilityand time spent looking at the ad) depends the version of the ad thatis displayed. In order to maximize the total user engagement overthe course of the ad campaign, multi-armed bandit algorithms canbe used; different versions of the ad correspond to the arms and thereward from selecting an arm is given by the clicks or time spentlooking at the ad version corresponding to that arm.Personalized recommendations using Contextual and Struc-

tured bandits. Although the ad-selection problem can be solvedby standard MAB algorithms, there are several specialized MABvariants that are designed to give better performance. For instance,the contextual bandit problem [12, 13] has been studied to pro-vide personalized displays of the ads to the users. Here, beforemaking a choice at each time step (i.e., deciding which versionto show to a user), we observe the context associated with thatuser (e.g., age/occupation/income features). Contextual bandit algo-rithms learn the mappings from the context θ to the most favoredversion of ad k∗(θ ) in an online manner and thus are useful forpersonalized recommendations. A closely related problem is thestructured bandit problem [14–17], in which the context θ (age/income/ occupational features) is hidden but the mean rewards fordifferent versions of ad (arms) as a function of hidden context θ are

arX

iv:1

911.

0395

9v1

[st

at.M

L]

6 N

ov 2

019

https://doi.org/xx.xxxx/xxxxxxx.xxxxxxx

https://doi.org/xx.xxxx/xxxxxxx.xxxxxxx

Under review at SIGMETRICS ’20, June 08–12, 2020, Boston, MA Samarth Gupta, Shreyas Chaudhari, Gauri Joshi and Osman Yağan

…

Unknown population composition

Correlated events

Figure 1: The ratings of a user corresponding to different

versions of the same ad are likely to be correlated. For

eg., if a person likes first version, there is a good chance

that it will also like 2nd as it also has George Clooney in

it. However, the population composition is unknown, i.e.,

the fraction of people liking the first/second or the last

version is unknown. Photos of advertisement taken from

https://www.nespresso.com.

known. Such models prove useful for personalized recommenda-tion in which the context of the user is unknown, but the rewardmappings µk (θ ) are known through surveyed data.GlobalRecommendations usingCorrelated-RewardBandits.

In this work we study a variant of the classical multi-armed banditproblem in which rewards corresponding to different arms are cor-related to each other. In many practical settings, the reward we getfrom different arms at any given step are likely to be correlated. Inthe ad-selection example given in Figure 1, a user reacting positively(by clicking, ordering, etc.) to the first version of the ad with GeorgeClooney might also be more likely to click the second version thatalso has Clooney; of course one can construct examples where thereis negative correlation between click events to different ads. Themodel we study in this paper explicitly captures these correlations.Similar to the classical MAB setting, the goal here is to displayversions of the ad to maximize user engagement. Unlike contextualbandits, we do not observe the context (age/occupational/income)features of the user and do not focus on providing personalizedrecommendation. Instead our goal is to provide global recommen-dations to a population whose demographics is unknown. Unlikestructured bandits, we do not assume that the mean rewards arefunctions of a hidden context parameter θ . In structured bandits,although the mean rewards depend on θ the reward realizationscan still be independent.

1.2 Summary of Main Results.

Model overview.Motivated by the presence of correlation in userchoices in Multi-Armed Bandit environments, we study a multi-armed bandit problem that explicitly models correlations among re-wards. These correlation are captured in the form of pseudo-rewards,which provide upper bounds on the conditional expectation of re-wards. For example, in the context of displaying ad versions, wherethe user either likes or dislikes the version, pseudo-rewards rep-resent an upper bound on the chances that user likes version B of

…

Reward realization r

𝑅" 𝑅# 𝑅$ 𝑅%

E[R`|R1 = r] s`,1(r)

Figure 2: Upon observing a reward r from an arm k , pseudo-rewards sℓ,k (r ), give us an upper bound on the conditional

expectation of the reward fromarm ℓ given thatwe observed

reward r from arm k . These pseudo-rewards models the cor-

relation in rewards corresponding to different arms.

the ad if it liked/disliked version A. We show that the knowledge ofsuch bounds, even when they are not all tight, can lead to signifi-cant improvement in the cumulative reward obtained by reducingthe amount of exploration compared to classical MAB algorithms.Figure 2 presents an illustration of our correlation model, wherethe pseudo-rewards, denoted by sℓ,k (r ), provide an upper boundon the reward that we could have received from arm ℓ given thatpulling arm k led to a reward of r .Pseudo-rewards in practice. The pseudo-rewards sℓ,k (r ) can beobtained through domain knowledge or from historical data.

Pseudo-rewards from domain knowledge. For instance, in the con-text of medical testing, where the goal is to identify the best drugto treat an ailment from among a set of K possible options, theeffectiveness of two drugs is correlated when the drugs share somecommon ingredients. Through domain knowledge of doctors, itis possible answer questions such as “what are the chances thatdrug B would be effective given drugAwas not effective?", throughwhich we can infer the pseudo-rewards.

Pseudo-rewards from surveyed data. In the context of displayingadvertisements, such correlations can be learned through surveysin which a user is asked to rate different versions of the ad in anexperimental setup. Once these pseudo-rewards are learned, thecompany can then use them to perform ad-selection at a globallevel (where the reward distributions corresponding to differentarms are unknown). However, the correlations are still going to bethere because of the inherent similarity and differences betweenthe different versions of the ad.

A key advantage of our problem setup is that these pseudo-rewards are just upper bounds on the conditional expected rewardsand can be arbitrarily loose. The proposed algorithm adapts accord-ingly and performs at least as well as the classical bandit algorithmsin any case. This also makes this model practical – if some pseudo-rewards are unknown due to lack of domain knowledge/data, theycan simply be replaced by the maximum possible reward entries,which serves a natural upper bound.Algorithm Overview.We use the knowledge of pseudo-rewardsto extend any classical bandit strategy to the correlatedMAB setting.

Multi-Armed Bandits with Correlated Arms Under review at SIGMETRICS ’20, June 08–12, 2020, Boston, MA

To do so, in each round t , the algorithm performs the followingthree steps.

(1) Select arm kmax that has been pulled the most number of sofar until step t − 1.

(2) Identify the set At of arms that are empirically competitive

with respect to kmax, that is, arms that have empirical meanpseudo-rewards larger than the empirical mean of arm kmax.

(3) Use a classical multi-armed bandit algorithm (for example,UCB or Thompson Sampling) over the reduced set of armsAt ∪ kmax to determine the arm that is pulled in round t .

We refer to this algorithm as C-Bandit where Bandit refers to theclassical bandit algorithm used in the last step of the algorithm (i.e.,UCB/TS/KL-UCB).Regret Analysis and the Notion of Competitive Arms. By do-ing regret analysis of C-UCB and C-TS, we obtain the followingupper bound on the expected regret of C-UCB and C-TS.

Proposition 1.1 (Upper Bound on Expected Regret). Theexpected cumulative regret of the C-UCB and C-TS algorithms is

upper bounded as

E [Reд(T )] ≤ C · O(logT ) + O(1), (1)

Here C denotes the number of competitive arms. We call an arm kto be competitive if expected pseudo-reward of arm k with respectto the optimal arm k∗ is larger than the mean reward of arm k∗.Formally, an arm k is competitive if E

[sk,k∗ (r )

]≥ µk∗ , and the arm

is said to be non-competitive otherwise. The result in Proposition 1.1arises from the fact that the C-UCB and C-TS algorithms end uppulling the non-competitive arms only O(1) times and only thecompetitive arms are pulled O(logT ) times. In contrast to UCB/TS,that pulls allK−1 sub-optimal arms O(logT ) times, our proposed C-UCB and C-TS algorithms pull onlyC ≤ K − 1 arms O(logT ) times.In this sense, we reduce a K-armed bandit problem to aC +1-armedbandit problem. We emphasize that k∗, µ∗ andC are all unknown tothe algorithm at the beginning. In fact, when C = 0, our proposedalgorithms achieve bounded regret meaning that after some finitestep, no arm but the optimal one will be selected.Simulations and Experiments. Figure 3 illustrates the perfor-mance of C-UCB, C-TS relative to UCB in a correlated multi-armedbandit setting with three arms. The value ofC depends on the under-lying hidden joint probability distribution, we show a setup whereC = 0 in Figure 3(a), C = 1 in Figure 3(b) and C = 2 in Figure 3(c).We see that whenC = 0, our proposed algorithms achieve boundedregret. In Figure 3(b), we see reduction in regret over UCB as onlyone arm is pulled O(logT ) by C-UCB and C-TS and in Figure 3(c)we see performance of C-UCB similar to UCB as both sub-optimalarms are competitive. We do extensive validation of our results byperforming experiments of two real-world datasets, namelyMovie-lens and Goodreads, which show that the proposed approachyields drastically smaller regret than classical Multi-Armed Banditstrategies.

0 5

104

0

100

200

300

Cu

mu

lative

Re

gre

t

(a)

0 5Total Rounds

0

200

400

600(b)

UCBC-UCBC-TS

0 5

104

0

50

100

150

200(c)

Figure 3: The cumulative regret of C-UCB and C-TS depend

on the number of competitive arms, i.e., C. C itself depends

on the unknown joint probability distribution of rewards

and is not known beforehand. We consider a setup where

C = 0 in (a), C = 1 in (b) and C = 2 in (c).

1.3 Key Contributions.

i) A General and Previously Unexplored Correlated Multi-

Armed Bandit Model. In Section 2 we describe our novel corre-lated multi-armed bandit model, in which rewards of a user cor-responding to different arms are correlated with each other. Thiscorrelation is captured by the knowledge of pseudo-rewards, whichare upper bounds on the conditional mean reward of arm ℓ givenreward of arm k . While pseudo-rewards are known, they can be ar-bitrarily loose. Thus, our frameworks captures very general settingswhere only partial information about correlations is available.

ii) An approach to generalize algorithms to theCorrelated

MAB setting.We propose a novel approach in Section 3 that ex-tends any classical bandit (such as UCB, TS, KL-UCB etc.) algorithmto the correlated MAB setting studied in this paper. This is done byidentifying some arms as empirically non-competitive in each roundfrom the samples obtained so far. The empirically non-competitive

arms are likely to be sub-optimal and hence our algorithm focuseson picking one of the empirically competitive arms through any clas-sical bandit algorithms of choice. Being able to choose any banditalgorithm for selection among empirically competitive arms allowsus to leverage algorithms such as Thompson Sampling and KL-UCBthat are known to outperform UCB.

iii) Unified regret analysis.We present our regret bounds andanalysis in Section 4. A rigorous analysis of the regret achievedunder both C-UCB and C-TS are given through a unified technique.This technique can be of broad interest since it provides us a recipeto obtain regret analysis for any C-Bandit algorithm. Our regretbounds for C-UCB and C-TS show that the set of non-competitive

arms are pulled only O(1) times, as opposed to O(logT ) times astypical in bandit problems. As a result, onlyC ≤ K − 1 (competitive)arms are pulled O(logT ) times leading to a significant reduction inregret relative to UCB or TS that pull each of the K − 1 sub-optimalarm O(logT ) times.

iv) Evaluation using real-world datasets. We also performsimulations to validate our theoretical results in Section 5 andshow performance in the special case where rewards are corre-lated through hidden random variable X . Our experimental results


r s2,1(r ) r s1,2(r )0 0.7 0 0.81 0.4 1 0.5

(a) R1 = 0 R1 = 1R2 = 0 0.2 0.4R2 = 1 0.2 0.2

(b) R1 = 0 R1 = 1R2 = 0 0.2 0.3R2 = 1 0.4 0.1

Table 1: The top row shows the pseudo-rewards of arms 1

and 2, i.e., upper bounds on the conditional expected re-

wards (which are known to the player). The bottom row de-

picts two possible joint probability distribution (unknown

to the player). Under distribution (a), Arm 1 is optimal

whereas Arm 2 is optimal under distribution (b).

given in Section 6 on the Movielens [18] and the Goodreads [19]datasets demonstrate the applicability of our C-Bandit approach inpractical settings. In particular, they demonstrate how the pseudo-rewards can be learned in practice. The results show significantimprovement over the performance of classical bandit approach forrecommendation system applications.

2 PROBLEM FORMULATION

2.1 Correlated Multi-Armed Bandit Model

Consider a Multi-Armed Bandit setting with K arms 1, 2, . . .K.At each round t , a user enters the system and we need to decide anarm kt to display to the user. Upon displaying arm kt , we receive arandom rewardRkt ∈ [0,B]. Our goal is to maximize the cumulativereward over time. The expected reward (over the population ofusers) of arm k , is denoted by µk . If we knew the arm with highestmean, i.e., k∗ = arg maxk ∈K µk beforehand, then we would alwayspull arm k∗ to maximize expected cumulative reward. We nowdefine the cumulative regret, minimizing which is equivalent tomaximizing cumulative reward:

Reд(T ) =T∑t=1

µkt − µk∗ =∑k,k∗

nk (T )∆k . (2)

Here, nk (T ) denotes the number of times a sub-optimal arm ispulled till round T and ∆k denotes the sub-optimality gap of arm k ,i.e., ∆k = µk∗ − µk .

The classical multi-Armed bandit setting assumes the rewards tobe independent across arms. More formally, Pr(Rℓ = rℓ |Rk = r ) =Pr(Rℓ = rℓ) ∀rℓ , r . Consequently, E [Rℓ |Rk = r ] = E [Rℓ] ∀r .However, in most practical scenarios this assumption is unlikely tobe true. In fact, rewards of a user corresponding to different armsare likely to be correlated. Motivated by this we consider a setupwhere the conditional distribution of the reward from arm ℓ givenreward from arm k is not equal to the probability distribution ofthe reward from arm ℓ, i.e., fRℓ |Rk (rℓ |rk ) , fRℓ

(rℓ), with fRℓ(rℓ)

denoting the probability distribution function of the reward fromarm ℓ. Consequently, due to such correlations, we have E [Rℓ |Rk ] ,E [Rℓ].

In our problem setting, rewards obtained from a user correspond-ing to different arms are correlated and this correlation is modeledby the knowledge of pseudo-rewards that constitute an upper boundon conditional expected rewards.

r s2,1(r ) s3,1(r )0 0.7 21 0.8 1.22 2 1

r s1,2(r ) s3,2(r )0 0.5 1.51 1.3 22 2 0.8

r s1,3(r ) s2,3(r )0 1.5 21 2 1.32 0.7 0.75

Table 2: If some pseudo-reward entries are unknown (due to

lack of prior-knowledge/data), those entries can be replaced

with the maximum possible reward and then used in the C-

BANDIT algorithm. We do that here by entering 2 for the

entries where pseudo-rewards are unknown.

Definition 1 (Pseudo-Reward). Suppose we pull arm k and

observe reward r , then the pseudo-reward of arm ℓ with respect to armk , denoted by sℓ,k (r ), is an upper bound on the conditional expected

reward of arm ℓ, i.e.,

E[Rℓ |Rk = r ] ≤ sℓ,k (r ). (3)

2.2 Illustration

Consider the example shown in Table 1, where we have a 2 armedbandit problem in which the reward is either 0 or 1. Table 1 illus-trates example values of pseudo-rewards for this problem. Whilethe pseudo-rewards are known, the underlying joint probabilitydistribution is unknown. For instance when joint probability dis-tribution is as shown in Table 1 (a), Arm 1 is optimal and Arm 2 isoptimal if joint probability distribution is as shown in Table 1(b).

In practice, these pseudo-rewards can be learned from prior-available data, or through offline surveys in which users are pre-sented with all K arms allowing us to sample R1, . . . ,RK jointly.Through such data, we can evaluate an estimate on the conditionalexpected rewards. For example in Table 1, we can look at all userswho obtained 0 reward for Arm 1 and calculate their average re-ward for Arm 2, say µ2,1(0). This average provides an estimate onthe conditional expected reward. If the training data is large, onecan use this value directly as s2,1(0) because through law of largenumbers, the empirical average equals the E [R2 |R1 = 0]. Since weonly need an upper bound on E [R2 |R1 = 0], we can use severalapproaches to construct the pseudo-rewards. For example, we canset µ2,1(0) + σ2,1(0), with σ2,1(0) denoting the empirical standarddeviation on the conditional reward of arm 2. In addition, pseudo-rewards for any unknown conditional mean reward could be filledwith the maximum possible reward for the corresponding arm. Ta-ble 2 shows an example of a 3-armed bandit problem where somepseudo-reward entries are unknown, e.g., due to lack of data. Wecan fill these missing entries with maximum possible reward (i .e ., 2)as shown in Table 2 to complete the pseudo-reward entries.

Remark 1 (Reduction to Classical Multi-Armed Bandits).When all pseudo-reward entries are unknown, then all pseudo-reward

entries can be filled with maximum possible reward for each arm.

In such a case, the problem framework studied in this paper reduces

to the setting of the classical Multi-Armed Bandit problem and our


1 3 5X

Y2(X)Y1(X) Y3(X)

1

3

5

X

1 3 5X

1 3 5X

1

3

5

1

3

5

Latent

Random

Variable

Figure 4: Rewards for different arms are correlated through

a hidden random variable X. At each roundX takes a realiza-

tion in X. The reward obtained from an arm k is Yk (X ). Thefigure illustrates lower bounds and upper bounds on Yk (X )(through dotted lines). For instance, when X takes the real-

ization 1, reward of arm 3 is a random variable bounded be-

tween 1 and 3.

proposed C-Bandit algorithm performs exactly as standard bandit

(for e.g., UCB, TS etc.) algorithms.

2.3 Special Case: Correlated Bandits with a

Latent Random Source

Our proposed correlated multi-armed bandit framework subsumesmany interesting and previously unexplored multi-armed banditsettings. One such special case is the correlated multi-armed banditmodel where the rewards depend on a common latent source ofrandomness. More concretely, the rewards of different arms arecorrelated through a hidden random variable X (see Figure 4). Ateach round t , X takes a an i.i.d. realization Xt ∈ X (unobserved tothe player) and upon pulling arm k , we observe a random rewardYk (Xt ). The latent random variable X here could represent thefeatures (i.e., age/occupation etc.) of the user arriving to the system,to whom we show one of the K arms. These features of the userare hidden in the problem due to privacy concerns. The randomreward Yk (Xt ) represents the preference of user with context Xtfor the kth version of the ad, for the application of ad-selection.

In this problem setup, upper and lower bounds onYk (X ), namelyдk (X ) and д

k(X ) are known. For instance, the information on upper

and lower bounds of Yk (Xt ) could represent knowledge of the formthat children of age 5-10 rate documentaries only in the range 1-

3 out of 5. Such information can be known or learned throughprior available data. While the bounds on Yk (X ) are known, thedistribution of X and reward distribution within the bounds isunknown, due to which the optimal arm is not known beforehand.Thus, an online approach is needed to minimize the regret. We nowshow how this setting can be covered by our general framework,which allows us to use the algorithms proposed in this paper to solvethe correlated multi-armed bandit problem with a latent randomsource.

It is possible to translate this setting to the general frameworkdescribed in the problem by transforming the mappings Yk (X ) topseudo-rewards sℓ,k (r ). Recall the pseudo-rewards represent anupper bound on the conditional expectation of the rewards. In this

1 3 5X

Y2(X)Y1(X) Y3(X)

1

3

5

1 3 5X

1 3 5X

1

3

5

1

3

5

R1 = 4

s2;1(4) = 3:5 s2;1(4) = 4

Figure 5: An illustration on how to calculate pseudo-rewards

in CMAB with latent random source. Upon observing a re-

ward of 4 from arm 1, we can see that themaximumpossible

reward for arms 2 and 3 is 3.5 and 4 respectively. Therefore,

s2,1(4) = 3.5 and s3,1(4) = 4.

framework, sℓ,k (r ) can be calculated as:

sℓ,k (r ) = maxдk(x )<r<дk (x )

дℓ(x),

where дk(x) and дk (x) represent upper and lower bounds on Yk (x).

Upon observing a realization from arm k , it is possible to estimatethe maximum possible reward that would have been obtained fromarm ℓ through the knowledge of bounds on Yk (X ).

Figure 5 illustrates how pseudo-reward is evaluated when weobtain a reward r = 4 by pulling arm 1. We first infer that X lies in[0, 0.8] if r = 4 and then find the maximum possible reward for arm2 and arm 3 in [0, 0.8]. Once these pseudo-rewards are constructed,the problem fits in the general framework described in this paperand we can use the algorithms proposed for this setting directly.

2.4 Comparison with parametric (structured)

models

As mentioned in Section 1, a seemingly related model is the struc-tured bandits model [14, 15, 20]. Structured bandits is a class ofproblems that cover linear bandits [16], generalized linear bandits[21], Lipschitz bandits [22], global bandits [23], regional bandits[24] etc. In the structured bandits setup, mean rewards correspond-ing to different arms are related to one another through a hiddenparameter θ . The underlying value of θ is fixed and the mean re-ward mappings θ → µk (θ ) are known. Similarly, [25] studies adependent armed bandit problem, that also has mean rewards cor-responding to different arms related to one another. It considers aparametric model, where mean rewards of different arms are drawnfrom one of the K clusters, each having an unknown parameter πi .All of these models are fundamentally different from the problemsetting considered in this paper. We enlist some of the differenceswith the structured bandits (and the model in [25]) below.

(1) In this work we explicitly model the correlations in the re-wards of a user corresponding to different arms. While, meanrewards are related to each other in structured bandits and[25], the reward realizations are not necessarily correlated.


(2) Another key difference is that the model studied here is non-parametric in the sense that there is no hidden feature spaceas is the case in structured bandits and [25].

(3) An important point to highlight is that the reward mappingsfrom θ to µk (θ ) in the structured bandits setup need to beexact. If they happen to be incorrect, then the algorithmsfor structured bandit cannot be used as they rely on thecorrectness of µk (θ ) to construct confidence intervals onthe unknown parameter θ . In contrast, the model studiedhere relies on the pseudo-rewards being upper bounds onconditional expectations. These bounds need not be tight andthe proposed C-Bandit algorithms adjust accordingly andperform at least as well as the corresponding classical banditalgorithm. In other words, our approach can be useful in anysetting where the prior data is available with given confidenceintervals (which can then be converted into upper bounds onthe conditional mean rewards), while the structured settingrequires exact mean values to be known.

3 THE PROPOSED C-BANDIT ALGORITHMS

We now propose an approach that extends the classical multi-armedbandit algorithms (such as UCB, Thompson Sampling, KL-UCB) tothe correlated MAB setting. At each round t + 1, the UCB algorithm[26] selects the arm with the highest UCB index Ik,t , i.e.,

kt+1 = arg maxk ∈K

Ik,t , Ik,t = µk (t) + B

√2 log(t)nk (t)

, (4)

where µk (t) is the empirical mean of the rewards received from armk until round t , and nk (t) is the number of times arm k is pulled tillround t . The second term in the UCB index causes the algorithm toexplore arms that have been pulled only a few times (small nk (t)).Recall that we assume all rewards to be bounded within an intervalof size B. When the index t is implied by context, we abbreviateµk (t) and Ik (t) to µk and Ik respectively in the rest of the paper.

Under Thompson sampling [27], the armkt+1 = arg maxk ∈K Sk,tis selected at time step t + 1. Here, Sk,t is the sample obtained fromthe posterior distribution of µk . That is,

kt+1 = arg maxk ∈K

Sk,t , Sk,t ∼ N(µk (t),

βB

nk (t) + 1

). (5)

In correlated MAB framework, the rewards observed from onearm can help estimate the rewards from other arms. Our key ideais to use this information to reduce the amount of explorationrequired. We do so by evaluating the empirical pseudo-reward ofevery other arm ℓ with respect to an arm k . If this pseudo-rewardis smaller than empirical reward of arm k , then arm ℓ is consideredto be empirically non-competitive with respect to arm k , and we donot consider it as a candidate in the UCB/Thompson Sampling/anyother bandit algorithm.

We define the notion of empirically competitive arms in Sec-tion 3.2 and then describe how we modify the classical banditalgorithms to perform in the considered correlated MAB setting inSection 3.3.

3.1 Empirical and Expected Pseudo-Rewards

In our correlated MAB framework, pseudo-reward of arm ℓ withrespect to arm k provides us an estimate on the reward of arm ℓthrough the reward sample obtained from arm k . We now definethe notion of empirical pseudo-reward below which can be used toobtain an optimistic estimate of µℓ through just reward samples ofarm k .

Definition 2 (Empirical and Expected Pseudo-Reward). Af-ter t rounds, arm k is pulled nk (t) times. Using these nk (t) rewardrealizations, we can construct the empirical pseudo-reward ϕℓ,k (t)for each arm ℓ with respect to arm k as follows.

ϕℓ,k (t) ≜∑tτ=1 1kτ =k sℓ,k (rt )

nk (t), ℓ ∈ 1, . . . ,K \ k. (6)

The expected pseudo-reward of arm ℓ with respect to arm k is defined

as

ϕℓ,k ≜ E[sℓ,k (r )

]. (7)

Observe that E[sℓ,k (r )

]≥ E [E [Rℓ |Rk = r ]] = µℓ . Due to this,

empirical pseudo-reward ϕℓ,k (t) can be used to obtain an estimatedupper bound on µℓ . Note that the empirical pseudo-reward ϕℓ,k (t)is defined with respect to arm k and it is only a function of therewards observed by pulling k .

3.2 Competitive and Non-competitive arms

with respect to Arm kUsing the pseudo-reward estimates defined above, we can classifyeach arm ℓ , k as competitive or non-competitive with respect thearm k . To this end, we first define the notion of the pseudo-gap.

Definition 3 (Pseudo-Gap). The pseudo-gap ∆ℓ,k of arm ℓ with

respect to arm k is defined as

∆ℓ,k ≜ µk − ϕℓ,k , (8)i.e., the difference between expected reward of arm k and the expected

pseudo-reward of arm ℓ with respect to arm k .

From the definition of pseudo-reward, it follows that the expectedpseudo-reward ϕℓ,k is greater than or equal to the expected rewardµℓ from arm ℓ. Thus, a positive pseudo-gap ∆ℓ,k > 0 indicates thatit is possible to classify arm ℓ as sub-optimal using only the rewardsobserved from arm k (with high probability as the number of pullsfor arm k gets large); thus, arm ℓ needs not be explored. Such armsare called non-competitive, as we define below.

Definition 4 (Competitive and Non-Competitive arms). Anarm ℓ is said to be non-competitive if its pseudo-gap with respect to

the optimal arm k∗ is positive, that is, ∆ℓ,k∗ > 0. Similarly, an arm

ℓ is said to be competitive if ∆ℓ,k∗ < 0. The unique best arm k∗ has∆k∗,k∗ = 0 and is not counted in the set of competitive arms.

Since the reward distribution of each arm is unknown, we cannot find the pseudo-gap of each arm and thus have to resort toempirical estimates based on observed rewards. In our algorithm,we use a noisy notion of the competitiveness of an arm definedas follows. Note that since the optimal arm k∗ is also not known,empirical competitiveness of an arm ℓ is defined with respect toeach of the other arms k , ℓ.


Definition 5 (Empirically Competitive and Non-Competi-tive arms). An arm ℓ is said to be “empirically non-competitive with

respect to arm k at round t" if its empirical pseudo-reward is less than

the empirical reward of arm k , that is, µk (t) − ϕℓ,k (t) > 0. Similarly,

an arm ℓ , k is deemed empirically competitive with respect to arm

k at round t , if µk (t) − ϕℓ,k (t) ≤ 0.

3.3 The C-Bandit Algorithm

The central idea in our correlated C-BANDIT approach is that afterpulling the optimal arm k∗ sufficiently large number of times, thenon-competitive (and thus sub-optimal) arms can be classified asempirically non-competitive with increasing confidence, and thusneed not be explored. As a result, the non-competitive arms will bepulled only O(1) times. However, the competitive arms cannot bediscerned as sub-optimal by just using the rewards observed fromthe optimal arm, and have to be explored O(logT ) times each. Thus,we are able to reduce a K-armed bandit to a C + 1-armed banditproblem, where C is the number of competitive arms.

Using this idea, our C-BANDIT proceeds as follows. After everyround t , we maintain values for empirical reward, µk (t), for eacharm k . These empirical estimates are based on the nk (t) samples ofrewards that have been observed for k till round t . In addition tothis, we maintain empirical pseudo-reward of arm ℓ with respectto arm k , ϕℓ,k (t), for all pairs of arms (ℓ,k). In each round t , thealgorithm performs the following steps:

(1) Select arm kmax = arg maxk nk (t − 1), that has been pulledthe most until round t − 1.

(2) Identify empirically competitive arms At : Identify thesetAt of arms that are empirically competitive with respectto arm kmax, i.e.,

At = k ∈ K : µkmax ≤ ϕk,kmax .

(3) Play BANDIT algorithm in At ∪ kmax: For instance,the C-UCB pulls the arm

kt = arg maxk ∈At∪kmax

Ik,t−1,

where Ik,t−1 is the UCB index defined in (4).Similarly, C-TS pulls the arm

kt = arg maxk ∈At∪kmax

Sk,t−1,

where Sk,t is the Thompson sample defined in (5)).(4) Update the empirical pseudo-rewards ϕℓ,kt (t) for all ℓ, the

empirical reward for arm kt .In step 1, we choose the arm that has been pulled the most num-

ber of times because we have the maximum number of rewardsamples from this arm. Thus, it is likely to most accurately identifythe non-competitive arms. This property enables the proposed algo-rithm to achieve an O(1) regret contribution from non-competitivearms as we show in Section 4 below.

Note that our C-BANDIT approach allows using any classicalMulti-Armed Bandit algorithm in the correlated Multi-Armed Ban-dit setting. This is important because some algorithms such asThompson Sampling and KL-UCB are known to obtain much betterempirical performance over UCB. Extending those to the correlatedMAB setting allows us to have the superior empirical performance

Algorithm 1 C-UCB Correlated UCB Algorithm

1: Input: Pseudo-rewards sℓ,k (r )2: Initialize: nk = 0, Ik = ∞ for all k ∈ 1, 2, . . .K3: for each round t do4: Find kmax = arg maxk nk (t−1), the arm that has been pulled

most times until round t − 15: Initialize the empirically competitive setAt = 1, 2, . . . ,K\

kmax.6: for k , kmax

do

7: if µkmax > ϕk,kmax then

8: Remove arm k from the empirically competitive set:At = At⧹k

9: end if

10: end for

11: Apply UCB1 over arms in At ∪ kmax by pulling arm kt =arg maxk ∈At∪kmax Ik (t − 1)

12: Receive reward rt , and update nkt = nkt + 113: Update Empirical reward: µkt (t) =

µkt (t−1)(nkt (t )−1)+rtnkt (t )

14: Update the UCB Index: Ikt (t) = µkt + B

√2 log tnkt

15: Update empirical pseudo-rewards for all k , kt : ϕk,kt (t) =∑τ :kτ =kt sk,kτ (rτ )/nkt

16: end for

Algorithm 2 C-TS Correlated TS Algorithm

1: Steps 1 - 10 as in C-UCB2: Apply TS over arms in At ∪ kmax by pulling arm kt =

arg maxk ∈At∪kmax Sk,t , where Sk,t ∼ N(µk (t),

βBnk (t )+1

).

3: Receive reward rt , and update nkt , µkt and empirical pseudo-rewards ϕk,kt (t).

over UCB even in the correlated setting. This benefit is demon-strated in our simulations and experiments described in Section 5and Section 6.

4 REGRET ANALYSIS AND BOUNDS

We now characterize the performance of the C-UCB algorithmby analyzing the expected value of the cumulative regret (2). Theexpected regret can be expressed as

E [Reд(T )] =K∑k=1E [nk (T )]∆k , (9)

where ∆k = µk∗−µk is the sub-optimality gap of arm k with respectto the optimal arm k∗, and nk (T ) is the number of times arm k ispulled in T slots.

For the regret analysis, we assume without loss of generalitythat the rewards are between 0 and 1 for all k ∈ 1, 2, . . .K. Notethat the C-Bandit algorithms do not require this condition, andthe regret analysis can also be generalized to any bounded rewards.


4.1 Regret Bounds

In order to bound E [Reд(T )] in (9), we can analyze the expectednumber of times sub-optimal arms are pulled, that is, E [nk (T )], forall k , k∗. Theorem 1 and Theorem 2 below show that E [nk (T )]scales as O(1) and O(logT ) for non-competitive and competitivearms respectively. Recall that a sub-optimal arm is said to be non-competitive if its pseudo-gap ∆k,k∗ > 0, and competitive otherwise.

Theorem 1 (Expected Pulls of a Non-competitive Arm). Theexpected number of times a non-competitive arm with pseudo-gap

∆k,k∗ is pulled by C-UCB is upper bounded as

E [nk (T )] ≤ Kt0 + K2

T∑t=Kt0

3( tK

)−2+

T∑t=1

t−3, (10)

= O(1), (11)

and for C-TS is bounded as,

E [nk (T )] ≤ Ktb + K2

T∑t=Ktb

(3( tK

)−2+

( tK

)1−2β )+

T∑t=1

t−3

(12)= O(1) for β > 1, (13)

where,

t0 = infτ ≥ 2 : ∆min, ∆k,k∗ ≥ 4

√K logτ

τ

.

tb = infτ ≥ exp(11β) : ∆min, ∆k,k∗ ≥ 6

√2Kβ logτ

τ

.

Theorem 2 (Expected Pulls of a Competitive Arm). The ex-pected number of times a competitive arm is pulled by C-UCB algo-

rithm is upper bounded as

E [nk (T )] ≤ 8log(T )∆2k

+

(1 + π 2

3

)+

T∑t=1

t exp(−t∆2

min

2K

), (14)

= O(logT ) where ∆min = mink

∆k > 0. (15)

and for C-TS is bounded as

E [nk (T )] ≤18 log(T∆2

k )∆2k

+ exp(11β) + 182∆2

k

+

T∑t=1

t exp(−t∆2

min

2K

)= O(logT ) where ∆min = min

k∆k > 0. (16)

Substituting the bounds on E [nk (T )] derived in Theorem 1 andTheorem 2 into (9), we get the following upper bound on expectedregret.

Corollary 1 (Upper Bound on Expected Regret). The ex-pected cumulative regret of the C-UCB and C-TS algorithms is upper

bounded as

E [Reд(T )] ≤∑k ∈C

∆kU(c)k (T ) +

∑k ′∈1, ...,K \C∪k∗

∆k ′U(nc)k ′ (T ),

(17)= C · O(logT ) + O(1), (18)

where C ⊆ 1, . . . ,K \ k∗ is set of competitive arms with cardinal-

ity C , U(c)k (T ) is the upper bound on E [nk (T )] for competitive arms

given in (2), and U(nc)k (T ) is the upper bound for non-competitive

arms given in (1).

4.2 Proof Sketch

We now present an outline of our regret analysis of C-UCB andC-TS. A key strength of our analysis is that it can be extended veryeasily to any C-BANDIT algorithm. The results independent oflast step in the algorithm are presented in Appendix B, while therigorous regret upper bounds for C-UCB and C-TS are presented inAppendix D,F.

There are three key components to prove the result in Theorem 1and Theorem 2. The first two components hold independent ofwhich bandit algorithm (UCB/TS/KL-UCB) is used for selecting thearm from the set of competitive arms, which makes our analysiseasy to extend to any C-BANDIT algorithm.

i) Probability of optimal arm being identified as empir-

ically non-competitive at round t (denoted by Pr(E1(t))) issmall. In particular, we show that

Pr(E1(t)) ≤ t exp(−t∆2

min2K

).

This ensures that the optimal arm is identified as empirically non-competitive only O(1) times. We show that the number of times acompetitive arm is pulled is bounded as

E [nk (T )] ≤T∑t=1

Pr(E1(t)) + Pr(Ec1 (t),kt = k, Ik,t−1 > Ik∗,t−1).

(19)The first term sums to a constant, while the second term is upperbounded by the number of times UCB pulls the sub-optimal arm k .Due to this the upper bound on the number of pulls of competitivearm by C-UCB / C-TS is only an additive constant more than theupper bound on the number of pulls for an arm by UCB / TS algo-rithms and hence we have same pre-log constants for the upperbound on the pulls of competitive arms.

ii) Probability of identifying anon-competitive armas em-

pirically competitive jointly with optimal arm being kmax(t)is small. Notice that the first two steps of our algorithm involveidentifying kmax(t), arm that has been pulled most number of timesso far, and eliminating arms which are empirically non-competitivewith respect to kmax(t) for round t . We show that the joint eventthat arm k∗ is kmax(t) and a non-competitive arm k is identified asempirically non-competitive is small. Formally,

Pr(kt+1 = k,k∗ = kmax(t)) ≤ t exp

(−t ∆k,k∗

2K

). (20)

This occurs because upon obtaining a large number of samples ofarm k∗, expected reward of arm k∗ (i.e., µk∗ ) and expected pseudo-reward of arm k with respect to arm k∗ (i.e., ϕk,k∗ ) can be esti-mated fairly accurately. Since pseudo-gap of arm k is positive (i.e.,µk∗ > ϕk,k∗ ), the probability that arm k is identified as empiricallycompetitive is small.

An implication of (20) is that the expected number of times anon-competitive arm is identified as empirically competitive jointlywith the optimal arm having maximum number of pulls is boundedabove by a constant.


p1(r ) r s2,1(r ) s3,1(r )0.2 0 0.7 20.2 1 0.8 1.20.6 2 2 1

Table 3: SupposeArm1 is optimal and its unknownprobabil-

ity distribution is (0.2, 0.2, 0.6), then µ1 = 1.4, while ϕ2,1 = 1.5and ϕ3,1 = 1.2. Due to this Arm 2 is Competitive while Arm

3 is non-competitive

iii) Probability that a sub-optimal arm is kmaxat round t ,

is small. Formally, we show that for C-UCB, we have

Pr(k = kmax(t)) ≤ 3K( tK

)−2 ∀t > Kt0,k , k∗ (21)

This component of our analysis is specific to the classical banditalgorithm used in C-BANDIT. We show a similar result for C-TSrigorously in Lemma 10. Intuitively, a result of this kind should holdfor any good performing classical multi-armed bandit algorithm. Wereach the result of (21) in C-UCB by showing that

Pr(kt+1 = k,nk (t) >

t

2K

)≤ t−3 ∀t > t0,k , k

∗ (22)

The probability of selecting a sub-optimal arm k after it has beenpulled significantly many times is small as with more number ofpulls, the exploration component in UCB index of arm k becomessmall, and consequently it is likely to be smaller than the UCBindex of optimal arm k∗ (as it has larger empirical mean reward orhas been pulled fewer number of times). Our analysis in Lemma 8shows how the result in (22) can be translated to obtain (21) (thistranslation is again not dependent on which bandit algorithm isused in C-BANDIT).

We show that the expected number of pulls of a non-competitivearm k can be bounded as

E [nk (T )] ≤T∑t=1

Pr(kt+1 = k,k∗ = kmax) + Pr(k∗ , kmax) (23)

The first term in (23) is O(1) due to (20) and the second term is O(1)due to (21). Refer to Appendix D,F for rigorous regret analysis ofC-UCB and C-TS.

4.3 Discussion on Regret Bounds

Competitive Arms. Recall than an arm is said to be competi-tive if µk∗ (i.e., expected reward from arm k∗) > E

[ϕk,k∗

]=

E[E[Rk ′ |Rk ]

]. Since the distribution of reward of each arm is un-

known, initially the Algorithm does not know which arm is com-

petitive and which arm is non-competitive.Reduction in effective number of arms. Interestingly, our resultfrom Theorem 1 shows that the C-UCB and C-TS algorithms, thatoperate in a sequential fashion, make sure that non-competitive

arms are pulled only O(1) times. Due to this, only the competitivearms are pulled O(logT ) times. Moreover, the pre-log terms in theupper bound of UCB and C-UCB (and correspondingly TS and C-TS)for these arms is the same. In this sense, our C-BANDIT approachreduces aK-armed bandit problem to aC+1-armed bandit problem.Effectively only C ≤ K − 1 arms are pulled O(logT ) times, whileother arms are stopped being pulled after a finite time.

Depending on the joint probability distribution, different arms canbe optimal, competitive or non-competitive. Table 3 shows a casewhere arm 1 is optimal and the reward distribution of arm 1 is(0.2, 0.2, 0.6), which leads to µ1 = 1.4 > ϕ3,1 = 1.2 and µ1 =1.4 < ϕ2,1 = 1.5. Due to this Arm 2 is competitive while Arm 3 isnon-competitive.Achieving Bounded Regret. If the set of competitive arms C isempty (i.e., the number of competitive arms C = 0), then our algo-rithm will lead to (see (18)) an expected regret of O(1), instead ofthe typical O(logT ) regret scaling in classic multi-armed bandits.One such scenarion in which this can happen is if pseudo-rewardssk,k∗ of all arms with respect to optimal arm k∗ match the con-ditional expectation of arm k . Formally, if sk,k∗ = E [Rk |Rk∗ ]∀k ,then E

[sk,k∗

]= E [Rk ] = µk < µk∗ . Due to this, all arms are

non-competitive and our algorithms achieve only O(1) regret. Wenow evaluate a lower bound result for a special case of our model,where rewards are correlated through a latent random variable Xas described in Section 2.3.

We present a lower bound on the expected regret for the modeldescribed in Section 2.3. Intuitively, if an arm ℓ is competitive, itcan not be deemed sub-optimal by only pulling the optimal arm k∗

infinitely many times. This indicates that exploration is necessaryfor competitive arms. The proof of this bound closely follows thatof the 2-armed classical bandit problem [1]; i.e., we construct a newbandit instance under which a previously sub-optimal arm becomesoptimal without affecting reward distribution of any other arm.

Theorem 3 (Lower Bound for CorrelatedMABwith latentrandom source). For any algorithm that achieves a sub-polynomial

regret, the expected cumulative regret for the model described in

Section 2.3 is lower bounded as

limT→∞

inf E [Reд(T )]log(T ) ≥

maxk ∈C∆k

D(fRk | |fRk )if C > 0,

0 if C = 0.(24)

Here fRk is the reward distribution of arm k , which is linkedwith fX since Rk = Yk (X ). The term fRk

represents the reward dis-tribution of arm k in the new bandit instance where arm k becomesoptimal and distribution fRk∗ is unaffected. The divergence termrepresents "the amount of distortion needed in reward distributionof arm k to make it better than arm k∗", and hence captures theproblem difficulty in the lower bound expression.Bounded regretwhenever possible for the special case of Sec-

tion 2.3. From Corollary 1, we see that whenever C > 0, our pro-posed algorithm achieves O(logT ) regret matching the lower boundgiven in Theorem 3 order-wise. Also, when C = 0, our algorithmachieves O(1) regret. Thus, our algorithm achieves bounded regretwhenever possible, i.e., when C = 0 for the model described inSection 2.3. In the general problem setting, a lower bound Ω(logT )exists whenever it is possible to change the joint distribution ofrewards such that the marginal distribution of optimal arm k∗ isunaffected and pseudo-rewards sℓ,k (r ) still remain an upper boundon E [Rℓ |Rk = r ] under the new joint probability distribution. Ingeneral, this can happen even ifC = 0, we discuss one such scenarioin the Appendix G.2 and explain the challenges that need to comefrom the algorithmic side to meet the lower bound.


0 5

Total Rounds104

0

50

100

Cum

ula

tive R

egre

t

(a)

UCBC-UCBC-TS

0 5

104

0

50

100

150(b)

Figure 6: Cumulative regret for UCB, C-UCB and C-TS corre-

sponding to the problem shown in Table 1. For the setting (a)

in Table 1, Arm 1 is optimal and Arm 2 is non-competitive,

in setting (b) of Table 1 Arm 2 is optimal while Arm 1 is com-

petitive.

5 SIMULATIONS

We now present the empirical performance of proposed algorithms.For all the results presented in this section, we compare the perfor-mance of all algorithms on the same reward realizations and plotthe cumulative regret averaged over 100 independent trials.

5.1 Simulations with known pseudo-rewards

Consider the example shown in Table 1, with the top row showingthe pseudo-rewards, which are known to the player, and the bottomrow showing two possible joint probability distributions (a) and (b),which are unknown to the player. We show the simulation resultof our proposed algorithms C-UCB, C-TS against UCB in Figure 6for the setting considered in Table 1.Case (a): Bounded regret. For the probability distribution (a),notice that Arm 1 is optimal with µ1 = 0.6, µ2 = 0.4. Moreover,ϕ2,1 = 0.4 × 0.7 + 0.6.4 = 0.52. Since ϕ2,1 < µ1, Arm 2 is non-competitive. Hence, in Figure 6(a), we see that our proposed C-UCBand C-TS Algorithms achieve bounded regret, whereas UCB leadsto logarithmic regret.Case (b): All competitive arms. For the probability distribution(b), Arm 2 is optimal with µ2 = 0.5 and µ1 = 0.4. The expectedpseudo-reward of arm 1 w.r.t to arm 2 in this case is ϕ1,2 = 0.8 ×0.5 + 0.5 × 0.5 = 0.65. Since ϕ1,2 >, the sub-optimal arm (i.e., Arm1) is competitive and hence C-UCB and C-TS also end up exploringArm 1. Due to this we see that C-UCB, C-TS achieve a regret similarto UCB in Figure 6(b). C-TS has empirically smaller regret thanC-UCB as Thompson Sampling performs better empirically thanthe UCB algorithm. The design of our C-Bandit approach allowsthe use of any other bandit algorithm in the last step, e.g., KL-UCB.

5.2 Simulations for the model in Section 2.3

We now show the performance of C-UCB and C-TS against UCB forthe model considered in Section 2.3, where rewards correspondingto different arms are correlated through a latent random variableX . We consider a setting where reward obtained from Arm 1, givena realization x of X , is bounded between 2x − 1 and 2x + 1, i.e.,

0 2 4 6

X

0

2

4

6

8

10

12

14

Y1(X

)

0 2 4 6

X

0

2

4

6

8

10

Y2(X

)

Figure 7: Rewards corresponding to the two arms are corre-

lated through a random variable X lying in (0, 6). The lines

represent lower and upper bounds on reward of Arms 1

,Y1(X ), and 2, Y2(X ), given the realization of random variable

X .

0 5

Total Rounds104

0

500

1000C

um

ula

tive R

egre

t(a)

0 5

104

0

1000

2000

3000(b)

UCBC-UCBC-TS

Figure 8: Simulation results for the example shown in Fig-

ure 7. In (a), X ∼ Beta(1, 1) and in (b) X ∼ Beta(1.5, 5).

2X − 1 ≤ Y1(X ) ≤ 2X + 1. Similarly, conditional reward of Arm2 is, (3 − X )2 − 1 ≤ Y2(X ) ≤ (3 − X )2 + 1. Figure 7 demonstratesthese upper and lower bounds on Yk (X ). We run C-UCB, C-TSand UCB for this setting for two different distributions of X . Forthe simulations, we set the conditional reward of both the armsto be distributed uniformly between the upper and lower bounds,however this information is not known to the Algorithms.Case (a): X ∼ Beta(1, 1). When X is distributed as X ∼ Beta(1, 1),Arm 1 is optimal while Arm 2 is non-competitive. Due to this, weobserve that C-UCB and C-TS achieve bounded regret in Figure 8(a).Case (b): X ∼ Beta(1.5, 5). In the scenario where X has the distri-bution Beta(1.5, 5), Arm 2 is optimal while Arm 1 is competitive.Due to this, C-UCB and C-TS do not stop exploring Arm 1 in finitetime and we see the cumulative regret similar to UCB in Figure 8(b).

Our next simulation result considers a setting where the knownupper and lower bounds on Yk (X ) match and the reward Yk corre-sponding to a realization of X is deterministic, i.e., Yk (X ) = дk (X ).We show our simulation results for the reward functions describedin Figure 9 with three different distributions ofX . Corresponding toX ∼ Beta(4, 4), Arm 1 is optimal and Arms 2,3 are non-competitiveleading to bounded regret for C-UCB, C-TS in Figure 3(a). In setting


0 0.2 0.4 0.6 0.8 1

x

0

0.2

0.4

0.6

0.8

1

gk(x

)g

1(x)

g2(x)

g3(x)

Figure 9: Reward Functions used for the simulation results

presented in Figure 3.

(b), we consider X ∼ Beta(2, 5) in which Arm 1 is optimal, Arm2 is competitive and Arm 3 is non-competitive. Due to this, ourproposed C-UCB and C-TS Algorithms stop pulling Arm 3 aftersome time and hence achieve significantly reduced regret relativeto UCB in Figure 3(b). For third scenario (c), we set X ∼ Beta(1, 5),which makes Arm 3 optimal while Arms 1 and 2 are competitive.Hence, our algorithms explore both the sub-optimal arms and havea regret comparable to that of UCB in Figure 3(c).

6 EXPERIMENTS

We now show the performance of our proposed algorithms in real-world settings. Through the use of MovieLens and Goodreadsdatasets, we demonstrate how the correlated MAB framework canbe used in practical settings for recommendation system applica-tions. In such systems, it is possible to use the prior available data(from a certain population) to learn the correlation structure inthe form of pseudo-rewards. When trying to design a campaign tomaximize user engagement in a new unknown demographic, thelearned correlation information in the form of pseudo-rewards canhelp significantly reduce the regret as we show from our resultsdescribed below.

6.1 Experiments on theMovieLens dataset

The MovieLens dataset [18] contains a total of 1M ratings for atotal of 3883 Movies rated by 6040 Users. Each movie is rated ona scale of 1-5 by the users. Moreover, each movie is associatedwith one (and in some cases, multiple) genres. For our experiments,of the possibly several genres associated with each movie, one ispicked uniformly at random. To perform our experiments, we splitthe data into two parts, with the first half containing ratings of theusers who provided the most number of ratings. This half is used tolearn the pseudo-reward entries, the other half is the test set whichis used to evaluate the performance of the proposed algorithms.Doing such a split ensures that the rating distribution is differentin the training and test data.Recommending the Best Genre. In our first experiment, the goalis to provide the best genre recommendations to a population withunknown demographic. We use the training dataset to learn thepseudo-reward entries. The pseudo-reward entry sℓ,k (r ) is evalu-ated by taking the empirical average of the ratings of genre ℓ thatare rated by the users who rated genre k as r . To capture the factthat it might not be possible in practice to fill all pseudo-reward en-tries, we randomly remove p-fraction of the pseudo-reward entries.

0 2500 50000

250

500

750

1000

1250

1500

1750

Cum

ulat

ive

Reg

ret

UCBC-UCBC-TS

0 2500 5000

Number of rounds

0

250

500

750

1000

1250

1500

1750

0 2500 50000

250

500

750

1000

1250

1500

1750

(a) (b) (c)

Figure 10: Cumulative regret for UCB, C-UCB and C-TS

for the application of recommending the best genre in the

Movielens dataset, where p fraction of the pseudo-entries

are replaced with maximum reward i .e ., 5. In (a),p = 0.1, for(b),p = 0.25 and p = 0.5 in (c).

0 2500 5000(a)

0

250

500

750

1000

1250

1500

1750

2000

Cum

ulat

ive

Reg

ret

UCBC-UCBC-TS

0 2500 5000(b)

0

250

500

750

1000

1250

1500

1750

Number of rounds

Figure 11: Cumulative regret of UCB, C-UCB and C-TS for

providing the best movie recommendations in the Movie-

lens dataset. Each pseudo-reward entry is added by 0.1 in (a)

and by 0.4 in (b).

The removed pseudo-reward entries are replaced by the maximumpossible rating, i.e., 5 (as that gives a natural upper bound on theconditional mean reward). Using these pseudo-rewards, we evalu-ate our proposed algorithms on the test data. Upon recommendinga particular genre (arm), the rating (reward) is obtained by sam-pling one of the ratings for the chosen arm in the test data. Ourexperimental results for this setting are shown in Figure 10, withp = 0.1, 0.25 and 0.5 (i.e., fraction of pseudo-reward entries that areremoved). We see that the proposed C-UCB and C-TS algorithmssignificantly outperform UCB in all three settings. This occurs assome of the 18 arms are stopped being pulled after some time. Thisshows that even when only a subset of the correlations are known,it is possible to exploit them to improve the performance of classicalbandit algorithms.

Recommending the Best Movie.We now consider the goal ofproviding the best movie recommendations to the population. To doso, we consider the 50 most rated movies in the dataset. containing109,804 user-ratings given by 6,025 users. In the testing phase, thegoal is to recommend one of these 50 movies to each user. As wasthe case in previous experiment, we learn the pseudo-reward entriesfrom the training data. Instead of using the learned pseudo-reward


0 2500 5000(a)

0

200

400

600

800

1000

1200

Cum

ulat

ive

Reg

ret

UCBC-UCBC-TS

0 2500 5000(b)

0

200

400

600

800

1000

1200

Number of rounds

Figure 12: Cumulative regret of UCB, C-UCB and C-TS

for providing best poetry book recommendation in the

Goodreads dataset. Every pseudo-reward entry is added by

q and p fraction of the pseudo-reward entries are removed,

with (a) p = 0.1,q = 0.1 and (b) p = 0.3,q = 0.1.

directly, we add a safety buffer to each of the pseudo-reward entry;i.e., we set the pseudo-reward as the empirical conditional meanplus the safety buffer. Adding a buffer will be needed in practice,as the conditional expectations learned from the training data arelikely to have some noise and adding a safety buffer allows usto make sure that pseudo-rewards constitute an upper bound onthe conditional expectations. Our experimental result in Figure 11shows the performance of C-UCB and C-TS relative to UCB forthis setting with safety buffer set to 0.1 in Figure 11(a) and to 0.4in Figure 11. In both cases, even after addition of safety buffers,our proposed C-UCB and C-TS algorithms outperform the UCBalgorithm.

6.2 Experiments on the Goodreads datasetThe Goodreads dataset [19] contains the ratings for 1,561,465books by a total of 808,749 users. Each rating is on a scale of 1-5.For our experiments, we only consider the poetry section and focuson the goal of providing best poetry recommendations to the wholepopulation whose demographics is unknown. The poetry datasethas 36,182 different poems rated by 267,821 different users. We dothe pre-processing of goodreads dataset in the same manner as thatof the MovieLens dataset, by splitting the dataset into two halves,train and test. The train dataset contains the ratings of the userswith most number of recommendations.Recommending the best poetry book.We consider the 25 mostrated books in the dataset and use these as the set of arms to rec-ommend in the testing phase. These 25 poems have 349,523 user-ratings given by 171,433 users. As with theMovieLens dataset, thepseudo-reward entries are learned on the training data. In practicalsituations it might not be possible to obtain all pseudo-reward en-tries. Therefore, we randomly selectp fraction of the pseudo-rewardentries and replace them with maximum possible reward (i.e. 5).Among the remaining pseudo-reward entries we add a safety bufferof q to each entry. Our result in Figure 12 shows the performance ofC-UCB and C-TS relative to UCB in two scenarios. In scenario (a),10% of the pseudo-reward entries are replaced by 5 and remainingare padded with a safety buffer of 0.1. For case (b), 30% entriesare replaced by 5 and safety buffer is 0.1. Under both cases, our

proposed C-UCB and C-TS algorithms are able to outperform UCBsignificantly.

7 CONCLUSION

This work presents a new correlated Multi-Armed bandit problemin which rewards obtained from different arms are correlated. Wecapture this correlation through the knowledge of pseudo-rewards.These pseudo-rewards, which represent upper bound on conditionalmean rewards, could be known in practice from either domainknowledge or learned from prior data. Using the knowledge ofthese pseudo-rewards, we the propose C-Bandit algorithm whichfundamentally generalizes any classical bandit algorithm to thecorrelated multi-armed bandit setting. A key strength of our paperis that it allows pseudo-rewards to be loose (in case there is notmuch prior information) and even then our C-Bandit algorithmsadapt and provide performance at least as good as that of classicalbandit algorithms.

We provide a unified method to analyze the regret of C-Banditalgorithms. In particular, the analysis shows that C-UCB and C-TSend up pulling non-competitive arms only O(1) times; i.e., they stoppulling certain arms after a finite time t . Due to this, C-UCB and C-TS pull onlyC ≤ K −1 of theK −1 sub-optimal arms O(logT ) times,as opposed to UCB/TS that pull all K − 1 sub-optimal arms O(logT )times. In this sense, our C-Bandit algorithms reduce a K-armed ban-dit to aC+1-armed bandit problem. We present several cases whereC = 0 for which C-UCB and C-TS achieve bounded regret. For thespecial case when rewards are correlated through a latent randomvariable X , we show that bounded regret is possible only whenC = 0; if C > 0, then O(logT ) regret is not possible to avoid. Thus,our C-UCB and C-TS algorithms achieve bounded regret wheneverpossible. Simulation results validate the theoretical findings andwe perform experiments onMovielens and Goodreads datasetsto demonstrate the applicability of our framework in the context ofrecommendation systems. The experiments on real-world datasetsshow that our C-UCB and C-TS algorithms significantly outperformthe UCB algorithm.

There are several interesting open problems to be studied. Weplan to study the problem of best-arm identification in the correlatedmulti-armed bandit setting, i.e., to identify the best arm with aconfidence 1 − δ in as few samples as possible. Since rewards arecorrelated with each other, we believe the sample complexity canbe significantly improved relative to state of the art algorithms,such as LIL-UCB [28, 29], which are designed for classical multi-armed bandits. Another open direction is to improve the C-Banditalgorithm to make sure that it achieves bounded regret wheneverpossible in the general framework studied in this paper.

REFERENCES

[1] Tze Leung Lai and Herbert Robbins. Asymptotically efficient adaptive allocationrules. Advances in applied mathematics, 6(1):4–22, 1985.

[2] Sébastien Bubeck, Nicolo Cesa-Bianchi, et al. Regret analysis of stochastic andnonstochastic multi-armed bandit problems. Foundations and Trends in Machine

Learning, 5(1):1–122, 2012.[3] Shipra Agrawal and Navin Goyal. Analysis of thompson sampling for the multi-

armed bandit problem. In Conference on Learning Theory, pages 39–1, 2012.[4] Aurélien Garivier and Olivier Cappé. The kl-ucb algorithm for bounded stochastic

bandits and beyond. In Proceedings of the 24th annual Conference On Learning

Theory, pages 359–376, 2011.


[5] Sofía S Villar, Jack Bowden, and James Wason. Multi-armed bandit models forthe optimal design of clinical trials: benefits and challenges. Statistical science: areview journal of the Institute of Mathematical Statistics, 30(2):199, 2015.

[6] Cem Tekin and Eralp Turgay. Multi-objective contextual multi-armed banditproblem with a dominant objective. arXiv preprint arXiv:1708.05655, 2017.

[7] J. Nino-Mora. Stochastic scheduling. In Encyclopedia of Optimization, pages 3818–3824. Springer, New York, 2 edition, 2009.

[8] Subhashini Krishnasamy, Rajat Sen, Ramesh Johari, and Sanjay Shakkottai. Regretof queueing bandits. CoRR, abs/1604.06377, 2016.

[9] Gauri Joshi. Efficient Redundancy Techniques to Reduce Delay in Cloud Systems.PhD thesis, Massachusetts Institute of Technology, June 2016.

[10] John White. Bandit algorithms for website optimization. " O’Reilly Media, Inc.",2012.

[11] Deepak Agarwal, Bee-Chung Chen, and Pradheep Elango. Explore/exploitschemes for web content optimization. In Data Mining, 2009. ICDM’09. Ninth

IEEE International Conference on, pages 1–10. IEEE, 2009.[12] Li Zhou. A survey on contextual multi-armed bandits. CoRR, abs/1508.03326,

2015.[13] Alekh Agarwal, Daniel Hsu, Satyen Kale, John Langford, Lihong Li, and Robert E.

Schapire. Taming the monster: A fast and simple algorithm for contextual bandits.In Proceedings of the International Conference on Machine Learning (ICML), pages1638–1646, 2014.

[14] Richard Combes, Stefan Magureanu, and Alexandre Proutière. Minimal explo-ration in structured stochastic bandits. In NIPS, 2017.

[15] Tor Lattimore and Rémi Munos. Bounded regret for finite-armed structuredbandits. In Advances in Neural Information Processing Systems, pages 550–558,2014.

[16] Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms forlinear stochastic bandits. In Advances in Neural Information Processing Systems,pages 2312–2320, 2011.

[17] Varsha Dani, Thomas P. Hayes, and Sham M. Kakade. Stochastic linear optimiza-tion under bandit feedback. In Proceedings on the Conference on Learning Theory

(COLT), 2008.[18] F. Maxwell Harper and Joseph A. Konstan. The movielens datasets: History and

context. ACM Transactions on Interactive Intelligent Systems (TiiS), 5, 4, Article19, 2015.

[19] Mengting Wan and Julian McAuley. Item recommendation on monotonic behav-ior chains. In Proceedings of the 12th ACM Conference on Recommender Systems,pages 86–94. ACM, 2018.

[20] Samarth Gupta, Shreyas Chaudhari, Gauri Joshi, and Osman Yağan. Exploitingcorrelation in finite-armed structured bandits. arXiv preprint arXiv:1810.08164,2018.

[21] Sarah Filippi, Olivier Cappe, Aurélien Garivier, and Csaba Szepesvári. Parametricbandits: The generalized linear case. In Advances in Neural Information Processing

Systems, pages 586–594, 2010.[22] Stefan Magureanu, Richard Combes, and Alexandre Proutiere. Lipschitz bandits:

Regret lower bounds and optimal algorithms. arXiv preprint arXiv:1405.4758,2014.

[23] Onur Atan, Cem Tekin, and Mihaela van der Schaar. Global multi-armed banditswith hölder continuity. In AISTATS, 2015.

[24] Zhiyang Wang, Ruida Zhou, and Cong Shen. Regional multi-armed bandits. InAISTATS, 2018.

[25] Sandeep Pandey, Deepayan Chakrabarti, and Deepak Agarwal. Multi-armed ban-dit problems with dependent arms. In Proceedings of the International Conference

on Machine Learning, pages 721–728, 2007.[26] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the

multiarmed bandit problem. Machine learning, 47(2-3):235–256, 2002.[27] Shipra Agrawal and Navin Goyal. Further optimal regret bounds for thompson

sampling. In Artificial Intelligence and Statistics, pages 99–107, 2013.[28] K. Jamieson and R. Nowak. Best-arm identification algorithms for multi-armed

bandits in the fixed confidence setting. In Proceedings on the Annual Conference

on Information Sciences and Systems (CISS), pages 1–6, March 2014.[29] Kevin Jamieson, Matthew Malloy, Robert Nowak, and Sébastien Bubeck. lil’ucb:

An optimal exploration algorithm for multi-armed bandits. In Conference on

Learning Theory, pages 423–439, 2014.[30] Tor Lattimore. Instance dependent lower bounds. http://banditalgs.com/2016/09/

30/instance-dependent-lower-bounds/.[31] Alexandre B. Tsybakov. Introduction to Nonparametric Estimation. Springer

Publishing Company, Incorporated, 1st edition, 2008.[32] J. Bretagnolle and C. Huber. Estimation des densitiés: risque minimax. z. für

wahrscheinlichkeitstheorie und verw. Geb., 47:199–137, 1979.[33] Milton Abramowitz. Stegun. Handbook of mathematical functions, pages 299–300,

1964.

http://banditalgs.com/2016/09/30/instance-dependent-lower-bounds/

http://banditalgs.com/2016/09/30/instance-dependent-lower-bounds/


A STANDARD RESULTS FROM PREVIOUS WORKS

Fact 1 (Hoeffding’s ineqality). Let Z1,Z2 . . .Zn be i.i.d random variables bounded between [a,b] : a ≤ Zi ≤ b, then for any δ > 0, wehave

Pr(∑n

i=1 Zin

− E [Zi ] ≥ δ

)≤ exp

(−2nδ2

(b − a)2

).

Lemma 1 (Standard result used in bandit literature). If µk,nk (t ) denotes the empirical mean of arm k by pulling arm k nk (t) times

through any algorithm and µk denotes the mean reward of arm k , then we have

Pr(µk,nk (t ) − µk ≥ ϵ,τ2 ≥ nk (t) ≥ τ1

)≤

τ2∑s=τ1

exp(−2sϵ2

).

Proof. Let Z1,Z2, ...Zt be the reward samples of arm k drawn separately. If the algorithm chooses to play arm k formth time, then itobserves reward Zm . Then the probability of observing the event µk,nk (t ) − µk ≥ ϵ,τ2 ≥ nk (t) ≥ τ1 can be upper bounded as follows,

Pr(µk,nk (t ) − µk ≥ ϵ,τ2 ≥ nk (t) ≥ τ1

)= Pr ©«©«

∑nk (t )i=1 Zi

nk (t)− µk ≥ ϵ

ª®¬ ,τ2 ≥ nk (t) ≥ τ1ª®¬ (25)

≤ Pr(( τ2⋃m=τ1

∑mi=1 Zim

− µk ≥ ϵ

),τ2 ≥ nk (t) ≥ τ1

)(26)

≤ Pr( τ2⋃m=τ1

∑mi=1 Zim

− µk ≥ ϵ

)(27)

≤τ2∑

s=τ1

exp(−2sϵ2

). (28)

Lemma 2 (From Proof of Theorem 1 in [26]). Let Ik (t) denote the UCB index of arm k at round t , and µk = E [дk (X )] denote the mean

reward of that arm. Then, we have

Pr(µk > Ik (t)) ≤ t−3.

Observe that this bound does not depend on the number nk (t) of times arm k is pulled. UCB index is defined in equation (6) of the mainpaper.

Proof. This proof follows directly from [26]. We present the proof here for completeness as we use this frequently in the paper.

Pr(µk > Ik (t)) = Pr(µk > µk,nk (t ) +

√2 log tnk (t)

)(29)

≤t∑

m=1Pr

(µk > µk,m +

√2 log tm

)(30)

=

t∑m=1

Pr(µk,m − µk < −

√2 log tm

)(31)

≤t∑

m=1exp

(−2m

2 log tm

)(32)

=

t∑m=1

t−4 (33)

= t−3. (34)where (30) follows from the union bound and is a standard trick (Lemma 1) to deal with random variable nk (t). We use this trick repeatedlyin the proofs. We have (32) from the Hoeffding’s inequality.

Lemma 3. Let E[1Ik>Ik∗

]be the expected number of times Ik (t) > Ik∗ (t) in T rounds. Then, we have

E[1Ik>Ik∗

]=

T∑t=1

Pr (Ik > Ik∗ ) ≤ 8 log(T )∆2k

+

(1 + π 2

3

).


The proof follows the analysis in Theorem 1 of [26]. The analysis of Pr (Ik > Ik∗ ) is done by conditioning on the event that Arm k hasbeen pulled 8 log(T )

∆2k

. Conditioned on this event, Pr (Ik (t) > Ik∗ (t)|nk (t)) ≤ t−2.

Lemma 4 (Theorem 2 [1]). Consider a two armed bandit problem with reward distributionsΘ = fR1 (r ), fR2 (r ), where the reward distributionof the optimal arm is fR1 (r ) and for the sub-optimal arm is fR2 (r ), and E

[fR1 (r )

]> E

[fR2 (r )

]; i.e., arm 1 is optimal. If it is possible to create

an alternate problem with distributions Θ′ = fR1 (r ), fR2 (r ) such that E[fR2 (r )

]> E

[fR1 (r )

]and 0 < D(fR2 (r )| | fR2 (r )) < ∞ (equivalent to

assumption 1.6 in [1]), then for any policy that achieves sub-polynomial regret, we have

lim infT→∞

E [n2(T )]logT ≥ 1

D(fR2 (r )| | fR2 (r )).

Proof. Proof of this is derived from the analysis done in [30]. We show the analysis here for completeness. A bandit instance v is definedby the reward distribution of arm 1 and arm 2. Since policy π achieves sub-polynomial regret, for any instance v , Ev,π [(Reд(T ))] = O(Tp )as T → ∞, for all p > 0.

Consider the bandit instances Θ = fR1 (r ), fR2 (r ), Θ′ = fR1 (r ), fR2 (r ), where E[fR2 (r )

]< E

[fR1 (r )

]< E

[fR2 (r )

]. The bandit instance

Θ′ is constructed by changing the reward distribution of arm 2 in the original instance, in such a way that arm 2 becomes optimal in instanceΘ′ without changing the reward distribution of arm 1 from the original instance.

From divergence decomposition lemma (derived in [30]), it follows that

D(PΘ,Π | |PΘ′,Π) = EΘ,π [n2(T )]D(fR2 (r )| | fR2 (r )).The high probability Pinsker’s inequality (Lemma 2.6 from [31], originally in [32]) gives that for any event A,

PΘ,π (A) + PΘ′,π (Ac ) ≥12 exp

(−D(PΘ,π | |PΘ′,π )

),

or equivalently,D(PΘ,π | |PΘ′,π ) ≥ log 1

2(PΘ,π (A) + PΘ′,π (Ac )).

If arm 2 is suboptimal in a 2-armed bandit problem, then E [Reд(T )] = ∆2E [n2(T )] . Expected regret in Θ is

EΘ,π [Reд(T )] ≥ T∆22 PΘ,π

(n2(T ) ≥

T

2

),

Similarly regret in bandit instance Θ′ is

EΘ′,π [Reд(T )] ≥ Tδ

2 PΘ′,π

(n2(T ) <

T

2

),

since suboptimality gap of arm 1 in Θ′ is δ . Define κ(∆2,δ ) = min(∆2,δ )2 . Then we have,

PΘ,π

(n2(T ) ≥

T

2

)+ PΘ′,π

(n2(T ) <

T

2

)≤EΘ,π [Reд(T )] + EΘ′,π [Reд(T )]

κ(∆2,δ )T.

On applying the high probability Pinsker’s inequality and divergence decomposition lemma stated earlier, we get

D(fR2 (r )| | fR2 (r ))EΘ,π [n2(T )] ≥ log(

κ(∆2,δ )T2(EΘ,π [Reд(T )] + EΘ′,π [Reд(T )])

)(35)

= log(κ(∆2,δ )

2

)+ log(T )

− log(EΘ,π [Reд(T )] + EΘ′,π [Reд(T )]). (36)

Since policy π achieves sub-polynomial regret for any bandit instance, EΘ,π [Reд(T )] + EΘ′,π [Reд(T )] ≤ γTp for all T and any p > 0,hence,

lim infT→∞

D(fR2 (r )| | fR2 (r ))EΘ,π [n2(T )]

logT ≥ 1 − lim supT→∞

EΘ,π [Reд(T )] + EΘ′,π [Reд(T )]logT +

lim infT→∞

log(κ(∆2,δ )

2

)logT (37)

= 1. (38)

Hence, lim infT→∞

EΘ,π [n2(T )]logT ≥ 1

D(fR2 (r ) | |fR2 (r )).


B RESULTS FOR ANY C-BANDIT ALGORITHM

Lemma 5. Define E1(t) to be the event that arm k∗ is empirically non-competitive in round t + 1, then,

Pr(E1(t)) ≤ t exp(−t∆2

min

2K

),

where ∆min = mink ∆k , the gap between the best and second-best arms.

Proof. We analyze the probability that arm k∗ is empirically non competitive by conditioning on the event that arm k∗ is not pulled formaximum number of times till round t . Analyzing this expression gives us,

Pr(E1(t)) = Pr(E1(t),nk∗ (t) , maxk

nk (t)) (39)

=∑k,k∗

Pr (E1(t),nk (t) = maxk ′

nk ′(t)) (40)

≤ maxk

Pr(E1(t),nk (t) = maxk ′

nk ′(t)) (41)

= maxk

Pr(µk > ϕk∗,k ,nk (t) = maxk ′

nk ′(t)) (42)

≤ maxk

Pr(µk > ϕk∗,k ,nk (t) ≥

t

K

)(43)

= maxk

Pr(∑t

τ=1 1kτ =k rτnk (t)

>

∑tτ=1 1kτ =k sk∗,k (rτ )

nk (t),nk (t) ≥

t

K

)(44)

= maxk

Pr(∑t

τ=1 1kτ =k (rτ − sk∗,k (rτ )

)nk (t)

> 0,nk (t) ≥t

K

)(45)

= maxk

Pr(∑t

τ=1 1kτ =k (rτ − sk∗,k (rτ )

)nk (t)

− (µk − ϕk∗,k ) > ϕk∗,k − µk ,nk (t) ≥t

K

)(46)

≤ maxk

Pr(∑t

τ=1 1kτ =k (rτ − sk∗,k (rτ )

)nk (t)

− (µk − ϕk∗,k ) > ∆k ,nk (t) ≥t

K

)(47)

≤ maxk

t exp(−t∆2

k2K

)(48)

= t exp(−t∆2

min2K

), (49)

Here (42) follows from the fact that in order for arm k∗ to be empirically non-competitive, empirical mean of arm k should be more thanempirical pseudo-reward of arm k∗ with respect to arm k . Inequality (43) follows since nk (t) being more than t

K is a necessary condition fornk (t) = maxk ′ nk ′(t) to occur. We have (47) as sk∗,k is more than µk∗ . We have (48) from the Hoeffding’s inequality, as we note that rewardsrτ − sk∗,k (rτ ) : τ = 1, . . . , t , kτ = k form a collection of i.i.d. random variables each of which is bounded between [−1, 1] with mean(µk − ϕk∗,k ). The term t before the exponent in (48) arises as the random variable nk (t) can take values from t/K to t (Lemma 1).

Lemma 6. If for a suboptimal arm k , k∗, ∆k,k∗ > 0, then,

Pr(kt+1 = k,nk∗ (t) = maxk

nk ) ≤ t exp(−t ∆2

k,k∗

2K

).

Moreover, if ∆k,k∗ ≥ 2√

2K log t0t0

for some constant t0 > 0. Then,

Pr(kt+1 = k,nk∗ (t) = maxk

nk ) ≤ t−3 ∀t > t0.


Proof. We now bound this probability as,

Pr(kt+1 = k,nk∗ = maxk

nk ) = Pr(µk∗ (t) < ϕk,k∗ (t), Ik (t) = max

k ′Ik ′(t),nk∗ (t) = max

knk (t)

)(50)

≤ Pr(µk∗ (t) < ϕk,k∗ (t),nk∗ (t) = max

knk (t)

)(51)

≤ Pr(µk∗ (t) < ϕk,k∗ (t),nk∗ (t) ≥ t

K

)(52)

≤ Pr(∑t

τ=1 1kτ =k∗ rτnk∗ (t) <

∑tτ=1 1kτ =k∗ sk,k∗ (rτ )

nk∗(t ),nk∗ (t) ≥ t

K

)(53)

= Pr(∑t

τ=1 1kτ =k∗ (rτ − sk,k∗ )nk∗ (t) − (µk∗ − ϕk,k∗ ) < −∆k,k∗ ,nk∗ ≥ t

K

)(54)

≤ t exp(−t ∆2

k,k∗

2K

)(55)

≤ t−3 ∀t > t0. (56)

Here, (54) follows from the Hoeffding’s inequality as we note that rewards rτ − sk,k∗ (rτ ) : τ = 1, . . . , t , kτ = k form a collection of i.i.d.random variables each of which is bounded between [−1, 1] with mean (µk − ϕk,k∗ ). The term t before the exponent in (54) arises as the

random variable nk (t) can take values from t/K to t (Lemma 1). Step (56) follows from the fact that ∆k,k∗ ≥ 2√

2K log t0t0

for some constantt0 > 0.

C ALGORITHM SPECIFIC RESULTS FOR C-UCB

Lemma 7. If ∆min ≥ 4√

K log t0t0

for some constant t0 > 0, then,

Pr(kt+1 = k,nk (t) ≥ s) ≤ 3t−3for s >

t

2K ,∀t > t0.

Proof. By noting that kt+1 = k corresponds to arm k having the highest index among the set of arms that are not empirically non-

competitive (denoted by A), we have,

Pr(kt+1 = k,nk (t) ≥ s) = Pr(Ik (t) = arg maxk ′∈A

Ik ′(t),nk (t) ≥ s) (57)

≤ Pr(E1(t) ∪(Ec1 (t), Ik (t) > Ik∗ (t)

),nk (t) ≥ s) (58)

≤ Pr(E1(t),nk (t) ≥ s) + Pr(Ec1 (t), Ik (t) > Ik∗ (t),nk (t) ≥ s) (59)

≤ t exp(−t∆2

min2K

)+ Pr (Ik (t) > Ik∗ (t),nk (t) ≥ s) . (60)

Here E1(t) is the event described in Lemma 5. If arm k∗ is not empirically non-competitive at round t , then arm k can only be pulled inround t + 1 if Ik (t) > Ik∗ (t), due to which we have (58). Inequalities (59) and (60) follow from union bound and Lemma 5 respectively.


We now bound the second term in (60).Pr(Ik (t) > Ik∗ (t),nk (t) ≥ s) = Pr (Ik (t) > Ik∗ (t),nk (t) ≥ s, µk∗ ≤ Ik∗ (t))+

Pr (Ik (t) > Ik∗ (t),nk (t) ≥ s |µk∗ > Ik∗ (t)) × Pr (µk∗ > Ik∗ (t)) (61)≤ Pr (Ik (t) > Ik∗ (t),nk (t) ≥ s, µk∗ ≤ Ik∗ (t)) + Pr (µk∗ > Ik∗ (t)) (62)

≤ Pr (Ik (t) > Ik∗ (t),nk (t) ≥ s, µk∗ ≤ Ik∗ (t)) + t−3 (63)

= Pr (Ik (t) > µk∗ ,nk (t) ≥ s) + t−4 (64)

= Pr(µk (t) +

√2 log tnk (t)

> µk∗ ,nk (t) ≥ s

)+ t−3 (65)

= Pr(µk (t) − µk > µk∗ − µk −

√2 log tnk (t)

,nk (t) ≥ s

)+ t−3 (66)

= Pr(∑t

τ=1 1kτ =k rτnk (t)

− µk > ∆k −

√2 log tnk (t)

,nk (t) ≥ s

)+ t−3 (67)

≤ t exp ©«−2s(∆k −

√2 log ts

)2ª®¬ + t−3 (68)

≤ t−3 exp(−2s

(∆2k − 2∆k

√2 log ts

))+ t−3 (69)

≤ 2t−3 for all t > t0. (70)

We have (61) holds because of the fact that P(A) = P(A|B)P(B) + P(A|Bc )P(Bc ), Inequality (63) follows from Lemma 2. From the definitionof Ik (t) we have (65). Inequality (68) follows from Hoeffding’s inequality and the term t before the exponent in (48) arises as the randomvariable nk (t) can take values from s to t (Lemma 1). Inequality (70) follows from the fact that s > t

2K and ∆k ≥ 4√

K log t0t0

for some constantt0 > 0.

Plugging this in the expression of Pr(kt = k | nk (t) ≥ s) (60) gives us,

Pr(kt+1 = k | nk (t) ≥ s) ≤ t exp(−t∆2

min2K

)+ Pr(Ik (t) > Ik∗ (t)|nk (t) ≥ s) (71)

≤ t exp(−t∆2

min2K

)+ 2t−3 (72)

≤ 3t−3. (73)

Here, (73) follows from the fact that ∆min ≥ 2√

2K log t0t0

for some constant t0 > 0.


K log t0t0


Pr(nk (t) >

t

K

)≤ 3K

( tK

)−2 ∀t > Kt0.

Proof. We expand Pr(nk (t) > t

K)as,

Pr(nk (t) ≥

t

K

)= Pr

(nk (t) ≥

t

K| nk (t − 1) ≥ t

K

)Pr

(nk (t − 1) ≥ t

K

)+

Pr(kt = k,nk (t − 1) = t

K− 1

)(74)

≤ Pr(nk (t − 1) ≥ t

K

)+ Pr

(kt = k,nk (t − 1) = t

K− 1

)(75)

≤ Pr(nk (t − 1) ≥ t

K

)+ 3(t − 1)−3 ∀(t − 1) > t0. (76)

Here, (76) follows from Lemma 7.


This gives us

Pr(nk (t) ≥

t

K

)− Pr

(nk (t − 1) ≥ t

K

)≤ 3(t − 1)−3, ∀(t − 1) > t0.

Now consider the summation

t∑τ= t

K

Pr(nk (τ ) ≥

t

K

)− Pr

(nk (τ − 1) ≥ t

K

)≤

t∑τ= t

K

3(τ − 1)−3.

This gives us,

Pr(nk (t) ≥

t

K

)− Pr

(nk

( tK

− 1)≥ t

K

)≤

t∑τ= t

K

3(τ − 1)−3.

Since Pr(nk

( tK − 1

)≥ t

K)= 0, we have,

Pr(nk (t) ≥

t

K

)≤

t∑τ= t

K

3(τ − 1)−3 (77)

≤ 3K( tK

)−2 ∀t > Kt0. (78)

D REGRET BOUNDS FOR C-UCB

Proof of Theorem 1We bound E [nk (T )] as,

E [nk (T )] = E[ T∑t=1

1kt=k

](79)

=

T−1∑t=0

Pr(kt+1 = k) (80)

=

Kt0∑t=1

Pr(kt = k) +T−1∑t=Kt0

Pr(kt+1 = k) (81)

≤ Kt0 +T−1∑t=Kt0

Pr(kt+1 = k,nk∗ (t) = maxk ′

nk ′(t)) +T−1∑t=Kt0

∑k ′,k∗

Pr(nk ′(t) = maxk ′′

nk ′′(t)) Pr(kt+1 = k |nk ′(t) = maxk ′′

nk ′′(t)) (82)

≤ Kt0 +T−1∑t=Kt0


nk ′(t)) +T−1∑t=Kt0

∑k ′,k∗


nk ′′(t)) (83)

≤ Kt0 +T−1∑t=Kt0

t−3 +T∑

t=Kt0

∑k ′,k∗

Pr(nk ′(t) ≥

t

K

)(84)

≤ Kt0 +T∑t=1

t−3 + K(K − 1)T∑

t=Kt0

3( tK

)−2. (85)

Here, (84) follows from Lemma 6 and (85) follows from Lemma 8.Proof of Theorem 2


For any suboptimal arm k , k∗,

E [nk (T )] ≤T∑t=1

Pr(kt = k) (86)

=

T∑t=1

Pr(E1(t),kt = k ∪ (Ec1 (t), Ik > Ik∗ ),kt = k) (87)

≤T∑t=1

Pr(E1(t)) + Pr(Ec1 (t), Ik (t − 1) > Ik∗ (t − 1),kt = k) (88)

≤T∑t=1

Pr(E1(t)) + Pr(Ec1 (t), Ik (t − 1) > Ik∗ (t − 1))

≤T∑t=1

Pr(E1(t)) + Pr(Ik (t − 1) > Ik∗ (t − 1),kt = k) (89)

=

T∑t=1

t exp(−t∆2

min2K

)+

T−1∑t=0

Pr (Ik (t) > Ik∗ (t),kt = k) (90)

=

T∑t=1

t exp(−t∆2

min2K

)+ E

[1Ik>Ik∗ (T )

](91)

≤ 8log(T )∆2k

+

(1 + π 2

3

)+

T∑t=1

t exp(−t∆2

min2K

). (92)

Here, (90) follows from Lemma 5. We have (91) from the definition of E[nIk>Ik∗ (T )

]in Lemma 3, and (92) follows from Lemma 3.

Proof of Theorem 3: Follows directly by combining the results on Theorem 1 and Theorem 2.

E ALGORITHM SPECIFIC RESULTS FOR C-TS

As done in [27] Let us define two thresholds, a lower threshold Lk , and an upper thresholdUk for an arm k ∈ K ,

Uk = µk +∆k3 , Lk = µk∗ − ∆k

3 . (93)

Let Eµi (t) and ESi (t) be the events that,

Eµk (t) = ∃t : µk (t) ≤ Uk

ESk (t) = ∃t : Sk (t) ≤ Lk . (94)

Recall Sk (t) is the sample obtained from the posterior distribution on the mean reward of arm k at round t .

Fact 2. ([33]). For a Gaussian distributed random variable Z with mean µ and variance σ 2, then for any constant c ,

14√π

exp(−7c2/2) < Pr(|Z − µ | > cσ ) ≤ 12 exp(−c2/2).

Fact 3. ([33]). For a Gaussian distributed random variable Z with mean µ and variance σ 2, then for any constant c ,

Pr(|Z − µ | > cσ ) ≥ 1√

2πc

c2 + 1exp(−c2/2).


2β K log t0t0

for some constant t0 > exp((11βσ 2), s > t

K, then

Pr (kt = k,nk (t − 1) ≥ s) ≤ 3t−3 + 2t−2β

∀t > t0, where k , k∗ is a sub-optimal arm.


Proof. We start by bounding the probability of the pull of k-th arm at round t as follows,

Pr (kt = k,nk (t − 1) ≥ s) ≤ Pr (E1(t),kt = k,nk (t − 1) ≥ s) + Pr(E1(t),kt = k,nk (t − 1) ≥ s

)≤t exp

(−t∆2

min2K

)+ Pr

(E1(t),kt = k,nk (t − 1) ≥ s

)(95)

≤t−3 +

(Pr(kt = k,Eµk (t),E

Sk (t),nk (t − 1) ≥ s)︸︷︷︸

term A

+ Pr(kt = k,Eµk (t),ESk (t),nk (t − 1) ≥ s)︸︷︷︸

term B

+ Pr(kt = k,Eµk (t),nk (t − 1) ≥ s))

︸︷︷︸term C

(96)

where, in (95), comes from Lemma 5. Now we treat each term in (96) individually. Note that we know from [27] that for all s ≥ exp(11β),

(A) ≤ t−2β

For the term B we can show that,

(B) ≤ Pr(Eµk (t),E

Sk (t),nk (t − 1) > s

)(97)

(a)≤ Pr

(N

(Uk (θ∗),

β

nk (t) + 1

)> Lk (θ∗),nk (t − 1) > s

)(98)

(b)= Pr

(N

(µk +

∆k3 ,

β

nk (t) + 1

)> µk∗ − ∆k

3 ,nk (t − 1) > s

)(99)

(c)= Pr

(N

(µk +

∆k3 − 2∆k3 ,

β

nk (t) + 1

)> µk∗ − ∆k

3 − 2∆k3 ,nk (t − 1) > s

)(100)

≤ Pr(N

(µk − ∆k

3 ,β

nk (t) + 1

)> µk∗ − 2∆k3 −

√8β log t

s,nk (t − 1) > s

)· Pr

(∆k3 ≥

√8β log t

s

)(101)

+ Pr(N

(µk − ∆k

3 ,β

nk (t) + 1

)> µk∗ − 2∆k3 − ∆k

3 ,nk (t − 1) > s

)· Pr

(∆k3 <

√8β log t

s

)︸︷︷︸

Term A

(102)

(d )≤ Pr

(N

(µk −

∆k3 ,

β

nk (t) + 1

)> µk∗−2∆k3 −

√8β log t

s,nk (t − 1) > s

)(103)

(e)≤

t∑m=s

12 exp

©«−m

(µk∗ − 2∆k3 − µk +

∆k3 −

√8β log t

s

)2

2β ,nk (t − 1) =m

ª®®®®®®¬(104)

(f )≤ t

2 exp(− s

2β

(8β log t

s+

4∆2k

9 − 4∆k3

√8β log t

s

))(105)

(д)≤ t−3

2 exp ©«− s

2β©«

4∆2k

9 − 4∆k3

√3ασ 2 log t

s

ª®¬ª®¬(h)≤ t−3

2 . (106)

Here (a) follows as µk < Uk (θ∗) (through event Eµk (t)) and¯ESk (t) is the event that N

(µk ,

βσ 2

nk (t )+1

)> Lk (θ∗). Equality (b) follows by

substituting the expressions for Lk (θ∗) andUk (θ∗). Inequality (d) follows as for t > t0 and s > tK , ∆k > 6

√2β log t

s . Inequality (e) follows

from Fact 2 . We have (h) as ∆k > 6√

2β log ts for all s > t/k and t > t0.


Finally, for the last term C we can show that,

(C) = Pr(kt = k,Eµk (t),nk (t − 1) ≥ s) (107)

≤ Pr(Eµk (t),nk (t − 1) ≥ s) (108)

= Pr(µk − µk >

∆k3 ,nk (t − 1) ≥ s

)(109)

≤ 2t exp(−2s

∆2k

9

)(110)

≤ 2t−3 ∀t > t0 (111)

Here (110) follows from hoeffding’s inequality and the union bound trick to handle random variable nk (t − 1). We have (111) as ∆k >6√

2Kβ log t0t0

for some t0 > 0 and s > tK and β > 1.


2βK log t0t0


Pr(nk (t) >

t

K

)≤ 3K

( tK

)−2+ K

( tK

)1−2β ∀t > Kt0.

Proof. We expand Pr(nk (t) > t

K)as,

Pr(nk (t) ≥

t

K

)= Pr

(nk (t) ≥

t

K| nk (t − 1) ≥ t

K

)Pr

(nk (t − 1) ≥ t

K

)+

Pr(kt = k,nk (t − 1) = t

K− 1

)(112)

≤ Pr(nk (t − 1) ≥ t

K

)+ Pr

(kt = k,nk (t − 1) = t

K− 1

)(113)

≤ Pr(nk (t − 1) ≥ t

K

)+ 3(t − 1)−3 + (t − 1)−2β ∀(t − 1) > t0. (114)

Here, (114) follows from Lemma 9.

This gives us

Pr(nk (t) ≥

t

K

)− Pr

(nk (t − 1) ≥ t

K

)≤ 3(t − 1)−3 + (t − 1)−2β , ∀(t − 1) > t0.

Now consider the summationt∑

τ= tK

Pr(nk (τ ) ≥

t

K

)− Pr

(nk (τ − 1) ≥ t

K

)≤

t∑τ= t

K

3(τ − 1)−3.

This gives us,

Pr(nk (t) ≥

t

K

)− Pr

(nk

( tK

− 1)≥ t

K

)≤

t∑τ= t

K

3(τ − 1)−3 + (τ − 1)−2β .

Since Pr(nk

( tK − 1

)≥ t

K)= 0, we have,

Pr(nk (t) ≥

t

K

)≤

t∑τ= t

K

3(τ − 1)−3 + (τ − 1)−2β (115)

≤ 3K( tK

)−2+ K

( tK

)1−2β ∀t > Kt0. (116)


F REGRET BOUNDS FOR C-TS

Proof of Theorem 1. Following the same steps as in Appendix D, we get

E [nk (T )] ≤ Kt0 +T−1∑t=Kt0


nk ′(t)) +T−1∑t=Kt0

∑k ′,k∗


nk ′′(t)) (117)

≤ Kt0 +T−1∑t=Kt0

t−3 +T∑

t=Kt0

∑k ′,k∗

Pr(nk ′(t) ≥

t

K

)(118)

≤ Kt0 +T∑t=1

t−3 + K(K − 1)T∑

t=Kt0

(3( tK

)−2+

( tK

)1−2β.

)(119)

= O(1) for β > 1. (120)

Here, (118) follows from Lemma 6 and (119) follows from Lemma 10.Proof of Theorem 2. Following the same steps as in Appendix D, we get For any suboptimal arm k , k∗,

E [nk (T )] ≤T∑t=1

t exp(−t∆2

min2K

)+

T−1∑t=0

Pr (Sk (t) > Sk∗ (t),kt = k) (121)

≤18 log(T∆2

k )∆2k

+ exp(11β) + 5 + 132∆2

k

+

T∑t=1

t exp(−t∆2

min2K

). (122)

We have (122) follows from the analysis of Thompson sampling in [27].

G LOWER BOUNDS

For the proof we define Rk = Yk (X ) and Rk = дk (X ), where fX (x) is the probability density function of random variable X and fX (x) is theprobability density function of random variable X . Similarly, we define fRk (r ) to be the reward distribution of arm k .

G.1 Proof of Theorem 4

Let arm k be a Competitive sub-optimal arm, i.e ∆k,k∗ < 0. To prove that regret is Ω(logT ) in this setting, we need to create a new banditinstance, in which reward distribution of optimal arm is unaffected, but a previously competitive sub-optimal arm k becomes optimal inthe new environment. We do so by constructing a bandit instance with latent randomness X and random rewards Yk (X ). Let’s denote toYk (X ) to be the random reward obtained on pulling arm k given the realization of X . To make arm k optimal in the new bandit instance, weconstruct Yk (X ) and X in the following manner. Let Yk denote the support of Yk (X ).

Define

Yk (X ) =дk (X ) w.p. 1 − ϵ1Yk (X ) ∼ Uniform(Yk ) w.p. ϵ1

This changes the conditional reward of arm k in the new bandit instance (with increased mean).Furthermore, Define

X =

S(Rk∗ ) w .p.1 − ϵ2Uniform ∼ X w .p.ϵ2.

,

with S(Rk∗ ) = arg maxдk∗

(x )<Rk∗<дk∗ (x ) дk (x).Here Rk∗ represents the random reward of arm k∗ in the original bandit instance.

This construction of X is possible for some ϵ1, ϵ2 > 0, whenever arm k is competitive by definition. Moreover, under such a constructionone can change reward distribution of Yk∗ (X ) such that reward Rk∗ has the same distribution as Rk∗ . This is done by changing the conditionalreward distribution, fYk∗ |X (r ) =

fYk∗ |X (r )fX (x )fX (x ) .

Due to this, if an arm is competitive, there exists a new bandit instance with latent randomness X and conditional rewards Yk∗ |X andYk |X such that fRk∗ = fR∗

kand E

[Rk

]> µk∗ , with fRk denoting the probability distribution function of the reward from arm k and Rk

representing the reward from arm k in the new bandit instance.Therefore, if these are the only two arms in our problem, then from Lemma 4,

limT→∞

inf E [nk (T )]logT ≥ 1D(fRk (r )| | fRk (r ))

,

where fRk(r ) represents the reward distribution of arm k in the new bandit instance.


r s2,1(r ) r s1,2(r )0

23 0

34

167 1

23

(a) R2 = 0 R2 = 1R1 = 0 0.1 0.2R1 = 1 0.3 0.4

(b) R2 = 0 R2 = 1R1 = 0 a bR1 = 1 c d

Table 4: The top row shows the pseudo-rewards of arms 1 and 2, i.e., upper bounds on the conditional expected rewards (which

are known to the player). The bottom row depicts two possible joint probability distribution (unknown to the player). Under

distribution (a), Arm 1 is optimal and all pseudo-reward except s2,1(1) are tight.

Moreover, if we have more K − 1 sub-optimal arms, instead of just 1, then

limT→∞

infE

[∑ℓ,k∗ nℓ(T )

]logT ≥ 1

D(fRk (r )| | fRk (r )).

Consequently, since E [Reд(T )] = ∑Kell=1 ∆ℓE [nℓ(T )], we have

limT→∞

inf E [Reд(T )]log(T ) ≥ maxk ∈C

∆kD(fRk | | fRk )

. (123)

G.2 Lower bound discussion in general framework

Consider the example shown in Table 4, for the joint probability distribution (a), Arm 1 is optimal. Moreover, all pseudo-rewards excepts2,1(1) are tight, i.e.,sℓ,k (r ) = E [Rℓ |Rk = r ]. For the joint probability distribution shown in (a), expected pseudo-reward of Arm 2 is 0.8 andhence it is competitive. Due to this, our C-UCB and C-TS algorithms pull Arm 2 O(logT ) times.

However, it is not possible to construct an alternate bandit environment with joint probability distribution shown in Table 4(b), suchthat Arm 2 becomes optimal while maintaining the same marginal distribution for Arm 1, and making sure that the pseudo-rewards stillremain upper bound on conditional expected rewards. Formally, there does not exist a,b, c,d such that c + d = 0.7, c

a+c < 3/4, ba+b < 2/3,

db+d < 2/3, d

d+c < 6/7 and a + b + c + d = 1. This suggests that there should be a way to achieve O(1) regret in this scenario. We believethis can be done by using all the constraints (imposed by the knowledge of pair-wise pseudo-rewards to shrink the space of possible jointprobability distributions) when calculating empirical pseudo-reward. However, this becomes tough to implement as the ratings can havemultiple possible values and the number of arms is more than 2. We leave the task of coming up with a practically feasible and easy toimplement algorithm that achieves bounded regret whenever possible in a general setup as an interesting open problem.

multi-armed bandits with correlated armsusers.ece.cmu.edu/~oyagan/journals/correlatedmab.pdf ·...

Documents