adaptive sequential experiments with unknown information ......sequential experiments in the...

80
Adaptive Sequential Experiments with Unknown Information Arrival Processes Yonatan Gur Stanford University Ahmadreza Momeni Stanford University April 26, 2020 Abstract Sequential experiments are often designed to strike a balance between maximizing immediate payos based on available information, and acquiring new information that is essential for maximizing future payos. This trade-o is captured by the multi-armed bandit (MAB) framework that has been studied and applied, typically when at each time epoch feedback is received only on the action that was selected at that epoch. However, in many practical settings, including product recommendations, dynamic pricing, retail management, and health care, additional information may become available betweendecision epochs. We introduce a generalized MAB formulation in which auxiliary information may appear arbitrarily over time. By obtaining matching lower and upper bounds, we characterize the minimax complexity of this family of problems as a function of the information arrival process, and study how salient characteristics of this process impact policy design and achievable performance. In terms of achieving optimal performance, we establish that: ( i ) upper condence bound and posterior sampling policies possess natural robustness with respect to the information arrival process without any adjustments, which uncovers a novel property of these popular families of policies and further lends credence to their appeal; and ( ii ) policies with exogenous exploration rate do not possess such robustness. For such policies, we devise a novel virtual time indices method for dynamically controlling the eective exploration rate. We apply our method for designing t -greedy-type policies that, without any prior knowledge on the information arrival process, attain the best performance (in terms of regret rate) that is achievable when the information arrival process is a priori known. We use data from a large media site to analyze the value that may be captured in practice by leveraging auxiliary information for designing content recommendations. Keywords : sequential experiments, data-driven decisions, product recommendation systems, online learning, adaptive algorithms, multi-armed bandits, exploration-exploitation, minimax regret. 1 Introduction 1.1 Background and motivation Eective design of sequential experiments requires balancing between new information acquisition (exploration), and optimizing payos based on available information ( exploitation). A well-studied The authors are grateful to Omar Besbes and Lawrence M. Wein for their valuable comments. An initial version of this work, including some preliminary results, appeared in Gur and Momeni (2018). Correspondence: [email protected] , [email protected]. 1

Upload: others

Post on 02-Feb-2021

8 views

Category:

Documents


0 download

TRANSCRIPT

  • Adaptive Sequential Experiments

    with Unknown Information Arrival Processes

    Yonatan Gur

    Stanford University

    Ahmadreza Momeni∗

    Stanford University

    April 26, 2020

    Abstract

    Sequential experiments are often designed to strike a balance between maximizing immediate payoffsbased on available information, and acquiring new information that is essential for maximizing futurepayoffs. This trade-off is captured by the multi-armed bandit (MAB) framework that has been studiedand applied, typically when at each time epoch feedback is received only on the action that wasselected at that epoch. However, in many practical settings, including product recommendations,dynamic pricing, retail management, and health care, additional information may become availablebetween decision epochs. We introduce a generalized MAB formulation in which auxiliary informationmay appear arbitrarily over time. By obtaining matching lower and upper bounds, we characterize theminimax complexity of this family of problems as a function of the information arrival process, andstudy how salient characteristics of this process impact policy design and achievable performance. Interms of achieving optimal performance, we establish that: (i) upper confidence bound and posteriorsampling policies possess natural robustness with respect to the information arrival process withoutany adjustments, which uncovers a novel property of these popular families of policies and furtherlends credence to their appeal; and (ii) policies with exogenous exploration rate do not possesssuch robustness. For such policies, we devise a novel virtual time indices method for dynamicallycontrolling the effective exploration rate. We apply our method for designing �t-greedy-type policiesthat, without any prior knowledge on the information arrival process, attain the best performance (interms of regret rate) that is achievable when the information arrival process is a priori known. Weuse data from a large media site to analyze the value that may be captured in practice by leveragingauxiliary information for designing content recommendations.

    Keywords: sequential experiments, data-driven decisions, product recommendation systems, onlinelearning, adaptive algorithms, multi-armed bandits, exploration-exploitation, minimax regret.

    1 Introduction

    1.1 Background and motivation

    Effective design of sequential experiments requires balancing between new information acquisition

    (exploration), and optimizing payoffs based on available information (exploitation). A well-studied

    ∗The authors are grateful to Omar Besbes and Lawrence M. Wein for their valuable comments. An initial version ofthis work, including some preliminary results, appeared in Gur and Momeni (2018). Correspondence: [email protected],[email protected].

    1

  • framework that captures the trade-off between exploration and exploitation is the one of multi-armed

    bandits (MAB) that first emerged in Thompson (1933) in the context of drug testing, and was later

    extended by Robbins (1952) to a more general setting. In this framework, an agent repeatedly chooses

    between different actions (arms), where each action is associated with an a priori unknown reward

    distribution, and the objective is to maximize cumulative return over a certain time horizon.

    The MAB framework focuses on balancing exploration and exploitation under restrictive assumptions

    on the future information collection process: in every round noisy observations are collected only on the

    action that was selected at that period. Correspondingly, policy design is typically predicated on that

    information collection procedure. However, in many practical settings to which bandit models have been

    applied for designing sequential experimentation, additional information might be realized and leveraged

    over time. Examples of such settings include dynamic pricing (Bastani et al. 2019, Bu et al. 2019), retail

    management (Zhang et al. 2010), clinical trials (Bertsimas et al. 2016, Anderer et al. 2019), as well as

    many machine learning domains (see a list of relevant settings in the survey by Pan and Yang 2009).

    To illustrate one particular such setting, consider the problem of designing product recommendations

    with limited data. Product recommendation systems are widely deployed over the web with the objective

    of helping users navigate through content and consumer products while increasing volume and revenue for

    service and e-commerce platforms. These systems commonly apply techniques that leverage information

    such as explicit and implicit user preferences, product consumption, and consumer ratings (see, e.g., Hill

    et al. 1995, Konstan et al. 1997, Breese et al. 1998). While effective when ample relevant information

    is available, these techniques tend to under-perform when encountering products that are new to the

    system and have little or no trace of activity. This phenomenon, known as the cold-start problem, has

    been documented and studied in the literature; see, e.g., Schein et al. (2002), Park and Chu (2009).

    In the presence of new products, a recommendation system needs to strike a balance between

    maximizing instantaneous performance indicators (such as revenue) and collecting valuable information

    that is essential for optimizing future recommendations. To illustrate this tradeoff (see Figure 1), consider

    a consumer (A) that is searching for a product using the organic recommendation system of an online

    retailer (e.g., Amazon). The consumer provides key product characteristics (e.g., “headphones” and

    “bluetooth”), and based on this description (and, perhaps, additional factors such as the browsing history

    of the consumer) the retailer recommends products to the consumer. While some of the candidate

    products that fit the desired description may have already been recommended many times to consumers,

    and the mean returns from recommending them are known, other candidate products might be new brands

    that were not recommended to consumers before. Evaluating the mean returns from recommending new

    products requires experimentation that is essential for improving future recommendations, but might be

    2

  • Additional Information

    Denisy

    Old item

    New

     item(A)

    (B)

    RecommendationSystem

    Figure 1: Additional information in product recommendations. In the depicted scenario, consumers of type (A)sequentially arrive to an online retailer’s organic recommendation system to search for products that match theterms “headphones” and “bluetooth.” Based on these terms, the retailer can recommend one of two products: anold brand that was already recommended many times to consumers, and a new brand that was never recommendedbefore. In parallel, consumers of type (B) arrive to the new brand’s product page directly from a search engine(e.g., Google) by searching for “headphones,” “bluetooth,” and “Denisy” (the name of the new brand).

    costly if consumers are less likely to purchase these new products.

    Several MAB settings were suggested and applied for designing recommendation algorithms that

    effectively balance information acquisition and instantaneous revenue maximization, where arms represent

    candidate recommendations; see the overview in Madani and DeCoste (2005), as well as later studies by

    Agarwal et al. (2009), Caron and Bhagat (2013), Tang et al. (2014), and Wang et al. (2017). Aligned

    with traditional MAB frameworks, these studies consider settings where in each time period information

    is obtained only on items that are recommended at that period, and suggest recommendation policies

    that are predicated on this information collection process. However, in many instances additional

    browsing and consumption information may be maintained in parallel to the sequential recommendation

    process, as a significant fraction of website traffic may take place through means other than its organic

    recommender system; see browsing data analysis in Sharma et al. (2015) that estimate that product

    recommendation systems are responsible for at most 30% of site traffic and revenue and that substantial

    fraction of remaining traffic arrives to product pages directly from search engines, as well as Besbes et al.

    (2016) that report similar findings for content recommendations in media sites.

    To illustrate an additional source of traffic and revenue information, let us revert to Figure 1 and

    consider a different consumer (B) that does not use the organic recommender system of the retailer,

    but rather arrives to a certain product page directly from a search engine (e.g., Google) by searching

    for that particular brand. While consumers (A) and (B) may have inherently different preferences, the

    actions taken by consumer (B) after arriving to the product page (e.g., whether or not she bought the

    product) can be informative about the returns from recommending that product to other consumers.

    This additional information could potentially be used to improve the performance of recommendation

    3

  • algorithms, especially when the browsing history that is associated with some products is limited.

    The above example illustrates that additional information often becomes available between decision

    epochs of recommender systems, and that this information might be essential for designing effective

    product recommendations. In comparison with the restrictive information collection process that is

    assumed in most MAB formulations, the availability of such auxiliary information (and even the prospect

    of obtaining it in the future) might impact the achievable performance and the design of sequential

    experiments and learning policies. When additional information is available one may potentially need

    to “sacrifice” less decisions for exploration in order to obtain effective estimators for the mean rewards.

    While this intuition suggests that exploration can be reduced in the presence of additional information, it

    is a priori not clear what type of performance improvement can be expected given different information

    arrival processes, and how policy design should depend on these processes.

    Moreover, monitoring exploration rates in real time in the presence of an arbitrary information arrival

    process introduces additional challenges that have distinct practical relevance. Most importantly, an

    optimal exploration rate may depend on several characteristics of the information arrival process, such

    as the amount of information that arrives and the time at which it appears (e.g., early on versus later

    on along the decision horizon). Since it may be hard to predict upfront the salient characteristics of the

    information arrival process, an important challenge is to identify the extent to which it is possible to

    adapt in real time to an a priori unknown information arrival process, in the sense of approximating the

    performance that is achievable under prior knowledge of the sample path of information arrivals. This

    paper is concerned with addressing these challenges.

    The research questions that we study in this paper are along two lines. First, we study how the

    information arrival process and its characteristics impact the performance one could aspire to when

    designing a sequential experiment. Second, we study what type of policies should one adopt for designing

    sequential experiments in the presence of general information arrival processes, especially when the

    structure of these processes is a priori unknown.

    1.2 Main contributions

    The main contribution of this paper lies in (1) introducing a new, generalized MAB framework with

    unknown and arbitrary information arrival process; (2) characterizing the minimax regret complexity of

    this broad class of problems and, by doing so, providing a sharp criterion for identifying policies that are

    rate-optimal in the presence of auxiliary information arrivals; and (3) identifying policies (and proposing

    new ones) that are rate-optimal in the presence of general and a priori unknown information arrival

    processes. More specifically, our contribution is along the following lines.

    4

  • (1) Modeling. To capture the salient features of settings with auxiliary information arrivals we

    formulate a new class of MAB problems that generalizes the classical stochastic MAB framework, by

    relaxing strong assumptions that are typically imposed on the information collection process. Our

    formulation considers a broad class of distributions that are informative about mean rewards, and allows

    observations from these distributions to arrive at arbitrary and a priori unknown rates and times. Our

    model therefore captures a large variety of real-world phenomena, yet maintains tractability that allows

    for a crisp characterization of the performance improvement that can be expected when more information

    might become available throughout the sequential decision process.

    (2) Analysis of achievable performance. We establish lower bounds on the performance that is

    achievable by any non-anticipating policy, where performance is measured in terms of regret relative to

    the performance of an oracle that constantly selects the action with the highest mean reward. We further

    show that our lower bounds can be achieved through suitable policy design. These results identify the

    minimax complexity associated with the MAB problem with unknown information arrival processes, as a

    function of the underlying information arrival process itself (together with other problem characteristics

    such as the length of the problem horizon). This provides a yardstick for evaluating the performance

    of policies, and a sharp criterion for identifying rate-optimal ones. Our results identify a spectrum of

    minimax regret rates ranging from the classical regret rates that appear in the stochastic MAB literature

    when there is no auxiliary information, to regret that is uniformly bounded over time when information

    arrives frequently and/or early enough.

    (3) Policy design. We establish that polices that are based on posterior sampling and upper confidence

    bounds may leverage additional information naturally, without any prior knowledge on the information

    arrival process. In that sense, policies such as Thompson sampling and UCB exhibit remarkable

    robustness with respect to the information arrival process: without any adjustments, these policies

    guarantee the best performance that can be achieved (in terms of regret rate). Moreover, the regret rate

    that is guaranteed by these policies cannot be improved even with prior knowledge of the information

    arrival process. The “best of all worlds” type of guarantee that is established implies that Thompson

    sampling and UCB achieve rate optimality uniformly over a general class of information arrival processes;

    Figure 2 visualizes the regret incurred by these polices as the amount of auxiliary information increases.

    This result identifies a new important property of these popular policies, generalizes the class of problems

    to which they have been applied thus far, and further lends credence to their appeal.

    Nevertheless, we observe that policies with an exogenous exploration rate do not exhibit such robustness.

    For such policies, we devise a novel virtual time indices method for endogenizing the exploration rate.

    We apply this method for designing �t-greedy-type policies that, without any prior knowledge on the

    5

  • Stanford GSB Yoni Gur ∙ Adaptive Sequential Experiments 1

    experimentation effectively stops

    Figure 2: Impact of stochastic information arrivals (see §3.1.1) with different rates, λ, of auxiliary informationarrivals: little auxiliary data (λ = 0.001), more auxiliary data (λ = 0.01) and a lot of auxiliary data (λ = 0.05).The top plots depict regret incurred by UCB1 tuned by c = 1.0 with (aUCB1) and without (UCB1) accountingfor auxiliary information arrivals. Bottom plots depict regret incurred by Thompson sampling tuned by c = 0.5with (aTS) and without (TS) accounting for auxiliary observations. The setup consists of horizon length T = 104,three arms and Gaussian rewards with means µ = (0.7, 0.5, 0.5)> and standard deviation σ = 0.5. Auxiliaryobservations are i.i.d. samples from the reward distributions. Plots are averaged over 400 replications.

    information arrival process, uniformly attain the best performance (in terms of regret rate) that is

    achievable when the information arrival process is a priori known. Figure 3 uses the settings described

    in Figure 2 to visualize the impact of adjusting the exploration rate of an εt-greedy policy using virtual

    time indices as the amount of auxiliary information increases. The virtual time indices method adjusts

    the exploration rate of the policy in real time, through replacing the time index used by the policy

    with virtual time indices that are dynamically updated over time. Whenever auxiliary information on

    a certain action arrives, the virtual time index associated with that action is carefully advanced to

    effectively reduce the rate at which the policy is experimenting with that action.

    Stanford GSB Yoni Gur ∙ Adaptive Sequential Experiments 1

    Figure 3: The information arrival settings described in Figure 2, and performance of εt-greedy-type policies(tuned by c = 1.0) that: ignore auxiliary observations (EG); use this information to adjust estimates of meanrewards without appropriately adjusting the exploration rate of the policy (nEG); and adjust reward estimatesand dynamically adjusts exploration rates using virtual time indices (aEG).

    Finally, using data from a large media site, we analyze the value that may be captured by leveraging

    auxiliary information in order to recommend content with unknown impact on the future browsing path

    6

  • of readers. The auxiliary information we utilize consists of users that arrived to media pages directly

    from search, and the browsing path of these users once arrived to that media page.

    1.3 Related work

    Multi-armed bandits. Since its inception, the MAB framework has been adopted for studying a

    variety of applications including clinical trials (Zelen 1969), strategic pricing (Bergemann and Välimäki

    1996), assortment selection (Caro and Gallien 2007), online auctions (Kleinberg and Leighton 2003),

    online advertising (Pandey et al. 2007), and product recommendations (Madani and DeCoste 2005, Li

    et al. 2010). For a comprehensive overview of MAB formulations we refer the readers to Berry and

    Fristedt (1985) and Gittins et al. (2011) for Bayesian / dynamic programming formulations, as well as

    Cesa-Bianchi and Lugosi (2006) and Bubeck and Cesa-Bianchi (2012) that cover the machine learning

    literature and the so-called adversarial setting. A sharp regret characterization for the more traditional

    formulation (random rewards realized from stationary distributions), often referred to as the stochastic

    MAB problem, was first established by Lai and Robbins (1985), followed by analysis of policies such as

    �t-greedy, UCB1, and Thompson sampling; see, e.g., Auer et al. (2002) and Agrawal and Goyal (2013a).

    The MAB framework focuses on balancing exploration and exploitation under restrictive assumptions

    on the future information collection process. Correspondingly, optimal policy design is typically predicated

    on the assumption that at each period a reward observation is collected only on the action that is

    selected at that time period (exceptions to this common information collection process are discussed

    below). In that sense, such policy design does not account for information that may become available

    between decision epochs, and that might be essential for achieving good performance in many practical

    settings. In the current paper, we relax the assumptions on the information arrival process in the MAB

    framework by allowing arbitrary information arrival processes. Our focus is on studying the impact of

    the information arrival characteristics (such as frequency and timing) on achievable performance and

    policy design, especially when the information arrival process is a priori unknown.

    Few MAB settings deviate from the information collection process that is described above, but these

    settings consider specific information arrival processes that are known a priori to the agent, as opposed

    to our formulation where the information arrival process is arbitrary and a priori unknown. One example

    include the so-called contextual MAB setting, also referred to as bandit problem with side observations

    (Wang et al. 2005), or associative bandit problem (Strehl et al. 2006), where at each trial the decision

    maker observes a context carrying information about other arms. (In Appendix E we demonstrate

    that the policy design approach we advance here could be applied to contextual MAB settings as well.)

    Another example is the full-information adversarial MAB setting, where rewards are arbitrary and

    7

  • can even be selected by an adversary (Auer et al. 1995, Freund and Schapire 1997). In this setting,

    after each decision epoch the agent has access to a reward observation from each arm (not only the one

    that was selected). The adversarial nature of this setting makes it fundamentally different in terms of

    achievable performance, analysis, and policy design, from the stochastic formulation that is adopted

    in this paper. Another related example, which could be viewed as a special case of our framework, is

    the setting of Shivaswamy and Joachims (2012), who consider a stochastic MAB problem where initial

    reward observations are available at the beginning of the horizon.

    Balancing and regulating exploration. Several papers have considered settings of dynamic opti-

    mization with partial information and distinguished between cases where myopic policies guarantee

    optimal performance, and cases where exploration is essential, in the sense that myopic policies may lead

    to incomplete learning and large losses; see, e.g., Araman and Caldentey (2009), Farias and Van Roy

    (2010), Harrison et al. (2012), and den Boer and Zwart (2013) for dynamic pricing without knowing

    the demand function, Huh and Rusmevichientong (2009) and Besbes and Muharremoglu (2013) for

    inventory management without knowing the demand distribution, and Lee et al. (2003) in the context

    of technology development. Bastani et al. (2017) consider the contextual MAB framework and show

    that if the distribution of contexts guarantees sufficient diversity, then exploration becomes unnecessary

    and greedy policies can leverage the natural exploration that is embedded in the information diversity

    to achieve asymptotic optimality. Relatedly, Woodroofe (1979) and Sarkar (1991) consider a Bayesian

    one-armed contextual MAB problem and show that a myopic policy is asymptotically optimal when

    the discount factor converges to one. Considering sequential recommendations to customers that make

    decisions based on the relevance of recommendations, Bastani et al. (2018) show that classical MAB

    policies may over-explore and propose proper modifications for those policies.

    On the other hand, few papers have studied cases where exploration is not only essential but should

    be enhanced in order to maintain optimality. For example, Cesa-Bianchi et al. (2006) introduce a partial

    monitoring setting where after playing an arm the agent does not observe the incurred loss but only a

    limited signal about it, and show that such feedback structure requires higher exploration rates. Besbes

    et al. (2019) consider a general framework where the reward distribution may change over time according

    to a budget of variation, and characterize the manner in which optimal exploration rates increase as a

    function of said budget. Shah et al. (2018) consider a platform in which the preferences of arriving users

    may depend on the experience of previous users, show that in this setting classical MAB policies may

    under-explore, and propose a balanced-exploration approach that leads to optimal performance.

    The above studies demonstrate a variety of practical settings where the extent of exploration that is

    required to maintain optimality strongly depends on particular problem characteristics that may often be

    8

  • a priori unknown to the decision maker. This introduces the challenge of dynamically endogenizing the

    rate at which a decision-making policy explores to approximate the best performance that is achievable

    under ex ante knowledge of the underlying problem characteristics. In this paper we address this

    challenge from an information collection perspective. We identify conditions on the information arrival

    process that guarantee the optimality of myopic policies, and further identify adaptive MAB policies

    that guarantee rate-optimality without prior knowledge on the information arrival process.

    Recommender systems. An active stream of literature has been studying recommender systems,

    focusing on modelling and maintaining connections between users and products; see, e.g., (Ansari et al.

    2000), the survey by (Adomavicius and Tuzhilin 2005), and a book by (Ricci et al. 2011). One key

    element that impacts the performance of recommender systems is the often limited data that is available.

    Focusing on the prominent information acquisition aspect of the problem, several studies (to which we

    referred earlier) have addressed sequential recommendation problems using a MAB framework where at

    each time period information is obtained only on items that are recommended at that period. Another

    approach is to identify and leverage additional sources of relevant information. Following that avenue,

    Farias and Li (2019) consider the problem of estimating user-item propensities, and propose a method

    to incorporate auxiliary data such as browsing and search histories to enhance the predictive power of

    recommender systems. While their work concerns with the impact of auxiliary information in an offline

    prediction context, our paper focuses on the impact of auxiliary information streams on the design,

    information acquisition, and appropriate exploration rate in a sequential experimentation framework.

    2 Problem formulation

    We formulate a class of multi-armed bandit problems with auxiliary information arrivals. We note that

    various modeling assumptions can be generalized and are made only to simplify exposition and analysis.

    Some model assumptions, as well as few extensions of our model, are discussed in §2.1.

    Let K = {1, . . . ,K} be a set of arms (actions) and let T = {1, . . . , T} denote a sequence of decision

    epochs. At each time period t ∈ T , a decision maker selects one of the K arms. When selecting an

    arm k ∈ K at time t ∈ T , a reward Xk,t ∈ R is realized and observed. For each t ∈ T and k ∈ K,

    the reward Xk,t is assumed to be independently drawn from some σ2-sub-Gaussian distribution with

    mean µk.1 We denote the profile of rewards at time t by Xt = (X1,t, . . . , XK,t)

    > and the profile of mean

    rewards by µ = (µ1, . . . , µK)>. We further denote by ν = (ν1, . . . , νK)

    > the distribution of the rewards

    1A real-valued random variable X is said to be sub-Gaussian if there is some σ > 0 such that for every λ ∈ R one hasEeλ(X−EX) ≤ eσ

    2λ2/2. This broad class of distributions includes, for instance, Gaussian random variables, as well as any

    random variable with a bounded support (if X ∈ [a, b] then X is (a−b)2

    4-sub-Gaussian) such as Bernoulli random variables.

    In particular, if a random variable is σ2-sub-Gaussian, it is also σ̃2-sub-Gaussian for all σ̃ > σ.

    9

  • profile Xt. We assume that rewards are independent across time periods and arms. We denote the

    highest expected reward and the best arm by µ∗ and k∗ respectively, that is:2

    µ∗ = maxk∈K{µk}, k∗ = arg max

    k∈Kµk.

    We denote by ∆k = µ∗−µk the difference between the expected reward of the best arm and the expected

    reward of arm k, and by ∆ a lower bound such that 0 < ∆ ≤ mink∈K\{k∗}

    ∆k.

    Information arrival process. Before each round t, the agent may or may not observe auxiliary

    information for some of the arms without pulling them. For each period t and arm k, we denote by

    hk,t ∈ {0, 1} the indicator of observing auxiliary information on arm k between decision epochs t − 1

    and t. We denote by ht = (h1,t, . . . , hK,t)> the vector of auxiliary information indicators at period t, and

    by H = (h1, . . . ,hT ) the information arrival matrix; we assume that this matrix is independent of the

    policy’s actions. If hk,t = 1, then between decision epochs t−1 and t the agent observes a random variable

    Yk,t ∼ ν ′k. For each arm k we assume that there exists a mapping φk : R→ R through which an unbiased

    estimator for µk can be constructed, that is, E [φk(Yk,t)] = µk, and that φk(Yk,t) is σ̂2-sub-Gaussian

    for some σ̂ > 0. We denote Yt = (Y1,t, . . . , YK,t)>, and assume that the random variables Yk,t are

    independent across time periods and arms, and are also independent from reward realizations. We

    denote the vector of information received between epochs t− 1 and t by Zt = (Z1,t, . . . , ZK,t)> where for

    any k one has Zk,t = hk,t · φk(Yk,t).

    Example 1 (Linear mappings) A special case of the information arrival processes we formalized

    consists of observations of independent random variables that are linear mappings of rewards. That is,

    there exist vectors (α1, . . . , αK) ∈ RK and (β1, . . . , βK) ∈ RK such that for all k and t,

    E[βk + αkYk,t] = µk.

    This simple class of mappings illustrates the possibility of utilizing available data in many domains,

    including maximizing conversions of product recommendations (for example, in terms of purchases of

    recommended products). The practicality of using linear mappings for improving product recommendations

    will be demonstrated empirically in §6 using data from a large media site.

    Admissible policies, performance, and regret. Let U be a random variable defined over a prob-

    ability space (U,U ,Pu). Let πt : Rt−1 × RK×t × {0, 1}K×t × U → K for t = 1, 2, 3, . . . be measurable

    functions (with some abuse of notation we also denote the action at time t by πt ∈ K) given by2For simplicity, when using the arg min and arg max operators we assume ties to be broken in favor of the smaller index.

    10

  • πt =

    π1(Z1,h1, U) t = 1,πt(Xπt−1,t−1, . . . , Xπ1,1,Zt, . . . ,Z1,ht, . . . ,h1, U) t = 2, 3, . . . .The mappings {πt : t = 1, . . . , T}, together with the distribution Pu define the class of admissible

    policies denoted by P . Policies in P depend only on the past history of actions and observations as well

    as auxiliary information arrivals, and allow for randomization via their dependence on U . We denote

    by S = S(∆, σ2, σ̂2) the class that includes pairs of allowed reward distribution profiles (ν,ν ′).

    We evaluate the performance of a policy π ∈ P by the regret it incurs under information arrival

    process H relative to the performance of an oracle that selects the arm with the highest expected reward.

    We define the worst-case regret as follows:

    RπS(H, T ) = sup(ν,ν′)∈S

    Eπ(ν,ν′)

    [T∑t=1

    (µ∗ − µπt)

    ], (1)

    where the expectation Eπ(ν,ν′)[·] is taken with respect to the noisy rewards and noisy auxiliary observations,

    as well as to the policy’s actions (throughout the paper we will denote by Pπ(ν,ν′), Eπ(ν,ν′), and R

    π(ν,ν′)

    the probability, expectation, and regret when the arms are selected according to policy π, rewards are

    distributed according to ν, and auxiliary observations are distributed according to ν ′). We note that

    regret is incurred over decisions made in epochs t = 1, . . . ,T; the main distinction relative to classical

    regret formulations is that in (1) the mappings {πt; t = 1, . . . , T} can be measurable with respect to the

    σ-field that also includes information that arrives between decision epochs, as captured by the matrix H.

    We denote by R∗S(H, T ) = infπ∈PRπS(H, T ) the best achievable guaranteed performance: the minimal

    regret that can be guaranteed by an admissible policy π ∈ P. In the following sections we study the

    magnitude of R∗S(H, T ) as a function of the information arrival process H.

    2.1 Discussion on model assumptions and extnetions

    Known mappings. A key modelling assumption in our framework is that the mappings {φk} through

    which auxiliary observations may carry information on reward distributions, are a priory known to

    the decision maker. The interpretability of feedback signals for the purpose of forming mean rewards

    estimates is a simplifying modeling assumption that is essential for tractability in many online dynamic

    optimization frameworks, and in particular in the MAB literature.3 Therefore, our model could be

    viewed as extending prior work by relaxing assumptions on the information collection process that are

    3Typically in the MAB literature feedback simply consists of the reward observations. However, even when this relationbetween feedback and rewards distribution is more general, e.g., in Russo et al. 2018, it is assumed to be a priori known.

    11

  • common in the MAB literature, while offering assumptions on the information structure (and what is

    known about it) that are comparable to what is assumed in existing literature.

    From a practical perspective, in many settings firms might be able to estimate the mappings {φk}

    from existing data. In the context of product recommendations, similar products might exhibit similar

    mappings from the same source of auxiliary information; for example, conversion rates of customers that

    arrive to similar products directly from search might be mapped to conversion rates from recommendations

    through similar mappings. In §6 we use data from a large US media site to evaluate the additional

    value that might be captured by utilizing one specific form of auxiliary information, for which the

    mappings {φk} are unknown upfront, and are estimated from existing data. Finally, the challenge of

    adapting to unknown mappings {φk} is further discussed in §7.

    Extension to contextual MAB. Our model extends the stochastic MAB framework of Lai and

    Robbins (1985) to allow for auxiliary information arrivals. We nevertheless note it can be applied to more

    general frameworks, including ones where connection between auxiliary observations and mean rewards

    is established through the parametric structure of the problem rather than through mappings {φk}. In

    Appendix E we demonstrate this by extending the model of Goldenshluger and Zeevi (2013) where mean

    rewards linearly depend on stochastic context vectors, to derive performance bounds in the presence of

    an unknown information arrival process. This extension also captures additional problem features that

    are relevant to various application domains, such as idiosyncratic consumer characteristics.

    Multiple information arrivals. For simplicity we assume that only one information arrival can occur

    between consecutive time periods for each arm (that is, hk,t ∈ {0, 1} for each time t and arm k). In fact,

    all our results hold without any adjustments when there are more than one information arrivals per time

    step per arm (allowing any integer values for the entries {hk,t}).

    Endogenous information arrival processes. We focus on a setting where the information arrival

    process (the matrix H) is exogenous and does not depend on the sequence of decisions. In Appendix D

    we study policy design and achievable performance under a broad class of information arrival processes

    that are reactive to the past decisions of the policy; see also related discussion in §7.

    Non-separable mean rewards. For the sake of simplicity we refer to the lower bound ∆ on the

    differences in mean rewards relative to the best arm as a fixed parameter that is independent of the

    horizon length T . This corresponds to the case of separable mean rewards, which is prominent in the

    stochastic MAB literature. Nevertheless, we do not make any explicit assumption on the separability

    of mean rewards and our analysis and results hold for the more general case where ∆ is a decreasing

    function of the horizon length T .

    12

  • 3 Impact on achievable performance

    In this section we study the impact that information arrival process may have on the performance that

    one could aspire to achieve. Our first result formalizes what cannot be achieved, establishing a lower

    bound on the best achievable performance as a function of the information arrival process.

    Theorem 1 (Lower bound on the best achievable performance) For any T ≥ 1 and information

    arrival matrix H, the worst-case regret for any admissible policy π ∈ P is bounded from below as follows:

    RπS(H, T ) ≥C1∆

    K∑k=1

    log

    (C2∆

    2

    K

    T∑t=1

    exp

    (−C3∆2

    t∑s=1

    hk,s

    )),

    where C1, C2, and C3 are positive constants that only depend on σ and σ̂.

    The precise expressions of C1, C2, and C3 are provided in the discussion below. Theorem 1 establishes a

    lower bound on the achievable performance in the presence of an information arrival process. This lower

    bound depends on an arbitrary sample path of information arrivals, captured by the elements of the

    matrix H. In that sense, Theorem 1 provides a spectrum of bounds on achievable performances, mapping

    many potential information arrival trajectories to the best performance they allow. In particular, when

    there is no additional information over what is assumed in the classical MAB setting, one recovers a

    lower bound of order K∆ log T that coincides with the bounds established in Lai and Robbins (1985) and

    Bubeck et al. (2013) for that setting. Theorem 1 further establishes that when additional information is

    available, achievable regret rates may become lower, and that the impact of information arrivals on the

    achievable performance depends on the frequency of these arrivals, but also on the time at which these

    arrivals occur; we further discuss these observations in §3.1.

    Key ideas in the proof. The proof of Theorem 1 adapts to our framework ideas of identifying a

    worst-case nature “strategy.” While the full proof is deferred to the appendix, we next illustrate its key

    ideas using the special case of two arms. We consider two possible profiles of reward distributions, ν and

    ν ′, that are “close” enough in the sense that it is hard to distinguish between the two, but “separated”

    enough such that a considerable regret may be incurred when the “correct” profile of distributions is

    misidentified. In particular, we assume that the decision maker is a priori informed that the first arm

    generates rewards according to a normal distribution with standard variation σ and a mean that is

    either −∆ (according to ν) or +∆ (according to ν ′), and that the second arm generates rewards with

    normal distribution of standard variation σ and mean zero. To quantify a notion of distance between the

    possible profiles of reward distributions we use the Kullback-Leibler (KL) divergence. The KL divergence

    between two positive measures ρ and ρ′ with ρ absolutely continuous with respect to ρ′, is defined as:

    13

  • KL(ρ, ρ′) :=

    ∫log

    (dρ

    dρ′

    )dρ = Eρ log

    (dρ

    dρ′(X)

    ),

    where Eρ denotes the expectation with respect to probability measure ρ. Using Lemma 2.6 from Tsybakov

    (2009) that connects the KL divergence to error probabilities, we establish that at each period t the

    probability of selecting a suboptimal arm must be at least

    psubt =1

    4exp

    (−2∆

    2

    σ2

    (Eν,ν′ [ñ1,T ] +

    t∑s=1

    σ2

    σ̂2h1,s

    )),

    where ñ1,t denotes the number of times the first arm is pulled up to time t. Each selection of suboptimal

    arm contributes ∆ to the regret, and therefore the cumulative regret must be at least ∆T∑t=1

    psubt . We

    further observe that if arm 1 has a mean reward of −∆, the cumulative regret must also be at least

    ∆ · Eν,ν′ [ñ1,T ]. Therefore the regret is lower bounded by ∆2

    (T∑t=1

    psubt + Eν,ν′ [ñ1,T ])

    , which is greater

    than σ2

    4∆ log

    (∆2

    2σ2

    T∑t=1

    exp

    (−2∆2

    σ̂2

    t∑s=1

    h1,s

    )). The argument can be repeated by switching arms 1 and 2.

    For K arms, we follow the above lines and average over the established bounds to obtain:

    RπS(H, T ) ≥σ2(K − 1)

    4K∆

    K∑k=1

    log

    (∆2

    σ2K

    T∑t=1

    exp

    (−2∆

    2

    σ̂2

    t∑s=1

    hk,s

    )),

    which establishes the result.

    3.1 Discussion and subclasses of information arrival processes

    Theorem 1 suggests that auxiliary information arrivals may be leveraged to reduce regret rates, and

    that their impact on the achievable performance increases when information arrives more frequently,

    and earlier. This observation is consistent with the following intuition: (i) at early time periods we

    have collected only few observations and therefore the marginal impact of an additional observation on

    the stochastic error rates is large; and (ii) when information appears early on, there are more future

    opportunities where this information can be used. To emphasize this observation we next demonstrate

    the implications on achievable performance of two information arrival processes of natural interest: a

    process with a fixed arrival rate, and a process with a decreasing arrival rate.

    3.1.1 Stationary information arrival process

    Assume that hk,t’s are i.i.d. Bernoulli random variables with mean λ. Then, for any T ≥ 1 and admissible

    policy π ∈ P, one obtains the following lower bound for the achievable performance:

    14

  • 1. If λ ≤ σ̂24∆2T

    , then

    EH [RπS(H, T )] ≥σ2(K − 1)

    4∆log

    ((1− e−1/2)∆2T

    K

    ).

    2. If λ ≥ σ̂24∆2T

    , then

    EH [RπS(H, T )] ≥σ2(K − 1)

    4∆log

    (1− e−1/2

    2λKσ2/σ̂2

    ).

    This class includes instances in which, on average, information arrives at a constant rate λ. Analyzing

    these arrival processes reveals two different regimes. When the information arrival rate is low enough,

    auxiliary observations become essentially ineffective, and one recovers the performance bounds that were

    established for the classical stochastic MAB problem. In particular, as long as there are no more than

    order ∆−2 information arrivals over T time periods, this information does not impact the achievable regret

    rates. When ∆ is fixed and independent of the horizon length T , the lower bound scales logarithmically

    with T . When ∆ can scale with T , a bound of order√T is recovered when ∆ is of order T−1/2. In

    both cases, there are known policies (such as UCB1) that guarantee rate-optimal performance; for more

    details see policies, analysis, and discussion in Auer et al. (2002).

    On the other hand, when there are more than order ∆−2 observations over T periods, the lower bound

    on the regret becomes a function of the arrival rate λ. When the arrival rate is independent of the

    horizon length T , the regret is bounded by a constant that is independent of T , and a myopic policy

    (e.g., a policy that for the first K periods pulls each arm once, and at each later period pulls the arm

    with the current highest estimated mean reward, while randomizing to break ties) is optimal. For more

    details see sections C.3 and C.4 of the Appendix.

    3.1.2 Diminishing information arrival process

    Assume that hk,t’s are random variables such that for each arm k ∈ K and time step t,

    E

    [t∑

    s=1

    hk,s

    ]=

    ⌊σ̂2κ

    2∆2log t

    ⌋,

    for some fixed κ > 0. Then, for any T ≥ 1 and admissible policy π ∈ P, one obtains the following lower

    bounds for the achievable performance:

    1. If κ < 1 then:

    EH [RπS(H, T )] ≥σ2(K − 1)

    4∆log

    (∆2/Kσ2

    1− κ((T + 1)1−κ − 1

    )).

    15

  • 2. If κ > 1 then:

    EH [RπS(H, T )] ≥σ2(K − 1)

    4∆log

    (∆2/Kσ2

    κ− 1

    (1− 1

    (T + 1)κ−1

    )).

    This class includes information arrival processes under which the expected number of information arrivals

    up to time t is of order log t. Therefore, it demonstrates the impact of the timing of information arrivals

    on the achievable performance, and suggests that a constant regret may be achieved when the rate of

    information arrivals is decreasing. Whenever κ < 1, the lower bound on the regret is logarithmic in T ,

    and there are known policies (e.g., UCB1) that guarantee rate-optimal performance. When κ > 1, the

    lower bound becomes a constant, and one may observe that when κ is large enough a myopic policy is

    asymptotically optimal. (In the limit κ→ 1 the lower bound is of order log log T .) For more details see

    sections C.5 and C.6 of the Appendix.

    3.1.3 Discussion

    One may contrast the classes of information arrival processes described in §3.1.1 and §3.1.2 by selecting

    κ = 2∆2λT

    σ2 log T. Then, in both settings the total number of information arrivals for each arm is λT on average.

    However, while in the first class the information arrival rate is fixed over the horizon, in the second class

    this arrival rate is higher in the beginning of the horizon and decreases over time. The different timing of

    the λT information arrivals may lead to different regret rates. For example, selecting λ = σ2 log T∆2T

    implies

    κ = 2. Then, the lower bound in §3.1.1 is logarithmic in T (establishing the impossibility of constant

    regret in that setting), but the lower bound in §3.1.2 is constant and independent of T (in §4 we show

    that constant regret is indeed achievable in this setting). This observation conforms to the intuition that

    earlier observations have higher impact on achievable performance, as at early periods there is only little

    information that is available (and therefore the marginal impact of an additional observation is larger),

    and since earlier information can be used for more decisions (as the remaining horizon is longer).4

    The analysis above suggests that effective optimal policy design might depend on the information

    arrival process: while policies that explore (and in that sense are not myopic) may be rate-optimal in

    some cases, a myopic policy that does not explore (except perhaps in a small number of periods in the

    4The subclasses described in §3.1.1 and §3.1.2 are special cases of the following setting. Let hk,t’s be independent randomvariables such that for each arm k and time t, the expected number of information arrivals up to time t satisfies:

    E

    [t∑

    s=1

    hk,s

    ]= λT

    t1−γ − 1T 1−γ − 1 .

    While the expected number of total information arrivals for each arm, λT , is determined by the parameter λ, theconcentration of arrivals is governed by the parameter γ. When γ = 0 the arrival rate is constant, corresponding to theclass described in §3.1.1. As γ increases, information arrivals concentrate in the beginning of the horizon, and γ → 1 leadsto E

    [∑ts=1 hk,s

    ]= λT log t

    log T, corresponding to the class in §3.1.2. Then, when λT is of order T 1−γ or higher, the lower

    bound is a constant independent of T .

    16

  • beginning of the horizon) can be rate-optimal in other cases. However, this identification of rate-optimal

    policies relies on prior knowledge of the information arrival process. In the following sections we therefore

    study the adaptation of polices to unknown information arrival processes, in the sense of guaranteeing

    rate-optimality without any prior knowledge on the information arrival process.

    4 Natural adaptation to the information arrival process

    In this section we establish that policies based on posterior sampling and upper confidence bounds

    may naturally adapt to an a priori unknown information arrival process to achieve the lower bound of

    Theorem 1 uniformly over the general class of information arrival processes that we consider.

    4.1 Robustness of Thompson sampling

    Consider the Thompson sampling with Gaussian priors (Agrawal and Goyal 2012, 2013a). In the following

    adjustment of this policy posteriors are updated both after the policy’s actions as well as after the

    auxiliary information arrivals. Denote by nk,t and X̄k,nk,t the weighted number of times a sample from

    arm k has been observed and the weighted empirical average reward of arm k up to time t, respectively:

    nk,t :=

    t−1∑s=1

    1{πs = k}+t∑

    s=1

    σ2

    σ̂2hk,s, X̄k,nk,t :=

    t−1∑s=1

    1σ21{πs = k}Xk,s +

    t∑s=1

    1σ̂2Zk,s

    t−1∑s=1

    1σ21{πs = k}+

    t∑s=1

    1σ̂2hk,s

    . (2)

    Thompson sampling with auxiliary observations. Inputs: a tuning constant c.

    1. Initialization: set initial counters nk,0 = 0 and initial empirical means X̄k,0 = 0 for all k ∈ K

    2. At each period t = 1, . . . , T :

    (a) Observe the vectors ht and Zt

    (b) Sample θk,t ∼ N (X̄k,nk,t , cσ2(nk,t + 1)−1) for all k ∈ K, and select the arm

    πt = arg maxk

    θk,t.

    (c) Receive and observe a reward Xπt,t

    The next result establishes that by deploying a Thompson sampling policy that updates posteriors

    after the policy’s actions and whenever auxiliary information arrives, one may guarantee rate optimal

    performance in the presence of unknown information arrival processes.

    17

  • Theorem 2 (Near optimality of Thompson sampling with auxiliary observations) Let π be

    a Thompson sampling with auxiliary observations policy, with c > 0. For every T ≥ 1, and auxiliary

    information arrival matrix H:

    RπS(H, T ) ≤∑

    k∈K\{k∗}

    (C4∆k

    log

    (T∑t=0

    exp

    (−∆2kC4

    t∑s=1

    hk,s

    ))+C5∆3k

    + C6∆k

    ),

    for some absolute positive constants C4, C5, and C6 that depend on only c, σ and σ̂.

    The upper bound in Theorem 2 holds for any arbitrary sample path of information arrivals that is

    captured by the matrix H, and matches the lower bound in Theorem 1 with respect to the dependence

    on the sample path of information arrivals hk,t’s, as well as the time horizon T , the number of arms K,

    and the minimum expected reward difference ∆. This establishes that, by updating posteriors after

    both policy’s actions and auxiliary information arrivals, the Thompson sampling policy guarantees rate

    optimality uniformly over the general class of information arrival processes defined in §2. In particular,

    for any information arrival matrix H, the regret rate guaranteed by Thompson sampling cannot be

    improved by any policy, even when the information arrival process is known upfront.

    When put together, Theorem 1 and Theorem 2 establish the minimax complexity of our problem,

    which is characterized as follows.

    Remark 1 Theorems 1 and 2 together identify the minimax regret rate for the MAB problem with any

    unknown information arrival process H as a function of the entries of H:

    R∗S(H, T ) �∑k∈K

    log

    (T∑t=1

    exp

    (−c ·

    t∑s=1

    hk,s

    )),

    where c is a constant that depends on problem parameters such as K, ∆, and σ.

    Remark 1 has a couple of important implications. First, it identifies a spectrum of minimax regret

    rates ranging from the classical regret rates that appear in the stochastic MAB literature when there

    is no auxiliary information, to regret that is uniformly bounded over time when information arrives

    frequently and/or early enough. Furthermore, identifying the minimax complexity provides a yardstick

    for evaluating the performance of policies, and a sharp criterion for identifying rate-optimal ones. This

    already identifies Thompson sampling as rate optimal, but will shortly be used for establishing the

    optimality (or sub-optimality) of other policies.

    Key ideas in the proof. To establish Theorem 2 we decompose the regret associated with each

    suboptimal arm k into three components: (i) Regret from pulling an arm when its empirical mean

    18

  • deviates from its expectation; (ii) Regret from pulling an arm when its empirical mean does not deviate

    from its expectation, but θk,t deviates from the empirical mean; and (iii) Regret from pulling an arm

    when its empirical mean does not deviate from its expectation and θk,t does not deviate from the empirical

    mean. Following the analysis in Agrawal and Goyal (2013b) one may use concentration inequalities

    such as Chernoff-Hoeffding bound to bound the cumulative expressions of cases (i) and (iii) uniformly

    over time. Case (ii) can be analyzed considering an arrival process with arrival rate that is decreasing

    exponentially with each arrival. We bound the total number of arrivals of this process (Lemma 3) and

    establish that this suffices to bound the cumulative regret associated with case (ii).

    4.2 Robustness of UCB

    We next consider a UCB1 policy (Auer et al. 2002) in which in addition to updating mean rewards and

    observation counters after the policy’s actions, these are also updated after auxiliary observations. We

    denote by nk,t and X̄k,nk,t the observation counters and mean reward estimates as defined in (2).

    UCB1 with auxiliary observations. Inputs: a constant c.

    1. At each period t = 1, . . . , T :

    (a) Observe the vectors ht and Zt

    (b) Select the arm

    πt =

    t if t ≤ Karg maxk∈K {X̄k,nk,t +√ cσ2 log tnk,t } if t > K(c) Receive and observe a reward Xπt,t

    The next result establishes that by updating counters and reward estimates after both policy’s actions

    and auxiliary information arrivals, the UCB1 policy guarantees rate-optimal performance over the class

    of general information arrival processes defined in §2.

    Theorem 3 Let π be UCB1 with auxiliary observations, tuned by c > 2. Then, for any T ≥ 1, K ≥ 2,

    and additional information arrival matrix H:

    RπS(H, T ) ≤∑k∈K

    C7∆2k

    log

    (T∑t=0

    exp

    (−∆2kC7

    t∑s=1

    hk,s

    ))+ C8∆k,

    where C7, and C8 are positive constants that depend only on σ and σ̂.

    The upper bound in Theorem 3 holds for any arbitrary sample path of information arrivals that is

    captured by the matrix H. It establishes that, similarly to Thompson sampling, when accounting the

    19

  • auxiliary observations as suggested above, UCB guarantees rate optimality uniformly over the general

    class of information arrival processes defined in §2.

    Key ideas in the proof. The proof adjusts the analysis of UCB1 in Auer et al. (2002). Pulling a

    suboptimal arm k at time step t implies that at least one of the following three events occur: (i) the

    empirical average of the best arm deviates from its mean; (ii) the empirical mean of arm k deviates

    from its mean; or (iii) arm k has not been pulled sufficiently often in the sense that

    ñk,t−1 ≤ l̂k,t −t∑

    s=1

    σ2

    σ̂2hk,s,

    where l̂k,t =4cσ2 log(τk,t)

    ∆2with τk,t :=

    t∑s=1

    exp

    (∆2k

    4cσ̂2

    t∑τ=s

    hk,τ

    ), and ñk,t−1 is the number of times arm k is

    pulled up to time t. The probability of the first two events can be bounded using Chernoff-Hoeffding

    inequality, and the probability of the third one can be bounded using:

    T∑t=1

    1

    {πt = k, ñk,t−1 ≤ l̂k,t −

    t∑s=1

    σ2

    σ̂2hk,s

    }≤ max

    1≤t≤T

    {l̂k,t −

    t∑s=1

    σ2

    σ̂2hk,s

    }

    Therefore, we establish that for c > 2,

    RπS(H, T ) ≤∑

    k∈K\{k∗}

    C7∆2k· max

    1≤t≤Tlog

    (t∑

    m=1

    exp

    (∆2kC7

    t∑s=m

    hk,s −∆2kC7

    t∑s=1

    hk,s

    ))+ C8∆k.

    Numerical analysis. We further evaluate the performance of Thompson sampling and UCB policies

    through numerical experiments that are detailed in Appendix F and include various selections of tuning

    parameters and information arrival scenarios. Results are consistent with the ones depicted in Figure 2

    across different settings. While optimal values for the tuning parameters are case dependent, we note

    that, in general, for both Thompson sampling and UCB, appropriate values for the parameter c decrease

    as the amount of auxiliary information that is realized over time grows large.

    4.3 Discussion

    Theorems 2 and 3 establish that Thompson sampling and UCB polices guarantee the best achievable

    performance (up to some multiplicative constant) uniformly over the class of information arrival processes

    defined in §2, that is, under any arbitrary information arrival process captured by the matrix H. For

    example, upper bounds that match the lower bounds established for the settings in §3.1.1 and §3.1.2 for

    any values of λ and κ are given in Appendix C.2 (see Corollaries 1 and 2).

    Our results imply that Thompson sampling and UCB possess remarkable robustness with respect to

    20

  • the information arrival process: without any adjustments they guarantee the best achievable performance,

    and furthermore, guarantee a regret rate that cannot be improved even when the information arrival

    process is known upfront. These results therefore uncover a novel property of these policies, and generalize

    the class of problems to which they have been applied.

    Nevertheless, policies in which the exploration rate is exogenous, such as forced sampling and �t-

    greedy-type policies, do not exhibit such robustness (as we will observe in §5.1).5 In order to study how

    policies with exogenous exploration rate can be adjusted to account for arrival of auxiliary information,

    we devote §5 for devising a novel virtual time indices method for endogenizing the effective exploration

    rate. We apply this method for designing �t-greedy-type policies that, without any prior knowledge on

    the information arrival process, attain the best performance that is achievable when the information

    arrival process is a priori known.

    5 Endogenizing exploration via virtual time indices

    5.1 Policy design intuition

    In this subsection we first demonstrate, through the case of �t-greedy policy, that policies with exogenous

    exploration rate may fail to achieve the lower bound in Theorem 1 in the presence of auxiliary information

    arrivals, and then develop intuition for how such policies may be adjusted to better leverage auxiliary

    information through the virtual time indices method. Formal policy design and analysis follow in §5.2.

    The inefficiency of “naive” adaptations of �t-greedy. Despite the robustness that was established

    in §4 for Thompson sampling and UCB policies, in general, accounting for auxiliary information while

    otherwise maintaining the policy structure may not suffice for achieving the lower bound established

    in Theorem 1. To demonstrate this, consider the �t-greedy policy (Auer et al. 2002), which at each

    period t selects an arm randomly with probability �t that is proportional to t−1, and with probability

    1 − �t selects the arm with the highest reward estimate. Without auxiliary information, �t-greedy

    guarantees rate-optimal regret of order log T , but this optimality does not carry over to settings with

    other information arrival processes: as was visualized in Figure 3, using auxiliary observations to update

    estimates without appropriately adjusting the exploration rate leads to sub-optimality.6 For example,

    5One advantage of policies with exogenous exploration is that they allow the collection of independent samples (withoutinner-sample correlation) and thus facilitate statistical inference based on standard methods; see, e.g., the analysis in Nieet al. (2017) and the empirical study in Villar et al. (2015). For discussion on other appealing features of such policies, suchas generality and practicality, see, e.g., Kveton et al. (2018).

    6A natural alternative is to increase the time index by one (or any other constant) each time auxiliary information arrives.Such an approach, that could be implemented by arm-dependent time indices with update rule τk,t = τk,t−1 + 1 + hk,t, may

    lead to sub-optimality as well. For example, suppose hk,1 =⌊C

    ∆2k

    log T⌋

    for some constant C > 0, and hk,t = 0 for all t > 1.

    By Theorems 2 and 3, when the constant C is sufficiently large constant regret is achievable. However, following the above

    21

  • consider stationary information arrivals (described in §3.1.1), with an arrival rate λ ≥ σ̂24∆2T

    . While

    constant regret is achievable in this case, without adjusting its exploration rate, �t-greedy explores with

    suboptimal arms at a rate that is independent of the number of auxiliary observations, and therefore

    still incurs regret of order log T .

    Over-exploration in the presence of additional information. For simplicity, consider a 2-armed

    bandit problem with µ1 > µ2. At each time t the �t-greedy policy explores over arm 2 independently

    with probability �t = ct−1 for some constant c > 0. As a continuous-time proxi for the minimal number

    of times arm 2 is selected by time t, consider the function∫ ts=1

    �sds = c

    ∫ ts=1

    ds

    s= c log t.

    The probability of best-arm misidentification at period t can be bounded from above by:7

    exp

    (−c̄∫ ts=1

    �sds

    )≤ c̃t−1,

    for some constants c̄, c̃ > 0. Thus, setting �t = ct−1 balances losses from exploration and exploitation.8

    Next, assume that just before time t0, an additional independent reward observation of arm 2 is

    collected. Then, at time t0, the minimal number of observations from arm 2 increases to (1 + c log t0),

    and the upper bound on the probability of best-arm misidentification decreases by factor e−c̄:

    exp

    (−c̄(

    1 +

    ∫ ts=1

    �sds

    ))≤ e−c̄ · c̃t−1, ∀ t ≥ t0.

    Therefore, when there are many auxiliary observations, the loss from best-arm misidentification is

    guaranteed to diminish, but performance might still be sub-optimal due to over-exploration.

    Endogenizing the exploration rate. Note that there exists a future time period t̂0 ≥ t0 such that:

    1 + c

    ∫ t01

    ds

    s= c

    ∫ t̂01

    ds

    s. (3)

    In words, the minimal number of observations from arm 2 by time t0, including the one that arrived just

    before t0, equals (in expectation) the minimal number of observations from arm 2 by time t̂0 without

    any additional information arrivals. Therefore, replacing �t0 = ct−10 with �t0 = ct̂

    −10 would adjust the

    exploration rate to fit the amount of information actually collected at t = t0, and the respective loss from

    update rule the regret incurred due to exploration is at least of order log(∆2kT

    log T).

    7This upper bound on the probability of misidentifying the best arm can be obtained using standard concentrationinequalities and is formalized, for example, in Step 5 of the proof of Theorem 4.

    8Design of exploration schedules that balance losses from experimenting with suboptimal arms and from misidentificationof the best arm is common; see, e.g., related discussions in Auer et al. (2002), Langford and Zhang (2008), Goldenshlugerand Zeevi (2013), and Bastani and Bayati (2015).

    22

  • 𝑐𝑡

    𝑐 𝜏 𝑡

    area = 1𝑒 𝑐𝑡

    𝑡 𝑡

    auxiliaryinformationarrives

    loss from exploration

    loss from exploitation

    loss from explorationwith advanced virtual time index 

    loss from exploitationafter additional observation arrives

    loss from exploitationwith advanced virtual time index

    𝑡

    ~

    Figure 4: Losses from exploration and exploitation (normalized through division by (µ1 − µ2)) when additionalobservation is collected just before time t0, with and without replacing the standard time index t with a virtualtime index τ(t). With a standard time index, exploration is performed at rate ct−1, which results in sub-optimalityafter time t0. With a virtual time index that is advanced at time t0, the exploration rate becomes c(τ(t))

    −1, whichre-balances losses from exploration and exploitation (dashed lines coincide).

    exploitation. The regret reduction by this adjustment is illustrated in Figure 5. We therefore adjust

    the exploration rate to be �t = c(τ(t))−1 for some virtual time τ(·). We set τ(t) = t for all t < t0, and

    τ(t) = t̂0 + (t− t0) for t ≥ t0. Solving (3) for t̂0, we write τ(t) in closed form:

    τ(t) =

    t t < t0c0t0 + (t− t0) t ≥ t0,for some constant c0 > 1. Therefore, the virtual time grows together with t, and advanced by a

    multiplicative constant whenever an auxiliary observation is collected.

    5.2 A rate-optimal adaptive �t-greedy policy

    We apply the ideas discussed in §5.1 to design an �t-greedy policy with adaptive exploration that

    dynamically adjusts the exploration rate in the presence of an unknown information arrival process. For

    simplicity, and consistent with standard versions of �t-greedy (see, e.g., Auer et al. 2002), the policy

    below assumes prior knowledge of the parameter ∆. In Appendix C.1 we provide an approach for

    designing �t-greedy policy that estimates ∆ throughout the decision process and establish performance

    guarantees for this policy in some special cases.

    Define nk,t and X̄k,nk,t as in (2), and consider the following adaptation of the �t-greedy policy.

    �t-greedy with adaptive exploration. Input: a tuning parameter c > 0.

    1. Initialization: set initial virtual times τk,0 = 0 for all k ∈ K

    2. At each period t = 1, 2, . . . , T :

    23

  • additional information arrivals additional information arrivalst t

    virtual time advanced

    Figure 5: Illustration of the adaptive exploration approach. (Left) Virtual time index τ(·) is advanced usingmultiplicative factors whenever auxiliary information arrives. (Right) Exploration rate decreases as a functionof τ(·), exhibiting discrete “jumps” whenever auxiliary information is collected.

    (a) Observe the vectors ht and Zt, and update virtual time indices for all k ∈ K:

    τk,t = (τk,t−1 + 1) · exp(hk,t∆

    2

    cσ̂2

    )(b) With probability min

    {1, cσ

    2

    ∆2

    K∑k′=1

    τ−1k′,t

    }select an arm at random: (exploration)

    πt = k with probabilityτ−1k,tK∑k′=1

    τ−1k′,t

    , for all k ∈ K

    Otherwise, select an arm with the highest estimated reward: (exploitation)

    πt = arg maxk∈K

    X̄k,nk,t ,

    (c) Receive and observe a reward Xπt,t

    At every period t, the �t-greedy with adaptive exploration policy dynamically reacts to the information

    sample path by advancing virtual time indices associated with different arms based on auxiliary

    observations that were collected since the last period. Then, the policy explores with probability that is

    proportional toK∑k′=1

    τ−1k′,t, and otherwise pulls the arm with the highest empirical mean reward.

    Every time additional information on arm k is observed, a carefully selected multiplicative factor is

    used to advance the virtual time index τk,t according to the update rule τk,t = (τk,t−1 + 1) · exp (δ · hk,t),

    for some suitably selected δ. In doing so, the policy effectively reduces exploration rates in order to

    explore over each arm k at a rate that would have been appropriate without auxiliary information arrivals

    at a future time step τk,t. This guarantees that the loss due to exploration is balanced with the loss due

    to best-arm misidentification throughout the horizon. Advancing virtual times based on the information

    sample path and the impact on the exploration rate of a policy are illustrated in Figure 5.

    24

  • The following result characterizes the guaranteed performance and establishes the rate optimality of

    �t-greedy with adaptive exploration in the presence of unknown information arrival processes.

    Theorem 4 (Near optimality of �t-greedy with adaptive exploration) Let π be an �t-greedy

    with adaptive exploration policy, tuned by c > max{

    16, 10∆2

    σ2

    }. Then, for every T ≥ 1, and for any

    information arrival matrix H, one has:

    RπS(H, T ) ≤∑k∈K

    ∆k

    (C9∆2

    log

    (T∑

    t=t∗+1

    exp

    (−∆

    2

    C9

    t∑s=1

    hk,τ

    ))+ C10

    ),

    where C9, and C10 are positive constants that depend only on σ and σ̂.

    The upper bound in Theorem 4 holds for any arbitrary sample path of information arrivals that is

    captured by the matrix H. This establishes that, similarly to Thompson sampling and UCB policies

    (but unlike the standard �t-greedy policy) the virtual time indices method, as applied in the �t-greedy

    with adaptive exploration policy, leads to rate optimality uniformly over the general class of information

    arrival processes defined in §2.

    Furthermore, we note that the virtual time indices method can also be applied to other, and more

    general MAB settings. In Appendix E we demonstrate this by extending the contextual MAB model of

    Goldenshluger and Zeevi (2013) where mean rewards linearly depend on stochastic context vectors, and

    applying the virtual time indices method to derive improved performance bounds in the presence of

    unknown information arrival processes.

    Key ideas in the proof. The proof decomposes regret into exploration and exploitation time periods.

    To bound the regret at exploration time periods express virtual time indices as

    τk,t =t∑

    s=1

    exp

    (∆2

    cσ̂2

    t∑τ=s

    hk,τ

    ).

    Denoting by tm the time step at which the mth auxiliary observation for arm k was collected, we establish

    an upper bound on the expected number of exploration time periods for arm k in the time interval

    [tm, tm+1 − 1], which scales linearly with cσ2

    ∆2log(τk,tm+1τk,tm

    )− 1. Summing over all values of m, we obtain

    that the regret over exploration time periods is bounded from above by

    ∑k∈K

    ∆k ·cσ2

    ∆2log

    (2

    T∑t=0

    exp

    (−∆2

    cσ̂2

    t∑s=1

    hk,τ

    )).

    To analyze regret at exploitation time periods we first lower bound the number of observations of each arm

    25

  • using Bernstein inequality, and then apply Chernoff-Hoeffding inequality to bound the probability that

    a sub-optimal arm would have the highest estimated reward, given the minimal number of observations

    on each arm. When c > max{

    16, 10∆2

    σ2

    }, this regret component decays at rate of at most t−1.

    Numerical analysis. Appendix F also includes numerical analysis of the adjusted �t-greedy policy

    across different selections of the tuning parameter and various information arrival scenarios, relative to

    other policies (including, for example, a more “naive” version of this policy that updates the empirical

    means and the reward observation counters upon auxiliary information arrivals but does not update

    the time indices as prescribed). We also tested for robustness with respect to misspecification of the

    gap parameter ∆, and evaluated the performance of an �t-greedy policy that is designed to estimate the

    gap ∆ throughout the decision process (this policy is provided in Appendix C.1 along with performance

    guarantees that are established in some special cases).

    6 Empirical proof of concept using content recommendations data

    To demonstrate the value that may be captured by leveraging auxiliary information, we use data from a

    large US media site to empirically evaluate achievable performance when recommending articles that

    have unknown impact on the future browsing path of readers. We next provide some relevant context on

    content recommendation services, followed by the setup and the results of our analysis.

    Background. Content recommendations, which point readers to content they “may like,” is a dynamic

    service provided by media sites and third-party providers, with the objective of increasing readership

    and revenue streams for media sites. In that context, Besbes et al. (2016) validate the necessity of two

    performance indicators that are fundamental for recommendations that guarantee good performance

    along the reading path of readers: (i) the likelihood of a reader to click on a recommendation, and (ii)

    when clicking on it, the likelihood of the reader to continue and consume additional content afterwards.

    The former indicator is the click-through rate (CTR) of the recommendation, and the latter indicator

    can be viewed as the conversion rate (CVR) of the recommendation.9

    A key operational challenge on which we focus here, is that the likelihood of readers to continue

    consuming content after reading an article is affected by features of that article, such as length and

    number of photos, that may change several times after the article’s release. Changes in article features

    may take place with the purpose of increasing readership, or due to editing the content itself; For

    example, changes in the structure of news article are very common during the first few days after their

    9While CTR is a common performance indicator in online services, future path has been increasingly recognized asanother key performance indicator in dynamic services that continue after the first click. In the context of contentrecommendations, a field experiment in Besbes et al. (2016) found that recommendation methods that account for thefuture reading path of users can increase clicks per visit by 10%.

    26

  • release. In both cases, it is common that these changes occur after estimates for click-through rates are

    already formed (which is typically the case a few hours after the article’s release).

    Our setup will describe sequential experiments designed with the goal of evaluating how structural

    changes in a given article impact the likelihood of readers to continue consuming content after reading it

    (the CVR associated with recommending that article). Using this setup, we will simulate the extent to

    which the performance of such experiments could be improved when utilizing auxiliary observations,

    available in the form of the browsing path of readers that arrived to these articles directly from external

    search (and not by clicking a recommendation).

    Our data set includes a list of articles, and consists of: (i) times at which these articles were

    recommended and the in-site browsing path that followed these recommendations; and (ii) times at

    which readers arrived to these articles directly from external search engines (such as Google) and the

    browsing path that followed these visits.

    Setup. While we defer the complete setup description to Appendix G, we next provide its key elements.

    Rather than which article to recommend, the problem we aim to analyze here is which version of the

    article to recommend. For that purpose, for each article in our data we considered a one-armed bandit

    setting with a known outside option to simulate experimentation with a new version of that article.

    Each one-armed bandit experiment was based on one day of data associated with the article at hand.

    We constructed the decision horizon t = 1, 2, . . . , 2000 based on the first 2,000 time-stamps of that day

    at which the article was recommended from the highest position (out of 5 recommendations that are

    presented in each page). For a fixed article a and day d, we denote by CTRa,d (click-through rate) the

    fraction of occasions where readers clicked on a recommendation pointing to that article during day d,

    out of all occasions where article a was recommended in first position, and by CVRrecoma,d (conversion

    rate) the fraction of occasions where readers continued to read another article after arriving to article a

    by clicking on a content recommendation during day d.

    We assume that the click-through rates CTRa,d are known and independent of the structure of the

    article.10 However, we assume that conversion rates depend on the structure of the article itself, which

    is subject to modifications. We denote by CVRrecoma,d,0 the known conversion rate of article a at its current

    structure, and by CVRrecoma,d,1 a new (unknown) conversion rate that corresponds to a new structure that

    is under consideration for article a. We assume that the new conversion rate CVRrecoma,d,1 can be either a

    bit higher or a bit lower than the old conversion rate CVRrecoma,d,0 .

    We assume that each time the article is recommended, the publisher can choose between recommending

    10This is a realistic assumptions as these rates are driven mainly by the title of the article, which is visible to the readerin the recommendation, and typically does not change over time.

    27

  • its current or new form. We abstract away from idiosyncratic reader characteristics and potential

    combinatorial externalities associated with other links or ads, and adapt the objective of maximizing

    the (revenue-normalized) one-step lookahead recommendation value, which was shown by Besbes et al.

    (2016) to be an effective approximation of the recommendation value. Then, the problem of which

    version of the article to recommend is reduced to a 1-armed bandit problem with a known outside

    option: Recommending version k ∈ {0, 1} of the article generates a random payoff X(1)k,t(

    1 +X(2)k,t

    ),

    where X(1)k,t ∼ Ber(CTRa,d) and X

    (2)k,t ∼ Ber(CVR

    recoma,d,k ).

    Auxiliary information. While the observations above are drawn randomly, the matrix H and the

    sample path of auxiliary observations are fixed and determined from the data as follows. In each

    one-armed bandit experiment the matrix H is reduced to a vector {h1,t}Tt=1. For each decision epoch t,

    the entry h1,t represents the number of readers that arrived to the article from an external search engine

    between epochs t− 1 and t (that is, between the time-stamps that correspond with these epochs).

    We note that the information arrival process in our data often includes two or more observations

    between consecutive decision epochs. It is therefore important to highlight that, as discussed in §2.1,

    the performance bounds established in this paper hold without any adjustment for any integer values

    assigned to the entries of the matrix H. For each epoch t and arrival-from-search m ∈ {1, . . . , h1,t} we

    denote by Y1,t,m ∈ {0, 1} the indicator of whether the reader continued to read another article after

    visiting the article at hand. We denote by CVRsearcha,d the fraction of readers that continued to another

    article after arriving to article a from search during day d. We define:

    αa,d :=CVRrecoma,d

    CVRsearcha,d

    to be the fraction of conversion rates for users that arrive to article a by clicking on a recommendation,

    and for users that arrive to it from search. Conversion rates of readers that click on a recommendation

    are typically higher than conversion rates of readers that arrive directly from search.11 In our data,

    values of αa,d are in the range [1, 16].

    As the CTR is identical across k ∈ {0, 1}, it suffices to focus on the conversion feedback with

    Z1,t,m = αa,dYk,t,m; see Appendix G for more details. Then, one may observe that the auxiliary

    observations {Y1,t,m} can be mapped to reward observations using a mapping φ defined by the class of

    linear mappings from Example 1, with β1 = 0, and α1 = αa,d.

    Estimating the mapping to reward observations. To demonstrate the estimation of φ from existing

    data, we construct an estimator of the fraction αa,d based on αa,d−1, the fraction of the two conversion

    11See related discussions in Besbes et al. (2016) on experienced versus inexperienced readers, and in Caro and Mart́ınez-deAlbéniz (2020) on followers versus new readers.

    28

  • rates from the previous day. Note that for each a and d, up to a multiplicative constant, αa,d is a fraction

    of two binomial random variables. As such, αa,d−1 is not an unbiased estimator of αa,d. However it

    was shown that the distribution of such fractions can be approximated using a log-normal distribution

    (see, e.g., Katz et al. 1978). Furthermore, since the fraction of two log-normal random variables also

    has a log-normal distribution, we assume that (αa,d−1/αa,d) is a log-normal random variable, that is,

    αa,d−1 = αa,d · exp(σ̃2W

    )for a standard normal random variable W ∼ N (0, 1) and a global parameter

    σ̃ > 0 that we estimate from the data. Then, one obtains the following unbiased estimator of αa,d:

    α̂a,d = αa,d−1 · exp(−σ̃2/2

    ).

    Results. We compare the performance of a Thompson sampling policy when ignoring auxiliary

    observations, with the one achieved when utilizing auxiliary observations (see §4.2) based on the

    estimates α̂. The right hand side of Figure 6 depicts the fitted log-normal distribution for the fraction

    α̂/α, with the estimated parameter σ̃2 ≈ 0.53. This implies that practical estimators for the mapping φ

    could be computed based on the previous day, with reasonable estimation errors. We note that these

    errors could be significantly reduced using additional data and more sophisticated estimation methods.

    0

    1

    0 0.5 1

    𝑅 𝐻, 𝑇(Normalized)

    Auxiliary Information Ratio    ∑ ℎ ,

    TS Utilizing Information Flows

    Standard TS

    Frequency Histogramof 𝐿𝑜𝑔 𝛼/𝛼

    Figure 6: (Left) Performance of a “standard” Thompson Sampling that ignores auxiliary observations relativeto one that utilizes auxiliary observations (see §4.2), both tuned by c = 0.25. Each point corresponds to the(normalized) mean cumulative regret averaged over 200 replications for a certain article on a certain day, and thefixed sample path of auxiliary observations for that article on that day. (Right) Histogram of log(αa,d−1/αa,d),with the fitted normal distribution suggesting αa,d−1 ∼ Lognormal(logαa,d, σ̃2) with σ̃2 ≈ 0.53.

    The performance comparison appears on the left side of Figure 6. Each point on the scatter plot

    corresponds to the normalized long-run average regret incurred for a certain article on a certain day, by

    a Thompson sampling policy, either with or without utilizing auxiliary information. Without utilizing

    auxiliary observations, the performance is independent of the information arrival process. When utilizes

    auxiliary information, performance is comparable to the ignoring auxiliary information when there are

    very little auxiliary observations, and significantly improves when the amount of auxiliary information

    29

  • increases. Main sources of variability in performance are (i) variability in the baseline conversion rates

    CVRrecoma,d , affecting both versions of the algorithm; and (ii) variability in the estimation quality of α,

    affecting performance only when utilizing auxiliary observations. Utilizing auxiliary information arrivals

    led to regret reduction of 27.8% on average; We note that this performance improvement is achieved

    despite (i) suffering occasionally from inaccuracies in the estimate of the mapping φ(·), and (ii) tuning

    both algorithms by the value c that optimizes the performance of the standard Thompson sampling.

    7 Concluding remarks and extensions

    In this paper we considered an extension of the multi-armed bandit framework, allowing for unknown

    and arbitrary auxiliary information arrival processes. We studied the impact of the information arrival

    process on the performance that can be achieved. Through matching lower and upper bounds we

    identified the minimax (regret) complexity of this class of problems as a function of the information

    arrival process, which provides a sharp criterion for identifying rate-optimal policies.

    We established that Thompson sampling and UCB policies can be leveraged to uniformly achieve rate-

    optimality and, in that sense, possess natural robustness to the information arrival process. Moreover,

    the regret rate that is guaranteed by these policies cannot be improved even with prior knowledge of

    the information arrival process. This uncovers a novel property of these popular families of policies and

    further lends credence to their appeal. We also observed that policies with exogenous exploration rate

    do not possess such robustness. For such policies, we devised a novel virtual time indices method to

    endogenously control the exploration rate to guarantee rate-optimality. Using content recommendations

    data, we demonstrated the value that can be captured in practice by leveraging auxiliary information.

    We next discuss a couple of model extensions and directions for future research.

    Reactive information arrival processes. In Appendix D we analy