private and byzantine-proof cooperative...

Private and Byzantine-Proof Cooperative Decision-MakingAbhimanyu Dubey

Massachusetts Institute of [email protected]

Alex PentlandMassachusetts Institute of Technology

[email protected]

ABSTRACTThe cooperative bandit problem is a multi-agent decision probleminvolving a group of agents that interact simultaneously with amulti-armed bandit, while communicating over a network withdelays. The central idea in this problem is to design algorithms thatcan efficiently leverage communication to obtain improvementsover acting in isolation. In this paper, we investigate the stochasticbandit problem under two settings - (a) when the agents wish tomake their communication private with respect to the action se-quence, and (b) when the agents can be byzantine, i.e., they provide(stochastically) incorrect information. For both these problem set-tings, we provide upper-confidence bound algorithms that obtainoptimal regret while being (a) differentially-private and (b) tolerantto byzantine agents. Our decentralized algorithms require no infor-mation about the network of connectivity between agents, makingthem scalable to large dynamic systems. We test our algorithms ona competitive benchmark of random graphs and demonstrate theirsuperior performance with respect to existing robust algorithms.We hope that our work serves as an important step towards creatingdistributed decision-making systems that maintain privacy.

KEYWORDSmulti-armed bandits; sequential decision-making; differential pri-vacy; robustness

ACM Reference Format:Abhimanyu Dubey and Alex Pentland. 2020. Private and Byzantine-ProofCooperative Decision-Making. In Proc. of the 19th International Conferenceon Autonomous Agents and Multiagent Systems (AAMAS 2020), Auckland,New Zealand, May 9–13, 2020, IFAAMAS, 13 pages.

1 INTRODUCTIONCooperative decision-making in multiagent systems has been aprominent area of scientific inquiry, with interest from a variety ofdisciplines from robotics and control systems to macroeconomicsand game theory. Applications of cooperative decision-makingrange from robotics [9, 21] to sensor networks [8, 20].

However, as the availability of large-scale dense data (often col-lected through a network of sources) increases, the problem ofcooperative learning among multiple agents becomes increasinglyrelevant, moving beyond systems that are “effectively” single-agentto much larger, real-time systems that are decentralized and haveagents operating independently. Examples of such applicationsinclude seamless personalization across multiple devices in IoTnetworks [19, 33, 43], that constantly share data between agents.

Proc. of the 19th International Conference on Autonomous Agents and Multiagent Systems(AAMAS 2020), B. An, N. Yorke-Smith, A. El Fallah Seghrouchni, G. Sukthankar (eds.), May9–13, 2020, Auckland, New Zealand. © 2020 International Foundation for AutonomousAgents and Multiagent Systems (www.ifaamas.org). All rights reserved.

In such a framework, it is imperative to ensure that informationis securely and robustly shared between agents, and individual con-cerns around privacy are not violated in the pursuit of performanceon inference problems. Most generally, the learning setting assumedin such problems and applications is sequential decision-making,where observations are provided in a sequence, and each agenttakes actions with the knowledge of this stream of information.

The multi-armed bandit [38] is a classic framework to investigatesequential decision-making in. It provides an elegant mathemati-cal framework to make precise statements about the exploration-exploitation dilemma that is central to sequential decision-making,and has thus become the environment of choice across a wide va-riety of application domains including, most prominently, onlineadvertising [17]. An extension is the cooperative multiagent bandit,where a group of agents collectively interact with the same decisionproblem, and must cooperate to obtain optimal performance.

However, in a decentralized environment, cooperative decision-making comes with several caveats. Since it is impossible to controlany individual agents’ behavior from a centralized server, enforcingimportant constraints such as private computaiton with respect toobservations becomes a difficult technical challenge. Additionally,if any agent is malicious (i.e. provides incorrect or adversarial com-munication), achieving optimal performance becomes non-trivial.

In this paper, we consider two problems in the context of cooper-ative multiagent decision-making. In the multi-armed (context-free)bandit environment, we investigate (a) privacy in inter-agent com-munication and (b) tolerance to byzantine or malicious agents thatdeliberately provide stochastically corrupt communication. Ourcontributions can be listed as follows.

1. We provide a multi-agent UCB-like algorithm Private Mul-tiagent UCB for the cooperative bandit problem, that guaranteescommunication of the reward sequence of any agent to be private toany other receiving agent, and provides optimal group regret. Ouralgorithm is completely decentralized, i.e., its performance or opera-tion does not require any knowledge of the agents’ communicationnetwork, and each agent chooses actions autonomously (i.e., with-out a central server). It also maintains privacy of the source agents’messages regardless of the individual behavior of any receivingagent (i.e., it is robust to defection in individual behavior).

2.We provide amulti-agent UCB-like algorithm, dubbedByzantine-Proof Multiagent UCB for the cooperative bandit problem withbyzantine agents, that corrupt their messages following Huber’sϵ-contamination model, and provide optimal (in terms of commu-nication overhead) group regret bounds for its performance. Likeour private algorithm, this too is completely decentralized, andoperates without any knowledge of the communication network.

3. We validate the theoretical bounds on the group regret of ouralgorithm on a family of random graphs under a variety of differentparameter settings, and demonstrate their superiority over existingsingle-agent algorithms and robustness to parameter changes.

The paper is organized as follows. First, we provide an overviewof the background and preliminaries essential to our problem setup.Next, we examine the private setting, followed by the byzantinesetting. We then provide results on our experimental setup andsurvey related work in this problem domain before closing remarks.

2 BACKGROUND AND PRELIMINARIESThe Cooperative Stochastic Bandit. The multi-armed stochasticbandit problem is an online learning problem that proceeds inrounds. At every round t ∈ {1, 2, ...,T } (denoted in shorthand as[T ]), the learner selects an arm At ∈ [K] and the bandit draws arandom reward Xt from a corresponding reward distribution νAtwith mean µAt . We assume the rewards are drawn from probabilitydensities with bounded support in [0, 1]1, a typical assumption inthe bandit literature. The objective is to minimize the Regret R(T).

R(T ) = T · maxk ∈[K ]

µk −∑t ∈[T ]

µAt . (1)

The cooperative multi-armed bandit problem is a distributed exten-sion of the stochastic bandit problem. In this setting, a group ofMagents that communicate over a (connected, undirected) networkG = (V , E) face an identical stochastic K-armed bandit, and mustcooperate to collectively minimize their group regret RG(T ).

RG(T ) = TM maxk ∈[K ]

µk −∑

m∈[M ]

∑t ∈[T ]

µAm,t . (2)

Here, we denote the action taken by learnerm at time T as Am,tand the corresponding reward received as Xm,t , d(m,m′) denotesthe length of the shortest path between agentsm andm′. For anyT ,nk (T ) denotes the total number of times any arm k is pulled acrossall agents, and nmk (T ) denotes the times agent m has pulled thearm. In the cooperative bandit setting, we denote the policy of eachagentm ∈ [M] by (πm,t )t ∈[T ]. The collective policy for all agentsis denoted by Π = (πm,t )m∈[M ],t ∈[T ]. Since G is not assumed to becomplete (i.e. each agent can communicate with every other agent),it introduces a delay in how information flows between agents.However, to provide an understanding of the limits of cooperativedecision-making, consider the scenario when G is complete. In thissetting, after every trial of the problem, agents can instantaneouslycommunicate all required information, and therefore the problemcan effectively be thought of as a centralized agent pullingM armsat every trial. Anantharam et al.[2] provide a lower bound on thegroup regret achievable in this setting, given as follows.

Theorem 2.1 (Lower Bound on Centralized Estimation[2]).For the stochastic bandit with K arms, each with mean µk , let ∆k =maxk ∈[K ] µk − µk and D(µ∗, µk ) denote the Kullback-Liebler diver-gence between the optimal arm k∗ = arg maxk ∈[K ] µk and arm k .Then, for any algorithm that pullsM arms at every trial t ∈ [T ], thetotal regret incurred satisfies

lim infT→∞

R(T ) ≥©«

∑k :∆k>0

∆kD(µ∗, µk )

ª®¬ lnT .

1This can be extended to any bounded support by renormalization, and we choosethis setup for notational convenience.

This implies that the per-agent regret in this case is O(M−1 lnT ).While the above result holds for centralized estimation, we see thatone cannot hope for (a) a dependence on T that grows sublogarith-mically, and (b) a per-agent regret with a better dependence onMthanO(M−1), since delays will only limit the amount of informationpossessed by any agent. We will demonstrate that this bound canbe matched (up to additive constants, in some algorithmic settings)in the two problems we consider, for any connected G.

Communication Protocol. While the existing work on the co-operative bandit problem assumes communication via a gossip orconsensus protocol [23, 24], this protocol requries complete knowl-edge of the graph G, which is not feasible in a decentralized setting.In this paper, we therefore adopt a variant of the Local communica-tion protocol that is utilized widely in distributed optimization [31],asynchronous online learning [36] and non-stochastic bandit set-tings [4, 10]. Under this protocol, at any time t , an agentm ∈ [M]

creates a message qm (t) that it broadcasts to all its neighbors in G.qm (t) is a tuple of the following form.

qm (t) = ⟨m, t, f1(Hm (t)), ..., fL(Hm (t))⟩ . (3)

Here Hm (t) denotes the history of action-reward pairs obtained bythe agent until time t , i.e. Hm (t) =

{Xm,1,Am,1, ...,Xm,t ,Am,t

},

and f1, ..., fL are real-valued functions for some message lengthL > 0. When a message is created, it forwarded from agent-to-agent γ times before it is discarded. Hence, at any time t , the oldestmessage being propagated in the network would be one createdat time t − γ . As long as a message has not already been receivedby an agent previously, the agent will broadcast the messages itreceives (including its own) to all its neighbors. Under this protocol,we see that two agents that are a distance γ apart can communicatemessages to each other, and they will receive messages from eachother at a delay of γ trials. An important concept we will use in theremainder of the paper, based on this protocol is the power graphof order γ of G, denoted as Gγ , which is a graph that contains anedge (i, j) if there is a path of length at most γ between i and j in G.

In the single-agent setting, the policy of the learner π = (πt )t ∈[T ]is a sequence of functions πt that define a probability distributionover the space of actions, conditioned on the history of decision-making of the learner, i.e. πt = πt (· ∈ [K]|X1,A1, ...,Xt ,At ). Inthe decentralized multi-agent setting, we have a set of policiesΠ = (πm )m∈[M ] for each agentm ∈ [M], where πm = (πm,t )t ∈[T ]defines a probability distribution over the space of actions condi-tioned on both the history of the agents and the messages it hasreceived from other agents until time t . If for any pair of agentsm,m′ in G, d(m,m′) denotes the distance between the two agentsin G, the policy for agentm at time t is of the following form.

πm,t = πm,t

(· ∈ [K]|Hm (t) ∪m′∈N(m,Gγ ),t ∈T

(qm′

(t − d(m,m′)

) ) )(4)

Here Nγ (m) denotes the γ -neighborhood of agentm in G, and usethe short notation Im (t) = ∪m′∈Nγ (m),t ∈T (qm′ (t − d(m,m′))) todenote the set of all messages possessed by the agentm at time t .We will now describe the first class of algorithms that are designedfor privacy in the cooperative setting.

3 PRIVATE COOPERATIVE BANDITSWe begin by first providing an overview of ϵ-differential privacyand the techniques we will use in our algorithm.

3.1 Differential PrivacyDifferential privacy, proposed by Dwork [13] is a formalisationof the amount of information that is leaked about the input of analgorithm to an adversary observing its output. Existing work in thesingle agent [30, 40] and the (centralized) multi-agent [40] banditsetting has assumed the algorithms only to have actions (outputs)that are differentially-private with respect to their rewards (inputs).However, since we are considering a decentralized setting, we as-sume here that each learner acts in isolation, while communicatingwith other agents by sharing messages. Its policy, therefore, is afunction of its own decision history and the messages it receivesfrom other agents. By making the messages differentially-privatewith respect to the reward sequence an agent observes, we canensure that any policy using these messages is also private withrespect to the rewards obtained by other agents. We use this todefine differential privacy in the decentralized bandit context.

Definition 3.1 (ϵ-differentially private message protocol). A mes-sage qm (t) composed of L functions (fi )i ∈[L] for agentm at time tis ϵ-differentially private with respect to the rewards it obtains iffor all histories Hm (t) and H ′

m (t) that differ in at most one rewardsample, we have for all i ∈ [L], S ⊆ R:

e−ϵ ≤Pr (fi ∈ S |Hm (t))

Pr (fi ∈ S |H ′m (t))

≤ eϵ

There are several advantages of having such a mechanism tointroduce privacy in the cooperative decision-making process. First,we see that unlike centralized or single-agent bandit algorithmsthat inject noise as a part of the central algorithm itself, our defini-tion requires each message to individually be private, regardless ofthe recipient agents’ algorithm. Such a protocol ensures that if anylearner decides to reveal information about the reward sequenceobtained by its neighbors, it will not be able to do so. In the latersections we will demonstrate that for any agent, guaranteeing thedifferential privacy of the messages it generates with respect toits own rewards suffices to ensure that our algorithm is differen-tially private with respect to the rewards of other agents. We nowintroduce concepts crucial to the development of our algorithm.

The Laplace Mechanism. One of the main techniques to in-troduce differential privacy for univariate problems is the Laplacemechanism, that operates on the notion of sensitivity. For any do-main U and function f : U∗ → R, we can define sensitivity of ffor two datasets D,D ′ ∈ U∗ that are neighbors (differ only in oneentry) as follows.

s(f ) = maxD ,D′∈U∗

| f (D) − f (D ′)| (5)

The Laplace mechanism operates by adding zero-mean noise to theoutput of f with a scale that is governed by the level of privacyrequired and the sensitivity of f , as determined by the following.

Lemma 3.2 (Theorem 4 of Dwork and Roth [14]). For any realfunction f of the input data D, adding a Laplace noise sample X withscale parameter β ensures that f (D) + X is (β/s(f ))-differentiallyprivate with respect to D, where s(f ) is the sensitivity of f .

The Laplace mechanism is the underlying approach that we uti-lize in guaranteeing privacy in the multi-agent setting. We will nowdescribe our message-passing protocol, which uses this mechanismto guarantee differential privacy between any pair of agents.

3.2 Private Message-PassingThe general approach to privacy in the (single-agent) bandit settinginvolves using a binary tree or hybrid mechanism to compute therunning sum of rewards for any arm [30, 40]. This mechanisminvolves maintaining a binary tree of individual rewards, whereeach node in the tree stores a partial sum. This implies that updatingthe tree would be logarithmic in the number of elements present,and a similar (logarithmic) bound on the sensitivity can be derived,therefore, for any arm k that has been pulledm times, one requiresintroducing Laplace noise L( lnm

ϵ ) to achieve ϵ-differential privacy.Under this mechanism, [40] provide a bound on the regret.

Theorem 3.3 (Regret of DP-UCB Algorithm [40]). The single-agent DP-UCB algorithm, when run for T trials, obtains regret of

O(K lnT · max

{4√

8ϵ ln (lnT ) , 8

∆min

}).

While this approach is feasible in the single-agent case, in thedistributed setting, we will have to maintain M separate binarytrees (one for each agent), which would create an overhead of afactor ofM in the regret (since the sensitivity is increased by a fac-tor ofM). To mitigate this overhead we can alternatively utilize theinterval mechanism, introduced in [40] to the multi-agent setting.Under this mechanism, the mean of an arm is updated only (ap-proximately) T /ϵ times, which makes it possible to add Laplaciannoise that is of a lower scale, greatly improving the regret obtainedby the algorithm: [40] demonstrate that this mechanism obtainsthe optimal single-agent regret with an additive increase due todifferential privacy. We demonstrate that using a message-passingalgorithm that is inspired by the interval mechanism, we obtain agroup regret that has a much smaller than O(M) dependence on thenumber of agents, and is additive in terms of the privacy overhead.

For the private setting, in themessage-passing protocol describedin the earlier sections, consider for any agent m, the followingmessage created at time t ∈ [T ].

qm (t) = ⟨m, t,γ , v̂m (t),nm (t)⟩ (6)

Here, v̂m (t) = (v̂mk (t))k ∈[K ] is a vector of arm-wise reward meansthat comprise the empirical mean of rewards with specific Laplacenoise added, andnm (t) = (nmk (t))k ∈[K ] is a vector of the number oftimes an arm k has been pulled by the agentm. Since the intervalmechanism proceeds by making the updates to the mean estimateinfrequent, this strategy is adopted in our message-passing protocolas well, which is described in the Lemma below. We describe thecomplete message- passing protocol in Algorithm 1.

Lemma 3.4 (Private release of means [40]). The intervalw toupdate the mean for each arm follows a series such that for the givenvalue of ϵ ∈ (0, 1),v ∈ (1, 1.5),wn =Wn+1 −Wn withW0 = 0, andWn+1 such that,

Wn+1 = infx ′∈N

x ′ ≥Wn + 1 :x ′∑

Wn+1

1√iv

≥1

ϵy′v

.

Protocol Description: The protocol followed is straightfor-ward: at any time t , the agent updates the sum of its own re-wards with the reward obtained from the bandit algorithm forthe arm pulled. Then, for a set of intervals w0, ...,wT as definedin Lemma 3.4, the broadcasted version of this personal mean isupdated to the latest version, with an additional Laplace noiseL(nmk (t)v/2−1), for some constant v ∈ (1, 1.5). This mechanismensures that the broadcast of the personal mean is differentially-private to the reward sequence observed, for each arm. After thisupdate procedure has concluded for each arm, the agent broadcaststhis (and all previously incoming messages from other agents) toits neighbors using the system routine SendMessages, which isassumed to be constant-time. It then updates its own group meanestimates of the mean for each arm (µ̂mk (t)) using the incoming mes-sages from all other agents. This is done by using the latest messagefrom each agent inNγ (m). It then discards stale messages (with lifeparameter l = 0), and returns the updated group means for all armsfor the bandit algorithm to use. We now describe the privacy guar-antee for the messages, and then describe the Private-Multi-UCBalgorithm and its privacy and regret guarantees.

Lemma 3.5 (Privacy of Outgoing Messages). For any agentm ∈ [M], t ∈ [T ],v ∈ (1, 1.5), each outgoing message qm (t) isnmk (t)−v/2-differentially private w.r.t. the sequence of arm k ∈ [K].

Proof. This follows directly from the fact that the only elementthat is dependent on the reward sequence of arm k is its noisymean µ̂mk (t), which is nmk (t)−v/2- differentially private since weadd Laplace noise of scale nmk (t)v/2−1 and that the sensitivity ofthe empirical mean is nmk (t)−1. □

3.3 Private Multi-Agent UCBThe private multi-agent UCB algorithm is straightforward. Duringa trial, any agent m ∈ [M] constructs the group means for eacharm using the reward samples from its own pulls and the pullsobtained from its neighbors in Nγ (m). This mean is given by theestimate µ̂mk (t) for any arm k , as obtained in Algorithm 1. Usingthese samples, it then follows the popular UCB strategy, whereit constructs an upper confidence-bound for each arm, and pullsthe arm with the largest upper confidence bound. The completealgorithm is described in Algorithm 2. First, we describe the privacyguarantee with respect to its neighboring agents in Nγ (m), andthen prove the regret (performance) guarantee on the group regret.

Lemma 3.6 (Privacy Guarantee). After t trials, any agentm ∈

[M] is (e ′m′, δ ′m )-differentially private with respect to the reward se-quence of any arm observed by any other agentm′ ∈ Nγ (m) commu-nicating via Algorithm 1 with privacy parameter v ∈ (1, 1.5), where,for ϵ ∈ (0, 1], δ ′m ∈ (0, 1], ϵ ′m′ satisfies

ϵ ′m ≤ min

(ϵ(t − d(m,m′))1−v/2

1 −v/2, 2ϵζ (v) +

√2ϵζ (v) ln(1/δ ′)

).

Proof. We can see that for any arm k , the estimate µ̂mk (t) iscomposed of the sum of rewards from all other agents in Nγ (m).However, with respect to the reward sequence of any single agentm′ ∈ Nγ (m), this term only depends on the differential privacyof the outgoing messages qm′ obtained by agentm until time t −

Algorithm 1 Private Message-Passing Protocol

1: Input: Agent m ∈ [M], Iteration t ∈ [T ], series W =

(w0,w1, ...wT ), such thatW (i) = wi , series counter imk for eachk ∈ [K]; imk = w0 if t = 0 ∀k , set of existing messagesQm (t −1),Qm (t) = ϕ if t = 0, privacy constant v ∈ (1, 1.5), existingreward sums sm,k (t) ∀k ∈ [K] such that sm,k (0) = 0∀k .

2: Obtain Xm,t ,Am,t from bandit algorithm.3: Set sm,k (t) = sm,k (t − 1) + Xm,t for k = Am,t .4: Set nmk (t) = nmk (t − 1) + 1 for k = Am,t .5: for Arm k in [K] do6: if nmk (t) =W (imk ) then7: Set v̂km (t) = sm,k (t)/n

mk (t) + L(nmk (t)v/2−1).

8: Set imk = imk + 1.

9: else10: Set v̂km (t) = vkm (t − 1).11: end if12: end for13: Set qm (t) = ⟨m, t, v̂m (t),nm (t)⟩.14: Set Qm (t) = Qm (t − 1) ∪ {qm (t)}.15: for Each neighborm′ in N1(m) do16: SendMessages(m,m′,Qm (t)).17: end for18: for Each neighborm′ in N1(m) do19: Q ′ =ReceiveMessages(m′,m)

20: Qm (t) = Qm (t) ∪Q ′.21: end for22: Set Nm

k (t) = nmk (t), µ̂mk (t) = sm,k (t) ∀k ∈ [K].23: for q′ = ⟨m′, t ′, x ′1, ..., x

′K ,a

′1, ...,a

′K ⟩ ∈ Qm (t) do

24: if IsLatestMessage(q′) then25: for Arm k ∈ [K] do26: Set Nm

k (t) = Nmk (t) + a′k .

27: Set µ̂mk (t) = µ̂mk (t) + a′k · x ′k .28: end for29: end if30: end for31: for Arm k ∈ [K] do32: µ̂mk (t) = µ̂mk (t)/Nm

k (t).33: end for

d(m,m′) (since it takes at least d(m,m′) trials for a message fromm′ to reachm). However, since for each arm, a newmean is releasedonly t/f times (where f is the interval in Algorithm 1), a k-foldcomposition theorem [15] identical to [40] can be applied witht ′ = t −d(m,m′). Using Theorem 3.4 and then Corollary 3.1 of [40]subsequently provides us the result. □

Remark 1 (Robustness to Strategy). It is important to notethat if all agentsm ∈ [M] follow the protocol in Algorithm 1, thenthe privacy guarantee is sufficient, regardless of the algorithm anyagent individually uses to make decisions. This is true since for anyagent, the the complete sequence of messages it receives from anyother agent is differentially private with respect to the origin’s rewardsequence for any arm and at any instant of the problem. Hence, if theagent chooses to ignore other agents and follow another an arbitrarydecision-making algorithm, the privacy guarantee only gets stronger,

Algorithm 2 Private Multi-Agent UCB

1: Input: Agent m ∈ [M], trial t ∈ [T ], arms k ∈ [K], meanµ̂mk (t) and count nmk (t) estimates for each arm k ∈ [K], fromAlgorithm 1.

2: if t ≤ K ⌈1/ϵ⌉ then3: Am,t = t mod K .4: else5: for Arm k ∈ [K] do

6: UCBmk (t) =

√2 ln tNmk (t ) .

7: end for8: Am,t = arg maxk ∈[K ]

{µ̂mk (t) + UCB(m)

k (t)}.

9: end if10: Xm,t = Pull(Am,t ).11: return Am,t ,Xm,t .

never worse (since it cannot gain any additional information from themessages). Therefore, this communication protocol is robust to anyagent’s individual decision-making policy.

Remark 2 (Robustness to Defection). The communication pro-tocol crucially involves broadcasting rewards with privacy alterationsat the source, and not at the destination. Consider an alternate proto-col where the agents broadcasted their true personal average rewards(without noise), and the receiving agent added an appropriately scaledLaplacian noise on receiving a message. While this protocol wouldyield slightly better guarantees (both on privacy and regret, since thenoise will be one Laplace sample, and not the sum of several samplesin the current case), it is not robust to defection i.e. if the agent decidesto learn the individual reward sequence, it can. Since our mechanismadds noise at the source, it maintains robustness under defection.

Theorem 3.7. If all agents m ∈ [M] each follow the PrivateMulti-Agent UCB algorithm with the messaging protocol describedin Algorithm 1, then the group regret incurred after T trials obeys thefollowing upper bound.

RG(T ) ≤∑

k :∆k>0χ̄ (Gγ )

(8 lnT∆k

)+

©«∑

k :∆k>0∆k

ª®¬(χ (Gγ ) (Mγ + 2) +M

(1ϵ+ ζ (1.5)

)).

Here, χ̄ is the clique covering number, and ζ is the Riemann zetafunction.

Proof. (Sketch) The proof proceeds by decomposing Gγ intoits minimum clique partitions, and analysing each clique individu-ally. We then derive a UCB-like decomposition for the group regretof each clique. We first bound the delay in information propaga-tion through this clique, and then, using concentration results forbounded and Laplace random variables, we bound the number ofpulls of a suboptimal armwithin the clique, and subsequently boundthe total group regret. A full proof is presented in the Appendix. □

Remark 3 (Deviation from Optimal Rates). Upon examiningthe bound, we see that there are twoO(1) terms - the first,O(χ̄ (Gγ )Mγ )is the loss from delays in communication, since information flows

between two agents with delays, and path lengths between agentsare (on average) larger than 1 (unless G is connected). When G isconnected, this reduces to a constantM + 2, that matches the single-

agent UCB case forMT total pulls. The second term,M(

1ϵ + ζ (1.5)

),

arises from the level of differential privacy required. When ϵ = 1,i.e. no privacy, this term also reduces to a constant of M , matchingthe single-agent non-private case withMT pulls. Finally, consider theoverhead in the logarithmic term, χ (Gγ ). This too, arises from thedelays in the network. When G is complete, this reduces to 1, matchingthe optimal lower bound.

Remark 4 (Dependence on γ and Asymptotic Optimality).In addition to the dependencies discussed earlier, a major factor incontrolling the asymptotic optimality is via the life parameter γ , thatcontrols “how far” information flows from any single agent. Thisdependence is present within the bound as well, as we can tune theperformance of the algorithm greatly by tuning γ . If we set γ =diameter(G), then χ (Gγ ) = 1 (since Gγ will be fully connected),which implies that the group regret incurred asymptotically (limT →

∞) matches the lower bound. Alternatively, if we were to run thealgorithm with γ = 0, i.e. no communication, then we obtain a regretof O(M lnT ), that is equivalent to the group regret obtained by Magents operating in isolation. Interestingly, there is no dependence ofG and γ on the privacy loss. This is intuitive, since our mechanismintroduces suboptimality at the source, and not at the destinations.

We defer experimental evaluations to the end of the paper. Wenow describe the algorithm and message-passing protocol for thesetup with byzantine agents.

4 BYZANTINE-PROOF BANDITSIn this section, we examine the cooperative multi-agent banditproblem where agents cannot be trusted. In this setting, there existbyzantine agents, that, at any trial, instead of reporting their truemeans, give a random sample from an alternate (but fixed) distribu-tion, with some probability ϵ . Once a message is created, however,we assume that it is received correctly by all subsequent agents,and any corruption occurs only at the source.

One might consider the setting where in addition to messagesbeing corrupted at the source, it is possible for a message to becorrupted by any intermediary byzantine agent with the same prob-ability ϵ . From a technical perspective, this setting is not differentthan the first setting, since the probability of any incoming messagebeing corrupted becomes at most γϵ (by the union bound and thatd(m,m′) ≤ γ for any pair of agentsm,m′ that can communicate),and the remainder of the analysis is identical henceforth. Therefore,we simply consider the first setting. We now formalize the problemand provide some technical tools for this setting.

Problem Setting andMessaging Protocol. For the byzantine-proof setting, in the message-passing protocol described in theearlier sections, consider for any agentm, the following messagecreated at time t ∈ [T ].

qm (t) =⟨m, t,Am,t , X̂m,t

⟩(7)

Where, for byzantine agents, X̂m,t = Xm,t with probability (1 − ϵ),and a random sample from an unknown (but fixed) distribution Qwith probability ϵ . For non-byzantine agents, X̂m,t = Xm,t with

probability 1. Contrasted to the protocol used in the previous prob-lem setting, there are several key differences. First, instead of com-municating reward means from all arms, we simply communicatejust the individual rewards obtained along with the action taken.This is essential owing to the nature of our algorithm, as will bemade precise later on (cf. Remark 6).

4.1 Robust EstimationThe general problem of estimating statistics of a distribution Pfrom samples of a mixture distribution of the form (1 − ϵ)P + ϵQfor ϵ < 1/2 is a classic contamination model in robust statistics,known as Huber’s ϵ-Contamination model [18]. In the univariatemean estimation setting, a popular approach is to utilize a trimmedestimator of the mean that is robust to outlying samples, that worksas long as ϵ is small. We utilize the trimmed estimator to designour algorithm as well, defined as follows.

Definition 4.1 (Robust Trimmed Estimator). Consider a set of2N samples X1,Y2, ...,XN ,YN of a mixture distribution (1 − ϵ)P +ϵQ for some ϵ ∈ [0, 1/2). Let X ∗

1 , ≤ X ∗2 ≤ ... ≤ X ∗

N denote anon-decreasing arrangement of X1, ...,XN . For some confidencelevel δ ∈ (0, 1), let α = max(ϵ, ln(1/δ )

N ), and let Z be the short-

est interval containing N

(1 − 2α −

√2α ln(4/δ )

N −ln(4/δ )N

)points

of {Y ∗1 , ...,Y

∗N }.

Then, the univariate trimmed estimator µ̂R ({X1,Y2, ...,XN ,YN }, δ )can be given by

µ̂R =1∑N

i 1 {Xi ∈ Z}

N∑i=1

Xi1 {Xi ∈ Z} . (8)

The fundamental idea behind the trimmed mean is to use half ofthe input points to identify quartiles, and then use the remaininghalf to estimate the “trimmed” mean around those quartiles. Anatural question arises regarding the optimality of this estimatorand its rate of concentration around the true mean of P . Recentwork from [32] provides such a result.

Theorem 4.2 (Confidence bound for the TrimmedMean [32]).Let δ ∈ (0, 1). Then for any distribution P with mean µ and finitevariance σ 2, we have, with probability at least 1 − δ ,

|µ̂R − µ | ≤ σ√ϵ +

√σ ln(1/δ )

N.

Using these results, we can design an algorithm for the byzantinecooperative bandit problem, as described next.

4.2 Byzantine-Proof Multi-Agent UCBThe byzantine-proof algorithm works for an agentm by makinga conservative assumption that all agents (including itself) arebyzantine, and constructing a robust mean estimator under thisassumption. It then uses this estimator and an associated upperconfidence bound (UCB) to choose an arm, similar to the single-agent UCB algorithm. The assumption of all agents being byzantineprimarily is made to aid in the analysis of the algorithm: fromthe perspective of any agent, if allM agents are byzantine, then itcan expect to obtain approximately O(TMϵ) corrupted messages.

Algorithm 3 Byzantine-Proof Multi-Agent UCB

1: Input: Agentm, arms k ∈ [K], mean estimator µ̂R (n, δ )2: Set Smk = ϕ ∀k ∈ [K], Qm (t) = ϕ.3: for For t ∈ [T ] do4: if t ≤ K then5: Am,t = t .6: else7: for Arm k ∈ [K] do8: µ̂

(m)

k = µ̂R (Smk , 1/t

2).

9: UCB(m)

k (t) = σ√ϵ +

√σ ln(1/δ )|Smk (t ) | .

10: end for11: Am,t = arg maxk ∈[K ]

{µ̂(m)

k (t) + UCB(m)

k (t)}.

12: end if13: Xm,t = Pull(Am,t ).14: SmAm,t

= SmAm,t∪ {Xm,t }

15: Qm (t) = PruneDeadMessages(Qm (t)).16: Qm (t) = Qm (t) ∪ {

⟨m, t,γ ,Am,t ,Xm,t

⟩}.

17: Set l = l − 1 ∀⟨m′, t ′,a′, x ′⟩ in Qm (t).18: for Each neighborm′ in N1(m) do19: SendMessages(m,m′,Qm (t)).20: end for21: Qm (t + 1) = ϕ.22: for Each neighborm′ in N1(m) do23: Q ′ =ReceiveMessages(m′,m)

24: Qm (t + 1) = Qm (t + 1) ∪Q ′.25: end for26: for ⟨m′, t ′,a′, x ′⟩ ∈ Qm (t + 1) do27: Sma′ = Sma′ ∪ {x ′}.28: end for29: end for

If however, only a fraction f < 1 of agents are byzantine, then itcan expect to obtain approximately O(TM f ϵ) corrupted messages,which would imply that (in expectation), allM agents are byzantinewith ϵ ′ = f ϵ . This information can be incorporated at runtimeas well, and hence we proceed with the conservative assumption.Moreover, if we overestimate ϵ during initalization, the performanceof the algorithm will remain unchanged, since the robust meanestimator for any value ϵ is also applicable to any other mixtureproportion ϵ ′ < ϵ without any degradation in performance.

In this setting, for the our description of the algorithm, wepresent both the algorithm and the message-passing protocol inthe same setup, unlike the previous case since there is no explicitdistinction to be made between the decision-making algorithm andthe message-passing algorithm, as they both operate together. Thecomplete algorithm for all agents is described in Algorithm 3.

Algorithm Description. We will consider an agentm ∈ [M].The agent maintains a set of rewards Smk for each arm k , which itupdates with the reward samples it obtains from messages fromother agents, and its own rewards. At every trial, for each arm, ituses the robust mean estimator µ̂R to compute the trimmed meanof all the reward samples present in Smk , and uses the number ofsamples to estimate an upper confidence bound as well. It then

chooses the arm with the largest sum of the (robust) upper confi-dence bound. After choosing an arm, it updates its set Smk and sendsmessages to all its neighbors, similar to the previous algorithm.

Theorem 4.3. If all agentsm ∈ [M] run algorithm 3 with meanestimator µ̂R , then for ϵ < ∆min/2σ , the group regret afterT iterationsobeys the following upper bound.

RG(T ) ≤ χ̄(Gγ

) ©«∑

k :∆k>0

4σ 2

(∆k − 2σ√ϵ)2

ª®¬ lnT+

(3M + γ χ

(Gγ

)(M − 1)

) ©«∑

k :∆k>0∆k

ª®¬ .Here, χ̄ (·) refers to the minimum clique number.

Proof. (Sketch). We proceed in a manner similar to the previ-ous regret bound, by decomposing the group regret into a sumof clique-wise group regrets. We then construct similar UCB-like“exploration” vs “exploitation” events, and bound the probability ofpulling an incorrect arm under these events using the confidencebounds of the robust mean estimator coupled with the worst-casedelay within the network. The full proof is in the Appendix. □

Remark 5 (Deviation from Optimal Rates). When comparingthe group regret bound to the optimal rate achievable in the single-agent case, there is an additive constant that arises from the delayin the network. Identical to the differentially private algorithm, thisadditive constant reduces to the constant corresponding toMT indi-vidual trials of the UCB algorithm when information flows instantlythroughout the network, i.e. G is connected. Additionally, we observeidentical dependencies on other graph parameters in the leading termas well, except for the modified denominator (∆k − 2σ

√ϵ)2. This

term arises from the inescapable bias that the mixture distributionQ introduces into estimation, and [32] show that this bias is optimaland unimprovable in general. This can intuitively be explained by thefact that after a certain ϵ , there will be so many corrupted samplesthat it would be impossible to distinguish the best arm from the nextbest arm, regardless of the number of samples (since the observed,noisy distributions for each arms will become indistinguishable). Wemust also note that the dependence on γ on the asymptotic optimalityremains identical to the previous algorithm, and when all agents aretruthful (ϵ = 0) and γ = diameter(G), this algorithm obtains optimalmulti-agent performance, matching the lower bound.

Remark 6 (Simultaneous Privacy and Robustness). A nat-ural question to ask is whether both privacy with respect to otheragents’ rewards and robustness with respect to byzantine agents canbe simultaneously achieved. This is a non-trivial question, since thefundamental approaches to both problems are conflicting. First, wemust notice that to achieve differential privacy without damagingregret, one must convey summary statistics (such as the mean or sumof rewards) directly, with appropriately added noise. However, forrobust estimation in byzantine-tolerance, communicating individualrewards is essential, to allow for an appropriate rate of concentrationof the robust mean. These two approaches seem contradictory at afirst inquiry, and hence we leave this line of research for future work.

In the next section, we decribe the experimental comparisons ofour proposed algorithms with existing benchmark algorithms.

5 EXPERIMENTSFor our experimental setup, we consider rewards drawn from ran-domly initialized Bernoulli distributions, with K = 5 as our defaultoperating environment. We initialize the connectivity graph G as asample from a random Erdos-Renyi [6] family with edge probabilityp = 0.1, and set the communication parameter γ = diameter(G)/2.We describe the experiments and associated benchmark algorithmsindividually.

Testing Private Multiagent UCB. In this setting, we comparethe group regret RG of M = 200 agents over 100 randomly ini-tialized trials of the above setup. The benchmark algorithms wecompare with are (a) single-agent UCB1 [3] running individuallyon each agent, and (b) DP-UCB-INT of [40] running individually oneach agent, under the exact setup for ϵ and δ as described in [40].We choose this benchmark as it is the state-of-the-art in the single-agent stochastic bandit case. The results of this experiment areplotted in Figure 1(a). We observe that the performance of our algo-rithm is significantly better than both private benchmarks, however,it incurs higher regret (as expected) than the non-private version.

Testing Byzantine-Proof Multiagent UCB. To test the per-formance of our byzantine-proof algorithm, we use the same setupas the previous case, withM = 200 agents, repeated over 100 ran-domly initialized trials, except we set the reward probability of eacharm to be constrained within [0.3, 0.7], and set the contaminationprobability as ϵ = 10−3. The benchmark algorithms in this case are(a) the single-agent UCB1 [3] running individually on each agent,and (b) the byzantine-proof multiagent UCB with ϵ = 0 (since noother work explicitly studies the Huber ϵ-contamination model asused in our setting). The results of this experiment are summarizedin Figure 1(b). Similar to the previous case, we observe a large im-provement in the group regret compared to the single-agent UCB1,and with no contamination, the performance is better.

Testing the effect of γ . To understand the effect of γ on theobtained regret, we repeated the same experiment (M = 200, 100trials of randomly generated Erdos-Renyi graphs with p = 0.1) withour two algorithms and compared their obtained group regret atT = 1000 trials. We observe a sharp decline as γ increases from1 to diameter(G), and matches the optimal group regret at γ =diameter(G), as hypothesized by our regret bound. The results ofthis experiment are summarized in Figure 1(c).

6 RELATEDWORKOur work relates to a vast body of work across several fields in thesequential decision-making literature.

Multi-AgentBanditAlgorithms. Multi-agent bandit problemshave been studied widely in the context of distributed control [5, 26–28]. In this line of work, the problem is competitive rather thancooperative, where a group of agents compete for a set of arms,and collisions occur when two or more agents select the samearms. Contextual bandit algorithms have also been explored thecontext of peer-to-peer networks and collaborative filtering overmultiple agent recommendations [11, 16, 17, 22], where “agents”are clustered or averaged by their similarity in contexts. Whilesome of these problem domains assume an underlying networkstructure between agents, none assume “delay” in the observed feed-back between agents, and communication is instantaneous with

Figure 1: Experimental comparisons on random graphs. Each figure is constructed by averaging over 100 trials.

the knowledge of the graph, making these algorithms effectivelysingle-agent. For the cooperative bandit problem, however, generalsettings have been investigated in both the stochastic case using aconsensus protocol [23–25] and in the non-stochastic case using amessage-passing protocol similar to ours [4, 10]. To the best of ourknowledge, our work is the first to consider the cooperative banditproblem with differential privacy or byzantine agents.

Private Sequential Decision-Making. Our work uses the im-mensely influential differential privacy framework of Dwork [13],that has been the foundation of almost all work in private sequen-tial decision-making. For the bandit setting, there has been signifi-cant research interest in the single-agent case, including work onprivate stochastic bandit algorithms [30, 37, 40], linear bandit algo-rithms [35] and in the nonstochastic case [41]. In the competitivemulti-agent stochastic bandit case, Tossou and Dimitrakakis [39]provide a UCB-based algorithm based on Time-Division Fair Shar-ing (TDFS).

Robust Sequential Decision-Making. Robust estimation hasa rich history in the bandit literature. Robustness to heavy-tailedreward distributions has been extensively explored in the stochasticmulti-armed setting, from the initial work of Bubeck et al.[7] thatproposed a UCB algorithm based on robust mean estimators, to thesubsequent work of [12, 42, 44] on both Bayesian and frequentistalgorithms for the same. Contamination in bandits has also been ex-plored in the stochastic single-agent case, such as the work of [1] inbest-armed identification under contamination and algorithms thatare jointly optimal for the stochastic and nonstochastic case [34],that uses a modification of the popular EXP3 algorithm. Our workbuilds on the work in robust statistics of Huber [18] and Lugosiand Mendelson [29].

To the best of our knowledge, our work is the first to considerprivacy and contamination in the context of the multi-agent cooper-ative bandit problem,with the additional benefit of being completelydecentralized.

7 CONCLUSIONIn this paper, we discussed the cooperative multi-armed stochasticbandit problem under two important practical settings – privatecommunication and the existence of byzantine agents that followan ϵ-contamination model. We provided two algorithms that arecompletely decentralized, and provide optimal group regret guaran-tees when run with certain parameter settings. Our work is the firstto investigate real-world scenarios of cooperative decision-making,however, it does leave many open questions for future work.

First, we realise that achieving both differential privacy andbyzantine-tolerance simultaneously is non-trivial, even in the sto-chastic case: differential privacy requires the computation of sum-mary statistics (such as the sum or average of rewards) in orderto provide feasible regret, however, for byzantine-tolerance we re-quire the use of robust estimators, that explicitly require individualreward samples to compute, making their combination a difficultproblem, which can be investigated in future work. Next, we anal-ysed a very specific setting for byzantine agents, which can begeneralized to adversarial corruptions, as done in the single-agentcase. Finally, extensions of the stochastic multi-armed setting tocontextual settings is also an important direction of research thatour work opens up in multi-agent sequential decision-making.

REFERENCES[1] Jason Altschuler, Victor-Emmanuel Brunel, and Alan Malek. 2019. Best Arm

Identification for Contaminated Bandits. Journal of Machine Learning Research20, 91 (2019), 1–39.

[2] Venkatachalam Anantharam, Pravin Varaiya, and Jean Walrand. 1987. Asymptot-ically efficient allocation rules for the multiarmed bandit problem with multipleplays-Part I: IID rewards. IEEE Trans. Automat. Control 32, 11 (1987), 968–976.

[3] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. 2002. Finite-time analysis ofthe multiarmed bandit problem. Machine learning 47, 2-3 (2002), 235–256.

[4] Yogev Bar-On and Yishay Mansour. 2019. Individual Regret in CooperativeNonstochastic Multi-Armed Bandits. arXiv preprint arXiv:1907.03346 (2019).

[5] Ilai Bistritz and Amir Leshem. 2018. Distributed multi-player bandits-a game ofthrones approach. In Advances in Neural Information Processing Systems. 7222–7232.

[6] Béla Bollobás and Oliver M Riordan. 2003. Mathematical results on scale-freerandom graphs. Handbook of graphs and networks: from the genome to the internet(2003), 1–34.

[7] Sébastien Bubeck, Nicolo Cesa-Bianchi, and Gábor Lugosi. 2013. Bandits withheavy tail. IEEE Transactions on Information Theory 59, 11 (2013), 7711–7717.

[8] John Byers and Gabriel Nasser. 2000. Utility-based decision-making in wirelesssensor networks. In Proceedings of the 1st ACM international symposium on Mobilead hoc networking & computing. IEEE Press, 143–144.

[9] Y Uny Cao, Alex S Fukunaga, and Andrew Kahng. 1997. Cooperative mobilerobotics: Antecedents and directions. Autonomous robots 4, 1 (1997), 7–27.

[10] Nicolo Cesa-Bianchi, Claudio Gentile, and Yishay Mansour. 2019. Delay andcooperation in nonstochastic bandits. The Journal of Machine Learning Research20, 1 (2019), 613–650.

[11] Nicolo Cesa-Bianchi, Claudio Gentile, and Giovanni Zappella. 2013. A gang ofbandits. In Advances in Neural Information Processing Systems. 737–745.

[12] Abhimanyu Dubey and Alex ‘Sandy’ Pentland. 2019. Thompson Samplingon Symmetric Alpha-Stable Bandits. In Proceedings of the Twenty-Eighth In-ternational Joint Conference on Artificial Intelligence, IJCAI-19. InternationalJoint Conferences on Artificial Intelligence Organization, 5715–5721. https://doi.org/10.24963/ijcai.2019/792

[13] Cynthia Dwork. 2011. Differential privacy. Encyclopedia of Cryptography andSecurity (2011), 338–340.

[14] Cynthia Dwork, Aaron Roth, et al. 2014. The algorithmic foundations of differ-ential privacy. Foundations and Trends® in Theoretical Computer Science 9, 3–4(2014), 211–407.

[15] Cynthia Dwork, Guy N Rothblum, and Salil Vadhan. 2010. Boosting and differ-ential privacy. In 2010 IEEE 51st Annual Symposium on Foundations of Computer

https://doi.org/10.24963/ijcai.2019/792

https://doi.org/10.24963/ijcai.2019/792

Science. IEEE, 51–60.[16] Claudio Gentile, Shuai Li, Purushottam Kar, Alexandros Karatzoglou, Giovanni

Zappella, and Evans Etrue. 2017. On context-dependent clustering of bandits. InProceedings of the 34th International Conference on Machine Learning-Volume 70.JMLR. org, 1253–1262.

[17] Claudio Gentile, Shuai Li, and Giovanni Zappella. 2014. Online clustering ofbandits. In International Conference on Machine Learning. 757–765.

[18] Peter J Huber. 2011. Robust statistics. Springer.[19] Harald R Kisch and Claudia Lage Rebello da Motta. 2017. Real World Examples

of Agent based Decision Support Systems for Deep Learning based on ComplexFeed Forward Neural Networks.. In COMPLEXIS. 94–101.

[20] Lawrence A Klein. 2004. Sensor and data fusion: a tool for information assessmentand decision making. Vol. 324. SPIE press Bellinghamˆ eWA WA.

[21] Jelle R Kok, Matthijs TJ Spaan, Nikos Vlassis, et al. 2003. Multi-robot decisionmak-ing using coordination graphs. In Proceedings of the 11th International Conferenceon Advanced Robotics, ICAR, Vol. 3. 1124–1129.

[22] Nathan Korda, Balázs Szörényi, and Li Shuai. 2016. Distributed clustering oflinear bandits in peer to peer networks. In Journal of machine learning researchworkshop and conference proceedings, Vol. 48. International Machine LearningSociet, 1301–1309.

[23] Peter Landgren, Vaibhav Srivastava, and Naomi Ehrich Leonard. 2016. Distributedcooperative decision-making in multiarmed bandits: Frequentist and Bayesianalgorithms. In 2016 IEEE 55th Conference on Decision and Control (CDC). IEEE,167–172.

[24] Peter Landgren, Vaibhav Srivastava, and Naomi Ehrich Leonard. 2016. On dis-tributed cooperative decision-making in multiarmed bandits. In 2016 EuropeanControl Conference (ECC). IEEE, 243–248.

[25] Peter Landgren, Vaibhav Srivastava, and Naomi Ehrich Leonard. 2018. SocialImitation in Cooperative Multiarmed Bandits: Partition-Based Algorithms withStrictly Local Information. In 2018 IEEE Conference on Decision and Control (CDC).IEEE, 5239–5244.

[26] Keqin Liu and Qing Zhao. 2010. Decentralized multi-armed bandit with multipledistributed players. In 2010 Information Theory and Applications Workshop (ITA).IEEE, 1–10.

[27] Keqin Liu and Qing Zhao. 2010. Distributed learning in cognitive radio networks:Multi-armed bandit with distributed multiple players. In 2010 IEEE InternationalConference on Acoustics, Speech and Signal Processing. IEEE, 3010–3013.

[28] Keqin Liu and Qing Zhao. 2010. Distributed learning in multi-armed bandit withmultiple players. IEEE Transactions on Signal Processing 58, 11 (2010), 5667–5681.

[29] Gabor Lugosi and Shahar Mendelson. 2019. Robust multivariate mean estimation:the optimality of trimmed mean. arXiv preprint arXiv:1907.11391 (2019).

[30] Nikita Mishra and Abhradeep Thakurta. 2015. (Nearly) optimal differentiallyprivate stochastic multi-arm bandits. In Proceedings of the Thirty-First Conferenceon Uncertainty in Artificial Intelligence. AUAI Press, 592–601.

[31] Ciamac Cyrus Moallemi. 2007. A message-passing paradigm for optimization.Vol. 68.

[32] Adarsh Prasad, Sivaraman Balakrishnan, and Pradeep Ravikumar. 2019. A unifiedapproach to robust mean estimation. arXiv preprint arXiv:1907.00927 (2019).

[33] Rodrigo Roman, Jianying Zhou, and Javier Lopez. 2013. On the features andchallenges of security and privacy in distributed internet of things. ComputerNetworks 57, 10 (2013), 2266–2279.

[34] Yevgeny Seldin and Aleksandrs Slivkins. 2014. One Practical Algorithm for BothStochastic and Adversarial Bandits.. In ICML. 1287–1295.

[35] Roshan Shariff and Or Sheffet. 2018. Differentially private contextual linearbandits. In Advances in Neural Information Processing Systems. 4296–4306.

[36] Jukka Suomela. 2013. Survey of local algorithms. 45, 2 (2013), 24.[37] Abhradeep Guha Thakurta and Adam Smith. 2013. (Nearly) optimal algorithms

for private online learning in full-information and bandit settings. In Advancesin Neural Information Processing Systems. 2733–2741.

[38] William R Thompson. 1933. On the likelihood that one unknown probabilityexceeds another in view of the evidence of two samples. Biometrika 25, 3/4 (1933),285–294.

[39] Aristide CY Tossou and Christos Dimitrakakis. [n. d.]. Differentially private,multi-agent multi-armed bandits.

[40] Aristide CY Tossou and Christos Dimitrakakis. 2016. Algorithms for differen-tially private multi-armed bandits. In Thirtieth AAAI Conference on ArtificialIntelligence.

[41] Aristide Charles Yedia Tossou and Christos Dimitrakakis. 2017. Achieving privacyin the adversarial multi-armed bandit. In Thirty-First AAAI Conference on ArtificialIntelligence.

[42] Sattar Vakili, Keqin Liu, and Qing Zhao. 2013. Deterministic sequencing ofexploration and exploitation for multi-armed bandit problems. IEEE Journal ofSelected Topics in Signal Processing 7, 5 (2013), 759–767.

[43] Lizhe Wang and Rajiv Ranjan. 2015. Processing distributed internet of thingsdata in clouds. IEEE Cloud Computing 2, 1 (2015), 76–80.

[44] Xiaotian Yu, Han Shao, Michael R Lyu, and Irwin King. 2018. Pure explorationof multi-armed bandits with heavy-tailed payoffs. In Conference on Uncertaintyin Artificial Intelligence. 937–946.

APPENDIXTheorem 7.1. If all agents m ∈ [M] each follow the Private Multi-Agent UCB algorithm with the messaging protocol described in

Algorithm 1, then the group regret incurred after T trials obeys the following upper bound.

RG(T ) ≤∑

k :∆k>0χ (Gγ )

(8σ 2 lnT

∆k

)+

©«∑

k :∆k>0∆k

ª®¬(χ (Gγ ) (Mγ + 2) +M

(1ϵ+ ζ (1.5)

)).

Here, χ is the clique number, and ζ is the Riemann zeta function.

Proof. Let a clique covering of Gγ be given by Cγ . We begin by decomposing the group regret.

RG(T ) =M∑

m=1Rm (T ) (9)

≤∑C∈Cγ

∑m∈C

K∑k=1

∆kE[nkm (T )] (10)

=∑C∈Cγ

K∑k=1

∆k

( ∑m∈C

T∑t=1P

(Am,t = k

))(11)

Consider the cumulative regret RC(T ) within the clique C. For some time T kC, assume that each agent has pulled arm k for ηkm trials, where

ηkC=

∑m∈C ηkm . Then,

RC(T ) ≤K∑k=1

∆k©«ηkC +

∑m∈C

T∑t=T k

C

P(Am,t = k,N

Ck (t) ≥ ηk

C

)ª®®¬ . (12)

Here N Ck (t) denotes the number of times arm k has been pulled by any agent in C. We now examine the probability of agentm ∈ C pulling

arm k . First note that the empirical mean of any arm can be given by the latest messages accumulated by agentm until that time. This can begiven by the following, for any arm k ∈ [K].

µ̂mk (t − 1) =∑

m′∈N(m)

©«∑nkm′ (t−d (m,m′))

u=1 Xkm′,u

nkm′(t − d(m,m′))+ Ykm′

ª®®¬ (13)

Here, Ykm′ ∼ L

(nkm′(t − d(m,m′))v/2−1

). For convenience, let’s denote the noise-free mean µ̂mk (t − 1) −

∑m′∈N(m) Y

km′ as Zmk (t − 1), and

Nmk (t) = nkm (t) +

∑m′∈N(m) n

km′(t − d(m,m′)). Note that an arm is pulled when one of three events occurs:

Event (A): Zmk (t − 1) ≤ µ∗ − σ

√2 ln tNm∗ (t)

−∑

m′∈N(m)

Y ∗m′ (14)

Event (B): Zmk (t − 1) ≥ µk + σ

√2 ln tNmk (t)

+∑

m′∈N(m)

Ykm′ (15)

Event (C): µ∗ ≤ µk + 2σ

√2 ln tNmk (t)

(16)

We will first analyse events (A) and (B). We know from Dwork and Roth [14] that for any N random variables Yi ∼ L(bi ),

Pr ©«��∑iYi

�� ≥ ln(

1ω

) √8∑ib2iª®¬ ≤ ω (17)

For some ω ∈ (0, 1), let hk ,m (t) = ln(

1ω

) √8∑m′(nkm′(t − d(m,m′))v−2. Let us examine the probability of event (A) occuring.

Pr(B) = Pr ©«Zmk (t − 1) ≥ µk + σ

√2 ln tNmk (t)

+∑

m′∈N(m)

Ykm′

ª®¬ (18)

= Pr(µ̂mk (t − 1) − Zmk (t − 1) ≥ hk ,m (t) & Zmk (t − 1) ≥ µk + σ

√2 ln tNmk (t)

− hk ,m (t)) (19)

≤ Pr(µ̂mk (t − 1) − Zmk (t − 1) ≥ hk ,m (t)

)+ Pr

(Zmk (t − 1) ≥ µk + σ

√2 ln tNmk (t)

− hk ,m (t)

)(20)

≤ ω + Pr

(Zmk (t − 1) ≥ µk + σ

√2 ln tNmk (t)

− hk ,m (t)

)(21)

≤ ω + exp ©«−2Nmk (t)

(σ

√2 ln tNmk (t)

− hk ,m (t)

)2ª®¬ (22)

= t−3.5 + exp ©«−2Nmk (t)

©«σ√

2 ln tNmk (t)

− 3.5 ln t√

8∑m′

(nkm′(t − d(m,m′))v−2ª®¬2ª®®¬ (23)

≤ t−3.5 + exp ©«−2Nmk (t)

(σ

√2 ln tNmk (t)

− 3.5 ln t√

8M(ϵ−1 − γ

)v/2−1)2ª®¬ (24)

≤ 2t−3.5 (25)

Here, we use Hoeffding’s Inequality in Equation (22), and the fact that v ∈ (1, 1.5) in Equation (24) and that nkm (t) ≥ ϵ−1, and choose∑m′ nkm′(t − d(m,m′)) such that

exp ©«−2Nmk (t)

(σ

√2 ln tNmk (t)

− 3.5 ln t√

8M(ϵ−1 − γ

)v/2−1)2ª®¬ ≤ t−3.5.

We see that as long as N Ck (t) ≥

(ϵ−1−γ )v/2−1(2√

2σ−√

3.5)√

24M ln t+

|C |ϵ , this is true, following the fact that γ < 1/ϵ and the argument presented in [39]

for the single agent case. We can repeat the same process for the event (A). Finally, let us analyse event (C). For (C) to be true, we must havethe following to be true.

Nmk (t) <

8σ 2 ln t∆2k

. (26)

Since Nmk (t) ≥ N C

k −Mγ , we see that this event does not happen for any agentm ∈ C if we set

ηCk =

⌈max

{8σ 2 lnT

∆2k

+Mγ ,(ϵ−1 − γ )v/2−1(2

√2σ −

√3.5)

√24M ln t

+|C|

ϵ

}⌉.

Next, we should notice that the second term decreases as t increases. We therefore, can decompose the regret of the entire clique followingthe argument in [39] identically.

RC(T ) ≤K∑k=1

∆k©«ηkC +

|C|

ϵ+

∑m∈C

T∑t=T k

C

P(Am,t = k,N

Ck (t) ≥ ηk

C

)ª®®¬ (27)

≤∑

k :∆k>0∆k

(8σ 2 lnT

∆2k

+Mγ + 2 +|C|

ϵ+

∑m∈C

T∑t=1

4t−1.5)

(28)

≤∑

k :∆k>0∆k

(8σ 2 lnT

∆2k

+Mγ + 2 +|C|

ϵ+

∑m∈C

T∑t=1

4t−1.5)

(29)

≤∑

k :∆k>0∆k

(8σ 2 lnT

∆2k

+Mγ + 2 + |C|

(1ϵ+ ζ (1.5)

))(30)

Summing over all cliques C ∈ Cγ , we get

RG(T ) ≤∑

C ∈Cγ

∑k :∆k>0

∆k

(8σ 2 lnT

∆2k

+Mγ + 2 + |C|

(1ϵ+ ζ (1.5)

)). (31)

Choosing Cγ to be the minimal clique partition of Gγ ,we obtain the final form of the bound. □

Theorem 7.2. If all agentsm ∈ [M] run algorithm 3 with mean estimator µ̂R , then for ϵ < ∆min/2σ , the group regret after T iterations obeysthe following upper bound.

RG(T ) ≤ χ(Gγ

) ©«∑

k :∆k>0

4σ 2

(∆k − 2σ√ϵ)2

ª®¬ lnT +(3M + γ χ

(Gγ

)(M − 1)

) ©«∑

k :∆k>0∆k

ª®¬ .Here, χ (·) refers to the minimum clique number.

Proof. Let a clique covering of Gγ be given by Cγ . We begin by decomposing the group regret.

RG(T ) =M∑

m=1Rm (T ) (32)

≤∑C∈Cγ

∑m∈C

K∑k=1

∆kE[nkm (T )] (33)

=∑C∈Cγ

K∑k=1

∆k

( ∑m∈C

T∑t=1P

(Am,t = k

))(34)

Consider the group regret RC(T ) within the clique C. For some time T kC, assume that each agent has pulled arm k for ηkm trials, where

ηkC=

∑m∈C ηkm . Then,

RC(T ) ≤K∑k=1

∆k©«ηkC +

∑m∈C

T∑t=T k

C

P(Am,t = k,N

Ck (t) ≥ ηk

C

)ª®®¬ . (35)

Here N Ck (t) denotes the number of times arm k has been pulled by any agent in C. We now examine the probability of agentm ∈ C pulling

arm k . Note that an arm is pulled when one of three events occurs:

Event (A): µ̂m∗ (t − 1) ≤ µ∗ − σ√ϵ −

√σ ln(1/δ )|Sm∗ (t)|

(36)

Event (B): µ̂mk (t − 1) ≥ µk + σ√ϵ +

√σ ln(1/δ )|Smk (t)|

(37)

Event (C): µ∗ ≤ µk + 2σ√ϵ + 2

√σ ln(1/δ )|Smk (t)|

(38)

Now, let us examine the occurence of event (C):

∆k − 2σ√ϵ ≤ 2σ

√ϵ −

√σ ln(1/δ )|Sm∗ (t)|

(39)

=⇒ |Smk (t)| ≤4σ 2 ln t

(∆k − 2σ√ϵ)2

(40)

Let us now examine a lower bound for N km (t) = |Smk (t)|. Let Pkm (t) denote the set of reward samples obtained by agentm for its own pulls of

arm k until time t . We know, then that Pkm (t) = Pkm (t − 1) if arm k was pulled at time t , and Pkm (t) = Pkm (t − 1) ∪ {Xm,t } otherwise.Additionally, any message from an agentm′ ∈ G takes d(m,m′) − 1 iterations to reach agentm. Therefore:

Smk (t) = Pkm (t) ∪

⋃

m′∈G\{m }

Pkm(t − d(m′,m) + 1

) . (41)

Note that Pkm (t) and Pk′

m′(t′) are disjoint for allm ,m′,k,k ′, t, t ′. Let n(S) denote the cardinality of S . Then,

n(Smk (t)

)= n

(Pkm (t)

)+

∑

m′∈G\{m }

n(Pkm

(t − d(m′,m) + 1

) ) . (42)

Now, in the iterations t − d(m,m′) + 1 to t , agentm′ can pull arm k at most d(m,m′) − 1 times and at least 0 times. Therefore,

N kGγ (m)

(t) ≥ n(Smk (t)

)≥ max

0,N kGγ (m)

(t) −∑

m′∈Gγ (m)\{m }

(d(m,m′) − 1) (43)

≥ max{0,N k

Gγ (m)(t) + |Gγ (m)|(1 − γ )

}(44)

Hence, N km (t) ≥ N C

k (t) − (M − 1)(1 − γ ) for all t . Therefore, if we set ηkC=

⌈4σ 2 lnT

(∆k−2σ√ϵ )2+ (M − 1)(γ − 1)

⌉, we know that event (C) will

not occur. Additionally, using the union bound over N ∗m (t) and N k

m (t), and Theorem 4.2, we have:

P(Event (A) or (B) occurs) ≤ 2t∑

s=1

1s4 ≤

2t3 . (45)

Combining all probabilities, and inserting in Equation (12), we have,

RC(T ) ≤K∑k=1

∆k©«ηkC +

∑m∈C

T∑t=T k

C

P(Am,t = k,N

Ck (t) ≥ ηk

C

)ª®®¬ (46)

≤

K∑k=1

∆k

(⌈4σ 2 lnT

(∆k − 2σ√ϵ)2+ (M − 1)(γ − 1)

⌉+

∑m∈C

T∑t=1

2t3

)(47)

≤

K∑k=1

∆k

(⌈4σ 2 lnT

(∆k − 2σ√ϵ)2+ (M − 1)(γ − 1)

⌉+ 4|C|

)(48)

≤

K∑k=1

∆k

(4σ 2 lnT

(∆k − 2σ√ϵ)2+ (M − 1)(γ − 1) + 1 + 4|C|

). (49)

We can now substitute this result in the total regret.

RG(T ) ≤∑C∈Cγ

RC(T ) (50)

≤∑C∈Cγ

K∑k=1

∆k

(4σ 2 lnT

(∆k − 2σ√ϵ)2+ (M − 1)(γ − 1) + 1 + 4|C|

). (51)

Choosing Cγ as the minimum clique covering gets us the result. □

private and byzantine-proof cooperative...

Documents