distributed multi-agent online learning based on global...

1

Distributed Multi-Agent Online Learning Based onGlobal Feedback

Jie Xu, Cem Tekin, Simpson Zhang, and Mihaela van der Schaar

Abstract—In many types of multi-agent systems, distributedagents cooperate with each other to take actions with the goal ofmaximizing an overall system reward. However, in many of thesesystems, agents only receive a (perhaps noisy) global feedbackabout the realized overall reward rather than individualizedfeedback about the relative merit of their own actions withrespect to the overall reward. If the contribution of an agent’sactions to the overall reward is unknown a priori, it is crucialfor the agents to utilize a distributed algorithm that still allowsthem to learn their best actions. In this paper, we rigorouslyformalize this problem and develop online learning algorithmswhich enable the agents to cooperatively learn how to maximizethe overall reward in these global feedback scenarios withoutexchanging any information among themselves. We prove that,if the agents observe the global feedback without errors, thedistributed nature of the considered multi-agent system resultsin no performance loss compared with the case where agents canexchange information. When the agents’ individual observationsare erroneous, existing centralized algorithms, including popularones like UCB1, break down. To address this challenge, wepropose a novel class of distributed algorithms that are robustto individual observation errors and whose performance can beanalytically bounded. We prove that our algorithms’ learningregrets - the losses incurred by the algorithms due to uncertainty- are logarithmically increasing in time and thus the time averagereward converges to the optimal average reward. Moreover, wealso illustrate how the regret depends on the size of the actionspace, and we show that this relationship is influenced by theinformativeness of the reward structure with regard to eachagent’s individual action. We prove that when the overall rewardis fully informative, regret is linear in the total number of actionsof all the agents. When the reward function is not informative,regret is linear in the number of joint actions. Our analytic andnumerical results show that the proposed learning algorithmssignificantly outperform existing online learning solutions interms of regret and learning speed. We illustrate how ourtheoretical framework can be used in practice by applying it toonline Big Data mining using distributed classifiers. However, ourframework can be applied to many other applications includingonline distributed decision making in cooperative multi-agentsystems (e.g. packet routing or network coding in multi-hopnetworks), cross-layer optimization (e.g. parameter selection indifferent layers), multi-core processors etc.

Index Terms—Multi-agent learning, online learning, multi-armed bandits, Big Data mining, distributed cooperative learning,reward informativeness.

I. INTRODUCTION

In this paper, we consider a multi-agent decision making andlearning problem, in which a set of distributed agents select

J. Xu, T. Cem and M. van der Schaar are with the Dept. of ElectricalEngineering, University of California Los Angeles (UCLA). S. Zhang is withthe Dept. of Economics, University of California Los Angeles (UCLA). Thisresearch is supported by the US Air Force Office of Scientific Research underthe DDDAS Program.

actions from their own action sets in order to maximize theoverall system reward which depends on the joint action ofall agents. In the considered scenario, agents do not know apriori how their actions influence the overall system reward,or how their influence may change dynamically over time.Therefore, in order to maximize the overall system reward,agents must dynamically learn how to select their best actionsover time. But agents can only observe/measure the overallsystem performance and hence, they only obtain global feed-back that depends on the joint actions of all agents. Sinceindividualized feedback about individual actions is absent,it is impossible for the agents to learn how their actionsalone affect the overall performance without cooperating witheach other. However, because agents are distributed they areunable to communicate and coordinate their action choices.Moreover, agents’ observations of the global feedback maybe subject to individual errors, and thus it may be extremelydifficult for an agent to conjecture other agents’ actionsbased solely on its own observed reward history. The factthat individualized feedback is missing, communication isnot possible, and the global feedback is noisy makes thedevelopment of efficient learning algorithms which maximizethe joint reward very challenging. Importantly, the consideredmulti-agent learning scenario differs significantly from theexisting solutions [12][20][21], in which agents receive indi-vidualized rewards. To help illustrate the differences, Figures1(a) and (b) portray conventional learning in multi-agentsystems based on individualized feedback and the consideredlearning in multi-agent systems based on global feedback withindividual noise, respectively. The considered problem hasmany application scenarios. For instance, in a stream miningsystem that uses multiple classifiers for event detection invideo streams, individual classifiers select operating pointsto classify specific objects or actions of interest, the resultsof which are synthesized to derive an event classificationresult. If the global feedback is only about whether the eventclassification is correct or not, individualized feedback aboutindividual contribution is not available. For another instance, ina cooperative communication system, a set of wireless nodesforward signals of the same message copy to a destinationnode through noisy wireless channels. Each forwarding nodeselects its transmission scheme (e.g. power level) and the des-tination combines the forwarded signals to decode the originalmessage using, e.g., a maximal ratio combination scheme.Since the message is only decoded using the combined signalbut not individual signals, only a global reward depending onthe joint effort of the forwarding nodes is available but not thenodes’ individual contributions.

2

Select actionfrom A1fd

.

.

.

Aggr.

Learner 1

Learner 2

Learner N

Select actionfrom A1fds

Select actionfrom A1fdNA

2A

1A

OR

Select actionfrom A1 fa

.

.

.

Aggr.

Learner 1

Learner 2

Learner N

Select actionfrom A1ff

Select actionfrom A1fNA

2A

1A

OR

IR 1

IR 2

IR N

(a) (b)

+

+

+

+

+

+

noise

noise

noise

noise

noise

noise

IR: Individual Reward

OR: Overall Reward

Aggr.: Aggregate

Fig. 1. Learning in multi-agent systems with (a) individualized feedback;(b) global feedback (this paper).

In this paper, we formalize for the first time the above multi-agent decision making framework and propose a systematicsolution based on the theory of multi-armed bandits [9][10].We propose multi-agent learning algorithms which enable thevarious agents to individually learn how to make decisionsto maximize the overall system reward without exchanginginformation with other agents. In order to quantify the lossdue to learning and operating in an unknown environment,we define the regret of an online learning algorithm for theset of agents as the difference between the expected rewardof the best joint action of all agents and the expected rewardof the algorithm used by the agents. We prove that, if theglobal feedback is received without errors by the agents,then all deterministic algorithms can be implemented in adistributed manner without message exchanges. This impliesthat the distributed nature of the system does not introduceany performance loss compared with a centralized systemsince there exist deterministic algorithms that are optimal.Subsequently, we show that if agents receive the global feed-back with different (individual) errors, existing deterministicalgorithms may break down and hence, there is a need fornovel distributed algorithms that are robust to such errors.For this, we develop a class of algorithms which achievea logarithmic upper bound on the regret, implying that theaverage reward converges to the optimal average reward1.The upper bound on regret also gives a lower bound onthe convergence rate to the optimal average reward. Forour first algorithm, DisCo, we start without any additionalassumptions on the problem structure and show that the regretis still logarithmic in time. Although, the time order of theregret of the DisCo algorithm is logarithmic, due to its lineardependence on the cardinality of the joint action space, whichincreases exponentially with the number of agents, the regretis large and the convergence rate is very slow with manyagents. Next, we define the informativeness of the overallreward function based on how effectively agents are ableto distinguish the impact of their actions from the actions

1It is shown in [9] that logarithmic regret is the best possible even for simplesingle agent learning problems. However, convergence to the optimal reward isa much weaker result than logarithmic regret. Any algorithm with a sublinearbound on regret will converge to the optimal average reward asymptotically.

of others, and we exploit this informativeness in order todesign improved learning algorithms. When the overall rewardfunction is fully informative about the optimality of individualactions, the improved learning algorithm Disco-FI achieves aregret that is linear in the size of the action space of eachagent, and logarithmic in time. The crucial idea behind thisresult is that, when the overall reward is fully informative,instead of using the exact reward estimates of every jointaction, the agents can use the relative reward estimates ofeach individual action to learn their optimal actions at a muchfaster speed. Finally, we consider a more general setting wherethe global rewards are only partially informative. Our thirdalgorithm (DisCo-PI) works when the overall reward functionis informative for a group of agents instead of each agentindividually, and it achieves a regret which is in betweenthe first two algorithms. As an application of our theoreticalframework, we then run simulations utilizing our algorithmsfor the problem of online Big Data mining using distributedclassifiers [3][?][5][6]. We show that the proposed algorithmsachieve a very high classification accuracy when comparedwith existing solutions. Our framework could also be similarlyapplied to many other applications including online distributeddecision making in cooperative multi-agent systems such asmulti-path or multi-hop networks, cross-layer design, multi-core processing systems, etc.

The remainder of this paper is organized as follows. InSection II, we describe the related work and highlight thedifferences from our work. In Section III, we build the systemmodel and define the regret of a learning algorithm withrespect to the optimal policy. In Section IV, we characterizethe class of algorithms that can be implemented by distributedagents without errors, and we show that with errors centralizedalgorithms may break down. In Section V, the basic distributedcooperative learning algorithm is proposed and its regretperformance is analyzed. Two improved learning algorithms,for fully informative and partially informative overall rewardfunctions, are developed and analyzed in Sections VI and VII.Some discussions and extensions are provided in Section VIII.In Section IX, we evaluate the proposed algorithms throughnumerical results for the problem of Big Data stream mining.Finally, the concluding remarks are given in Section X.

II. RELATED WORKS

The literature on multi-armed bandit problems can be tracedback to [7][8] which studies a Bayesian formulation andrequires priors over the unknown distributions. In our paper,such information is not needed. A general policy based onupper confidence bounds is presented in [9] that achievesasymptotically logarithmic regret in time given that the re-wards from each arm are drawn from an independent andidentically distributed (i.i.d.) process. It also shows that nopolicy can do better than Ω(K ln t) 2 (i.e. linear in the numberof arms and logarithmic in time) and therefore, this policyis order optimal in terms of time. In [10], upper confidencebound (UCB) algorithms are presented which are proved toachieve logarithmic regret uniformly over time, rather than

2 We adopt the standard asymptotic notations Ω(·) and O(·) as in [24].

3

only asymptotically. These policies are shown to be orderoptimal when the arm rewards are generated independentlyof each other. When the rewards are generated by a Markovprocess, algorithms with logarithmic regret with respect to thebest static policy are proposed in [14] and [15]. However, all ofthese algorithms intrinsically assume that the reward processof each arm is independent, and hence they do not exploitany correlations that might be present between the rewardsof different arms. In this paper the rewards may be highlycorrelated, and so it is important to design algorithms thattake this into account.

Another interesting bandit problem, in which the goal isto exploit the correlations between the rewards, is the com-binatorial bandit problem [23]. In this problem, the agentchooses an action vector and receives a reward which dependson some linear or non-linear combination of the individualrewards of the actions. In a combinatorial bandit problem theset of arms grows exponentially with the dimension of theaction vector; thus standard bandit policies like the one in[10] will have a large regret. The idea in these problems is toexploit the correlations between the rewards of different armsto improve the learning rate and thereby reduce the regret[11][12]. Most of the works on combinatorial bandits assumethat the expected reward of an arm is a linear function ofthe chosen actions for that arm. For example [13] assumesthat after an action vector is selected, the individual rewardsfor each non-zero element of the action vector are revealed.Another work [22] considers combinatorial bandit problemswith more general reward functions, defines the approximationregret and shows that it grows logarithmically in time. Theapproximation regret compares the performance of the learningalgorithm with an oracle that acts approximately optimally,while we compare our algorithm with the optimal policy. Thiswork also assumes that individual observations are available.However, in this paper we assume that only global feedback isavailable and individuals cannot observe each other’s actions.Agents have to learn their optimal actions based only onthe feedback about the overall reward. Other bandit problemswhich use linear reward models are studied in [14][15][16].These consider the case where only the overall reward of theaction profile is revealed but not the individual rewards of eachaction. However, our analysis is not restricted to linear rewardmodels, but instead much more general. In addition, in mostof the previous work on multi-armed bandits [9][10][11][12],the rewards of the actions (arms) are assumed to come from anunknown but fixed distribution. We also have this assumptionin most of our analysis in this paper. However, in Section VIIwe propose learning algorithms which can be used when thedistribution over rewards is changing over time (i.e. exhibitsaccuracy drift3).

Another line of work considers online optimization prob-lems, where the goal is to minimize the loss due to learningthe optimal vector of actions which maximizes the expectedreward function. These works show sublinear (greater thanlogarithmic) regret bounds for linear or submodular expected

3Accuracy drift is more general than concept drift and is formally definedlater in Section VIII.

[9,10,11] [12,17,18] [13] [22] This work

Decentralized No Yes No No Yes

Combinatorial with linear rewards

No No Yes Yes Yes

Combinatorial with nonlinear rewards

No No No Yes Yes

Only overall reward revealed

N/A N/A No No Yes

Exploit reward informativeness

No No No No Yes

Accuracy drift No No No No Yes TABLE ICOMPARISON WITH EXISTING WORKS ON BANDITS.

reward functions, when the rewards are generated by anadversary to minimize the gain of the agent. The difference ofour work is that we consider a more general reward functionand prove logarithmic regret bounds. Recently, distributedbandit algorithms are developed in [28] in network settings.In that work, agents have the same set of arms with thesame unknown distributions and are allowed to communicatewith neighbors to share their observed rewards. In contrast,in the current paper, agents have distinct sets of arms, thereward depends on the joint action of agents and agents donot communicate at run-time.

Table I summarizes the comparison with existing works onbandits.

III. SYSTEM MODEL

There are N agents indexed by the setN = 1, 2, ..., N. Eachagent has access to an action set An, with the cardinality ofthe action set denoted by Kn = |An|. Since we model thesystem using the multi-armed bandit framework, we will use“arm” and “action” interchangeably in this paper. In additionto the number of its own arms, each agent n knows the numberof arms Kj of all the other agents j = n. The model is setin discrete time t = 1, 2, ..., T . In each time slot, each agentselects a single one of its own arms an(t) ∈ An. Agents aredistributed and thus cannot observe the arm selections of theother agents. We denote by a(t) the vector of arm selectionsby all the agents at time t, which we call the joint arm that isselected at time t.

Given any joint arm selection, a random reward rt(a(t))will be generated according to an unknown distribution, witha dynamic range bounded by a value D′. For now, we will as-sume that this global reward is i.i.d. across time. We denote theexpected reward given a selection a(t) by µ(a)=E[rt(a(t))].The agents do not know the reward function rt(a(t)) initiallyand must learn it over time. Every period each agent nprivately observes a signal rnt = rt + ϵnt , equal to the globalreward rt plus a random noise term ϵnt . We assume that ϵnthas zero mean, is bounded in magnitude by D′′ and i.i.d.across time, but it does not need to be i.i.d. across agents.Let D = D′ + D′′. Agents cannot communicate, so at anytime t each agent has access to only its own history of noisyreward observations Hn

t = rnτ tτ=1.Agents operate according to an algorithm πn(Hn

t ), whichtells it which arm to choose after every history of observations.This algorithm can be deterministic, meaning that given any

4

history it will map to a unique arm, or probabilistic, meaningthat for some histories it will map to a probability distributionover arms. Let π(H1

t , ...,HNt )=π1(H1

t ), ..., πN (HNt ) denote

the joint algorithm that is used by all agents after every possi-ble history of observations. Since agents cannot communicate,the joint algorithm may only select actions for each agentbased on that agent’s private observation history. We denotethe joint arm selected at time t given the joint algorithmas aπ(t). Fixing any joint algorithm, we can compute theexpected reward at time 0 as E

∑Tt=1 rt(a

π(t)).This paper will propose a group of joint algorithms that can

achieve sublinear regret in time given different restrictions onthe expected reward function µ(a). Denote the optimal jointaction by a∗ := argmax

aµ(a). We will always assume that the

optimal joint action is unique. The regret of a joint algorithmπ(H1

t , ...,HNt ) is given by

R(T ;π) := Tµ(a∗)− ET∑

t=1

rt(aπ(t)) (1)

Regret gives the convergence rate of the total expectedreward of the learning algorithm to the value of the optimal so-lution. Any algorithm whose regret is sublinear will convergeto the optimal solution in terms of the average reward.

IV. ROBUSTNESS OF ALGORITHMS WITH DISTRIBUTEDIMPLEMENTATION

In the considered setting, there is no individual rewardobservation associated with each individual arm but only anoverall reward which depends on the arms selected by allagents. Therefore agents have to learn how their individualarm selections influence the overall reward, and choose thebest joint set of arms in a cooperative but isolated manner.In general, agents may observe different noisy versions of theoverall reward realization at each time, so we would like thealgorithms to be robust to errors and perform efficiently in anoisy environment. But we will start by considering situationswhere there are no errors, and show that in this case agentsare able to achieve the optimal expected reward even if theyare distributed and unable to communicate.A. Scenarios without individual observation errors

Let Πc be the set of algorithms that can be implemented ina scenario where agents are allowed to exchange messages(reward observations, selected arms etc.) at run-time. LetΠd be the set of algorithms that can be implemented inscenarios where agents cannot exchange messages at run-time. Obviously Πd ⊆ Πc. At the first sight, it seems thatthe restrictions on communication may result in efficiencyloss compared to the scenario where agents can exchangemessages. Next, we prove a perhaps surprising result – there isno efficiency loss even if agents cannot exchange messages atrun-time as long as the agents observe the same overall rewardrealization in each time slot. Such a result is thus applicableif there are no errors, or even if the error terms, ϵt , are thesame for every agent at every time t.

Theorem 1: If agents observe the same reward realizationin each time slot, then min

π∈ΠcR(T ;π) = min

π∈ΠdR(T ;π), ∀T .

Proof: See Appendix A in [31].Theorem 1 reveals that even if agents are distributed and not

able to exchange messages at run-time, all existing determin-istic algorithms proposed for centralized scenarios can still beused when agents observe the same reward realizations. Thereason is that even though agents cannot directly communicate,as long as they know the algorithms of the other agents beforeruntime they can correctly predict which arms the other agentswill choose based on the global reward history. In particular,the classic UCB1 algorithm can be implemented in distributedscenarios without loss of performance.

B. Scenarios with individual observation errors

When agents observe different noisy versions of the rewardrealizations, it is difficult for them to infer the correct actionsof other agents based on their own private reward historiessince their beliefs about others could be wrong and incon-sistent. For instance, one agent may observe a high rewardfor a joint arm, while another agent observes a low reward.Then the first agent may decide to keep playing that jointarm, and believe that the other agent is also still playingit, while in actuality the other agent has already moved onto testing other joint arms. In such scenarios, even a singlesmall observation error could cause inconsistent beliefs amongagents and lead to error propagation that is never corrected inthe future. In Proposition 1, we show this effect for the classicUCB1 algorithm and prove that UCB1 is not robust to errorswhen implemented in a distributed way.

Proposition 1: In distributed networks where agents do notexchange messages at run-time, if the observations of theoverall reward realization are subject to individual errors,then the expected regret of the distributed version of UCB1algorithm, in which each agent keeps an instance of UCB1 forits own actions and N −1 different instances of UCB1 for theactions of other agents, can be linear.

Proof: See Appendix B in [31].Proposition 1 implies that even if we implement existing

deterministic algorithms such as UCB1 for distributed agentsusing the reward history, there is no guarantee on their per-formance when individual observation errors exist. Therefore,there is a need to develop new algorithms that are robust toerrors in distributed scenarios. In the next few sections, we willpropose such a class of algorithms that are robust to individualerrors and can achieve logarithmic regret in time even in thenoisy environment.

V. DISTRIBUTED COOPERATIVE LEARNING ALGORITHM

In this section, we propose the basic distributed cooperativelearning (DisCo) algorithm which is suitable for any overall re-ward function. The proposed learning algorithm achieves log-arithmic regret (i.e. R(T ) = O(

∏Nn=1 Kn lnT )). In Sections

VI and VII, we will identify some useful reward structures andexploit them to design improved learning algorithms (DisCo-FI and DisCo-PI algorithms) which achieve even better regretresults.

5YesNoYes NoExploration

Exploitation YesNo( ) 0E t >( ) ( )t tγ ζ> 1( 1) : mod( ( ) 1, 1)E t E t L+ = + +

( 1) 0E t + =( 1) ( ) 1t tγ γ+ = +: 1t t= +Fig. 2. Flowchart of the phase transition.

A. Description of the Algorithm

The DisCo algorithm is divided into phases: exploration andexploitation. Each agent using DisCo will alternate betweenthese two phases, in a way that at any time t, either all agentsare exploring or all are exploiting. In the exploration phase,each agent selects an arm only to learn about the effects on theexpected reward, without considering reward maximization,and updates the reward estimates of the arm it selected. In theexploitation phase, each agent exploits the best (estimated)arm to maximize the overall reward.

Knowledge, Counters and Estimates: There is a determin-istic control function ζ(t) of the form ζ(t) = A ln t commonlyknown by all agents. This function will be designed and deter-mined before run-time, and thus is an input of the algorithm.Each exploration phase has a fixed length of L1 =

∏Nn=1 Kn

slots, equal to the total number of joint arms. Each agent main-tains two counters 4. The first counter γ(t) records the numberof exploration phases that they have experienced by timeslot t. The second counter E(t) ∈ 0, 1, ..., L1 representswhether the current slot is an exploration slot and, if yes, whichrelative position it is at. Specifically, E(t) = 0 means that thecurrent slot is an exploitation slot; E(t) > 0 means that thecurrent slot is the E(t)-th slot in the current exploration phase.Both counters are initialized to zero: γ(0) = 0, E(0) = 0.Each agent n maintains L1 sample mean reward estimatesrn(l) ∀l ∈ 1, ..., L1, one for each relative slot positionin an exploration phase. Let bnl denote the arm selected byagent n in the l-th position in an exploration phase. Thesereward estimates are initialized to be rn(l) = 0 and willbe updated over time using the realized rewards (the exactupdating method will be explained shortly).

Phase Transition: Whether a new slot t is an explorationslot or an exploitation slot will be determined by the valuesof ζ(t), γ(t) and E(t). At the beginning of each slot t, theagents first check the counter E(t) to see whether they arestill in the exploration phase: if E(t) > 0, then the slot is anexploration slot; if E(t) = 0, whether the slot is an explorationslot or an exploitation slot will then be determined by γ(t) and

4Agents maintain these counters by themselves, γn(t), En(t) ∀n. How-ever, since agents update these counters in the same way, the superscript forthe agent index is neglected in our analysis.

ζ(t). If γ(t) ≤ ζ(t), then the agents start a new explorationphase, and at this point E(t) is set to be E(t) = 1. If γ(t) >ζ(t), then the slot is an exploitation slot. At the end of eachexploration slot, counter E(t+1) for the next slot is updated tobe E(t+1)← mod(E(t)+1, L1+1). When E(t+1) = 0, thecurrent exploration phase ends, and hence the counter γ(t+1)for the next slot is updated to be γ(t+1)← γ(t)+1. Figure 2provides the flowchart of the phase transition for the algorithm.

Prescribed Actions: The algorithm prescribes differentactions for agents in different slots and in different phases.

(i) Exploration phase: As clear from the Phase Transition,an exploration phase consists of L1 slots. In each phase,the agents select their own arms in such a way that everyjoint arm is selected exactly once. This is possible withoutcommunication if agents agree on a selection order for thejoint arms before run-time. At the end of each explorationslot (the lth slot), rn(l) is updated to

rn(l)← (γ(t)− 1)rn(l) + rntγ(t)

(2)

Note that the observed reward realization rnt at time t maybe different for different agents due to errors.

(ii) Exploitation phase: Each exploitation phase has a vari-able length which depends on the control function ζ(t) andcounter γ(t). At each exploitation slot t, each agent n selectsan = bnl∗ : l∗ = argmax

lrn(l). That is, each agent n

selects the arm with the best reward estimate among rn(l),∀l ∈ 1, ..., L1. Note that in the exploitation slots, an agentn does not need to know other agents’ selected arms. Sinceagents have individual observation noises, it is also possiblethat l∗ is different for different agents.B. Analysis of the regret

At any exploitation slot, agents need sufficiently manyreward observations from all sets of arms in order to estimatethe best joint arm correctly with a probability high enoughsuch that the expected number of mistakes is small. On theother hand, if the agents spend too much time in explor-ing, then the regret will be too large because they are notexploiting the best joint arm sufficiently often. The controlfunction ζ(t) determines when the agents should explore andwhen they should exploit and hence balances exploration andexploitation. In Theorem 2, we will establish conditions on thecontrol function ζ(t) such that the expected regret bound of theproposed DisCo algorithm is logarithmic in time. Let ∆max =maxa =a∗

µ(a∗)−µ(a) be the maximum reward loss by selecting

any suboptimal joint arm, and let ∆min = mina =a∗

µ(a∗)−µ(a)

be the reward difference between the best joint arm and thesecond-best joint arm.

Theorem 2: If ζ(t) = A ln t with A > 2(

D∆min

)2, then the

expected regret of the DisCo algorithm after any number Tperiods is bounded by

R(T ) ≤ AL1∆max lnT +B1 (3)

where B1 = L1∆max +

∑∞t 2NL1∆

maxt−A

2

(∆min

D

)2

is aconstant.

Proof: See Appendix C in [31].

6

The regret bound proved in Theorem 2 is logarithmic intime which guarantees convergence in terms of the averagereward, i.e. lim

T→∞E[R(T )]/T = 0. In fact, the order of the

regret bound, i.e. O(L1 lnT ), is the lowest possible that canbe achieved [9]. However, since the impact of individual armson the overall (expected) reward is unknown and may becoupled in a complex way, it is necessary to explore everypossible joint arm to learn its performance. This leads to alarge constant that multiplies lnT which is on the order ofL1 =

∏Nn=1 Kn. If there are many agents, then

∏Nn=1 Kn will

be very large and hence, a large reward loss will be incurred inthe exploration phases. This motivates us to design improvedlearning algorithms which do not require exploring all possiblejoint arms in order to improve the learning regret. In the nextsection, we will explore the informativeness (defined formallylater) of the expected reward function to develop improvedlearning algorithms based on the basic DisCo algorithm. Wefirst consider the best case (Full Informativeness) and thenextend to the more general case (Partial Informativeness).

VI. A LEARNING ALGORITHM FOR FULLY INFORMATIVEREWARDS

In many application scenarios, even if we do not know ex-actly how the actions of agents determine the expected overallrewards, some structural properties of the overall reward func-tion may be known. For example, in the classification problemwhich uses multiple classifiers [?], the overall classificationaccuracy is increasing in each individual classifier’s accuracy,even though each individual’s optimal action is unknown apriori. Thus, some overall reward functions may provide higherlevels of informativeness about the optimality of individualactions. In this section, we will develop learning algorithmsthat achieve improved regret results and faster learning speedby exploiting such information.

A. Reward Informativeness

We first define the informativeness of an expected overallreward function.

Definition 1: (Informativeness) An expected overall rewardfunction µ(a) is said to be informative with respect to agentn if there exists a unique arm a∗n ∈ An such that ∀a−n,,a∗n = argmax

an

µ(an,a−n).In words, if the reward is informative with respect to agent n,then for any choices of arms selected by other agents, agentn’s best arms in terms of the expected overall reward is thesame. Lemma 1 helps explain why such a reward function is“informative”.

Lemma 1: Suppose that µ(a) is informative with respect toagent n and the unique optimal arm is a∗n, then the followingis true:

a∗n = argmaxan

∑a−n

θa−nµ(an,a−n), ∀θa−n ≥ 0,∑a−n

θa−n = 1

(4)Proof: This is a direct result of the Definition 1.

Lemma 1 states that, for an agent n, the weighted averageof the expected overall reward over all possible choices ofarms by other agents is maximized at the optimal arm a∗n.

Moreover, the optimal arm is the same for all possible weightsθa−n , ∀a−n. It further implies that instead of using the exactexpected overall reward estimate rn(an,a−n) to evaluate theoptimality of an arm an, agent n can also use the relativeoverall reward estimate (i.e. the weighted average rewardestimate). In this way, agent n needs to maintain only Kn

relative overall reward estimates rn(an) by selecting the arman and can use these estimates to learn and select the optimalarm. In particular, let wan,a−n

be the number of times that thejoint arm (an,a−n) is selected in the exploration slots that areused to estimate r(an). Then,

Ern(an) =∑

a−nwan,a−nµ(an,a−n)∑a−n

wan,a−n

(5)

If wan,a−n = wan,a−n , ∀an, an, ∀a−n, then we havewan,a−n∑

a−nwan,a−n

, θa−n , ∀an. Note that we don’t need toknow the exact value of wan,a−nas long as wan,a−n =wan,a−n ,∀an, an, ∀a−n. Therefore, the relative reward esti-mates rn(an) can be used to learn the optimal action a∗n evenif agents are not exactly sure what arms have been played byother agents.

Definition 2: (Fully Informative) An expected overall re-ward function µ(a) is said to be fully informative if it isinformative with respect to all agents.

If the overall reward function is fully informative, then theagents only need to record the relative overall reward estimatesinstead of the exact overall reward estimates. Therefore, thekey design problem of the learning algorithm is, for each agentn, to ensure that the weights in (4) are the same for the relativereward estimates of all its arms so that it is sufficient for agentn to learn the optimal arm using only these relative rewardestimates.

We emphasize the importance of the weights θa−n , ∀a−n

being the same for all an ∈ An of each agent n even thoughagent n does not need to know these weights exactly. If theweights are different for different an, then it is possible thatrn(a′n) > rn(a∗n) merely because other agents are using theirgood arms when agent n is selecting a suboptimal arm anwhile other agents are using their bad arms when agent nis selecting the optimal arm a∗n. Hence, simply relying onthe relative reward estimates does not guarantee obtaining thecorrect information needed to find the optimal arm.

Reward functions that are fully informative exist for manyapplications. We identify a class of overall reward functionsthat are fully informative below.

Fully Informative Reward Functions: For each agent n,if there exists a function fn : An → R, such that for all jointarms a, the expected reward can be expressed as a functionF : RN → R where µ(a) = F (f1(a1), ..., fN (aN )) and µ ismonotone in fn, ∀f−n, ∀n, then µ(a) is fully informative.

We provide two concrete examples below.(1) Classification with multiple classifiers. In the prob-

lem of event classification using multiple classifiers, eachclassifier is in charge of the classification problem of onespecific feature of the target event [3][?][5][6]. The event isaccurately classified if and only if all classifiers have theircorresponding features classified correctly. Let fn(an) be the

7

unknown feature classification accuracy of classifier n byselecting the operating point an. Assuming that the featuresare independent, then the event classification accuracy can beexpressed as µ(a) =

∏Nn=1 f(an) given the selection of the

joint operating points a of all classifiers. Hence the eventclassification accuracy is fully informative.

(2) Network security monitoring using distributed learning:A set of distributed learners (each in charge of monitoring aspecific sub-network) make predictions about a potential secu-rity attack based on their own observed data (e.g. packets fromdifferent IP addresses to their corresponding sub-networks).Let the prediction of learner n be yn(t|an(t)) ∈ 1,−1 attime t by choosing a classification function an. Based on thesepredictions, an ensemble learner uses a weighted majorityvote rule [27] to make the final prediction, i.e. y(t|a(t)) =sgn(w · y(t|an(t))), and takes the security measures accord-ingly. In the end, the distributed learners observe the outcomertn(a(t)) of the system which depends on the accuracy ofthe prediction, i.e. |y(t)− y(t|a(t))| with y(t) being the truesecurity condition. Let fn(an) = E|yn(an) − y| be theaccuracy of learner n by choosing a classification functionan. The reward function µ(a) is also monotone in fn(an)and hence is fully informative.

We note that in the first example different agents have or-thogonal learning tasks (classification with respect to differentfeatures) while in the second example different agents have thesame learning task (detecting the security attack). However,both examples exhibit the fully informative property and ourproposed learning algorithms handle both cases effectively.The difference comes from the speed of learning. When agentshave orthogonal learning tasks, they are more pivotal and sotheir actions have a greater influence on the rewards, whichallows them to learn faster as well. This is highlighted in oursimulation results in Section IX, where it is shown that whenan agent becomes more pivotal it discovers its optimal actionquicker.

B. Description of the Algorithm

In this subsection, we describe an improved learning al-gorithm. We call this new algorithm the DisCo-FI algorithmwhere “FI” stands for “Fully Informative” 5. The key differ-ence from the basic DisCo algorithm is that, in DisCo-FI, theagents will maintain relative reward estimates instead of theexact reward estimates.

Knowledge, Counters and Estimates: Agents know acommon deterministic function ζ(t) and maintain two countersγ(t) and E(t). Now each exploration phase has a fixed lengthof L2 =

∑Nn=1 Kn slots and hence, E(t) ∈ 0, 1, ..., L2 with

E(t) = 0 representing that the slot is an exploitation slot andE(t) > 0 representing that it is the E(t)-th relative slot in thecurrent exploration phase. As before, both counters are initial-ized to be γ(0) = 0, E(0) = 0. Each agent n maintains Kn

sample mean (relative) reward estimates rn(an), ∀an ∈ An,one for each one of its own arms. These (relative) rewardestimates are initialized to be rn(an) = 0 and will be updatedover time using the realized rewards.

5The algorithm can run in the general case, but we bound its regret onlywhen the overall reward function is fully informative.

1 2 3 1 2 3 1 2 31 1 13 3 3 3 3 32 2 2 2 2 23 3 3Subphase 1 Subphase 2 Subphase 3Agent 1Agent 2Agent 3 Agent 1 updates its best arm (to 2) Agent 2 updates its best arm (to 3)Fig. 3. Illustration of one exploration phase with 3 agents, each of whichhaving 3 arms.

Phase Transition: The transition between exploration phas-es and exploitation phases are almost identical to that in theDisCo algorithm. The only difference is that at the end ofeach exploration slot, the counter E(t+1) for the next slot isupdated to be E(t+1)← mod(E(t)+ 1, L2 +1). Hence, weensure that each exploration phase has only L2 slots.

Prescribed Actions: The algorithm prescribes differentactions for agents in different slots and in different phases.

(i) Exploration phase: As clear from the Phase Transition, anexploration phase consists of L2 slots. These slots are furtherdivided into N subphases and the length of the nth subphaseis Kn. In the nth subphase, agents take actions as follows(Figure 3 provides an illustration):

1) Agent n selects each of its arms an ∈ An in turn, eacharm for one slot. At the end of each slot in this subphase,it updates its reward estimate using the realized rewardin this slot as follows,

rn(an)←γ(t)rn(an) + rnt

γ(t) + 1(6)

2) Agent i = n selects the arm with the highest rewardestimate for every slot in this subphase, i.e. ai(t) =arg max

ai∈Ai

ri(ai).

(ii) Exploitation phase: Each exploitation phase has a vari-able length which depends on the control function ζ(t) andcounter γ(t). In each exploitation slot t, each agent n selectsan(t) = arg max

a∈An

rn(a).

C. Analysis of regret

We bound the regret of the DisCo-FI algorithm in Theorem3. Let ∆min

n = minan =a∗

n,a−n

µ(a∗n,a−n)− µ(an,a−n) be the

reward difference of agent n’s best arm and its second-bestarm, and let ∆min

FI = minn

∆minn .

Theorem 3: Suppose µ(a) is fully informative. If ζ(t) =

A lnT with A ≥ 2(

D∆min

FI

)2

, then the expected regret of theDisCo-FI algorithm after any number T slots is bounded by

R(T ) < AL2∆max lnT +B2 (7)

where B2 = L2∆max + 2L2∆

max∑∞

t=1 t−A

2

(∆min

FID

)2

is aconstant number.

Proof: See Appendix D in [31].The regret bound proved in Theorem 2 is also logarithmic

in time for any finite time horizon T . Therefore, the average

8

reward is guaranteed to converge to the optimal reward whenthe time horizon goes to infinity, i.e. lim

T→∞E[R(T )]/T = 0.

Importantly, the proposed DisCo-FI algorithm exploits theinformativeness of the expected overall reward function andachieves a much smaller constant that multiplies lnT . Insteadof learning every joint arm, agents can directly learn their ownoptimal arm through the relative reward estimates.

VII. A LEARNING ALGORITHM FOR PARTIALLYINFORMATIVE REWARDS

In the previous section, we developed the DisCo-FI algo-rithm for reward functions that are fully informative. However,in problems where the full informativeness property may nothold, the DisCo-FI algorithm cannot guarantee a logarithmicregret bound. In this section, we extend DisCo-FI to themore general case where the full informativeness constraintis relaxed. For example, in the classification problem whichuses multiple classifiers, each classifier consists of multiplecomponents each of which is considered as an independentagent. The accuracy of each individual classifier may dependon the configurations of these components in a complexway but the overall classification accuracy is still increasingin the accuracy of each individual classifier. Specifically, ifthe accuracy of one of these classifiers is increased, thenthe overall accuracy will increase independently of whichconfiguration of the components of that classifier are chosen.

A. Partially Informativeness

We define an agent group and a group partition first.Definition 3: (Agent Group and Group partition) An agent

group g consists of a set of agents. A group partition G =g1, ..., gM of size M is a set of M agent groups such thateach agent n ∈ N belongs to exactly one group.

We will call the set of arms selected by the agents in a groupgm a group-joint arm with respect to group gm, denoted byam = ann∈gm

6. Denote the size of group gm by Nm. It isclear that

∑Mm=1 Nm = N .

Definition 4: (Group-Informativeness) An expected over-all reward function µ(a) is said to be informative withrespect to a group gm if there exists a unique group-joint arm a∗

m ∈ ×n∈gmAn such that ∀a−m,a∗m =

arg maxam∈×i∈gmAi

µ(am, a−m).

In words, for different choices of arms by other agents,group gm’s best group-joint arm is the same. Note that thisis a generalization of Definition 1 because a group can alsoconsist of only a single agent. Lemma 2 immediately follows.

Lemma 2: If µ(a) is informative with respect to an agentgroup gm and the unique group-joint optimal arm is a∗

m, thenthe following is true:

a∗m = argmax

∑a−m

θa−mµ(am,a−m),

∀θa−m ≥ 0,∑a−m

θa−m = 1 (8)

Proof: This is a direct result of Definition 4.

6We abuse notation by using am agm . This should not introduce confusiongiven specific contexts.

Lemma 2 states that for an agent group gm, the weightedaverage of the expected reward over all possible choices ofarms by other agents is maximized at the optimal group-jointarm am. Moreover, the optimal group-joint arm is the samefor all possible weights. Therefore, to evaluate the optimalityof a group-joint arm am, the agents in group gm (∀n ∈ gm)can use the relative reward estimate for that group-joint armrn(am) instead of using the exact expected reward estimatern(am,a−m) as long as the weights θa−m

, ∀a−m, are thesame for all am.

Definition 5: (Partially Informative) An expected overallreward function µ(a) is said to be partially informative withrespect to a group partition G = g1, ..., gM if it is informa-tive with respect to all groups in G.

Consider a surveillance problem in a wireless sensor net-work. Assume that there are multiple areas that are monitoredby clusters of sensors. Let m be the m-th cluster of sensors.Each sensor selects a surveillance action. For instance, thisaction can be the position of the video camera, channel listenedto by the sensor, etc. Let µm(am) be the reward of the jointsurveillance action taken by the sensors in cluster m. Forexample, this reward can be the probability of detecting anintruder that enters the area surveyed by the sensors in clusterm, Then depending on the strategic importance of these areas,the global reward is a linear combination of the rewards ofthe clusters, i.e., µ(a) =

∑m

wmµw(am). However, improving

each individual sensor’s action may not necessarily improvethe accuracy of the cluster. In this case, the global reward ismonotone in each cluster’s reward but may not be monotonein each individual sensor’s action. Thus, the reward functionis partially informative.

If a reward function is fully informative, then it is alsopartially informative with respect to any group partition ofthe agents. On the other hand, if we take the entire agentset as one single group, then any reward function is partiallyinformative with respect to this partition. Therefore, “Partial-ly Informative” can apply to all possible reward functionsthrough defining the group partition appropriately.

B. Description of the Algorithm

In this subsection, we propose the improved algorithmwhose regret can be bounded for reward functions that arepartially informative. We call this new algorithm the DisCo-PI algorithm where “PI” stands for “Partially Informative”.

Knowledge, Counters and Estimates: Agents know acommon deterministic function ζ(t) and maintain two countersγ(t) and E(t). In the DisCo-PI algorithm, each explorationphase has a fixed length of L3 =

∑Mm=1

∏n∈gm

Kn slots andhence, E(t) ∈ 0, 1, ..., L3 with E(t) = 0 representing thatthe slot is not an exploration slot and E(t) > 0 representingthat it is the E(t)-th relative slot in the current explorationphase. Both counters are initialized to be γ(0) = 0, E(0) = 0.Each agent n in group gm maintains Sm =

∏i∈gm

Ki rewardestimates rn(l), ∀l ∈ 1, 2, ..., Sm. Let bnl ∈ An denote thearm selected by agent n in the lth slot in an explorationsubphase. These (relative) reward estimates are initialized tobe rn(l) = 0 and will be updated over time using the realizedrewards.

9

Phase Transition: The algorithm works in a similar wayas the first two algorithms in determining whether a slot isan exploration slot or an exploitation slot. The only differenceis that the counter E(t + 1) is updated to be E(t + 1) ←mod(E(t) + 1, L3 + 1). This ensures that each explorationphase has L3 slots.

Prescribed Actions: The algorithm prescribes differentactions in different slots and in different phases.

(i) Exploration phase: An exploration phase consists of L3

slots. These slots are further divided into M subphases andthe length of the mth subphase is

∏n∈gm

Kn. In the mth

subphase, agents take actions as follows:1) Agents in group gm select the arms in such a way that

every group-joint arm am with respect to group gm isselected exactly once in this exploration subphase. At theend of the lth slot in the exploration subphase, rn(bnl ) isupdated to be

rn(l)← γ(t)rn(l) + rntγ(t) + 1

(9)

2) Agents i in group gj = gm selects the component armthat forms the group-joint arm with the highest rewardestimate, i.e. ai = bil∗ : l∗ = argmax

lri(l), for every

slot in this subphase.(ii) Exploitation phase: Each exploitation phase has a vari-

able length which depends on the control function ζ(t) andcounter γ(t). In each exploitation slot t, each agent n of groupgm selects an = bnl∗ : l∗ = argmax

lrn(l).

C. Analysis of regretWe bound the regret by running the DisCo-PI algorithm

in Theorem 4. Let ∆minm = min

am =am,a−m

µ(a∗m,a−m) −

µ(am,a−m) be the reward difference of the best group-jointarm of gm and the second-best group-joint arm of gm, and let∆min

PI = minm

∆minm .

Theorem 4: Suppose µ(a) is partially informative withrespect to a group partition G. If ζ(t) = A lnT with A ≥2(

D∆min

PI

)2

, then the expected regret of the DisCo-PI algorithmafter any number T slots is bounded by

R(T ) < AL3∆max lnT +B3 (10)

where

B3 = L3∆max + 2

M∑m=1

Nm

∏n∈gm

Kn∆max

∞∑t=1

t−A

2

(∆min

PID

)2

(11)is a constant number.

Proof: See Appendix E in [31].The regret bound proved in Theorem 4 is also logarithmicin time for any finite time horizon T . Therefore, the averagereward is guaranteed to converge to the optimal reward whenthe time horizon goes to infinity, i.e. lim

T→∞E[R(T )]/T = 0.

However, instead of learning every joint arm like in DisCo,agents in each group can learn just their own optimal group-joint arm using the relative reward estimates. Note that theconstant that multiplies lnT is smaller than that of DisCo butlarger than DisCo-FI. Table II summarizes the characteristicsof the three proposed algorithms.

DisCo DisCo-PI DisCo-FI

Reward Informativeness Any Partially

Informative Fully

Informative

Learning Speed Slow Medium Fast

Regret order 1( ln )N nnO K T=∏ 1 ln )( mM nn gm KO T∈=∑∏ 1( ln )N nnO K T=∑ TABLE IICOMPARISON OF THE PROPOSED THREE ALGORITHMS.

VIII. EXTENSIONS

A. Missing and Delayed Feedback

In the previous analysis, we assumed that the global feed-back is provided to the agents immediately at the end of eachslot. In practice, this feedback may be missing or delayed andthe delay may also be different for different agents since agentsare distributed. We study the extension of our algorithm forthese two scenarios in this subsection.

Consider the missing feedback case where the global feed-back at time t may be missing with probability 1 − p. Forinstance, in the example of multiple classifiers, the groundtruth label used to compute the global reward is not alwaysavailable due to the high labeling cost and thus, the globalreward may be sometimes unavailable. Our proposed algorith-m can be easily extended as follows: whenever the feedbackis missing in the exploration phase (which is observed by allagents), the agents repeat their current actions in the next slotuntil the feedback is received. Let Rnm(T ) denote the regretbound of the proposed algorithms when there is no missingfeedback and Rm(T ) denote the regret bound of the modifiedalgorithm with missing feedback by time T . Then we have thefollowing proposition.

Proposition 2: Rm(T ) = Rnm(T ) + p−1p (A lnT +

1)L∆max .Proof: See Appendix F in [31].

Next, consider the delayed feedback case where the globalfeedback at time t arrives at time t+Ln(t) for agent n, whereLn(t)is a random variable with support in 0, 1, ..., Lmaxand Lmax > 0 is an integer. Our proposed algorithms canmodified as follows: the exploration phase will be extendedby Lmax slots. The agents will update their reward estimateswhen the corresponding feedback is received. In each of theextended Lmaxslots, agents select the arms that maximizetheir estimated rewards in that slot. In this way, the algorithmensures that sufficient labels have been received and hence, thereward estimates are sufficiently accurate in any exploitationslot. Let Rnd(T ) denote the regret bound of the proposedalgorithms with no delay and Rd(T ) denote the regret boundof the modified algorithm with delays by time T . Then wehave the following proposition.

Proposition 3: Rd(T )=Rnd(T )+(A lnT + 1)Lmax∆max.

Proof: See Appendix G in [31].In both cases with missing or delayed feedback, the mod-

ified algorithms achieve larger regrets than the original al-gorithms. However, the regret bounds are still logarithmic intime.

10

B. Accuracy drift

In the previous analysis, we assumed that even though thereward distributions and hence the expected rewards of thejoint actions are unknown a priori, they are not changingover time. However, in some scenarios these rewards canbe both unknown and time-varying due to changing systemcharacteristics 7. We refer to this as accuracy drift. In thiscase, the expected reward by selecting a joint arm a is alsoa time variable µt(a). Then the optimal joint arm is alsoa time variable, i.e. a∗(t) := argmax

aµt(a). The learning

regret becomes R(T ) :=∑T

t=1 µt(a∗(t))−E

∑Tt=1 rt(a

π(t)).Moreover, ∆max, ∆min, ∆min

FI , ∆minPI will also be time vari-

ables ∆max(t), ∆min(t), ∆minFI (t), ∆min

PI (t). We assume that∆max is upper bounded by ∆max(t) and ∆min(t), ∆min

FI (t),∆min

PI (t) are lower bounded by ∆min.Definition 6: (Accuracy Drift) The accuracy drift of the

reward function for any two slots t, t′ is defined to be|µt(a)− µt′(a)|.

The proposed algorithms can be modified for use in deploy-ment scenarios exhibiting accuracy drifts. Since the expectedrewards are changing, using the realized rewards from thebeginning of the system to estimate the expected rewards in thecurrent slot will be very inaccurate. Therefore, agents shoulduse only the most recent realized rewards to update the rewardestimates. To do this, counter γ(t) now maintains the numberof exploration phases that have been experienced in the last Γslots. The deterministic control function ζ(t) becomes a singlecontrol parameter which is independent of time, i.e. ζ(t) = A.Whether a new slot is an exploration slot or an exploitationslot will still be determined by the values of ζ(t), γ(t) andE(t) in a similar way to the previous algorithms. We boundthe time-average regret in the following proposition.

Proposition 4: Suppose the accuracy drift satisfies |µt(a)−µt′(a)| ≤ cΓ,∀t′ : t ≥ t′ ≥ t−Γ, ∀t. If ζ(t) = A, (A < Γ/L),then the time-average expected regrets R(T )

T of the modifiedalgorithms after any number T > Γ slots are bounded by

AL1+2(Γ−AL1)NL1e−A

2

(∆min−2cΓ

D

)2

Γ ∆max

AL2+2(Γ−AL2)L2e−A

2

(∆min

FI −2cΓD

)2

Γ ∆max

AL3+2(Γ−AL3)M∑

m=1Nm

N∏n=1

Kne−A

2

(∆min

PI −2cΓD

)2

Γ ∆max

(12)

for DisCo, DisCo-FI, DisCo-PI, respectively.Proof: See Appendix H in [31].

We note that the modified algorithms for accuracy driftdo not achieve logarithmic regret in time since they haveto track the changes in the expected reward by continuouslyexploring the arms at a constant rate. However, by exploitingthe informativeness of the reward functions, better regretresults can still be obtained by adopting DisCo-FI and DisCo-PI algorithms rather than the basic DisCo algorithm. Note thatthe bound given in Proposition 4 is not tight and the right handside of the bound in (12) is time independent. When the right

7For example, in a Big Data system the distribution of the data stream maychange which will result in a change in classification accuracies.

hand side of (12) is greater than D, this bound will not giveus any information about the performance of the algorithm,since the maximum one-step loss of any algorithm is boundedby D. We see that the regret bound decays exponentially inthe difference between ∆min and cΓ, which is intuitive sincea higher ∆min − 2cΓ implies that it is easier to identify thebest joint arm using only the observations in last Γ time steps.

IX. ILLUSTRATIVE RESULTS

In this section, we illustrate the performance of the proposedlearning algorithms via simulation results for the Big Datamining problem using multiple classifiers.A. Big Data Mining using Multiple Classifiers

A plethora of online Big Data applications, such as videosurveillance, traffic monitoring in a city, network securitymonitoring, social media analysis etc., require processing andanalyzing streams of raw data to extract valuable informationin real-time [1]. A key research challenge [26] in a real-time stream mining system is that the data may be gatheredonline by multiple distributed sources and subsequently itis locally processed and classified to extract knowledge andactionable intelligence, and then sent to a centralized entitywhich is in charge of making global decisions or predictions.The various local classifiers are not collocated and cannotcommunicate with each other due to the lack of a commu-nication infrastructure (because of delays or other costs suchas complexity [?][6]). Another stream mining problem mayinvolve the processing of the same or multiple data stream, butrequire the use of classifier chains (rather than multiple singleclassifiers which are distributed as mentioned before) for itsprocessing. For instance, video event detection [2][19] requiresfinding events of interest or abnormalities which could involvedetermining the concurrent occurrence (i.e. classification) ofa set of basic objects and features (e.g. motion trajectories)by chaining together multiple classifiers which can jointlydetermine the presence of the event or phenomena of interest.The classifiers are often implemented at various locationsto ensure scalability, reliability and low complexity [3][?].For all incoming data, each classifier needs to select anoperating point from its own set, whose accuracy and cost (e.g.delay) are unknown and may depend on the incoming datacharacteristics, in order to classify its corresponding featureand maximize the event classification accuracy (i.e. the overallsystem reward). Hence, classifiers need to learn from past datainstances and the event classification performance to constructthe optimal chain of classifiers. This classifier chain learningproblem can be directly mapped into the considered multi-agent decision making and learning problem: agents are thecomponent classifiers, actions are the operating points and theoverall system reward is the event classification performance(i.e. accuracy minus cost).B. Experiment Setup

Our proposed algorithm is tested using classifiers and videosprovided by IBM’s TRECVID 2007 project [25]. By extractingfeatures such as color histogram, color correlogram, and co-occurrence texture, the classifiers are trained to detect high-level features, such as whether the video shot takes place

11

0 0.5 1 1.5 2

x 104

0.3

0.4

0.5

0.6

0.7

0.8

0.9

time

accu

racy

Random

UCB1DisCo

DisCo−FI

Optimal

SE

(a) Rule 1

0 0.5 1 1.5 2

x 104

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

time

accu

racy

Safe ExperimentRandom

UCB1DisCoDisCo−FI

Optimal

(b) Rule 2

Fig. 4. Performance comparison for various algorithms.

outdoors or in an office building, or whether there is ananimal or a car in the video. The classifiers are SVM-basedand can therefore dynamically set detection thresholds for theoutput scores for each image without changing the underlyingimplementation. We chose this dataset due to the wide rangeof high-level features detected, which best models distributedclassifiers trained across different sites. In the simulations,we use three classifiers (agents) to classify three features:(i) whether the image contained cars (CAR), (ii) whether theimage contained mountains (MOU) (iii) whether the image issports related (SPO). By synthesizing the feature classifica-tion results, the event detection result is obtained under twodifferent rules.

1) Rule 1: The event is correctly classified if all threefeatures are correctly classified.

2) Rule 2: The event is correctly classified if Feature 1(CAR) is correctly classified and either Feature 2 (MOU)or Feature 3 (SPO) is correctly classified.

In the simulations, each classifier can choose from 4 op-erating points which will result in different accuracies. Letpn denote the classification accuracy with respect to featuren. Assume that the classification of features is independentamong classifiers, then the event classification accuracy peventdepends on the feature classification accuracy as follows:

pevent = p1p2p3 under Rule 1pevent = p1(1− (1− p2)(1− p3)) under Rule 2 (13)

Hence, the reward structure is fully informative for bothevent synthesis rules.C. Performance Comparison

We implement the proposed algorithms and compare theirperformance against four benchmark schemes:

TABLE IIIFALSE ALARM AND MISS DETECTION RATES.

DisCo DisCo-FI UCB1 SE RandomFalse Alarm 0.039 0.029 0.050 0.065 0.069

Miss Detection 0.249 0.194 0.356 0.469 0.496

(1) Random: In each period, each classifier randomly selectsone operating point.

(2) Safe Experimentation (SE): This is a method used in[5] when there is no uncertainty about the accuracy of theclassifiers. In each period t, each classifier selects its baselineaction with probability 1− ϵt or selects a new random actionwith probability ϵt. When the realized reward is higher thanthe baseline reward, the classifiers update their baseline actionsto the new action.

(3) UCB1: This is a classic multi-armed bandit algorithmproposed in [10]. As we showed in Proposition 1, there maybe problems implementing this centralized algorithm in adistributed setting without message exchange. Nevertheless,for the sake of our simulations we will assume that there areno individual errors in the observation of the global feedbackwhen we implement UCB1, and hence it can be perfectlyimplemented in our distributed environment.

(4) Optimal: In this benchmark, the classifiers choose theoptimal joint operating points (trained offline) in all periods.

Figure 4 shows the achieved event classification accuracyover time under both rule 1 and rule 2. All curves areobtained by averaging 50 simulation runs. We also note thatagents may receive noisy versions of the outcome (exceptfor UCB1). Under both rules, SE works almost as poorly asthe Random benchmark in terms of event detection accuracy.Due to the uncertainty in the detection results, updating thebaseline action to a new action with a higher realized rewarddoes not necessarily lead to selecting a better baseline action.Hence, SE is not able to learn the optimal operating points ofthe classifiers. UCB1 achieves a much higher accuracy thanRandom and SE algorithms and is able to learn the optimaljoint operating points over time. However, the learning speedis slow because the joint arm space is large, i.e. 43 = 64.The proposed DisCo algorithm can also learn the optimaljoint action. However, since the joint arm space is large, theclassifiers have to stay in the exploration phases for a relativelylong time in the initial periods to gain sufficiently high con-fidence in reward estimates while the exploitation phases arerare and short. Thus, the classification accuracy is low initially.After the initial exploration phases, the classifiers begin toexploit and hence the average accuracy increases rapidly. Sincethe reward structure satisfies the Fully Informative condition,DisCo-FI rapidly learns the optimal joint action and performsthe best among all schemes. Table III shows the false alarmand miss detection rates under rule 1 by treating one event asthe null hypothesis and the remaining events as the alternativehypothesis.

D. Informativeness

Next, we compare the learning performance of the threeproposed algorithms. For the DisCo-PI algorithm, we considertwo group partitions 1, 2, 3 and 1, 2, 3. Figure 5

12

0 0.5 1 1.5 2

x 104

0.4

0.45

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

time

accu

racy

DisCoDisCo−PI1,2,31,2,,3

DisCo−FI

Optimal

(a) Rule 1

0 0.5 1 1.5 2

x 104

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

time

accu

racy

Optimal

DisCo

DisCo−PI(1, 2,3)

DisCo−FI

DisCo−PI (1,2, 3)

(b) Rule 2

Fig. 5. Performance comparison for DisCo, DisCo-FI and DisCo-PI.

shows the learning performance over time for DisCo, DisCo-FIand DisCo-PI under Rule 1 and Rule 2. In both cases, DisCo-FI achieves the smallest learning regret and hence the fastestlearning speed while the basic DisCo algorithm performs theworst. This is because DisCo-FI fully exploits the problemstructure. The performance of the DisCo-PI algorithm is inbetween that of DisCo-FI and the basic DisCo algorithm.However, different group partitions have different impactson the performance. Under Rule 1, the two group partitionsperform similarly since the impacts of the three classifiers onthe final classification result are symmetric. Under Rule 2,the impacts of classifier 2 and classifier 3 are coupled in amore complex way. Since the group partition 1, 2, 3captures this coupling effect better, it performs better than thegroup partition 1, 2, 3. We note that even though that inthis simulation DisCo-FI performs the best, in other scenarioswhere the reward function is only partially informative or evennot informative, DisCo-PI and DisCo may perform better.

E. Impacts of reward function on learning speed

For both synthesis rules, the reward functions are fullyinformative, and so classifiers can learn their own optimaloperating points using only the relative rewards. However,the same classifier will learn its optimal operating point atdifferent speeds under different rules due to the differences inthat classifier’s impact on the global reward. Note that in thefirst rule, all the classifiers are processing different tasks ofequal importance, whereas in the second rule classifiers 2 and3 are less critical than classifier 1. Thus the learning speed forclassifier 2 will be slower under the second rule because itsimpact is lower. This learning speed depends on the overallreward difference between the classifier’s best operating point

0 1 2 3 4 5

x 104

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

time

perc

enta

ge o

f cho

osin

gth

e op

timal

ope

ratin

g po

int

Scenario 2Scenario 1

Fig. 6. Classifier 2 learns its optimal operating point at different speedsunder different rules.

0 0.5 1 1.5 2

x 104

0.4

0.45

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

timeac

cura

cy

Missing Prob. 0.1

No Missing Feedback

Missing Prob. 0.3

(a) Rule 1

0 0.5 1 1.5 2

x 104

0.7

0.75

0.8

0.85

0.9

time

accu

racy

Missing Prob. 0.3

Missing Prob. 0.1

No Missing Feedback

(b) Rule 2

Fig. 7. Learning performance in scenarios with missing feedback.

and its second-best operating point, i.e. ∆minn . For classifier 2:

under Rule 1, ∆min2 = min p1 min p3∆p2 and under Rule 2,

∆min2 = min p1 min(1−p3)∆p2 with ∆p2 being the accuracy

difference of classifier 2’s best and second-best operatingpoints. Since pn is usually much larger than 0.5, ∆min

2 ofRule 2 is much smaller than that of Rule 1 and hence, classifier2 learns its optimal operating point at a much slower speedunder Rule 2 than Rule 1. Figure 6 illustrates the percentageof choosing the optimal operating point by classifier 2 underdifferent rules.

F. Missing and Delayed Feedback

In this set of simulations, we study the impact of missingand delayed global feedback on the learning performance ofthe proposed algorithm. In Figure 7, we show the accumulatingaccuracy of the modified DisCo-FI algorithm for three missingfeedback scenarios – there is no missing feedback, the missingprobability is 0.1 and 0.3. A larger missing probability induces

13

lower classification accuracy for a given time. Nevertheless,the proposed algorithm is not very sensitive to missing feed-backs. Even if the missing probability is relatively large, thedegradation of the learning performance is small.

In Figure 8, we show the accumulating accuracy of themodified DisCo-FI algorithm for three delayed feedback s-cenarios – there is no delay, the maximal delay is 50 slots and100 slots. Under both synthesis rules, learning is the fastestwithout feedback delays, and the larger the delay, the slowerthe learning speed. However, even with delays, the proposedDisCo-FI algorithm is still able to achieve logarithmic regret.

G. Accuracy Drift

Finally, we study the impact of accuracy drift on thelearning performance of the proposed algorithm. In this setof simulations, the accuracies of the operating points of theclassifiers change every fixed number of periods 8. Figure 9shows the accumulating average accuracy over time for DisCo-FI under Rule 1 and Rule 2, with the period length being 3000slots, 1000 slots and 500 slots. The longer the period length,the smaller the accuracy drift. If the period length is infinity,then the expected accuracy is static and hence there is noaccuracy drift. Several observations are worth noting. First, thelearning performance is better if the accuracy drift is smaller.However, in all cases, the learning accuracy is almost constantand does not approach the optimal accuracy over time. Thisis because in order to track the changes in the accuracy, thealgorithm uses only the recent reward feedbacks to estimate theaccuracies of the operating points. Hence, the estimation errordoes not diminish as times increases. Second, the degree ofimpact of the accuracy drift depends on the reward functions.In this simulation, the reward function induced by Rule 2 isless vulnerable to accuracy drift. Third, we also note that thelearning accuracy for the first period is relatively higher thanthat for the later periods. This is because the initial rewardestimates are less corrupted by using wrong reward feedbacksand hence, they are more accurate.

X. CONCLUSIONS

In this paper, we studied a general multi-agent decisionmaking problem in which decentralized agents learn theirbest actions to maximize the system reward using only noisyobservations of the overall reward. The challenging part isthat individualized feedback is missing, communication amongagents is impossible and the global feedback is subject toindividual observation errors. We proposed a class of distribut-ed cooperative learning algorithms that addresses all theseproblems. These algorithms were proved to be able to achievelogarithmic regret in time. We also proved that by exploitingthe informativeness of the reward function, much better regretresults can be achieved by our algorithms compared with ex-isting solutions. Through simulations we applied the proposedlearning algorithms to Big Data stream mining problems andshowed significant performance improvements. Importantly,our theoretical framework can also be applied to learning

8This is realized by swapping the indices of the operating points for eachclassifier.

0 0.5 1 1.5 2

x 104

0.65

0.7

0.75

0.8

0.85

0.9

time

accu

racy

No Delay

MaxDelay = 50MaxDelay = 100

(a) Rule 1

0 0.5 1 1.5 2

x 104

0.3

0.4

0.5

0.6

0.7

0.8

0.9

time

accu

racy

No DelayMaxDelay = 50

MaxDelay = 100

(b) Rule 2

Fig. 8. Learning performance in scenarios with delayed feedback.

0 0.5 1 1.5 2 2.5 3

x 104

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

time

accu

racy

Period = 1000 Period = 500

Period = 3000

(a) Rule 1

0 0.5 1 1.5 2 2.5 3

x 104

0.7

0.75

0.8

0.85

0.9

0.95

1

time

accu

racy

Period = 1000 Period = 500

Period = 3000

(b) Rule 2

Fig. 9. Learning performance in scenarios with accuracy drift.

in other types of multi-agent systems where communicationbetween agents is not possible and agents observe only noisyglobal feedback.

14

REFERENCES

[1] M. Shah, J. Hellerstein, M. Franklin, “Flux: An adaptive partitioningoperator for continuous query systems,” International Conference on DataEngineering (ICDE), 2003.

[2] Y. Jiang, S. Bhattacharya, S. Chang, and M. Shah, “High-level eventrecognition in unconstrained videos,” Int. J. Multimed. Info. Retr., Nov.2013

[3] F. Fu, D. Turaga, O. Verscheure, and M. van der Schaar, “Configuringcompeting classifier chains in distributed stream mining systems,” IEEEJournal of Selected Topics in Signal Processing, vol. 1, no. 4, Dec 2007.

[4] J. Xu, C. Tekin and M. van der Schaar, “Learning optimal classifierchains for real-time big data mining,” 51st Annual Allerton Conferenceon Communication, Control and Computing, 2013.

[5] B. Foo and M. van der Schaar, “A rules-based approach for configuringchains of classifiers in real-time stream mining systems,” EURASIPJournal on Advances in Signal Processing, vol. 2009, 2009.

[6] R. Ducasse, D. Turaga, and M. van der Schaar, “Adaptive topologicoptimization for large-scale stream mining,” IEEE Journal of SelectedTopics in Signal Processing, vol. 4, no. 3, June 2010.

[7] Gittins, John C, “Bandit processes and dynamic allocation indices,”Journal of the Royal Statistical Society. Series B (Methodological) (1979):148-177.

[8] P. Whittle, “Multi-armed bandits and the Gittins index,” Journal of theRoyal Statistical Society, Series B (Methodological), 1980.

[9] T. Lai and H. Robbins, “Asymptotically efficient adaptive allocationrules,” Adv. Appl. Math., vol. 6, no. 1, 1985.

[10] P. Auer, N. Cesa-Bianchi, and P. Fischer, “Finite-time analysis of themultiarmed bandit problem,” Mach. Learn., vol. 47, no. 2-3, 2002.

[11] V. Anantharam, P. Varaiya, and J. Walrand, “Asymptotically efficientallocation rules for the multiarmed bandit problem with multiple plays CPart I: I. I. D. reward,” IEEE Trans. Autom. Control, vol. AC-32, no. 11,1987.

[12] A. Anandkumar, N. Michael, A. K. Tang, and A. Swami, “Distributedalgorithms for learning and cognitive medium access with logarithmicregret,” IEEE J. Sel. Areas Commun., vol. 29, no. 4, Apr. 2011.

[13] Y. Gai, B. Krishnamachari, and R. Jain, “Combinatorial network opti-mization with unknown variables: multi-armed bandits with linear rewardsand individual observations,” IEEE/ACM Trans. Networking, vol. 20, no.5, 2012.

[14] P. Rusmevichientong and J. N. Tsitsiklis, “Linearly parameterized ban-dits,” Math. Oper. Res., vol. 35, no. 2, 2010.

[15] P. Auer, “Using confidence bounds for exploitation-exploration trade-offs,” J. Mach. Learn. Res., vol. 3, pp. 397-422, 2002.

[16] V. Dani, T. P. Hayes, and S. Kakade, “Stochastic linear optimizationunder bandit feedback,” in Proc. 21st Annu. COLT, 2008.

[17] C. Tekin and M. Liu, “Online learning of rested and restless bandits,”IEEE Trans. on Info. Theory, vol. 58, no. 5, pp. 5588-5611, 2012.

[18] H. Liu, K. Liu and Q. Zhao, “Learning in a changing world: restlessmulti-armed bandit with unknown dynamics,” IEEE Trans. on Info.Theory, vol. 59, no. 3, pp. 1902-1916, 2012.

[19] P. Sidiropoulos, V. Mezaris, I. Kompatsiaris, “Enhancing video conceptdetection with the use of tomographs,” in Proc. IEEE InternationalConference on Image Processing (ICIP), 2013.

[20] C. Tekin and M. Liu, “Performance and convergence of multi-user onlinelearning and its application in dynamic spectrum sharing,” in Mechanismsand Games for Dynamic Spectrum Allocation, Cambridge UniversityPress, available from January 2014.

[21] K. Liu and Q. Zhao, “Distributed learning in multi-armed bandit withmultiple players,” IEEE Trans. Signal Process., vol. 58, no. 11, Nov.2010.

[22] W. Chen, Y. Wang and Y. Yuan, “Combinatorial multi-armed bandit:general framework and applications,” in Proc. 30th Intl. Conf. MachineLearning (ICML-13), pp. 151-159, 2013.

[23] N. Cesa-Bianchi, G. Lugosi, “Combinatorial bandits,” J. Comp. Sys. Sci.78, no. 5, pp. 1404-1422, 2012.

[24] C.E. Leiserson, R. L. Rivest, C. Stein, Introduction to Algorithms, Ed.T. H. Cormen, he MIT press, 2001.

[25] M. Campbell, A. Haubold, M. Liu, et al., “IBM Research TRECVID-2007 Video Retrieval System”.

[26] R. Ducasse and M. van der Schaar, “Finding It Now: Constructionand Configuration of Networked Classifiers in Real-Time Stream MiningSystems,” Handbook of Signal Processing Systems, Springer New York,Ed. S. S. Bhattacharyya, F. Deprettere, R. Leupers and J. Takala, 2013.

[27] N. Littleston and M. K. Warmuth, “The weighted majority algorithm,”Inf. Comput., vol. 108, no. 2, pp. 212-261, Feb. 1994.

[28] B. Szorenyi et al. “Gossip-based distributed stochastic bandit algorithm-s,” Proc. 30th International Conference on Machine Learning, 2013.

[29] J. Xu, M. van der Schaar, J. Liu and H. Li, “Forecasting popularityof videos using social media,”, IEEE J. Sel. Topics Signal Process.(forthcoming), Nov., 2014.

[30] J. Xu, D. Deng, U. Demiryurek, C. Shahabi and M. van der Schaar,“Mining the situation: spatiotemporal traffic prediction with big data,”IEEE J. Sel. Topics Signal Process. (forthcoming), 2015.

[31] Online appendix http://medianetlab.ee.ucla.edu/papers/XuTSP15App.pdf.

distributed multi-agent online learning based on global...

Documents