distributed batch gaussian process optimization

20
Distributed Batch Gaussian Process Optimization Erik A. Daxberger 1 Bryan Kian Hsiang Low 2 Abstract This paper presents a novel distributed batch Gaussian process upper confidence bound (DB-GP-UCB) algorithm for performing batch Bayesian optimization (BO) of highly complex, costly-to-evaluate black-box objective functions. In contrast to existing batch BO algorithms, DB- GP-UCB can jointly optimize a batch of inputs (as opposed to selecting the inputs of a batch one at a time) while still preserving scalability in the batch size. To realize this, we generalize GP-UCB to a new batch variant amenable to a Markov approximation, which can then be natu- rally formulated as a multi-agent distributed con- straint optimization problem in order to fully ex- ploit the efficiency of its state-of-the-art solvers for achieving linear time in the batch size. Our DB-GP-UCB algorithm offers practitioners the flexibility to trade off between the approxima- tion quality and time efficiency by varying the Markov order. We provide a theoretical guar- antee for the convergence rate of DB-GP-UCB via bounds on its cumulative regret. Empiri- cal evaluation on synthetic benchmark objective functions and a real-world optimization problem shows that DB-GP-UCB outperforms the state- of-the-art batch BO algorithms. 1. Introduction Bayesian optimization (BO) has recently gained consider- able traction due to its capability of finding the global max- imum of a highly complex (e.g., non-convex, no closed- form expression nor derivative), noisy black-box objective 1 Ludwig-Maximilians-Universit¨ at, Munich, Germany. A sub- stantial part of this research was performed during his student ex- change program at the National University of Singapore under the supervision of Bryan Kian Hsiang Low and culminated in his Bachelor’s thesis. 2 Department of Computer Science, National University of Singapore, Republic of Singapore. Correspondence to: Bryan Kian Hsiang Low <[email protected]>. Proceedings of the 34 th International Conference on Machine Learning, Sydney, Australia, PMLR 70, 2017. Copyright 2017 by the author(s). function with a limited budget of (often costly) function evaluations, consequently witnessing its use in an increas- ing diversity of application domains such as robotics, en- vironmental sensing/monitoring, automatic machine learn- ing, among others (Brochu et al., 2010; Shahriari et al., 2016). A number of acquisition functions (e.g., probabil- ity of improvement or expected improvement (EI) over the currently found maximum (Brochu et al., 2010), entropy- based (Villemonteix et al., 2009; Hennig & Schuler, 2012; Hern´ andez-Lobato et al., 2014), and upper confidence bound (UCB) (Srinivas et al., 2010)) have been devised to perform BO: They repeatedly select an input for eval- uating/querying the black-box function (i.e., until the bud- get is depleted) that intuitively trades off between sampling where the maximum is likely to be given the current, pos- sibly imprecise belief of the function modeled by a Gaus- sian process (GP) (i.e., exploitation) vs. improving the GP belief of the function over the entire input domain (i.e., ex- ploration) to guarantee finding the global maximum. The rapidly growing affordability and availability of hard- ware resources (e.g., computer clusters, sensor networks, robot teams/swarms) have motivated the recent develop- ment of BO algorithms that can repeatedly select a batch of inputs for querying the black-box function in parallel in- stead. Such batch/parallel BO algorithms can be classified into two types: On one extreme, batch BO algorithms like multi-points EI (q-EI) (Chevalier & Ginsbourger, 2013), parallel predictive entropy search (PPES) (Shah & Ghahra- mani, 2015), and the parallel knowledge gradient method (q-KG) (Wu & Frazier, 2016) jointly optimize the batch of inputs and hence scale poorly in the batch size. On the other extreme, greedy batch BO algorithms (Azimi et al., 2010; Contal et al., 2013; Desautels et al., 2014; Gonz´ alez et al., 2016) boost the scalability by selecting the inputs of the batch one at a time. We argue that such a highly sub- optimal approach to gain scalability is an overkill: In prac- tice, each function evaluation is often much more compu- tationally and/or economically costly (e.g., hyperparameter tuning for deep learning, drug testing on human subjects), which justifies dedicating more time to obtain better BO performance. In this paper, we show that it is in fact possi- ble to jointly optimize the batch of inputs and still preserve scalability in the batch size by giving practitioners the flex- ibility to trade off BO performance for time efficiency.

Upload: others

Post on 22-Mar-2022

23 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Distributed Batch Gaussian Process Optimization

Distributed Batch Gaussian Process Optimization

Erik A. Daxberger 1 Bryan Kian Hsiang Low 2

AbstractThis paper presents a novel distributed batch

Gaussian process upper confidence bound

(DB-GP-UCB) algorithm for performing batchBayesian optimization (BO) of highly complex,costly-to-evaluate black-box objective functions.In contrast to existing batch BO algorithms, DB-GP-UCB can jointly optimize a batch of inputs(as opposed to selecting the inputs of a batchone at a time) while still preserving scalabilityin the batch size. To realize this, we generalizeGP-UCB to a new batch variant amenable to aMarkov approximation, which can then be natu-rally formulated as a multi-agent distributed con-straint optimization problem in order to fully ex-ploit the efficiency of its state-of-the-art solversfor achieving linear time in the batch size. OurDB-GP-UCB algorithm offers practitioners theflexibility to trade off between the approxima-tion quality and time efficiency by varying theMarkov order. We provide a theoretical guar-antee for the convergence rate of DB-GP-UCBvia bounds on its cumulative regret. Empiri-cal evaluation on synthetic benchmark objectivefunctions and a real-world optimization problemshows that DB-GP-UCB outperforms the state-of-the-art batch BO algorithms.

1. IntroductionBayesian optimization (BO) has recently gained consider-able traction due to its capability of finding the global max-imum of a highly complex (e.g., non-convex, no closed-form expression nor derivative), noisy black-box objective

1Ludwig-Maximilians-Universitat, Munich, Germany. A sub-stantial part of this research was performed during his student ex-change program at the National University of Singapore underthe supervision of Bryan Kian Hsiang Low and culminated in hisBachelor’s thesis. 2Department of Computer Science, NationalUniversity of Singapore, Republic of Singapore. Correspondenceto: Bryan Kian Hsiang Low <[email protected]>.

Proceedings of the 34 thInternational Conference on Machine

Learning, Sydney, Australia, PMLR 70, 2017. Copyright 2017by the author(s).

function with a limited budget of (often costly) functionevaluations, consequently witnessing its use in an increas-ing diversity of application domains such as robotics, en-vironmental sensing/monitoring, automatic machine learn-ing, among others (Brochu et al., 2010; Shahriari et al.,2016). A number of acquisition functions (e.g., probabil-ity of improvement or expected improvement (EI) over thecurrently found maximum (Brochu et al., 2010), entropy-based (Villemonteix et al., 2009; Hennig & Schuler, 2012;Hernandez-Lobato et al., 2014), and upper confidence

bound (UCB) (Srinivas et al., 2010)) have been devisedto perform BO: They repeatedly select an input for eval-uating/querying the black-box function (i.e., until the bud-get is depleted) that intuitively trades off between samplingwhere the maximum is likely to be given the current, pos-sibly imprecise belief of the function modeled by a Gaus-

sian process (GP) (i.e., exploitation) vs. improving the GPbelief of the function over the entire input domain (i.e., ex-ploration) to guarantee finding the global maximum.

The rapidly growing affordability and availability of hard-ware resources (e.g., computer clusters, sensor networks,robot teams/swarms) have motivated the recent develop-ment of BO algorithms that can repeatedly select a batch

of inputs for querying the black-box function in parallel in-stead. Such batch/parallel BO algorithms can be classifiedinto two types: On one extreme, batch BO algorithms likemulti-points EI (q-EI) (Chevalier & Ginsbourger, 2013),parallel predictive entropy search (PPES) (Shah & Ghahra-mani, 2015), and the parallel knowledge gradient method

(q-KG) (Wu & Frazier, 2016) jointly optimize the batch ofinputs and hence scale poorly in the batch size. On theother extreme, greedy batch BO algorithms (Azimi et al.,2010; Contal et al., 2013; Desautels et al., 2014; Gonzalezet al., 2016) boost the scalability by selecting the inputs ofthe batch one at a time. We argue that such a highly sub-optimal approach to gain scalability is an overkill: In prac-tice, each function evaluation is often much more compu-tationally and/or economically costly (e.g., hyperparametertuning for deep learning, drug testing on human subjects),which justifies dedicating more time to obtain better BOperformance. In this paper, we show that it is in fact possi-ble to jointly optimize the batch of inputs and still preservescalability in the batch size by giving practitioners the flex-ibility to trade off BO performance for time efficiency.

Page 2: Distributed Batch Gaussian Process Optimization

Distributed Batch Gaussian Process Optimization

To achieve this, we first observe that, interestingly, batchBO can be perceived as a cooperative multi-agent decisionmaking problem whereby each agent optimizes a separateinput of the batch while coordinating with the other agentsdoing likewise. To the best of our knowledge, this has notbeen considered in the BO literature. In particular, if batchBO can be framed as some known class of multi-agent de-cision making problems, then it can be solved efficientlyand scalably by the latter’s state-of-the-art solvers. The keytechnical challenge would therefore be to investigate howbatch BO can be cast as one of such to exploit its advan-tage of scalability in the number of agents (hence, batchsize) while at the same time theoretically guaranteeing theresulting BO performance.

To tackle the above challenge, this paper presents a noveldistributed batch BO algorithm (Section 3) that, in con-trast to greedy batch BO algorithms (Azimi et al., 2010;Contal et al., 2013; Desautels et al., 2014; Gonzalez et al.,2016), can jointly optimize a batch of inputs and, unlikethe batch BO algorithms (Chevalier & Ginsbourger, 2013;Shah & Ghahramani, 2015; Wu & Frazier, 2016), still pre-serve scalability in the batch size. To realize this, we gener-alize GP-UCB (Srinivas et al., 2010) to a new batch variantamenable to a Markov approximation, which can then benaturally formulated as a multi-agent distributed constraint

optimization problem (DCOP) in order to fully exploit theefficiency of its state-of-the-art solvers for achieving lineartime in the batch size. Our proposed distributed batch GP-

UCB (DB-GP-UCB) algorithm offers practitioners the flex-ibility to trade off between the approximation quality andtime efficiency by varying the Markov order. We provide atheoretical guarantee for the convergence rate of our DB-GP-UCB algorithm via bounds on its cumulative regret.We empirically evaluate the cumulative regret incurred byour DB-GP-UCB algorithm and its scalability in the batchsize on synthetic benchmark objective functions and a real-world optimization problem (Section 4).

2. Problem Statement, Background, andNotations

Consider the problem of sequentially optimizing an un-known objective function f : D ! R where D ⇢ Rd

denotes a domain of d-dimensional input feature vectors.We consider the domain to be discrete as it is known howto generalize results to a continuous, compact domain viasuitable discretizations (Srinivas et al., 2010). In each it-eration t = 1, . . . , T , a batch Dt ⇢ D of inputs is se-lected for evaluating/querying f to yield a correspondingcolumn vector yDt , (yx)>

x2Dtof noisy observed outputs

yx , f(x)+✏ with i.i.d. Gaussian noise ✏ ⇠ N (0, �2n) and

noise variance �2n.

Regret. Supposing our goal is to get close to the globalmaximum f(x⇤) as rapidly as possible where x⇤ ,arg maxx2D f(x), this can be achieved by minimizing astandard batch BO objective such as the batch or full cu-

mulative regret (Contal et al., 2013; Desautels et al., 2014):The notion of regret intuitively refers to a loss in rewardfrom not knowing x⇤ beforehand. Formally, the instanta-neous regret incurred by selecting a single input x to eval-uate its corresponding f is defined as rx , f(x⇤) � f(x).Assuming a fixed cost of evaluating f for every possi-ble batch Dt of the same size, the batch and full cumu-lative regrets are, respectively, defined as sums (over it-eration t = 1, . . . , T ) of the smallest instantaneous re-gret incurred by any input within every batch Dt, i.e.,RT , PT

t=1 minx2Dt rx, and of the instantaneous re-grets incurred by all inputs of every batch Dt, i.e., R0

T ,PTt=1

Px2Dt

rx. The convergence rate of a batch BO al-gorithm can then be assessed based on some upper boundon the average regret RT /T or R0

T /T (Section 3) since thecurrently found maximum after T iterations is no furtheraway from f(x⇤) than RT /T or R0

T /T . It is desirable fora batch BO algorithm to asymptotically achieve no regret,i.e., limT!1 RT /T = 0 or limT!1 R0

T /T = 0, implyingthat it will eventually converge to the global maximum.

Gaussian Processes (GPs). To guarantee no regret (Sec-tion 3), the unknown objective function f is modeled as asample of a GP. Let {f(x)}x2D denote a GP, that is, everyfinite subset of {f(x)}x2D follows a multivariate Gaus-sian distribution (Rasmussen & Williams, 2006). Then,the GP is fully specified by its prior mean mx , E[f(x)]and covariance kxx0 , cov[f(x), f(x0)] for all x,x0

2 D,which, for notational simplicity (and w.l.o.g.), are assumedto be zero, i.e., mx = 0, and bounded, i.e., kxx0 1, re-spectively. Given a column vector yD1:t-1 , (yx)>

x2D1:t-1

of noisy observed outputs for some set D1:t�1 , D1 [

. . . [ Dt�1 of inputs after t � 1 iterations, a GP model canperform probabilistic regression by providing a predictivedistribution p(fDt |yD1:t-1) = N (µDt ,⌃DtDt) of the latentoutputs fDt , (f(x))>

x2Dtfor any set Dt ✓ D of inputs

selected in iteration t with the following posterior meanvector and covariance matrix:

µDt ,KDtD1:t-1(KD1:t-1D1:t-1+�2nI)�1yD1:t-1 ,

⌃DtDt ,KDtDt�KDtD1:t-1(KD1:t-1D1:t-1+�2nI)�1KD1:t-1Dt

(1)where KBB0 , (kxx0)x2B,x02B0 for all B, B0

⇢ D.

GP-UCB and its Greedy Batch Variants. Inspired by theUCB algorithm for the multi-armed bandit problem, theGP-UCB algorithm (Srinivas et al., 2010) selects, in eachiteration, an input x 2 D for evaluating/querying f thattrades off between sampling close to an expected maximum(i.e., with large posterior mean µ{x}) given the current GPbelief of f (i.e., exploitation) vs. that of high predictive un-

Page 3: Distributed Batch Gaussian Process Optimization

Distributed Batch Gaussian Process Optimization

certainty (i.e., with large posterior variance⌃{x}{x}) to im-prove the GP belief of f over D (i.e., exploration), that is,maxx2D µ{x} + �1/2

t ⌃1/2

{x}{x} where the parameter �t > 0is set to trade off between exploitation vs. exploration forbounding its cumulative regret.

Existing generalizations of GP-UCB such as GP batch

UCB (GP-BUCB) (Desautels et al., 2014) and GP-UCB

with pure exploration (GP-UCB-PE) (Contal et al., 2013)are greedy batch BO algorithms that select the inputs of thebatch one at a time (Section 1). Specifically, to avoid se-lecting the same input multiple times within a batch (hencereducing to GP-UCB), they update the posterior variance(but not the posterior mean) after adding each input tothe batch, which can be performed prior to evaluating itscorresponding f since the posterior variance is indepen-dent of the observed outputs (1). They differ in that GP-BUCB greedily adds each input to the batch using GP-UCB(without updating the posterior mean) while GP-UCB-PEselects the first input using GP-UCB and each remaininginput of the batch by maximizing only the posterior vari-ance (i.e., pure exploration). Similarly, a recently proposedUCB-DPP-SAMPLE algorithm (Kathuria et al., 2016) se-lects the first input using GP-UCB and the remaining inputsby sampling from a determinantal point process (DPP).Like GP-BUCB, GP-UCB-PE, and UCB-DPP-SAMPLE,we can theoretically guarantee the convergence rate of ourDB-GP-UCB algorithm, which, from a theoretical pointof view, signifies an advantage of GP-UCB-based batchBO algorithms over those (e.g., q-EI and PPES) inspiredby other acquisition functions such as EI and PES. Unlikethese greedy batch BO algorithms (Contal et al., 2013; De-sautels et al., 2014), our DB-GP-UCB algorithm can jointlyoptimize the batch of inputs while still preserving scalabil-ity in batch size by casting as a DCOP to be described next.

Distributed Constraint Optimization Problem (DCOP).A DCOP can be defined as a tuple (X , V, A, h, W) thatcomprises a set X of input random vectors, a set V of|X | corresponding finite domains (i.e., a separate domainfor each random vector), a set A of agents, a functionh : X ! A assigning each input random vector to an agentresponsible for optimizing it, and a set W , {wn}n=1,...,N

of non-negative payoff functions such that each functionwn defines a constraint over only a subset Xn ✓ X of in-put random vectors and represents the joint payoff that thecorresponding agents An , {h(x)|x 2 Xn} ✓ A achieve.Solving a DCOP involves finding the input values of X thatmaximize the sum of all functions w1, . . . , wn (i.e., socialwelfare maximization), that is, maxX

PNn=1 wn(Xn). To

achieve a truly decentralized solution, each agent can onlyoptimize its local input random vector(s) based on the as-signment function h but communicate with its neighbor-ing agents: Two agents are considered neighbors if thereis a function/constraint involving input random vectors that

the agents have been assigned to optimize. Complete andapproximation algorithms exist for solving a DCOP; see(Chapman et al., 2011; Leite et al., 2014) for reviews ofsuch algorithms.

3. Distributed Batch GP-UCB (DB-GP-UCB)A straightforward generalization of GP-UCB (Srinivaset al., 2010) to jointly optimize a batch of inputs is to sim-ply consider summing the GP-UCB acquisition functionover all inputs of the batch. This, however, results in se-lecting the same input |Dt| times within a batch, hence re-ducing to GP-UCB, as explained earlier in Section 2. To re-solve this issue but not suffer from the suboptimal behaviorof greedy batch BO algorithms such as GP-BUCB (Desau-tels et al., 2014) and GP-UCB-PE (Contal et al., 2013), wepropose a batch variant of GP-UCB that jointly optimizes abatch of inputs in each iteration t = 1, . . . , T according to

maxDt⇢D 1>µDt + ↵1/2t I[fD;yDt |yD1:t-1 ]

1/2 (2)

where the parameter ↵t > 0, which performs a similar roleto that of �t in GP-UCB, is set to trade off between ex-ploitation vs. exploration for bounding its cumulative re-gret (Theorem 1) and the conditional mutual information1

I[fD;yDt |yD1:t-1 ] can be interpreted as the information gainon f over D (i.e., equivalent to fD , (f(x))>

x2D) by se-lecting the batch Dt of inputs for evaluating/querying fgiven the noisy observed outputs yD1:t-1 from the previ-ous t � 1 iterations. So, in each iteration t, our proposedbatch GP-UCB algorithm (2) selects a batch Dt ⇢ D of in-puts for evaluating/querying f that trades off between sam-pling close to expected maxima (i.e., with a large sum ofposterior means 1>µDt =

Px2Dt

µ{x}) given the cur-rent GP belief of f (i.e., exploitation) vs. that yieldinga large information gain I[fD;yDt |yD1:t-1 ] on f over D

to improve its GP belief (i.e., exploration). It can be de-rived that I[fD;yDt |yD1:t-1 ] = 0.5 log |I+��2

n ⌃DtDt | (Ap-pendix A), which implies that the exploration term in (2)can be maximized by spreading the batch Dt of inputs farapart to achieve large posterior variance individually andsmall magnitude of posterior covariance between them toencourage diversity.

Unfortunately, our proposed batch variant of GP-UCB (2)involves evaluating prohibitively many batches of inputs(i.e., exponential in the batch size), hence scaling poorly inthe batch size. However, we will show in this section thatour batch variant of GP-UCB is, interestingly, amenable toa Markov approximation, which can then be naturally for-mulated as a multi-agent DCOP in order to fully exploit the

1In contrast to the BO algorithm of Contal et al. (2014) thatalso uses mutual information, our work here considers batch BOby exploiting the correlation information between inputs of abatch in our acquisition function in (2) to encourage diversity.

Page 4: Distributed Batch Gaussian Process Optimization

Distributed Batch Gaussian Process Optimization

efficiency of its state-of-the-art solvers for achieving lineartime in the batch size.

Markov Approximation. The key idea is to design thestructure of a matrix DtDt whose log-determinant canclosely approximate that of DtDt , I + ��2

n ⌃DtDt re-siding in the I[fD;yDt |yD1:t-1 ] term in (2) and at the sametime be decomposed into a sum of log-determinant terms,each of which is defined by submatrices of DtDt that alldepend on only a subset of the batch. Such a decompositionenables our resulting approximation of (2) to be formulatedas a DCOP (Section 2).

At first glance, our proposed idea may be naively imple-mented by constructing a sparse block-diagonal matrix DtDt using, say, the N > 1 diagonal blocks of DtDt .Then, log | DtDt | can be decomposed into a sum of log-determinants of its diagonal blocks2, each of which de-pends on only a disjoint subset of the batch. This, however,entails an issue similar to that discussed at the beginning ofthis section of selecting the same |Dt|/N inputs N timeswithin a batch due to the assumption of independence ofoutputs between different diagonal blocks of DtDt . Toaddress this issue, we significantly relax this assumptionand show that it is in fact possible to construct a morerefined, dense matrix approximation DtDt by exploitinga Markov assumption, which consequently correlates theoutputs between all its constituent blocks and is, perhapssurprisingly, still amenable to the decomposition to achievescalability in the batch size.

Specifically, evenly partition the batch Dt of inputs intoN 2 {1, . . . , |Dt|} disjoint subsets Dt1, . . . , DtN and DtDt ( DtDt ) into N ⇥ N square blocks, i.e., DtDt ,[ DtnDtn0 ]n,n0=1,...,N ( DtDt , [ DtnDtn0 ]n,n0=1,...,N ).Our first result below derives a decomposition of the log-determinant of any symmetric positive definite block ma-trix DtDt into a sum of log-determinant terms, eachof which is defined by a separate diagonal block of theCholesky factor of

�1DtDt

:

Proposition 1. Consider the Cholesky factorization of

a symmetric positive definite �1DtDt

, U>U where

Cholesky factor U , [Unn0 ]n,n0=1,...,N (i.e., partitioned

into N ⇥ N square blocks) is an upper triangular block

matrix (i.e., Unn0 = 0 for n > n0). Then, log | DtDt | =PN

n=1 log |(U>nnUnn)�1

|.

Its proof (Appendix B) utilizes properties of the determi-nant and that the determinant of an upper triangular blockmatrix is a product of determinants of its diagonal blocks(i.e., |U | =

QNn=1 |Unn|). Proposition 1 reveals a subtle

possibility of imposing some structure on the inverse of2The determinant of a block-diagonal matrix is a product of

determinants of its diagonal blocks.

DtDt such that each diagonal block Unn of its Choleskyfactor (and hence each log |(U>

nnUnn)�1| term) will de-

pend on only a subset of the batch. The following resultpresents one such possibility:

Proposition 2. Let B 2 {1, . . . , N �1} be given. If �1DtDt

is B-block-banded3, then

(U>nnUnn)�1 = DtnDtn � DtnDB

tn

�1DB

tnDBtn DB

tnDtn

(3)for n = 1, . . . , N where ⌘ , min(n + B, N),D

Btn , S⌘

n0=n+1 Dtn0 , DtnDBtn

, [ DtnDtn0 ]n0=n+1,...,⌘ ,

DBtnDB

tn, [ Dtn0Dtn00 ]n0,n00=n+1,...,⌘ , and DB

tnDtn,

>DtnDB

tn.

Its proof follows directly from a block-banded matrix resultof (Asif & Moura, 2005) (i.e., Theorem 1). Proposition 2indicates that if

�1DtDt

is B-block-banded (Fig. 1b), theneach log |(U>

nnUnn)�1| term depends on only the subset

Dtn [ DBtn =

S⌘n0=n Dtn0 of the batch Dt (Fig. 1c).

Our next result defines a structure of DtDt in terms ofthe blocks within the B-block band of DtDt to induce aB-block-banded inverse of DtDt :Proposition 3. Let

DtnDtn0,

8><

>:

DtnDtn0 if |�| B,

DtnDBtn �1

DBtnDB

tn DB

tnDtn0 if � < �B,

DtnDBtn0 �1

DBtn0DB

tn0 DB

tn0Dtn0 if � > B;

(4)where� , n�n0

for n, n0 = 1, . . . , N (see Fig. 1a). Then,

�1DtDt

is B-block-banded (see Fig. 1b).

Its proof follows directly from a block-banded matrix resultof (Asif & Moura, 2005) (i.e., Theorem 3). It can be ob-served from (4) and Fig. 1 that (a) though

�1DtDt

is a sparseB-block-banded matrix, DtDt is a dense matrix approx-imation for B = 1, . . . , N � 1; (b) when B = N � 1or N = 1, DtDt = DtDt ; and (c) the blocks within

the B-block band of DtDt (i.e., |n � n0| B) coin-

cide with that of DtDt while each block outside the B-block band of DtDt (i.e., |n � n0

| > B) is fully speci-fied by the blocks within the B-block band of DtDt (i.e.,|n � n0

| B) due to its recursive series of |n � n0| � B

reduced-rank approximations (Fig. 1a). Note, however, thatthe log |(U>

nnUnn)�1| terms (3) for n = 1, . . . , N depend

on only the blocks within (and not outside) the B-blockband of DtDt (Fig. 1c).

Remark 1. Proposition 3 provides an attractive principledinterpretation: Let "x , ��1

n (yx � µ{x}) denote a scaled

3A block matrix P , [Pnn0 ]n,n0=1,...,N (i.e., partitioned intoN ⇥N square blocks) is B-block-banded if any block Pnn0 out-side its B-block band (i.e., |n� n0| > B) is 0.

Page 5: Distributed Batch Gaussian Process Optimization

Distributed Batch Gaussian Process Optimization

Dt4Dt4

Dt1Dt4 Dt1Dt3 Dt1Dt1 Dt1Dt2Dt1

Dt1

Dt2

Dt2

Dt3

Dt3 Dt4

Dt4

Dt2Dt1 Dt2Dt2 Dt2Dt3

Dt3Dt3 Dt3Dt4

Dt2Dt4

Dt4Dt1

|n � n0| 1n � n0 > 1

n0� n > 1

Dt4Dt3 Dt4Dt2

Dt3Dt2 Dt3Dt1

Dt1

Dt1

Dt2

Dt2

Dt3

Dt3 Dt4

Dt4

|n � n0| 1n � n0 > 1

n0� n > 1

0

0 0

0

00

Dt1

Dt1

Dt2

Dt2

Dt3

Dt3 Dt4

Dt4

n0� n > 1

0

0 0

0

00

0

0

0

U12

U23

U34

U11

U22

U33

U44

n � n0 > 0 0 n0� n 1

(a) DtDt (b) �1DtDt

(c) U = cholesky( �1DtDt

)

Figure 1. DtDt , �1DtDt , and U with B = 1 and N = 4. (a) Shaded blocks (i.e., |n�n0| B) form the B-block band while unshaded

blocks (i.e., |n � n0| > B) fall outside the band. Each arrow denotes a recursive call. (b) Unshaded blocks outside the B-block bandof �1

DtDt (i.e., |n � n0| > B) are 0, which result in the (c) unshaded blocks of its Cholesky factor U being 0 (i.e., n � n0 > 0 orn0 � n > B). Using (3) and (4), U11, U22, U33, and U44 depend on only the shaded blocks of DtDt enclosed in red, green, blue, andpurple, respectively.

residual incurred by the GP predictive mean (1). Its covari-ance is then cov["x, "x0 ] = {x}{x0}. In the same spirit asa Gaussian Markov random process, imposing a B-th or-der Markov property on the residual process {"x}x2Dt isequivalent to approximating DtDt with DtDt (4) whoseinverse is B-block-banded. In other words, if |n�n0

| > B,then {"x}x2Dtn and {"x}x2Dtn0 are conditionally indepen-dent given {"x}x2Dt\(Dtn[Dtn0 ). This conditional inde-pendence assumption therefore becomes more relaxed witha larger batch Dt. Proposition 2 demonstrates the impor-tance of such a B-th order Markov assumption (or, equiva-lently, the sparsity of B-block-banded

�1DtDt

) to achievingscalability in the batch size.

Remark 2. Regarding the approximation quality of DtDt

(4), the following result (see Appendix C for its proof)shows that the Kullback-Leibler (KL) distance of DtDt

from DtDt measures an intuitive notion of the approxi-

mation error of DtDt being the difference in informationgain when relying on our Markov approximation, whichcan be bounded by some quantity ⌫t:

Proposition 4. Let the KL distance DKL( , e ) ,0.5(tr( e �1) � log | e �1

| � |Dt|) between two sym-

metric positive definite |Dt| ⇥ |Dt| matrices and e measure the error of approximating with e . Also, let

I[fD;yDt |yD1:t-1 ] , 0.5 log | DtDt | denote the approxi-

mated information gain, and C � I[f{x};yDt |yD1:t-1 ] for

all x 2 D and t 2 N. Then, for all t 2 N,

DKL( DtDt , DtDt)

= I[fD;yDt |yD1:t-1 ] � I[fD;yDt |yD1:t-1 ]

(exp(2C) � 1) I[fD;yDt |yD1:t-1 ] , ⌫t .

Proposition 4 implies that the approximated infor-mation gain I[fD;yDt |yD1:t-1 ] is never smaller thanthe exact information gain I[fD;yDt |yD1:t-1 ] sinceDKL( DtDt , DtDt) � 0 with equality when N = 1, inwhich case DtDt = DtDt (4). Thus, intuitively, our pro-posed Markov approximation hallucinates information into DtDt to yield an optimistic estimate of the informationgain (by selecting a particular batch), ultimately makingour resulting algorithm overconfident in selecting a batch.This overconfidence is information-theoretically quantifiedby the approximation error DKL( DtDt , DtDt) ⌫t.

Remark 3. The KL distance DKL( DtDt , DtDt) of DtDt

from DtDt is also the least among all |Dt|⇥ |Dt| matriceswith a B-block-banded inverse, as proven in Appendix D.

DCOP Formulation. By exploiting the approximated in-formation gain I[fD;yDt |yD1:t-1 ] (Proposition 4), Proposi-tion 1, (3), and (4), our batch variant of GP-UCB (2) canbe reformulated in an approximate sense4 to a distributed

batch GP-UCB (DB-GP-UCB) algorithm5 that jointly op-timizes a batch of inputs in each iteration t = 1, . . . , Taccording to

Dt , arg maxDt⇢D

NX

n=1

wn(Dtn [ DBtn)

wn(Dtn [ DBtn) , 1>µDtn+(0.5↵t log | DtnDtn|DB

tn|)1/2

(5)with DtnDtn|DB

tn, DtnDtn� DtnDB

tn �1

DBtnDB

tn DB

tnDtn.

4Note that our acquisition function (5) usesPN

n=1(log | · |)1/2

instead of (PN

n=1 log | · |)1/2 to enable the decomposition.

5Pseudocode for DB-GP-UCB is provided in Appendix E.

Page 6: Distributed Batch Gaussian Process Optimization

Distributed Batch Gaussian Process Optimization

Note that (5) is equivalent to our batch variant of GP-UCB(2) when N = 1. It can also be observed that (5) isnaturally formulated as a multi-agent DCOP (Section 2)whereby every agent an 2 A is responsible for optimiz-ing a disjoint subset Dtn of the batch Dt for n = 1, . . . , Nand each function wn defines a constraint over only thesubset Dtn [ D

Btn =

S⌘n0=n Dtn0 of the batch Dt and

represents the joint payoff that the corresponding agentsAn , {an0}

⌘n0=n ✓ A achieve. As a result, (5) can be effi-

ciently and scalably solved by the state-of-the-art DCOP al-gorithms (Chapman et al., 2011; Leite et al., 2014). For ex-ample, the time complexity of an iterative message-passingalgorithm called max-sum (Farinelli et al., 2008) scales ex-ponentially in only the largest arity maxn2{1,...,N} |Dtn [

DBtn| = (B+1)|Dt|/N of the functions w1, . . . , wN . Given

a limited time budget, a practitioner can set a maximumarity of ! for any function wn, after which the numberN of functions is adjusted to d(B + 1)|Dt|/!e so thatthe time incurred by max-sum to solve the DCOP in (5)is O(|D|

!!3B|Dt|)6 per iteration (i.e., linear in the batchsize |Dt| by assuming ! and the Markov order B to be con-stants). In contrast, our batch variant of GP-UCB (2) incursexponential time in the batch size |Dt|. The max-sum algo-rithm is also amenable to a distributed implementation on acluster of parallel machines to boost scalability further. If asolution quality guarantee is desired, then a variant of max-sum called bounded max-sum (Rogers et al., 2011) can beused7. Finally, the Markov order B can be varied to tradeoff between the approximation quality of DtDt (4) andthe time efficiency of max-sum in solving the DCOP in (5).

Regret Bounds. Our main result to follow derives proba-bilistic bounds on the cumulative regret of DB-GP-UCB:Theorem 1. Let � 2 (0, 1) be given, C1 ,4/ log(1 + ��2

n ), �T , maxD1:T ⇢D I[fD;yD1:T ], ↵t ,C1|Dt| exp(2C) log(|D|t2⇡2/(6�)), and ⌫T , PT

t=1 ⌫t.

Then, for the batch and full cumulative regrets incurred by

our DB-GP-UCB algorithm (5),

RT 2�T |DT |

�2↵T N(�T + ⌫T )�1/2

and

R0T 2 (T↵T N(�T + ⌫T ))1/2

hold with probability of at least 1 � �.

6We assume the use of online sparse GP models (Csato &Opper, 2002; Hensman et al., 2013; Hoang et al., 2015; 2017;Low et al., 2014b; Xu et al., 2014) that can update the GP predic-tive/posterior distribution (1) in constant time in each iteration.

7Bounded max-sum is previously used in (Rogers et al., 2011)to solve a related maximum entropy sampling problem (Shewry &Wynn, 1987) formulated as a DCOP. But, the largest arity of anyfunction wn in this DCOP is still the batch size |Dt| and, unlikethe focus of our work here, no attempt is made in (Rogers et al.,2011) to reduce it, thus causing max-sum and bounded max-sumto incur exponential time in |Dt|. In fact, our proposed Markovapproximation can be applied to this problem to reduce the largestarity of any function wn in this DCOP to again (B + 1)|Dt|/N .

Its proof (Appendix F), when compared to that of GP-UCB(Srinivas et al., 2010) and its greedy batch variants (Con-tal et al., 2013; Desautels et al., 2014), requires tacklingthe additional technical challenges associated with jointlyoptimizing a batch Dt of inputs in each iteration t. Notethat the uncertainty sampling based initialization strategyproposed by Desautels et al. (2014) can be employed to re-place the

pexp(2C) term (i.e., growing linearly in |Dt|)

appearing in our regret bounds by a kernel-dependent con-stant factor of C 0 that is independent of |Dt|; values of C 0

for the most commonly-used kernels are replicated in Ta-ble 2 in Appendix G (see section 4.5 in (Desautels et al.,2014) for a more detailed discussion on this issue).

Table 1 in Appendix G compares the bounds on RT ofDB-GP-UCB (5), GP-UCB-PE, GP-BUCB, GP-UCB, andUCB-DPP-SAMPLE. Compared to the bounds on RT ofGP-UCB-PE and UCB-DPP-SAMPLE, our bound includesthe additional kernel-dependent factor of C 0, which is sim-ilar to GP-BUCB. In fact, our regret bound is of the sameform as that of GP-BUCB except that our bound incor-porates a parameter N of our Markov approximation andan upper bound ⌫T on the cumulative approximation error,both of which vanish for our batch variant of GP-UCB (2):Corollary 1. For our batch variant of GP-UCB (2), the

cumulative regrets reduce to RT 2�T |DT |

�2↵T �T

�1/2

and R0T 2 (T↵T �T )1/2

.

Corollary 1 follows directly from Theorem 1 and by notingthat for our batch variant (2), N = 1 (since DtDt thentrivially reduces to DtDt ) and ⌫t = 0 for t = 1, . . . , T .

Finally, the convergence rate of our DB-GP-UCB algo-rithm is dominated by the growth behavior of �T + ⌫T .While it is well-known that the bounds on the maximummutual information �T established for the commonly-usedlinear, squared exponential, and Matern kernels in (Srinivaset al., 2010; Kathuria et al., 2016) (i.e., replicated in Table 2in Appendix G) only grow sublinearly in T , it is not imme-diately clear how the upper bound ⌫T on the cumulativeapproximation error behaves. Our next result reveals that⌫T in fact only grows sublinearly in T as well:Corollary 2. ⌫T (exp(2C) � 1)�T .

Corollary 2 follows directly from the definitions of ⌫t inProposition 4 and ⌫T and �T in Theorem 1 and applyingthe chain rule for mutual information. Since �T growssublinearly in T for the above-mentioned kernels (Srini-vas et al., 2010) and C can be chosen to be independent ofT (e.g., C , �|Dt|�1) (Desautels et al., 2014), it followsfrom Corollary 2 that ⌫T grows sublinearly in T . As a re-sult, Theorem 1 guarantees sublinear cumulative regrets forthe above-mentioned kernels, which implies that our DB-GP-UCB algorithm (5) asymptotically achieves no regret,regardless of the degree of our proposed Markov approxi-

Page 7: Distributed Batch Gaussian Process Optimization

Distributed Batch Gaussian Process Optimization

mation (i.e., configuration of [N, B]). Thus, our batch vari-ant of GP-UCB (2) achieves no regret as well.

4. Experiments and DiscussionThis section evaluates the cumulative regret incurred byour DB-GP-UCB algorithm (5) and its scalability in thebatch size empirically on two synthetic benchmark ob-jective functions such as Branin-Hoo (Lizotte, 2008) andgSobol (Gonzalez et al., 2016) (Table 3 in Appendix H)and a real-world pH field of Broom’s Barn farm (Webster& Oliver, 2007) (Fig. 3 in Appendix H) spatially distributedover a 1200 m by 680 m region discretized into a 31 ⇥ 18grid of sampling locations. These objective functions andthe real-world pH field are each modeled as a sample ofa GP whose prior covariance is defined by the widely-used squared exponential kernel kxx0 , �2

s exp(�0.5(x �

x0)>⇤�2(x � x0)) where ⇤ , diag[`1, . . . , `d] and �2s are

its length-scale and signal variance hyperparameters, re-spectively. These hyperparameters together with the noisevariance �2

n are learned using maximum likelihood estima-tion (Rasmussen & Williams, 2006).

The performance of our DB-GP-UCB algorithm (5) iscompared with the state-of-the-art batch BO algorithmssuch as GP-BUCB (Desautels et al., 2014), GP-UCB-PE(Contal et al., 2013), SM-UCB (Azimi et al., 2010), q-EI(Chevalier & Ginsbourger, 2013), and BBO-LP by plug-ging in GP-UCB (Gonzalez et al., 2016), whose imple-mentations8 are publicly available. These batch BO algo-rithms are evaluated using a performance metric that mea-sures the cumulative regret incurred by a tested algorithm:PT

t=1 f(x⇤) � f(ext) where ext , arg maxxt2D µ{xt} (1)is the recommendation of the tested algorithm after t batchevaluations. For each experiment, 5 noisy observations arerandomly selected and used for initialization. This is in-dependently repeated 64 times and we report the resultingmean cumulative regret incurred by a tested algorithm. Allexperiments are run on a Linux system with Intelr XeonrE5-2670 at 2.6GHz with 96 GB memory.

For our experiments, we use a fixed budget of T |DT | =64 function evaluations and analyze the trade-off betweenbatch size |DT | (i.e., 2, 4, 8, 16) vs. time horizon T (re-spectively, 32, 16, 8, 4) on the performance of the testedalgorithms. This experimental setup represents a practi-cal scenario of costly function evaluations: On one hand,when a function evaluation is computationally costly (i.e.,time-consuming), it is more desirable to evaluate f for alarger batch (e.g., |DT | = 16) of inputs in parallel in eachiteration t (i.e., if hardware resources permit) to reduce thetotal time needed (hence smaller T ). On the other hand,

8Details on the used implementations are given in Table 4 inAppendix I. We implemented DB-GP-UCB in MATLAB to ex-ploit the GPML toolbox (Rasmussen & Williams, 2006).

when a function evaluation is economically costly, one maybe willing to instead invest more time (hence larger T ) toevaluate f for a smaller batch (e.g., |DT | = 2) of inputsin each iteration t in return for a higher frequency of in-formation and consequently a more adaptive BO to achievepotentially better performance. In some settings, both fac-tors may be equally important, that is, moderate values of|DT | and T are desired. To the best of our knowledge, sucha form of empirical analysis does not seem to be availablein the batch BO literature.

Fig. 2 shows results9 of the cumulative regret incurred bythe tested algorithms to analyze their trade-off betweenbatch size |DT | (i.e., 2, 4, 8, 16) vs. time horizon T (re-spectively, 32, 16, 8, 4) using a fixed budget of T |DT | = 64function evaluations for the Branin-Hoo function (left col-umn), gSobol function (middle column), and real-world pHfield (right column). Our DB-GP-UCB algorithm uses theconfigurations of [N, B] = [4, 2], [8, 5], [16, 10] in the ex-periments with batch size |DT | = 4, 8, 16, respectively;in the case of |DT | = 2, we use our batch variant of GP-UCB (2) which is equivalent to DB-GP-UCB when N = 1.It can be observed that DB-GP-UCB achieves lower cumu-lative regret than GP-BUCB, GP-UCB-PE, SM-UCB, andBBO-LP in all experiments (with the only exception beingthe gSobol function for the smallest batch size of |DT | = 2where BBO-LP performs slightly better) since DB-GP-UCB can jointly optimize a batch of inputs while GP-BUCB, GP-UCB-PE, SM-UCB, and BBO-LP are greedybatch algorithms that select the inputs of a batch one attime. Note that as the real-world pH field is not as well-behaved as the synthetic benchmark functions (see Fig. 3in Appendix H), the estimate of the Lipschitz constant byBBO-LP is potentially worse, hence likely degrading itsperformance. Furthermore, DB-GP-UCB can scale to amuch larger batch size of 16 than the other batch BO algo-rithms that also jointly optimize the batch of inputs, whichinclude q-EI, PPES (Shah & Ghahramani, 2015) and q-KG(Wu & Frazier, 2016): Results of q-EI are not available for|DT | � 4 as they require a prohibitively huge computa-tional effort to be obtained10 while PPES can only operatewith a small batch size of up to 3 for the Branin-Hoo func-tion and up to 4 for other functions, as reported in (Shah& Ghahramani, 2015), and q-KG can only operate witha small batch size of 4 for all tested functions (includingthe Branin-Hoo function and four others), as reported in(Wu & Frazier, 2016). The scalability of DB-GP-UCB isattributed to our proposed Markov approximation of our

9Error bars are omitted in Fig. 2 to preserve the readability ofthe graphs. A replication of the graphs in Fig. 2 including standarderror bars is provided in Appendix K.

10In the experiments of Gonzalez et al. (2016), q-EI can reach abatch size of up to 10 but performs much worse than GP-BUCB,which is likely due to a considerable downsampling of possiblebatches available for selection in each iteration.

Page 8: Distributed Batch Gaussian Process Optimization

Distributed Batch Gaussian Process Optimization

10 20 300

1

2

3

4

5

6

Re

gre

t

DB-GP-UCBGP-UCB-PEGP-BUCBSM-UCBBBO-LPqEI

10 20 300

1

2

3

4

5

6

10 20 300

5

10

15

5 10 150

0.5

1

1.5

2

2.5

3

Re

gre

t

DB-GP-UCBGP-UCB-PEGP-BUCBSM-UCBBBO-LP

5 10 150

0.5

1

1.5

2

2.5

3

5 10 150

2

4

6

8

2 4 6 80

0.5

1

1.5

Re

gre

t

DB-GP-UCBGP-UCB-PEGP-BUCBSM-UCBBBO-LP

2 4 6 80

0.5

1

1.5

2 4 6 80

1

2

3

4

1 2 3 4

T

0

0.2

0.4

0.6

0.8

1

Re

gre

t

DB-GP-UCBGP-UCB-PEGP-BUCBSM-UCBBBO-LP

1 2 3 4

T

0

0.2

0.4

0.6

0.8

1 2 3 4

T

0

0.5

1

1.5

2

Branin-Hoo gSobol pH field

Figure 2. Cumulative regret incurred by tested algorithms withvarying batch sizes |DT | = 2, 4, 8, 16 (rows from top to bottom)using a fixed budget of T |DT | = 64 function evaluations for theBranin-Hoo function, gSobol function, and real-world pH field.

batch variant of GP-UCB (2) (Section 3), which can thenbe naturally formulated as a multi-agent DCOP (5) in orderto fully exploit the efficiency of one of its state-of-the-artsolvers called max-sum (Farinelli et al., 2008). In the ex-periments with the largest batch size of |DT | = 16, we havereduced the number of iterations in max-sum to less than 5without waiting for convergence to preserve the efficiencyof DB-GP-UCB, thus sacrificing its BO performance. Nev-ertheless, DB-GP-UCB can still outperform the other tested

batch BO algorithms.

We have also investigated and analyzed the trade-off be-tween approximation quality and time efficiency of our DP-GP-UCB algorithm and reported the results in Appendix Jdue to lack of space. To summarize, it can be observedfrom our results that the approximation quality improvesnear-linearly with an increasing Markov order B at the ex-pense of higher computational cost (i.e., exponential in B).

5. ConclusionThis paper develops a novel distributed batch GP-UCB

(DB-GP-UCB) algorithm for performing batch BO ofhighly complex, costly-to-evaluate, noisy black-box objec-tive functions. In contrast to greedy batch BO algorithms(Azimi et al., 2010; Contal et al., 2013; Desautels et al.,2014; Gonzalez et al., 2016), our DB-GP-UCB algorithmcan jointly optimize a batch of inputs and, unlike (Cheva-lier & Ginsbourger, 2013; Shah & Ghahramani, 2015; Wu& Frazier, 2016), still preserve scalability in the batch size.To realize this, we generalize GP-UCB (Srinivas et al.,2010) to a new batch variant amenable to a Markov ap-proximation, which can then be naturally formulated as amulti-agent DCOP in order to fully exploit the efficiencyof its state-of-the-art solvers such as max-sum (Farinelliet al., 2008; Rogers et al., 2011) for achieving linear timein the batch size. Our proposed DB-GP-UCB algorithmoffers practitioners the flexibility to trade off between theapproximation quality and time efficiency by varying theMarkov order. We provide a theoretical guarantee for theconvergence rate of our DB-GP-UCB algorithm via boundson its cumulative regret. Empirical evaluation on syntheticbenchmark objective functions and a real-world pH fieldshows that our DB-GP-UCB algorithm can achieve lowercumulative regret than the greedy batch BO algorithmssuch as GP-BUCB, GP-UCB-PE, SM-UCB, and BBO-LP,and scale to larger batch sizes than the other batch BOalgorithms that also jointly optimize the batch of inputs,which include q-EI, PPES, and q-KG. For future work,we plan to generalize DB-GP-UCB (a) to the nonmyopiccontext by appealing to existing literature on nonmyopicBO (Ling et al., 2016) and active learning (Cao et al., 2013;Hoang et al., 2014a;b; Low et al., 2008; 2009; 2011; 2014a)as well as (b) to be performed by a multi-robot team tofind hotspots in environmental sensing/monitoring by seek-ing inspiration from existing literature on multi-robot ac-tive sensing/learning (Chen et al., 2012; 2013b; 2015; Lowet al., 2012; Ouyang et al., 2014). For applications witha huge budget of function evaluations, we like to coupleDB-GP-UCB with the use of parallel/distributed sparse GPmodels (Chen et al., 2013a; Hoang et al., 2016; Low et al.,2015) to represent the belief of the unknown objective func-tion efficiently.

Page 9: Distributed Batch Gaussian Process Optimization

Distributed Batch Gaussian Process Optimization

AcknowledgementsThis research is supported by Singapore Ministry of Educa-tion Academic Research Fund Tier 2, MOE2016-T2-2-156.Erik A. Daxberger would like to thank Volker Tresp for hisadvice throughout this research project.

ReferencesAnderson, B. S., Moore, A. W., and Cohn, D. A non-

parametric approach to noisy and costly optimization. InProc. ICML, 2000.

Asif, A. and Moura, J. M. F. Block matrices with L-block-banded inverse: Inversion algorithms. IEEE Trans. Sig-

nal Processing, 53(2):630–642, 2005.

Azimi, J., Fern, A., and Fern, X. Z. Batch Bayesian op-timization via simulation matching. In Proc. NIPS, pp.109–117, 2010.

Brochu, E., Cora, V. M., and de Freitas, N. A tutorial onBayesian optimization of expensive cost functions, withapplication to active user modeling and hierarchical re-inforcement learning. arXiv:1012.2599, 2010.

Cao, N., Low, K. H., and Dolan, J. M. Multi-robot infor-mative path planning for active sensing of environmentalphenomena: A tale of two algorithms. In Proc. AAMAS,pp. 7–14, 2013.

Chapman, A., Rogers, A., Jennings, N. R., and Leslie,D. A unifying framework for iterative approximate best-response algorithms for distributed constraint optimisa-tion problems. The Knowledge Engineering Review, 26(4):411–444, 2011.

Chen, J., Low, K. H., Tan, C. K.-Y., Oran, A., Jaillet, P.,Dolan, J. M., and Sukhatme, G. S. Decentralized datafusion and active sensing with mobile sensors for mod-eling and predicting spatiotemporal traffic phenomena.In Proc. UAI, pp. 163–173, 2012.

Chen, J., Cao, N., Low, K. H., Ouyang, R., Tan, C. K.-Y.,and Jaillet, P. Parallel Gaussian process regression withlow-rank covariance matrix approximations. In Proc.

UAI, pp. 152–161, 2013a.

Chen, J., Low, K. H., and Tan, C. K.-Y. Gaussian process-based decentralized data fusion and active sensing formobility-on-demand system. In Proc. RSS, 2013b.

Chen, J., Low, K. H., Jaillet, P., and Yao, Y. Gaussian pro-cess decentralized data fusion and active sensing for spa-tiotemporal traffic modeling and prediction in mobility-on-demand systems. IEEE Trans. Autom. Sci. Eng., 12:901–921, 2015.

Chevalier, C. and Ginsbourger, D. Fast computation of themulti-points expected improvement with applications inbatch selection. In Proc. 7th International Conference on

Learning and Intelligent Optimization, pp. 59–69, 2013.

Contal, E., Buffoni, D., Robicquet, A., and Vayatis,N. Parallel Gaussian process optimization with up-per confidence bound and pure exploration. In Proc.

ECML/PKDD, pp. 225–240, 2013.

Contal, E., Perchet, V., and Vayatis, N. Gaussian processoptimization with mutual information. In Proc. ICML,pp. 253–261, 2014.

Csato, L. and Opper, M. Sparse online Gaussian processes.Neural Computation, 14(3):641–669, 2002.

Desautels, T., Krause, A., and Burdick, J. W. Paralleliz-ing exploration-exploitation tradeoffs in Gaussian pro-cess bandit optimization. JMLR, 15:4053–4103, 2014.

Farinelli, A., Rogers, A., Petcu, A., and Jennings, N. R.Decentralised coordination of low-power embedded de-vices using the max-sum algorithm. In Proc. AAMAS,pp. 639–646, 2008.

Gonzalez, J., Dai, Z., Hennig, P., and Lawrence, N. D.Batch Bayesian optimization via local penalization. InProc. AISTATS, 2016.

Hennig, P. and Schuler, C. J. Entropy search forinformation-efficient global optimization. JMLR, 13:1809–1837, 2012.

Hensman, J., Fusi, N., and Lawrence, N. D. Gaussian pro-cesses for big data. In Proc. UAI, 2013.

Hernandez-Lobato, J. M., Hoffman, M. W., and Ghahra-mani, Z. Predictive entropy search for efficient globaloptimization of black-box functions. In Proc. NIPS, pp.918–926, 2014.

Hoang, Q. M., Hoang, T. N., and Low, K. H. A generalizedstochastic variational Bayesian hyperparameter learningframework for sparse spectrum Gaussian process regres-sion. In Proc. AAAI, pp. 2007–2014, 2017.

Hoang, T. N., Low, K. H., Jaillet, P., and Kankanhalli,M. Active learning is planning: Nonmyopic ✏-Bayes-optimal active learning of Gaussian processes. In Proc.

ECML/PKDD Nectar Track, pp. 494–498, 2014a.

Hoang, T. N., Low, K. H., Jaillet, P., and Kankanhalli, M.Nonmyopic ✏-Bayes-optimal active learning of Gaussianprocesses. In Proc. ICML, pp. 739–747, 2014b.

Hoang, T. N., Hoang, Q. M., and Low, K. H. A unifyingframework of anytime sparse Gaussian process regres-sion models with stochastic variational inference for bigdata. In Proc. ICML, pp. 569–578, 2015.

Page 10: Distributed Batch Gaussian Process Optimization

Distributed Batch Gaussian Process Optimization

Hoang, T. N., Hoang, Q. M., and Low, K. H. A dis-tributed variational inference framework for unifyingparallel sparse Gaussian process regression models. InProc. ICML, pp. 382–391, 2016.

Kathuria, T., Deshpande, A., and Kohli, P. Batched Gaus-sian process bandit optimization via determinantal pointprocesses. In Proc. NIPS, pp. 4206–4214, 2016.

Leite, A. R., Enembreck, F., and Barthes, J.-P. A. Dis-tributed constraint optimization problems: Review andperspectives. Expert Systems with Applications, 41(11):5139–5157, 2014.

Ling, C. K., Low, K. H., and Jaillet, P. Gaussian pro-cess planning with Lipschitz continuous reward func-tions: Towards unifying Bayesian optimization, activelearning, and beyond. In Proc. AAAI, pp. 1860–1866,2016.

Lizotte, D. J. Practical Bayesian optimization. Ph.D. The-sis, University of Alberta, 2008.

Low, K. H., Dolan, J. M., and Khosla, P. Adaptive multi-robot wide-area exploration and mapping. In Proc. AA-

MAS, pp. 23–30, 2008.

Low, K. H., Dolan, J. M., and Khosla, P. Information-theoretic approach to efficient adaptive path planning formobile robotic environmental sensing. In Proc. ICAPS,pp. 233–240, 2009.

Low, K. H., Dolan, J. M., and Khosla, P. Active Markovinformation-theoretic path planning for robotic environ-mental sensing. In Proc. AAMAS, pp. 753–760, 2011.

Low, K. H., Chen, J., Dolan, J. M., Chien, S., and Thomp-son, D. R. Decentralized active robotic exploration andmapping for probabilistic field classification in environ-mental sensing. In Proc. AAMAS, pp. 105–112, 2012.

Low, K. H., Chen, J., Hoang, T. N., Xu, N., and Jaillet,P. Recent advances in scaling up Gaussian process pre-dictive models for large spatiotemporal data. In Ravela,S. and Sandu, A. (eds.), Dynamic Data-Driven Environ-

mental Systems Science: First International Conference,

DyDESS 2014, pp. 167–181. LNCS 8964, Springer In-ternational Publishing, 2014a.

Low, K. H., Xu, N., Chen, J., Lim, K. K., and Ozgul, E. B.Generalized online sparse Gaussian processes with ap-plication to persistent mobile robot localization. In Proc.

ECML/PKDD Nectar Track, pp. 499–503, 2014b.

Low, K. H., Yu, J., Chen, J., and Jaillet, P. Parallel Gaussianprocess regression for big data: Low-rank representationmeets Markov approximation. In Proc. AAAI, pp. 2821–2827, 2015.

Ouyang, R., Low, K. H., Chen, J., and Jaillet, P. Multi-robot active sensing of non-stationary Gaussian process-based environmental phenomena. In Proc. AAMAS, pp.573–580, 2014.

Rasmussen, C. E. and Williams, C. K. I. Gaussian Pro-

cesses for Machine Learning. MIT Press, 2006.

Rogers, A., Farinelli, A., Stranders, R., and Jennings, N. R.Bounded approximate decentralised coordination via themax-sum algorithm. AIJ, 175(2):730–759, 2011.

Shah, A. and Ghahramani, Z. Parallel predictive entropysearch for batch global optimization of expensive objec-tive functions. In Proc. NIPS, pp. 3312–3320, 2015.

Shahriari, B., Swersky, K., Wang, Z., Adams, R. P., and deFreitas, N. Taking the human out of the loop: A reviewof Bayesian optimization. Proceedings of the IEEE, 104(1):148–175, 2016.

Shewry, M. C. and Wynn, H. P. Maximum entropy sam-pling. J. Applied Statistics, 14(2):165–170, 1987.

Srinivas, N., Krause, A., Kakade, S., and Seeger, M. Gaus-sian process optimization in the bandit setting: No re-gret and experimental design. In Proc. ICML, pp. 1015–1022, 2010.

Villemonteix, J., Vazquez, E., and Walter, E. An informa-tional approach to the global optimization of expensive-to-evaluate functions. J. Glob. Optim., 44(4):509–534,2009.

Webster, R. and Oliver, M. Geostatistics for Environmental

Scientists. John Wiley & Sons, Inc., 2007.

Wu, J. and Frazier, P. The parallel knowledge gradientmethod for batch Bayesian optimization. In Proc. NIPS,pp. 3126–3134, 2016.

Xu, N., Low, K. H., Chen, J., Lim, K. K., and Ozgul, E. B.GP-Localize: Persistent mobile robot localization usingonline sparse Gaussian process observation model. InProc. AAAI, pp. 2585–2592, 2014.

Page 11: Distributed Batch Gaussian Process Optimization

Distributed Batch Gaussian Process Optimization

A. Derivation of I[fD;yDt |yD1:t-1 ] Term in (2)

By the definition of conditional mutual information,

I[fD;yDt |yD1:t-1 ]

= H[yDt |yD1:t-1 ] � H[yDt |fD,yD1:t-1 ]

= H[yDt |yD1:t-1 ] � H[yDt |fDt ]

= 0.5|Dt| log(2⇡e) + 0.5 log |�2nI + ⌃DtDt | � 0.5|Dt| log(2⇡e) � 0.5 log |�

2nI|

= 0.5 log(|�2nI + ⌃DtDt ||�

2nI|

�1)

= 0.5 log(|�2nI + ⌃DtDt ||�

�2n I|)

= 0.5 log |I + ��2n ⌃DtDt |

where the third equality is due to the definition of Gaussian entropy, that is, H[yDt |yD1:t-1 ] , 0.5|Dt| log(2⇡e) +0.5 log |�

2nI+⌃DtDt | and H[yDt |fDt ] , 0.5|Dt| log(2⇡e)+0.5 log |�

2nI|, the latter of which follows from ✏ = yx�f(x) ⇠

N (0, �2n) for all x 2 Dt and hence p(yDt |fDt) = N (0, �

2nI).

B. Proof of Proposition 1

log | DtDt |

= � log | �1DtDt

|

= � log |U>

U |

= � log |U>

||U |

= � log |U |2

= �2 logNY

n=1

|Unn|

= �2NX

n=1

log |Unn|

= �

NX

n=1

log |Unn|2

= �

NX

n=1

log |U>nn||Unn|

= �

NX

n=1

log |U>nnUnn|

=NX

n=1

log |U>nnUnn|

�1

=NX

n=1

log |(U>nnUnn)�1

|

where the first, third, fourth, eighth, ninth, and last equalities follow from the properties of the determinant, the secondequality is due to the Cholesky factorization of

�1DtDt

, and the fifth equality follows from the property that the determinantof an upper triangular block matrix is a product of determinants of its diagonal blocks (i.e., |U | =

QNn=1 |Unn|).

Page 12: Distributed Batch Gaussian Process Optimization

Distributed Batch Gaussian Process Optimization

C. Proof of Proposition 4From the definition of DKL( DtDt , DtDt),

DKL( DtDt , DtDt)

= 0.5⇣

tr( DtDt �1DtDt

) � log | DtDt �1DtDt

| � |Dt|

= 0.5⇣� log | DtDt

�1DtDt

|

= 0.5�� log | DtDt | + log | DtDt |

= 0.5 log | DtDt | � 0.5 log | DtDt |

= I[fD;yDt |yD1:t-1 ] � I[fD;yDt |yD1:t-1 ] .

The second equality is due to tr( DtDt �1DtDt

) = tr( DtDt �1DtDt

) = tr(I) = |Dt|, which follows from the observationsthat the blocks within the B-block bands of DtDt and DtDt coincide and

�1DtDt

is B-block-banded (Proposition 3). Itfollows that

DKL( DtDt , DtDt)= I[fD;yDt |yD1:t-1 ] � I[fD;yDt |yD1:t-1 ]

= 0.5 log | DtDt | �

|Dt|X

b=1

0.5 log⇣1 + �

�2n ⌃b�1

{xb}{xb}

0.5 log

Y

x2Dt

�1 + �

�2n ⌃{x}{x}

�!

|Dt|X

b=1

0.5 log⇣1 + �

�2n ⌃b�1

{xb}{xb}

=

|Dt|X

b=1

0.5 log�1 + �

�2n ⌃{xb}{xb}

��

|Dt|X

b=1

0.5 log⇣1 + �

�2n ⌃b�1

{xb}{xb}

|Dt|X

b=1

0.5 log⇣1 + �

�2n exp(2C)⌃b�1

{xb}{xb}

⌘�

|Dt|X

b=1

0.5 log⇣1 + �

�2n ⌃b�1

{xb}{xb}

exp(2C)

|Dt|X

b=1

0.5 log⇣1 + �

�2n ⌃b�1

{xb}{xb}

⌘�

|Dt|X

b=1

0.5 log⇣1 + �

�2n ⌃b�1

{xb}{xb}

= (exp(2C) � 1)

|Dt|X

b=1

0.5 log⇣1 + �

�2n ⌃b�1

{xb}{xb}

= (exp(2C) � 1) I[fD;yDt |yD1:t-1 ] .

The second and last equalities are due to Lemma 4 in Appendix F and ⌃b�1{xb}{xb} is defined in Definition 1 in Appendix F.

The first inequality is due to Hadamard’s inequality and the observation that the blocks within the B-block bands of DtDt

and DtDt (and thus their diagonal elements) coincide. The second inequality is due to Lemma 2 in Appendix F. The thirdinequality is due to Bernoulli’s inequality.

Remark. The first inequality can also be interpreted as bounding the approximated information gain for an arbitrary DtDt

by the approximated information gain for the DtDt with the highest possible degree of our proposed Markov approxi-mation, i.e., for N = |Dt| and B = 0. In this case, all inputs of the batch are assumed to have conditionally independentcorresponding outputs such that the determinant of the approximated matrix reduces to the product of its diagonal ele-ments which are equal to the diagonal elements of the original matrix. Thus, | DtDt |

Qx2Dt

�1 + �

�2n ⌃{x}{x}

�which

interestingly coincides with Hadamard’s inequality. Note that we only consider B � 1 for our proposed algorithm (Propo-sition 2) since the case of B = 0 entails an issue similar to that discussed at the beginning of Section 3 of selecting thesame input |Dt| times within a batch.

D. Minimal KL Distance of Approximated MatrixFor the approximation quality of DtDt (4), the following result shows that the Kullback-Leibler (KL) distance of DtDt

from DtDt is the least among all |Dt| ⇥ |Dt| matrices with a B-block-banded inverse:

Proposition 5. Let KL distance DKL( , e ) , 0.5(tr( e �1) � log | e �1| � |Dt|) between two |Dt| ⇥ |Dt| symmet-

ric positive definite matrices and e measure the error of approximating with e . Then, DKL( DtDt , DtDt)

Page 13: Distributed Batch Gaussian Process Optimization

Distributed Batch Gaussian Process Optimization

DKL( DtDt ,e ) for any matrix e with a B-block-banded inverse.

Proof.

DKL( DtDt , DtDt) + DKL( DtDt ,e )

= 0.5⇣

tr( DtDt �1DtDt

)�log | DtDt �1DtDt

|�|Dt|

⌘+ 0.5

⇣tr( DtDt

e �1)�log | DtDte �1

|�|Dt|

= 0.5⇣

tr( DtDte �1) � log | DtDt | � log |e �1

| � |Dt|

= 0.5⇣

tr( DtDte �1) � log | DtDt

e �1| � |Dt|

= DKL( DtDt ,e ) .

The second equality is due to tr( DtDt �1DtDt

) = tr( DtDt �1DtDt

) = tr(I) = |Dt|, which follows from the observa-tions that the blocks within the B-block bands of DtDt and DtDt coincide and

�1DtDt

is B-block-banded (Proposi-tion 3). The third equality follows from the first observation above and the definition that e �1 is B-block-banded. SinceDKL( DtDt ,

e ) � 0, DKL( DtDt , DtDt) DKL( DtDt ,e ).

E. Pseudocode for DB-GP-UCB

Algorithm 1 DB-GP-UCBInput: Objective function f , input domain D, batch size |Dt|, time horizon T , prior mean mx and kernel kxx0 ,approximation parameters B and N

for t = 1, . . . , T do

Select acquisition function a(Dt) ,

8><

>:

1>µDt +

q↵t I[fD;yDt |yD1:t-1

] (2) if B = N � 1 ,

PNn=1 1>

µDtn +q

0.5↵t log | DtnDtn|DBtn

| (5) otherwise

Select batch Dt , arg maxDt⇢D a(Dt)

Query batch Dt to obtain yDt, (f(x) + ✏)>

x2Dt

end forOutput: Recommendation ex , arg maxx2D µ{x}

F. Proof of Theorem 1We first define a different notion of posterior variance:Definition 1 (Updated Posterior Variance). Let Dt , {x1, . . . ,x|Dt|} be the batch selected in iteration t. Assume an

arbitrary ordering of the inputs in Dt. Then, for 0 b � 1 < |Dt|, ⌃b�1{xb}{xb} is defined as the updated posterior variance

at input xb that is obtained by applying (1) conditioned on the previous inputs in the batch Db�1t , {x1, . . . ,xb�1}. Note

that performing this update is possible without querying Db�1t since ⌃b�1

{xb}{xb} is independent of the outputs yDb�1t

. For

b � 1 = 0, ⌃b�1{xb}{xb} reduces to ⌃{xb}{xb}.

The following lemmas are necessary for proving our main result here:Lemma 1. Let � 2 (0, 1) be given and �t , 2 log(|D|⇡t/�) where

P1t=1 ⇡

�1t = 1 and ⇡t > 0. Then,

Pr⇣8x 2 D 8t 2 N |f(x) � µ{x}| �

1/2t ⌃1/2

{x}{x}

⌘� 1 � � .

Lemma 1 above corresponds to Lemma 5.1 in (Srinivas et al., 2010); see its proof therein. For example, ⇡t = t2⇡

2/6 > 0

satisfiesP1

t=1 ⇡�1t = 1.

Lemma 2. For f sampled from a known GP prior with known noise variance �2n, the ratio of ⌃{xb}{xb} to ⌃b�1

{xb}{xb} for

all xb 2 Dt is bounded by

⌃{xb}{xb}

⌃b�1{xb}{xb}

= exp⇣2 I[f{xb};yDb�1

t|yD1:t-1 ]

⌘ exp(2C)

Page 14: Distributed Batch Gaussian Process Optimization

Distributed Batch Gaussian Process Optimization

where ⌃b�1{xb}{xb} and D

b�1t are previously defined in Definition 1, and for all x 2 D and t 2 N,

C � I[f{x};yDb�1t

|yD1:t-1 ]

is a suitable constant.

Lemma 2 above is a combination of Proposition 1 and equation 9 in (Desautels et al., 2014); see their proofs therein. Theonly difference is that we equivalently bound the ratio of variances instead of the ratio of standard deviations, thus leadingto an additional factor of 2 in the argument of exp.

Remark. Since the upper bound exp(2C) will appear in our regret bounds, we need to choose C suitably. A straightforwardchoice C , �|Dt|�1 = maxA⇢D,|A||Dt|�1 I[fD;yA] � maxA⇢D,|A||Dt|�1 I[fD;yA|yD1:t-1 ] � I[fD;yDb�1

t|yD1:t-1 ] �

I[f{x};yDb�1t

|yD1:t-1 ] (see equations 11, 12, and 13 in (Desautels et al., 2014)) is unfortunately unsatisfying from theperspective of asymptotic scaling since it grows at least as ⌦(log |Dt|), thus implying that exp(2C) grows at least linearlyin |Dt| and yielding a regret bound that is also at least linear in |Dt|. The work of Desautels et al. (2014) shows that wheninitializing an algorithm suitably, one can obtain a constant C independent of the batch size |Dt|. Refer to Section 4 in(Desautels et al., 2014) for a more detailed discussion.

Lemma 3. For all t 2 N and xb 2 Dt,

⌃b�1{xb}{xb} 0.5C0 log

⇣1 + �

�2n ⌃b�1

{xb}{xb}

where C0 , 2/ log(1 + ��2n ).

Lemma 3 above corresponds to an intermediate step of Lemma 5.4 in (Srinivas et al., 2010); see its proof therein.

Lemma 4. The information gain for a batch Dt chosen in any iteration t can be expressed in terms of the updated posterior

variances of the individual inputs xb 2 Dt, b 2 {1, . . . , |Dt|} of the batch Dt. That is, for all t 2 N,

I[fD;yDt |yD1:t-1 ] = 0.5

|Dt|X

b=1

log⇣1 + �

�2n ⌃b�1

{xb}{xb}

⌘.

Lemma 4 above corresponds to Lemma 5.3 in (Srinivas et al., 2010) (the only difference being that we equivalently sumover 1, . . . , |Dt| instead of 1, . . . , T ); see its proof therein.

Lemma 5. Let � 2 (0, 1) be given, C0 , 2/ log(1 + ��2n ), and ↵t , C0|Dt| exp(2C)�t where �t and exp(2C) are

previously defined in Lemmas 1 and 2, respectively. Then,

Pr

8Dt ⇢ D 8t 2 N

X

x2Dt

|f(x) � µ{x}|

p↵t I[fD;yDt |yD1:t-1 ]

!� 1 � � .

Proof. For all Dt ⇢ D and t 2 N,X

x2Dt

�t ⌃{x}{x}

= �t

|Dt|X

b=1

⌃{xb}{xb}

�t

|Dt|X

b=1

exp(2C) ⌃b�1{xb}{xb}

0.5C0 exp(2C)�t

|Dt|X

b=1

log⇣1 + �

�2n ⌃b�1

{xb}{xb}

= C0 exp(2C)�t I[fD;yDt |yD1:t-1 ]= |Dt|

�1↵t I[fD;yDt |yD1:t-1 ]

Page 15: Distributed Batch Gaussian Process Optimization

Distributed Batch Gaussian Process Optimization

where the first inequality is due to Lemma 2, the second inequality is due to Lemma 3, and the second equality is due toLemma 4. Thus,

X

x2Dt

�1/2t ⌃1/2

{x}{x}

s|Dt|

X

x2Dt

�t⌃{x}{x}

p↵t I[fD;yDt |yD1:t-1 ]

where the first inequality is due to the Cauchy-Schwarz inequality. It follows that

Pr

8Dt ⇢ D 8t 2 N

X

x2Dt

|f(x) � µ{x}|

p↵t I[fD;yDt |yD1:t-1 ]

!

� Pr

8Dt ⇢ D 8t 2 N

X

x2Dt

|f(x) � µ{x}|

X

x2Dt

�1/2t ⌃1/2

{x}{x}

!

� Pr⇣8x 2 D 8t 2 N |f(x) � µ{x}| �

1/2t ⌃1/2

{x}{x}

�1 � �

where the first two inequalities are due to the property that for logical propositions A and B, [A =) B] =) [Pr(A)

Pr(B)], and the last inequality is due to Lemma 1.

Lemma 6. Let ⌫t � I[fD;yDt |yD1:t-1 ]�I[fD;yDt |yD1:t-1 ] be an upper bound on the approximation error of DtDt . Then,

for all t 2 N,

NX

n=1

q0.5 log | DtnDtn|DB

tn|

pN(I[fD;yDt |yD1:t-1 ] + ⌫t) .

Proof.

NX

n=1

q0.5 log | DtnDtn|DB

tn|

vuutN

NX

n=1

0.5 log | DtnDtn|DBtn

|

=q

N I[fD;yDt |yD1:t-1 ]

=q

N(I[fD;yDt |yD1:t-1 ] + I[fD;yDt |yD1:t-1 ] � I[fD;yDt |yD1:t-1 ])

pN(I[fD;yDt |yD1:t-1 ] + ⌫t)

where the first inequality is due to the Cauchy-Schwarz inequality.

Lemma 7. Let t 2 N be given. If

X

x2Dt

|f(x) � µ{x}|

p↵t I[fD;yDt |yD1:t-1 ] (6)

for all Dt ⇢ D, thenP

x2Dtrx 2

q↵t N(I[fD;yDt

|yD1:t-1 ] + ⌫t) and minx2Dtrx

2q

|Dt|�2↵t N(I[fD;yDt

|yD1:t-1 ] + ⌫t).

Page 16: Distributed Batch Gaussian Process Optimization

Distributed Batch Gaussian Process Optimization

Proof.X

x2Dt

rx

=X

x2Dt

(f(x⇤) � f(x))

=X

x2Dt

f(x⇤) �

X

x2Dt

f(x)

0

@X

x2Dt

µ{x} +q

↵t N(I[fD;yDt|yD1:t-1 ] + ⌫t)

1

A�

X

x2Dt

f(x)

=q

↵t N(I[fD;yDt|yD1:t-1 ] + ⌫t) +

0

@X

x2Dt

µ{x} �

X

x2Dt

f(x)

1

A

=q

↵t N(I[fD;yDt|yD1:t-1 ] + ⌫t) +

X

x2Dt

�µ{x} � f(x)

q↵t N(I[fD;yDt

|yD1:t-1 ] + ⌫t) +q

↵t N(I[fD;yDt|yD1:t-1 ] + ⌫t)

= 2q

↵t N(I[fD;yDt|yD1:t-1 ] + ⌫t) .

(7)

The first equality in (7) is by definition (Section 2). The first inequality in (7) is due to

X

x2Dt

f(x⇤)

=X

x2D⇤t

f(x)

X

x2D⇤t

µ{x} +q

↵t I[fD;yD⇤t|yD1:t-1 ]

X

x2D⇤t

µ{x} +q

↵t I[fD;yD⇤t|yD1:t-1 ]

=X

x2D⇤t

µ{x} +

vuut↵t

NX

n=1

0.5 log | D⇤tnD⇤

tn|D⇤Btn

|

X

x2D⇤t

µ{x} +p

↵t

NX

n=1

q0.5 log | D⇤

tnD⇤tn|D⇤B

tn|

X

x2Dt

µ{x} +p

↵t

NX

n=1

q0.5 log | DtnDtn|DB

tn|

X

x2Dt

µ{x} +q

↵t N(I[fD;yDt|yD1:t-1 ] + ⌫t)

(8)

where, in (8), the first inequality is due to (6), the second inequality is due to Proposition 4 (see the paragraph after this

proposition in particular), the third inequality is due to the simple observation thatPN

n=1

pan �

qPNn=1 an, the fourth

inequality follows from the definition of Dt in (5) and, with a slight abuse of notation, D⇤t is defined as a batch of |Dt|

inputs x⇤, and the last inequality is due to Lemma 6. The last inequality in (7) follows from (6) and an argument equivalentto the one in (8) (i.e., by substituting D

⇤t by Dt).

From (7),

minx2Dt

rx 1

|Dt|

X

x2Dt

rx 2q

|Dt|�2↵t N(I[fD;yDt

|yD1:t-1 ] + ⌫t) .

Page 17: Distributed Batch Gaussian Process Optimization

Distributed Batch Gaussian Process Optimization

Main Proof.

R0T

=TX

t=1

X

x2Dt

rx

TX

t=1

2q

↵t N(I[fD;yDt|yD1:t-1

] + ⌫t)

2

vuutT

TX

t=1

↵t N(I[fD;yDt|yD1:t-1

] + ⌫t)

2

vuutT↵T N

TX

t=1

I[fD;yDt|yD1:t-1

] +TX

t=1

⌫t

!

= 2q

T↵T N�I[fD;yD1:T

] + ⌫T

2p

T↵T N (�T + ⌫T )

=q

C2T |DT | exp(2C)�T N (�T + ⌫T )

holds with probability 1� � where the first equality is by definition (Section 2), the first inequality follows from Lemmas 5and 7, the second inequality is due to the Cauchy-Schwarz inequality, the third inequality is due to the non-decreasing↵t with increasing t, the second equality follows from the chain rule for mutual information and the definition of ⌫T ,PT

t=1 ⌫t, the fourth inequality is by definition (Theorem 1), and the third equality is due to the definition of ↵t in Lemma 5,|D1| = . . . = |DT | and the definition that C2 , 4C0 = 8/ log(1 + �

�2n ).

Analogous reasoning leads to the result that

RT =TX

t=1

minx2Dt

rx 2q

T |DT |�2↵T N(�T + ⌫T ) =q

C2T |DT |�1 exp(2C)�T N(�T + ⌫T )

holds with probability 1 � �, where the first equality is by definition (Section 2).

G. Comparison of Regret Bounds

Table 1. Bounds on RT (�T , 2 log(|D|T 2⇡2/(6�)), C1 , 4/ log(1 + �

�2n ), C2 , 2C1, C3 , 9C1). Note that |DT | = 1 in �T

for GP-UCB and HDPP , PTt=1 H(DPP (Kt)) with H(DPP (K)) denoting the entropy of a (|Dt| � 1)-DPP with kernel K (see

(Kathuria et al., 2016) for details on their proposed kernels). Also, note that for DB-GP-UCB and GB-BUCB, we assume the use of theinitialization strategy proposed by Desautels et al. (2014); otherwise, the factor C0 is replaced by

pexp(2C).

BO Algorithm Bound on RT

DB-GP-UCB (5) C0p

C2T |DT |�1�T N(�T + ⌫T )

GP-UCB-PE (Contal et al., 2013)p

C1T |DT |�1�T �T

GP-BUCB (Desautels et al., 2014) C0p

C2T |DT |�1�T |DT |�T

GP-UCB (Srinivas et al., 2010)p

C2T�T �T

UCB-DPP-SAMPLE (Kathuria et al., 2016)p

2C3T |DT |�T [�T � HDPP + |DT | log(|D|)]

Page 18: Distributed Batch Gaussian Process Optimization

Distributed Batch Gaussian Process Optimization

Table 2. Bounds on maximum mutual information �T (Srinivas et al., 2010; Kathuria et al., 2016) and values of C0 (Desautels et al.,2014) for different commonly-used kernels (↵ , d(d+ 1)/(2⌫ + d(d+ 1)) 1 with ⌫ being the Matern parameter).

Kernel �T C0

Linear d log(T |DT |) exp(2/e)

RBF (log(T |DT |))d exp((2d/e)d)Matern (T |DT |)↵ log(T |DT |) e

H. Synthetic Benchmark Objective Functions and Real-World pH Field

Table 3. Synthetic benchmark objective functions.

Name Function D

Branin-Hoo f(x) = a(x2 � bx21 + cx1 � r)2 + s(1 � t) cos(x1) + s [�5, 15]2

where a = 1, b = 5.1/(4⇡2), c = 5/⇡, r = 6, s = 10, and t = 1/(8⇡).

gSobol f(x) =dY

i=1

|4xi � 2| + ai

1 + ai[�5, 5]2

where d = 2 and ai = 1 for i = 1, . . . , d.

Mixture of cosines f(x) = 1 �P2

i=1(g(xi) � r(xi)) [�1, 1]2

where g(xi) = (1.6xi � 0.5)2 and r(xi) = 0.3 cos(3⇡(1.6xi � 0.5)).

5 10 15 20 25 30

2

4

6

8

10

12

14

16

18

5.5

6

6.5

7

7.5

8

8.5

Figure 3. Real-world pH field of Broom’s Barn farm (Webster & Oliver, 2007).

Page 19: Distributed Batch Gaussian Process Optimization

Distributed Batch Gaussian Process Optimization

I. Details on the Implementations of Batch BO Algorithms

Table 4. Details on the available implementations of the batch BO algorithms for comparison with DB-GP-UCB in our experiments.

BO Algorithm Language URL of Source Code

GP-BUCB MATLAB http://www.gatsby.ucl.ac.uk/˜tdesautels/

SM-UCB MATLAB http://www.gatsby.ucl.ac.uk/˜tdesautels/

GP-UCB-PE MATLAB http://econtal.perso.math.cnrs.fr/software/

q-EI R http://cran.r-project.org/web/packages/DiceOptim/

BBO-LP Python http://sheffieldml.github.io/GPyOpt/

J. Analysis of the Trade-Off between the Approximation Quality vs. Time Efficiency ofDB-GP-UCB

We now analyze the trade-off between the approximation quality vs. time efficiency of DB-GP-UCB by varying the Markovorder B and number N of functions in DCOP. The mixture of cosines function (Anderson et al., 2000) is used as the objec-tive function f and modeled as a sample of a GP. A large batch size |DT | = 16 is used as it allows us to compare a multitudeof different configurations of [N, B] 2 {[16, 14], [16, 12], . . . , [16, 0], [8, 6], [8, 4], [8, 2], [8, 0], [4, 2], [4, 0], [2, 0]}. The ac-quisition function in our batch variant of GP-UCB (2) is used as the performance metric to evaluate the approximationquality of the batch DT (i.e., by plugging DT into (2) to compute the value of the acquisition function) produced by ourDB-GP-UCB algorithm (5) for each configuration of [N, B].

Fig. 4 (top) shows results of the normalized values of the acquisition function in (2) achieved by plugging in the batch DT

produced by DP-GP-UCB (5) for different configurations of [N, B] such that the optimal value of (2) (i.e., achieved in thecase of N = 1) is normalized to 1. Fig. 4 (bottom) shows the corresponding time complexity of DP-GP-UCB plotted inlog|D|-scale, thus displaying the values of (B + 1)|DT |/N . It can be observed that the approximation quality improvesnear-linearly with an increasing Markov order B at the expense of higher computational cost (i.e., exponential in B).

B=N-1 [16,14] [16,12] [16,10] [16,8] [16,6] [16,4] [16,2] [16,0] [8,6] [8,4] [8,2] [8,0] [4,2] [4,0] [2,0]

0.92

0.94

0.96

0.98

1

1.02

Va

lue

of

acq

. fu

nct

ion

B=N-1 [16,14] [16,12] [16,10] [16,8] [16,6] [16,4] [16,2] [16,0] [8,6] [8,4] [8,2] [8,0] [4,2] [4,0] [2,0]

Configuration of [N,B]

0

5

10

15

20

Tim

e c

om

ple

xity

N=1

N=1

Figure 4. (Top) Mean of the normalized value of the acquisition function in (2) (over 64 experiments of randomly selected noisy observa-tions of size 5) achieved by plugging in the batch DT (of size 16) produced by our DP-GP-UCB algorithm (5) for different configurationsof [N,B] (including the case of N = 1 yielding the optimal value of (2)); note that the horizontal line is set at the optimal baseline ofy = 1 for easy comparison and the y-axis starts at y = 0.915. (Bottom) Time complexity of DP-GP-UCB for different configurationsof [N,B] plotted in log|D|-scale.

Page 20: Distributed Batch Gaussian Process Optimization

Distributed Batch Gaussian Process Optimization

K. Replication of Regret Graphs including Error Bars

5 10 15 20 25 30

T

0

1

2

3

4

5

6

Re

gre

t

DB-GP-UCBGP-UCB-PEGP-BUCBSM-UCBBBO-LPqEI

5 10 15 20 25 30

T

0

1

2

3

4

5

6

Re

gre

t

DB-GP-UCBGP-UCB-PEGP-BUCBSM-UCBBBO-LPqEI

5 10 15 20 25 30

T

0

2

4

6

8

10

12

14

16

Re

gre

t

DB-GP-UCBGP-UCB-PEGP-BUCBSM-UCBBBO-LPqEI

2 4 6 8 10 12 14 16

T

0

0.5

1

1.5

2

2.5

3

Re

gre

t

DB-GP-UCBGP-UCB-PEGP-BUCBSM-UCBBBO-LP

2 4 6 8 10 12 14 16

T

0

0.5

1

1.5

2

2.5

3

Re

gre

t

DB-GP-UCBGP-UCB-PEGP-BUCBSM-UCBBBO-LP

2 4 6 8 10 12 14 16

T

0

1

2

3

4

5

6

7

Re

gre

t

DB-GP-UCBGP-UCB-PEGP-BUCBSM-UCBBBO-LP

1 2 3 4 5 6 7 8

T

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

Re

gre

t

DB-GP-UCBGP-UCB-PEGP-BUCBSM-UCBBBO-LP

1 2 3 4 5 6 7 8

T

0

0.2

0.4

0.6

0.8

1

1.2

1.4

Re

gre

t

DB-GP-UCBGP-UCB-PEGP-BUCBSM-UCBBBO-LP

1 2 3 4 5 6 7 8

T

0

0.5

1

1.5

2

2.5

3

3.5

4R

eg

ret

DB-GP-UCBGP-UCB-PEGP-BUCBSM-UCBBBO-LP

1 1.5 2 2.5 3 3.5 4

T

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Re

gre

t

DB-GP-UCBGP-UCB-PEGP-BUCBSM-UCBBBO-LP

1 1.5 2 2.5 3 3.5 4

T

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Re

gre

t

DB-GP-UCBGP-UCB-PEGP-BUCBSM-UCBBBO-LP

1 1.5 2 2.5 3 3.5 4

T

0

0.5

1

1.5

2

Re

gre

t

DB-GP-UCBGP-UCB-PEGP-BUCBSM-UCBBBO-LP

Branin-Hoo gSobol pH field

Figure 5. Cumulative regret incurred by tested algorithms with varying batch sizes |DT | = 2, 4, 8, 16 (rows from top to bottom) using afixed budget of T |DT | = 64 function evaluations for the Branin-Hoo function, gSobol function, and real-world pH field. The error barsdenote the standard error.