bayesian optimization with gradients - cornell university · mation, with state-of-the-art...

Bayesian Optimization with Gradients

Jian Wu, Matthias Poloczek, Andrew Gordon Wilson, and Peter I. Frazier

School of Operations Research and Information EngineeringCornell University

{jw926,poloczek,andrew,pf98}@cornell.edu

Abstract

In recent years, Bayesian optimization has proven tobe exceptionally successful for global optimization ofexpensive-to-evaluate multimodal objective functions.However, unlike most optimization methods, Bayesianoptimization typically does not make use of derivativeinformation. In this paper we show how Bayesian opti-mization can exploit derivative information to greatlydecrease the number of objective function evaluationsrequired for a good performance. We find derivativeinformation particularly complements the knowledgegradient acquisition function, relative to the popularexpected improvement acquisition function. Moreover,we develop a batch Bayesian optimization procedurewhich exploits noisy and incomplete derivative infor-mation, with state-of-the-art performance comparedto a wide range of optimization procedures with andwithout gradients, on several benchmarks as well askernel learning and k-nearest neighbor applciations.

1 Introduction

Bayesian optimization is able to find global optimawith a remarkably small number of function evalua-tions [Brochu et al., 2010, Kleijnen, 2014, Jones et al.,1998]. Bayesian optimization has thus been particu-larly adopted for automatic hyperparameter tuning ofmachine learning algorithms [Snoek et al., 2012, Swer-sky et al., 2013, Gelbart et al., 2014, Gardner et al.,2014], where learning objectives can be extremely ex-pensive to evaluate, noisy, and multimodal.

Bayesian optimization models observations of an ob-jective function as being sampled from a probabilisticdistribution over functions, typically given by a Gaus-sian process (GP).

The objective function, for example, could be pre-dictive performance, and its input could be the hy-perparameters of a deep neural network, such as thedropout rate. The goal is then to find the optimalsetting of hyperparameters with as few evaluations ofthe objective as possible, since each evaluation requires

retraining all the weights of a neural network, whichis a computationally expensive procedure. To decidewhere to query the objective function next, one usesan acquisition function such as expected improvement[Jones et al., 1998, Huang et al., 2006, Picheny et al.,2013], upper confidence bound [Srinivas et al., 2010],or knowledge gradient [Frazier et al., 2009, Scott et al.,2011], which determines the trade-off between explo-ration (moving to where the GP has high predictiveuncertainty) and exploitation (moving to where theGP has a high predictive mean).

Unlike alternative optimization approaches,Bayesian optimization procedures do not generallyleverage derivative information. Derivative informa-tion may seemingly be unavailable, or lead to moreexpensive Gaussian process inference procedures.

On the other hand: (1) The value of gradient in-formation is particularly compelling in the context ofBayesian optimization: one can update an estimateof the optimum using all of the derivative informationat previous evaluations of the objective function ratherthan just local gradient information at the current iter-ation. (2) In many Bayesian optimization applicationsit is not GP inference which is the computational bot-tleneck, but rather evaluating an expensive objectivefunction. Reducing the number of times one needsto query the objective can more than compensatefor some additional computational overhead duringGP inference. (3) Derivative information is availablein many applications, often at little additional cost.For example, in PDE-constrained problems, gradientscan be obtained cheaply via adjoint methods [Plessix,2006, Duffy, 2009]. Moreover, recent work [Maclau-rin et al., 2015, Luketina et al., 2015, Fu et al., 2016]makes gradient information available for hyperparam-eter tuning problems. (4) Even when derivative infor-mation is not readily available, we can usually com-pute approximative derivatives, for instance throughfinite differences. (5) Moreover, Bayesian optimizationprocedures can be designed to naturally accommodatenoisy or incomplete derivative information.

This paper investigates the use of derivative infor-

1

mation with Bayesian optimization. In particular:

• We show how Bayesian optimization can effi-ciently exploit derivative information to greatlydecrease the number of objective function evalu-ations required for good performance.

• We show that gradient information naturally com-plements the knowledge gradient acquisition func-tion, outperforming the popular expected im-provement acquisition function, and we provideinsights into this result.

• We show that random feature expansions [Lazaro-Gredilla et al., 2010] can effectively manage theadditional computational expenses of gradientmethods.

• We show how one can naturally handle noisy andincomplete gradient information.

• We propose a batch Bayesian optimization ap-proach which leverages derivative information,and provides substantially improved performanceover many alternatives. We compare with state-of-the-art batch Bayesian optimization algorithmswith and without without derivation information,and the gradient-based optimizer BFGS when fullgradients are available.

We assume familiarity with Gaussian processes andBayesian optimization, for which we recommend Ras-mussen and Williams [2006] and Shahriari et al. [2016]for a review. We begin in section 2 by further describ-ing related work. Our Bayesian optimization algo-rithm that exploits derivative information is describedin section 3. In section 4, we compare the performanceof our algorithm with several competing methods on acollection of synthetic functions and real problems.

2 Related Work

Despite its fundamental importance, surprisingly littlework has been done to extend or understand Bayesianoptimization with derivatives.

Lizotte [2008, Sect. 4.2.1 and Sect. 5.2.4] incor-porates derivatives into Bayesian optimization, mod-elling the derivatives of a Gaussian process as in Ras-mussen and Williams [2006, Sect. 9.4]. Lizotte [2008]shows that Bayesian optimization with the expectedimprovement acquisition and complete gradient infor-mation at each sample can outperform BFGS.

Alternatively, Osborne et al. [2009] propose fullyBayesian optimization procedures that use observa-tions of derivatives to improve the conditioning ofthe Gaussian process covariance matrix. If a sampleis taken near a previously observed point, only the

derivative information at the new location is used toupdate the covariance matrix.

Our approach is focused on using derivative infor-mation for fast convergence, and is more closely re-lated to Lizotte [2008], with several key differences:(i) we consider derivative information with Bayesianoptimization in much greater generality; (ii) we ex-plore the effects of gradients with different acquisi-tion functions, showing empirical and theoretical ad-vantages to combining derivatives with the knowledgegradient acquisition function; (iii) we consider how toleverage derivative information efficiently; (iv) we ac-commodate partial and noisy derivative information;(v) we develop a powerful batch parallel Bayesian opti-mization method that leverages derivative informationand outperforms state-of-the-art alternatives on sev-eral benchmarks and consequential applications suchas kernel learning.

Recently, several batch Bayesian optimization al-gorithms have been proposed that in each iterationchoose a set of points rather than a single point atwhich the function is evaluated. Snoek et al. [2012]construct the batch by iteratively adding a point ofmaximum value under a single point EI criterion, av-eraged over the posterior distribution of previouslyselected points. Wang et al. [2015] develop a MonteCarlo method to determine a set of points that maxi-mizes a batch EI criterion. Marmin et al. [2016] pro-poses a fast way to evaluate the closed-form formulaof the batch EI criterion.

Batch acquisition algorithms can also be developedfrom upper confidence bounds [Contal et al., 2013,Desautels et al., 2014] or entropy search [Shah andGhahramani, 2015].

Our approach to handling batch observations ismost closely related to the batch Knowledge Gradi-ent (KG) of Wu and Frazier [2016], who extended theknowledge gradient policy of Frazier et al. [2009] tothe batch setting. We provide a detailed descriptionof this approach in section 3.2.

3 Batch Knowledge Gradientwith Derivatives

In section 3.1 we discuss a general approach to incorpo-rating derivative information into Gaussian processesfor Bayesian optimization. We propose an acquisitionfunction based on the knowledge gradient that selectsa batch of points to sample in each iteration, utilizingderivative information. This approach is introducedin section 3.2. We then detail how to implement thealgorithm efficiently in section 3.4.

2

3.1 Derivative Information

Given an expensive-to-evaluate function f , our goalis to find an argminx∈Af(x), where A ⊂ Rd is thedomain we optimize over. We place a Gaussian processprior over the function f : A → R, which is specifiedby its mean function µ(·) : A → R and the kernelfunction K(·, ·) : A × A → R≥0, where R≥0 denotesthe nonnegative reals. We initially suppose that foreach sample we observe the function value and all dpartial derivatives, and then later show how to relaxthis assumption.

For x ∈ A we denote the function value by f(x)and the gradient by ∇f(x). It is convenient to jointlymodel the function and its gradient via a multi-outputGaussian process with mean function µ and kernelfunction K defined as follows:

µ(x) = (µ(x),∇µ(x))T,

K(x, x′) =

(K(x, x′) J(x, x′)J(x′, x)T H(x, x′)

)(3.1)

where J(x, x′) =(

∂K(x,x′)∂x′

1, · · · , ∂K(x,x′)

∂x′d

)and

H(x, x′) is the d×d Hessian of K(x, x′). Since the gra-dient is a linear operator, the gradient of a GP is alsoa GP (see also section 9.4 in Rasmussen and Williams[2006]).

We are particularly interested in the ability of ac-quisition algorithms to leverage noisy observations ofpartial derivatives. Accordingly, we suppose that theobservations of the function value and the gradientare subject to noise. That is, when evaluating f(x) atpoint x, we observe the (d+ 1)-dimensional vector

y(x)∣∣∣ (f(x),∇f(x)) ∼ N

((f(x)∇f(x)

),diag(σ2(x))

),

where σ2 : A → Rd+1≥0 gives the variance of the ob-

servational noise at each point for the function valueand its d partial derivatives. Then diag(σ2(x)) is thediagonal matrix that gives the variance for each ob-servation, i.e. either of the function f or of a partialderivative. If σ2 is not known, we will estimate it fromdata.

The posterior distribution is again a GP with meanfunction µ(n)(·) and kernel function K(n)(·, ·). Theirformulae are given in the appendix for completeness.

To relax the assumption of complete derivatives, wenote that if some entries of (f(x),∇f(x)) are not pro-vided, then the remaining values associated with x stillobey the multivariate normal distribution imposed bythe GP. Thus, we may simply omit the entries of themean vector corresponding to outputs of the GP thatare not available. Accordingly, we omit the rows andcolumns of the covariance matrix that correspond tovalues that were not provided.

3.2 The Acquistion Algorithm dKG

We extend the batch version of the knowledge gradientof Wu and Frazier [2016] to exploit available derivativeinformation. We refer to this algorithm as derivative-enabled knowledge gradient (dKG).

The algorithm proceeds iteratively: in each iterationdKG selects a batch of q points in A that has a max-imum value of information (VOI). Suppose that wehave observed n points and let µ(n)(x) for each x ∈ Abe the (d + 1)-dimensional vector that gives the pos-terior mean for f(x) and its d partial derivatives at x.Note that we showed in Sect. 3.1 how to remove theassumption that all d+ 1 values are provided.

Note that the expected value of f(x) under the pos-terior distribution is given by eT1 µ

(n)(x), where e1 isthe (d+ 1)-dimensional vector whose first entry is oneand other entries are zero. If we were to make an ir-revocable (risk-neutral) decision now, we would pickan argminx∈Ae

T1 µ

(n)(x) (for a minimization problem).Therefore, we define the dKG factor for a given set of qcandidate points z(1:q) as

dKG(z(1:q),A)

= minx∈A

eT1 µ(n)(x)− En

[minx∈A

eT1 µ(n+q)(x)|y(z(1:q))

],

(3.2)

where En [·] is the expectation taken with respectto the posterior distribution after n evaluations,and y(z(1:q)) are the observations of both the functionvalues and partial derivatives at the points z(1:q). It iscrucial to note that the dKG factor takes the posteriordistribution over the derivatives at the points z(1:q)

into account by conditioning on y(z(1:q)), althoughEq. (3.2) is formulated as difference between poste-rior means under the function f . In the sequel we willrefer to Eq. (3.2) as the inner optimization problem.

Now the next batch to evaluate is a set of q pointsthat maximizes the dKG factor,

maxz(1:q)⊂A

dKG(z(1:q),A). (3.3)

We refer to Eq. (3.3) as the outer optimization prob-lem.

3.3 An Illustration of the Effects ofIncorporating Derivative Informa-tion

We examine how observing derivative information af-fects the posterior distribution and the value of in-formation analyses of the knowledge gradient and theexpected improvement criteria.

The two topmost plots of Fig. 1 depict the posteriorsurfaces of a function sampled from a one dimensional

3

Gaussian process (without taking into account partialderivatives, on the left-hand side) and after incorpo-rating observations of the full respective gradients atthe sample locations (on the right-hand side). We seethat the uncertainty is considerably reduced if deriva-tive information is taken into account.

The two plots in the second row illustrate how theacquisition criteria of the knowledge gradient and ex-pected improvement are affected by including deriva-tive information. Here we suppose a batch size of one.Note that EI, KG, and even dEI pick essentially thesame location for the next sample, where dKG prefersa different sample.

The plots in the third and fourth row show the pos-terior surface after observing the next sample chosenby the respective acquisition criterion. We see that theposterior uncertainty is smaller away from the globaloptimum for the algorithms that utilize the gradientobservations than for those that do not. Interestingly,we notice that the knowledge gradient seems to bene-fit considerably more from derivative information thanexpected improvement (fourth row): dKG has sampleda point whose observation gives an accurate knowledgeof the location of the optimum, while dEI still is forcedto make a greedy sampling decision. We will investi-gate this observation in more detail in our experimen-tal evaluation.

3.4 An Efficient Formulation of dKG

Recall that the computation of a batch of maximumvalue of information is difficult since each evaluation ofthe objective function dKG(z(1:q),A) requires an op-timal solution to the inner optimization problem inEq. (3.2) that is stated over the continuous space A.To make this problem tractable in practice, we proposea novel discretization that improves over previous ap-proaches. Then we can compute the dKG factor and itsgradient over the discrete set, which allows us to find abatch of maximum value of information efficiently viaa gradient-based optimizer.

An Improved Discretization of A. We discretizethe set A in the inner optimization problem stated inEq. (3.2). How to perform this step is an interestingtopic of research itself [Scott et al., 2011]. The discreteset An is not chosen statically, but evolves over time.For example, one can draw M samples from the poste-rior over the global maximizer (please refer to Sect. Band also to Hernandez-Lobato et al. [2014], Shah andGhahramani [2015] for a description of this technique).This sample set, denoted by AM

n , is then extended bythe locations of previously sampled points x(1:n) andthe set of points z(1:q) whose value of information wewish to compute. Then the inner optimization prob-

lem can be restated as

dKG(z(1:q),An)

= minx∈An

eT1 µ(n)(x)− En

[minx∈An

eT1 µ(n+q)(x)|y(z(1:q))

],

(3.4)

where An = AMn ∪ x(1:n) ∪ z(1:q). For the experimental

evaluation we recompute AMn in every iteration after

updating the posterior of the Gaussian process.

The Computation of dKG and its Gradient.Next we show how the dKG factor can be computedefficiently, using the above discretization in the in-ner optimization problem. Recall that K(n) and µ(n)

are the kernel and mean function respectively of theposterior after evaluating n points. It is well-known(e.g., see Frazier et al. [2009], Wu and Frazier [2016])that, conditioned on z(1:q) and the knowledge af-ter n evaluations, y(z(1:q)) − µ(n)(z(1:q)) is normallydistributed with zero mean and covariance matrixK(n)(z(1:q), z(1:q)) + diag{σ2(z(1)), · · · , σ2(z(q))}. Re-call that y(z(1:q)) contains the function value and the dpartial derivatives for each of the q points in the batch.

Following Wu and Frazier [2016], we expressµ(n+q)(x) as

µ(n+q)(x) = µ(n)(x) + K(n)(x, z(1:q))(K(n)(z(1:q), z(1:q))

+diag{σ2(z(1)), · · · , σ2(z(q))})−1

(y(z(1:q))− µ(n)(z(1:q))

).

Thus, we can rewrite µ(n+q)(x) as

µ(n+q)(x) = µn(x) + σn(x, z(1:q))Zq(d+1),(3.5)

where Zq(d+1) is a q ·(d+1)-dimensional standard nor-mal vector and

σ(n)(x, z(1:q)) = K(n)(x, z(1:q))(D(n)(z(1:q))T

)−1.

Here D(n)(z(1:q)) is the Cholesky factor ofthe covariance matrix K(n)(z(1:q), z(1:q)) +diag{σ2(z(1)), · · · , σ2(z(q))}. Now we can com-pute the dKG factor using Monte Carlo sampling.To compute the gradient of the dKG factor, weapply infinitesimal perturbation analysis (IPA), whichallows us to exchange the expectation operator andthe gradient operator (see Wu and Frazier [2016] forfurther details). Specifically, by Eq. (3.5), we canrewrite the expression of the approximate dKG factor

4

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.03

2

1

0

1

2

3posterior without gradient observations

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.03.0

2.5

2.0

1.5

1.0

0.5

0.0

0.5

1.0

1.5posterior with gradient observations

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.00.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35KG without Gradient

EI without Gradient

Best point by KG without Gradient

Best point by EI without Gradient

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.00.1

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8KG with Gradient

EI with Gradient

Best point by KG with Gradient

Best point by EI with Gradient

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.04

3

2

1

0

1

2posterior after evaluating the point by EI

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.04

3

2

1

0

1

2posterior after evaluating the point by KG

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.03.5

3.0

2.5

2.0

1.5

1.0

0.5

0.0

0.5

1.0posterior after evaluating the point by dEI

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.02.5

2.0

1.5

1.0

0.5

0.0

0.5

1.0

1.5posterior after evaluating the point by dKG

Figure 1: The topmost plots show the posterior surfaces of a function sampled from a one dimensional Gaus-sian process with and without incorporating observations of the gradients. Note that the posterior variance isconsiderably smaller if the gradients are incorporated. The plots in the second row show the utility of samplingeach point under the value of information criteria of KG and EI in both settings. If no derivatives are observed,both KG and EI will query a point with high potential gain (i.e. a small expected function value). On the otherhand, when gradients are observed, dKG makes a considerably better sampling decision, whereas dEI samplesessentially the same location as EI. The plots in the third and fourth row depict the posterior surface afterthe respective sample. Interestingly, KG benefits more from observing the gradients than EI (fourth row): dKGsamples a point whose observation yields an accurate knowledge of the location of the optimum, while dEI stillhas considerable uncertainty around the optimum.

5

as

dKG(z(1:q),An)

= minx∈An

eT1 µ(n)(x)− En

[minx∈An

eT1 µ(n+q)(x)|y(z(1:q))

],

= EZq(1+d)

[minx∈An

eT1 µ(n)(x)

− minx∈An

eT1

(µn(x) + σn(x, z(1:q))Zq(d+1)

)].

Now let

x1,∗ = argminx∈AneT1 µ(n)(x) and

x2,∗ = argminx∈AneT1

(µn(x) + σn(x, z(1:q))Zq(d+1)

),

then the partial derivative of dKG(z(1:q),An) with re-spect to zij is

∂dKG(z(1:q),An)

∂zij

= EZq(1+d)

[∂eT1 µ

(n)(x1,∗)

∂zij

− ∂

∂zijeT1

(µn(x2,∗) + σn(x2,∗, z(1:q))Zq(d+1)

)]where zij is the j-th dimension of i-th point in z(1:q).Therefore, we can utilize a multi-start gradient-descentto select the next batch.

4 Experiments

We evaluate the performance of the proposed algo-rithm dKG on six standard synthetic benchmarks inSect. 4.1. Moreover, we examine its ability to tunethe hyperparameters for the kernel-weighted k-NearestNeighbor algorithm (KNN) (see Sect. 4.2) and for aspectral mixture kernel (cp. Sect. 4.3). Note that inthe former application not all hyperparameters are dif-ferentiable. We compare its performance to state-of-the-art methods in Bayesian optimization:

• the batch expected improvement method (EI)of Wang et al. [2014] that does not utilize deriva-tive information,

• our extension of the above batch expected im-provement method that incorporates derivativeinformation (dEI), and

• the batch knowledge gradient algorithm withoutderivative information (KG) of Wu and Frazier[2016].

Note that all above algorithms can be run even ifnot all partial derivatives are given. In benchmarks

that provide the full gradient, we additionally compareto the gradient-descent method L-BFGS-B providedin scipy. We suppose that the objective function f isdrawn from a Gaussian process GP (µ,Σ), where µ isa constant mean function and Σ is the squared expo-nential kernel. The hyperparameters of the kernel areobtained via maximum marginal likelihood estimationand updated after every iteration. The parameter Mthat determines the number of samples drawn fromthe posterior over the global maximizer is set to 200(cp. Sect. 3.4).

The plots for synthetic benchmark functions inSect. 4.1 we report the immediate regret of the so-lution that each algorithm would pick as a function ofthe number of function evaluations. Recall that theimmediate regret is defined as the loss with respectto a global optimum. For the other experiments theplots depict the objective value of the solution insteadof the immediate regret. The error bars give the meanvalue plus and minus one standard error. The numberof replications varies and is stated in the description ofthe respective benchmark below. Our method was im-plemented in C++ with a Python interface, inheritingthe open-source implementation of batch EI from theMetrics Optimization Engine [Wang et al., 2014].

4.1 Results on synthetic functions

We evaluate the algorithms on six test functions cho-sen from Bingham [2015]. In order to demonstrate theability to benefit from noisy derivative information, wesuppose an additive normally distributed noise withzero mean and variance σ2 = 0.25 for both the func-tion value and its partial derivatives. Note that σis not known to the algorithms but has to be esti-mated from observations. Moreover, we investigatehow the performance of the algorithms is affected ifpartial derivatives are not given for all parameters. Wealso experiment with two different batch sizes: batchsize q = 4 is used for the Branin, Rosenbrock, andAckley functions. Otherwise we use batch size q = 8.The experimental results are summarized in Fig. 2.

Functions with Full Gradient Information. For2d Branin on domain [−15, 15]2, 5d Ackley on [−2, 2]5,6d Hartmann function on [0, 1]6, we assume that thefull gradient is available.

Looking at the results for the Branin function (cp.Fig. 2), dKG outperforms its competitors after 40 func-tion evaluations and obtains the best solution over-all (within the limit of function evaluations). BFGSmakes faster progress than the Bayesian optimizationmethods during the first 20 evaluations; however, itstalls subsequently and fails to obtain a competitive so-lution. On the Ackley function dEI makes fast progressduring the first 50 evaluations but is fails to make any

6

progress subsequently. dKG requires about 100 eval-uations to improve on the performance of dEI; dKGexhibits the best overall performance again. For theHartmann function dKG clearly dominates its com-petitors over all function evaluations.

Functions with Incomplete Derivative Informa-tion. For the 3d Rosenbrock function on [−2, 2]3 weonly provide a noisy observation of the third partialderivative. Both EI and dEI get stuck early. dKGon the other hand finds almost optimal solution af-ter about 50 function evaluations; KG catches up af-ter about 75 evaluations and has a comparable per-formance afterwards. The 4d Levy benchmark on[−10, 10]4, where the fourth partial derivative is ob-servable with noise, shows a different ordering of thealgorithms: here EI has the best performance, beat-ing even its formulation that utilizes derivative in-formation. A possible explanation could be that thesmoothness and regularized shape of the function sur-face benefits this acquisition criterion. For the 8d Co-sine mixture function on [−1, 1]8 we provide two noisypartial derivatives. Both KG-based acquisition func-tions show surprising behavior: their function valuesactually become worse initially. After about 50 func-tion evaluation both algorithms recover and achievethe best performances, with dKG beating KG slightly.

Summing up, we see that dKG successfully exploitsnoisy derivative information and has the best overallperformance.

4.2 Kernel-weighted k-Nearest Neigh-bor

Suppose a cab company wishes to predict the durationof trips for its vehicles and customers. Clearly, the du-ration not only depends on the endpoints of the trip,but also on the day and time. In this benchmark wetune a kernel-weighted k-nearest neighbor (KNN) met-ric to optimize predictions of these durations, basedon historical data. A trip is described by the pick-uptime t, the pick-up location (p1, p2), and the drop-offpoint (d1, d2). Then the estimate of the duration isobtained as a weighted average over all trips Dm,t inour database that happened in the time interval t±mminutes, where m is a tunable hyperparameter:

Prediction(t, p1, p2, d1, d2)

=

∑i∈Dm,t

durationi × weight(i)∑i∈Dm,t

weight(i).

The weight of trip i ∈ Dm,t in this prediction is givenby

weight(i) = exp

{− (t− ti)2

l21− (p1 − pi1)2

l22− (p2 − pi2)2

l23

− (d1 − di1)2

l24− (d2 − di2)2

l25

},

where (ti, pi1, pi2, d

i1, d

i2) are the respective parameter

values for trip i, and (l1, l2, l3, l4, l5) are tunable hy-perparameters. Thus, we have 6 hyperparameters totune: (m, l1, l2, l3, l4, l5). We choose m in [30, 100], l21in [102, 104], and l22, l

23, l

24, l

25 each in [10−5, 10−2].

We use the yellow cab NYC public data set fromJune 2016, sampling 4000 records from June 1st toJune 25th as training data and 1000 trip records fromJune 26th to 30th as validation data. Our test cri-terion is the root mean squared error (RMSE), forwhich we compute the partial derivatives on the val-idation dataset with respect to the hyperparameters(l1, l2, l3, l4, l5), while the hyperparameter m is not dif-ferentiable.

The experimental results show that our algorithmas well as the batch knowledge gradient perform con-siderably better than both algorithms based on theexpected improvement criterion. For either pair ex-ploiting derivative information provides an advantage.Fig. 3 summarizes the results for batch size q = 8.

0 50 100 150 200 250function evaluations

4.0

4.5

5.0

5.5

6.0

6.5

7.0

the f

unct

ion v

alu

e

tuning 6 hyperparameters in the kernel weighted KNN with batch size 8: the first 5 derivatives available

EIdEIKGdKG

Figure 3: The best function value for the KNN bench-mark, averaged over 100 replications.

4.3 Kernel LearningIn this benchmark we examine the performance of theoptimization algorithms for a complex kernel learningtask. Although we have access to an analytic closedform (marginal likelihood) objective, this objective is(i) expensive to evaluate, (ii) highly multimodal, and(iii) derivative information is available. Thus learningflexible kernel functions is a perfect candidate for ourapproach.

7

0 20 40 60 80 100 120 140 160function evaluations

2.5

2.0

1.5

1.0

0.5

0.0

0.5

1.0

1.5

the log1

0 s

cale

of

the im

media

te r

egre

t

2d Branin function with batch size 4: noisy full gradient available

EIdEIKGdKG4-start L-BFGS-B

0 50 100 150 200 250 300 350function evaluations

2.0

1.5

1.0

0.5

0.0

0.5

1.0

1.5

2.0

2.5

the log1

0 s

cale

of

the im

media

te r

egre

t

3d Rosenbrock function with batch size 4: noisy 3rd derivative available

EIdEIKGdKG


0.2

0.0

0.2

0.4

0.6

0.8

1.0

the log1

0 s

cale

of

the im

media

te r

egre

t

4d Levy function with batch size 8: noisy 4th derivative available

EIdEIKGdKG


0.2

0.0

0.2

0.4

0.6

0.8

the log1

0 s

cale

of

the im

media

te r

egre

t

5d Ackley function with batch size 4: noisy full gradient available


0 20 40 60 80 100 120 140 160 180function evaluations

1.5

1.0

0.5

0.0

0.5

the log1

0 s

cale

of

the im

media

te r

egre

t

6d Hartmann function with batch size 8: noisy full gradient available



0.6

0.4

0.2

0.0

0.2

0.4

0.6

0.8

1.0

the log1

0 s

cale

of

the im

media

te r

egre

t

8d Cosine Mixture function with batch size 8: noisy 1st and 2nd derivatives available

EIdEIKGdKG

Figure 2: The average performance of 100 replications (the log10 of the immediate regret vs. the number offunction evaluations). For the Branin, Ackley, and Hartmann functions, we assume that a noisy observation ofthe full gradient is available. On the other functions only one or two partial derivatives can be observed (withnoise). dKG performs significantly better than its competitors for all benchmarks except the Levy function.

8

Spectral mixture kernels [Wilson and Adams, 2013]can be used for flexible kernel learning to enable long-term extrapolation. These kernels are obtained bymodeling a spectral density by a mixture of Gaus-sians. While any stationary kernel can be describedby a spectral mixture kernel with a particular settingof its hyperparameters, initializing and learning theseparameters can be difficult, due to a highly multimodalmarginal likelihood objective.

In this experiment, the task is to train a 3-component spectral mixture kernel on an airline dataset used by Wilson and Adams [2013]. We have todetermine the mixture weights, means, and variances,for each of the three Gaussians. We run the algorithmswith batch size q = 8 on this highly multi-modal func-tion. Their performance is summarized in Fig. 4. Onthis application, BFGS tends to either perform rea-sonably well, or become trapped in a very bad localoptima, depending highly on initialization and humanintervention. We have chosen a run where BFGS per-forms reasonably. dKG, on other hand, can more con-sistently find a good solution. Here dKG finds the bestsolution within the step limit. Overall, we observe thatgradient information is highly valuable in performingthis kernel learning task.


6.0

6.5

7.0

7.5

8.0

8.5

9.0

the log s

cale

of

the f

unct

ion v

alu

e

tuning a spectral mixture kernel with 3 components (9 hyperparameters) with batch size 8: full gradient available


Figure 4: Top: The average performance for the spec-tral mixture kernel benchmark. Bottom: The test per-formance of one final recommendation of hyperparam-eters found by dKG.

5 Discussion

Bayesian optimization is primarily applied to low di-mensional problems where we wish to find a good so-lution with a very small number of objective func-tion evaluations. We have shown that in this contextderivative information can be extremely useful: we cangreatly decrease the number of objective function eval-uations, especially when using the knowledge gradientacquisition function. Moreover, our approach providesconsiderably better performance even when derivativeinformation is noisy and only available for some vari-ables. Our batch Bayesian optimization method out-performs state-of-the-art techniques with and withoutderivatives on a number of common benchmarks inaddition to significant kernel learning and k-nearestneighbor applications.

Bayesian optimization is increasingly being usedto automate parameter tuning in machine learning,where objective functions can be extremely expensiveto evaluate. For example, the parameters could evenrepresent the architecture of a deep neural network.In the future we expect derivative information withBayesian optimization to enable such promising appli-cations, moving us towards fully automatic and prin-cipled approaches to statistical machine learning.

Acknowledgments

Wilson was partially supported by NSF IIS-1563887.Frazier, Poloczek, and Wu were partially supported byNSF CAREER CMMI-1254298, NSF CMMI-1536895,NSF IIS-1247696, AFOSR FA9550-12-1-0200, AFOSRFA9550-15-1-0038, and AFOSR FA9550-16-1-0046.

9

A The Posterior Distribution ofthe Multivariate GP

Suppose that we have sampled f at n points X :={x(1), x(2), · · · , x(n)} so far and observed y(1:n), whereeach observation consists of the function value and thegradient at x(i). Then the posterior distribution isa multivariate Gaussian process with mean functionµn(·) and kernel function Kn(·, ·), where

µ(n)(x) = µ(x) + K(x,X)(K(X,X)

+diag{σ2(x(1)), · · · , σ2(x(n))})−1

(y(1:n) − µ(X)),

K(n)(x1, x2) = K(x1, x2)− K(x1, X)(K(X,X)

+diag{σ2(x(1)), · · · , σ2(x(n))})−1

K(X,x2).

(A.1)

The rows and columns in Eq. (A.1) corresponding topartial derivatives (or function values) that were notobserved are to be omitted.

B Spectral Density Approxima-tion of the Gaussian Process

In this paper, we use random features to approximatea Gaussian process to

• obtain a better discretization of set A used in theinner optimization problem of dKG (see Sect. 3.4),and

• improve the scalability of kernel learning,

thereby following ideas of [Hernandez-Lobato et al.,2014, Lazaro-Gredilla et al., 2010].

Denote by s(w) the Fourier dual of a stationary ker-nel function and p(w) := s(w)/α the associated nor-malized density, where α =

∫s(w)dw. We approxi-

mate the Gaussian process with a finite set of m ran-dom features, specifically,

K(x1, x2) =2α

mEp(W,b) [cos(Wx1 + b) cos(Wx2 + b)] ,

where W is a m× d random matrix with Wij ∼ p(w)and b is a m × 1 random vector with bi ∼ U(0, 2π)[Hernandez-Lobato et al., 2014, Sect. A].

Let Φ(x) =√

2α/m cos(Wx + b). We approximatethe Gaussian process prior for f via a Bayesian linearmodel f(x) = Φ(x)T θ, where θ ∼ N (0, I). Condi-tioned on the collected data, the posterior of the θ ismultivariate normal with mean and covariance

m = (ΦT Φ + Σ)−1ΦT y

V = (ΦT Φ + Σ)−1Σ,

where Σ = diag(σ2(x)) denotes the variance for obser-vations of function values and partial derivatives (seeSect. 3.1).

To sample from the posterior of the global max-ima, we first sample m random features Φ(i)(x)and their corresponding weights θ(i), and then con-struct f (i)(x) = Φ(i)(x)T θ(i). This is a samplefrom the approximate posterior of f conditioned onthe data, on which we locate global optima us-ing a gradient-based optimizer (see also Sect. 2.1 inHernandez-Lobato et al. [2014] and Sect. 3.2 in Shahand Ghahramani [2015]).

Recall that we re-estimate the hyperparameters ofthe kernel regularly as more function observations areobtained. To speed up the kernel learning when thenumber of samples n exceeds m = 1000, we applythe above approximation. Then we seek hyperparam-eters that maximize − log det (ΦΦT + Σ)− yT [ΦΦT +Σ]−1y. With this approximation, the computationtime is O(m2n) instead of O(n3).

10

References

Nyc trip record data.http://www.nyc.gov/html/tlc/, 2016. Lastaccessed on 2016-10-10.

D. Bingham. Optimization test problems.http://www.sfu.ca/ ssurjano/optimization.html,2015.

E. Brochu, T. Brochu, and N. de Freitas. Abayesian interactive optimization approach to pro-cedural animation design. In Proceedings of the2010 ACM SIGGRAPH/Eurographics Symposiumon Computer Animation, pages 103–112. Euro-graphics Association, 2010.

E. Contal, D. Buffoni, A. Robicquet, and N. Vayatis.Parallel gaussian process optimization with upperconfidence bound and pure exploration. In MachineLearning and Knowledge Discovery in Databases,pages 225–240. Springer, 2013.

T. Desautels, A. Krause, and J. W. Burdick. Par-allelizing exploration-exploitation tradeoffs in gaus-sian process bandit optimization. The Journal ofMachine Learning Research, 15(1):3873–3923, 2014.

A. C. Duffy. An introduction to gradient computationby the discrete adjoint method. Technical report,Florida State University, 2009.

P. Frazier, W. Powell, and S. Dayanik. Theknowledge-gradient policy for correlated normal be-liefs. INFORMS journal on Computing, 21(4):599–613, 2009.

J. Fu, H. Luo, J. Feng, and T.-S. Chua. Distill-ing reverse-mode automatic differentiation (drmad)for optimizing hyperparameters of deep neural net-works. arXiv preprint arXiv:1601.00917, 2016.

J. R. Gardner, M. J. Kusner, Z. E. Xu, K. Q. Wein-berger, and J. Cunningham. Bayesian optimizationwith inequality constraints. In ICML, pages 937–945, 2014.

M. Gelbart, J. Snoek, and R. Adams. Bayesian op-timization with unknown constraints. In Proceed-ings of the Thirtieth Conference Annual Confer-ence on Uncertainty in Artificial Intelligence (UAI-14), pages 250–259, Corvallis, Oregon, 2014. AUAIPress.

J. M. Hernandez-Lobato, M. W. Hoffman, andZ. Ghahramani. Predictive entropy search for ef-ficient global optimization of black-box functions.In Advances in Neural Information Processing Sys-tems, pages 918–926, 2014.

D. Huang, T. T. Allen, W. I. Notz, and N. Zeng.Global Optimization of Stochastic Black-Box Sys-tems via Sequential Kriging Meta-Models. Journalof Global Optimization, 34(3):441–466, 2006.

D. R. Jones, M. Schonlau, and W. J. Welch. Ef-ficient global optimization of expensive black-boxfunctions. Journal of Global optimization, 13(4):455–492, 1998.

J. P. Kleijnen. Simulation-optimization via kriging andbootstrapping: A survey. Journal of Simulation, 8(4):241–250, 2014.

M. Lazaro-Gredilla, J. Quinonero-Candela, C. E. Ras-mussen, and A. R. Figueiras-Vidal. Sparse spectrumgaussian process regression. The Journal of MachineLearning Research, 11:1865–1881, 2010.

D. J. Lizotte. Practical bayesian optimization. Univer-sity of Alberta, 2008.

J. Luketina, M. Berglund, and T. Raiko. Scalablegradient-based tuning of continuous regularizationhyperparameters. arXiv preprint arXiv:1511.06727,2015.

D. Maclaurin, D. Duvenaud, and R. P. Adams.Gradient-based hyperparameter optimizationthrough reversible learning. In Proceedings ofthe 32nd International Conference on MachineLearning, 2015.

S. Marmin, C. Chevalier, and D. Ginsbourger. Ef-ficient batch-sequential bayesian optimization withmoments of truncated gaussian vectors. arXivpreprint arXiv:1609.02700, 2016.

M. A. Osborne, R. Garnett, and S. J. Roberts. Gaus-sian processes for global optimization. In 3rd inter-national conference on learning and intelligent opti-mization (LION3), pages 1–15. Citeseer, 2009.

V. Picheny, D. Ginsbourger, Y. Richet, and G. Caplin.Quantile-based optimization of noisy computer ex-periments with tunable precision. Technometrics, 55(1):2–13, 2013.

R.-E. Plessix. A review of the adjoint-state methodfor computing the gradient of a functional with geo-physical applications. Geophysical Journal Interna-tional, 167(2):495–503, 2006.

C. E. Rasmussen and C. K. I. Williams. GaussianProcesses for Machine Learning. MIT Press, 2006.ISBN ISBN 0-262-18253-X.

W. Scott, P. Frazier, and W. Powell. The correlatedknowledge gradient for simulation optimization of

11

continuous parameters using gaussian process re-gression. SIAM Journal on Optimization, 21(3):996–1026, 2011.

A. Shah and Z. Ghahramani. Parallel predictive en-tropy search for batch global optimization of ex-pensive objective functions. In Advances in NeuralInformation Processing Systems, pages 3312–3320,2015.

B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, andN. de Freitas. Taking the human out of the loop: Areview of bayesian optimization. Proceedings of theIEEE, 104(1):148–175, 2016.

J. Snoek, H. Larochelle, and R. P. Adams. Practi-cal bayesian optimization of machine learning algo-rithms. In Advances in neural information process-ing systems, pages 2951–2959, 2012.

N. Srinivas, A. Krause, M. Seeger, and S. M. Kakade.Gaussian process optimization in the bandit setting:No regret and experimental design. In Proceedingsof the 27th International Conference on MachineLearning (ICML-10), pages 1015–1022, 2010.

K. Swersky, J. Snoek, and R. P. Adams. Multi-taskbayesian optimization. In Advances in neural infor-mation processing systems, pages 2004–2012, 2013.

J. Wang, S. C. Clark, E. Liu, and P. I.Frazier. Metrics optimization engine.http://yelp.github.io/MOE/, 2014. Last ac-cessed on 2016-01-21.

J. Wang, S. C. Clark, E. Liu, and P. I. Frazier. Parallelbayesian global optimization of expensive functions.2015.

A. G. Wilson and R. P. Adams. Gaussian processkernels for pattern discovery and extrapolation. InICML (3), pages 1067–1075, 2013.

J. Wu and P. Frazier. The parallel knowledge gradi-ent method for batch bayesian optimization. In Ad-vances in Neural Information Processing Systems,2016.

12

bayesian optimization with gradients - cornell university · mation, with state-of-the-art...

Documents