abstract arxiv:1910.10596v3 [stat.ml] 29 feb 2020 · jiaxin shi, michalis k. titsias, andriy mnih...

Sparse Orthogonal Variational Inference for Gaussian Processes

Jiaxin Shi Michalis K. Titsias Andriy MnihTsinghua University DeepMind DeepMind

Abstract

We introduce a new interpretation of sparsevariational approximations for Gaussian pro-cesses using inducing points, which can leadto more scalable algorithms than previousmethods. It is based on decomposing aGaussian process as a sum of two indepen-dent processes: one spanned by a finite ba-sis of inducing points and the other captur-ing the remaining variation. We show thatthis formulation recovers existing approxima-tions and at the same time allows to obtaintighter lower bounds on the marginal likeli-hood and new stochastic variational inferencealgorithms. We demonstrate the efficiency ofthese algorithms in several Gaussian processmodels ranging from standard regression tomulti-class classification using (deep) convo-lutional Gaussian processes and report state-of-the-art results on CIFAR-10 among purelyGP-based models.

1 INTRODUCTION

Gaussian processes (GP) (Rasmussen and Williams,2006) are nonparametric models for representing dis-tributions over functions, which can be seen as a gener-alization of multivariate Gaussian distributions to infi-nite dimensions. The simplicity and elegance of thesemodels has led to their wide adoption in uncertaintyestimation for machine learning, including supervisedlearning (Williams and Rasmussen, 1996; Williamsand Barber, 1998), sequential decision making (Srini-vas et al., 2010), model-based planning (Deisenrothand Rasmussen, 2011), and unsupervised data analy-sis (Lawrence, 2005; Damianou et al., 2016).

Despite the successful application of these models,

Proceedings of the 23rdInternational Conference on Artifi-cial Intelligence and Statistics (AISTATS) 2020, Palermo,Italy. PMLR: Volume 108. Copyright 2020 by the au-thor(s).

Training

Prediction

Figure 1: The graphical model of SOLVE-GP. Theprior f ∼ GP(0, k) is decomposed into two indepen-dent GPs (denoted by thick horizontal lines): f‖ ∼ p‖and f⊥ ∼ p⊥. The variables connected by thick linesform a multivariate Gaussian. X,y denote the trainingdata. X∗ are the test inputs. f‖ = f‖(X), f⊥ = f⊥(X).u = f‖(Z) denote the inducing variables in standardSVGP methods. SOLVE-GP introduces another set ofinducing variables v⊥ = f⊥(O) to summarize p⊥.

they suffer from O(N3) computation and O(N2) stor-age requirements given N training data points, whichhas motivated a large body of research on sparse GPmethods (Csato and Opper, 2002; Lawrence et al.,2002; Seeger et al., 2003; Quinonero-Candela and Ras-mussen, 2005a; Titsias, 2009; Hensman et al., 2013;Bui et al., 2017). GPs have also been unfavourablycompared to deep learning models for lacking repre-sentation learning capabilities.

Sparse variational GP (SVGP) methods (Titsias, 2009;Hensman et al., 2013, 2015a) based on variationallearning of inducing points have shown promise in ad-dressing these limitations. Such methods leave theprior distribution of the GP model unchanged andinstead enforce sparse structures in the posterior ap-proximation though variational inference. This givesO(M2N +M3) computation and O(MN +M2) stor-age with M inducing points. Moreover, they allow usto perform mini-batch training by sub-sampling data

arX

iv:1

910.

1059

6v3

[st

at.M

L]

29

Feb

2020


points. Successful application of SVGP allowed scal-able GP models trained on billions of data points (Sal-imbeni and Deisenroth, 2017). These advances ininference methods have also led to more flexibilityin model design. A recent convolutional GP model(van der Wilk et al., 2017) encodes translation invari-ance by summing over GPs that take image patches asinputs. The inducing points, which can be interpretedas image patches in this model, play a role similar tothat of convolutional filters in neural networks. Theirwork showed that it is possible to implement repre-sentation learning in GP models. Further extensionsof such models into deep hierarchies (Blomqvist et al.,2018; Dutordoir et al., 2019) significantly boosted theperformance of GPs for natural images.

As these works suggest, currently the biggest challengein this area still lies in scalable inference. The compu-tational cost of the widely used SVGP methods scalescubically with the number of inducing points, mak-ing it difficult to improve the flexibility of posteriorapproximations (Shi et al., 2019). For example, state-of-the-art models like deep convolutional GPs use only384 inducing points for inference in each layer to get amanageable running time (Dutordoir et al., 2019).

We introduce a new framework, called SOLVE-GP,which allows increasing the number of inducing pointsgiven a fixed computational budget. It is based ondecomposing the GP prior as the sum of a low-rankapproximation using inducing points, and a full-rankresidual process. We observe that the standard SVGPmethods can be reinterpreted under such decompo-sition. By introducing another set of inducing vari-ables for the orthogonal complement, we can increasethe number of inducing points at a much lower ad-ditional computational cost. With our method dou-bling the number of inducing points leads to a 2-foldincrease in the cost of Cholesky decomposition, com-pared to the 8-fold increase for the original SVGPmethod. We show that SOLVE-GP is equivalent toa structured covariance approximation for SVGP de-fined over the union of the two sets of inducing points.Interestingly, under such interpretation our work canbe seen as a generalization of the recently proposeddecoupled-inducing-points method (Salimbeni et al.,2018). As the decoupled method often comes witha complex dual formation, our framework provides asimpler derivation and more intuitive understandingfor it.

We conducted experiments on convolutional GPs andtheir deep variants. To the best of our knowledge, weare the first to train a purely GP-based model with-out any neural network components to achieve over80% test accuracy on CIFAR-10. No data augmen-tation was used to obtain these results. Besides clas-

sification, we also evaluated our method on a rangeof regression datasets that range in size from tens ofthousands to millions of data points. Our results showthat SOLVE-GP is often competitive with the moreexpensive SVGP counterpart that uses the same num-ber of inducing points, and outperforms SVGP whengiven the same computational budget.

2 BACKGROUND

Here, we briefly review Gaussian processes and sparsevariational GP methods. A GP is an uncountable col-lection of random variables indexed by a real-valuedvector x taking values in X ⊂ Rd, of which any finitesubset has a multivariate Gaussian distribution. A GPis defined by a mean function m(x) = E[f(x)] and acovariance function k(x,x′) = Cov[f(x), f(x′)]:

f ∼ GP(m(x), k(x,x′)).

Let X = [x1,x2, . . . ,xN ]> ∈ RN×d be (the matrix con-taining) the training data points and f = f(X) ∈ RNdenote the corresponding function values. Similarlywe denote the test data points by X∗ and their func-tion values by f∗. Assuming a zero mean function, thejoint distribution over f , f∗ is given by:

p(f , f∗) := N(

ff∗

∣∣∣∣ 0,

[Kff Kf∗K∗f K∗∗

]),

where Kff is an N ×N kernel matrix with its (i, j)thentry as k(xi,xj), and similarly [Kf∗]ij = k(xi,x

∗j ),

[K∗∗]ij = k(x∗i ,x∗j ). In practice we often observe the

training function values through some noisy measure-ments y, generated by the likelihood function p(y|f).For regression, the likelihood usually models indepen-dent Gaussian observation noise: yn = fn + εn, εn ∼N (0, σ2). In this situation the exact posterior distri-bution p(f∗|y) can be computed in closed form:

f∗|y ∼ N (K∗f (Kff + σ2I)−1y,

K∗∗ −K∗f (Kff + σ2I)−1Kf∗). (1)

As seen from Eq. (1), exact prediction involves the in-verse of matrix Kff +σ2I, which requires O(N3) com-putation. For large datasets, we need to avoid thecubic complexity by resorting to approximations.

Inducing points have played a central role in previ-ous works on scalable GP inference. The general ideais to summarize f with a small number of variablesu = f(Z), where Z = [z1, . . . , zM ]> ∈ RM×d is aset of parameters, called inducing points, in the inputspace. The augmented joint distribution over u, f , f∗

is p(f , f∗|u)p(u), where p(u) = N (0,Kuu) and Kuu

denotes the kernel matrix of inducing points with the(i, j)th entry corresponding to k(zi, zj). There is a

Jiaxin Shi, Michalis K. Titsias, Andriy Mnih

long history of developing sparse approximations forGPs by making different independence assumptionsfor the conditional distribution p(f , f∗|u) to reducethe computational cost (Quinonero-Candela and Ras-mussen, 2005b). However, these methods made mod-ifications to the GP prior and tended to suffer fromdegeneracy and overfitting problems.

Sparse variational GP methods (SVGP), first pro-posed in Titsias (2009) and later extended for mini-batch training and non-conjugate likelihoods (Hens-man et al., 2013, 2015a), provide an elegant solutionto these problems. By reformulating the posterior in-ference problem as variational inference and restrict-ing the variational distribution to be q(f , f∗,u) :=q(u)p(f , f∗|u), the variational lower bound for mini-mizing KL [q(f , f∗,u)‖p(f , f∗,u|y)] simplifies to:

N∑n=1

Eq(u)p(fn|u) [log p(yn|fn)]−KL [q(u)‖p(u)] . (2)

For GP regression the bound has a collapsed form ob-tained by solving for the optimal q(u) and plugging itinto (2) (Titsias, 2009):

logN (y|0,Qff + σ2I)− 1

2σ2tr (Kff −Qff ) , (3)

where Qff = KfuK−1uuKuf . Computing this objec-

tive requires O(M2N + M3) operations, in contrastto the O(N3) complexity of exact inference. The in-ducing points Z can be learned as variational param-eters by maximizing the lower bound. More generally,if we do not collapse q(u) and let q(u) = N (mu,Su),where mu,Su are trainable parameters, we can use theuncollapsed bound for mini-batch training and non-Gaussian likelihoods (Hensman et al., 2013, 2015a).

3 SOLVE-GP

Despite the success of SVGP methods, their O(M3)complexity makes it difficult for the flexibility ofposterior approximation to grow with the datasetsize. We present a new framework called SparseOrthogonaL Variational infErence for Gaussian Pro-cesses (SOLVE-GP), which allows the use of an addi-tional set of inducing points at a lower computationalcost than the standard SVGP methods.

3.1 Reinterpreting SVGP

We start by reinterpreting SVGP methods using a sim-ple reparameterization, which will then lead us to pos-sible ways of improving the approximation. First wenotice that the covariance of the conditional distribu-tion p(f |u) = N (KfuK−1

uuu,Kff − Qff ) does not de-

pend on u.1 Therefore, samples from p(f |u) can bereparameterized as

f⊥ ∼ p⊥(f⊥) := N (0,Kff −Qff ),

f = f⊥ + KfuK−1uuu. (4)

The reason for denoting the zero-mean component asf⊥ shall become clear later. Now we can reparameter-ize the augmented prior distribution p(f ,u) as

u ∼ p(u), f⊥ ∼ p⊥(f⊥), f = KfuK−1uuu + f⊥, (5)

and the joint distribution of the GP model becomes

p(y,u, f⊥) = p(y|f⊥ + KfuK−1uuu)p(u)p⊥(f⊥). (6)

Posterior inference for f in the original model thenturns into inference for u and f⊥. If we approximatethe above GP model by considering a factorised ap-proximation q(u)p⊥(f⊥), where q(u) is a variationaldistribution and p⊥(f⊥) is the prior distribution of f⊥that appears also in Eq. (6), we arrive at the stan-dard SVGP method. To see this, note that minimizingKL [q(u)p⊥(f⊥)‖p(u, f⊥|y)] is equivalent to maximiz-ing the variational lower bound

Eq(u)p⊥(f⊥) log p(y|f⊥ + KfuK−1uuu)−KL [q(u)‖p(u)] ,

which is the SVGP objective (Eq. (2)) using the repa-rameterization in Eq. (4).

Under this interpretation of the standard SVGPmethod, it becomes clear that we can modify the formof the variational distribution q(u)p⊥(f⊥) to improvethe accuracy of the posterior approximation. Thereare two natural options: (i) keep p⊥(f⊥) as part of theapproximation and alter q(u) so that it will have somedependence on f⊥, and (ii) keep q(u) independent fromf⊥, and replace p⊥(f⊥) with a more structured varia-tional distribution q(f⊥). While both options lead tonew bounds and more accurate approximations thanthe standard method, we will defer the discussion of (i)to appendix A and focus on (ii) because it is amenableto large-scale training, as we will show next.

3.2 Orthogonal Decomposition

As suggested in section 3.1, we consider improving thevariational distribution for f⊥. However, the complex-ity of inferring f⊥ is the same as for f and thus cubic.Resolving the problem requires a better understandingof the reparameterization we used in section 3.1.

The key observation here is that the reparameteri-zation in Eq. (5) corresponds to an orthogonal de-composition in the function space. For simplicity,

1Note that kernel matrices like Kuu depend on Z in-stead of u; the subscript only indicates that this is thecovariance matrix of u.


we first derive such decomposition in the Reproduc-ing Kernel Hilbert Space (RKHS) induced by k, andthen generalize the result to the GP sample space.The RKHS with kernel k is the closure of the space{∑`i=1 cik(x′i, ·), ci ∈ R, ` ∈ N+,x′i ∈ X}, with the

inner product defined as 〈f, k(x, ·)〉H = f(x), ∀f ∈H. Let V denote the linear span of the kernel ba-sis functions indexed by the inducing points: V :={∑Mj=1 αjk(zj , ·), α = [α1, . . . , αM ]> ∈ RM}. For any

function f ∈ H, we can decompose it (Cheng andBoots, 2016) as

f = f‖ + f⊥, f‖ ∈ V and f⊥ ⊥ V,

Assuming f‖ =∑Mj=1 α

′jk(zj , ·), then we can solve

for the coefficients (details in appendix B): α′ =k(Z,Z)−1f(Z), where k(Z,Z) denotes the kernel ma-trix of Z. Therefore,

f‖(x) = k(x,Z)k(Z,Z)−1f(Z), f⊥ = f − f‖. (7)

Here k(x,Z) := [k(z1,x), . . . , k(zM ,x)]. AlthoughEq. (7) is derived by assuming f ∈ H, it motivatesus to study the same decomposition for f ∼ GP(0, k).Then f‖ becomes k(·,Z)K−1

uuu. Interestingly, we canverify that this is a sample from a GP with a zero meanfunction and covariance function Cov[f‖(x), f‖(x

′)] =

k(x,Z)K−1uuk(Z,x′). Similarly we can show that f⊥ is

a sample from another GP and we denote these twoindependent GPs as p‖ and p⊥ (Hensman et al., 2017):

f‖ ∼ p‖ ≡ GP(0, k(x,Z)K−1uuk(Z,x′)),

f⊥ ∼ p⊥ ≡ GP(0, k(x,x′)− k(x,Z)K−1uuk(Z,x′)).

Marginalizing out the GPs at the training points X, itis easy to show that

f‖ = f‖(X) = KfuK−1uuu ∼ N (0,KfuK−1

uuKuf ),

f⊥ = f⊥(X) ∼ N (0,Kff −KfuK−1uuKuf ).

This is exactly the decomposition we used in sec-tion 3.1, and the meaning of f⊥ becomes clear.

3.3 SOLVE-GP Lower Bound

The decomposition described in the previous sectiongives new insights for improving the variational distri-bution for f⊥. Specifically, we can introduce a secondset of inducing variables v⊥ := f⊥(O) to approximatep⊥, as illustrated in Fig. 1. We call this second setO = [o1, . . . ,oM2

]> ∈ RM2×d the orthogonal inducingpoints. The joint model distribution is then

p(y|f⊥ + KfuK−1uuu)p(u)p⊥(f⊥|v⊥)p⊥(v⊥).

First notice that the standard SVGP methodscorrespond to using the variational distribution

q(u)p⊥(v⊥)p⊥(f⊥|v⊥). To obtain better approxima-tions we can replace the prior factor p⊥(v⊥) with atunable variational factor q(v⊥) := N (mv,Sv):

q(u, f⊥,v⊥) = q(u)q(v⊥)p⊥(f⊥|v⊥).

This gives the SOLVE-GP variational lower bound:

Eq(u)q⊥(f⊥)

[log p(y|f⊥ + KfuK−1

uuu)]

−KL [q(u)‖p(u)]−KL [q(v⊥)‖p⊥(v⊥)] , (8)

where q⊥(·) :=∫p⊥(·|v⊥)q(v⊥)dv⊥ is the varia-

tional predictive distribution for p⊥. Simple com-putations show that q⊥(f⊥) = N (CfvC−1

vvmv,Sf⊥),where Sf⊥ = Cff + CfvC−1

vv (Sv −Cvv)C−1vvCvf . Here

Cff := Kff − Qff is the covariance matrix of p⊥on the training inputs and similarly for the othermatrices. Because the likelihood factorizes givenf (i.e., f⊥ + KfuK−1

uuu), the first term of Eq. (8)

simplifies to∑Nn=1 Eq(u)q(f⊥(xn))[log p(yn|f⊥(xn) +

k(xn,Z)K−1uuu)]. Therefore, we only need to compute

marginals of q⊥(f⊥) at individual data points. In thegeneral setting, the SOLVE-GP lower bound can bemaximized in O(NM2 + M3) time per gradient up-date, where M = max(M,M2). In mini-batch trainingN is replaced by the batch size. The predictive densityat test data points can be found in appendix D.

To intuitively understand the improvement over thestandard SVGP methods, we derive a collapsed boundfor GP regression using (8) and compare it to the Tit-sias (2009) bound. Plugging in the optimal q(u), andsimplifying (see appendix C), gives the bound

logN (y|CfvC−1vvmv,Qff + σ2I)− 1

2σ2tr(Sf⊥)

−KL [N (mv,Sv)‖N (0,Cvv)] . (9)

With an appropriate choice of q(v⊥) this bound can betighter than the Titsias (2009) bound. For example,notice that when q(v⊥) is equal to the prior p⊥(v⊥),i.e., mv = 0 and Sv = Cvv, the bound in (9) reducesto the one in (3). Another interesting special casearises when the variational distribution has the samecovariance matrix as the prior (i.e., Sv = Cvv), whilethe mean mv is learnable. Then the bound becomes

logN (y|CfvC−1vvmv,Qff + σ2I)

− 1

2σ2tr (Kff −Qff )− 1

2m>v C−1

vvmv. (10)

Here we see that the second set of inducing variablesv⊥ mostly determines the mean prediction over y,which is zero in the Titsias (2009) bound (Eq. (3)).

Our method introduces another set of inducing pointsto improve the variational approximation. One natu-ral question to ask is, how does this compare to the


standard SVGP algorithm with the inducing pointschosen to be union of the two sets? We answer it asfollows: 1) Given the same number of inducing points,SOLVE-GP is more computationally efficient than thestandard SVGP method; 2) SOLVE-GP can be inter-preted as using a structured covariance in the varia-tional approximation for SVGP.

Computational Benefits. For a quick comparison,we analyze the cost of the Cholesky decompositionin both methods. We assume the time complexityof decomposing an M × M matrix is cM3, where cis constant w.r.t. M . For SOLVE-GP, to computethe inverse and the determinant of Kuu and Cvv,we need the Cholesky factors of them, which costc(M3 + M3

2 ). For SVGP with M inducing points, weneed the Cholesky factor of Kuu, which costs cM3.Adding another M inducing points in SVGP leads toan 8-fold increase (i.e., from cM3 to 8cM3) in the costof the Cholesky decomposition, compared to the 2-foldincrease if we switch to SOLVE-GP with M2 = M or-thogonal inducing points. A more rigorous analysis isgiven in appendix D, where we enumerate all the cubic-cost operations needed when we compute the bound.

Structured Covariance. We can express our vari-ational approximation w.r.t. the original GP. Let v =f(O) denote the function outputs at the orthogonal in-ducing points. We then have the following relationshipbetween u,v and u,v⊥:[

uv

]=

[I 0

KvuK−1uu I

] [u

v⊥

].

Therefore, the joint variational distribution over uand v that corresponds to the factorized q(u)q(v⊥)is also Gaussian. By change-of-variable we can ex-press it as q(u,v) = N (mu,v,Su,v), where mu,v =[mu,mv + KvuK−1

uumu

]>and

Su,v =

[Su SuK−1

uuKuv

KvuK−1uuSu Sv + KvuK−1

uuSuK−1uuKuv

].

From Su,v we can see that our approach is differ-ent from making the mean-field assumption q(u,v) =q(u)q(v), instead it captures the covariance betweenu,v through a structured parameterization.

4 EXTENSIONS

One direct extension of SOLVE-GP involves usingmore than two sets of inducing points by repeatedlyapplying the decomposition. However, this adds morecomplexity to the implementation. Below we showthat the SOLVE-GP framework can be easily extendedto different GP models where the standard SVGPmethod applies.

Inter-domain and Convolutional GPs. Simi-lar to SVGP methods, SOLVE-GP can deal withinter-domain inducing points (Lazaro-Gredilla andFigueiras-Vidal, 2009) which lie in a different do-main from the input space. The inducing variablesu, which we used to represent outputs of the GP atthe inducing points, are now defined as u = g(Z) :=[g(z1), . . . , g(zM )]>, where g is a different functionfrom f that takes inputs in the domain of inducingpoints. In convolutional GPs (van der Wilk et al.,2017), the input domain is the space of images, whilethe inducing points are in the space of image patches.The convolutional GP function is defined as f(x) =∑p wpg

(x[p]), where g ∼ GP(0, kg), x[p] is the pth

patch in x, and w = [w1, . . . , wP ]> are the assignedweights for different patches. In SOLVE-GP, we canchoose either Z, O, or both to be inter-domain as longas we can compute the covariance between u,v andf . For convolutional GPs, we let Z and O both becollections of image patches. Examples of the covari-ance matrices we need for this model include Kvf andKvu (used for Cvv). They can be computed as

[Kvf ]ij = Cov[g(oi), f(xj)] =∑p

wpkg(oi,x[p]j ),

[Kvu]ij = Cov[g(oi), g(zj)] = kg(oi, zj).

Deep GPs. We show that we can integrate SOLVE-GP with popular doubly stochastic variational infer-ence algorithms for deep GPs (Salimbeni and Deisen-roth, 2017). The joint distribution of a deep GP modelwith inducing variables in all layers is

p(y, f1:L,u1:L) = p(y|fL)

L∏`=1

[p(f `|u`, f `−1)p(u`)

],

where we define f0 = X and f ` is the output of the`th-layer GP. The doubly stochastic algorithm appliesSVGP methods to each layer conditioned on sam-ples from the variational distribution in the previouslayer. The variational distribution over u1:L, f1:L isq(f1:L,u1:L) =

∏L`=1

[p(f `|u`, f `−1)q(u`)

]. This gives

a similar objective as in the single layer case (Eq. (2)):

Eq(fL)

[log p(y|fL)

]−∑L`=1 KL

[q(u`)‖p(u`)

], where

q(fL) =∫ ∏L

`=1

[p(f `|u`, f `−1)q(u`)du`

]df1:L−1. Ex-

tending this using SOLVE-GP is straightforward: wesimply introduce orthogonal inducing variables v1:L

⊥for all layers, which yields the lower bound:

Eq(uL,fL⊥)[log p(y|fL⊥ + KLfu(KL

uu)−1uL)]−L∑`=1

{KL[q(u`)‖p(u`)] + KL[q(v`⊥)‖p⊥(v`⊥)]

}. (11)

The expression for q(uL, fL⊥) is given in appendix E.


5 RELATED WORK

Many approximate algorithms have been proposed toovercome the computational limitations of GPs. Thesimplest of these are based on subsampling, such asthe subset-of-data training (Rasmussen and Williams,2006) and the Nystrom approximation (Williams andSeeger, 2001). Better approximations can be con-structed by learning a set of inducing points to sum-marize the dataset. As mentioned in section 2, theseworks can be divided into approximations to theGP prior (SoR, DTC, FITC, etc.; Quinonero-Candelaand Rasmussen, 2005b), and sparse variational meth-ods (Titsias, 2009; Hensman et al., 2013, 2015a).

Recently there have been many attempts to reducethe computational cost of using a large set of inducingpoints. A notable line of work (Wilson and Nickisch,2015; Evans and Nair, 2018; Gardner et al., 2018) in-volves imposing grid structures on the locations of Z toperform fast structure-exploiting computations. How-ever, to get such benefits Z need to be fixed due to thestructure constraints, which often suffers from curse ofdimensionality in the input space.

Another direction for allowing the use of more induc-ing points is the decoupled method (Cheng and Boots,2017), where two different sets of inducing points areused for modeling the mean and the covariance func-tion. This gives linear complexity in the number ofmean inducing points which allows using many moreof them. Despite the increasing interest in decoupledinducing points (Havasi et al., 2018; Salimbeni et al.,2018), the method has not been well understood dueto its complexity. We found that SOLVE-GP is closelyrelated to a recent development of decoupled methods:the orthogonally decoupled variational GP (ODVGP,Salimbeni et al., 2018), as explained next.

Connection with Decoupled Inducing Points.If we set the β and γ inducing points in ODVGP (Sal-imbeni et al., 2018) to be Z and O, their approachbecomes equivalent to using the variational distribu-tion q′(u,v) = N (m′u,v,S

′u,v), where

m′u,v =

[mu

mv + KvuK−1uumu

], S′u,v =[

Su SuK−1uuKuv

KvuK−1uuSu Kvv+KvuK−1

uu(Su−Kuu)K−1uuKuv

].

By comparing Su,v to S′u,v, we can see that we gener-alize their method by introducing Sv, which replacesthe original residual Kvv−KvuK−1

uuKuv (or Cvv), sothat we allow more flexible covariance modeling whilestill keeping the block structure. Thus ODVGP is aspecial case of SOLVE-GP where q(v⊥) is restrictedto have the same covariance Cvv as the prior.

6 EXPERIMENTS

Since ODVGP is a special case of SOLVE-GP, we useM,M2 to refer to |β| and |γ| in their algorithm, re-spectively.

6.1 1D Regression

We begin by illustrating our method on Snelson’s 1Dregression problem (Snelson and Ghahramani, 2006)with 100 training points and mini-batch size 20. Wecompare the following methods: SVGP with 5 and 10inducing points, ODVGP (M = 5,M2 = 100), andSOLVE-GP (M = 5,M2 = 5).

The results are plotted in Fig. 2. First we can seethat 5 inducing points are insufficient to summarizethe training set: SVGP (M = 5) cannot fit data welland underestimates the variance in regions beyond thetraining data. Increasing M to 10 fixes the issues, butrequires 8x more computation for the Cholesky decom-position than using 5 inducing points2. The decoupledformulation provides a cheaper alternative and we havetried ODVGP (M = 5,M2 = 100), which has 100 ad-ditional inducing points for modeling the mean func-tion. Comparing Fig. 2a and Fig. 2b, we can see thatthis results in a much better fit for the mean function.However, the model still overestimates the predictivevariance. As ODVGP is a special case of the SOLVE-GP framework, we can improve on it in terms of co-variance modeling. As seen in Fig. 2c, adding 5 or-thogonal inducing points can closely approximate theresults of SVGP (M = 10), with only a 2-fold increasein the cost of the Cholesky decomposition relative toSVGP (M = 5).

6.2 Convolutional GP Models

One class of applications that benefit from the SOLVE-GP framework is the training of large, hierarchical GPmodels where the true posterior distribution is diffi-cult to approximate with a small number of inducingpoints. Convolutional GPs (van der Wilk et al., 2017)and their deep variants (Blomqvist et al., 2018; Du-tordoir et al., 2019) are such models. There induc-ing points are feature detectors just like CNN filters,which play a critical role in predictive performance.As explained in section 4, it is straightforward to ap-ply SOLVE-GP to these models.

Convolutional GPs. We train convolutional GPson the CIFAR-10 dataset, using GPs with TICK ker-nels (Dutordoir et al., 2019) to define the patch re-sponse functions. Table 1 shows the results for SVGP

2In practice the cost is negligible in this toy problembut we are analyzing the theoretical complexity.


(a) SVGP, 5 (b) ODVGP, 5 + 100 (c) SOLVE-GP, 5 + 5 (d) SVGP, 10

Figure 2: Posterior processes on the Snelson dataset, where shaded bands correspond to intervals of ±3 standarddeviations. The learned inducing locations are shown at the bottom of each figure, where + correspond to Z;blue and dark triangles correspond to O in ODVGP and SOLVE-GP, respectively.

0 20 40 60 80 100Epoch

0.061

0.062

0.063

0.064

0.065

0.066

0.067

0.068

0.069

Test

RM

SE

SOLVE-GP, 1024+1024

SVGP, 1024

SVGP, 2048

ODVGP, 1024+8192

0 20 40 60 80 100Epoch

1.20

1.22

1.24

1.26

1.28

1.30

1.32

1.34

1.36

Test

LL

(a) w/o whitening

0 20 40 60 80 100Epoch

0.060

0.065

0.070

Test

RM

SE

SOLVE-GP, 1024+1024

SVGP, 1024

SVGP, 2048

ODVGP, 1024+8192

0 20 40 60 80 100Epoch

1.10

1.15

1.20

1.25

1.30

1.35

1.40

1.45

Test

LL

(b) w/ whitening (except ODVGP)

Figure 3: Test RMSE and predictive log-likelihoods during training on HouseElectric.

with 1K and 2K inducing points, SOLVE-GP (M =1K,M2 = 1K), and SVGP (M = 1.6K) that has asimilar running time on GPU as SOLVE-GP. ClearlySOLVE-GP outperforms SVGP (M = 1K). It alsooutperforms SVGP (M = 1.6K), which has the samerunning time, and performs on par with the more ex-pensive SVGP (M = 2K), which is very encourag-ing. This suggests that the structured covariance ap-proximation is fairly accurate even for this large, non-conjugate model.

Deep Convolutional GPs. We further extendSOLVE-GP to deep convolutional GPs using the tech-niques described in section 4. We experiment with 2-layer and 3-layer models that have 1K inducing pointsin the output layer and 384 inducing points in otherlayers. The results are summarized in Table 3. Thesemodels are already quite slow to train on a single GPU,as indicated by the time per iteration. SOLVE-GP al-lows to double the number of inducing points in eachlayer with only a 2-fold increase in computation. Thisgives superior performance on both accuracy and testpredictive likelihoods. The double-size SVGP takes aweek to run and is only for comparison purpose.

As shown above, on both single layer and deep convo-lutional GPs, we improve the state-of-the-art results ofCIFAR-10 classification by 3-4 percentage points. Thisleads to more than 80% accuracy on CIFAR-10 with apurely GP-based model, without any neural networkcomponents, closing the gap between GP/kernel re-gression and CNN baselines presented in Novak et al.

Table 1: Convolutional GPs for CIFAR-10 classifica-tion. Previous SOTA is 64.6% by SVGP with 1K in-ducing points (van der Wilk et al., 2017).

M(+M2) Test Acc Test LL Time

SVGP1K 66.07% -1.59 0.241 s/iter

1.6K 67.18% -1.54 0.380 s/iterSOLVE-GP 1K + 1K 68.19% -1.51 0.370 s/iter

SVGP 2K 68.06% -1.48 0.474 s/iter

(2019); Arora et al. (2019). Note that all the resultsare obtained without data augmentation.

6.3 Regression Benchmarks

Besides classification experiments, we evaluate ourmethod on 10 regression datasets, with size rang-ing from tens of thousands to millions. The set-tings are followed from Wang et al. (2019) and de-scribed in detail in appendix G. We implementedSVGP with M = 1024&2048 inducing points, ODVGPand SOLVE-GP (M = 1024,M2 = 1024), as well asSVGP with M = 1536 inducing points, which hasroughly the same training time per iteration on GPUas the SOLVE-GP objective. An attractive property ofODVGP is that by restricting the covariance of q(v⊥)to be the same as the prior covariance Cvv, it can usefar largerM2, because the complexity is linear withM2

by sub-sampling the columns of Kvv for each gradientupdate. Thus for a fair comparison, we also include


Table 2: Test log-likelihood values for the regression datasets. The numbers in parentheses are standard errors.Best mean values are highlighted, and asterisks indicate statistical significance.

Kin40k Protein KeggDirected KEGGU 3dRoad Song Buzz HouseElectric

N 25,600 29,267 31,248 40,708 278,319 329,820 373,280 1,311,539d 8 9 20 27 3 90 77 9

SVGP1024 0.094(0.003) -0.963(0.006) 0.967(0.005) 0.678(0.004) -0.698(0.002) -1.193(0.001) -0.079(0.002) 1.304(0.002)1536 0.129(0.003) -0.949(0.005) 0.944(0.006) 0.673(0.004) -0.674(0.003) -1.193(0.001) -0.079(0.002) 1.304(0.003)

ODVGP1024 + 1024 0.137(0.003) -0.956(0.005) -0.199(0.067) 0.105(0.033) -0.664(0.003) -1.193(0.001) -0.078(0.001) 1.317(0.002)1024 + 8096 0.144(0.002) -0.946(0.005) -0.136(0.063) 0.109(0.033) -0.657(0.003) -1.193(0.001) -0.079(0.001) 1.319(0.004)

SOLVE-GP 1024 + 1024 *0.187(0.002) -0.943(0.005) 0.973(0.003) 0.680(0.003) -0.659(0.002) -1.192(0.001) *-0.071(0.001) *1.333(0.003)

SVGP 2048 0.137(0.003) -0.940(0.005) 0.907(0.003) 0.665(0.004) -0.669(0.002) -1.192(0.001) -0.079(0.002) 1.304(0.003)

Table 3: Deep convolutional GPs for CIFAR-10 classi-fication. Previous SOTA is 76.17% by a 3-layer modelwith 384 inducing points in all layers (Dutordoir et al.,2019).

(a) 2-layer model

SVGP SOLVE-GP SVGPM(+M2) 384, 1K 384 + 384, 1K + 1K 768, 2K

Test Acc 76.35% 77.80% 77.46%Test LL -1.04 -0.98 -0.98

Time 0.392 s/iter 0.657 s/iter 1.104 s/iter

(b) 3-layer model

SVGP SOLVE-GP SVGP

M(+M2) 384, 384, 1K384 + 384, 384 + 384,

1K + 1K768, 768, 2K

Test Acc 78.76% 80.30% 80.33%Test LL -0.88 -0.79 -0.82

Time 0.418 s/iter 0.752 s/iter 1.246 s/iter

ODVGP (M2 = 8096), where in each iteration 1024columns of Kvv are sampled to estimate the gradient.Other experimental details are given in appendix G.

We report the predictive log-likelihoods on test datain Table 2. For space reasons, we provide the re-sults on two small datasets (Elevators, Bike) in ap-pendix H. We can see that performance of SOLVE-GPis competitive with SVGP (M = 2048) that involves4x more expensive Cholesky decomposition. Perhapssurprisingly, despite using a less flexible covariance inthe variational distribution, SOLVE-GP often outper-forms SVGP (M = 2048). We believe this is due to theoptimization difficulties introduced by the 2048×2048covariance matrix and will test hypothesis on theHouseElectric dataset below. On most datasets, usinga large number of additional inducing points for mod-eling the mean function did improve the performance,as shown by the comparison between ODVGP (M2 =1024) and ODVGP (M2 = 8096). However, more flex-

ible covariance modeling seems to be more important,as SOLVE-GP outperforms ODVGP (M2 = 8096) onall datasets except for 3dRoad.

In Fig. 3a we plot the evolution of test RMSE andtest log-likelihoods during training on HouseElectric.Interestingly, ODVGP (M2 = 8096) performs on parwith SOLVE-GP early in training before falling be-hind it substantially. The beginning stage is likelywhere the additional inducing points give good pre-dictions but are not in the best configuration for max-imizing the training lower bounds. This phenomenonis also observed on Protein, Elevators, and Kin40k.We believe such mismatch between the training lowerbound and predictive performance is caused by fix-ing the covariance matrix of q(v⊥) to the prior co-variance. SVGP (M = 2048) does not improve overSVGP (M = 1024) and is outperformed by SOLVE-GP. Suggested above, this might be due to the diffi-culty of optimising large covariance matrices. To ver-ify this, we tried the “whitening” trick (Murray andAdams, 2010; Hensman et al., 2015b), described inappendix F, which is often used to make optimiza-tion easier by reducing the correlation in the posteriordistributions. As shown in Fig. 3b, the performance ofSVGP (M = 2048) and SOLVE-GP becomes similarwith whitening. We did not use whitening in ODVGPbecause it has a slightly different parameterization toallow sub-sampling Kvv.

7 CONCLUSION

We proposed SOLVE-GP, a new variational inferenceframework for GPs using inducing points, that unifiesand generalizes previous sparse variational methods.This increases the number of inducing points we canuse for a fixed computational budget, which allows toimprove performance of large, hierarchical GP mod-els at a manageable computational cost. Future workincludes experiments on challenging datasets like Im-ageNet and investigating other ways to improve thevariational distribution, as mentioned in section 3.1.


Acknowledgements

We thank Alex Matthews and Yutian Chen for helpfulsuggestions on improving the paper.

References

Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li,Ruslan Salakhutdinov, and Ruosong Wang. On ex-act computation with an infinitely wide neural net.arXiv preprint arXiv:1904.11955, 2019.

Kenneth Blomqvist, Samuel Kaski, and MarkusHeinonen. Deep convolutional Gaussian processes.arXiv preprint arXiv:1810.03052, 2018.

Thang D. Bui, Josiah Yan, and Richard E. Turner.A unifying framework for Gaussian process pseudo-point approximations using power expectation prop-agation. Journal of Machine Learning Research, 18(104):1–72, 2017.

Ching-An Cheng and Byron Boots. Incremental vari-ational sparse Gaussian process regression. In Ad-vances in Neural Information Processing Systems,pages 4410–4418, 2016.

Ching-An Cheng and Byron Boots. Variational infer-ence for Gaussian process models with linear com-plexity. In Advances in Neural Information Process-ing Systems, pages 5184–5194, 2017.

L. Csato and M. Opper. Sparse online Gaussian pro-cesses. Neural Computation, 14:641–668, 2002.

Andreas C. Damianou, Michalis K. Titsias, and Neil D.Lawrence. Variational inference for latent variablesand uncertain inputs in Gaussian processes. Journalof Machine Learning Research, 17(42):1–62, 2016.

Marc Deisenroth and Carl E Rasmussen. PILCO: Amodel-based and data-efficient approach to policysearch. In International Conference on MachineLearning, pages 465–472, 2011.

Vincent Dutordoir, Mark van der Wilk, Artem Arte-mev, Marcin Tomczak, and James Hensman. Trans-lation insensitivity for deep convolutional Gaussianprocesses. arXiv preprint arXiv:1902.05888, 2019.

Trefor Evans and Prasanth Nair. Scalable Gaussianprocesses with grid-structured eigenfunctions (GP-GRIEF). In International Conference on MachineLearning, pages 1416–1425, 2018.

Jacob Gardner, Geoff Pleiss, Ruihan Wu, Kilian Wein-berger, and Andrew Wilson. Product kernel interpo-lation for scalable Gaussian processes. In ArtificialIntelligence and Statistics, pages 1407–1416, 2018.

Xavier Glorot and Yoshua Bengio. Understanding thedifficulty of training deep feedforward neural net-works. In Artificial Intelligence and Statistics, pages249–256, 2010.

Marton Havasi, Jose Miguel Hernandez-Lobato, andJuan Jose Murillo-Fuentes. Deep Gaussian processeswith decoupled inducing inputs. arXiv preprintarXiv:1801.02939, 2018.

James Hensman, Nicolo Fusi, and Neil D Lawrence.Gaussian processes for big data. arXiv preprintarXiv:1309.6835, 2013.

James Hensman, Alexander Matthews, and ZoubinGhahramani. Scalable variational Gaussian processclassification. In Artificial Intelligence and Statis-tics, pages 351–360, 2015a.

James Hensman, Alexander G Matthews, MaurizioFilippone, and Zoubin Ghahramani. MCMC forvariationally sparse Gaussian processes. In Advancesin Neural Information Processing Systems, pages1648–1656, 2015b.

James Hensman, Nicolas Durrande, and Arno Solin.Variational Fourier features for Gaussian processes.Journal of Machine Learning Research, 18(151):1–151, 2017.

Daniel Hernandez-Lobato, Jose M Hernandez-Lobato,and Pierre Dupont. Robust multi-class Gaussianprocess classification. In Advances in Neural Infor-mation Processing Systems, pages 280–288, 2011.

Diederik P Kingma and Max Welling. Auto-encodingvariational bayes. arXiv preprint arXiv:1312.6114,2013.

N. D. Lawrence, M. Seeger, and R. Herbrich. Fastsparse Gaussian process methods: the informativevector machine. In Advances in Neural InformationProcessing Systems. MIT Press, 2002.

Neil Lawrence. Probabilistic non-linear principal com-ponent analysis with Gaussian process latent vari-able models. Journal of Machine Learning Research,6(Nov):1783–1816, 2005.

Miguel Lazaro-Gredilla and Anibal Figueiras-Vidal.Inter-domain Gaussian processes for sparse infer-ence using inducing features. In Advances in NeuralInformation Processing Systems, pages 1087–1095,2009.

Iain Murray and Ryan P Adams. Slice sampling co-variance hyperparameters of latent Gaussian mod-els. In Advances in Neural Information ProcessingSystems, pages 1732–1740, 2010.

Roman Novak, Lechao Xiao, Yasaman Bahri, Jae-hoon Lee, Greg Yang, Daniel A. Abolafia, JeffreyPennington, and Jascha Sohl-dickstein. Bayesiandeep convolutional networks with many channels areGaussian processes. In International Conference onLearning Representations, 2019.

J. Quinonero-Candela and C. E. Rasmussen. A uni-fying view of sparse approximate Gaussian process


regression. Journal of Machine Learning Research,6:1939–1959, 2005a.

Joaquin Quinonero-Candela and Carl Edward Ras-mussen. A unifying view of sparse approximateGaussian process regression. Journal of MachineLearning Research, 6(Dec):1939–1959, 2005b.

Carl Edward Rasmussen and Christopher KI Williams.Gaussian Processes for Machine Learning. MITPress, 2006.

Danilo Jimenez Rezende, Shakir Mohamed, and DaanWierstra. Stochastic backpropagation and approx-imate inference in deep generative models. In In-ternational Conference on Machine Learning, pages1278–1286, 2014.

Hugh Salimbeni and Marc Deisenroth. Doublystochastic variational inference for deep Gaussianprocesses. In Advances in Neural Information Pro-cessing Systems, pages 4588–4599, 2017.

Hugh Salimbeni, Ching-An Cheng, Byron Boots, andMarc Deisenroth. Orthogonally decoupled varia-tional Gaussian processes. In Advances in NeuralInformation Processing Systems, pages 8711–8720,2018.

M. Seeger, C. K. I. Williams, and N. D. Lawrence.Fast forward selection to speed up sparse Gaussianprocess regression. In Ninth International Workshopon Artificial Intelligence. MIT Press, 2003.

Jiaxin Shi, Mohammad Emtiyaz Khan, and Jun Zhu.Scalable training of inference networks for Gaussian-process models. In International Conference on Ma-chine Learning, pages 5758–5768, 2019.

Edward Snelson and Zoubin Ghahramani. SparseGaussian processes using pseudo-inputs. In Ad-vances in Neural Information Processing Systems,pages 1257–1264, 2006.

Niranjan Srinivas, Andreas Krause, Sham Kakade,and Matthias Seeger. Gaussian process optimiza-tion in the bandit setting: no regret and experimen-tal design. In International Conference on MachineLearning, pages 1015–1022, 2010.

Michalis Titsias. Variational learning of inducing vari-ables in sparse Gaussian processes. In Artificial In-telligence and Statistics, pages 567–574, 2009.

Michalis Titsias and Miguel Lazaro-Gredilla. Dou-bly stochastic variational bayes for non-conjugateinference. In International Conference on MachineLearning, pages 1971–1979, 2014.

Mark van der Wilk, Carl Edward Rasmussen, andJames Hensman. Convolutional Gaussian processes.In Advances in Neural Information Processing Sys-tems, pages 2849–2858, 2017.

Ke Alexander Wang, Geoff Pleiss, Jacob R Gard-ner, Stephen Tyree, Kilian Q Weinberger, andAndrew Gordon Wilson. Exact Gaussian pro-cesses on a million data points. arXiv preprintarXiv:1903.08114, 2019.

Christopher KI Williams and David Barber. Bayesianclassification with Gaussian processes. IEEE Trans-actions on Pattern Analysis and Machine Intelli-gence, 20(12):1342–1351, 1998.

Christopher KI Williams and Carl Edward Rasmussen.Gaussian processes for regression. In Advances inNeural Information Processing Systems, pages 514–520, 1996.

Christopher KI Williams and Matthias Seeger. Usingthe Nystrom method to speed up kernel machines.In Advances in Neural Information Processing Sys-tems, pages 682–688, 2001.

Andrew Wilson and Hannes Nickisch. Kernel inter-polation for scalable structured Gaussian processes(KISS-GP). In International Conference on Ma-chine Learning, pages 1775–1784, 2015.


A Tighter Sparse Variational Bounds for GP Regression

As mentioned in section 3.1, another way to improve the variational distribution q(u)p⊥(f⊥) in SVGP is to makeu and f⊥ dependent. The best possible approximation of this type is obtained by the setting q(u) to the optimalexact posterior conditional q∗(u) = p(u|f⊥,y). The corresponding collapsed bound for GP regression can bederived by analytically marginalising out u from the joint model in Eq. (6),

p(y|f⊥) =

∫p(y|f⊥ + KfuK−1

uuu)p(u) du

= N (y|f⊥,Qff + σ2I), (12)

and then forcing the approximation p⊥(f⊥):

Ep⊥(f⊥) logN (y|f⊥,Qff + σ2I). (13)

This bound has a closed-form as


2tr[(Qff + σ2I)−1(Kff −Qff )

], (14)

Applying the matrix inversion lemma to (Qff +σ2I)−1, we have an equivalent form that can be directly comparedwith Eq. (3):


2σ2tr(Kff −Qff )︸︷︷︸

=Eq. (3)

+1

2σ4tr[Kfu(Kuu + σ−2KufKfu)−1Kuf (Kff −Qff )

], (15)

where the first two terms recover Eq. (3), suggesting this is a tighter bound than the Titsias (2009) bound.This bound is not amenable to large-scale datasets because of O(N2) storage and O(MN2) computation time(dominated by the matrix multiplication KufKff ) requirements. However, it is still of theoretical interest andcan be applied to medium-sized regression datasets, just like the SGPR algorithm using the Titsias (2009) bound.

B Details of Orthogonal Decomposition

In section 3.2 we described the following orthogonal decomposition for f ∈ H, where H is the RKHS induced bykernel k:

f = f‖ + f⊥, f‖ ∈ V and f⊥ ⊥ V. (16)

Here V is the subspace spanned by the inducing basis: V = {∑Mj=1 αjk(zj , ·), α = [α1, . . . , αM ]> ∈ RM}. Since

f‖ ∈ V , we let f‖ =∑Mj=1 α

′jk(zj , ·). According to the properties of orthogonal projection, we have

〈f, g〉H = 〈f‖, g〉H, ∀g ∈ V, (17)

where 〈〉H is the RKHS inner product that satisfies the reproducing property: 〈f, k(x, ·)〉H = f(x). Similarly let

g =∑Mj=1 βjk(zj , ·). Then 〈f, g〉H =

∑Mj=1 βjf(zj), 〈f‖, g〉H =

∑Mi=1

∑Mj=1 α

′iβjk(zi, zj). Plugging into Eq. (17)

and rearranging the terms, we have

β>(f(Z)− k(Z,Z)α′) = 0, ∀β ∈ RM , (18)

where f(Z) = [f(z1), . . . , f(zM )]>, k(Z,Z) is a matrix with the ij-th term as k(zi, zj), and α′ = [α′1, . . . , α′M ]>.

Therefore,α′ = k(Z,Z)−1f(Z), (19)

and it follows that f‖(x) = k(x,Z)k(Z,Z)−1f(Z), where k(x,Z) = [k(z1,x), . . . , k(zM ,x)]. The above analysisarrives at the decomposition:

f‖ = k(·,Z)k(Z,Z)−1f(Z), f⊥ = f − f‖. (20)

Although the derivation from Eq. (16) to (19) relies on the fact f ∈ H, such that the inner product is well-defined,the decomposition in Eq. (20) is valid for any function f on X . This motivates us to study it for f ∼ GP(0, k).Substituting u for f(Z) and Kuu for k(Z,Z), we have for f ∼ GP(0, k):

f‖ = k(·,Z)K−1uuu ∼ p‖ ≡ GP(0, k(x,Z)K−1

uuk(Z,x′)), (21)

f⊥ ∼ p⊥ ≡ GP(0, k(x,x′)− k(x,Z)K−1uuk(Z,x′)). (22)


C The Collapsed SOLVE-GP Lower Bound

We derive the collapsed SOLVE-GP lower bound in Eq. (9) by seeking the optimal q(u) that is independent off⊥. First we rearrange the terms in the uncollapsed SOLVE-GP bound (Eq. (8)) as

Eq(u)

{Eq⊥(f⊥)

[logN

(y|f⊥ + KfuK−1

uuu, σ2I)]}−KL [q(u)‖p(u)]−KL [q(v⊥)‖p⊥(v⊥)] . (23)

where q⊥(f⊥) = N (mf⊥ ,Sf⊥), and mf⊥ = CfvC−1vvmv, Sf⊥ = Cff + CfvC−1

vv (Sv − Cvv)C−1vvCvf . In the first

term we can simplify the expectation over f⊥ as:

Eq⊥(f⊥) logN(y|f⊥ + KfuK−1

uuu, σ2I)

= Eq⊥(f⊥)

[−N

2log 2π − N

2log σ2 − 1

2σ2(y− f⊥ −KfuK−1

uuu)>(y− f⊥ −KfuK−1uuu)

]=

[−N

2log 2π − N

2log σ2 − 1

2σ2(y−mf⊥ −KfuK−1

uuu)>(y−mf⊥ −KfuK−1uuu)

]− Eq⊥(f⊥)

[1

2σ2(f⊥ −mf⊥)>(f⊥ −mf⊥)

]= logN (y|KfuK−1

uuu + mf⊥ , σ2I)− 1

2σ2tr(Sf⊥). (24)

Plugging into Eq. (23) and rearranging the terms, we have

Eq(u)

[logN (y|KfuK−1

uuu + mf⊥ , σ2I)]−KL [q(u)‖p(u)]︸︷︷︸

≤log∫N (y|KfuK−1

uuu+mf⊥ ,σ2I)p(u) du

− 1

2σ2tr(Sf⊥)−KL [q(v⊥)‖p⊥(v⊥)] . (25)

Clearly the leading two terms form a variational lower bound of the joint distribution N (y|KfuK−1uuu +

mf⊥ , σ2I)p(u). The optimal q(u) will turn it into the log marginal likelihood:

log

∫N (y|KfuK−1

uuu + mf⊥ , σ2I)p(u) du = logN (y|mf⊥ ,Qff + σ2I). (26)

Plugging this back, we have the collapsed SOLVE-GP bound in Eq. (9):

logN (y|CfvC−1vvmv,Qff + σ2I)− 1

2σ2tr(Sf⊥)−KL [N (mv,Sv)‖N (0,Cvv)] , (27)

Moreover, we could find the optimal q∗(v) = N (m∗v,S∗v) by setting the derivatives w.r.t. mv and Sv to be zeros:

m∗v = Cvv[Cvv + CvfA−1Cfv]−1CvfA

−1y, (28)

S∗v = Cvv[Cvv + σ−2CvfCfv]−1Cvv, (29)

where A = Qff + σ2I. Then the collapsed bound with the optimal q(v⊥) is

logN(y|CfvC−1vvm∗v,A)− 1

2σ2tr[Cff −B(B + σ2I)−1B]−KL [N (m∗v,S

∗v)‖N (0,Cvv)] , (30)

where B = CfvC−1vvCvf .

D Computational Details

D.1 Training

To compute the lower bound in Eq. (8), we write it as

N∑n=1

Eq(f(xn);Θ)[log p(yn|f(xn))]−KL [q(u)‖p(u)] − KL [q(v⊥)‖p⊥(v⊥)], (31)


Algorithm 1 The SOLVE-GP lower bound via Cholesky decomposition. We parameterize the variationalcovariance matrices with their Cholesky factors Su = LuL>u ,Sv = LvL>v . A = L0

u \Kuv denotes the solution ofL0

uA = Kuv. � denotes elementwise multiplication. The differences from SVGP are shown in blue.

Input: X (training inputs), y (targets), Z,O (inducing points), mu,Lu,mv,Lv (variational parameters)1: Kuu = k(Z,Z), Kvv = k(O,O)2: L0

u = Cholesky(Kuu), Kuv = k(Z,O), A := L0u \Kuv, Cvv = Kvv −A>A, L0

v = Cholesky(Cvv)3: Kuf = k(Z,X), Kvf = k(O,X)4: B := L0

u \Kuf , Cvf = Kvf −A>B, D := L0v \Cvf

5: E := L0u \B, F := L>u E, G := L0

v \D, H := L>v G

6: µ(X) = E>mu + G>mv

7: σ2(X) = diag(Kff ) + (F� F)>1− (B�B)>1 + (H�H)>1− (D�D)>1

8: Compute LLD =∑Nn=1 EN (µ(xn),σ2(xn)) log p(yn|f(xn)) in closed form or using quadrature/Monte Carlo.

9: function Compute KL(m, L, L0)10: P = L0 \ L, a = L0 \m

11: return log(diag(L0))>1− log(diag(L))>1 + 1/2((P�P)>1 + a>a−M)12: end function13: KLu = Compute KL(mu,Lu,L

0u), KLv = Compute KL(mv,Lv,L

0v)

14: return LLD−KLu −KLv

where Θ := {mu,Su,mv,Sv,Z,O} and q(f(xn); Θ) defines the marginal distribution of f = f⊥ + KfuK−1uuu for

the n-th data point given u ∼ q(u) and f⊥ ∼ q⊥(f⊥). We can write q(f(xn); Θ) as

q(f(xn); Θ) = N (µ(xn), σ2(xn)), (32)

where

µ(xn) = k(xn,Z)K−1uumu + c(xn,O)C−1

vvmv, (33)

σ2(xn) = k(xn,Z)K−1uuSuK−1

uuk(Z,xn) + c(xn,xn) + c(xn,O)C−1vv (Sv −Cvv)C−1

vvc(O,xn). (34)

Here c(x,x′) := k(x,x′) − k(x,Z)K−1uuk(Z,x) denotes the covariance function of p⊥. The univariate expec-

tation of log p(yn|f(xn)) under q(f(xn); Θ) can be computed in closed form (e.g., for Gaussian likelihoods)or using quadrature (Hensman et al., 2015b). It can also be estimated by Monte Carlo with the reparame-terization trick (Kingma and Welling, 2013; Titsias and Lazaro-Gredilla, 2014; Rezende et al., 2014) to prop-agate gradients. For large datasets, an unbiased estimate of the sum can be used for mini-batch training:N|B|∑

(x,y)∈B Eq(f(x);Θ)[log p(y|f(x))], where B denotes a small batch of data points.

Besides the log-likelihood term, we need to compute the two KL divergence terms:

KL [q(u)‖p(u)] =1

2

[log det Kuu − log det Su −M + tr(K−1

uuSu) + m>u K−1uumu

], (35)

KL [q(v⊥)‖p⊥(v⊥)] =1

2

[log det Cvv − log det Sv −M + tr(C−1

vvSv) + m>v C−1vvmv

]. (36)

We note that if the blue parts in Eqs. (31) to (34) are removed, then we recover the SVGP lower bound in Eq. (2).An implementation of the above computations using the Cholesky decomposition is shown in algorithm 1.

D.2 Prediction

We can predict the function value at a test point x∗ with the approximate posterior by substituting x∗ for xnin Eq. (32). For multiple test points X∗, we denote the joint predictive density by N (f∗|µ∗,Σ∗), where thepredicted mean and covariance are

µ∗ = K∗uK−1uumu + C∗vC−1

vvmv, (37)

Σ∗ = K∗uK−1uuSuK−1

uuKu∗ + C∗∗ −C∗vC−1vv (Cvv − Sv)C−1

vvCv∗. (38)


Table 4: Cubic-cost operations in SOLVE-GP and SVGP, following the implementation in algorithm 1.

SVGP SOLVE-GP

Matrix multiplication O(NM2)× 1

O(NM2)× 1O(NM2

2 )× 1O(NMM2)× 1O(MM2

2 )× 1

Cholesky O(M3)× 1O(M3)× 1O(M3

2 )× 1

Solving triangularmatrix equations

O(M3)× 1O(NM2)× 2

O(M3)× 1O(NM2)× 2O(M3

2 )× 1O(NM2

2 )× 2O(M2M

2)× 1

0

6

12

18

24

30

SVGP (M) SVGP (2M) SOLVE-GP (M+M)

7

24

3 2

8

14

8

1

Matmul Cholesky Trisolve

(a) N ≈M

0

2

4

6

8

10

SVGP (M) SVGP (2M) SOLVE-GP (M+M)

3

8

12

8

1 100

Matmul Cholesky Trisolve

(b) M � N

Figure 4: Comparison of computational cost for SVGP and SOLVE-GP. For each method and each type ofcubic-cost operation, we plot the factor of increase in cost compared to a single operation on M ×M matrices.

D.3 Computational Complexity

As mentioned in section 3.3, the time complexity of SOLVE-GP is O(NM2 + M3) per gradient update, whereM = max(M,M2) and N is the batch size. Here we provide a more fine-grained analysis by counting cubic-cost operations and compare to the standard SVGP method. We underlined all the cubic-cost operations inalgorithm 1, including matrix multiplication, Cholesky decomposition, and solving triangular matrix equations.We count them for SVGP and SOLVE-GP. The results are summarized in Table 4.

For comparison purposes, we study two cases of mini-batch training: (i) N ≈M and (ii) M � N . We considerSOLVE-GP with M2 = M , which has 2M inducing points in total, and then compare to SVGP with M and 2Minducing points. For each method and each type of operation, we plot the factor of increase in cost comparedto a single operation on M ×M matrices. For instance, when N ≈M (Fig. 4a), SVGP with M inducing pointsrequires solving three triangular matrix equations for M ×M matrices. Doubling the number of inducing pointsin SVGP increases the cost by a factor of 8, plotted as 24 for SVGP (2M). In contrast, in SOLVE-GP withM orthogonal inducing points we only need to solve 7 triangular matrix equations for M ×M matrices. Thecomparison under the case of M � N is shown in Fig. 4b. In this case SOLVE-GP additionally introduces oneO(M3) matrix multiplication operation, but overall the algorithm is still much faster than SVGP (2M) giventhe speed-up in Cholesky decomposition and solving matrix equations.


E Details of Eq. (11)

The variational distribution in Eq. (11) is defined as:

q(uL, fL⊥) =

∫ L∏`=1

[p⊥(f `⊥|v`⊥, f `−1

⊥ ,u`−1)q(v`⊥)q(u`)du`dv`⊥] L−1∏`=1

df `⊥. (39)

F Whitening

Similar to the practice in SVGP methods, we can apply the “whitening” trick (Murray and Adams, 2010;Hensman et al., 2015b) to SOLVE-GP. The goal is to improve the optimization of variational approximations

by reducing correlation in the posterior distributions. Specifically, we could “whiten” u by using u′ = K−1/2uu u,

where K−1/2uu denotes the Cholesky factor of the prior covariance Kuu. Then posterior inference for u turns

into inference for u′, which has an isotropic Gaussian prior N (0, I). Then we parameterize the variationaldistribution w.r.t. u′: q(u′) = N (mu,Su). Whitening q(v⊥) is similar to whitening q(u), i.e., we parameterize

the variational distribution w.r.t. v⊥ = C−1/2vv v⊥ and set q(v⊥) = N (mv,Sv). The algorithm can be derived by

replacing mu,mv with L0umu, L0

vmv, and Su,Sv with L0uSuL0

u

>,L0

vSvL0v

>in algorithm 1 and removing the

canceled terms.

G Experiment Details

For all experiments, we use kernels with a shared lengthscale across dimensions. All model hyperparameters,including kernel parameters, patch weights in convolutional GP models, and observation variances in regres-sion experiments, are optimized jointly with variational parameters using ADAM. The variational distributionsq(u) and q(v⊥) are by default initialized to the prior distributions. Unless stated otherwise, no “whitening”trick (Murray and Adams, 2010; Hensman et al., 2015b) is used for SVGP or SOLVE-GP.

G.1 1D Regression

We randomly sample 100 training data points from Snelson’s dataset (Snelson and Ghahramani, 2006) as thetraining data. All models use Gaussian RBF kernels and are trained for 10K iterations with learning rate 0.01and mini-batch size 20. The GP kernel is initialized with lengthscale 1 and variance 1. The Gaussian likelihoodis initialized with variance 0.1.

G.2 Convolutional GP Models

All models are trained for 300K iteration with learning rate 0.003 and batch size 64. The learning rate is annealedby 0.25 every 50K iterations to ensure convergence. We use a zero mean function and the robust multi-classclassification likelihood (Hernandez-Lobato et al., 2011). The Gaussian RBF kernels for the patch response GPsin all layers are initialized with lengthscale 5 and variance 5. We used the TICK kernel (Dutordoir et al., 2019)for the output layer GP, for which we use a Matern32 kernel between patch locations with lengthscale initializedto 3. We initialize the inducing patch locations to random values in [0, H]× [0,W ], where [H,W ] is the shape ofthe output feature map in patch extraction.

Convolutional GPs We set patch size to 5× 5 and stride to 1. We use the whitening trick in all single-layerexperiments for u (and v⊥) since we find it consistently improves the performance. Inducing points are initializedby cluster centers which are generated from running K-means on M × 100 (for SVGP) or (M +M2)× 100 (forSOLVE-GP) image patches. The image patches are randomly sampled from 1K images randomly selected fromthe dataset.

Deep Convolutional GPs The detailed model configurations are summarized in Table 5. No whitening trickis used for multi-layer experiments because we find it hurts performance. Inducing points in the input layerare initialized in the same way as in the single-layer model. In Blomqvist et al. (2018); Dutordoir et al. (2019),three-layer models were initialized with the trained values of a two-layer model to avoid getting stuck in bad


local minima. Here we design an initialization scheme that allows training deeper models without the need ofpretraining. We initialize the inducing points in deeper layers by running K-means on M × 100 (for SVGP)or (M + M2) × 100 (for SOLVE-GP) image patches which are randomly sampled from the projections of 1Kimages to these layers. The projections are done by using a convolution operation with random filters generatedusing Glorot uniform (Glorot and Bengio, 2010). We also note that when implementing the forward samplingfor approximating the log-likelihood term, we follow the previous practice (Dutordoir et al., 2019) to ignore thecorrelations between outputs of different patches to get faster sampling, which works well in practice. While itis also possible to take into account the correlation when sampling as this only increases the computation costby a constant factor, doing this might require multi-GPU training due to the additional memory requirements.

Table 5: Model configurations of deep convolutional GPs.

2-layer 3-layer

Layer 0 patch size 5× 5, stride 1, out channel 10, patch size 5× 5, stride 1, out channel 10Layer 1 patch size 4× 4, stride 2 patch size 4× 4, stride 2, out channel 10Layer 2 - patch size 5× 5, stride 1

G.3 Regression Benchmarks

The experiment settings are followed from Wang et al. (2019), where we used GPs with Matern32 kernels and80% / 20% training / test splits. A 20% subset of the training set is used for validation. We repeat eachexperiment 5 times with random splits and report the mean and standard error of the performance metrics. Forall datasets we train for 100 epochs with learning rate 0.01 and mini-batch size 1024.

H Additional Results

H.1 Regression Benchmarks

Due to space limitations in the main text, we include the Root Mean Squared Error (RMSE) on test data inTable 6. The results on Elevators and Bike are shown in Table 7.

Table 6: Test RMSE values of regression datasets. The numbers in parentheses are standard errors. Best meanvalues are highlighted, and asterisks indicate statistical significance.

Kin40k Protein KeggDirected KEGGU 3dRoad Song Buzz HouseElectric

N 25,600 29,267 31,248 40,708 278,319 329,820 373,280 1,311,539d 8 9 20 27 3 90 77 9

SVGP1024 0.193(0.001) 0.630(0.004) 0.098(0.003) 0.123(0.001) 0.482(0.001) 0.797(0.001) 0.263(0.001) 0.063(0.000)1536 0.182(0.001) 0.621(0.004) 0.098(0.002) 0.123(0.001) 0.470(0.001) 0.797(0.001) 0.263(0.001) 0.063(0.000)

ODVGP1024 + 1024 0.183(0.001) 0.625(0.004) 0.176(0.012) 0.156(0.004) 0.467(0.001) 0.797(0.001) 0.263(0.001) 0.062(0.000)1024 + 8096 0.180(0.001) 0.618(0.004) 0.157(0.009) 0.157(0.004) 0.462(0.002) 0.797(0.001) 0.263(0.001) 0.062(0.000)

SOLVE-GP 1024 + 1024 *0.172(0.001) 0.618(0.004) 0.095(0.002) 0.123(0.001) 0.464(0.001) 0.796(0.001) 0.261(0.001) *0.061(0.000)

SVGP 2048 0.177(0.001) 0.615(0.004) 0.100(0.003) 0.124(0.001) 0.467(0.001) 0.796(0.001) 0.263(0.000) 0.063(0.000)

H.2 Convolutional GP Models

We include here the full tables for CIFAR-10 classification, where we also report the accuracies and predictivelog-likelihoods on the training data. Table 8 contains the results by convolutional GPs. Table 9 and Table 10include the results of 2/3-layer deep convolutional GPs.


Table 7: Regression results on Elevators and Bike. Best mean values are highlighted.

ElevatorsN = 10, 623, d = 18

BikeN = 11, 122, d = 17

Test LL RMSE Test LL RMSE

SVGP1024 -0.516(0.006) 0.398(0.004) -0.218(0.006) 0.283(0.003)1536 -0.511(0.007) 0.396(0.004) -0.203(0.006) 0.279(0.003)

ODVGP1024 + 1024 -0.518(0.006) 0.397(0.004) -0.191(0.006) 0.272(0.003)1024 + 8096 -0.523(0.006) 0.399(0.004) -0.186(0.006) 0.270(0.003)

SOLVE-GP 1024 + 1024 -0.509(0.007) 0.395(0.004) -0.189(0.006) 0.272(0.003)

SVGP 2048 -0.507(0.007) 0.395(0.004) -0.193(0.006) 0.276(0.003)

Table 8: Convolutional GPs for CIFAR-10 classification.

Train Acc Train LL Test Acc Test LL Time

SVGP1000 77.81% -1.36 66.07% -1.59 0.241 s/iter1600 78.44% -1.26 67.18% -1.54 0.380 s/iter

SOLVE-GP 1000 + 1000 79.32% -1.20 68.19% -1.51 0.370 s/iter

SVGP 2000 79.46% -1.22 68.06% -1.48 0.474 s/iter

Table 9: 2-layer deep convolutional GPs for CIFAR-10 classification.

Inducing Points Train Acc Train LL Test Acc Test LL Time

SVGP 384, 1K 84.86% -0.82 76.35% -1.04 0.392 s/iterSOLVE-GP 384 + 384, 1K + 1K 87.59% -0.72 77.80% -0.98 0.657 s/iter

SVGP 768, 2K 87.25% -0.74 77.46% -0.98 1.104 s/iter

Table 10: 3-layer deep convolutional GPs for CIFAR-10 classification.

Inducing Points Train Acc Train LL Test Acc Test LL Time

SVGP 384, 384, 1K 87.70% -0.67 78.76% -0.88 0.418 s/iterSOLVE-GP (384 + 384)× 2, 1K + 1K 89.88% -0.57 80.30% -0.79 0.752 s/iter

SVGP 768, 768, 2K 90.01% -0.58 80.33% -0.82 1.246 s/iter

abstract arxiv:1910.10596v3 [stat.ml] 29 feb 2020 · jiaxin shi, michalis k. titsias, andriy mnih...

Documents