predictive active set selection methods for gaussian processes

Neurocomputing 80 (2012) 10–18

Contents lists available at SciVerse ScienceDirect

Neurocomputing

0925-23

doi:10.1

n Corr

E-m

owi@im

journal homepage: www.elsevier.com/locate/neucom

Predictive active set selection methods for Gaussian processes

Ricardo Henao a,n, Ole Winther b

a DTU Informatics, Technical University of Denmark, Denmarkb Bioinformatics Centre, University of Copenhagen, Denmark

a r t i c l e i n f o

Available online 7 November 2011

Keywords:

Gaussian process classification

Active set selection

Predictive distribution

Expectation propagation

12/$ - see front matter & 2011 Elsevier B.V. A

016/j.neucom.2011.09.017

esponding author.

ail addresses: [email protected] (R. Henao),

m.dtu.ku.dk (O. Winther).

a b s t r a c t

We propose an active set selection framework for Gaussian process classification for cases when the

dataset is large enough to render its inference prohibitive. Our scheme consists of a two step alternating

procedure of active set update rules and hyperparameter optimization based upon marginal likelihood

maximization. The active set update rules rely on the ability of the predictive distributions of a

Gaussian process classifier to estimate the relative contribution of a data point when being either

included or removed from the model. This means that we can use it to include points with potentially

high impact to the classifier decision process while removing those that are less relevant. We introduce

two active set rules based on different criteria, the first one prefers a model with interpretable active

set parameters whereas the second puts computational complexity first, thus a model with active

set parameters that directly control its complexity. We also provide both theoretical and empirical

support for our active set selection strategy being a good approximation of a full Gaussian process

classifier. Our extensive experiments show that our approach can compete with state-of-the-

art classification techniques with reasonable time complexity. Source code publicly available at

http://cogsys.imm.dtu.dk/passgp.

& 2011 Elsevier B.V. All rights reserved.

1. Introduction

Classification with Gaussian process (GP) priors has manyattractive features, for instance it is non-parametric, exceptionallyflexible through covariance function designs, provides fully prob-abilistic outputs and Bayesian model comparison as principledframework for automatic hyperparameter elicitation and variableselection. However, such a set of features comes in with a greatdisadvantage since the computational cost of performing infer-ence scales cubically with the size, N, of the training set. Inaddition, the memory requirements scale quadratically also withN. This means that applicability of Gaussian process classifiers(GPCs) is sadly limited to problems with dataset sizes in the lowerten thousands. The poor scaling of specially non-linear classifica-tion methods has inspired a considerable amount of researcheffort focused on sparse approximations [1–7]. See particularly[1,2] for a detailed overview of sparse approximations in GPCs.These methods attempt in general to decrease the computationalcost of inference in one degree w.r.t. N, i.e. OðNM2

Þ, where MoN

and M is the size of a working set consisting on a subset of the

ll rights reserved.

training data or a set of auxiliary unobserved variables. Both waysof defining the working set basically target the same objective ofgetting as close as possible to the classifier that uses theinformation of the entire training set, however they approach itfrom different angles. Using a subset from the entire data poolamounts to keep those data points that better contribute tothe classification task and discard the remaining ones throughsome suitable data selection/ranking procedure [6–10]. Alterna-tively, building an auxiliary set tries to directly reduce thedifference in distribution between the classifier using N pointsand the one using only M, by estimating the location of anauxiliary set in the input space, usually called pseudo-input set[1,4,11]. The latter approach is evidently more principled, how-ever the number of parameters to be learnt grows with thenumber and size of the auxiliary set, making it unfeasible fordatasets in the upper ten thousands and sensitive to overfittingdue to the number of free parameters in the model. From a fullyBayesian perspective, in [12] the authors propose an efficientMCMC based inference that is made possible by using a sparseand approximate basis function expansion over the trainingdataset. The main computational burden is therefore the sameas other sparse kernel methods possibly with a larger pre-factordue to sampling.

Having in mind that our main goal is to obtain the bestclassification performance with the least computational costpossible, we do not attempt to estimate auxiliary sets but rather

http://cogsys.imm.dtu.dk/passgp

www.elsevier.com/locate/neucom

www.elsevier.com/locate/neucom

dx.doi.org/10.1016/j.neucom.2011.09.017

mailto:[email protected]

mailto:[email protected]

dx.doi.org/10.1016/j.neucom.2011.09.017

R. Henao, O. Winther / Neurocomputing 80 (2012) 10–18 11

to select a subset of the training data. The framework presentedhere, Predictive Active Set Selection (PASS-GP) uses the predictivedistribution of a GPC in order to quantify the relative importanceof each data point and then use it to iteratively update an activeset. Recently, Kapoor et al. [6] proposed a similar criterion in thecontext of active learning with Gaussian processes. We use theterm active set because it is ultimately the one used to estimatethe predictive distribution that produces the classification ruleand active set updating scheme. In a nutshell, our frameworkconsists of alternating between active set updates and hyperpara-meter optimization based upon the marginal likelihood of theactive set. We provide two active set update schemes that targetdifferent practical scenarios. The first simply called PASS-GPbuilds the active set by including/removing points with small/large predictive probability until no more or too few data pointsare included in the active set. This means that the size of theactive set is not known in advance so as the expected computa-tional complexity. The second scheme is aware that in someapplications is very important to keep the computational com-plexity and/or memory requirements on a budget, thus being ableto specify the size of the active set beforehand is essential. In fixedPASS-GP (fPASS-GP) we keep the size of the active set constant byincluding and removing the same amount of data points in eachupdate to achieve the desired behavior.

The remainder of the paper presents in Section 2 a concisedescription of expectation propagation based inference for GPCs.Section 3 continues with our proposed framework for activeset selection, followed by some theoretical insights based upona ‘representer theorem’ for the predictive mean of a GP classifierin Section 4. Marginal likelihood approximations to the full GPclassifier are introduced in Section 5. Finally, experimental resultsand discussion appear in Sections 6 and 7, respectively.

2. Gaussian processes for classification

Given a set of input random variables X¼ ½x1, . . . ,xN�>, a

Gaussian process is defined as a joint Gaussian distributionover function values at the input points f ¼ ½f 1, . . . ,f N�

> with meanvector m (taken to be zero in the following) and covariancematrix K with elements Kij ¼ kðxi,xjÞ and hyperparameters h.For classification, assuming independently observed binary 71labels y¼ ½y1, . . . ,yN�

> and a probit (cumulative Gaussian) like-lihood function tðyn9f nÞ ¼Fðf nynÞ, we end up with an intractableposterior

pðf9X,yÞ ¼ Z�1pðf9XÞYN

n ¼ 1

tðyn9f nÞ,

where the normalizing constant Z ¼ pðy9XÞ is the marginal like-lihood. If we want to perform inference we must resort toapproximations. Here we use Expectation Propagation (EP) becauseit is currently the most accurate deterministic approximation, seee.g. [2,13]. In EP, the likelihood function is locally approximated byan un-normalized Gaussian distribution to obtain

qðf9X,yÞ ¼ Z�1EP pðf9XÞ

YNn ¼ 1

z�1n~tðyn9f nÞ

¼ Z�1EP pðf9XÞN ðf9 ~m, ~CÞ

¼N ðf9m,cÞ, ð1Þ

where qðf9X,yÞ � pðf9X,yÞ, the zn are the normalization coeffic-ients, ~tðyn9f nÞ and N ðf9 ~m, ~CÞ conform the site Gaussian approxima-tions to tðyn9f nÞ. In order to obtain qðf9X,yÞ, one starts fromqðf9X,yÞ ¼ pðf9X,yÞ and update the individual ~tn site approximationssequentially. For this purpose, we delete the site approximation

~tn from the current posterior leading to the so-called cavitydistribution

q\nðf9X,y\nÞ ¼ pðf9XÞYian

z�1i~tðyi9f iÞ,

from which we can obtain a cavity predictive distribution

q\nðyn9X,y\nÞ ¼

Ztðyn9f nÞq\nðf9X,y\nÞ df ¼F

ynm\nffiffiffiffiffiffiffiffiffiffiffiffiffiffi1þv\n

p !

, ð2Þ

where m\n ¼ v\nðC�1nn mn�

~C�1

nn~mnÞ and v\n ¼ ðC

�1nn �

~C�1

nn Þ�1. We then

combine the cavity distribution with the exact likelihood tðyn9f nÞ, toobtain the so-called tilted distribution qnðf9X,yÞ ¼ z�1

n tðyn9f nÞ

q\nðf9X,y\nÞ. Since we need to choose the parameters of the siteapproximations we must minimize some divergence measure.It is well known that when qðf9X,yÞ is Gaussian, minimizingKLðpðfÞ99qðfÞÞ is equivalent to moment matching between thosetwo distributions including zero-th moments for the normaliz-ing constants. The EP algorithm iterates by updating each siteapproximation in turn and makes several passes over thetraining data.

With the Gaussian approximation to the posterior distributionin Eq. (1), it is possible to calculate the predictive distribution of anew data point x% as

qðyn9X,y,x%Þ ¼

Ztðy%9f %

Þqðf %9X,y,x%Þ df %

¼Fy%m%ffiffiffiffiffiffiffiffiffiffiffiffiffi1þv%

p

� �, ð3Þ

where qðf %9X,y,x%Þ is the approximate predictive Gaussian dis-tribution (the marginal of qðf,f %9X,y,x%Þ w.r.t. f) with meanm% ¼ k%>

ðKþ ~CÞ�1 ~m and variance v% ¼ k%%

�k%>ðKþ ~CÞ�1k%. In

addition, the approximation to the marginal likelihood pðy9XÞresults in the normalization constant from Eq. (1),i.e. qðy9XÞ ¼ ZEP. The logarithm of ZEPðh,X,yÞ and its derivativescan be used jointly with conjugate gradient updates to performmodel selection under the evidence maximization framework. Fora detailed presentation of GP including its implementationdetails, consult [2,13].

3. Predictive Active Set Selection

The EP algorithm is performed by iterative updates of eachsite approximation using the whole dataset fX,yg. In the activeset scenario on the other hand, we only want to approximatethe posterior distribution in Eq. (1) using a small subset, theactive set fXA,yAg. Since exploring all possible active sets isobviously intractable even for a fixed active set size M, theproblem is how to select an active set that delivers a performanceas good as possible within the available computing resources.The Informative Vector Machine (IVM) [8] for instance, computesin each iteration the differential entropy score for all datapoints not already part of the active set fXI ,yIg and performupdates by including the single point leading to a maximumscore. Despite this greedy heuristic, IVM has proved to behavequite well in practice, giving the so far best reported GP perfor-mance on the USPS and MNIST tasks [8,9]. We propose aniterative approach in the same spirit with two main conceptualchanges

�
Active set inclusion/deletion based directly upon the data pointweight in prediction. The ‘representer theorem’ for the meanprediction, discussed in Section 4, leads directly to the weightbeing expressed in terms of (a derivative of) the cavitypredictive probability. This means that we can actually usethe predictive distribution for a point in the inactive set topredict the weight it would have if it would be included in the

R. Henao, O. Winther / Neurocomputing 80 (2012) 10–1812

active set. For classification we use the (cavity) predictiveprobability to decide upon deletion and inclusion because itis monotonically related to weight and it is a readily inter-pretable quantity.
� Hyperparameter optimization must be an integral part of algo-
rithm, because the weights of the examples (and thus theactive set) is conditioned on the hyperparameter values andvice versa. We therefore alternate between active set updatesand hyperparameter optimization using several passes over thedataset.

Next we discuss the details of our (f)PASS-GP frameworkfollowed by a detailed comparison with the IVM. First we needto define rules for including and deleting points of the active set.As already mentioned, we use the predictive distribution in Eq.(3) for inclusions since data points with small predictive prob-ability are more likely to contribute to improve the classifierperformance and the quality of the active set. For deletions, weuse the cavity predictive distribution in Eq. (2) because whenexamined carefully it can be seen as a leave-one-out estimator[14]. This means that points with cavity probability close to onedo not contribute to the decision rule thus they can be discardedfrom the active set. With the two ranking measures set, i.e.Eqs. (2) and (3), we have essentially two possibilities. The first isto set probability thresholds on the distributions and let themodel decide the size of the active set or we can rather specifydirectly the amount of inclusions/deletions. In PASS-GP, weinclude points in the active set with probability less than pinc

and remove them with probability greater than pdel. The appeal-ing aspect of these thresholds is that they can be interpreted, forinstance if we set pinc ¼ 0:5 we will include all misclassifiedobservations in the current active set whereas if pinc ¼ 0:6 wewill also include points near the decision boundary. We requiretwo thresholds because we only want to remove points that asfor the classifier are very easy to classify, so unlike pinc, pdel mustbe close enough to one. In fPASS-GP, we want to keep thecomputational complexity of the classifier under control therebywe want the size of the active set to be fixed. For this purpose weonly have to be sure that each active set update includes andremoves the same amount of points. In practice we define pexc asthe exchange proportion w.r.t. M, meaning that each updatereplaces the fixed proportion of most hard to classify points inthe inactive set with those more surely classified in the currentactive set. This update rule assumes that the active set is largeenough to contain points in the active set with cavity probabilityclose to one.

From a practical point of view, ranking every point inthe inactive set at each iteration for inclusion could becomeprohibitive for large datasets. However, we still want to beable to cover the whole dataset rather than selecting a randomsubset for ranking. We then split the data into Nsub non-over-lapping subsets and process each one of them in each iteration,such that each batch has something between 100 and 1000 datapoints.

Hyperparameter selection is a very important feature andneeds to be done jointly with the active set update procedure.Algorithm 1 starts from a fixed randomly selected active set ofsize Ninit (that is M in fPASS-GP), large enough to provide a goodinitial hyperparameter set values. Next we alternate betweenactive set and hyperparameter optimization updates. Havingin mind that we only expect small changes of the hyperpara-meters from one iteration to another, we reuse current valuesof h as initial values for the next iteration to speed-up the learn-ing process. The addition and deletion rules in Algorithm 1have parameters fpinc,pdelg and pexc for PASS-GP and fPASS-GP,respectively.

Algorithm 1. Predictive Active Set Selection.

Input: fX,yg, h and fNinit, Nsub, Npassg

Input: pinc and pdel (PASS-GP)Input: pexc (fPASS-GP)

Output: qðfA9XA,yAÞ, hnew and A

beginA’f1, . . . ,Ninitg

fX,ygð1Þsub, . . . fX,ygðNsubÞ

sub ’fX,yg

for i¼ 1 to Npass do

for j¼ 1 to Nsub do

hnew ¼ argmaxh log ZEPðh,XA,yAÞ

Get qðfA9XA,yAÞ and qðyn9XA,yA,x%Þ

forall fxn,yngAfXA,yAg do

if RemoveRuleðq\nðyn9XA,yA\nÞÞ

then A’A\fng

��end

forall fxn,yngAfX,ygðjÞsub do

if AdditionRuleðqðyn9XA,yA,x%,tÞÞ

then A’A [ fng

��end

��end

��end

��end

3.1. Differences between (f)PASS-GP and IVM

Since IVM is the closest relative of our active set selectionmethod, we briefly discuss the main differences between the two:(i) the active set and thus the computational complexity is usuallyfixed beforehand in IVM. PASS-GP works with inclusion anddeletion thresholds instead. (ii) IVM does not allow for deletionsfrom the active set which is a clear disadvantage as points oftenbecome irrelevant at a later stage, when more points have beenincluded. In (f)PASS-GP we can make an (almost) unbiasedcommon ranking of all training points active as well as inactive,using a quantity that is meaningful and directly related to theweight of the training point in predictions. Using both inclusions/deletions and several passes over the training set makes (f)PASS-GP quite insensitive to the initial choice of active set. (iii) Whenthe dataset is considerably large, IVM randomly selects a subset ofpoints to be ranked from the inactive set, meaning that is likelythat some points of the dataset are never considered for inclusionin the active set. (iv) The hyperparameter optimization is a part ofthe algorithm in (f)PASS-GP working on subsets of data betweenupdates and iterating over the dataset several times. IVM makes asingle inclusion per step and in principle stops when the limitfor the active set is reached. (iv) In terms of complexity timeper iteration IVM is faster than (f)PASS-GP, OðNMÞ againstOðM2

ð2þN=NsubÞÞ where M is the size of A, however storagerequirements are considerably lower, OðM2

Þ compared to OðNMÞ.

4. Representer for mean prediction

The ‘representer theorem’ for the posterior mean of f [14],connects the predictive probability and the weight of a data point.Using that pðf9XÞ ¼�Kð@=@fÞpðf9XÞ, we get the exact relation forthe posterior mean /fS¼Ka with the weight of element n being

an ¼1

pðy9XÞ

Zpðf9XÞ

@

@f n

pðy9fÞ df

¼/p0ðyn9f nÞS\n

/pðyn9f nÞS\n

¼@

@hlog /pðyn9f nþhÞS\n

��h ¼ 0

,


where / �S\n ¼m\n denotes an average over a posterior withoutthe n-th data point and p0ðyn9f nÞ ¼ @pðyn9f nÞ=@f n. The final expres-sion implies that the weight is nothing but the log derivative of

the cavity predictive probability /pðyn9f nÞS\n ¼ pðyn9X,y\nÞ. For

regression, pðyn9f nÞ ¼N ðyn9f n,s2Þ and an ¼ ðyn�/f nS\nÞðs2þv\nÞ�1

with v\n ¼/f 2nS\n�/f nS

2\n. The element an will therefore be small

when the cavity mean has a small deviation from the targetrelative to the variance. For a new data point pair fx%,y%g, we cancalculate the weight of this point exactly, replacing the cavityaverage with the full average in the expression above. We cantherefore predict without any EP rerunning, how much weightthis new point will have. For classification wecan calculate the weight using the current EP approximation.

When zn ¼ yn/f nS\n=ffiffiffiffiffiffiffiffiffiffiffiffiffiffi1þv\n

pis above � 4, the cavity probabi-

lity equation (2) approaches one and an � yn expð�z2n=2Þ=ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

2pð1þv\nÞp

. This fast decay indicates that GPC in many cases

will be effectively sparse even though a strictly does notcontain zeros.

In the inclusion/deletion steps we rank data points accordingto their weights. For classification we can indeed use the pre-dictive probability directly, since it is a monotonic function of theweight. Including a new data point will of course affect the valueof all other weights as well leading to a rearrangement of theirrank. Including multiple data points will also invalidate thepredicted value of the weights (e.g. think of the extreme of twonew data points being identical). We therefore have to recalculatethe weights by retraining with EP for classification or simplyupdating the posterior for regression before going to the nextstep. If we have already an active set covering the decision regionswell enough, this rearrangement step will amount to minoradjustments and the approximation will work well.

In this work we have only used the representer theorem foractive set selection. It is also possible, but not tested here, to useall training points for prediction while only calculating theposterior on the active set. The inactive set weights are thensimply set to the predicted values from the active set posterior. Toget the full predictive probability one also has to calculate thecontribution to the predictive variances which can be obtained bya similar theorem but for the predictive variance, see [14].

5. Marginal likelihood approximations

In this section we decompose the marginal likelihood in theiractive and inactive set contributions. We will argue that thecontribution from the active set will dominate, justifying why wecan limit ourselves to optimizing the hyperparameters over thisset. In the following section we will investigate this assumptionempirically. The marginal likelihood can be decomposed via thechain rule as

pðy9XÞ ¼ pðyI9yA,XA,XIÞpðyA9XAÞ, ð4Þ

where we have used the marginalization property of GPs,

pðyA9XÞ ¼Z

pðyA9fAÞpðfA9XAÞdfA ¼ pðyA9XAÞ,

that we approximate as qðyA9XAÞ ¼ ZEP,A and we identify it as themarginal likelihood for the active set A. The conditional marginallikelihood term can be written as

pðyI9yA,XA,XIÞ ¼

ZpðyI9fIÞpðfI9XI ,XA,fAÞpðfA9XA,yAÞ dfA dfI , ð5Þ

where we used pðf9XÞ ¼ pðfI9XI ,XA,fAÞpðfA9XA,yAÞ. We can make anEP approximation here just like in Eq. (1) by replacing the posteriorpðfA9XA,yAÞ by the multivariate Gaussian qðfA9XA,yAÞ ¼

N ðfA9mA,CAAÞ where active set specific means and variances arefound by EP. Marginalizing over fA in Eq. (5) makes it now tractable

qðyI9yA,XA,XIÞ �

ZpðyI9fIÞN ðfI9mI9A,CII9AÞ dfI

with parameters

mI9A ¼KIAðKAAþ~CAAÞ

�1 ~mA,

CII9A ¼KII�KIAðKAAþ~CAAÞ

�1KAI ,

where the tilted moments are as defined in Section 2. When theinactive set consists of a single example, we obtain the EP predictivedistribution in Eq. (3), otherwise we have to solve for a new marginallikelihood. Denoting the marginal likelihood for a set fX,yg with anon-zero mean GP prior by

Zðh,X,y,mÞ ¼

Zpðy9fÞN ðf9m,KÞ df,

and its EP approximation by ZEPðh,X,y,mÞ, we can write the approx-imation to the marginal likelihood in Eq. (4) as

ZACC � ZEPðh,X,yI ,mI9AÞZEPðh,X,yA,0Þ:

Using this approximate decomposition reduces the complexity of EPfrom OðN3NpassÞ to Oðð9I93

þM3ÞNpassÞ, where 9I9 is the size of the

inactive set. Unfortunately this is still too costly for large N. A finallow complexity approximation to the marginal likelihood, that wedenote by ZAPP, is to replace pðyI9yA,XÞ with the product of marginalsQ

iA Ipðyi9yA,XA,xiÞ. Empirically—see Fig. 3, this approximation turnsout to be lower than the actual marginal likelihood, i.e. the jointdistribution enforces the labels relative to the product of themarginals.

6. Experiments

The results presented in this section consist of several classi-fication tasks performed on three well known datasets, namelyUSPS, MNIST and IJCNN. The first two correspond to handwrittendigit databases while the third is a physical system inspireddataset assembled for the IJCNN 2001 neural network competi-tion. We compare the two approaches introduced in Section 3against the IVM and reduced complexity SVM (RSVM) [3]. Weconsider as performance measures not only classification errors,but also the error-cost trade-off and prediction uncertainty. Wealso present results for the approximation to the marginal like-lihood of the full GP presented in Section 5. All experiments wereperformed on a 2.0 GHz desktop machine with 2 GB RAM.

6.1. USPS

The USPS digits database contains 9289 grayscale images ofsize 16�16 pixels, scaled and translated to fall within the rangefrom �1 to 1. Here we adopt the traditional data splitting, i.e.7291 observations for training and the remaining 2007 for testing.For each binary one-against-rest classifier we use the same modelsetup consisting of a squared exponential covariance matrix plusadditive jitter

kðxi,xjÞ ¼ y1 exp �Jxi�xjJ

2

2y2

!þy3dij, ð6Þ

where dij ¼ 1 if i¼ j and zero otherwise. We have three hyper-parameters in h, namely, signal variance, characteristic lengthscale and jitter coefficient. Provided that the four active setmethods being considered may depend upon random initializa-tion we repeated all tasks 10 times. Individual settings for each


method are

�

Figfor

hy

GP

ba

PASS-GP: Ninit ¼ 300, Nsub ¼ 10, Npass ¼ 2, pinc ¼ 0:6 andpdel ¼ 0:99.
� fPASS-GP: Ninit ¼ 300, Nsub ¼ 10, Npass ¼ 4, pexc ¼ 0:02. We allow
fPASS-GP to perform more passes through the data becausefPASS-GP progresses slower due to pexc being small.
� RVM: M¼500, h¼ ½1 1=16 0�, C¼10 and k¼ 10. More precisely,
h and the regularization parameter, C, were obtained by gridsearch cross-validation, while k was set to the value suggestedby the authors of [3].
� IVM: M¼300 and Npass ¼ 8. In the publicly available version of
IVM, hyperparameter selection is done by alternating betweenfull active set selection and hyperparameter optimization.Since IVM starts from an empty active set, it can be verysensitive to the initial values of h. We experienced howeverthat by adding a linear term, y4x>i xj, in the covariance matrix inEq. (6) makes IVM quite insensitive to initialization. The resultsreported here include such a linear term because we found thatusing Eq. (6) alone makes the IVM to perform very poorly.

Fig. 1(a) shows mean test errors for every one-against-rest taskusing PASS-GP, fPASS-GP, RSVM, IVM and the full GPC withhyperparameter optimization. Besides, Fig. 1(b) shows the activeset sizes for each digit using PASS-GP. From the figure, it can be

0 1 2 3 4 5 6 7 8 9

0.6

0.8

1

1.2

1.4

1.6

Digit

Err

or

(%)

PASS-GPfPASS-GP

RSVM

IVM

fullGPC

0 1 2 3 4 5 6 7 8 9

100

150

200

250

300

350

400

Digit

Active

se

t siz

e

. 1. Error rates and active set sizes for USPS data. (a) Mean classification errors

each digit using PASS-GP, fPASS-GP, RSVM, IVM and the full GPC with

perparameter optimization. (b) Active set sizes for PASS-GP. Note that fPASS-

and IVM use M¼300, whereas RSVM uses M¼500 for the results in (a). Error

rs are standard deviations over 10 repetitions.

seen that Gaussian process based active set methods performsimilarly still slightly better than the RVM. The full GPC was onlyran once due to its computational requirements, which explainsthe lack of error bars in Fig. 1(a). Furthermore, compared tofPASS-GP, IVM (M¼300) and RSVM (M¼500), PASS-GP seems torequire smaller active sets to achieve similar classification per-formance. It is important to mention that we also tried largervalues of M for the fixed active set algorithms but without anysignificant improvement in performance.

Fig. 2 shows classification errors for digits 2 and 4 against theothers in top and bottom panels, respectively, as a function bothof the active set size and running time. For fPASS-GP, RSVM andIVM we used M¼ f200, . . . ,600g and for PASS-GP we usedpinc ¼ f0:2:0:3, . . . ,0:9g. We included also the classification errorobtained by the full GPC with hyperparameter optimizationdepicted as an horizontal dashed line. See [7] for a more detailedcomparison between PASS-GP and full GPCs. Several featuresfrom Fig. 2 worth to be highlighted. (i) Gaussian process basedmethods approach the full GP for large values of M, as expected.(ii) Similar to Fig. 1(a), PASS-GP seems to consistently outperformfPASS-GP for similar sizes of M. (iii) For small values of M, RSVMand IVM perform better than our active set methods, howeverfurther increasing M does not considerably improves their per-formance. When M is small enough, it is very likely that ourapproaches are not able to obtain plausible estimates of thehyperparameters of the covariance function, thus its poor perfor-mance compared to RSVM that uses fixed values. Provided thatthe full GPC takes 8:6e5 s to run, PASS-GP and fPASS-GP areapproximately three orders of magnitude faster than the full GPCwith hyperparameter optimization, see [7]. From Fig. 2(b) and (d),we see that for similar active set sizes, PASS-GP and fPASS-GPhave comparable computational costs as one may expect. Simi-larly, RSVM and IVM scale better than our active set selectionmethods. In terms of error-cost trade-off, RVM has a clear edgewhile the Gaussian process based methods can be regarded ascomparable. It is important to note that for RVM, the difference incomputational costs as seen in Fig. 2(b) and (d) should not beconsidered as significant since we are not counting the time usedto obtain the parameters used by the RSVM, that unfortunatelyneed to be selected by expensive grid search with cross-valida-tion. The IVM turned out to be time-wise comparable to ouractive set methods not because its selection procedure but due tothe hyperparameter optimization scheme used.

The results obtained on USPS suggest that (f)PASS-GP isperforming slightly better than the full GPC. This could be dueto numerical instability produced by the size of the problem, bythe iterative nature of the EP algorithm and/or not enoughiterations for the hyperparameter selection procedure. However,it could also mean that optimizing on the active set achieves abetter ‘‘local’’ fit around the decision boundary region. A priori,one cannot expect that a single set of hyperparameters is able todescribe all regions in input space, thus every possible active set.The same kind of local improvement observed here was alsoreported by [15,4] using GPC auxiliary set methods.

Combining the ten binary tasks into a one-against-rest multi-class classifier, PASS-GP obtained 4:5170:17% which is compar-able or better1 than 4.6170.11% by fPASS-GP, 4.8870.12% byRSVM and 4.3870.11% by IVM. Baselines are, 5.13% by GPC withhyperparameter optimization, 4.78% by GPC with fixed h and9.7570.40% by GPC with random active set selection. Otherrelevant results found in the literature include 5.15% by onlineGP [16] and 4.98% by IVM with randomized greedy selection [9].

1 Assuming independent errors, the standard deviation on the performance isffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiEð1�EÞ=Ntest

pgiving approximately 0.4% for USPS and 0.1% for MNIST.

200 300 400 500 600

1.2

1.4

1.6

1.8

2

2.2

2.4

Err

or

(%)

Active set size

PASS-GP

fPASS-GP

RSVM

IVM

fullGPC

102 103

1.2

1.4

1.6

1.8

2

2.2

2.4

Err

or

(%)

Time (s)

PASS-GP

fPASS-GP

RSVM

IVM

fullGPC

200 300 400 500 600

1.4

1.6

1.8

2

2.2

2.4

2.6

2.8

Err

or

(%)

Active set size

PASS-GP

fPASS-GP

RSVM

IVM

fullGPC

102 103

1.4

1.6

1.8

2

2.2

2.4

2.6

2.8

Err

or

(%)

Time (s)

PASS-GP

fPASS-GP

RSVM

IVM

fullGPC

Fig. 2. Results for selected individual digits of USPS data. (a) and (c) show mean classification errors as a function of the active set size for digits 2 and 4 vs the rest,

respectively. (b) and (c) show mean classification errors as a function of the running time, matching panels (a) and (c). The horizontal dashed line in all plots is the

performance of the full GPC with hyperparameter selection. In both cases, the full GPC took approximately 8.6e5 s [7]. Values represent averages over ten independent

repetitions with error bars omitted for clarity.


All three Gaussian process based methods are comparable withstate-of-the-art techniques such as SVM, see [17]. It is worthpointing out that the best result we could obtain from IVM usingthe covariance matrix in Eq. (6) was 6.2770.21% for M¼1500which is substantially worse than the performance of the full GPC.As reference, it has been shown that the human error rate isapproximately 2.5%.

Next we want to evaluate the two approximations to the marginallikelihood proposed in Section 5. We proceed by computing theaccurate but expensive approximation ZACC, the less accurate butaffordable ZAPP and the marginal likelihood of the full GPC and theactive set, simply denoted as ZEP and ZEP,A, respectively. In order toshow how the approximations depend on the size of the active set,we compute them for pinc ¼ f0:5,0:6, . . . ,0:9,0:99,1g, with pinc ¼ 1being the full GPC. Fig. 3 shows that the three approximationsapproach the marginal likelihood of the full GPC as the inclusionthreshold and so the active set size increases. As expected, ZACC is thebest approximation, however the computational effort needed tocompute it is roughly two orders of magnitude larger compared tothe cost of computing ZAPP and ZEP,A. It is very interesting that evenwith large values of pinc ¼ 0:99 the size of the active set remainsbelow 10% of the training data and the contribution to the log-marginal likelihood from the inactive ZEPðh,X,yI ,mI9AÞ set basicallyvanishes, since ZAPP and ZEP,A are essentially the same.

Finally, we want to assess the uncertainty of the predictionsmade by the Gaussian process based methods by means ofcomparing the predictive probabilities with the true outcomes.Fig. 4 shows estimated log predictive densities for PASS-GP,fPASS-GP, IVM and the full GPC, using all USPS predictions madeon the test separated into correct and incorrect predictions.Assuming no labeling errors, the true density consists of twopoint mass densities at {0,1} provided our one-against-rest set-ting. As one might expect, the full GPC achieves the bestapproximation, followed by fPASS-GP and PASS-GP. IVM suggestsmore predictive uncertainty because of the two ‘‘spurious’’ modesin Fig. 4(a). Another way to assess the predictive uncertainty is tocompute Brier scores, that measures the average of squaredeviations between estimated and true predictive probabilities.For the USPS dataset we obtained: 0.5370.03, 0.2770.01,0.7170.02 and 0.1470.00 for PASS-GP, fPASS-GP, IVM and fullGPC, respectively. Note that the Brier scores are in agreement towhat we observe in Fig. 4.

6.2. MNIST

The MNIST digits database has 60 000 and 10 000 as trainingand testing examples, respectively. Each example is a gray-scaleimage of 28�28 pixels. The estimated human test error is around

0.2 0.4 0.6 0.8

10−1

100

101

Lo

g-d

en

sity

q(y |X, y, x )

0.2 0.4 0.6 0.8

10−4

10−2

100

Lo

g-d

en

sity

q(y |X, y, x )

PASS-GPfPASS-GPIVM

fullGPC

Fig. 4. Predictive density estimation for USPS data. Densities for correct and incorrect predictions are shown separately in (a) and (b), respectively. The ground truth for

(a) is a two point mass mixture at {0,1} and a flat distribution for (b).

200 300 400 500

102

200 300 400 500

100

200

300

400

500

600

700

800

200 300 400 500

2

2.2

2.4

2.6

2.8

3x 104

Tim

e (

s)

Tim

e (

s)

log Z

ZEP

ZAPP

ZACC

ZEP, A

Active set sizeActive set sizeActive set size

Fig. 3. Marginal log-likelihood approximations as a function of the active set size for digit 3 vs the rest. The plots show means and standard deviations (error bars) over ten

repetitions. Each marker indicates a different inclusion threshold pinc ¼ f0:5,0:6, . . . ,0:9,0:99g. In the left panel, ZEP is for the full GPC (pinc ¼ 1), ZEP,A for the active set only

and the remaining two, ZACC and ZAPP, are the proposed approximations. The middle and right panels show computation times required to compute fZAPP ,ZEP,Ag and ZACC,

respectively.


0:2%. The settings used for the algorithm are nearly the same asthose for USPS with only two differences. Nsub is set 100 since thetraining set in MNIST is almost ten times larger than USPS and weare not updating the hyperparameter in each iteration but every10-th, in order to make the training process faster. We also ranour algorithm with hyperparameter updates every single iterationwithout any noticeable improvement in performance (results notshown). Fig. 5 shows test error rates, active set sizes, multi-classerrors and running times for each binary classifier based on PASS-GP, fPASS-GP and RSVM using a 9-th degree polynomial covar-iance function

Kðxi,xjÞ ¼ y1ðxi � xjþ1Þ9:

We use this covariance matrix instead for the standard squaredexponential from Eq. (6), because a polynomial covariance is wellknown for providing optimal results for the MNIST dataset [18].Results for the squared exponential covariance function can befound in [7] and confirm that the polynomial covariance behaveslightly better for this dataset. For IVM we could not make thepolynomial covariance to work properly, thus we decided to useEq. (6) plus a linear term like in the USPS experiment.

From Fig. 5(b) it can be seen that in every case the size of theactive set is less than 4% of the training set. The results for fPASS-GP and RSVM were obtained using M¼2000. We did try for largervalues of M but the reduction in error was not significantcompared to the overhead in computational cost. Fig. 5(a)shows the classification error for each digit. The performance ofthe three approaches considered is comparable but letting PASS-GP with an edge over the other two, both in terms of error and

variances. Fig. 5(c) shows the results of combining the ten binaryclassifiers. Again, PASS-GP behaves slightly better than the others,however when looking at the run times in Fig. 5(d) we can seethat RSVM is computationally more affordable than ourapproaches, even more considering that it uses M¼2000. Com-paring PASS-GP to fPASS-GP, the former has a smaller mean runtime but with larger variance compared to the more expensivefPASS-GP. fPASS-GP is more stable time-wise, but takes more timebecause it uses a fixed M¼2000. As far as the authors know theseare the first GP based results on MNIST using the whole database.IVM [8] with sub-sampled images of size 13�13 has been tried toproduce a test error rate of 1.5470.04%. Seeger [9] made addi-tional tests on some digits (5, 8 and 9) on the full size imageswithout any further improvement. On the other hand, PASS-GP isagain comparable with state-of-the-art techniques not includingpreprocessing stages and/or data augmentation, for instance SVMis 1.4% and 1.22% using RBF and a 9-th degree polynomial kernel,respectively. The reported sizes of support vector sets areapproximately two times larger than our active sets [18].

6.3. Incorporating invariances

It has been shown that a good way to improve the overallperformance of a classifier is to incorporate additional priorknowledge in the training procedure particularly by means ofexternally handling invariances of the data. In [18], it is shownthat instead of just dealing with the invariances by augmentingthe original dataset—which turns out to be infeasible in manycases, it is better to augment only the support vector set of a SVM.

0 1 2 3 4 5 6 7 8 90.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Digit

Err

or

(%)

PASS-GP

fPASS-GP

RSVM

IVM

0 1 2 3 4 5 6 7 8 9

800

1000

1200

1400

1600

1800

2000

Digit

Active

se

t siz

e

1.25

1.3

1.35

1.4

1.45

1.5

1.55

1.6

1.65

Err

or

(%)

PASS-GP

fPASS-GP

RSVM

IVM

0

1000

2000

3000

4000

5000

6000

7000

8000

Tim

e (

s)

PASS-GP

fPASS-GP

RSVM

IVM

Fig. 5. Error rates, active set sizes and run times for MNIST data. (a) Mean

classification errors for each digit task using PASS-GP, fPASS-GP, RSVM and IVM.

(b) Active set sizes for PASS-GP. Note that fPASS-GP, RSVM and IVM use M¼2000.

(c) Mean multi-class classification errors and (d) average timings over one-

against-the-rest classifiers and repetitions. Error bars in (a), (c) and (d) are

standard deviations computed over 10 repetitions of the experiment.

Table 1Results for USPS and MNIST using PASS-GP and active set invariances. Figures are aver

Digit 0 1 2 3 4

USPS (%) 0.63 0.38 1.01 0.69 0.9

Active set 870 442 1251 1316 1654

MNIST (%) 0.14 0.14 0.24 0.24 0.2

Active set 6505 4372 11 401 12 988 9776


We therefore try the same procedure as suggested in [18]consisting of four 1-pixel translations (left, up, right and downdirections) on each element of the active set for USPS and eight1-pixel translations (including diagonals as well) for MNIST,resulting in new training sets of size 5�M and 9�M, accordingly.In this case we have used the same settings as in the previousexperiments with only two differences. First, the hyperparametershave been set to those found using the original dataset. Second, wemade the important observation that in order to get a performanceimprovement a large active set was needed. For training on theaugmented dataset we increased pinc from 0.6 to 0.99 for USPS and0.9 for MNIST. We conjecture that we can get even betterperformance—at the expense of a substantial increase in complex-ity, by increasing pinc in the initial run to get a larger initial activeset to work with.

Results in Table 1 show that performance-wise, PASS-GPreached 3.3570.03% for USPS and 0.8670.02% for MINST on themulti-class task, what is comparable to state-of-the-art techniques.For instance SVM obtained 3.2% on USPS and 0.68% on MNIST withan equivalent procedure. The difference in performance is probablydue to our active set not being large enough, since support set sizesreported for SVMs are typically twice as large [18].

6.4. IJCNN

As final experiment, we want to compare fPASS-GP, RSVM andIVM on a common ground. For this purpose we use the IJCNN datasetwhich is widely used by the SVM research community. It consists of49 990 training examples, 91 701 test examples and each observationcounts with 22 features. We consider M¼ f100;200, . . . ,1000g withsquared covariance function and fixed hyperparameters, the latterusing the values suggested in [3], that is h¼ ½1 1=8 1=16� for fPASS-GP and IVM, and h¼ ½1 1=8 0�, C¼16 for RSVM. For IVM we include alinear term as in the previous experiments with y4 ¼ 1. Besides, eachsetting was repeated 10 times to collect statistics. Fig. 6 summarizesthe results obtained. More specifically, Fig. 6(a) shows the meanclassification error as a function of the active set. We can see thatfPASS-GP is slightly better than RSVM and IVM in the entire range ofM, besides the former seems to be particularly good for small valuesof M. When we plot mean errors as a function of running times—as aproxy for the computational cost, we see that there exist two regimes,one for small values of M where fPASS-GP outperforms RSVM andIVM, and the other where the cubic complexity of the GPCs starthurting fPASS-GP, thus letting RVM and IVM with a better error-costtrade-off.

7. Discussion

We have proposed a framework for active set selection in GPC.The core of our active set update rule is that the predictivedistribution of a GPC can be used to quantify the relative weightof points in the active set that can be marked for deletion or newpoints from the active set with low predictive probabilities, that

ages over 10 and 5 repetitions, respectively.

5 6 7 8 9

3 1.16 0.51 0.37 0.59 0.65

1425 1242 987 1532 1281

9 0.22 0.17 0.35 0.29 0.35

11 960 7360 9872 15 194 14 790

100 200 300 400 500 600 700 800 900 10001

2

3

4

5

6

7

8

9

10

11

Active set size

Err

or

(%)

fPASS-GP

RSVM

IVM

101 102 1031

2

3

4

5

6

7

8

9

10

11

Err

or

(%)

Time (s)

fPASS-GP

RSVM

IVM

Fig. 6. Error rates and run times for IJCNN data. (a) Mean classification error as a

function of the active set size using fPASS-GP, RSVM and IVM. (b) Mean

classification error as a function of the run time. Error bars correspond to standard

deviations computed over 10 repetitions of the experiment.


make them ideal for inclusion. The algorithmic skeleton of ourframework consists on two alternating steps, namely active setupdates and hyperparameter optimization. We designed twoactive set update criteria that target two different practicalscenarios. The first we called PASS-GP focuses on interpretabilityof the parameters of the update rule by thresholding the pre-dictive distributions of GPC. The second acknowledges that insome applications having a fixed computational cost is key, thusfPASS-GP keeps the size of the active set fixed so the overall costand memory requirements can be known beforehand.

We presented theoretical and practical support that our activeset selection strategy is efficient while still retaining the mostappealing benefits of GPC: prediction uncertainty, model selec-tion, prior knowledge leverage and state-of-the-art performance.Compared to other approximative methods, although slower thanIVM [8] and RSVM [3], PASS-GP provides better results. We didnot consider any auxiliary set method like FITC [4] because fortask of the size like for example MNIST or IJCNN, it is prohibitive.Additionally, we have noticed in practice that our approximationis quite insensitive to the initial active set selection and also thatmore than two or three passes through the data do not yieldimproved performance nor large active set sizes. The code used inthis work is based on the Matlab toolbox provided with [2] and ispublicly available at http://cogsys.imm.dtu.dk/passgp.

The not so satisfying feature of active set approximations isthat we are ignoring some of the training data. Although some ofour findings on the USPS dataset actually suggest that this can bebeneficial for performance, it is of interest to make a modifiedversion where the inactive set is used approximately in a cost

efficient way. The representer theorem for the mean predictionand the approximations for marginal likelihood discussed in thispaper might give inspiration for such methods. In conclusion,efficient methods for GPs are still much in need when the data isabundant such as in ordinal regression for collaborative filtering.

References

[1] J. Quinonero-Candela, C.E. Rasmussen, A unifying view of sparse approximateGaussian process regression, J. Mach. Learn. Res. 6 (2005) 1939–1959.

[2] C.E. Rasmussen, C.K.I. Williams, Gaussian Processes for Machine Learning, TheMIT Press, Cambridge, MA, 2006.

[3] S.S. Keerthi, O. Chappelle, D. DeCoste, Building support vector machines withreduced classifier complexity, J. Mach. Learn. Res. 7 (2006) 1493–1515.

[4] A. Naish-Guzman, S. Holden, The generalized FITC approximation, in:J.C. Platt, D. Koller, Y. Singer, S. Roweis (Eds.), Advances in Neural InformationProcessing Systems, vol. 20, MIT Press, Cambridge, MA, 2008, pp. 1057–1064.

[5] T. Joachims, C.-N.J. Yu, Sparse kernel SVMs via cutting-plane training, Mach.Learn. 76 (2009) 179–193.

[6] A. Kapoor, K. Grauman, R. Urtasun, T. Darrell, Gaussian processes for objectcategorization, Int. J. Comput. Vision 88 (2) (2010) 169–188.

[7] R. Henao, O. Winther, PASS-GP: predictive active set selection for Gaussianprocesses. in: 2010 IEEE International Workshop on Machine Learning forSignal Processing (MLSP)2010, pp. 148–153.

[8] N.D. Lawrence, M. Seeger, R. Herbrich, Fast sparse Gaussian process methods:the informative vector machine, in: S. Becker, S. Thrun, K. Obermayer (Eds.),Advances in Neural Information Processing Systems, vol. 15, The MIT Press,Cambridge, MA, 2003, pp. 600–616.

[9] M. Seeger, Bayesian Gaussian Process Models: PAC-Bayesian GeneralisationError Bounds and Sparse Approximations, Ph.D. Thesis, University of Edin-burg, 2003.

[10] N.D. Lawrence, J.C. Platt, M.I. Jordan, Extensions of the informative vectormachine, in: J. Winkler, N.D. Lawrence, M. Niranjan (Eds.), Proceedings of theSheffield Machine Learning Workshop, Springer-Verlag, Berlin, 2005.

[11] M. Titsias, Variational learning of inducing variables in sparse Gaussianprocesses. in: Proceedings of the Twelfth International Conference onArtificial Intelligence and Statistics, vol. 5, 2009, pp. 567–574.

[12] Z. Zhang, G. Dai, M.I. Jordan, Bayesian generalized kernel mixed models,J. Mach. Learn. Res. 12 (2011) 111–139.

[13] M. Kuss, C.E. Rasmussen, Assessing approximate inference for binary Gaus-sian process classification, J. Mach. Learn. Res. 6 (2005) 1679–1704.

[14] M. Opper, O. Winther, Gaussian processes for classification: mean-fieldalgorithms, Neural Comput. 12 (11) (2000) 2655–2684.

[15] E. Snelson, Z. Ghahramani, Sparse Gaussian processes using pseudo-inputs,in: Y. Weiss, B. Scholkopf, J.C. Platt (Eds.), Advances in Neural InformationProcessing Systems, vol. 18, The MIT Press, 2006.

[16] L. Csato, Gaussian Processes—Iterative Sparse Approximations, Ph.D. Thesis,Aston University, 2002.

[17] B. Scholkopf, A.J. Smola, Learning with Kernels Support Vector Machines,Regularization, Optimization and Beyond, The MIT Press, 2001.

[18] D. DeCoste, B. Scholkopf, Training invariant support vector machines, Mach.Learn. 46 (1–3) (2002) 161–190.

Ricardo Henao received the degree in electronicengineering and masters in industrial automation fromthe Universidad Nacional de Colombia, Manizales, in2002 and 2004, respectively. He is currently a Ph.D.student at the department of DTU Informatics of theTechnical University of Denmark. His research inter-ests include graphical model learning and supervisedmodeling with applications to bioinformatics.

Ole Winther works in machine learning research withapplications in bioinformatics, data mining, greentechnology and collaborative filtering. Ole Winther,M.Sc. physics and computer science ’94 and Ph.D.physics ’98, both at University of Copenhagen (KU),has published more than 60 scientific papers and haspreviously held positions at Lund University and Centerfor Biological Sequence Analysis, Technical Universityof Denmark (DTU). He currently holds joint positions asassociate professor at Cognitive System, DTU Infor-matics and group leader at Bioinformatics, KU.

http://cogsys.imm.dtu.dk/passgp

predictive active set selection methods for gaussian processes

Documents