sparse gaussian process classification with multiple classes matthias w. seeger michael i. jordan...

Sparse Gaussian Process Classification

With Multiple Classes

Matthias W. SeegerMichael I. Jordan

University of California, Berkeley

www.cs.berkeley.edu/~mseeger

Gaussian Processes are different Kernel Machines:

Estimate single “best” function to solve problem

Bayesian Gaussian Processes:Inference over random functions mean predictions and uncertainty estimates

Gives posterior distribution over functions More expressive Powerful empirical Bayesian model selection Combination in larger probabilistic structure

Harder to run, but worth it!

The Need for Linear Time “So Gaussian Processes aim for more than

Kernel Machines --- Do they run much slower then?” Not necessarily (anymore)!

GP multi-way classification:

Linear in number datapoints Linear in number classes No artificial “output coding” Predictive uncertainties Empirical Bayesian model selection

Sparse GP Approximations Lawrence, Seeger, Herbrich: IVM (NIPS 02)

Home in on active set , size Replace likelihood by a

likelihood approximation , a Gaussian functionof only

Use information criteria to find I greedily Restricted to models with one process only

(like other sparse GP methods)

Multi-Class Models Multinomial Likelihood (“Softmax”)

Use one process uc(¢) for each class Processes independent

a priori Different kernels K(c)

for each class

“But That’s Easy…” … we thought back then, but: Posterior

covariance

Both are block-diagonal, but in different systems!Together: A has no simple structure!

Second Order Approximation u(c) should be coupled a posteriori Diagonal

not useful Hessian of has simple form

Allow for likelihood coupling to be represented exactly up to second order:

, diagonal minus rank 1

Subproblems Efficient representation exploiting the

prior independence and constrained form ADF projections to constrained Gaussian to

compute site precision blocks

Forward selection of I Extensions of simple myopic scheme Model selection based on conditional inference

Representation Exploits block-diagonal matrix structures Nontrivial to get numerics right (Cholesky

factors) Dominating stub buffers , to compute

marginal moments Update after inclusion (stubs)

in total

Restricted ADF Projection

Hard (non-convex) because constrained Use double-loop scheme: outer loop analytic,

inner loop convex very fast Initialization matters. Our choice can be

motivated from second order approximation (once more)

Information Gain Criterion Selection score measures “informativeness” of

candidates, given current belief

after inclusion of candidate i Points close or wrong side of class boundaries Requires marginal computed from stubs Score candidates prior to each

inclusion

Extensions of Myopic Scheme

Solid Set Liquid Set

Active Set I

i

Freezing

Inclu

sion

• growing• fixed site parameters (for efficiency)

• fixed size• site parameters iteratively updated using EP

Overview Inference Algorithm

Inclusion Phase: Include pattern. Move oldest liquid to solid active set

EP Phase: Run EP updates iteratively on liquid set site parameters

Selection Phase: Compute marginals, score O(n/C) candidates. Select winner

Model Selection Use variational bound on marginal likelihood

based on inference approximation

Gradient costs inference plus Minimize using Quasi Newton, reselecting I and

site parameters for new search directions(non-standard optimization problem)

Preliminary Experiments Small part of MNIST (even digits, C=5, n=800) No model selection (MS not yet tested), all K(c)

the same:

dfinal=150, L=25 (liquid set)

Preliminary Experiments (2)

Future Experiments Much larger experiments are in preparation,

including model selection Uses novel powerful object oriented Matlab/C+

+ interface Control over very large persistent C++ objects from

Matlab Faster transition: prototype (Matlab) product (C+

+) Powerful matrix classes (masking, LAPACK/BLAS) Optimization code

Will be released into public domain

Future Work Experiments on much larger tasks Model selection with independent, heavily

parameterized kernels (ARD,…) Present scheme cannot be used for large C

Future Work (2)Gaussian process priors in large structured networks

Gaussian process conditional random fields, … Previous work adresses function “point estimation”.

We aim for GP inference including uncertainty estimates

Have to deal with huge random field: correlations not only between datapoints, but also along time Automatic factorizations will be crucial

The multi-class scheme will be a major building block

sparse gaussian process classification with multiple classes matthias w. seeger michael i. jordan...

Documents