kernels for automatic pattern discovery and extrapolation

21
Kernels for Automatic Pattern Discovery and Extrapolation Andrew Gordon Wilson [email protected] mlg.eng.cam.ac.uk/andrew University of Cambridge Joint work with Ryan Adams (Harvard) 1 / 21

Upload: others

Post on 05-Apr-2022

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Kernels for Automatic Pattern Discovery and Extrapolation

Kernels for Automatic Pattern Discovery andExtrapolation

Andrew Gordon Wilson

[email protected]/andrewUniversity of Cambridge

Joint work with Ryan Adams (Harvard)

1 / 21

Page 2: Kernels for Automatic Pattern Discovery and Extrapolation

Pattern Recognition

1968 1977 1986 1995 2004

320

340

360

380

400

Year

CO

2 Con

cent

ratio

n (p

pm)

2 / 21

Page 3: Kernels for Automatic Pattern Discovery and Extrapolation

Pattern Recognition

1968 1977 1986 1995 2004 2013300

320

340

360

380

400

Year

CO

2 Con

cent

ratio

n (p

pm)

3 / 21

Page 4: Kernels for Automatic Pattern Discovery and Extrapolation

Pattern Recognition

−15 −10 −5 0 5 10 15

0

0.5

1

x (input) a)

Obs

erva

tion

4 / 21

Page 5: Kernels for Automatic Pattern Discovery and Extrapolation

Gaussian processes

DefinitionA Gaussian process (GP) is a collection of random variables, any finite number ofwhich have a joint Gaussian distribution.

Nonparametric Regression Model

I Prior: f (x) ∼ GP(m(x), k(x, x′)), meaning (f (x1), . . . , f (xN)) ∼ N (µ,K),with µi = m(xi) and Kij = cov(f (xi), f (xj)) = k(xi, xj).

GP posterior︷ ︸︸ ︷p(f (x)|D) ∝

Likelihood︷ ︸︸ ︷p(D|f (x))

GP prior︷ ︸︸ ︷p(f (x))

−10 −5 0 5 10−3

−2

−1

0

1

2

3

input, t a)

outp

ut, f

(t)

Gaussian process sample prior functions

−10 −5 0 5 10−3

−2

−1

0

1

2

3

input, tb)

outp

ut, f

(t)

Gaussian process sample posterior functions

5 / 21

Page 6: Kernels for Automatic Pattern Discovery and Extrapolation

Gaussian Process Covariance Kernels

Let τ = x− x′:

kSE(τ) = exp(−0.5τ 2/`2) (1)

kMA(τ) = a(1 +

√3τ`

) exp(−√

3τ`

) (2)

kRQ(τ) = (1 +τ 2

2α `2 )−α (3)

kPE(τ) = exp(−2 sin2(π τ ω)/`2) (4)

6 / 21

Page 7: Kernels for Automatic Pattern Discovery and Extrapolation

CO2 Extrapolation with Standard Kernels

1968 1977 1986 1995 2004

320

340

360

380

400

Year

CO

2 Con

cent

ratio

n (p

pm)

7 / 21

Page 8: Kernels for Automatic Pattern Discovery and Extrapolation

Gaussian processes

“How can Gaussian processes possiblyreplace neural networks? Did we throw thebaby out with the bathwater?”

David MacKay, 1998.

8 / 21

Page 9: Kernels for Automatic Pattern Discovery and Extrapolation

More Expressive Covariance Functions

f̂i(x)

f̂1(x)

f̂q(x)

y1(x)

yj(x)

W11(x)

Wpq (x)

W1q(x

)W

p1 (x

)

yp(x)

...

... ...

...

x

Gaussian Process Regression Networks. Wilson et. al, ICML 2012.

9 / 21

Page 10: Kernels for Automatic Pattern Discovery and Extrapolation

Gaussian Process Regression Network

1 2 3 4 5longitude

1

2

3

4

5

6la

titu

de

0.15

0.00

0.15

0.30

0.45

0.60

0.75

0.90

10 / 21

Page 11: Kernels for Automatic Pattern Discovery and Extrapolation

Expressive Covariance Functions

I GPs in Bayesian neural network like architectures. (Salakhutdinov and Hinton,2008; Wilson et. al, 2012; Damianou and Lawrence, 2012).Task specific, difficult inference, no closed form kernels.

I Compositions of kernels. (Archambeau and Bach, 2011; Durrande et. al, 2011;Rasmussen and Williams, 2006).In the general case, difficult to interpret, difficult inference, struggle withover-fitting.

Can learn almost nothing about the covariance function of a stochastic process from asingle realization, if we assume that the covariance function could be any positivedefinite function. Most commonly one assumes a restriction to stationary kernels,meaning that covariances are invariant to translations in the input space.

11 / 21

Page 12: Kernels for Automatic Pattern Discovery and Extrapolation

Bochner’s Theorem

Theorem(Bochner) A complex-valued function k on RP is the covariance function of a weaklystationary mean square continuous complex-valued random process on RP if andonly if it can be represented as

k(τ) =∫RP

e2πisTτψ(ds) , (5)

where ψ is a positive finite measure.

If ψ has a density S(s), then S is called the spectral density or power spectrum of k,and k and S are Fourier duals:

k(τ) =∫

S(s)e2πisTτds , (6)

S(s) =∫

k(τ)e−2πisTτdτ . (7)

12 / 21

Page 13: Kernels for Automatic Pattern Discovery and Extrapolation

Idea

k and S are Fourier duals:

k(τ) =∫

S(s)e2πisTτds , (8)

S(s) =∫

k(τ)e−2πisTτdτ . (9)

I If we can approximate S(s) to arbitrary accuracy, then we can approximate anystationary kernel to arbitrary accuracy.

I We can model S(s) to arbitrary accuracy, since scale-location mixtures ofGaussians can approximate any distribution to arbitrary accuracy.

I A scale-location mixture of Gaussians can flexibly model many distributions,and thus many covariance kernels, even with a small number of components.

13 / 21

Page 14: Kernels for Automatic Pattern Discovery and Extrapolation

Kernels for Pattern Discovery

Let τ = x− x′ ∈ RP. From Bochner’s Theorem,

k(τ) =∫RP

S(s)e2πisTτds (10)

For simplicity, assume τ ∈ R1 and let

S(s) = [N (s;µ, σ2) +N (−s;µ, σ2)]/2 . (11)

Then

k(τ) = exp{−2π2τ 2σ2} cos(2πτµ) . (12)

More generally, if S(s) is a symmetrized mixture of diagonal covariance Gaussianson Rp, with covariance matrix Mq = diag(v(1)

q , . . . , v(P)q ), then

k(τ) =Q∑

q=1

wq

P∏p=1

exp{−2π2τ 2p v(p)

q } cos(2πτpµ(p)q ). (13)

14 / 21

Page 15: Kernels for Automatic Pattern Discovery and Extrapolation

Results, CO2

15 / 21

Page 16: Kernels for Automatic Pattern Discovery and Extrapolation

Results, Reconstructing Standard Covariances

16 / 21

Page 17: Kernels for Automatic Pattern Discovery and Extrapolation

Results, Negative Covariances

17 / 21

Page 18: Kernels for Automatic Pattern Discovery and Extrapolation

Results, Sinc Pattern

18 / 21

Page 19: Kernels for Automatic Pattern Discovery and Extrapolation

Results, Airline Passengers

19 / 21

Page 20: Kernels for Automatic Pattern Discovery and Extrapolation

Gaussian Process Kernels for Automatic Pattern Discovery

I Gaussian processes are rich distributions over functions, which provide aBayesian nonparametric approach to smoothing and interpolation.

I We introduce new, simple, closed form kernels, which can be used withGaussian processes to enable automatic pattern discovery and extrapolation.Code available at: http://mlg.eng.cam.ac.uk/andrew

I These kernels form a basis for all stationary covariance functions.

I These kernels can be used with non-Bayesian methods.

I In the future, it would be interesting to reverse engineer covariance functionsthat are induced by expressive architectures, like deep neural networks, todevelop powerful interpretable models with simple inference procedures andclosed form kernels.

20 / 21

Page 21: Kernels for Automatic Pattern Discovery and Extrapolation

Results

Table: We compare the test performance of the proposed spectral mixture (SM)kernel with squared exponential (SE), Matérn (MA), rational quadratic (RQ), andperiodic (PE) kernels. The SM kernel consistently has the lowest mean squared error(MSE) and highest log likelihood (L).

SM SE MA RQ PE

CO2

MSE 9.5 1200 1200 980 1200L 170 −320 −240 −100 −1800

NEG COV

MSE 62 210 210 210 210L −25 −70 −70 −70 −70

SINC

MSE 0.000045 0.16 0.10 0.11 0.05L 3900 2000 1600 2000 600

AIRLINE

MSE 460 43000 37000 4200 46000L −190 −260 −240 −280 −370

21 / 21