lecture 3: gaussian processes · lecture 3: gaussian processes jens kober knowledge-based control...

Lecture 3: Gaussian Processes

Jens Kober

Knowledge-Based Control Systems (SC42050)

Cognitive Robotics

3mE, Delft University of Technology, The Netherlands

19-02-2018

Outline

1 Properties

2 Gaussian distributions

3 Inference

4 A different representation

5 Gaussian processes

6 Kernels

7 Applications

Lecture based on lectures by David MacKay and Oliver Stegle &Karsten Borgwardt

2 / 28

Function Approximators

−1−0.5

00.5

1 −1

−0.5

0

0.5

1

−1

−0.5

0

0.5

1

phi

x

thet

a

Controller Process

PS Z

Z

NS

NLNS

PSPL

Radial BasisFunctions

SplinesWavelets

Z

Fuzzy SystemsSigmoidalNeural Networks

Parameterizations

3 / 28

Properties of Gaussian Processes

� nonlinear regression

� accurate and flexible

� predictions with error bars

� non-parametric

-5 -4 -3 -2 -1 0 1 2 3 4 5-0.5

0

0.5

1

1.5

2

2.5

4 / 28

1D Gaussian Distribution

� Normal distribution

� Probability density function:

N (y j�; �) =1p

2��2e�

12(y��)2

�2

-5 -4 -3 -2 -1 0 1 2 3 4 50

0.2

0.4

0.6

0.8

1

5 / 28

nD Gaussian Distribution� Probability density function:

N (y j�;K) =1p

det(2�K)e�

12(y��)TK�1(y��)

� K is the covariance matrix (or kernel matrix)

03

0.2

0.4

pdf

2

0.6

3

y2

0.8

y1

2

1

11

0 0

6 / 28

2D Gaussian Distribution: Contour & Samples

0 0.5 1 1.5 2 2.5 3 3.5

y1

0

0.5

1

1.5

2

2.5

3

y 2

2D Gaussian

7 / 28

2D Gaussian Distribution: Covariance Matrices

K is positive-semidefinite and symmetric

-3 -2 -1 0 1 2 3

y1

-3

-2

-1

0

1

2

3

y 2-3 -2 -1 0 1 2 3

y1

-3

-2

-1

0

1

2

3

y 2

-3 -2 -1 0 1 2 3

y1

-3

-2

-1

0

1

2

3

y 2

K =

"1 :1:1 1

#K =

"1 :5:5 1

#K =

"1 �:9�:9 1

#

8 / 28

Inference

conditional 2D Gaussian

0 0.5 1 1.5 2 2.5 3 3.5

y1

0

0.5

1

1.5

2

2.5

3

y 2

9 / 28

Inference

� Joint probability p(y1; y2j�;K) = N ([y1; y2]j�;K)

� Conditional probability y1 = a

p(y2jy1;�;K) =p(y1; y2j�;K)

p(y1j�;K)= N (y2j�̄; K̄)

� Partitioning:

y =

"y1y2

#� =

"�1

�2

#K =

"K11 K12

K21 K22

#

� New mean and covariance

�̄ = �2 + K21K�111 (a ��1) K̄ = K22 �K21K

�111 K12

10 / 28

A Different Representation

0 0.5 1 1.5 2 2.5 3 3.5

y1

0

0.5

1

1.5

2

2.5

3

y 2

2D Gaussian

1 2

dim

-0.5

0

0.5

1

1.5

2

2.5

3

y

2D Gaussian

y =h

0:86 2:00iy =

h1:78 2:22

i

11 / 28

Higher-Dimensional Gaussians

1 2 3

dim

-0.5

0

0.5

1

1.5

2

2.5

3

y

3D Gaussian

1 2 3 4 5 6 7 8 9 10

dim

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

y

10D Gaussian

12 / 28

2D Conditional


0 0.5 1 1.5 2 2.5 3 3.5

y1

0

0.5

1

1.5

2

2.5

3

y 2

1 2

dim

-0.5

0

0.5

1

1.5

2

2.5

3

y


1 2

dim

-0.5

0

0.5

1

1.5

2

2.5

3

y

conditional 2D Gaussian (2 std)

13 / 28

10D Conditional

1 2 3 4 5 6 7 8 9 10

dim

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

y


1 2 3 4 5 6 7 8 9 10

dim

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

y

conditional 10D Gaussian (2 std)

� Looks like nonlinear regression!

� Where did K come from?

14 / 28

10D Conditional: K

K =

26666666664

1:00 0:88 0:61 0:32 0:14 0:04 0:01 0:00 0:00 0:000:88 1:00 0:88 0:61 0:32 0:14 0:04 0:01 0:00 0:000:61 0:88 1:00 0:88 0:61 0:32 0:14 0:04 0:01 0:000:32 0:61 0:88 1:00 0:88 0:61 0:32 0:14 0:04 0:010:14 0:32 0:61 0:88 1:00 0:88 0:61 0:32 0:14 0:040:04 0:14 0:32 0:61 0:88 1:00 0:88 0:61 0:32 0:140:01 0:04 0:14 0:32 0:61 0:88 1:00 0:88 0:61 0:320:00 0:01 0:04 0:14 0:32 0:61 0:88 1:00 0:88 0:610:00 0:00 0:01 0:04 0:14 0:32 0:61 0:88 1:00 0:880:00 0:00 0:00 0:01 0:04 0:14 0:32 0:61 0:88 1:00

37777777775

� Diagonal structure� Generated by a covariance function (kernel function)

Ki ;j = k(xi ; xj ; �K)

� Here:kSE(xi ; xj ;�f ; l) = �2

f e�

12l2

(xi�xj )2

squared exponential kernel (hyperparameters: l horizontallengthscale, �f vertical lengthscale)

15 / 28

Definition

Gaussian Process

A Gaussian process is a collection of random variables with the propertythat the joint distribution of any finite subset is a Gaussian.

16 / 28

Function Space View

� Kernel function and hyperparameters = prior belief on functionproperties (smoothness, lenghtscales, etc.)

� Construct a joint Gaussian for an arbitrary selection of inputlocations X

� Prior on infinite function f (x)

p(f (x)) = GP (f (x)jk)

� Prior on function values f = (f1; : : : ; fN)

p(f jX; �K) = N (f j0;KX;X(�K))

17 / 28

Noise-free Observations

� Noise-free training data D = fxn; fngNn=1

� Want to predict f � at test points X�

� Joint distribution of f and f �:

p([f ; f �]jX;X�; �K) = N

[f ; f �]

��0;"KX;X KX;X�

KX�;X KX�;X�

#!

(dependence of K on �K dropped for brevity)

� Real data is rarely noise-free

18 / 28

Inference

� Noisy training data D = fX; yg� Joint distribution of latent function values f and f � given y :

p([f ; f �]jX;X�; y ; �K ; �2) /

Priorz }| {N

[f ; f �]

��0;"KX;X KX;X�

KX�;X KX�;X�

#!

�NY

n=1

N�ynjfn; �2

�| {z }

Likelihood

� Integrating out f yields

p([y ; f �]jX;X�; y ; �K ; �2) / N

[y ; f �]

��0;"KX;X + �2I KX;X�

KX�;X KX�;X�

#!

19 / 28

Predictions

� Turn into conditional distribution

� Joint distribution

p([y ; f �]jX;X�; y ; �K ; �2) / N

[y ; f �]

��0;"KX;X + �2I KX;X�

KX�;X KX�;X�

#!

� Gaussian predictive distribution for f �

p(f �jX;X�; y ; �K ; �2) = N (f �j��;Σ�) with

�� = KX�;X

�KX;X + �2I

��1

y

Σ� = KX�;X�+�2I �KX�;X

�KX;X + �2I

��1

KX;X�

� Matrix inverse (training) is expensive

20 / 28

Example

-5 -4 -3 -2 -1 0 1 2 3 4 5

x

-1.5

-1

-0.5

0

0.5

1

1.5

y

prediction (2 std) true function training data

21 / 28

Kernel Parameters

horizontal lengthscale (too narrow, too wide)

-5 -4 -3 -2 -1 0 1 2 3 4 5

x

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

y


-5 -4 -3 -2 -1 0 1 2 3 4 5

x

-1.5

-1

-0.5

0

0.5

1

1.5

y


! optimize kernel parameters as part of training

22 / 28

Kernels

� Squared exponential

kSE(xi ; xj) = �2f exp

�1

2

DXd=1

(xd;i � xd;j)2l�2d

!

� Linearklin(xi ; xj) = �2

o + xixj

� Brownian motion (Wiener process)

kBrownian(xi ; xj) = min (xi ; xj)

� Periodic

kperiodic(xi ; xj) = exp

0@�2 sin2

�xi�xj2

��2

1A

� Many more(need to correspond to an inner product k(xi ; xj) = �(xi )

T�(xj) )

23 / 28

Examples of priors

-5 -4 -3 -2 -1 0 1 2 3 4 5

x

-3

-2

-1

0

1

2

3

y

Samples of Squared Exponential Prior

0 1 2 3 4 5 6 7 8 9 10

x

-5

-4

-3

-2

-1

0

1

2

3

4

y

Samples of Brownian Motion Prior

-5 -4 -3 -2 -1 0 1 2 3 4 5

x

-10

-8

-6

-4

-2

0

2

4

6

8

10

y

Samples of Linear Prior

24 / 28

Constructing Kernels

� Sumksum(xi ; xj) = k1(xi ; xj) + k2(xi ; xj)

addition of independent processes

� Productkprod(xi ; xj) = k1(xi ; xj)k2(xi ; xj)

product of independent processes

� Convolution

kconv(xi ; xj) =

Zh(xi ; zi )k(zi ; zj)h(xj ; zj) dzi dzj

blurring with kernel h

25 / 28

Active Learning

select next training point (here the one with highest uncertainty)

-5 -4 -3 -2 -1 0 1 2 3 4 5

x

-2

-1

0

1

2

y

prediction (2 std) true function training data next data point

-5 -4 -3 -2 -1 0 1 2 3 4 5

x

-2

-1

0

1

2

y


-5 -4 -3 -2 -1 0 1 2 3 4 5

x

-2

-1

0

1

2

y


26 / 28

Bayesian Optimization1

https://youtu.be/dOux-bXFCU4

1A. Marco, P. Hennig, J. Bohg, S. Schaal, and S. Trimpe (May 2016). “Automatic LQR Tuning Based on GaussianProcess Global Optimization”. In: Proceedings of the IEEE International Conference on Robotics and Automation (ICRA)

27 / 28

Summary

� Easy to use� model with infinite number of parameters

� State-of-the-art performance� Many standard regression methods are special cases of GPs

� Radial basis functions� Splines� Linear (ridge) regression� Feed-forward neural networks with one hidden layer

� Problems� Ill-conditioned� N3 complexity is bad news for N > 1000 ! approximate/sparse

methods

28 / 28

https://youtu.be/dOux-bXFCU4

lecture 3: gaussian processes · lecture 3: gaussian processes jens kober knowledge-based control...

Documents