lecture 3: gaussian processes · lecture 3: gaussian processes jens kober knowledge-based control...
TRANSCRIPT
Lecture 3: Gaussian Processes
Jens Kober
Knowledge-Based Control Systems (SC42050)
Cognitive Robotics
3mE, Delft University of Technology, The Netherlands
19-02-2018
Outline
1 Properties
2 Gaussian distributions
3 Inference
4 A different representation
5 Gaussian processes
6 Kernels
7 Applications
Lecture based on lectures by David MacKay and Oliver Stegle &Karsten Borgwardt
2 / 28
Function Approximators
−1−0.5
00.5
1 −1
−0.5
0
0.5
1
−1
−0.5
0
0.5
1
phi
x
thet
a
Controller Process
PS Z
Z
NS
NLNS
PSPL
Radial BasisFunctions
SplinesWavelets
Z
Fuzzy SystemsSigmoidalNeural Networks
Parameterizations
3 / 28
Properties of Gaussian Processes
� nonlinear regression
� accurate and flexible
� predictions with error bars
� non-parametric
-5 -4 -3 -2 -1 0 1 2 3 4 5-0.5
0
0.5
1
1.5
2
2.5
4 / 28
1D Gaussian Distribution
� Normal distribution
� Probability density function:
N (y j�; �) =1p
2��2e�
12(y��)2
�2
-5 -4 -3 -2 -1 0 1 2 3 4 50
0.2
0.4
0.6
0.8
1
5 / 28
nD Gaussian Distribution� Probability density function:
N (y j�;K) =1p
det(2�K)e�
12(y��)TK�1(y��)
� K is the covariance matrix (or kernel matrix)
03
0.2
0.4
2
0.6
3
y2
0.8
y1
2
1
11
0 0
6 / 28
2D Gaussian Distribution: Contour & Samples
0 0.5 1 1.5 2 2.5 3 3.5
y1
0
0.5
1
1.5
2
2.5
3
y 2
2D Gaussian
7 / 28
2D Gaussian Distribution: Covariance Matrices
K is positive-semidefinite and symmetric
-3 -2 -1 0 1 2 3
y1
-3
-2
-1
0
1
2
3
y 2-3 -2 -1 0 1 2 3
y1
-3
-2
-1
0
1
2
3
y 2
-3 -2 -1 0 1 2 3
y1
-3
-2
-1
0
1
2
3
y 2
K =
"1 :1:1 1
#K =
"1 :5:5 1
#K =
"1 �:9�:9 1
#
8 / 28
Inference
conditional 2D Gaussian
0 0.5 1 1.5 2 2.5 3 3.5
y1
0
0.5
1
1.5
2
2.5
3
y 2
9 / 28
Inference
� Joint probability p(y1; y2j�;K) = N ([y1; y2]j�;K)
� Conditional probability y1 = a
p(y2jy1;�;K) =p(y1; y2j�;K)
p(y1j�;K)= N (y2j�̄; K̄)
� Partitioning:
y =
"y1y2
#� =
"�1
�2
#K =
"K11 K12
K21 K22
#
� New mean and covariance
�̄ = �2 + K21K�111 (a ��1) K̄ = K22 �K21K
�111 K12
10 / 28
A Different Representation
0 0.5 1 1.5 2 2.5 3 3.5
y1
0
0.5
1
1.5
2
2.5
3
y 2
2D Gaussian
1 2
dim
-0.5
0
0.5
1
1.5
2
2.5
3
y
2D Gaussian
y =h
0:86 2:00iy =
h1:78 2:22
i
11 / 28
Higher-Dimensional Gaussians
1 2 3
dim
-0.5
0
0.5
1
1.5
2
2.5
3
y
3D Gaussian
1 2 3 4 5 6 7 8 9 10
dim
-2.5
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
y
10D Gaussian
12 / 28
2D Conditional
conditional 2D Gaussian
0 0.5 1 1.5 2 2.5 3 3.5
y1
0
0.5
1
1.5
2
2.5
3
y 2
1 2
dim
-0.5
0
0.5
1
1.5
2
2.5
3
y
conditional 2D Gaussian
1 2
dim
-0.5
0
0.5
1
1.5
2
2.5
3
y
conditional 2D Gaussian (2 std)
13 / 28
10D Conditional
1 2 3 4 5 6 7 8 9 10
dim
-2.5
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
y
conditional 10D Gaussian
1 2 3 4 5 6 7 8 9 10
dim
-2.5
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
y
conditional 10D Gaussian (2 std)
� Looks like nonlinear regression!
� Where did K come from?
14 / 28
10D Conditional: K
K =
26666666664
1:00 0:88 0:61 0:32 0:14 0:04 0:01 0:00 0:00 0:000:88 1:00 0:88 0:61 0:32 0:14 0:04 0:01 0:00 0:000:61 0:88 1:00 0:88 0:61 0:32 0:14 0:04 0:01 0:000:32 0:61 0:88 1:00 0:88 0:61 0:32 0:14 0:04 0:010:14 0:32 0:61 0:88 1:00 0:88 0:61 0:32 0:14 0:040:04 0:14 0:32 0:61 0:88 1:00 0:88 0:61 0:32 0:140:01 0:04 0:14 0:32 0:61 0:88 1:00 0:88 0:61 0:320:00 0:01 0:04 0:14 0:32 0:61 0:88 1:00 0:88 0:610:00 0:00 0:01 0:04 0:14 0:32 0:61 0:88 1:00 0:880:00 0:00 0:00 0:01 0:04 0:14 0:32 0:61 0:88 1:00
37777777775
� Diagonal structure� Generated by a covariance function (kernel function)
Ki ;j = k(xi ; xj ; �K)
� Here:kSE(xi ; xj ;�f ; l) = �2
f e�
12l2
(xi�xj )2
squared exponential kernel (hyperparameters: l horizontallengthscale, �f vertical lengthscale)
15 / 28
Definition
Gaussian Process
A Gaussian process is a collection of random variables with the propertythat the joint distribution of any finite subset is a Gaussian.
16 / 28
Function Space View
� Kernel function and hyperparameters = prior belief on functionproperties (smoothness, lenghtscales, etc.)
� Construct a joint Gaussian for an arbitrary selection of inputlocations X
� Prior on infinite function f (x)
p(f (x)) = GP (f (x)jk)
� Prior on function values f = (f1; : : : ; fN)
p(f jX; �K) = N (f j0;KX;X(�K))
17 / 28
Noise-free Observations
� Noise-free training data D = fxn; fngNn=1
� Want to predict f � at test points X�
� Joint distribution of f and f �:
p([f ; f �]jX;X�; �K) = N
[f ; f �]
�����0;"KX;X KX;X�
KX�;X KX�;X�
#!
(dependence of K on �K dropped for brevity)
� Real data is rarely noise-free
18 / 28
Inference
� Noisy training data D = fX; yg� Joint distribution of latent function values f and f � given y :
p([f ; f �]jX;X�; y ; �K ; �2) /
Priorz }| {N
[f ; f �]
�����0;"KX;X KX;X�
KX�;X KX�;X�
#!
�NY
n=1
N�ynjfn; �2
�| {z }
Likelihood
� Integrating out f yields
p([y ; f �]jX;X�; y ; �K ; �2) / N
[y ; f �]
�����0;"KX;X + �2I KX;X�
KX�;X KX�;X�
#!
19 / 28
Predictions
� Turn into conditional distribution
� Joint distribution
p([y ; f �]jX;X�; y ; �K ; �2) / N
[y ; f �]
�����0;"KX;X + �2I KX;X�
KX�;X KX�;X�
#!
� Gaussian predictive distribution for f �
p(f �jX;X�; y ; �K ; �2) = N (f �j��;�) with
�� = KX�;X
�KX;X + �2I
��1
y
Σ� = KX�;X�+�2I �KX�;X
�KX;X + �2I
��1
KX;X�
� Matrix inverse (training) is expensive
20 / 28
Example
-5 -4 -3 -2 -1 0 1 2 3 4 5
x
-1.5
-1
-0.5
0
0.5
1
1.5
y
prediction (2 std) true function training data
21 / 28
Kernel Parameters
horizontal lengthscale (too narrow, too wide)
-5 -4 -3 -2 -1 0 1 2 3 4 5
x
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
y
prediction (2 std) true function training data
-5 -4 -3 -2 -1 0 1 2 3 4 5
x
-1.5
-1
-0.5
0
0.5
1
1.5
y
prediction (2 std) true function training data
! optimize kernel parameters as part of training
22 / 28
Kernels
� Squared exponential
kSE(xi ; xj) = �2f exp
�1
2
DXd=1
(xd;i � xd;j)2l�2d
!
� Linearklin(xi ; xj) = �2
o + xixj
� Brownian motion (Wiener process)
kBrownian(xi ; xj) = min (xi ; xj)
� Periodic
kperiodic(xi ; xj) = exp
0@�2 sin2
�xi�xj2
��2
1A
� Many more(need to correspond to an inner product k(xi ; xj) = �(xi )
T�(xj) )
23 / 28
Examples of priors
-5 -4 -3 -2 -1 0 1 2 3 4 5
x
-3
-2
-1
0
1
2
3
y
Samples of Squared Exponential Prior
0 1 2 3 4 5 6 7 8 9 10
x
-5
-4
-3
-2
-1
0
1
2
3
4
y
Samples of Brownian Motion Prior
-5 -4 -3 -2 -1 0 1 2 3 4 5
x
-10
-8
-6
-4
-2
0
2
4
6
8
10
y
Samples of Linear Prior
24 / 28
Constructing Kernels
� Sumksum(xi ; xj) = k1(xi ; xj) + k2(xi ; xj)
addition of independent processes
� Productkprod(xi ; xj) = k1(xi ; xj)k2(xi ; xj)
product of independent processes
� Convolution
kconv(xi ; xj) =
Zh(xi ; zi )k(zi ; zj)h(xj ; zj) dzi dzj
blurring with kernel h
25 / 28
Active Learning
select next training point (here the one with highest uncertainty)
-5 -4 -3 -2 -1 0 1 2 3 4 5
x
-2
-1
0
1
2
y
prediction (2 std) true function training data next data point
-5 -4 -3 -2 -1 0 1 2 3 4 5
x
-2
-1
0
1
2
y
prediction (2 std) true function training data next data point
-5 -4 -3 -2 -1 0 1 2 3 4 5
x
-2
-1
0
1
2
y
prediction (2 std) true function training data next data point
26 / 28
Bayesian Optimization1
https://youtu.be/dOux-bXFCU4
1A. Marco, P. Hennig, J. Bohg, S. Schaal, and S. Trimpe (May 2016). “Automatic LQR Tuning Based on GaussianProcess Global Optimization”. In: Proceedings of the IEEE International Conference on Robotics and Automation (ICRA)
27 / 28
Summary
� Easy to use� model with infinite number of parameters
� State-of-the-art performance� Many standard regression methods are special cases of GPs
� Radial basis functions� Splines� Linear (ridge) regression� Feed-forward neural networks with one hidden layer
� Problems� Ill-conditioned� N3 complexity is bad news for N > 1000 ! approximate/sparse
methods
28 / 28