gaussian processes: from the basics to the...

Post on 10-Jun-2020

0 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Gaussian Processes:From the Basics to the State-of-the-Art

Dr. Richard E. Turner (ret26@cam.ac.uk)Computational and Biological Learning Lab, Department of

Engineering, University of Cambridge

with Thang Bui, Yingzhen Li, Jose Miguel Hernandez Lobato,Daniel Hernandez Lobato, Josiah Jan, Alex Navarro,

Felipe Tobar, and Maneesh Sahani

1 / 335

Motivation: non-linear regression

1 5 10 15 20x

−2.0

−1.5

−1.0

−0.5

0.0

0.5

1.0

1.5

2.0

y

2 / 335

Motivation: non-linear regression

1 5 10 15 20x

−2.0

−1.5

−1.0

−0.5

0.0

0.5

1.0

1.5

2.0

y

?

3 / 335

Motivation: non-linear regression

1 5 10 15 20x

−2.0

−1.5

−1.0

−0.5

0.0

0.5

1.0

1.5

2.0

y

4 / 335

Motivation: non-linear regression

1 5 10 15 20x

−2.0

−1.5

−1.0

−0.5

0.0

0.5

1.0

1.5

2.0

y

5 / 335

Motivation: non-linear regression

1 5 10 15 20x

−2.0

−1.5

−1.0

−0.5

0.0

0.5

1.0

1.5

2.0

y

Can we do this with a plain old Gaussian? 6 / 335

Gaussian distribution

p(y|Σ) ∝ exp(−1

2yTΣ−1y)

Σ =[

1 .7.7 1

]

y1

y2

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

7 / 335

Gaussian distribution

p(y|Σ) ∝ exp(−1

2yTΣ−1y)

Σ =[

1 .7.7 1

]

y1

y2

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

8 / 335

Gaussian distribution

p(y|Σ) ∝ exp(−1

2yTΣ−1y)

Σ =[

1 .6.6 1

]

y1

y2

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

9 / 335

Gaussian distribution

p(y|Σ) ∝ exp(−1

2yTΣ−1y)

Σ =[

1 .4.4 1

]

y1

y2

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

10 / 335

Gaussian distribution

p(y|Σ) ∝ exp(−1

2yTΣ−1y)

Σ =[

1 .1.1 1

]

y1

y2

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

11 / 335

Gaussian distribution

p(y|Σ) ∝ exp(−1

2yTΣ−1y)

Σ =[

1 .0.0 1

]

y1

y2

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

12 / 335

Gaussian distribution

p(y|Σ) ∝ exp(−1

2yTΣ−1y)

Σ =[

1 .7.7 1

]

y1

y2

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

13 / 335

Gaussian distribution

p(y|Σ) ∝ exp(−1

2yTΣ−1y)

Σ =[

1 .7.7 1

]

y1

y2

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

14 / 335

Gaussian distribution

p(y2|y1,Σ) ∝ exp(−1

2(y2 − µ∗)Σ∗−1(y2 − µ∗)) [ 1

1

]

y1

y2

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

15 / 335

Gaussian distribution

p(y2|y1,Σ) ∝ exp(−1

2(y2 − µ∗)Σ∗−1(y2 − µ∗)) [ 1

1

]

y1

y 2

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

16 / 335

Gaussian distribution

p(y2|y1,Σ) ∝ exp(−1

2(y2 − µ∗)Σ∗−1(y2 − µ∗)) [ 1

1

]

y1

y2

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

17 / 335

Gaussian distribution

p(y2|y1,Σ) ∝ exp(−1

2(y2 − µ∗)Σ∗−1(y2 − µ∗)) [ 1

1

]

y1

y2

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

18 / 335

Gaussian distribution

p(y2|y1,Σ) ∝ exp(−1

2(y2 − µ∗)Σ∗−1(y2 − µ∗)) [ 1

1

]

y1

y2

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

19 / 335

Gaussian distribution

p(y2|y1,Σ) ∝ exp(−1

2(y2 − µ∗)Σ∗−1(y2 − µ∗)) [ 1

1

]

y1

y2

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

20 / 335

Gaussian distribution

p(y2|y1,Σ) ∝ exp(−1

2(y2 − µ∗)Σ∗−1(y2 − µ∗)) [ 1

1

]

y1

y2

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

21 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y 2

Σ =[

1 .9.9 1

]

22 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y 2

1 2−3

−2

−1

0

1

2

3

variable index

y

Σ =[

1 .9.9 1

]

23 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y 2

1 2−3

−2

−1

0

1

2

3

variable index

y

Σ =[

1 .9.9 1

]

24 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y2

1 2−3

−2

−1

0

1

2

3

variable index

y

Σ =[

1 .9.9 1

]

25 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y2

1 2−3

−2

−1

0

1

2

3

variable index

y

Σ =[

1 .9.9 1

]

26 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y2

1 2−3

−2

−1

0

1

2

3

variable index

y

Σ =[

1 .9.9 1

]

27 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y2

1 2−3

−2

−1

0

1

2

3

variable index

y

Σ =[

1 .9.9 1

]

28 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y2

1 2−3

−2

−1

0

1

2

3

variable index

y

Σ =[

1 .9.9 1

]

29 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y2

1 2−3

−2

−1

0

1

2

3

variable index

y

Σ =[

1 .9.9 1

]

30 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y2

1 2−3

−2

−1

0

1

2

3

variable index

y

Σ =[

1 .9.9 1

]

31 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y2

1 2−3

−2

−1

0

1

2

3

variable index

y

Σ =[

1 .9.9 1

]

32 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y2

1 2−3

−2

−1

0

1

2

3

variable index

y

Σ =[

1 .9.9 1

]

33 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y2

1 2−3

−2

−1

0

1

2

3

variable index

y

Σ =[

1 .9.9 1

]

34 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y2

1 2−3

−2

−1

0

1

2

3

variable index

y

Σ =[

1 .9.9 1

]

35 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y2

1 2−3

−2

−1

0

1

2

3

variable index

y

Σ =[

1 .9.9 1

]

36 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y2

1 2−3

−2

−1

0

1

2

3

variable index

y

Σ =[

1 .9.9 1

]

37 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y2

1 2−3

−2

−1

0

1

2

3

variable index

y

Σ =[

1 .9.9 1

]

38 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y2

1 2−3

−2

−1

0

1

2

3

variable index

y

Σ =[

1 .9.9 1

]

39 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y2

1 2−3

−2

−1

0

1

2

3

variable index

y

Σ =[

1 .9.9 1

]

40 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y2

1 2−3

−2

−1

0

1

2

3

variable index

y

Σ =[

1 .9.9 1

]

41 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y2

1 2−3

−2

−1

0

1

2

3

variable index

y

Σ =[

1 .9.9 1

]

42 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y2

1 2−3

−2

−1

0

1

2

3

variable index

y

Σ =[

1 .9.9 1

]

43 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y2

1 2−3

−2

−1

0

1

2

3

variable index

y

Σ =[

1 .9.9 1

]

44 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y2

1 2−3

−2

−1

0

1

2

3

variable index

y

Σ =[

1 .9.9 1

]

45 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y2

1 2−3

−2

−1

0

1

2

3

variable index

y

Σ =[

1 .9.9 1

]

46 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y2

1 2−3

−2

−1

0

1

2

3

variable index

y

Σ =[

1 .9.9 1

]

47 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y2

1 2−3

−2

−1

0

1

2

3

variable index

y

Σ =[

1 .9.9 1

]

48 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y2

1 2−3

−2

−1

0

1

2

3

variable index

y

Σ =[

1 .9.9 1

]

49 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y2

1 2−3

−2

−1

0

1

2

3

variable index

y

Σ =[

1 .9.9 1

]

50 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y2

1 2−3

−2

−1

0

1

2

3

variable index

y

Σ =[

1 .9.9 1

]

51 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y2

1 2−3

−2

−1

0

1

2

3

variable index

y

Σ =[

1 .9.9 1

]

52 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y2

1 2−3

−2

−1

0

1

2

3

variable index

y

Σ =[

1 .9.9 1

]

53 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y2

1 2−3

−2

−1

0

1

2

3

variable index

y

Σ =[

1 .9.9 1

]

54 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y2

1 2−3

−2

−1

0

1

2

3

variable index

y

Σ =[

1 .9.9 1

]

55 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y2

1 2−3

−2

−1

0

1

2

3

variable index

y

Σ =[

1 .9.9 1

]

56 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y2

1 2−3

−2

−1

0

1

2

3

variable index

y

Σ =[

1 .9.9 1

]

57 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y2

1 2−3

−2

−1

0

1

2

3

variable index

y

Σ =[

1 .9.9 1

]

58 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y2

1 2−3

−2

−1

0

1

2

3

variable index

y

Σ =[

1 .9.9 1

]

59 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y2

1 2−3

−2

−1

0

1

2

3

variable index

y

Σ =[

1 .9.9 1

]

60 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y2

1 2−3

−2

−1

0

1

2

3

variable index

y

Σ =[

1 .9.9 1

]

61 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y2

1 2−3

−2

−1

0

1

2

3

variable index

y

Σ =[

1 .9.9 1

]

62 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y2

1 2−3

−2

−1

0

1

2

3

variable index

y

Σ =[

1 .9.9 1

]

63 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y2

1 2−3

−2

−1

0

1

2

3

variable index

y

Σ =[

1 .9.9 1

]

64 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y5

1 2 3 4 5−3

−2

−1

0

1

2

3

variable index

y

Σ =

1 .9 .8 .6 .4.9 1 .9 .8 .6.8 .9 1 .9 .8.6 .8 .9 1 .9.4 .6 .8 .9 1

65 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y5

1 2 3 4 5−3

−2

−1

0

1

2

3

variable index

y

Σ =

1 .9 .8 .6 .4.9 1 .9 .8 .6.8 .9 1 .9 .8.6 .8 .9 1 .9.4 .6 .8 .9 1

66 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y5

1 2 3 4 5−3

−2

−1

0

1

2

3

variable index

y

Σ =

1 .9 .8 .6 .4.9 1 .9 .8 .6.8 .9 1 .9 .8.6 .8 .9 1 .9.4 .6 .8 .9 1

67 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y5

1 2 3 4 5−3

−2

−1

0

1

2

3

variable index

y

Σ =

1 .9 .8 .6 .4.9 1 .9 .8 .6.8 .9 1 .9 .8.6 .8 .9 1 .9.4 .6 .8 .9 1

68 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y5

1 2 3 4 5−3

−2

−1

0

1

2

3

variable index

y

Σ =

1 .9 .8 .6 .4.9 1 .9 .8 .6.8 .9 1 .9 .8.6 .8 .9 1 .9.4 .6 .8 .9 1

69 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y5

1 2 3 4 5−3

−2

−1

0

1

2

3

variable index

y

Σ =

1 .9 .8 .6 .4.9 1 .9 .8 .6.8 .9 1 .9 .8.6 .8 .9 1 .9.4 .6 .8 .9 1

70 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y5

1 2 3 4 5−3

−2

−1

0

1

2

3

variable index

y

Σ =

1 .9 .8 .6 .4.9 1 .9 .8 .6.8 .9 1 .9 .8.6 .8 .9 1 .9.4 .6 .8 .9 1

71 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y5

1 2 3 4 5−3

−2

−1

0

1

2

3

variable index

y

Σ =

1 .9 .8 .6 .4.9 1 .9 .8 .6.8 .9 1 .9 .8.6 .8 .9 1 .9.4 .6 .8 .9 1

72 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y5

1 2 3 4 5−3

−2

−1

0

1

2

3

variable index

y

Σ =

1 .9 .8 .6 .4.9 1 .9 .8 .6.8 .9 1 .9 .8.6 .8 .9 1 .9.4 .6 .8 .9 1

73 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y5

1 2 3 4 5−3

−2

−1

0

1

2

3

variable index

y

Σ =

1 .9 .8 .6 .4.9 1 .9 .8 .6.8 .9 1 .9 .8.6 .8 .9 1 .9.4 .6 .8 .9 1

74 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y5

1 2 3 4 5−3

−2

−1

0

1

2

3

variable index

y

Σ =

1 .9 .8 .6 .4.9 1 .9 .8 .6.8 .9 1 .9 .8.6 .8 .9 1 .9.4 .6 .8 .9 1

75 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y5

1 2 3 4 5−3

−2

−1

0

1

2

3

variable index

y

Σ =

1 .9 .8 .6 .4.9 1 .9 .8 .6.8 .9 1 .9 .8.6 .8 .9 1 .9.4 .6 .8 .9 1

76 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y5

1 2 3 4 5−3

−2

−1

0

1

2

3

variable index

y

Σ =

1 .9 .8 .6 .4.9 1 .9 .8 .6.8 .9 1 .9 .8.6 .8 .9 1 .9.4 .6 .8 .9 1

77 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y5

1 2 3 4 5−3

−2

−1

0

1

2

3

variable index

y

Σ =

1 .9 .8 .6 .4.9 1 .9 .8 .6.8 .9 1 .9 .8.6 .8 .9 1 .9.4 .6 .8 .9 1

78 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y5

1 2 3 4 5−3

−2

−1

0

1

2

3

variable index

y

Σ =

1 .9 .8 .6 .4.9 1 .9 .8 .6.8 .9 1 .9 .8.6 .8 .9 1 .9.4 .6 .8 .9 1

79 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y5

1 2 3 4 5−3

−2

−1

0

1

2

3

variable index

y

Σ =

1 .9 .8 .6 .4.9 1 .9 .8 .6.8 .9 1 .9 .8.6 .8 .9 1 .9.4 .6 .8 .9 1

80 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y5

1 2 3 4 5−3

−2

−1

0

1

2

3

variable index

y

Σ =

1 .9 .8 .6 .4.9 1 .9 .8 .6.8 .9 1 .9 .8.6 .8 .9 1 .9.4 .6 .8 .9 1

81 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y5

1 2 3 4 5−3

−2

−1

0

1

2

3

variable index

y

Σ =

1 .9 .8 .6 .4.9 1 .9 .8 .6.8 .9 1 .9 .8.6 .8 .9 1 .9.4 .6 .8 .9 1

82 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y5

1 2 3 4 5−3

−2

−1

0

1

2

3

variable index

y

Σ =

1 .9 .8 .6 .4.9 1 .9 .8 .6.8 .9 1 .9 .8.6 .8 .9 1 .9.4 .6 .8 .9 1

83 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y5

1 2 3 4 5−3

−2

−1

0

1

2

3

variable index

y

Σ =

1 .9 .8 .6 .4.9 1 .9 .8 .6.8 .9 1 .9 .8.6 .8 .9 1 .9.4 .6 .8 .9 1

84 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y5

1 2 3 4 5−3

−2

−1

0

1

2

3

variable index

y

Σ =

1 .9 .8 .6 .4.9 1 .9 .8 .6.8 .9 1 .9 .8.6 .8 .9 1 .9.4 .6 .8 .9 1

85 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y5

1 2 3 4 5−3

−2

−1

0

1

2

3

variable index

y

Σ =

1 .9 .8 .6 .4.9 1 .9 .8 .6.8 .9 1 .9 .8.6 .8 .9 1 .9.4 .6 .8 .9 1

86 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y5

1 2 3 4 5−3

−2

−1

0

1

2

3

variable index

y

Σ =

1 .9 .8 .6 .4.9 1 .9 .8 .6.8 .9 1 .9 .8.6 .8 .9 1 .9.4 .6 .8 .9 1

87 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y5

1 2 3 4 5−3

−2

−1

0

1

2

3

variable index

y

Σ =

1 .9 .8 .6 .4.9 1 .9 .8 .6.8 .9 1 .9 .8.6 .8 .9 1 .9.4 .6 .8 .9 1

88 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y5

1 2 3 4 5−3

−2

−1

0

1

2

3

variable index

y

Σ =

1 .9 .8 .6 .4.9 1 .9 .8 .6.8 .9 1 .9 .8.6 .8 .9 1 .9.4 .6 .8 .9 1

89 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y5

1 2 3 4 5−3

−2

−1

0

1

2

3

variable index

y

Σ =

1 .9 .8 .6 .4.9 1 .9 .8 .6.8 .9 1 .9 .8.6 .8 .9 1 .9.4 .6 .8 .9 1

90 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y5

1 2 3 4 5−3

−2

−1

0

1

2

3

variable index

y

Σ =

1 .9 .8 .6 .4.9 1 .9 .8 .6.8 .9 1 .9 .8.6 .8 .9 1 .9.4 .6 .8 .9 1

91 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y5

1 2 3 4 5−3

−2

−1

0

1

2

3

variable index

y

Σ =

1 .9 .8 .6 .4.9 1 .9 .8 .6.8 .9 1 .9 .8.6 .8 .9 1 .9.4 .6 .8 .9 1

92 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y5

1 2 3 4 5−3

−2

−1

0

1

2

3

variable index

y

Σ =

1 .9 .8 .6 .4.9 1 .9 .8 .6.8 .9 1 .9 .8.6 .8 .9 1 .9.4 .6 .8 .9 1

93 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y5

1 2 3 4 5−3

−2

−1

0

1

2

3

variable index

y

Σ =

1 .9 .8 .6 .4.9 1 .9 .8 .6.8 .9 1 .9 .8.6 .8 .9 1 .9.4 .6 .8 .9 1

94 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y5

1 2 3 4 5−3

−2

−1

0

1

2

3

variable index

y

Σ =

1 .9 .8 .6 .4.9 1 .9 .8 .6.8 .9 1 .9 .8.6 .8 .9 1 .9.4 .6 .8 .9 1

95 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y5

1 2 3 4 5−3

−2

−1

0

1

2

3

variable index

y

Σ =

1 .9 .8 .6 .4.9 1 .9 .8 .6.8 .9 1 .9 .8.6 .8 .9 1 .9.4 .6 .8 .9 1

96 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y5

1 2 3 4 5−3

−2

−1

0

1

2

3

variable index

y

Σ =

1 .9 .8 .6 .4.9 1 .9 .8 .6.8 .9 1 .9 .8.6 .8 .9 1 .9.4 .6 .8 .9 1

97 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y5

1 2 3 4 5−3

−2

−1

0

1

2

3

variable index

y

Σ =

1 .9 .8 .6 .4.9 1 .9 .8 .6.8 .9 1 .9 .8.6 .8 .9 1 .9.4 .6 .8 .9 1

98 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y5

1 2 3 4 5−3

−2

−1

0

1

2

3

variable index

y

Σ =

1 .9 .8 .6 .4.9 1 .9 .8 .6.8 .9 1 .9 .8.6 .8 .9 1 .9.4 .6 .8 .9 1

99 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y5

1 2 3 4 5−3

−2

−1

0

1

2

3

variable index

y

Σ =

1 .9 .8 .6 .4.9 1 .9 .8 .6.8 .9 1 .9 .8.6 .8 .9 1 .9.4 .6 .8 .9 1

100 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y5

1 2 3 4 5−3

−2

−1

0

1

2

3

variable index

y

Σ =

1 .9 .8 .6 .4.9 1 .9 .8 .6.8 .9 1 .9 .8.6 .8 .9 1 .9.4 .6 .8 .9 1

101 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y5

1 2 3 4 5−3

−2

−1

0

1

2

3

variable index

y

Σ =

1 .9 .8 .6 .4.9 1 .9 .8 .6.8 .9 1 .9 .8.6 .8 .9 1 .9.4 .6 .8 .9 1

102 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y5

1 2 3 4 5−3

−2

−1

0

1

2

3

variable index

y

Σ =

1 .9 .8 .6 .4.9 1 .9 .8 .6.8 .9 1 .9 .8.6 .8 .9 1 .9.4 .6 .8 .9 1

103 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y5

1 2 3 4 5−3

−2

−1

0

1

2

3

variable index

y

Σ =

1 .9 .8 .6 .4.9 1 .9 .8 .6.8 .9 1 .9 .8.6 .8 .9 1 .9.4 .6 .8 .9 1

104 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y20

1 2 3 4 5 6 7 8 9 1011121314151617181920−3

−2

−1

0

1

2

3

variable index

y

Σ =

1 10 20

1

10

20

105 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y20

1 2 3 4 5 6 7 8 9 1011121314151617181920−3

−2

−1

0

1

2

3

variable index

y

Σ =

1 10 20

1

10

20

106 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y20

1 2 3 4 5 6 7 8 9 1011121314151617181920−3

−2

−1

0

1

2

3

variable index

y

Σ =

1 10 20

1

10

20

107 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y20

1 2 3 4 5 6 7 8 9 1011121314151617181920−3

−2

−1

0

1

2

3

variable index

y

Σ =

1 10 20

1

10

20

108 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y20

1 2 3 4 5 6 7 8 9 1011121314151617181920−3

−2

−1

0

1

2

3

variable index

y

Σ =

1 10 20

1

10

20

109 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y20

1 2 3 4 5 6 7 8 9 1011121314151617181920−3

−2

−1

0

1

2

3

variable index

y

Σ =

1 10 20

1

10

20

110 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y20

1 2 3 4 5 6 7 8 9 1011121314151617181920−3

−2

−1

0

1

2

3

variable index

y

Σ =

1 10 20

1

10

20

111 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y20

1 2 3 4 5 6 7 8 9 1011121314151617181920−3

−2

−1

0

1

2

3

variable index

y

Σ =

1 10 20

1

10

20

112 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y20

1 2 3 4 5 6 7 8 9 1011121314151617181920−3

−2

−1

0

1

2

3

variable index

y

Σ =

1 10 20

1

10

20

113 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y20

1 2 3 4 5 6 7 8 9 1011121314151617181920−3

−2

−1

0

1

2

3

variable index

y

Σ =

1 10 20

1

10

20

114 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y20

1 2 3 4 5 6 7 8 9 1011121314151617181920−3

−2

−1

0

1

2

3

variable index

y

Σ =

1 10 20

1

10

20

115 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y20

1 2 3 4 5 6 7 8 9 1011121314151617181920−3

−2

−1

0

1

2

3

variable index

y

Σ =

1 10 20

1

10

20

116 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y20

1 2 3 4 5 6 7 8 9 1011121314151617181920−3

−2

−1

0

1

2

3

variable index

y

Σ =

1 10 20

1

10

20

117 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y20

1 2 3 4 5 6 7 8 9 1011121314151617181920−3

−2

−1

0

1

2

3

variable index

y

Σ =

1 10 20

1

10

20

118 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y20

1 2 3 4 5 6 7 8 9 1011121314151617181920−3

−2

−1

0

1

2

3

variable index

y

Σ =

1 10 20

1

10

20

119 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y20

1 2 3 4 5 6 7 8 9 1011121314151617181920−3

−2

−1

0

1

2

3

variable index

y

Σ =

1 10 20

1

10

20

120 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y20

1 2 3 4 5 6 7 8 9 1011121314151617181920−3

−2

−1

0

1

2

3

variable index

y

Σ =

1 10 20

1

10

20

121 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y20

1 2 3 4 5 6 7 8 9 1011121314151617181920−3

−2

−1

0

1

2

3

variable index

y

Σ =

1 10 20

1

10

20

122 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y20

1 2 3 4 5 6 7 8 9 1011121314151617181920−3

−2

−1

0

1

2

3

variable index

y

Σ =

1 10 20

1

10

20

123 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y20

1 2 3 4 5 6 7 8 9 1011121314151617181920−3

−2

−1

0

1

2

3

variable index

y

Σ =

1 10 20

1

10

20

124 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y20

1 2 3 4 5 6 7 8 9 1011121314151617181920−3

−2

−1

0

1

2

3

variable index

y

Σ =

1 10 20

1

10

20

125 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y20

1 2 3 4 5 6 7 8 9 1011121314151617181920−3

−2

−1

0

1

2

3

variable index

y

Σ =

1 10 20

1

10

20

126 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y20

1 2 3 4 5 6 7 8 9 1011121314151617181920−3

−2

−1

0

1

2

3

variable index

y

Σ =

1 10 20

1

10

20

127 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y20

1 2 3 4 5 6 7 8 9 1011121314151617181920−3

−2

−1

0

1

2

3

variable index

y

Σ =

1 10 20

1

10

20

128 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y20

1 2 3 4 5 6 7 8 9 1011121314151617181920−3

−2

−1

0

1

2

3

variable index

y

Σ =

1 10 20

1

10

20

129 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y20

1 2 3 4 5 6 7 8 9 1011121314151617181920−3

−2

−1

0

1

2

3

variable index

y

Σ =

1 10 20

1

10

20

130 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y20

1 2 3 4 5 6 7 8 9 1011121314151617181920−3

−2

−1

0

1

2

3

variable index

y

Σ =

1 10 20

1

10

20

131 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y20

1 2 3 4 5 6 7 8 9 1011121314151617181920−3

−2

−1

0

1

2

3

variable index

y

Σ =

1 10 20

1

10

20

132 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y20

1 2 3 4 5 6 7 8 9 1011121314151617181920−3

−2

−1

0

1

2

3

variable index

y

Σ =

1 10 20

1

10

20

133 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y20

1 2 3 4 5 6 7 8 9 1011121314151617181920−3

−2

−1

0

1

2

3

variable index

y

Σ =

1 10 20

1

10

20

134 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y20

1 2 3 4 5 6 7 8 9 1011121314151617181920−3

−2

−1

0

1

2

3

variable index

y

Σ =

1 10 20

1

10

20

135 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y20

1 2 3 4 5 6 7 8 9 1011121314151617181920−3

−2

−1

0

1

2

3

variable index

y

Σ =

1 10 20

1

10

20

136 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y20

1 2 3 4 5 6 7 8 9 1011121314151617181920−3

−2

−1

0

1

2

3

variable index

y

Σ =

1 10 20

1

10

20

137 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y20

1 2 3 4 5 6 7 8 9 1011121314151617181920−3

−2

−1

0

1

2

3

variable index

y

Σ =

1 10 20

1

10

20

138 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y20

1 2 3 4 5 6 7 8 9 1011121314151617181920−3

−2

−1

0

1

2

3

variable index

y

Σ =

1 10 20

1

10

20

139 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y20

1 2 3 4 5 6 7 8 9 1011121314151617181920−3

−2

−1

0

1

2

3

variable index

y

Σ =

1 10 20

1

10

20

140 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y20

1 2 3 4 5 6 7 8 9 1011121314151617181920−3

−2

−1

0

1

2

3

variable index

y

Σ =

1 10 20

1

10

20

141 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y20

1 2 3 4 5 6 7 8 9 1011121314151617181920−3

−2

−1

0

1

2

3

variable index

y

Σ =

1 10 20

1

10

20

142 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y20

1 2 3 4 5 6 7 8 9 1011121314151617181920−3

−2

−1

0

1

2

3

variable index

y

Σ =

1 10 20

1

10

20

143 / 335

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y20

1 2 3 4 5 6 7 8 9 1011121314151617181920−3

−2

−1

0

1

2

3

variable index

y

Σ =

1 10 20

1

10

20

144 / 335

Regression using Gaussians

1 2 3 4 5 6 7 8 9 1011121314151617181920−3

−2

−1

0

1

2

3

variable index

y

Σ =

1 10 20

1

10

20

145 / 335

Regression using Gaussians

1 2 3 4 5 6 7 8 9 1011121314151617181920−3

−2

−1

0

1

2

3

variable index

y

Σ =

1 10 20

1

10

20

146 / 335

Regression using Gaussians

1 2 3 4 5 6 7 8 9 1011121314151617181920−3

−2

−1

0

1

2

3

variable index

y

Σ =

1 10 20

1

10

20

147 / 335

Regression using Gaussians

1 2 3 4 5 6 7 8 9 1011121314151617181920−3

−2

−1

0

1

2

3

variable index

y

Σ =

Σ(x1, x2) = K(x1, x2) + Iσ2y

K(x1, x2) = σ2e−1

2l2(x1−x2)2

1 10 20

1

10

20

148 / 335

Regression: probabilistic inference in function space

−3

−2

−1

0

1

2

3

y

Σ =

Σ(x1, x2) = K(x1, x2) + Iσ2y

K(x1, x2) = σ2e−1

2l2(x1−x2)2

1 10 20

1

10

20

149 / 335

Regression: probabilistic inference in function space

−3

−2

−1

0

1

2

3

y

Σ =

Σ(x1, x2) = K(x1, x2) + Iσ2y

p(y|θ) = N (0,Σ)

Non-parametric (∞-parametric)

K(x1, x2) = σ2e−1

2l2(x1−x2)2

1 10 20

1

10

20

Parametric model

y(x) = f (x; θ) + σyε

ε ∼ N (0, 1)

150 / 335

Regression: probabilistic inference in function space

−3

−2

−1

0

1

2

3

y

Σ =

Σ(x1, x2) = K(x1, x2) + Iσ2y

Non-parametric (∞-parametric)

p(y|θ) = N (0,Σ) function estimatewith uncertainty

K(x1, x2) = σ2e−1

2l2(x1−x2)2

1 10 20

1

10

20

Parametric model

y(x) = f (x; θ) + σyε

ε ∼ N (0, 1)

151 / 335

Regression: probabilistic inference in function space

−3

−2

−1

0

1

2

3

y

Σ =

Σ(x1, x2) = K(x1, x2) + Iσ2y

p(y|θ) = N (0,Σ)

Non-parametric (∞-parametric)

noise

K(x1, x2) = σ2e−1

2l2(x1−x2)2

1 10 20

1

10

20

Parametric model

y(x) = f (x; θ) + σyε

ε ∼ N (0, 1)

152 / 335

Regression: probabilistic inference in function space

−3

−2

−1

0

1

2

3

y

Σ =

Σ(x1, x2) = K(x1, x2) + Iσ2y

p(y|θ) = N (0,Σ)

Non-parametric (∞-parametric)

horizontal-scale

K(x1, x2) = σ2e−1

2l2(x1−x2)2

1 10 20

1

10

20

Parametric model

y(x) = f (x; θ) + σyε

ε ∼ N (0, 1)

153 / 335

Regression: probabilistic inference in function space

−3

−2

−1

0

1

2

3

y

Σ =

Σ(x1, x2) = K(x1, x2) + Iσ2y

p(y|θ) = N (0,Σ)

Non-parametric (∞-parametric) Parametric model

y(x) = f (x; θ) + σyε

ε ∼ N (0, 1)K(x1, x2) = σ2e−1

2l2(x1−x2)2

1 10 20

1

10

20

horizontal-scalevertical-scale

154 / 335

Mathematical Foundations: Definition

Gaussian process = generalisation of multivariate Gaussian distribution toinfinitely many variables.

Definition: a Gaussian process is a collection of random variables, anyfinite number of which have (consistent) Gaussian distributions.

A Gaussian distribution is fully specified by a mean vector, µ, andcovariance matrix Σ:

f = (f1, . . . , fn) ∼ N (µ,Σ), indices i = 1, . . . ,n

A Gaussian process is fully specified by a mean function m(x) andcovariance function K(x, x′):

f (x) ∼ GP(m(x),K(x, x′)

), indices x

155 / 335

Mathematical Foundations: Regression

Q1. What's the formal justification for how we were using GPs for regression?

156 / 335

Mathematical Foundations: Regression

Q1. What's the formal justification for how we were using GPs for regression?

generative model (like non-linear regression)

156 / 335

Mathematical Foundations: Regression

Q1. What's the formal justification for how we were using GPs for regression?

generative model (like non-linear regression)

place GP prior over the non-linear function

(smoothly wiggling functions expected)

156 / 335

Mathematical Foundations: Regression

Q1. What's the formal justification for how we were using GPs for regression?

generative model (like non-linear regression)

place GP prior over the non-linear function

sum of Gaussian variables = Gaussian: induces a GP over

(smoothly wiggling functions expected)

156 / 335

Mathematical Foundations: Marginalisation

Q2. A GP is "like" a Gaussian distribution with an infinitely long mean vector and an "infinite by infinite" covariance matrix, so how do we represent it on a computer?

157 / 335

Mathematical Foundations: Marginalisation

Q2. A GP is "like" a Gaussian distribution with an infinitely long mean vector and an "infinite by infinite" covariance matrix, so how do we represent it on a computer?

We are saved by the marginalisation property:

157 / 335

Mathematical Foundations: Marginalisation

Q2. A GP is "like" a Gaussian distribution with an infinitely long mean vector and an "infinite by infinite" covariance matrix, so how do we represent it on a computer?

We are saved by the marginalisation property:

157 / 335

Mathematical Foundations: Marginalisation

Q2. A GP is "like" a Gaussian distribution with an infinitely long mean vector and an "infinite by infinite" covariance matrix, so how do we represent it on a computer?

We are saved by the marginalisation property:

Only need to represent finite dimensional projections of GPs on computer.

157 / 335

Mathematical Foundations: Prediction

Q3. How do we make predictions?

158 / 335

Mathematical Foundations: Prediction

Q3. How do we make predictions?

158 / 335

Mathematical Foundations: Prediction

Q3. How do we make predictions?

158 / 335

Mathematical Foundations: Prediction

Q3. How do we make predictions?

158 / 335

Mathematical Foundations: Prediction

Q3. How do we make predictions?

predictive mean

158 / 335

Mathematical Foundations: Prediction

Q3. How do we make predictions?

linear in the data

predictive mean

158 / 335

Mathematical Foundations: Prediction

Q3. How do we make predictions?

prior uncertainty

predictive uncertainty

reduction inuncertainty

linear in the data

predictive mean predictive covariance

predictions more confident than prior

158 / 335

Regression: probabilistic inference in function space

−3

−2

−1

0

1

2

3

y

Σ =

Σ(x1, x2) = K(x1, x2) + Iσ2y

p(y|θ) = N (0,Σ)

Non-parametric (∞-parametric) Parametric model

y(x) = f (x; θ) + σyε

ε ∼ N (0, 1)K(x1, x2) = σ2e−1

2l2(x1−x2)2

1 10 20

1

10

20

horizontal-scalevertical-scale

159 / 335

What effect do the hyper-parameters have?

−3

−2

−1

0

1

2

3

x

y

Σ =

Σ(x1, x2) = K(x1, x2) + Iσ2y

p(y|θ) = N (0,Σ)

Non-parametric (∞-parametric) Parametric model

y(x) = f (x; θ) + σyε

ε ∼ N (0, 1)K(x1, x2) = σ2e−1

2l2(x1−x2)2

160 / 335

What effect do the hyper-parameters have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

Σ(x1, x2) = K(x1, x2) + Iσ2y

p(y|θ) = N (0,Σ)

Non-parametric (∞-parametric) Parametric model

y(x) = f (x; θ) + σyε

ε ∼ N (0, 1)K(x1, x2) = σ2e−1

2l2(x1−x2)2

161 / 335

What effect do the hyper-parameters have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

Σ(x1, x2) = K(x1, x2) + Iσ2y

p(y|θ) = N (0,Σ)

Non-parametric (∞-parametric) Parametric model

y(x) = f (x; θ) + σyε

ε ∼ N (0, 1)K(x1, x2) = σ2e−1

2l2(x1−x2)2

162 / 335

What effect do the hyper-parameters have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

Σ(x1, x2) = K(x1, x2) + Iσ2y

p(y|θ) = N (0,Σ)

Non-parametric (∞-parametric) Parametric model

y(x) = f (x; θ) + σyε

ε ∼ N (0, 1)K(x1, x2) = σ2e−1

2l2(x1−x2)2

163 / 335

What effect do the hyper-parameters have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

Σ(x1, x2) = K(x1, x2) + Iσ2y

p(y|θ) = N (0,Σ)

Non-parametric (∞-parametric) Parametric model

y(x) = f (x; θ) + σyε

ε ∼ N (0, 1)K(x1, x2) = σ2e−1

2l2(x1−x2)2

164 / 335

What effect do the hyper-parameters have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

Σ(x1, x2) = K(x1, x2) + Iσ2y

p(y|θ) = N (0,Σ)

Non-parametric (∞-parametric) Parametric model

y(x) = f (x; θ) + σyε

ε ∼ N (0, 1)K(x1, x2) = σ2e−1

2l2(x1−x2)2

165 / 335

What effect do the hyper-parameters have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

Σ(x1, x2) = K(x1, x2) + Iσ2y

p(y|θ) = N (0,Σ)

Non-parametric (∞-parametric) Parametric model

y(x) = f (x; θ) + σyε

ε ∼ N (0, 1)K(x1, x2) = σ2e−1

2l2(x1−x2)2

166 / 335

What effect do the hyper-parameters have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

Σ(x1, x2) = K(x1, x2) + Iσ2y

p(y|θ) = N (0,Σ)

Non-parametric (∞-parametric) Parametric model

y(x) = f (x; θ) + σyε

ε ∼ N (0, 1)K(x1, x2) = σ2e−1

2l2(x1−x2)2

167 / 335

What effect do the hyper-parameters have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

Σ(x1, x2) = K(x1, x2) + Iσ2y

p(y|θ) = N (0,Σ)

Non-parametric (∞-parametric) Parametric model

y(x) = f (x; θ) + σyε

ε ∼ N (0, 1)K(x1, x2) = σ2e−1

2l2(x1−x2)2

168 / 335

What effect do the hyper-parameters have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

Σ(x1, x2) = K(x1, x2) + Iσ2y

p(y|θ) = N (0,Σ)

Non-parametric (∞-parametric) Parametric model

y(x) = f (x; θ) + σyε

ε ∼ N (0, 1)K(x1, x2) = σ2e−1

2l2(x1−x2)2

169 / 335

What effect do the hyper-parameters have?

−3

−2

−1

0

1

2

3

x

y

Σ =

Σ(x1, x2) = K(x1, x2) + Iσ2y

p(y|θ) = N (0,Σ)

Non-parametric (∞-parametric) Parametric model

y(x) = f (x; θ) + σyε

ε ∼ N (0, 1)K(x1, x2) = σ2e−1

2l2(x1−x2)2

170 / 335

What effect do the hyper-parameters have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

Σ(x1, x2) = K(x1, x2) + Iσ2y

p(y|θ) = N (0,Σ)

Non-parametric (∞-parametric) Parametric model

y(x) = f (x; θ) + σyε

ε ∼ N (0, 1)K(x1, x2) = σ2e−1

2l2(x1−x2)2

171 / 335

What effect do the hyper-parameters have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

Σ(x1, x2) = K(x1, x2) + Iσ2y

p(y|θ) = N (0,Σ)

Non-parametric (∞-parametric) Parametric model

y(x) = f (x; θ) + σyε

ε ∼ N (0, 1)K(x1, x2) = σ2e−1

2l2(x1−x2)2

172 / 335

What effect do the hyper-parameters have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

Σ(x1, x2) = K(x1, x2) + Iσ2y

p(y|θ) = N (0,Σ)

Non-parametric (∞-parametric) Parametric model

y(x) = f (x; θ) + σyε

ε ∼ N (0, 1)K(x1, x2) = σ2e−1

2l2(x1−x2)2

173 / 335

What effect do the hyper-parameters have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

Σ(x1, x2) = K(x1, x2) + Iσ2y

p(y|θ) = N (0,Σ)

Non-parametric (∞-parametric) Parametric model

y(x) = f (x; θ) + σyε

ε ∼ N (0, 1)K(x1, x2) = σ2e−1

2l2(x1−x2)2

174 / 335

What effect do the hyper-parameters have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

Σ(x1, x2) = K(x1, x2) + Iσ2y

p(y|θ) = N (0,Σ)

Non-parametric (∞-parametric) Parametric model

y(x) = f (x; θ) + σyε

ε ∼ N (0, 1)K(x1, x2) = σ2e−1

2l2(x1−x2)2

175 / 335

What effect do the hyper-parameters have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

Σ(x1, x2) = K(x1, x2) + Iσ2y

p(y|θ) = N (0,Σ)

Non-parametric (∞-parametric) Parametric model

y(x) = f (x; θ) + σyε

ε ∼ N (0, 1)K(x1, x2) = σ2e−1

2l2(x1−x2)2

176 / 335

What effect do the hyper-parameters have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

Σ(x1, x2) = K(x1, x2) + Iσ2y

p(y|θ) = N (0,Σ)

Non-parametric (∞-parametric) Parametric model

y(x) = f (x; θ) + σyε

ε ∼ N (0, 1)K(x1, x2) = σ2e−1

2l2(x1−x2)2

177 / 335

What effect do the hyper-parameters have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

Σ(x1, x2) = K(x1, x2) + Iσ2y

p(y|θ) = N (0,Σ)

Non-parametric (∞-parametric) Parametric model

y(x) = f (x; θ) + σyε

ε ∼ N (0, 1)K(x1, x2) = σ2e−1

2l2(x1−x2)2

178 / 335

What effect do the hyper-parameters have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

Σ(x1, x2) = K(x1, x2) + Iσ2y

p(y|θ) = N (0,Σ)

Non-parametric (∞-parametric) Parametric model

y(x) = f (x; θ) + σyε

ε ∼ N (0, 1)K(x1, x2) = σ2e−1

2l2(x1−x2)2

179 / 335

What effect do the hyper-parameters have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

Σ(x1, x2) = K(x1, x2) + Iσ2y

p(y|θ) = N (0,Σ)

Non-parametric (∞-parametric) Parametric model

y(x) = f (x; θ) + σyε

ε ∼ N (0, 1)K(x1, x2) = σ2e−1

2l2(x1−x2)2

180 / 335

What effect do the hyper-parameters have?

−3

−2

−1

0

1

2

3

x

y

Σ =

short horizontal length-scale

Σ(x1, x2) = K(x1, x2) + Iσ2y

p(y|θ) = N (0,Σ)

Non-parametric (∞-parametric) Parametric model

y(x) = f (x; θ) + σyε

ε ∼ N (0, 1)K(x1, x2) = σ2e−1

2l2(x1−x2)2

181 / 335

What effect do the hyper-parameters have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

short horizontal length-scale

Σ(x1, x2) = K(x1, x2) + Iσ2y

p(y|θ) = N (0,Σ)

Non-parametric (∞-parametric) Parametric model

y(x) = f (x; θ) + σyε

ε ∼ N (0, 1)K(x1, x2) = σ2e−1

2l2(x1−x2)2

182 / 335

What effect do the hyper-parameters have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

short horizontal length-scale

Σ(x1, x2) = K(x1, x2) + Iσ2y

p(y|θ) = N (0,Σ)

Non-parametric (∞-parametric) Parametric model

y(x) = f (x; θ) + σyε

ε ∼ N (0, 1)K(x1, x2) = σ2e−1

2l2(x1−x2)2

183 / 335

What effect do the hyper-parameters have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

short horizontal length-scale

Σ(x1, x2) = K(x1, x2) + Iσ2y

p(y|θ) = N (0,Σ)

Non-parametric (∞-parametric) Parametric model

y(x) = f (x; θ) + σyε

ε ∼ N (0, 1)K(x1, x2) = σ2e−1

2l2(x1−x2)2

184 / 335

What effect do the hyper-parameters have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

short horizontal length-scale

Σ(x1, x2) = K(x1, x2) + Iσ2y

p(y|θ) = N (0,Σ)

Non-parametric (∞-parametric) Parametric model

y(x) = f (x; θ) + σyε

ε ∼ N (0, 1)K(x1, x2) = σ2e−1

2l2(x1−x2)2

185 / 335

What effect do the hyper-parameters have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

short horizontal length-scale

Σ(x1, x2) = K(x1, x2) + Iσ2y

p(y|θ) = N (0,Σ)

Non-parametric (∞-parametric) Parametric model

y(x) = f (x; θ) + σyε

ε ∼ N (0, 1)K(x1, x2) = σ2e−1

2l2(x1−x2)2

186 / 335

What effect do the hyper-parameters have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

Σ(x1, x2) = K(x1, x2) + Iσ2y

short horizontal length-scale

p(y|θ) = N (0,Σ)

Non-parametric (∞-parametric) Parametric model

y(x) = f (x; θ) + σyε

ε ∼ N (0, 1)K(x1, x2) = σ2e−1

2l2(x1−x2)2

187 / 335

What effect do the hyper-parameters have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

Σ(x1, x2) = K(x1, x2) + Iσ2y

short horizontal length-scale

p(y|θ) = N (0,Σ)

Non-parametric (∞-parametric) Parametric model

y(x) = f (x; θ) + σyε

ε ∼ N (0, 1)K(x1, x2) = σ2e−1

2l2(x1−x2)2

188 / 335

What effect do the hyper-parameters have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

Σ(x1, x2) = K(x1, x2) + Iσ2y

short horizontal length-scale

p(y|θ) = N (0,Σ)

Non-parametric (∞-parametric) Parametric model

y(x) = f (x; θ) + σyε

ε ∼ N (0, 1)K(x1, x2) = σ2e−1

2l2(x1−x2)2

189 / 335

What effect do the hyper-parameters have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

Σ(x1, x2) = K(x1, x2) + Iσ2y

short horizontal length-scale

p(y|θ) = N (0,Σ)

Non-parametric (∞-parametric) Parametric model

y(x) = f (x; θ) + σyε

ε ∼ N (0, 1)K(x1, x2) = σ2e−1

2l2(x1−x2)2

190 / 335

What effect do the hyper-parameters have?

−3

−2

−1

0

1

2

3

x

y

Σ =

Σ(x1, x2) = K(x1, x2) + Iσ2y

p(y|θ) = N (0,Σ)

Non-parametric (∞-parametric)short horizontal length-scale

Parametric model

y(x) = f (x; θ) + σyε

ε ∼ N (0, 1)K(x1, x2) = σ2e−1

2l2(x1−x2)2

191 / 335

What effect do the hyper-parameters have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

Σ(x1, x2) = K(x1, x2) + Iσ2y

short horizontal length-scale

p(y|θ) = N (0,Σ)

Non-parametric (∞-parametric) Parametric model

y(x) = f (x; θ) + σyε

ε ∼ N (0, 1)K(x1, x2) = σ2e−1

2l2(x1−x2)2

192 / 335

What effect do the hyper-parameters have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

Σ(x1, x2) = K(x1, x2) + Iσ2y

p(y|θ) = N (0,Σ)

Non-parametric (∞-parametric)short horizontal length-scale

Parametric model

y(x) = f (x; θ) + σyε

ε ∼ N (0, 1)K(x1, x2) = σ2e−1

2l2(x1−x2)2

193 / 335

What effect do the hyper-parameters have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

Σ(x1, x2) = K(x1, x2) + Iσ2y

short horizontal length-scale

p(y|θ) = N (0,Σ)

Non-parametric (∞-parametric) Parametric model

y(x) = f (x; θ) + σyε

ε ∼ N (0, 1)K(x1, x2) = σ2e−1

2l2(x1−x2)2

194 / 335

What effect do the hyper-parameters have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

Σ(x1, x2) = K(x1, x2) + Iσ2y

short horizontal length-scale

p(y|θ) = N (0,Σ)

Non-parametric (∞-parametric) Parametric model

y(x) = f (x; θ) + σyε

ε ∼ N (0, 1)K(x1, x2) = σ2e−1

2l2(x1−x2)2

195 / 335

What effect do the hyper-parameters have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

Σ(x1, x2) = K(x1, x2) + Iσ2y

short horizontal length-scale

p(y|θ) = N (0,Σ)

Non-parametric (∞-parametric) Parametric model

y(x) = f (x; θ) + σyε

ε ∼ N (0, 1)K(x1, x2) = σ2e−1

2l2(x1−x2)2

196 / 335

What effect do the hyper-parameters have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

Σ(x1, x2) = K(x1, x2) + Iσ2y

p(y|θ) = N (0,Σ)

Non-parametric (∞-parametric)short horizontal length-scale

Parametric model

y(x) = f (x; θ) + σyε

ε ∼ N (0, 1)K(x1, x2) = σ2e−1

2l2(x1−x2)2

197 / 335

What effect do the hyper-parameters have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

Σ(x1, x2) = K(x1, x2) + Iσ2y

p(y|θ) = N (0,Σ)

Non-parametric (∞-parametric)short horizontal length-scale

Parametric model

y(x) = f (x; θ) + σyε

ε ∼ N (0, 1)K(x1, x2) = σ2e−1

2l2(x1−x2)2

198 / 335

What effect do the hyper-parameters have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

Σ(x1, x2) = K(x1, x2) + Iσ2y

p(y|θ) = N (0,Σ)

Non-parametric (∞-parametric)short horizontal length-scale

Parametric model

y(x) = f (x; θ) + σyε

ε ∼ N (0, 1)K(x1, x2) = σ2e−1

2l2(x1−x2)2

199 / 335

What effect do the hyper-parameters have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

Σ(x1, x2) = K(x1, x2) + Iσ2y

p(y|θ) = N (0,Σ)

Non-parametric (∞-parametric)short horizontal length-scale

Parametric model

y(x) = f (x; θ) + σyε

ε ∼ N (0, 1)K(x1, x2) = σ2e−1

2l2(x1−x2)2

200 / 335

What effect do the hyper-parameters have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

Σ(x1, x2) = K(x1, x2) + Iσ2y

p(y|θ) = N (0,Σ)

Non-parametric (∞-parametric)short horizontal length-scale

Parametric model

y(x) = f (x; θ) + σyε

ε ∼ N (0, 1)K(x1, x2) = σ2e−1

2l2(x1−x2)2

201 / 335

What effect do the hyper-parameters have?

−3

−2

−1

0

1

2

3

x

y

Σ =

long horizontal length-scale

Σ(x1, x2) = K(x1, x2) + Iσ2y

p(y|θ) = N (0,Σ)

Non-parametric (∞-parametric) Parametric model

y(x) = f (x; θ) + σyε

ε ∼ N (0, 1)K(x1, x2) = σ2e−1

2l2(x1−x2)2

202 / 335

What effect do the hyper-parameters have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

long horizontal length-scale

Σ(x1, x2) = K(x1, x2) + Iσ2y

p(y|θ) = N (0,Σ)

Non-parametric (∞-parametric) Parametric model

y(x) = f (x; θ) + σyε

ε ∼ N (0, 1)K(x1, x2) = σ2e−1

2l2(x1−x2)2

203 / 335

What effect do the hyper-parameters have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

Σ(x1, x2) = K(x1, x2) + Iσ2y

long horizontal length-scale

p(y|θ) = N (0,Σ)

Non-parametric (∞-parametric) Parametric model

y(x) = f (x; θ) + σyε

ε ∼ N (0, 1)K(x1, x2) = σ2e−1

2l2(x1−x2)2

204 / 335

What effect do the hyper-parameters have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

Σ(x1, x2) = K(x1, x2) + Iσ2y

long horizontal length-scale

p(y|θ) = N (0,Σ)

Non-parametric (∞-parametric) Parametric model

y(x) = f (x; θ) + σyε

ε ∼ N (0, 1)K(x1, x2) = σ2e−1

2l2(x1−x2)2

205 / 335

What effect do the hyper-parameters have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

Σ(x1, x2) = K(x1, x2) + Iσ2y

long horizontal length-scale

p(y|θ) = N (0,Σ)

Non-parametric (∞-parametric) Parametric model

y(x) = f (x; θ) + σyε

ε ∼ N (0, 1)K(x1, x2) = σ2e−1

2l2(x1−x2)2

206 / 335

What effect do the hyper-parameters have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

Σ(x1, x2) = K(x1, x2) + Iσ2y

long horizontal length-scale

p(y|θ) = N (0,Σ)

Non-parametric (∞-parametric) Parametric model

y(x) = f (x; θ) + σyε

ε ∼ N (0, 1)K(x1, x2) = σ2e−1

2l2(x1−x2)2

207 / 335

What effect do the hyper-parameters have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

Σ(x1, x2) = K(x1, x2) + Iσ2y

long horizontal length-scale

p(y|θ) = N (0,Σ)

Non-parametric (∞-parametric) Parametric model

y(x) = f (x; θ) + σyε

ε ∼ N (0, 1)K(x1, x2) = σ2e−1

2l2(x1−x2)2

208 / 335

What effect do the hyper-parameters have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

Σ(x1, x2) = K(x1, x2) + Iσ2y

long horizontal length-scale

p(y|θ) = N (0,Σ)

Non-parametric (∞-parametric) Parametric model

y(x) = f (x; θ) + σyε

ε ∼ N (0, 1)K(x1, x2) = σ2e−1

2l2(x1−x2)2

209 / 335

What effect do the hyper-parameters have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

Σ(x1, x2) = K(x1, x2) + Iσ2y

long horizontal length-scale

p(y|θ) = N (0,Σ)

Non-parametric (∞-parametric) Parametric model

y(x) = f (x; θ) + σyε

ε ∼ N (0, 1)K(x1, x2) = σ2e−1

2l2(x1−x2)2

210 / 335

What effect do the hyper-parameters have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

Σ(x1, x2) = K(x1, x2) + Iσ2y

long horizontal length-scale

p(y|θ) = N (0,Σ)

Non-parametric (∞-parametric) Parametric model

y(x) = f (x; θ) + σyε

ε ∼ N (0, 1)K(x1, x2) = σ2e−1

2l2(x1−x2)2

211 / 335

What effect do the hyper-parameters have?

−3

−2

−1

0

1

2

3

x

y

Σ =

Σ(x1, x2) = K(x1, x2) + Iσ2y

long horizontal length-scale

p(y|θ) = N (0,Σ)

Non-parametric (∞-parametric) Parametric model

y(x) = f (x; θ) + σyε

ε ∼ N (0, 1)K(x1, x2) = σ2e−1

2l2(x1−x2)2

212 / 335

What effect do the hyper-parameters have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

Σ(x1, x2) = K(x1, x2) + Iσ2y

long horizontal length-scale

p(y|θ) = N (0,Σ)

Non-parametric (∞-parametric) Parametric model

y(x) = f (x; θ) + σyε

ε ∼ N (0, 1)K(x1, x2) = σ2e−1

2l2(x1−x2)2

213 / 335

What effect do the hyper-parameters have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

Σ(x1, x2) = K(x1, x2) + Iσ2y

long horizontal length-scale

p(y|θ) = N (0,Σ)

Non-parametric (∞-parametric) Parametric model

y(x) = f (x; θ) + σyε

ε ∼ N (0, 1)K(x1, x2) = σ2e−1

2l2(x1−x2)2

214 / 335

What effect do the hyper-parameters have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

Σ(x1, x2) = K(x1, x2) + Iσ2y

long horizontal length-scale

p(y|θ) = N (0,Σ)

Non-parametric (∞-parametric) Parametric model

y(x) = f (x; θ) + σyε

ε ∼ N (0, 1)K(x1, x2) = σ2e−1

2l2(x1−x2)2

215 / 335

What effect do the hyper-parameters have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

Σ(x1, x2) = K(x1, x2) + Iσ2y

long horizontal length-scale

p(y|θ) = N (0,Σ)

Non-parametric (∞-parametric) Parametric model

y(x) = f (x; θ) + σyε

ε ∼ N (0, 1)K(x1, x2) = σ2e−1

2l2(x1−x2)2

216 / 335

What effect do the hyper-parameters have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

Σ(x1, x2) = K(x1, x2) + Iσ2y

long horizontal length-scale

p(y|θ) = N (0,Σ)

Non-parametric (∞-parametric) Parametric model

y(x) = f (x; θ) + σyε

ε ∼ N (0, 1)K(x1, x2) = σ2e−1

2l2(x1−x2)2

217 / 335

What effect do the hyper-parameters have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

Σ(x1, x2) = K(x1, x2) + Iσ2y

long horizontal length-scale

p(y|θ) = N (0,Σ)

Non-parametric (∞-parametric) Parametric model

y(x) = f (x; θ) + σyε

ε ∼ N (0, 1)K(x1, x2) = σ2e−1

2l2(x1−x2)2

218 / 335

What effect do the hyper-parameters have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

Σ(x1, x2) = K(x1, x2) + Iσ2y

long horizontal length-scale

p(y|θ) = N (0,Σ)

Non-parametric (∞-parametric) Parametric model

y(x) = f (x; θ) + σyε

ε ∼ N (0, 1)K(x1, x2) = σ2e−1

2l2(x1−x2)2

219 / 335

What effect do the hyper-parameters have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

Σ(x1, x2) = K(x1, x2) + Iσ2y

long horizontal length-scale

p(y|θ) = N (0,Σ)

Non-parametric (∞-parametric) Parametric model

y(x) = f (x; θ) + σyε

ε ∼ N (0, 1)K(x1, x2) = σ2e−1

2l2(x1−x2)2

220 / 335

What effect do the hyper-parameters have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

Σ(x1, x2) = K(x1, x2) + Iσ2y

long horizontal length-scale

p(y|θ) = N (0,Σ)

Non-parametric (∞-parametric) Parametric model

y(x) = f (x; θ) + σyε

ε ∼ N (0, 1)K(x1, x2) = σ2e−1

2l2(x1−x2)2

221 / 335

What effect do the hyper-parameters have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

Σ(x1, x2) = K(x1, x2) + Iσ2y

long horizontal length-scale

p(y|θ) = N (0,Σ)

Non-parametric (∞-parametric) Parametric model

y(x) = f (x; θ) + σyε

ε ∼ N (0, 1)K(x1, x2) = σ2e−1

2l2(x1−x2)2

222 / 335

What effect do the hyper-parameters have?

K(x1, x2) = σ2 exp(− 1

2l2 (x1 − x2)2)

Hyper-parameters have a strong effectI l controls the horizontal length-scaleI σ2 controls the vertical scale of the data

=⇒ need automatic learning of hyper-parameters from data e.g.

arg maxl,σ2

log p(y |θ)

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x−3

−2

−1

0

1

2

3

x−3

−2

−1

0

1

2

3

x−3

−2

−1

0

1

2

3

x−3

−2

−1

0

1

2

3

x−3

−2

−1

0

1

2

3

x−3

−2

−1

0

1

2

3

x−3

−2

−1

0

1

2

3

x−3

−2

−1

0

1

2

3

x−3

−2

−1

0

1

2

3

x−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

l = short l = longl = medium

223 / 335

What effect do the hyper-parameters have?

K(x1, x2) = σ2 exp(− 1

2l2 (x1 − x2)2)

Hyper-parameters have a strong effectI l controls the horizontal length-scaleI σ2 controls the vertical scale of the data

=⇒ need automatic learning of hyper-parameters from data e.g.

arg maxl,σ2

log p(y |θ)

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x−3

−2

−1

0

1

2

3

x−3

−2

−1

0

1

2

3

x−3

−2

−1

0

1

2

3

x−3

−2

−1

0

1

2

3

x−3

−2

−1

0

1

2

3

x−3

−2

−1

0

1

2

3

x−3

−2

−1

0

1

2

3

x−3

−2

−1

0

1

2

3

x−3

−2

−1

0

1

2

3

x−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

l = short l = longl = medium

224 / 335

Higher dimensional input spaces

0

20

40

60

80

100

0

20

40

60

80

100

−2

0

2

x1x

2

y

225 / 335

Higher dimensional input spaces

0

20

40

60

80

100

0

20

40

60

80

100

−2

0

2

x1x

2

y

226 / 335

Higher dimensional input spaces

0

20

40

60

80

100

0

20

40

60

80

100

−2

0

2

x1x

2

y

227 / 335

Higher dimensional input spaces

0

20

40

60

80

100

0

20

40

60

80

100

−2

0

2

x1x

2

y

228 / 335

Higher dimensional input spaces

0

20

40

60

80

100

0

20

40

60

80

100

−2

0

2

x1x

2

y

229 / 335

Higher dimensional input spaces

0

20

40

60

80

100

0

20

40

60

80

100

−2

0

2

x1x

2

y

230 / 335

Higher dimensional input spaces

0

20

40

60

80

100

0

20

40

60

80

100

−2

0

2

x1x

2

y

231 / 335

Higher dimensional input spaces

0

20

40

60

80

100

0

20

40

60

80

100

−2

0

2

x1x

2

y

232 / 335

Higher dimensional input spaces

0

20

40

60

80

100

0

20

40

60

80

100

−2

0

2

x1x

2

y

233 / 335

Higher dimensional input spaces

0

20

40

60

80

100

0

20

40

60

80

100

−2

0

2

x1x

2

y

234 / 335

Higher dimensional input spaces

0

20

40

60

80

100

0

20

40

60

80

100

−2

0

2

x1x

2

y

235 / 335

Higher dimensional input spaces

0

20

40

60

80

100

0

20

40

60

80

100

−2

0

2

x1x

2

y

236 / 335

Higher dimensional input spaces

0

20

40

60

80

100

0

20

40

60

80

100

−2

0

2

x1x

2

y

237 / 335

Higher dimensional input spaces

0

20

40

60

80

100

0

20

40

60

80

100

−2

0

2

x1x

2

y

238 / 335

Higher dimensional input spaces

0

20

40

60

80

100

0

20

40

60

80

100

−2

0

2

x1x

2

y

239 / 335

What effect do the hyper-parameters have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

Σ(x1, x2) = K(x1, x2) + Iσ2y

long horizontal length-scale

p(y|θ) = N (0,Σ)

Non-parametric (∞-parametric) Parametric model

y(x) = f (x; θ) + σyε

ε ∼ N (0, 1)K(x1, x2) = σ2e−1

2l2(x1−x2)2

240 / 335

What effect does the form of the covariance function have?

−3

−2

−1

0

1

2

3

x

y

Σ =

Laplacian covariance function

Browninan motion

Ornstein-Uhlenbeck

Σ =

K(x1, x2) = σ2 exp(− 1

2l2 |x1 − x2|)

241 / 335

What effect does the form of the covariance function have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

Laplacian covariance function

Browninan motion

Ornstein-Uhlenbeck

K(x1, x2) = σ2 exp(− 1

2l2 |x1 − x2|)

242 / 335

What effect does the form of the covariance function have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

Laplacian covariance function

Browninan motion

Ornstein-Uhlenbeck

K(x1, x2) = σ2 exp(− 1

2l2 |x1 − x2|)

243 / 335

What effect does the form of the covariance function have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

Laplacian covariance function

Browninan motion

Ornstein-Uhlenbeck

K(x1, x2) = σ2 exp(− 1

2l2 |x1 − x2|)

244 / 335

What effect does the form of the covariance function have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

Laplacian covariance function

Browninan motion

Ornstein-Uhlenbeck

K(x1, x2) = σ2 exp(− 1

2l2 |x1 − x2|)

245 / 335

What effect does the form of the covariance function have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

Laplacian covariance function

Browninan motion

Ornstein-Uhlenbeck

K(x1, x2) = σ2 exp(− 1

2l2 |x1 − x2|)

246 / 335

What effect does the form of the covariance function have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

Laplacian covariance function

Browninan motion

Ornstein-Uhlenbeck

K(x1, x2) = σ2 exp(− 1

2l2 |x1 − x2|)

247 / 335

What effect does the form of the covariance function have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

Laplacian covariance function

Browninan motion

Ornstein-Uhlenbeck

K(x1, x2) = σ2 exp(− 1

2l2 |x1 − x2|)

248 / 335

What effect does the form of the covariance function have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

Laplacian covariance function

Browninan motion

Ornstein-Uhlenbeck

K(x1, x2) = σ2 exp(− 1

2l2 |x1 − x2|)

249 / 335

What effect does the form of the covariance function have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

Laplacian covariance function

Browninan motion

Ornstein-Uhlenbeck

K(x1, x2) = σ2 exp(− 1

2l2 |x1 − x2|)

250 / 335

What effect does the form of the covariance function have?

−3

−2

−1

0

1

2

3

x

y

Σ =

Laplacian covariance function

Browninan motion

Ornstein-Uhlenbeck

K(x1, x2) = σ2 exp(− 1

2l2 |x1 − x2|)

251 / 335

What effect does the form of the covariance function have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

Laplacian covariance function

Browninan motion

Ornstein-Uhlenbeck

K(x1, x2) = σ2 exp(− 1

2l2 |x1 − x2|)

252 / 335

What effect does the form of the covariance function have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

Laplacian covariance function

Browninan motion

Ornstein-Uhlenbeck

K(x1, x2) = σ2 exp(− 1

2l2 |x1 − x2|)

253 / 335

What effect does the form of the covariance function have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

Laplacian covariance function

Browninan motion

Ornstein-Uhlenbeck

K(x1, x2) = σ2 exp(− 1

2l2 |x1 − x2|)

254 / 335

What effect does the form of the covariance function have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

Laplacian covariance function

Browninan motion

Ornstein-Uhlenbeck

K(x1, x2) = σ2 exp(− 1

2l2 |x1 − x2|)

255 / 335

What effect does the form of the covariance function have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

Laplacian covariance function

Browninan motion

Ornstein-Uhlenbeck

K(x1, x2) = σ2 exp(− 1

2l2 |x1 − x2|)

256 / 335

What effect does the form of the covariance function have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

Laplacian covariance function

Browninan motion

Ornstein-Uhlenbeck

K(x1, x2) = σ2 exp(− 1

2l2 |x1 − x2|)

257 / 335

What effect does the form of the covariance function have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

Laplacian covariance function

Browninan motion

Ornstein-Uhlenbeck

K(x1, x2) = σ2 exp(− 1

2l2 |x1 − x2|)

258 / 335

What effect does the form of the covariance function have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

Laplacian covariance function

Browninan motion

Ornstein-Uhlenbeck

K(x1, x2) = σ2 exp(− 1

2l2 |x1 − x2|)

259 / 335

What effect does the form of the covariance function have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

Laplacian covariance function

Browninan motion

Ornstein-Uhlenbeck

K(x1, x2) = σ2 exp(− 1

2l2 |x1 − x2|)

260 / 335

What effect does the form of the covariance function have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

Laplacian covariance function

Browninan motion

Ornstein-Uhlenbeck

K(x1, x2) = σ2 exp(− 1

2l2 |x1 − x2|)

261 / 335

What effect does the form of the covariance function have?

−3

−2

−1

0

1

2

3

x

y

Σ =

Rational Quadratic

K(x1, x2) = σ2(1 + 1

2αl2 |x1 − x2|)−α

262 / 335

What effect does the form of the covariance function have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

K(x1, x2) = σ2(1 + 1

2αl2 |x1 − x2|)−α

Rational Quadratic

263 / 335

What effect does the form of the covariance function have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

K(x1, x2) = σ2(1 + 1

2αl2 |x1 − x2|)−α

Rational Quadratic

264 / 335

What effect does the form of the covariance function have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

K(x1, x2) = σ2(1 + 1

2αl2 |x1 − x2|)−α

Rational Quadratic

265 / 335

What effect does the form of the covariance function have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

K(x1, x2) = σ2(1 + 1

2αl2 |x1 − x2|)−α

Rational Quadratic

266 / 335

What effect does the form of the covariance function have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

K(x1, x2) = σ2(1 + 1

2αl2 |x1 − x2|)−α

Rational Quadratic

267 / 335

What effect does the form of the covariance function have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

K(x1, x2) = σ2(1 + 1

2αl2 |x1 − x2|)−α

Rational Quadratic

268 / 335

What effect does the form of the covariance function have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

K(x1, x2) = σ2(1 + 1

2αl2 |x1 − x2|)−α

Rational Quadratic

269 / 335

What effect does the form of the covariance function have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

K(x1, x2) = σ2(1 + 1

2αl2 |x1 − x2|)−α

Rational Quadratic

270 / 335

What effect does the form of the covariance function have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

K(x1, x2) = σ2(1 + 1

2αl2 |x1 − x2|)−α

Rational Quadratic

271 / 335

What effect does the form of the covariance function have?

−3

−2

−1

0

1

2

3

x

y

Σ =

K(x1, x2) = σ2(1 + 1

2αl2 |x1 − x2|)−α

Rational Quadratic

272 / 335

What effect does the form of the covariance function have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

K(x1, x2) = σ2(1 + 1

2αl2 |x1 − x2|)−α

Rational Quadratic

273 / 335

What effect does the form of the covariance function have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

K(x1, x2) = σ2(1 + 1

2αl2 |x1 − x2|)−α

Rational Quadratic

274 / 335

What effect does the form of the covariance function have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

K(x1, x2) = σ2(1 + 1

2αl2 |x1 − x2|)−α

Rational Quadratic

275 / 335

What effect does the form of the covariance function have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

K(x1, x2) = σ2(1 + 1

2αl2 |x1 − x2|)−α

Rational Quadratic

276 / 335

What effect does the form of the covariance function have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

K(x1, x2) = σ2(1 + 1

2αl2 |x1 − x2|)−α

Rational Quadratic

277 / 335

What effect does the form of the covariance function have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

K(x1, x2) = σ2(1 + 1

2αl2 |x1 − x2|)−α

Rational Quadratic

278 / 335

What effect does the form of the covariance function have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

Rational Quadratic

K(x1, x2) = σ2(1 + 1

2αl2 |x1 − x2|)−α

279 / 335

What effect does the form of the covariance function have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

Rational Quadratic

K(x1, x2) = σ2(1 + 1

2αl2 |x1 − x2|)−α

280 / 335

What effect does the form of the covariance function have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

Rational Quadratic

K(x1, x2) = σ2(1 + 1

2αl2 |x1 − x2|)−α

281 / 335

What effect does the form of the covariance function have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

Rational Quadratic

K(x1, x2) = σ2(1 + 1

2αl2 |x1 − x2|)−α

282 / 335

What effect does the form of the covariance function have?

−3

−2

−1

0

1

2

3

x

y

Σ =

Periodic

sinusoid × squared exponential

K(x1, x2) = σ2 cos(ω(x1 − x2)) exp(− 1

2l2 (x1 − x2)2)

283 / 335

What effect does the form of the covariance function have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

K(x1, x2) = σ2 cos(ω(x1 − x2)) exp(− 1

2l2 (x1 − x2)2)

Periodic

sinusoid × squared exponential

284 / 335

What effect does the form of the covariance function have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

K(x1, x2) = σ2 cos(ω(x1 − x2)) exp(− 1

2l2 (x1 − x2)2)

Periodic

sinusoid × squared exponential

285 / 335

What effect does the form of the covariance function have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

K(x1, x2) = σ2 cos(ω(x1 − x2)) exp(− 1

2l2 (x1 − x2)2)

Periodic

sinusoid × squared exponential

286 / 335

What effect does the form of the covariance function have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

K(x1, x2) = σ2 cos(ω(x1 − x2)) exp(− 1

2l2 (x1 − x2)2)

Periodic

sinusoid × squared exponential

287 / 335

What effect does the form of the covariance function have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

K(x1, x2) = σ2 cos(ω(x1 − x2)) exp(− 1

2l2 (x1 − x2)2)

Periodic

sinusoid × squared exponential

288 / 335

What effect does the form of the covariance function have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

K(x1, x2) = σ2 cos(ω(x1 − x2)) exp(− 1

2l2 (x1 − x2)2)

Periodic

sinusoid × squared exponential

289 / 335

What effect does the form of the covariance function have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

K(x1, x2) = σ2 cos(ω(x1 − x2)) exp(− 1

2l2 (x1 − x2)2)

Periodic

sinusoid × squared exponential

290 / 335

What effect does the form of the covariance function have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

K(x1, x2) = σ2 cos(ω(x1 − x2)) exp(− 1

2l2 (x1 − x2)2)

Periodic

sinusoid × squared exponential

291 / 335

What effect does the form of the covariance function have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

K(x1, x2) = σ2 cos(ω(x1 − x2)) exp(− 1

2l2 (x1 − x2)2)

Periodic

sinusoid × squared exponential

292 / 335

What effect does the form of the covariance function have?

−3

−2

−1

0

1

2

3

x

y

Σ =

K(x1, x2) = σ2 cos(ω(x1 − x2)) exp(− 1

2l2 (x1 − x2)2)

Periodic

sinusoid × squared exponential

293 / 335

What effect does the form of the covariance function have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

K(x1, x2) = σ2 cos(ω(x1 − x2)) exp(− 1

2l2 (x1 − x2)2)

Periodic

sinusoid × squared exponential

294 / 335

What effect does the form of the covariance function have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

K(x1, x2) = σ2 cos(ω(x1 − x2)) exp(− 1

2l2 (x1 − x2)2)

Periodic

sinusoid × squared exponential

295 / 335

What effect does the form of the covariance function have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

K(x1, x2) = σ2 cos(ω(x1 − x2)) exp(− 1

2l2 (x1 − x2)2)

Periodic

sinusoid × squared exponential

296 / 335

What effect does the form of the covariance function have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

K(x1, x2) = σ2 cos(ω(x1 − x2)) exp(− 1

2l2 (x1 − x2)2)

Periodic

sinusoid × squared exponential

297 / 335

What effect does the form of the covariance function have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

K(x1, x2) = σ2 cos(ω(x1 − x2)) exp(− 1

2l2 (x1 − x2)2)

Periodic

sinusoid × squared exponential

298 / 335

What effect does the form of the covariance function have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

K(x1, x2) = σ2 cos(ω(x1 − x2)) exp(− 1

2l2 (x1 − x2)2)

Periodic

sinusoid × squared exponential

299 / 335

What effect does the form of the covariance function have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

K(x1, x2) = σ2 cos(ω(x1 − x2)) exp(− 1

2l2 (x1 − x2)2)

Periodic

sinusoid × squared exponential

300 / 335

What effect does the form of the covariance function have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

K(x1, x2) = σ2 cos(ω(x1 − x2)) exp(− 1

2l2 (x1 − x2)2)

Periodic

sinusoid × squared exponential

301 / 335

What effect does the form of the covariance function have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

K(x1, x2) = σ2 cos(ω(x1 − x2)) exp(− 1

2l2 (x1 − x2)2)

Periodic

sinusoid × squared exponential

302 / 335

What effect does the form of the covariance function have?

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

K(x1, x2) = σ2 cos(ω(x1 − x2)) exp(− 1

2l2 (x1 − x2)2)

Periodic

sinusoid × squared exponential

303 / 335

The covariance function has a large effect

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

OU RQ periodic

304 / 335

The covariance function has a large effect

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

OU RQ periodic

Bayesian model comparison:

305 / 335

The covariance function has a large effect

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

OU RQ periodic

Bayesian model comparison:prior over models

306 / 335

The covariance function has a large effect

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

OU RQ periodic

Bayesian model comparison:prior over models

marginal likelihood

307 / 335

The covariance function has a large effect

Health warnings: Hard to compute (need approximations)Often results are very sensitive to the priors

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

OU RQ periodic

Bayesian model comparison:prior over models

marginal likelihood

308 / 335

Scaling Gaussian Processes toLarge Datasets

309 / 335

Motivation: Gaussian Process Regression

inputs

outputs

310 / 335

Motivation: Gaussian Process Regression

inputs

outputs

?

310 / 335

Motivation: Gaussian Process Regression

inputs

outputs

?

310 / 335

Motivation: Gaussian Process Regression

inputs

outputs

?

inference & learning

310 / 335

Motivation: Gaussian Process Regression

inputs

outputs

?

inference & learning

intractabilities

computational

analytic

310 / 335

Motivation: Gaussian Process Regression

310 / 335

A Brief History of Gaussian Process Approximations

FITC: Snelson et al. “Sparse Gaussian Processes using Pseudo-inputs”PITC: Snelson et al. “Local and global sparse Gaussian process approximations”EP: Csato and Opper 2002 / Qi et al. "Sparse-posterior Gaussian Processes for general likelihoods.” VFE: Titsias “Variational Learning of Inducing Variables in Sparse Gaussian Processes”DTC / PP: Seeger et al. “Fast Forward Selection to Speed Up Sparse Gaussian Process Regression”

311 / 335

A Brief History of Gaussian Process Approximations

approximate generative modelexact inference

FITC: Snelson et al. “Sparse Gaussian Processes using Pseudo-inputs”PITC: Snelson et al. “Local and global sparse Gaussian process approximations”EP: Csato and Opper 2002 / Qi et al. "Sparse-posterior Gaussian Processes for general likelihoods.” VFE: Titsias “Variational Learning of Inducing Variables in Sparse Gaussian Processes”DTC / PP: Seeger et al. “Fast Forward Selection to Speed Up Sparse Gaussian Process Regression”

311 / 335

A Brief History of Gaussian Process Approximations

approximate generative modelexact inference

methods employingpseudo-data

FITC: Snelson et al. “Sparse Gaussian Processes using Pseudo-inputs”PITC: Snelson et al. “Local and global sparse Gaussian process approximations”EP: Csato and Opper 2002 / Qi et al. "Sparse-posterior Gaussian Processes for general likelihoods.” VFE: Titsias “Variational Learning of Inducing Variables in Sparse Gaussian Processes”DTC / PP: Seeger et al. “Fast Forward Selection to Speed Up Sparse Gaussian Process Regression”

311 / 335

A Brief History of Gaussian Process Approximations

approximate generative modelexact inference

methods employingpseudo-data

FITC: Snelson et al. “Sparse Gaussian Processes using Pseudo-inputs”PITC: Snelson et al. “Local and global sparse Gaussian process approximations”EP: Csato and Opper 2002 / Qi et al. "Sparse-posterior Gaussian Processes for general likelihoods.” VFE: Titsias “Variational Learning of Inducing Variables in Sparse Gaussian Processes”DTC / PP: Seeger et al. “Fast Forward Selection to Speed Up Sparse Gaussian Process Regression”

FITCPITCDTC

311 / 335

A Brief History of Gaussian Process Approximations

approximate generative modelexact inference

methods employingpseudo-data

FITC: Snelson et al. “Sparse Gaussian Processes using Pseudo-inputs”PITC: Snelson et al. “Local and global sparse Gaussian process approximations”EP: Csato and Opper 2002 / Qi et al. "Sparse-posterior Gaussian Processes for general likelihoods.” VFE: Titsias “Variational Learning of Inducing Variables in Sparse Gaussian Processes”DTC / PP: Seeger et al. “Fast Forward Selection to Speed Up Sparse Gaussian Process Regression”

FITCPITCDTC

A Unifying View of Sparse Approximate Gaussian Process RegressionQuinonero-Candela & Rasmussen, 2005(FITC, PITC, DTC)

311 / 335

A Brief History of Gaussian Process Approximations

approximate generative modelexact inference

exact generative modelapproximate inference

methods employingpseudo-data

FITC: Snelson et al. “Sparse Gaussian Processes using Pseudo-inputs”PITC: Snelson et al. “Local and global sparse Gaussian process approximations”EP: Csato and Opper 2002 / Qi et al. "Sparse-posterior Gaussian Processes for general likelihoods.” VFE: Titsias “Variational Learning of Inducing Variables in Sparse Gaussian Processes”DTC / PP: Seeger et al. “Fast Forward Selection to Speed Up Sparse Gaussian Process Regression”

FITCPITCDTC

A Unifying View of Sparse Approximate Gaussian Process RegressionQuinonero-Candela & Rasmussen, 2005(FITC, PITC, DTC)

311 / 335

A Brief History of Gaussian Process Approximations

approximate generative modelexact inference

exact generative modelapproximate inference

methods employingpseudo-data

FITC: Snelson et al. “Sparse Gaussian Processes using Pseudo-inputs”PITC: Snelson et al. “Local and global sparse Gaussian process approximations”EP: Csato and Opper 2002 / Qi et al. "Sparse-posterior Gaussian Processes for general likelihoods.” VFE: Titsias “Variational Learning of Inducing Variables in Sparse Gaussian Processes”DTC / PP: Seeger et al. “Fast Forward Selection to Speed Up Sparse Gaussian Process Regression”

VFEEPPP

FITCPITCDTC

A Unifying View of Sparse Approximate Gaussian Process RegressionQuinonero-Candela & Rasmussen, 2005(FITC, PITC, DTC)

311 / 335

A Brief History of Gaussian Process Approximations

approximate generative modelexact inference

exact generative modelapproximate inference

methods employingpseudo-data

FITC: Snelson et al. “Sparse Gaussian Processes using Pseudo-inputs”PITC: Snelson et al. “Local and global sparse Gaussian process approximations”EP: Csato and Opper 2002 / Qi et al. "Sparse-posterior Gaussian Processes for general likelihoods.” VFE: Titsias “Variational Learning of Inducing Variables in Sparse Gaussian Processes”DTC / PP: Seeger et al. “Fast Forward Selection to Speed Up Sparse Gaussian Process Regression”

VFEEPPP

FITCPITCDTC

A Unifying View of Sparse Approximate Gaussian Process RegressionQuinonero-Candela & Rasmussen, 2005(FITC, PITC, DTC)

311 / 335

A Brief History of Gaussian Process Approximations

approximate generative modelexact inference

exact generative modelapproximate inference

methods employingpseudo-data

FITC: Snelson et al. “Sparse Gaussian Processes using Pseudo-inputs”PITC: Snelson et al. “Local and global sparse Gaussian process approximations”EP: Csato and Opper 2002 / Qi et al. "Sparse-posterior Gaussian Processes for general likelihoods.” VFE: Titsias “Variational Learning of Inducing Variables in Sparse Gaussian Processes”DTC / PP: Seeger et al. “Fast Forward Selection to Speed Up Sparse Gaussian Process Regression”

VFEEPPP

FITCPITCDTC

A Unifying View of Sparse Approximate Gaussian Process RegressionQuinonero-Candela & Rasmussen, 2005(FITC, PITC, DTC)

A Unifying Framework for Sparse Gaussian Process Approximation using Power Expectation PropagationBui, Yan and Turner, 2016(VFE, EP, FITC, PITC ...)

311 / 335

EP pseudo-point approximation

true posterior

312 / 335

EP pseudo-point approximation

true posterior

312 / 335

EP pseudo-point approximation

true posterior

marginal

likelihood

posterior

312 / 335

EP pseudo-point approximation

true posterior approximate posterior

marginal

likelihood

posterior

312 / 335

EP pseudo-point approximation

true posterior approximate posterior

marginal

likelihood

posterior

312 / 335

EP pseudo-point approximation

true posterior approximate posterior

marginal

likelihood

posterior

312 / 335

EP pseudo-point approximation

true posterior approximate posterior

marginal

likelihood

posterior

312 / 335

EP pseudo-point approximation

input locations of

'pseudo' data

outputs and covariance

'pseudo' data

true posterior approximate posterior

marginal

likelihood

posterior

exact joint

of new GP

regression

model

312 / 335

EP algorithm

313 / 335

EP algorithm

1. remove take out onepseudo-observation

likelihoodcavity

313 / 335

EP algorithm

1. remove

2. include

take out onepseudo-observation

likelihood

add in onetrue observation

likelihood

cavity

tilted

313 / 335

EP algorithm

1. remove

2. include

3. project

take out onepseudo-observation

likelihood

add in onetrue observation

likelihood

project ontoapproximating

family

cavity

tilted KL between unnormalisedstochastic processes

313 / 335

EP algorithm

1. remove

2. include

3. project

4. update

take out onepseudo-observation

likelihood

add in onetrue observation

likelihood

project ontoapproximating

family

updatepseudo-observation

likelihood

cavity

tilted KL between unnormalisedstochastic processes

313 / 335

EP algorithm

1. remove

2. include

3. project

4. update

take out onepseudo-observation

likelihood

add in onetrue observation

likelihood

project ontoapproximating

family

updatepseudo-observation

likelihood

cavity

tilted

1. minimum: moments matched at pseudo-inputs2. Gaussian regression: matches moments everywhere

KL between unnormalisedstochastic processes

313 / 335

EP algorithm

1. remove

2. include

3. project

4. update

take out onepseudo-observation

likelihood

add in onetrue observation

likelihood

project ontoapproximating

family

updatepseudo-observation

likelihood

cavity

tilted

1. minimum: moments matched at pseudo-inputs2. Gaussian regression: matches moments everywhere

KL between unnormalisedstochastic processes

rank 1

313 / 335

A Brief History of Gaussian Process Approximations

approximate generative modelexact inference

exact generative modelapproximate inference

methods employingpseudo-data

FITC: Snelson et al. “Sparse Gaussian Processes using Pseudo-inputs”PITC: Snelson et al. “Local and global sparse Gaussian process approximations”EP: Csato and Opper 2002 / Qi et al. "Sparse-posterior Gaussian Processes for general likelihoods.” VFE: Titsias “Variational Learning of Inducing Variables in Sparse Gaussian Processes”DTC / PP: Seeger et al. “Fast Forward Selection to Speed Up Sparse Gaussian Process Regression”

VFEEPPP

FITCPITCDTC

A Unifying View of Sparse Approximate Gaussian Process RegressionQuinonero-Candela & Rasmussen, 2005(FITC, PITC, DTC)

A Unifying Framework for Sparse Gaussian Process Approximation using Power Expectation PropagationBui, Yan and Turner, 2016(VFE, EP, FITC, PITC ...)

314 / 335

Fixed points of EP = FITC approximation

approximate generative modelexact inference

exact generative modelapproximate inference

methods employingpseudo-data

FITC: Snelson et al. “Sparse Gaussian Processes using Pseudo-inputs”PITC: Snelson et al. “Local and global sparse Gaussian process approximations”EP: Csato and Opper 2002 / Qi et al. "Sparse-posterior Gaussian Processes for general likelihoods.” VFE: Titsias “Variational Learning of Inducing Variables in Sparse Gaussian Processes”DTC / PP: Seeger et al. “Fast Forward Selection to Speed Up Sparse Gaussian Process Regression”

VFEEPPP

FITCPITCDTC

A Unifying View of Sparse Approximate Gaussian Process RegressionQuinonero-Candela & Rasmussen, 2005(FITC, PITC, DTC)

A Unifying Framework for Sparse Gaussian Process Approximation using Power Expectation PropagationBui, Yan and Turner, 2016(VFE, EP, FITC, PITC ...)

315 / 335

Fixed points of EP = FITC approximation

approximate generative modelexact inference

exact generative modelapproximate inference

methods employingpseudo-data

FITC: Snelson et al. “Sparse Gaussian Processes using Pseudo-inputs”PITC: Snelson et al. “Local and global sparse Gaussian process approximations”EP: Csato and Opper 2002 / Qi et al. "Sparse-posterior Gaussian Processes for general likelihoods.” VFE: Titsias “Variational Learning of Inducing Variables in Sparse Gaussian Processes”DTC / PP: Seeger et al. “Fast Forward Selection to Speed Up Sparse Gaussian Process Regression”

VFEEPPP

FITCPITCDTC

A Unifying View of Sparse Approximate Gaussian Process RegressionQuinonero-Candela & Rasmussen, 2005(FITC, PITC, DTC)

A Unifying Framework for Sparse Gaussian Process Approximation using Power Expectation PropagationBui, Yan and Turner, 2016(VFE, EP, FITC, PITC ...)

315 / 335

Fixed points of EP = FITC approximation

approximate generative modelexact inference

exact generative modelapproximate inference

methods employingpseudo-data

FITC: Snelson et al. “Sparse Gaussian Processes using Pseudo-inputs”PITC: Snelson et al. “Local and global sparse Gaussian process approximations”EP: Csato and Opper 2002 / Qi et al. "Sparse-posterior Gaussian Processes for general likelihoods.” VFE: Titsias “Variational Learning of Inducing Variables in Sparse Gaussian Processes”DTC / PP: Seeger et al. “Fast Forward Selection to Speed Up Sparse Gaussian Process Regression”

VFEEPPP

FITCPITCDTC

interpretation resolves issues with FITC: why does it work so well?

are we allowed to increase M with N

A Unifying View of Sparse Approximate Gaussian Process RegressionQuinonero-Candela & Rasmussen, 2005(FITC, PITC, DTC)

A Unifying Framework for Sparse Gaussian Process Approximation using Power Expectation PropagationBui, Yan and Turner, 2016(VFE, EP, FITC, PITC ...)

315 / 335

EP algorithm

1. remove

2. include

3. project

4. update

take out onepseudo-observation

likelihood

add in onetrue observation

likelihood

project ontoapproximating

family

updatepseudo-observation

likelihood

cavity

tilted

1. minimum: moments matched at pseudo-inputs2. Gaussian regression: matches moments everywhere

KL between unnormalisedstochastic processes

rank 1

316 / 335

Power EP algorithm (as tractable as EP)

1. remove

2. include

3. project

4. update

take out fraction ofpseudo-observation

likelihood

add in fraction oftrue observation

likelihood

project ontoapproximating

family

updatepseudo-observation

likelihood

cavity

tilted

1. minimum: moments matched at pseudo-inputs2. Gaussian regression: matches moments everywhere

KL between unnormalisedstochastic processes

rank 1

317 / 335

Power EP: a unifying framework

FITCCsato and Opper, 2002

Snelson and Ghahramani, 2005

VFETitsias, 2009

318 / 335

Power EP: a unifying framework

GP Regression GP Classification

PEPVFE

EP

inter-domain

[4] Quiñonero-Candela et al. 2005[5] Snelson et al., 2005[6] Snelson, 2006[7] Schwaighofer, 2002

[10,5,6*]

[14*]

[12*,15*]

[13][17,13]

[9,11,8*]

[16*] inter-domain

structuredapprox.

structuredapprox.

(FITC)

[7,4*,6*](PITC)

[8] Titsias, 2009[9] Csató, 2002[10] Csató et al., 2002[11] Seeger et al., 2003

[12] Naish-Guzman et al, 2007[13] Qi et al., 2010[14] Hensman et al., 2015[15] Hernández-Lobato et al., 2016[16] Matthews et al., 2016[17] Figueiras-Vidal et al., 2009

PEPVFE

EP

* = optimised pseudo-inputs ** = structured versions of VFE recover VFE

** **

319 / 335

How should I set the power parameter α?

6 UCI classification datasets20 random splitsM = 10, 50, 100

hypers and inducing inputs optimised

8 UCI regression datasets20 random splits

M = 0 - 200hypers and inducing

inputs optimised

0.0 0.2 0.4 0.6 0.8 1.02

3

4

5

6

7

8

0.0 0.2 0.4 0.6 0.8 1.02

3

4

5

6

7

8

MS

Era

nker

ror

rank

log-

loss

ran

klo

g-lo

ss r

ank

= 0.5 does well on average

0.0 0.2 0.4 0.6 0.8 1.02.5

3.0

3.5

4.0

4.5

5.0

5.5

6.0

0.0 0.2 0.4 0.6 0.8 1.01

2

3

4

5

6

7

320 / 335

Deep Gaussian Processes forRegression

321 / 335

Pros and cons of Gaussian Process Regression

Probabilistic Xneed well-calibrated predictive uncertainty (decision making)big models (even with small data)neural networks do not provide well calibrated uncertainty estimates

Non-parametric (∞ - parametric) Xmodel parameters: information bottleneck between training and test data

training data→ θθθ → test data

unbounded model (θθθ grows with data) pruned by data

Deep (and wide) 7

intrinsic hierarchical structure in real world datasimpler to learn a series of weak non-linearities

322 / 335

Pros and cons of Gaussian Process Regression

Probabilistic Xneed well-calibrated predictive uncertainty (decision making)big models (even with small data)neural networks do not provide well calibrated uncertainty estimates

Non-parametric (∞ - parametric) Xmodel parameters: information bottleneck between training and test data

training data→ θθθ → test data

unbounded model (θθθ grows with data) pruned by data

Deep (and wide) 7

intrinsic hierarchical structure in real world datasimpler to learn a series of weak non-linearities

322 / 335

Pros and cons of Gaussian Process Regression

Probabilistic Xneed well-calibrated predictive uncertainty (decision making)big models (even with small data)neural networks do not provide well calibrated uncertainty estimates

Non-parametric (∞ - parametric) Xmodel parameters: information bottleneck between training and test data

training data→ θθθ → test data

unbounded model (θθθ grows with data) pruned by data

Deep (and wide) 7

intrinsic hierarchical structure in real world datasimpler to learn a series of weak non-linearities

322 / 335

Pros and cons of Gaussian Process Regression

Probabilistic Xneed well-calibrated predictive uncertainty (decision making)big models (even with small data)neural networks do not provide well calibrated uncertainty estimates

Non-parametric (∞ - parametric) Xmodel parameters: information bottleneck between training and test data

training data→ θθθ → test data

unbounded model (θθθ grows with data) pruned by data

Deep (and wide) 7

intrinsic hierarchical structure in real world datasimpler to learn a series of weak non-linearities

322 / 335

From Gaussian Processes to Deep Gaussian Processes

input

323 / 335

From Gaussian Processes to Deep Gaussian Processes

input

323 / 335

From Gaussian Processes to Deep Gaussian Processes

}input

input

323 / 335

From Gaussian Processes to Deep Gaussian Processes

}input

input

long short

323 / 335

From Gaussian Processes to Deep Gaussian Processes

}input

input

long short

DGP = multi-layer neural network, infinitely wide hidden layer

323 / 335

Deep Gaussian Processes

fl ∼ GP(0, k(., .))yn = g(xn) = fL(fL−1(· · · f2(f1(xn)))) + εn

hL−1,n := fL−1(· · · f1(xn)), yn = fL(hL−1,n) + εn

Deep GPsa aremulti-layer generalisation of Gaussian processes,equivalent to deep neural networks with infinitelywide hidden layers

Questions:How to perform inference and learning tractably?How Deep GPs compare to alternative,e.g. Bayesian neural networks?

aDamianou and Lawrence (2013) [unsupervised learning]

xn

h1,n

h2,n

yn

f1

f2

f3

N

324 / 335

Approximate inference for (Deep) Gaussian Processes

GP Regression GP Classification

PEPVFE

EP

inter-domain

[4] Quiñonero-Candela et al. 2005[5] Snelson et al., 2005[6] Snelson, 2006[7] Schwaighofer, 2002

[10,5,6*]

[14*]

[12*,15*]

[13][17,13]

[9,11,8*]

[16*] inter-domain

structuredapprox.

structuredapprox.

(FITC)

[7,4*,6*](PITC)

[8] Titsias, 2009[9] Csató, 2002[10] Csató et al., 2002[11] Seeger et al., 2003

[12] Naish-Guzman et al, 2007[13] Qi et al., 2010[14] Hensman et al., 2015[15] Hernández-Lobato et al., 2016[16] Matthews et al., 2016[17] Figueiras-Vidal et al., 2009

PEPVFE

EP

* = optimised pseudo-inputs ** = structured versions of VFE recover VFE

** **

325 / 335

Experiment: Value function of the mountain car problem

goalr = 10

car

each stepr=-1

[x,x]

x1

x2

GP fit

x1

x2

Value function

x1

x2

DGP fit

326 / 335

Experiment: Comparison to Bayesian neural networks

Compare DGPs with GPs and Bayesian neural networks with one andtwo hidden layers using:

VI(G): Graves’ VI [diagonal Gaussian, without the reparam. trick]VI(KW): Kingma and Welling’s VI [with the reparam. trick]PBP: ADF with Probablistic BackpropagationDropout: Combining dropout predictions at test timeSGLD: Stochastic gradient Langevin dynamicsHMC: Hamiltonian Monte Carlo [only for small networks]

327 / 335

Experiment: Comparison to Bayesian neural networks

Rankings of all methods across all datasets

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

MLL Average Rank

PBP−1Dropout−1SGLD−1DGP−1 50SGLD−2VI(KW)−1GP 50

DGP−3 100DGP−2 100DGP−3 50DGP−2 50

GP 100HMC−1

VI(KW)−2DGP−1 100

CD

328 / 335

Experiment: Comparison to Bayesian neural networks [Best results]

bostonN = 506D = 13

−2.5

−2.4

−2.3

−2.2

−2.1

−2.0

aver

age

test

log-

likel

ihoo

d/na

ts

concreteN = 1030

D = 8

−3.2

−3.0

−2.8

−2.6

−2.4

−2.2

−2.0

−1.8

energyN = 768D = 8

−2.0

−1.8

−1.6

−1.4

−1.2

−1.0

−0.8

−0.6

kin8nmN = 8192

D = 8

1.6

1.8

2.0

2.2

2.4

2.6

2.8

3.0

3.2

navalN = 11934

D = 16

5.0

5.5

6.0

6.5

7.0

7.5

powerN = 9568

D = 4

−2.65

−2.60

−2.55

−2.50

−2.45

−2.40

−2.35

−2.30

−2.25

proteinN = 45730

D = 9

−3.0

−2.5

−2.0

−1.5

−1.0

−0.5

0.0

red wineN = 1588D = 11

−0.8

−0.6

−0.4

−0.2

0.0

0.2

0.4

0.6

0.8

yachtN = 308D = 6

−1.8

−1.6

−1.4

−1.2

−1.0

−0.8

−0.6

−0.4

yearN = 515345

D = 90

−3.0

−2.5

−2.0

−1.5

−1.0

−0.5

0.0

0.5

BNN-deterministic BNN-sampling GP DGP

329 / 335

Experiment: Comparison to Bayesian neural networks [Best results]

bostonN = 506D = 13

−2.5

−2.4

−2.3

−2.2

−2.1

−2.0

aver

age

test

log-

likel

ihoo

d/na

ts

concreteN = 1030

D = 8

−3.2

−3.0

−2.8

−2.6

−2.4

−2.2

−2.0

−1.8

energyN = 768D = 8

−2.0

−1.8

−1.6

−1.4

−1.2

−1.0

−0.8

−0.6

kin8nmN = 8192

D = 8

1.6

1.8

2.0

2.2

2.4

2.6

2.8

3.0

3.2

navalN = 11934

D = 16

5.0

5.5

6.0

6.5

7.0

7.5

powerN = 9568

D = 4

−2.65

−2.60

−2.55

−2.50

−2.45

−2.40

−2.35

−2.30

−2.25

proteinN = 45730

D = 9

−3.0

−2.5

−2.0

−1.5

−1.0

−0.5

0.0

red wineN = 1588D = 11

−0.8

−0.6

−0.4

−0.2

0.0

0.2

0.4

0.6

0.8

yachtN = 308D = 6

−1.8

−1.6

−1.4

−1.2

−1.0

−0.8

−0.6

−0.4

yearN = 515345

D = 90

−3.0

−2.5

−2.0

−1.5

−1.0

−0.5

0.0

0.5

BNN-deterministic BNN-sampling GP DGP

330 / 335

Experiment: Comparison to Bayesian neural networks [Best results]

bostonN = 506D = 13

−2.5

−2.4

−2.3

−2.2

−2.1

−2.0

aver

age

test

log-

likel

ihoo

d/na

ts

concreteN = 1030

D = 8

−3.2

−3.0

−2.8

−2.6

−2.4

−2.2

−2.0

−1.8

energyN = 768D = 8

−2.0

−1.8

−1.6

−1.4

−1.2

−1.0

−0.8

−0.6

kin8nmN = 8192

D = 8

1.6

1.8

2.0

2.2

2.4

2.6

2.8

3.0

3.2

navalN = 11934

D = 16

5.0

5.5

6.0

6.5

7.0

7.5

powerN = 9568

D = 4

−2.65

−2.60

−2.55

−2.50

−2.45

−2.40

−2.35

−2.30

−2.25

proteinN = 45730

D = 9

−3.0

−2.5

−2.0

−1.5

−1.0

−0.5

0.0

red wineN = 1588D = 11

−0.8

−0.6

−0.4

−0.2

0.0

0.2

0.4

0.6

0.8

yachtN = 308D = 6

−1.8

−1.6

−1.4

−1.2

−1.0

−0.8

−0.6

−0.4

yearN = 515345

D = 90

−3.0

−2.5

−2.0

−1.5

−1.0

−0.5

0.0

0.5

BNN-deterministic BNN-sampling GP DGP

331 / 335

Experiment: Comparison to Bayesian neural networks [Best results]

bostonN = 506D = 13

−2.5

−2.4

−2.3

−2.2

−2.1

−2.0

aver

age

test

log-

likel

ihoo

d/na

ts

concreteN = 1030

D = 8

−3.2

−3.0

−2.8

−2.6

−2.4

−2.2

−2.0

−1.8

energyN = 768D = 8

−2.0

−1.8

−1.6

−1.4

−1.2

−1.0

−0.8

−0.6

kin8nmN = 8192

D = 8

1.6

1.8

2.0

2.2

2.4

2.6

2.8

3.0

3.2

navalN = 11934

D = 16

5.0

5.5

6.0

6.5

7.0

7.5

powerN = 9568

D = 4

−2.65

−2.60

−2.55

−2.50

−2.45

−2.40

−2.35

−2.30

−2.25

proteinN = 45730

D = 9

−3.0

−2.5

−2.0

−1.5

−1.0

−0.5

0.0

red wineN = 1588D = 11

−0.8

−0.6

−0.4

−0.2

0.0

0.2

0.4

0.6

0.8

yachtN = 308D = 6

−1.8

−1.6

−1.4

−1.2

−1.0

−0.8

−0.6

−0.4

yearN = 515345

D = 90

−3.0

−2.5

−2.0

−1.5

−1.0

−0.5

0.0

0.5

BNN-deterministic BNN-sampling GP DGP

332 / 335

Experiment: Comparison to Bayesian neural networks [Best results]

bostonN = 506D = 13

−2.5

−2.4

−2.3

−2.2

−2.1

−2.0

aver

age

test

log-

likel

ihoo

d/na

ts

concreteN = 1030

D = 8

−3.2

−3.0

−2.8

−2.6

−2.4

−2.2

−2.0

−1.8

energyN = 768D = 8

−2.0

−1.8

−1.6

−1.4

−1.2

−1.0

−0.8

−0.6

kin8nmN = 8192

D = 8

1.6

1.8

2.0

2.2

2.4

2.6

2.8

3.0

3.2

navalN = 11934

D = 16

5.0

5.5

6.0

6.5

7.0

7.5

powerN = 9568

D = 4

−2.65

−2.60

−2.55

−2.50

−2.45

−2.40

−2.35

−2.30

−2.25

proteinN = 45730

D = 9

−3.0

−2.5

−2.0

−1.5

−1.0

−0.5

0.0

red wineN = 1588D = 11

−0.8

−0.6

−0.4

−0.2

0.0

0.2

0.4

0.6

0.8

yachtN = 308D = 6

−1.8

−1.6

−1.4

−1.2

−1.0

−0.8

−0.6

−0.4

yearN = 515345

D = 90

−3.0

−2.5

−2.0

−1.5

−1.0

−0.5

0.0

0.5

BNN-deterministic BNN-sampling GP DGP

333 / 335

Pros and cons of Gaussian Process Regression

Probabilistic Xneed well-calibrated predictive uncertainty (decision making)big models (even with small data)neural networks do not provide well calibrated uncertainty estimates

Non-parametric (∞ - parametric) Xmodel parameters: information bottleneck between training and test data

training data→ θθθ → test data

unbounded model (θθθ grows with data) pruned by data

Deep (and wide) Xintrinsic hierarchical structure in real world datasimpler to learn a series of weak non-linearities

334 / 335

References (hyperlinked)

Deep Gaussian Processes:Deep Gaussian Processes for Regression using ApproximateExpectation Propagation, ICML 2016

Approximate inference in GPs:

A Unifying Framework for Sparse Gaussian Process Approximationusing Power Expectation Propagation, arXiv preprint 2016

Scalable Approximate inference:Stochastic Expectation Propagation, NIPS 2015Black-box α-divergence Minimization, ICML 2016

335 / 335

top related