regressione: processi gaussiani e machine...

Corso di Sistemi di Elaborazione dell’Informazione

Prof. Giuseppe Boccignone

Dipartimento di Informatica

Università di Milano

[email protected]

boccignone.di.unimi.it/FM_2016.html

SCUOLA DI SPECIALIZZAZIONE IN FISICA MEDICA

Regressione: processi Gaussiani e Machine Learning

Apprendimento supervisionato //Regressione

• Processi Gaussiani: prendendo un qualsiasi numero finito di variabili aleatorie, dall’ensemble che forma il processo aleatorio stesso, esse hanno una distribuzione di probabilità congiunta gaussiana (multivariata)

• Più precisamente: è Gaussiano se per qualunque scelta di si ha che

Processi stocastici // Classi di processi

Chapter 3

Gaussian Processes andStochastic DifferentialEquations

3.1 Gaussian Processes

Gaussian processes are a reasonably large class of processes that havemany applications, including in finance, time-series analysis and machinelearning. Certain classes of Gaussian processes can also be thought of asspatial processes. We will use these as a vehicle to start considering moregeneral spatial objects.

Definition 3.1.1 (Gaussian Process). A stochastic process {Xt}t≥0 is Gaus-sian if, for any choice of times t1, . . . , tn, the random vector (Xt1 , . . . , Xtn)has a multivariate normal distribution.

3.1.1 The Multivariate Normal Distribution

Because Gaussian processes are defined in terms of the multivariate nor-mal distribution, we will need to have a pretty good understanding of thisdistribution and its properties.

Definition 3.1.2 (Multivariate Normal Distribution). A vectorX = (X1, . . . , Xn)is said to be multivariate normal (multivariate Gaussian) if all linear combi-nations of X, i.e. all random variables of the form

n!

k=1

αkXk

have univariate normal distributions.

89

Chapter 3








n!

k=1

αkXk


89

3.1. GAUSSIAN PROCESSES 93

Simulating Multivariate Normals

One possible way to simulate a multivariate normal random vector isusing the Cholesky decomposition.

Lemma 3.1.11. If Z ∼ N(0, I), Σ = AA! is a covariance matrix, andX = µ+ AZ, then X ∼ N(µ,Σ).

Proof. We know from lemma 3.1.4 that AZ is multivariate normal and sois µ + AZ. Because a multivariate normal random vector is completelydescribed by its mean vector and covariance matrix, we simple need to findEX and Var(X). Now,

EX = Eµ+ EAZ = µ+ A0 = µ

and

Var(X) = Var(µ+ AZ) = Var(AZ) = AVar(Z)A! = AIA! = AA! = Σ.

Thus, we can simulate a random vector X ∼ N(µ,Σ) by

(i) Finding A such that Σ = AA!.

(ii) Generating Z ∼ N(0, I).

(iii) Returning X = µ+ AZ.

Matlab makes things a bit confusing because its function ‘chol’ producesa decomposition of the form B!B. That is, A = B!. It is important tobe careful and think about whether you want to generate a row or columnvector when generating multivariate normals.

The following code produces column vectors.

Listing 3.1: Matlab code

1 mu = [1 2 3]’;

2 Sigma = [3 1 2; 1 2 1; 2 1 5];

3 A = chol(Sigma);

4 X = mu + A’ * randn(3,1);

The following code produces row vectors.


1 mu = [1 2 3];

2 Sigma = [3 1 2; 1 2 1; 2 1 5];

3 A = chol(Sigma);

4 X = mu + randn(1,3) * A;

92 CHAPTER 3. GAUSSIAN PROCESSES ETC.

If A is also symmetric, then A is called symmetric positive semi-definite(SPSD). Note that if A is an SPSD then there is a real-valued decomposition

A = LL!,

though it is not necessarily unique and L may have zeroes on the diagonals.

Lemma 3.1.9. Covariance matrices are SPSD.

Proof. Given an n × 1 real-valued vector x and a random vector Y withcovariance matrix Σ, we have

Var (x!Y) = x!Var (Y)x.

Now this must be non-negative (as it is a variance). That is, it must be thecase that

x!Var (Y)x ≥ 0,

so Σ is positive semi-definite. Symmetry comes from the fact that Cov(X, Y ) =Cov(Y,X).

Lemma 3.1.10. SPSD matrices are covariance matrices.

Proof. Let A be an SPSD matrix and Z be a vector of random variables withVar(Z) = I. Now, as A is SPSD, A = LL!. So,

Var(LZ) = LVar(Z)L! = LIL! = A,

so A is the covariance matrix of LZ.

COVARIANCE POS DEF ...

Densities of Multivariate Normals

If X = (X1, . . . , Xn) is N(µ,Σ) and Σ is positive definite, then X has thedensity

f(x) = f(x1, . . . , xn) =1!

(2π)n|Σ|exp

"−1

2(x− µ)!Σ−1(x− µ)

#.

Chapter 3








n!

k=1

αkXk


89



A = LL!,






x!Var (Y)x ≥ 0,









f(x) = f(x1, . . . , xn) =1!

(2π)n|Σ|exp

"−1

2(x− µ)!Σ−1(x− µ)

#.



A = LL!,






x!Var (Y)x ≥ 0,









f(x) = f(x1, . . . , xn) =1!

(2π)n|Σ|exp

"−1

2(x− µ)!Σ−1(x− µ)

#.



A = LL!,






x!Var (Y)x ≥ 0,









f(x) = f(x1, . . . , xn) =1!

(2π)n|Σ|exp

"−1

2(x− µ)!Σ−1(x− µ)

#.

Processi Gaussiani (GP) e Machine Learning

• Processi Gaussiani: prendendo un qualsiasi numero finito di variabili aleatorie, dalla collezione che forma il processo aleatorio stesso, esse hanno una distribuzione di probabilità congiunta gaussiana (multivariata)

• Un processo Gaussiano è completamente specificato mediante media e matrice di covarianza


Simulating Multivariate Normals

One possible way to simulate a multivariate normal random vector isusing the Cholesky decomposition.

Lemma 3.1.11. If Z ∼ N(0, I), Σ = AA! is a covariance matrix, andX = µ+ AZ, then X ∼ N(µ,Σ).

Proof. We know from lemma 3.1.4 that AZ is multivariate normal and sois µ + AZ. Because a multivariate normal random vector is completelydescribed by its mean vector and covariance matrix, we simple need to findEX and Var(X). Now,

EX = Eµ+ EAZ = µ+ A0 = µ

and

Var(X) = Var(µ+ AZ) = Var(AZ) = AVar(Z)A! = AIA! = AA! = Σ.

Thus, we can simulate a random vector X ∼ N(µ,Σ) by

(i) Finding A such that Σ = AA!.

(ii) Generating Z ∼ N(0, I).

(iii) Returning X = µ+ AZ.

Matlab makes things a bit confusing because its function ‘chol’ producesa decomposition of the form B!B. That is, A = B!. It is important tobe careful and think about whether you want to generate a row or columnvector when generating multivariate normals.

The following code produces column vectors.


1 mu = [1 2 3]’;

2 Sigma = [3 1 2; 1 2 1; 2 1 5];

3 A = chol(Sigma);

4 X = mu + A’ * randn(3,1);

The following code produces row vectors.


1 mu = [1 2 3];

2 Sigma = [3 1 2; 1 2 1; 2 1 5];

3 A = chol(Sigma);

4 X = mu + randn(1,3) * A;

Processi stocastici // Simulazione di processi Gaussiani


The complexity of the Cholesky decomposition of an arbitrary real-valuedmatrix is O(n3) floating point operations.

3.1.2 Simulating a Gaussian Processes Version 1

Because Gaussian processes are defined to be processes where, for anychoice of times t1, . . . , tn, the random vector (Xt1 , . . . , Xtn) has a multivariatenormal density and a multivariate normal density is completely describe bya mean vector µ and a covariance matrix Σ, the probability distribution ofGaussian process is completely described if we have a way to construct themean vector and covariance matrix for an arbitrary choice of t1, . . . , tn.

We can do this using an expectation function

µ(t) = EXt

and a covariance function

r(s, t) = Cov(Xs, Xt).

Using these, we can simulate the values of a Gaussian process at fixed timest1, . . . , tn by calculating µ where µi = µ(t1) and Σ, where Σi,j = r(ti, tj) andthen simulating a multivariate normal vector.

In the following examples, we simulate the Gaussian processes at evenlyspaced times t1 = 1/h, t2 = 2/h, . . . , tn = 1.

Example 3.1.12 (Brownian Motion). Brownian motion is a very importantstochastic process which is, in some sense, the continuous time continuousstate space analogue of a simple random walk. It has an expectation functionµ(t) = 0 and a covariance function r(s, t) = min(s, t).


1 %use mesh size h

2 m = 1000; h = 1/m; t = h : h : 1;

3 n = length(t);

4

5 %Make the mean vector

6 mu = zeros(1,n);

7 for i = 1:n

8 mu(i) = 0;

9 end

10

11 %Make the covariance matrix

12 Sigma = zeros(n,n);






µ(t) = EXt







1 %use mesh size h

2 m = 1000; h = 1/m; t = h : h : 1;

3 n = length(t);

4


6 mu = zeros(1,n);

7 for i = 1:n

8 mu(i) = 0;

9 end

10








µ(t) = EXt







1 %use mesh size h

2 m = 1000; h = 1/m; t = h : h : 1;

3 n = length(t);

4


6 mu = zeros(1,n);

7 for i = 1:n

8 mu(i) = 0;

9 end

10








µ(t) = EXt







1 %use mesh size h

2 m = 1000; h = 1/m; t = h : h : 1;

3 n = length(t);

4


6 mu = zeros(1,n);

7 for i = 1:n

8 mu(i) = 0;

9 end

10








µ(t) = EXt







1 %use mesh size h

2 m = 1000; h = 1/m; t = h : h : 1;

3 n = length(t);

4


6 mu = zeros(1,n);

7 for i = 1:n

8 mu(i) = 0;

9 end

10








µ(t) = EXt







1 %use mesh size h

2 m = 1000; h = 1/m; t = h : h : 1;

3 n = length(t);

4


6 mu = zeros(1,n);

7 for i = 1:n

8 mu(i) = 0;

9 end

10




13 for i = 1:n

14 for j = 1:n

15 Sigma(i,j) = min(t(i),t(j));

16 end

17 end

18

19 %Generate the multivariate normal vector

20 A = chol(Sigma);

21 X = mu + randn(1,n) * A;

22

23 %Plot

24 plot(t,X);

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−1.8

−1.6

−1.4

−1.2

−1

−0.8

−0.6

−0.4

−0.2

0

0.2

t

Xt

Figure 3.1.1: Plot of Brownain motion on [0, 1].

Example 3.1.13 (Ornstein-Uhlenbeck Process). Another very importantGaussian process is the Ornstein-Uhlenbeck process. This has expectationfunction µ(t) = 0 and covariance function r(s, t) = e−α|s−t|/2.


1 %use mesh size h

2 m = 1000; h = 1/m; t = h : h : 1;

3 n = length(t);

4 %paramter of OU process

5 alpha = 10;

6


13 for i = 1:n

14 for j = 1:n


16 end

17 end

18


20 A = chol(Sigma);

21 X = mu + randn(1,n) * A;

22

23 %Plot

24 plot(t,X);

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−1.8

−1.6

−1.4

−1.2

−1

−0.8

−0.6

−0.4

−0.2

0

0.2

t

Xt




1 %use mesh size h

2 m = 1000; h = 1/m; t = h : h : 1;

3 n = length(t);


5 alpha = 10;

6



8 mu = zeros(1,n);

9 for i = 1:n

10 mu(i) = 0;

11 end

12



15 for i = 1:n

16 for j = 1:n

17 Sigma(i,j) = exp(-alpha * abs(t(i) - t(j)) / 2);

18 end

19 end

20


22 A = chol(Sigma);

23 X = mu + randn(1,n) * A;

24

25 %Plot

26 plot(t,X);

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

3

t

Xt

Figure 3.1.2: Plot of an Ornstein-Uhlenbeck process on [0, 1] with α = 10.

Example 3.1.14 (Fractional Brownian Motion). Fractional Brownian mo-tion (fBm) is a generalisation of Brownian Motion. It has expectation func-tion µ(t) = 0 and covariance function Cov(s, t) = 1/2(t2H + s2H − |t− s|2H),

• Simulazione di un processo di Ornstein-Uhlenbeck (O-U):

Processi stocastici // Simulazione di processi Gaussiani • Simulazione di un processo di Ornstein-Uhlenbeck (O-U):


13 for i = 1:n

14 for j = 1:n


16 end

17 end

18


20 A = chol(Sigma);

21 X = mu + randn(1,n) * A;

22

23 %Plot

24 plot(t,X);

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−1.8

−1.6

−1.4

−1.2

−1

−0.8

−0.6

−0.4

−0.2

0

0.2

t

Xt




1 %use mesh size h

2 m = 1000; h = 1/m; t = h : h : 1;

3 n = length(t);


5 alpha = 10;

6



8 mu = zeros(1,n);

9 for i = 1:n

10 mu(i) = 0;

11 end

12



15 for i = 1:n

16 for j = 1:n


18 end

19 end

20


22 A = chol(Sigma);

23 X = mu + randn(1,n) * A;

24

25 %Plot

26 plot(t,X);

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

3

t

Xt





8 mu = zeros(1,n);

9 for i = 1:n

10 mu(i) = 0;

11 end

12



15 for i = 1:n

16 for j = 1:n


18 end

19 end

20


22 A = chol(Sigma);

23 X = mu + randn(1,n) * A;

24

25 %Plot

26 plot(t,X);

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

3

t

Xt




0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−3

−2

−1

0

1

2

3

4

t

Xt


where H ∈ (0, 1) is called the Hurst parameter. When H = 1/2, fBm re-duces to standard Brownian motion. Brownian motion has independent in-crements. In contrast, for H > 1/2 fBm has positively correlated incrementsand for H < 1/2 fBm has negatively correlated increments.


1 %Hurst parameter

2 H = .9;

3

4 %use mesh size h

5 m = 1000; h = 1/m; t = h : h : 1;

6 n = length(t);

7


9 mu = zeros(1,n);

10 for i = 1:n

11 mu(i) = 0;

12 end

13



16 for i = 1:n

17 for j = 1:n

18 Sigma(i,j) = 1/2 * (t(i)^(2*H) + t(j)^(2*H)...

19 - (abs(t(i) - t(j)))^(2 * H));


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−3

−2

−1

0

1

2

3

4

t

Xt


where H ∈ (0, 1) is called the Hurst parameter. When H = 1/2, fBm re-duces to standard Brownian motion. Brownian motion has independent in-crements. In contrast, for H > 1/2 fBm has positively correlated incrementsand for H < 1/2 fBm has negatively correlated increments.


1 %Hurst parameter

2 H = .9;

3

4 %use mesh size h

5 m = 1000; h = 1/m; t = h : h : 1;

6 n = length(t);

7


9 mu = zeros(1,n);

10 for i = 1:n

11 mu(i) = 0;

12 end

13



16 for i = 1:n

17 for j = 1:n

18 Sigma(i,j) = 1/2 * (t(i)^(2*H) + t(j)^(2*H)...

19 - (abs(t(i) - t(j)))^(2 * H));



8 mu = zeros(1,n);

9 for i = 1:n

10 mu(i) = 0;

11 end

12



15 for i = 1:n

16 for j = 1:n


18 end

19 end

20


22 A = chol(Sigma);

23 X = mu + randn(1,n) * A;

24

25 %Plot

26 plot(t,X);

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

3

t

Xt



Processi Gaussiani (GP) e Machine Learning

• Processi Gaussiani: prendendo un qualsiasi numero finito di variabili aleatorie, dalla collezione che forma il processo aleatorio stesso, esse hanno una distribuzione di probabilità congiunta gaussiana (multivariata)

• Un processo Gaussiano è completamente specificato mediante media e matrice di covarianza

• Problema per il Machine Learning: imparare i parametri della funzione di covarianza

Further reading

Many more topics and code:http://www.gaussianprocess.org/gpml/

MCMC inference for GPs is implemented in FBM:http://www.cs.toronto.edu/~radford/fbm.software.html

Gaussian processes for ordinal regression, Chu and Ghahramani,JMLR, 6:1019–1041, 2005.

Flexible and e�cient Gaussian process models for machine learning,Edward L. Snelson, PhD thesis, UCL, 2007.

http://www.gaussianprocess.org/gpml/chapters/

http://www.gaussianprocess.org/gpml/code/matlab/doc/index.html

Processi Gaussiani: presentazione intuitiva //idee di base

• Supponiamo di avere 6 punti rumorosi: vogliamo predire a x*=0.2

Processi Gaussiani //idee di base

• Supponiamo di avere 6 punti rumorosi: vogliamo predire a x*=0.2

• Non assumiamo un modello parametrico, esempio:

• Lasciamo che i dati parlino da soli

• Assumiamo che i dati siano stati generati da una distribuzione Gaussiana n-variata


• Assumiamo che i dati siano stati generati da una distribuzione Gaussiana n-variata:

• media 0

• funzione di covarianza k(x,x’)

• Caratteristiche di k(x,x’)

• massima per x vicino a x’

• se x lontano da x’

• esempio:

x vicino a x’

y correlato a y’


• Assumiamo che i dati siano stati generati da una distribuzione Gaussiana n-variata con rumore gaussiano additivo

• Incorporiamo il rumore nel kernel

• Vogliamo predire y* e non la reale funzione f*

il punto è la media di

y* e f*

media uguale covarianza diversa


• Calcoliamo le covarianze fra tutte le possibili combinazioni di punti.

• otteniamo tre matrici


y* e f*


• Regressione con GP

• Assumiamo che i dati siano stati generati da una distribuzione Gaussiana n-variata con rumore gaussiano additivo

• Vogliamo predire

• Usando le proprietà delle Gaussiane


y* e f*

Predizione

Incertezza


• Regressione con GP per il punto y*, con

• Nell’ esempio, stimando dalle error bar, e con una scelta appropriata di e


y* e f*

Predizione

Incertezza


• Regressione con GP per 1000 punti



• Regressione con GP per 1000 punti


• Possiamo incorporare la “fisica” del processo nella funzione di kernel






Esempio

• Definizione della funzione di kernel k(x,z)

• Creazione di un data set

% definizione del kernel e della bandwidth tau = 1.0;k = @(x,z) exp(-(x-z)'*(x-z) / (2*tau^2));

% crea un training set corrotto da rumore m_train = 10; i_train = floor(rand(m_train,1) * size(X,1) + 1);

X_train = X(i_train);

y_train = h(i_train) + sigma * randn(m_train,1);

funzione vera (latente)

Esempio

• Predizione:

X_test = X;

K_test_test = compute_kernel_matrix(k,X_test,X_test);

K_test_train = compute_kernel_matrix(k,X_test,X_train);

K_train_train = compute_kernel_matrix(k,X_train,X_train);

G = K_train_train + sigma^2 * eye(size(X_train,1));

mu_test = K_test_train * (G \ y_train);

Sigma_test = K_test_test + sigma^2 * eye(size(X_test,1)) - K_test_train * (G \ (K_test_train'));

function [K] = compute_kernel_matrix(k, X, Z)

m = size(X,1);

n = size(Z,1); K = zeros(m,n); for i = 1:m

for j = 1:n K(i,j) = k(X(i,:)', Z(j,:)');

end end

end

• Fantastico! Ma...

• che cosa abbiamo fatto realmente e come si collega all’approccio Bayesiano?

• come si effettua una scelta appropriata dei parametri??

• Riprendiamo la regressione lineare


L a p r o b a b i l i t à Gaussiana a priori sui pesi equivale a un a priori Gaussiano sulle funzioni y

Design MatrixCongiunta Normale

• Si può dimostrare che

Matrice di Gram

A priori Gaussiano sulle funzioni y

Mi bastano media e varianza per specificarla

Media nulla basta la varianza per specificarla

Mi bastano i dati per specificare la varianza

Processi Gaussiani //fondamenti concettuali

P r o b a b i l i t à G a u s s i a n a (congiunta) sui punti della funzione

Probabilità Gaussiana a priori sulla funzione

I parametri w sono “spariti” !

f è una variabile aleatoria


function [f] = sample_gp_prior(k, X) K = compute_kernel_matrix(k, X, X); % campiona da una distribuzione Gaussiana a media nulla % con matrice di covarianza K [U,S,V] = svd(K); A = U*sqrt(S)*V'; f = A * randn(size(X,1),1); end


• Quando campiono dal processo Gaussiano campiono una funzione

Processo Gaussiano




Processo Gaussiano



Cioè: è una distribuzione di probabilità su funzioni?

E quindi una funzione è una “variabile aleatoria”?



• Vediamola in questo modo….. campioniamo da una Gaussiana 2D

� …which we can plot in a “horizontal manner”, if we wish

x1 x2 x1

x2 0

2

4

-2

-4



x1 x2 x1

x2 0

2

4

-2

-4


� …and where we can display a 2-D point using both plots

x1 x2 x1

x2 0

2

4

-2

-4




x1 x2 x1

x2 0

2

4

-2

-4



x1 x2 x1

x2 0

2

4

-2

-4



x1 x2 x1

x2 0

2

4

-2

-4




x1 x2 x1

x2 0

2

4

-2

-4



x1 x2 x1

x2 0

2

4

-2

-4



x1 x2 x1

x2 0

2

4

-2

-4



x1 x2 x1

x2 0

2

4

-2

-4




x1 x2 x1

x2 0

2

4

-2

-4



x1 x2 x1

x2 0

2

4

-2

-4



x1 x2 x1

x2 0

2

4

-2

-4



x1 x2 x1

x2 0

2

4

-2

-4



x1 x2 x1

x2 0

2

4

-2

-4




x1 x2 x1

x2 0

2

4

-2

-4



• ….. e uniamo le due coordinate con una retta


x1 x2 x1

x2 0

2

4

-2

-4



x1 x2 x1

x2 0

2

4

-2

-4



x1 x2 x1

x2 0

2

4

-2

-4



x1 x2 x1

x2 0

2

4

-2

-4



x1 x2 x1

x2 0

2

4

-2

-4




x1 x2 x1

x2 0

2

4

-2

-4



x1 x2 x1

x2 0

2

4

-2

-4



x1 x2 x1

x2 0

2

4

-2

-4



x1 x2 x1

x2 0

2

4

-2

-4




x1 x2 x1

x2 0

2

4

-2

-4



x1 x2 x1

x2 0

2

4

-2

-4



x1 x2 x1

x2 0

2

4

-2

-4



x1 x2 x1

x2 0

2

4

-2

-4



x1 x2 x1

x2 0

2

4

-2

-4



• Adesso da una Gaussiana 3D

� …but we’re not restricted to bivariate Gaussians

� Let’s introduce some extra dimensions, to our multivariate Gaussian distribution (without plotting it!)

� Perhaps unsurprisingly, as we generate points from our new trivariate distribution, we can equivalently show these on our three-column “horizontal” plot

x1 x2

0

2

4

-2

-4

x3






x1 x2

0

2

4

-2

-4

x3




x1 x2

0

2

4

-2

-4

x3






x1 x2

0

2

4

-2

-4

x3




x1 x2

0

2

4

-2

-4

x3




x1 x2

0

2

4

-2

-4

x3






x1 x2

0

2

4

-2

-4

x3




x1 x2

0

2

4

-2

-4

x3




x1 x2

0

2

4

-2

-4

x3




x1 x2

0

2

4

-2

-4

x3






x1 x2

0

2

4

-2

-4

x3




x1 x2

0

2

4

-2

-4

x3




x1 x2

0

2

4

-2

-4

x3




x1 x2

0

2

4

-2

-4

x3




x1 x2

0

2

4

-2

-4

x3






x1 x2

0

2

4

-2

-4

x3




x1 x2

0

2

4

-2

-4

x3




x1 x2

0

2

4

-2

-4

x3




x1 x2

0

2

4

-2

-4

x3




x1 x2

0

2

4

-2

-4

x3



� Why stop at a trivariate Gaussian distribution? We can add as many dimensions to our multivariate Gaussian as we wish

x1 x2

0

2

4

-2

-4

x3 x4 x5




x1 x2

0

2

4

-2

-4

x3 x4 x5


x1 x2

0

2

4

-2

-4

x3 x4 x5




x1 x2

0

2

4

-2

-4

x3 x4 x5


x1 x2

0

2

4

-2

-4

x3 x4 x5


x1 x2

0

2

4

-2

-4

x3 x4 x5




x1 x2

0

2

4

-2

-4

x3 x4 x5


x1 x2

0

2

4

-2

-4

x3 x4 x5


x1 x2

0

2

4

-2

-4

x3 x4 x5


x1 x2

0

2

4

-2

-4

x3 x4 x5


• Adesso 3 campionamenti da una Gaussiana 25DBuilding large Gaussians

Three draws from a 25D Gaussian:

−1

0

1

2

f

x

To produce this, we needed a mean: I used zeros(25,1)

The covariances were set using a kernel function: ⌃ij = k(xi,xj).

The x’s are the positions that I planted the tics on the axis.

Later we’ll find k’s that ensure ⌃ is always positive semi-definite.

Processo Gaussiano


f è una variabile aleatoria


Distribuzione Gaussiana vs. Processo Gaussiano

• Un processo Gaussiano definisce una distribuzione su funzioni p(f) dove

• può essere considerato un vettore a dimensionalità infinita (un processo stocastico)

• Definizione:

• p(f) è un processo Gaussiano se per ogni sottoinsieme finito la distribuzione sul sottoinsieme finito (vettore n-dimensionale) è una Gaussiana multivariata

Processi Gaussiani //definizione formale

che cosa ci dice il kernel?

• La funzione di kernel, dunque la covarianza, ci dice quanto due punti sono correlati

Processi Gaussiani //Che cosa ci dice il kernel….

Processo Gaussiano


Processo Gaussiano Usando Kernel diversi....



Processi Gaussiani //Modello grafico

n=1,...,N

p a r a m e t r i del kernel

Processi Gaussiani //Regressione

• Il modello assumendo rumore Gaussiano

• La distribuzione congiunta su e

• Predizione rispetto a un nuovo dato :

• La congiunta su

• Partiziono la matrice di covarianza come


• Sfruttando le proprietà delle congiunte di Gaussiane espresse in forma matriciale si ottiene che la predizione è specificabile come


Processi Gaussiani //Regressione: esempio


• La matrice di covarianza deve essere semidefinita e positiva

• In generale la funzione di covarianza ha dei parametri da apprendere

Processi Gaussiani //Parametri del kernel

varianza del segnale




lunghezza di scala o di correlazione



• approccio ML: si massimizza

• approccio MAP: si massimizza

• approccio totalmente Bayesiano: non trattabile, approssimazioni


regressione: processi gaussiani e machine...

Documents