linear models of regression: bias-variance decomposition

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

Linear Models Of Regression:

Bias-Variance Decomposition,

Bayesian Linear Regression

Prof. Nicholas Zabaras

School of Engineering

University of Warwick

Coventry CV4 7AL

United Kingdom

Email: [email protected]

URL: http://www.zabaras.com/

July 30, 2014

1

mailto:[email protected]

http://www.zabaras.com/

http://www.zabaras.com/


The bias-variance decomposition

Ridge Regression, Shrinkage, Regularization effect of Big Data

Bayesian linear regression, Parameter posterior distribution, A Note on

Data Centering, Numerical Example

Predictive distribution, Gaussian Processes

Bayesian inference in linear regression when s2 is unknown, Zellner’s

g-Prior, Uninformative (Semi-Conjugate) Prior, Evidence

Approximation

Contents

2

Kevin Murphy, Machine Learning: A probabilistic Perspective, Chapter 7 Chris Bishops’ PRML book, Chapter 3 Regression using parametric discriminative models in pmtk3 (run TutRegr.m in Pmtk3, pmtk3-

1nov12/docs/tutorial/html/tutRegr.html)

http://www.cs.ubc.ca/~murphyk/MLbook/

http://research.microsoft.com/en-us/um/people/cmbishop/PRML/index.htm

http://pmtk3.googlecode.com/svn-history/r2519/trunk/docs/tutorial/html/tutRegr.html#25

http://pmtk3.googlecode.com/svn-history/r2519/trunk/docs/tutorial/html/tutRegr.html#25


The Bias-Variance Decomposition

3

MLE (i.e. least squares) leads to severe over-fitting if complex

models are trained using data sets of limited size.

Over-fitting occurs whenever the number of basis functions is

large (i.e. for complex models) and the training data set is of

limited size.

Limiting the number of basis functions limits the flexibility of the

model.

Regularization controls over-fitting but one needs to determine

l.

Over-fitting is property of MLE and does not arise when we

marginalize over parameters in a Bayesian setting.

Before returning to a Bayesian setting, we will discuss the bias-

variance tradeoff (a frequentist viewpoint of model complexity).


Loss Function and the Regression Function

4

Recall the regression loss-function

The decision problem is to minimize the expected loss:

The solution of this minimization problem known as the

regression function is:

i.e. the average of t conditioned on x.

A useful expression was derived in an earlier lecture:

2

, ( ) ( )L t y y t x x

2

( ) ( , )L y t p t d dt x x x

( ) ( | ) |ty tp t dt t x x x

2 2

( ) | ( ) | ( )L y t p d t t p ,t d dt x x x x x x x

http://www.zabaras.com/Courses/BayesianComputation/BayesianStatisticsAndMachineLearning-Introduction.pdf




5

We can write the expected squared loss as:

where h(x) is the conditional expectation which is given by

Recall that the second term, which is independent of y(x), arises

from the intrinsic noise on the data and represents the minimum

achievable value of the expected loss.

If we model h(x) using a parametric function y(x,w) governed by

w, then from a Bayesian perspective the uncertainty in our

model is expressed through a posterior distribution over w.

2 2

( ) ( ) ( ) ( ) ( , )L y h p d h t p t d dt x x x x x x x

( ) [ | ] ( | )h t tp t dt x x x



6

A frequentist treatment involves making a point estimate of w

based on the data set D, and interpret the uncertainty of this

estimate as follows.

Suppose we had a large number of data sets each of size N

and each drawn independently from p(x, t).

For any given data set D, we run our learning algorithm and

obtain a prediction function y(x; D). Different data sets from the

ensemble give different functions and values of the squared

loss.

The performance of the learning algorithm is then assessed by

taking the average over this ensemble of data sets.

2 2




7

For any given data set D, we can run our learning algorithm

and obtain a prediction function y(x;D).

Here is the average of our prediction function over

all data sets.

Take the expectation of this expression with respect to D and

note that the final term will vanish, giving

Recall that h(x) is the desired regression function.

2 2

22

( ; ) ( ) ( ; ) [ ( ; )] [ ( ; )] ( )

( ; ) [ ( ; )] ( ; ) ( )

2 ( ; ) [ ( ; )] ( ; ) ( )

D D

D D

D D

y h y y y h

y y y h

y y y h

x x x x x x

x x x x

x x x x

D D D D

D D D

D D D

2

2 22( ; ) ( ) ( ; ) ( ) ( ; ) ( ; )D D D D

Bias Variance

y h y h y y

x x x x x xD D D D

[ ( ; )]D y x D



8

So far, we have considered a single input value x. If we

substitute this expansion back into the expected squared

loss function shown above, we obtain the following

decomposition of the expected squared loss

Expected loss = (bias)2 + variance + noise

where

22

2

2

( ; ) ( ) ( )

( ; ) ( ; ) (

( ) ( , )

D

D D

y h p d

= y y p

h t p t d dt

D

D D

x x x x

x x x)dx

x x x

bias

variance

noise

2 2




9

Expected loss = (bias)2 + variance + noise

There is a tradeoff between bias and variance

flexible models have low bias and high variance

rigid models have high bias and low variance

The bias-variance decomposition provides insights in model

complexity but is of limited use since several data sets D

are needed.

22

2

2

( ; ) ( ) ( )

( ; ) ( ; ) (

( ) ( , )

D

D D

y h p d

= y y p

h t p t d dt

D

D D

x x x x

x x x)dx

x x x

bias

variance

noise


0 0.5 1

-1

-0.5

0

0.5

1

x

t

ln l = 2.6

0 0.5 1

-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

x

t

Average of the 100 fits

Sin function data generated from


10

Dependence of bias and variance on model complexity, governed by a regularization parameter λ,

using the sinusoidal data set. On the left only 20 of the 100 fits are shown.

L = 100 data sets, each with N = 25 data points. 24

Gaussian basis functions, # of parameters M = 25

MatLab code

Large l

High Bias

Large l

Low variance

http://www.zabaras.com/Courses/BayesianComputation/Software/Fig5Chapter3Bishop.rar


0 0.5 1

-1

-0.5

0

0.5

1

x

t

ln l = -0.31

0 0.5 1

-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

x

t




11

Dependence of bias and variance on model complexity, governed by a regularization parameter λ,

using the sinusoidal data set. On the left only 20 of the 100 fits are shown.



MatLab code




12

Note averaging of many solutions of the complex model (M=25) obtained from different data sets

gives very good solution.

Weighted averaging is also done in Bayesian approach but there it is with respect to the posterior

distribution of parameters!



Lowest l

Large

variance

Low Bias

MatLab code

0 0.5 1

-1

-0.5

0

0.5

1

x

t

ln l = -2.4

0 0.5 1

-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

x

t





Trade-off Quantities

13

The average prediction over L data sets is estimated from

The integrated squared bias and integrated variance are then

given by

( )

1

1( ) ( )

L

y x y xL

2

( )

1 1

1 1( ) ( )

N L

n n

n

y x y xN L

variance =

22 ( ; ) ( ) ( )D y h p d Dx x x xbias

22

1

1( ) ( )

N

n n

n

y x h xN

bias

2

( ; ) ( ; ) (D D= y y p D Dx x x)dxvariance



14

Note that small regularization parameter l allows the

model to be finely tuned to the noise in each individual data

set leading to small bias but large variance.

Conversely, large l forces all w’s to go towards zero leading

to large bias but small variance.

MatLab code

-3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 20

0.05

0.1

Squared Bias + Variance Plot

ln(6)

variance

bias2

bias2+var

Test Error rms / 3

Complex Models Simple Models

Model Complexity Variance Bias

low Complex High Low

high Simple Low High

l

l




15

Plot of squared bias and variance together with their sum.

Also shown is the test set rms squared error for a test data

set size of 1000 points.

The minimum value of (bias)2 + variance occurs around ln λ

= −0.31, which is near to the value that gives the minimum

error on the test data.

MatLab code -3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.05

0.1

Squared Bias + Variance Plot

variance

bias2

bias2+var

Test Error rms / 3



Ridge Regression

16

MLE overfits as it is picking the parameter values that are

the best for modeling the training data.

If the data is noisy, such parameters result in complex

functions.

0 2 4 6 8 10 12 14 16 18 20-10

-5

0

5

10

15

20Consider fitting a degree

14 polynomial to N = 21

data using LS. The resu-

lting curve is wiggly and wi

(excluding w0) take large

values in order to inter-

polate the data perfectly.

If we changed the data a

little, the coefficients would

change a lot.

Run linregPolyVsRegDemo

from PMTK3

http://pmtk3.googlecode.com/svn-history/r2779/trunk/demos/linregPolyVsRegDemo.m




Ridge Regression

17

Degree 14 polynomial fit to N

= 21 data points with

increasing amounts of L2

regularization.

Data was generated from

noise with σ2 = 4.

0 2 4 6 8 10 12 14 16 18 20-10

-5

0

5

10

15

20ln lambda -20.135

The error bars, representing

the noise σ2 get wider as the

fit gets smoother, since we

are ascribing more of the

data variation to the noise.

0 2 4 6 8 10 12 14 16 18 20-10

-5

0

5

10

15

20ln lambda -8.571


from PMTK3

2

0

1

minN

T T

n n

n

t w x l

w+ w + w w





Ridge Regression

18

For a Bayesian perspective of regularized least squares,

consider the MAP estimation with a zero-mean Gaussian prior:

The MAP estimation problem is the same as:

Here l is the complexity penalty. The Ridge Regression –

Penalized Least Squares leads to the following estimate:

w0 is not regularized as it t does not affect the complexity. Regu-

larization ensures the function is simple (e.g. w=0 corresponds

to a straight line). Increasing l results in smoother functions and

smaller wi.

1( ) ( | 0, )p w w IN

2

0

1

ln ( | )2 2

NT T

n n

n

p t w x +const

w t + w w w

2

0

1

minN

T T

n n

n

t w x

l l

w+ w + w w , =

1

T TRidge Ml

w I t


Ridge Regression

19

-25 -20 -15 -10 -5 0 50

5

10

15

20

25

log lambda

mean squared error

train mse

test mse


from PMTK3

-20 -15 -10 -5 0 50.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

log lambda

negative log marg. likelihood

CV estimate of MSE

We continue with the degree 14 polynomial fit by ridge regression,

plotted vs log(λ) (N = 21, σ2 = 4).

On the left, notice that as λ increases, the training error increases. The

test error has the classical U shape. Cross validation can be used to

select an optimal λ.

On the right, we estimate of performance using the training set alone

using cross validation and −log p(D|λ) (the plots are vertically

rescaled to [0,1]).

Training error

Test error

Complex models Simple models

−log p(D|λ)

5-fold cross-validation

estimate of future MSE




Bayesian Scientific Computing, Spring 2013 (N. Zabaras) 20

To avoid inverting , we augment the original data

with virtual data coming from the prior as follows:

Note that with these definitions, the penalized LS problem looks

as:

This is clear since:

The Ridge estimate is: . Consider a QR

decomposition of . Then we only invert R which is upper

triangular (MatLab implements this as w=\t). Cost: O(ND2).

Numerically Stable Ridge Estimate

1T

Ml

I

11

, ,MM

0I

tt

minT

w

t w t w

T T TT T T

t wt w t w t w w t w t w w w

w

1

T T

Ridge

w t

QR

1

1 1T T T T T T T TRidge

w R Q QR R Q t R R R Q t R Q t


Consider now the case where D>>N. We first perform and

truncate an SVD decomposition as:

Defining Z=US an NxN matrix, we can write the Ridge estimate:

In essence,

we replace the D-dimensional xi with the N-

dimensional zi and perform penalized fit as before.

We then transform the N-dimensional solution to the D-

dimensional solution by multiplying by V.

Geometrically, we are rotating to a new coordinate system

in which all but the first N coordinates are zero.

This does not affect the solution since the spherical

Gaussian prior is rotationally invariant. The overall time is

now O(DN2) operations.

Numerically Stable Ridge Estimate

, , ,T T T

N N N N N Ndiagonal USV UU I VV I S

1

T TRidge Nl

w V I Z Z Z t


Using the singular values of , ridge predictions on the

training set can be written as:

\

Recall that ordinary least squares gives:

If , then direction uj has little effect on the prediction.

This motivates the definition of effective number of degrees of

freedom as:

We will see later in this lecture that for a uniform prior on w, the

posterior covariance is:

Connections with PCA

2

1 12 2

21

MjT T T T

Ridge N N j j

j j

sl l

s l

y w USV V I S SU t = US I S SU t = u u

2

js

1

MT

LS j j

j

y

y w u u

2

js l

2

21

( ) , 0 ( )

( 0) ( ) 0

Mj

j j

dof dof M

dof M and dof

sl l

s l

l l

11

cov | T

w D


Thus the directions in which we are most uncertain about w are

determined by the eigenvectors of with the smallest

eigenvalues. The eigenvalues of this matrix are . Hence small

σj correspond to directions with high posterior variance. It is

these directions which ridge regression shrinks the most.

Shrinkage

11cov | T

w D

2

21/ s

2

11/ s

-2 -1 0 1 2 3 4 5 6 7

-2

-1

0

1

2

3

4

5

MLE

Prior Mean

Posterior Mean

X

LIKELIHOOD

PRIOR 2 2

1 2s s

T 2

js

w1 is not-well determined by

the data (has high posterior

variance), but w2 is well-

determined.

Ill-determined parameters are

reduced in size towards 0.

This is called shrinkage.

Pr11 1 20,MLE

MAP ior MAPw w w w


Another effective regularizing approach is to use a lots of

data. More training data in general implies better learning.

The test set error decreases to a plateau as N increases.

This is illustrated by plotting the MSE incurred on the test

set achieved by polynomial regression of different degrees

vs N (learning curve).

The level of the plateau for the test error consists of two

terms:

an irreducible component (that all models incur) due to

the intrinsic variability of the generating process (noise

floor); and

a component that depends on the discrepancy

between the generating process (the “truth”) and the

model (structural error).

Regularization Effect of Big Data


Learning Curves

25

Truth is a

degree 2

polynomial and

we fit

polynomials of

degrees 1, 2

and 25.

The structural

error for M2 and

M25 is zero as

both capture the

true generating

process.

The structural

error for M1 is

substantial: the

plateau occurs

high above the

noise floor.

2 4s

Run linregPolyVsN

from PMTK3

0 20 40 60 80 100 120 140 160 180 2000

2

4

6

8

10

12

14

16

18

20

22

size of training set

mse

truth=degree 2, model = degree 1

train

test

0 20 40 60 80 100 120 140 160 180 2000

2

4

6

8

10

12

14

16

18

20

22


mse


train

test

0 20 40 60 80 100 120 140 160 180 2000

2

4

6

8

10

12

14

16

18

20

22

size of training setm

se


train

test

0 20 40 60 80 100 120 140 160 180 2000

2

4

6

8

10

12

14

16

18

20

22


mse


train

test Training MSE

X Test MSE

https://code.google.com/p/pmtk3/source/browse/trunk/demos/bookDemos/Introduction/linregPolyVsN.m?r=2649





Test Error and Simple Models

26

For any model

that is

expressive

enough to

capture the

truth (i.e., M2,

M10, M25 ), the

test error goes

to the noise

floor as N → ∞.

However, the

test error will go

to zero faster

for simpler

models (there

are fewer

parameters to

estimate). Run linregPolyVsN

from PMTK3

0 20 40 60 80 100 120 140 160 180 2000

2

4

6

8

10

12

14

16

18

20

22


mse


train

test

0 20 40 60 80 100 120 140 160 180 2000

2

4

6

8

10

12

14

16

18

20

22


mse


train

test

0 20 40 60 80 100 120 140 160 180 2000

2

4

6

8

10

12

14

16

18

20

22


se


train

test

0 20 40 60 80 100 120 140 160 180 2000

2

4

6

8

10

12

14

16

18

20

22


mse


train

test






Approximation Error

27

Run linregPolyVsN

from PMTK3

For finite N,

there is

discrepancy

between the

parameters that

we estimate and

the best

parameters that

we could

estimate given

the particular

model class.

The

approximation

error goes to

zero as N → ∞,

but it goes to

zero faster for

simpler models.

0 20 40 60 80 100 120 140 160 180 2000

2

4

6

8

10

12

14

16

18

20

22


mse


train

test

0 20 40 60 80 100 120 140 160 180 2000

2

4

6

8

10

12

14

16

18

20

22


mse


train

test

0 20 40 60 80 100 120 140 160 180 2000

2

4

6

8

10

12

14

16

18

20

22


se


train

test

0 20 40 60 80 100 120 140 160 180 2000

2

4

6

8

10

12

14

16

18

20

22


mse


train

test






Multi-task Learning

28

In domains with lots of data, simple methods work very

well. However, more often we have little data.

E.g., in web search data one has a lots of data but

personalizing the results gives us only few data per

user.

In general, in multi-task learning, we often borrow

statistical strength from tasks with lots of data and

share it with tasks with little data.

Halevy, A., P. Norvig, and F. Pereira (2009). The unreasonable effectiveness of data. IEEE Intelligent Systems 24(2),

8–12.

https://static.googleusercontent.com/media/research.google.com/en/pubs/archive/35179.pdf




Effective model complexity in MLE is governed by the number

of basis functions and is controlled by the size of the data set.

With regularization, the effective model complexity is controlled

mainly by l and still by the number and form of the basis

functions.

Thus the model complexity for a particular problem cannot be

decided simply by maximizing the likelihood function as this

leads to excessively complex models and over-fitting.

Independent hold-out data can be used to determine model

complexity but this wastes data and it is computationally

expensive (see earlier example on optimal regularization using

test data sets).



A Bayesian treatment of linear regression avoids the over-

fitting of maximum likelihood.

Bayesian approaches lead to automatic methods of

determining model complexity using the training data alone.



Assume additive Gaussian noise with known precision . The

likelihood function p(t|w) is the exponential of a quadratic

function of w

and its conjugate prior is Gaussian:

Combining this with the likelihood and using results for

marginal and conditional Gaussian distributions, gives the

posterior

where

0 0( ) ( | , )p w w m SN

( ) ( | , )N Np w | t w m SN

1

0 0

1 1

0

T

N N

T

N

m S S m t

S S

/2

21

11

| , , | ( ), exp ( )2 2

NN NT T

n n n n

nn

p t t

t X w w x w xN



We now have the product of two Gaussians and the

posterior is easily computed as:

Posterior Distribution: Derivation

32

1

0 0 0 0 0

2

1

1 1

1( | , ) exp ,

2

( | , , ) exp ( )2

1( | , , ) exp ( ) ( ) ( )

2

T

NT

n n

n

N NT T

n n n n

n n

p

p x t

p t

w m S w - m S w - m

t x w w

t x w w x x w x w

0

1 1

0 0 0

1 1

1 1 1 1

0 0 0 0

1

( | , , )

1 1exp ( ) ( ) ( )

2 2

, ( ) ( )

N

N NT T T T T

n n n n

n n

NT T T

N N N n n

n

p ,

t

,

m

w x t S

w S w w S m w x x w w x

w | S S m t S S S x x S

N

Square in w


Note that because the posterior distribution is Gaussian, its

posterior mode coincides with its mean.

The above expressions for the posterior mean and variance

can also be written for a sequential calculation (we already

have observed N data points and now considering an

additional data point (xN+1,tN+1)). In this case, we have:

( ) ( | , )N Np w | t w m SN 1

0 0

1 1

0

T

N N

T

N

m S S m t

S S

MAP Nw = m

1 1 1 1( , ) ( | , )N N N N N Np t , w | , x m S w m SN 1

1 1 1 1

1 1

1 1 1

N N N N n n

T

N N n n

t

m S S m

S S




34

Let us consider for a prior, a zero-mean isotropic Gaussian

governed by a single precision parameter (see earlier

example) so that

and the corresponding posterior distribution over w is then

given by

The log of the posterior is the sum of the log likelihood and

the log of the prior and, as a function of w, takes the form

Thus the MAP estimate is the same as regularized least

squares (Ridge Regression) with

1( ) ( | 0, )p w w IN

1

T

N N

T

N

m S t

S I

2

1

ln ( | ) ( )2 2

NT T

n n

n

p t x +const

w t w w w

/ .l




A Note on Data Centering

35

In linear regression, it helps to center the data in a way that

does not require us to compute the offset term m. Write the

likelihood as:

Let us assume that the input data are centered in each

dimension such that:

The mean of the output is equally likely to be positive or

negative. Let us put an improper prior and integrate

m out.

( | , , ) exp2

T

N Np ,

m m m

1 1t x w t w t w

1 1 2 1 11

1 2 2 2 2 1 22

1 2

1 2

( ) ( ) .. ( )( )

( ) ( ) .. ( ) ( ) ( ) .. ( )( ),

: : : : ..:

( ) ( ) .. ( )( )

TM

TTM N

T

N

TN N M NN

x x xx

x x x x x xx

x x xx

0 1 1. .T

i i i M i x x x x

1

0 1,...,N

i j

j

i M

x

( ) 1p m



36

Introducing, , the marginal likelihood becomes :

Completing the square in m gives (use the centering of the

input):

Our model is now simplified if instead of t we use (centered

output) and the likelihood is simply written as:

Recall that the MLE estimate for m is:

( | , ) exp2

T

N N N Np , t t t t d

m m m

1 1 1 1

A

t x w t w t w

1

1 N

i

i

t tN

0 0

2

( | , ) exp 22

exp2

T TN

T T

N

tN tN

T

N N

p , t N t d

t t

m m m

1

1

1 1

w

t x w A A A

t w t w

Nt 1t t

( | , ) exp2

Tp ,

t x w t w t w

1

1

, ,...,D

T

j Dj

j

t wm

is formed by averaging each column of Φ

http://zabaras.com/Courses/BayesianComputing/LinearModelsOfRegression-PartA.pdf





37

To simplify the earlier notation, consider a linear regression

model of the form

In the context e.g. of MLE, we need to minimize

Minimization wrt w0 gives:

where:

Thus:

We can now estimate the MLE estimate of w0 as follows:

0| Ty w x w x

0

T

w t x w

0

2

0

1

minN

T

i iw

i

t w

,w

- w x

0 0 0

1

0N

T T T

i i

i

t w w N tN N w t

w x w x w x

1

11

2 2

1

1

1

,:

:

N

i

i

N

Ni

i i

i

D N

iD

i

x N

x

x Nxt t N

xx N

x



38

Substituting the bias term in our objective function gives:

Minimization wrt w gives:

We thus first compute the MLE of w using the centered input

and output as follows:

We can now estimate the MLE estimate of w0 as follows:

1

1 1

N NTT T

c c c c i i i i

i i

t t

w X X X t x x x x x x

1

11 2 11 11 12 1

1 2 2 221 22 221

1 21 2

1

..

..,

: : : : :::

..

N

iTT iDD

NTT

T D iDic N

TT D D NN N NNN

iD

i

x N

x x x x x x x

x Nx x x x x x x

x x x x x x xx N

1

x x

x xX X X X x x

x x

,

0

T

w t x w

22

1 1

min minN N

T T T

i i i i

i i

t t t t

w w

w x w x w x x

1 1

N NT

i i i i

i i

t t

x x x x w x x

1

,c N

N

i

i

t

t t N

1t t t t


Bayesian Regression: Example

39

We generate synthetic data from the function f(x, a) = a0+a1x

with parameter values a0 = −0.3 and a1 = 0.5 by first choosing

values of xn from the uniform distribution U(x|−1, 1), then

evaluating f(xn, a), and finally adding Gaussian noise with

standard deviation of 0.2 to obtain the target values tn.

We assume β=(1/0.2)2=25 and α=2.0.

We perform Bayesian inference sequentially – one point at a

time – so the posterior at each level becomes the new prior.

We show results after 1, 2 and 22 points have been collected.

The results include the likelihood contours (for 1 point), the

posterior and samples of the regression function from the

posterior.


Prior - No data yet

-1 -0.5 0 0.5 1-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

-1 -0.5 0 0.5 1-3

-2

-1

0

1

2

3y(x,w) using samples of w from the prior

Bayesian Regression: Example

40

MatLab Code

http://www.zabaras.com/Courses/BayesianComputation/Software/Fig7Chapter3Bishop.m


Likelihood Contour

-1 -0.5 0 0.5 1-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1Contours of the posterior

-1 -0.5 0 0.5 1-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

-1 -0.5 0 0.5 1-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1y(x,w) using samples of w from the posterior

Example: One Data Point Collected

41

Note that the regression lines pass close to the data point (shown with a circle)

MatLab Code



Likelihood Contour

-1 -0.5 0 0.5 1-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8


-1 -0.5 0 0.5 1-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

-1 -0.5 0 0.5 1-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8


Example: 2nd Data Point Collected

42

Note that the regression lines now pass close to both data points

MatLab Code



Likelihood Contour

-1 -0.5 0 0.5 1-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8


-1 -0.5 0 0.5 1-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

-1 -0.5 0 0.5 1-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8


Example: 22 Data Points Collected

43

Note that the regression lines after 22 data points have been collected

MatLab Code



Summary of Results

44

prior/posterior (no data yet)

-1 -0.5 0 0.5 1-1

-0.5

0

0.5

1

-1 -0.5 0 0.5 1-1

-0.5

0

0.5

1data space (no data yet)

-1 -0.5 0 0.5 1-1

-0.5

0

0.5

1likelihood

-1 -0.5 0 0.5 1-1

-0.5

0

0.5

1prior/posterior

-1 -0.5 0 0.5 1-1

-0.5

0

0.5

1data space

-1 -0.5 0 0.5 1-1

-0.5

0

0.5

1likelihood

-1 -0.5 0 0.5 1-1

-0.5

0

0.5

1prior/posterior

-1 -0.5 0 0.5 1-1

-0.5

0

0.5

1data space

-1 -0.5 0 0.5 1-1

-0.5

0

0.5

1likelihood

-1 -0.5 0 0.5 1-1

-0.5

0

0.5

1prior/posterior

-1 -0.5 0 0.5 1-1

-0.5

0

0.5

1data space

MatLab Code



Summary of Results

45

W0

W1

prior/posterior

-1 0 1-1

0

1

-1 0 1-1

0

1

x

y

data space

W0

W1

-1 0 1-1

0

1

W0

W1

-1 0 1-1

0

1

-1 0 1-1

0

1

x

y

W0

W1

-1 0 1-1

0

1

W0

W1

-1 0 1-1

0

1

-1 0 1-1

0

1

x

y

W0

W1

-1 0 1-1

0

1

W0

W1

-1 0 1-1

0

1

-1 0 1-1

0

1

x

y

likelihood

Run bayesLinRegDemo2d

from PMTK3

After 20 data points

http://pmtk3.googlecode.com/svn-history/r800/trunk/docs/demos/Bayesian_models_II/bayesLinRegDemo2d.html




Predictive Distribution

46

In practice, we are not interested in w itself but in making

predictions of t for new values of x. This requires that we

evaluate the predictive distribution

where

The 1st term represents the noise on the data whereas the 2nd

term reflects the uncertainty associated with w.

Because the noise process and the distribution of w are

independent Gaussians, their variances are additive.

The error bars get larger as we move away from the training

points. By contrast, in the plug-in approximation, the error bars

are of constant size.

As additional data points are observed, the posterior

distribution becomes narrower.

2( | , , ) ( | , ) ( | , ) | ( ), ( )T

N Np t x p t x , p , , d t x x s Nx t w w x t w m

2 11( ) ( ) ( ),T T

N N Nx x S xs

S I


In a full Bayesian treatment, we want to compute the

predictive distribution, i.e. given the training data x and t

and a new test point x, we want the distribution:

To compute the needed marginal, we use a result from an

earlier lecture.


47

1

1 1

1

1 1

( | , , ) ( | , ) ( | , ) ,

( | , , ) ( | ( , ), )

( | , , )

1 1exp ( ) ( ) ( )

2 2

( ), , ( ) ( )

N NT T T

M M n n n n

n n

N NT

N n n N N M M n n

n n

p t x p t x p d where

p t x t y x and

p ,

x x t x

t x x x

N

N

x t w w x t w

w w

w x t

w I w w w w

w |, S S S I

http://www.zabaras.com/Courses/BayesianComputation/IntroToProbabilityAndStatistics.pdf



Appendix: Useful Result

48

For the above linear model, we proved in these notes, the

following very useful results about marginal and conditional

Gaussian models.

1

1

| ,

| | ,

p

p

N

N

x x

y x y Ax b L

m

1 1| , Tp Ny y A b L + A Am

1 1

| | ( ,T T Tp

x y x A LA A L y b) A LAm N




49

Thus for our problem:

The predictive distribution now takes the form:

1

1

1 1

( ),

( ) ,

N

N n n N

n

T

, t x

t, x , = 0

x w = S = S

y A = b L =

m

1

1

| ,

| | ,

p

p

N

N

x x

y x y Ax b L

m

1( | , , ) ( | ( , ), )p t x t y x w wN

1

( | , , ) ( ),N

N n n N

n

p , t x

w x t w |, S SN

1 1| , Tp y y A b L + A Am N

1

1

| ( ) ( ), ( ) ( )N

T T

N n n N

n

p t t x t x x x

S + SN


In a full Bayesian treatment, we want to compute the

predictive distribution, i.e. given the training data x and t

and a new test point x, we want the distribution:

where the mean and variance are given by

Note that:


50

1( | , , ) ( | , ) ( | , ) , ( | , , ) ( | ( , ), )p t x p t x p d p t x t y x x t w w x t w w wN

1

2 1

1

1

( ) ( ) ( ) ( ) ,

( ) ( ) ( ) ( )

( ) ( )

NT T T

N n n N N N

n

T

N N

NT T

N n n

n

m x x x t x

x x x

x x

s

+

S m m S t

S

S I I

uncertainty in the data+uncertainty in w

2( | , , ) ( | ( ), ( ))Np t x t m x xsx t N

2 1 1 2 2

1( ) ( ) ( ) ( ) ( )N

T

N N N Nx x x and x xs s s

S+


It is easy to show:

Note:

and the identity:

Using these results, we can write:


51

2 1 1 2 2

1( ) ( ) ( ) ( ) ( )N

T

N N N Nx x x and x xs s s

S+

11 1

1

1

( ) ( ) ( ) ( )N

T T

N n n N n n

n

x x x x

S I S

1 1

11

11

T

T

T

M v v MM vv M

v M v

12 1 1 1 1

1 1

2

2 2

( ) ( ) ( ) ( ) ( ) ( ) ( )

( ) ( ) ( ) ( )( ) ( ) ( ) ( )

1 ( ) ( ) 1 ( ) ( )

T T T

N N N n n

T T

N n n N n NT

N N NT T

n N n n N n

x x x x x x x

x x x xx x x x

x x x x

s

s s

S S

S S SS

S S

+ + +


The notation used here is as follows:

Predictive Distribution: Summary

52

1

2 1

1

1

( ) ( ) ( )

( ) ( ) ( )

( ) ( )

NT

N n n

n

T

N N

NT

N n n

n

m x x x t

x x x

x x

s

S

S

S I

+

2( | , , ) ( | ( ), ( ))Np t x t m x xsx t N

2 2 1

1

:

1

( ) , ( ) 1 ..

:

n

T M

n n

M

n

For Polynomial regression

x

x x x x x x

x

Note:

Predictive mean and

variance are functions

of x.

0

1

2 0 1 2 1

1

( )

( )

( ) ( ) , ( ) ( ) ( ) ( ) .. ( ) ,

:

( )

n

n

T

n n n n n M n

M n

x

x

x x x x x x x unit matrix M M

x

I


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-2

-1.5

-1

-0.5

0

0.5

1

1.5

2Predictive Distribution, M = 9

Generating function sin(2pi*x)

Random data points for fitting

Predictive Mean

Predictive Standard Deviation

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-2

-1.5

-1

-0.5

0

0.5

1

1.5




Predictive Mean


Pointwise Uncertainty in the Predictions

53

MatLab code

N=1

N=2

M=9 Gaussians, 10 parameters

Scale of Gaussians adjusted

with data

= 5*10-3

= 11.1

Using N=1,2,4,10

Data are given here

The predictive uncertainty is

smaller near the data.

The level of uncertainty

decreases with N


http://www.zabaras.com/Courses/BayesianComputation/Software/points.mat


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-2

-1.5

-1

-0.5

0

0.5

1

1.5




Predictive Mean


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-2

-1.5

-1

-0.5

0

0.5

1

1.5




Predictive Mean


Pointwise Uncertainty in the Predictions

54

MatLab code

N=4 N=10



Summary of Results

55

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-2

-1.5

-1

-0.5

0

0.5

1

1.5




Predictive Mean


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-2

-1.5

-1

-0.5

0

0.5

1

1.5




Predictive Mean


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-2

-1.5

-1

-0.5

0

0.5

1

1.5




Predictive Mean


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-2

-1.5

-1

-0.5

0

0.5

1

1.5




Predictive Mean


MatLab code



Plugin Approximation

56

-8 -6 -4 -2 0 2 4 6 80

10

20

30

40

50

60

plugin approximation (MLE)

prediction

training data

-8 -6 -4 -2 0 2 4 6 8-10

0

10

20

30

40

50

60

70

80

Posterior predictive (known variance)

prediction

training data

-8 -6 -4 -2 0 2 4 6 8-20

0

20

40

60

80

100

functions sampled from posterior

-8 -6 -4 -2 0 2 4 6 80

5

10

15

20

25

30

35

40

45

50

functions sampled from plugin approximation to posterior

Run linregPostPredDemo

from PMTK3

( | , , ) ( | , ) ( )

( | , )

p t x p t x d

p t x

wx t w w w

w

https://code.google.com/p/pmtk3/source/browse/trunk/demos/linregPostPredDemo.m?r=2779&spec=svn2779




0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-2

-1.5

-1

-0.5

0

0.5

1

1.5

2Plots of y(x,w) where w is a sample from the posterior over w

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-2

-1.5

-1

-0.5

0

0.5

1

1.5


Covariance Between the Predictions

57

Draw samples from the posterior of w and

then plot y(x,w). We use the same data as

the earlier example.

We are visualizing the joint uncertainty

in the posterior distribution between the

y values at two or more x values.

MatLab Code

N=2

N=1

Same data and basis functions

as in the earlier example.



0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-2

-1.5

-1

-0.5

0

0.5

1

1.5


Covariance Between the Predictions

58

Draw samples from the posterior of w and

then plot y(x,w)

We are visualizing the joint uncertainty

in the posterior distribution between the

y values at two or more x values.

MatLab Code

N=12

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-2

-1.5

-1

-0.5

0

0.5

1

1.5


N=4



Summary of Results

59

0 0.2 0.4 0.6 0.8 1-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

0 0.2 0.4 0.6 0.8 1-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

0 0.2 0.4 0.6 0.8 1-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

0 0.2 0.4 0.6 0.8 1-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

MatLab Code



Gaussian Basis vs. Gaussian Process

60

If we use localized basis functions such as Gaussians, then in

regions away from the basis function support, the contribution

from the second term in the predictive variance will go to zero,

leaving only the noise contribution β−1.

The model becomes very confident in its predictions when

extrapolating outside the region occupied by the basis functions.

This is an undesirable behavior.

This problem can be avoided by adopting an alternative

Bayesian approach to regression (Gaussian processes).

2 1 1

( )( ) ( ) ( )

away from theT

N Nsupport of x

x x xs S+


Bayesian Inference when s2 is Unknown

61

Let us extend the previous results for linear regression assuming

now that s2 is unknown.

Assume a likelihood of the form:*

A conjugate prior has the following form:

The posterior is now derived as:

00

2 2 2 2

0 0 0 0 0 0 0 0

1/2 1

0 0 0 020

1/2/2 2

0 0

, , | , , , | , | ,

2exp

22

Taa D

D

p a b a b

bb

a

s s s s

ss

NIG N InvGammaw w w V w w V

w w V w w

V

/2

2 2 2

/2 2

1| , , | , exp

22

TN

N Np s s s

s

Ny Xw y Xw

y X w y Xw I

0

0

1( )/2 1

0 0 0 02 20

21/2/2

0 0

2, | exp

22

T Taa D N

N D

bbp

as s

s

Dw w V w w y Xw y Xw

wV

In the remaining of this lecture, the response is denoted as y and the dimensionality of w is taken as D.



62

Let us define the following:

With these definitions, one with simple algebra can show:

The posterior marginals can now be derived explicitly:

0

0

1( )/2 1

0 0 0 02 20

21/2/2

0 0

2, | exp

22

T Taa D N

N D

bbp

as s

s

w - w V w - w y - Xw y - Xww

VD

11 1

0 0 0

1 1

0 0 0 0 0

,

1/ 2,

2

T T

N N N

T T T

N N N N Na a N b b

V V X X w V V w X y

w V w y y w V w

2 2 2 2, | , | , , , | , | ,N N N N N N N Np a b a bs s s sw w w V w w VD NIG N InvGamma

1 1 1/2 1

0 0 0 0 0 02 2

2

1/2 1

2

2

2, | exp

2

2exp

2

N

N

T T T T Ta D

N N N N

Ta D

N N N N

bp

b

s ss

ss

w w V w w y Xw y Xw w V w y y w V ww

w w V w w

D

2 2| | ,N Np a bs sD InvGamma

2

1 2

| , , 2 12

Na DT

N N NND N N N

N N

bp a

a b

D Tw w V w w

w w V


Posterior Marginals

63

The marginal posterior can be directly written as:

2

1 1 2/2 1

2 2

2

0

2| exp 1

2 2

N

N

a DT T

a DN N N N N N N

N

bp d

bs s

s

Dw w V w w w w V w w

w

2

1 2

| , , 2 12

Na DT

N N NND N N N

N N

bp a

a b

D Tw w V w w

w w V

To compute the integral above, simply set l=s-2, ds2=-l-2dl and use the normalizing factor of the

Gamma distribution . 1

0

( ) .a b a ae d a b bll l


Posterior Predictive Distribution

64

Consider the posterior predictive for m new test inputs:

As a first step, let us integrate in w by writing:

We next integrate in l=1/s2 and use the normalization of the

Gamma distribution:

1

/2 /2 12 2 2

/2 2 2

21| , exp exp

2 22

N

TT

m a DN N N N

m

bp d ds s s

s s

Dy Xw y Xw w w V w w

y X w

1

1 11 1 1 1 1

1 1 1 1

2

2

2

T T

N N N N

TT T T T T

N N N N N N N

T TT T T TT

N N N N N N N N N

b

b

y Xw y Xw w w V w w

w X X V X y V w X X V w X X V X y V w

X y V w X X V X y V w w V w y y

This

cancels out

from the

integration

in w

/2 1 /2| , expN N

m a m ap dl l l

Dy X

/2

1 1 1 1

/21

| , 2

12

N

N

m aT T

T T T TT

N N N N N N N N N

m a

T TN

N m N N

N

N

p b

b

a

a

Dy X X y V w X X V X y V w w V w y y

y Xw I XV X y Xw

Use the Sherman Morrison

Woodburry formula here to show

that (symmetry of V0 is assumed)

1 1

1T T T

m N m N

I XV X I X X X V X

http://zabaras.com/Courses/BayesianComputing/ConditionalGaussianDistributions.pdf






65

The posterior predictive is also a student T:

The predictive variance has two terms

due to the measurement noise

and due to the uncertainty in w. The second term

depends on how close a test input is to the training data.

| , | , , 2T

Nm N m N N

N

bp a

a

D Ty X y Xw I XV X

Nm

N

b

aI

TN

N

N

b

aXV X


Zellner’s G-Prior

66

It is common to set a0 = b0 = 0, corresponding to an

uninformative prior for σ2, and to set w0 = 0 and V0 =

g(XTX)−1 for any positive value g.

This is called Zellner’s g-prior. Here g plays a role

analogous to 1/λ in ridge regression. However, the prior

covariance is proportional to (XTX)−1 rather than I.

This ensures that the posterior is invariant to scaling of the

inputs.

1 1

2 2 2 2, , | , ,0,0 | , | 0,0T Tp g gs s s s

0 0NIG N InvGammaw w X X w X X

2 2 2 2

0 0 0 0 0 0 0 0, , | , , , | , | ,p a b a bs s s sNIG N InvGammaw w w V w w V

Zellner, A. (1986). On assessing prior distributions and bayesian regression analysis with g-prior distributions. In

Bayesian inference and decision techniques, Studies of Bayesian and Econometrics and Statistics volume 6.

North Holland.

Minka, T. (2000b). Bayesian linear regression. Technical report, MIT.

http://drsmorey.org/bibtex/upload/Zellner:1986.pdf













http://research.microsoft.com/en-us/um/people/minka/papers/minka-linear.pdf

http://research.microsoft.com/en-us/um/people/minka/papers/minka-linear.pdf


Unit Information Prior

67

We will see below that if we use an uninformative prior, the

posterior precision given N measurements is .

The unit information prior is defined to contain as much

information as one sample.

To create a unit information prior for linear regression, we

need to use which is equivalent to the g-prior

with g = N.

1 T

N

V X X

1

0

1 T

N

V X X

1 1

2 2 2 2, , | , ,0,0 | , | 0,0T Tp g gs s s s

0 0NIG N InvGammaw w X X w X X

Kass, R. and L. Wasserman (1995). A reference bayesian test for nested hypotheses and its relationship to

the schwarz criterio. J. of the Am. Stat. Assoc. 90(431), 928–934.

ftp://ftp.cis.upenn.edu/pub/datamining/public_html/ReadingGroup/papers/KW95.pdf




Uninformative Prior

68

An uninformative prior can be obtained by considering the

uninformative limit of the conjugate g-prior, which

corresponds to setting g = ∞. This is equivalent to an

improper NIG prior with w0 = 0, V0 = ∞I, a0 = 0 and b0 = 0,

which gives p(w, σ2) ∝ σ−(D+2).

Alternatively, we can start with the semi-conjugate prior

p(w, σ2) = p(w)p(σ2), and take each term to its

uninformative limit individually, which gives p(w, σ2) ∝ σ−2.

This is equivalent to an improper NIG prior with w0 = 0,V =

∞I, a0 = −D/2 and b0 = 0.

2 2 2 2 ( 2), , | , ,0,0 | , | 0,0 Dp s s s s s 0 0NIG N InvGammaw w I w I

2 2 2 2 2, , | , ,0,0 | , | ,02

Dp s s s s s

0 0NIG N InvGammaw w I w I


Uninformative Prior

69

Using the uninformative prior, , the

corresponding posterior and marginal posteriors are given

by

2 2

22

, | , | , , , ,

| , , 2 | , , ,

N N N N

ND N N N D

N

p a b

b sp a N D

a N D

s s

D NIG

D T T

w w w V

w w V w w C

1 1 11 1

0 0 0

0

1 1 2 2

0 0 0 0

1

,

/ 2 / 2,

1/ 2,

2

T T T T TMLEN N N

N

TT T T

MLE MLEN N N N

T TMLEN

a a N N D

b b s s

V C V X X X X w V V w X y X X X y = w

w V w y y w V w y X w y X w

w w X X X y

2 2,p s s w


The Caterpillar Example

70

The use of a (semi-conjugate) uninformative prior is quite

interesting since the resulting posterior turns out to be

equivalent to the results obtained from frequentist statistics.

This is equivalent to the sampling distribution of the MLE

which is given by the following:

is the standard error of the estimated parameter.

The frequentist confidence interval and the Bayesian

marginal credible interval for the parameters are the same.

2

| | , ,jj

jj j

C sp w w w N D

N D

D T

2

~ ,jj jj

N D j

j

w w C st s

s N D

Rice, J. (1995). Mathematical statistics and data analysis. Duxbury. 2nd edition (page 542)

Casella, G. and R. Berger (2002). Statistical inference. Duxbury. 2nd edition (page 554)



71

As a worked example of this, consider the caterpillar

dataset. We can compute the posterior mean and

standard deviation, and the 95% credible intervals (CI) for

the regression coefficients.

coeff mean stddev 95pc CI sig

w0 10.998 3.06027 [ 4.652, 17.345] *

w1 -0.004 0.00156 [ -0.008, -0.001] *

w2 -0.054 0.02190 [ -0.099, -0.008] *

w3 0.068 0.09947 [ -0.138, 0.274]

w4 -1.294 0.56381 [ -2.463, -0.124] *

w5 0.232 0.10438 [ 0.015, 0.448] *

w6 -0.357 1.56646 [ -3.605, 2.892]

w7 -0.237 1.00601 [ -2.324, 1.849]

w8 0.181 0.23672 [ -0.310, 0.672]

w9 -1.285 0.86485 [ -3.079, 0.508]

w10 -0.433 0.73487 [ -1.957, 1.091]

The 95%

credible intervals

are identical to

the 95%

confidence

intervals

computed using

standard

frequentist

methods.

Run linregBayesCaterpillar

from PMTK3

Marin, J.-M. and C. Robert (2007). Bayesian Core: a practical approach to

computational Bayesian statistics. Springer.

http://zabaras.com/Courses/BayesianComputing/CaterpillarRegressionProblem-PartA.pdf

http://zabaras.com/Courses/BayesianComputing/CaterpillarRegressionProblem-PartA.pdf

http://pmtk3.googlecode.com/svn-history/r1233/trunk/docs/demoOutput/Statistics/linregBayesCaterpillar.html

http://pmtk3.googlecode.com/svn-history/r1744/trunk/bookDemos/Statistics/linregBayesCaterpillar.m



72

We can use these marginal posteriors to compute if the

coefficients are significantly different from 0 -- check if its

95% CI excludes 0.

The CIs for coefficients 0, 1, 2, 4, 5 are all significant.

These results are the same as those produced by a

frequentist approach using p-values at the 5% level.

But note that the MLE does not even exist when N <D, so

standard frequentist inference theory breaks down in this

setting. Bayesian inference theory still works using proper

priors.

Maruyama, Y. and E. George (2008). A g-prior extension for p > n. Technical report, U. Tokyo.

http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=4B531770E85CF03DF5E7929E5622286C?doi=10.1.1.245.2551&rep=rep1&type=pdf






Empirical Bayes for Linear Regression

73

We describe next an empirical Bayes procedure for picking

the hyper-parameters in the prior.

More precisely, we choose η = (α, λ) to maximize the

marginal likelihood, where λ = 1/σ2 be the precision of the

observation noise and α is the precision of the prior, p(w) =

N(w|0, α-1I).

This is known as the evidence procedure.

MacKay, D. (1995b). Probable networks and plausible predictions — a review of practical Bayesian methods for

supervised neural networks. Network.

Buntine, W. and A. Weigend (1991). Bayesian backpropagation. Complex Systems 5, 603–643.

MacKay, D. (1999). Comparision of approximate methods for handling hyperparameters. Neural Computation

11(5), 1035–1068.

http://www.inference.phy.cam.ac.uk/mackay/network.pdf











https://www.complex-systems.com/pdf/05-6-4.pdf

https://www.complex-systems.com/pdf/05-6-4.pdf

http://www.inference.phy.cam.ac.uk/mackay/alpha.pdf







74

The evidence procedure provides an alternative to using

cross validation.

In the Figure, the log marginal likelihood is plotted for

different values of α, as well as the maximum value found

by the optimizer.

-25 -20 -15 -10 -5 0 5-150

-140

-130

-120

-110

-100

-90

-80

-70

-60

-50

log alpha

log evidence


from PMTK3






75

-25 -20 -15 -10 -5 0 5-150

-140

-130

-120

-110

-100

-90

-80

-70

-60

-50

log alpha

log evidence


from PMTK3

-20 -15 -10 -5 0 50.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

log lambda

negative log marg. likelihood

CV estimate of MSE

We obtain the same result as 5-CV (λ = 1/σ2 is fixed in

both methods).

The key advantage of the evidence procedure over CV is

that it allows different αj to be used for every feature.





Automatic Relevancy Determination

76

The evidence procedure can be used to perform feature

selection (automatic relevancy determination or ARD)

The evidence procedure is also useful when comparing

different kinds of models:

It is important to (at least approximately) integrate over η

rather than setting it arbitrarily.

Using variation Bayes models our uncertainty on η rather

than computing point estimates.

( | ) ( | ) ( , ,| ) ( | )

( | ) ( | ( | ), ),

p m p m p m p m d d

max p m p m p m d

D D

D

w w w

w w w

linear models of regression: bias-variance decomposition

Documents