linear models of regression: bias-variance decomposition

76
Bayesian Scientific Computing, Spring 2013 (N. Zabaras) Linear Models Of Regression: Bias-Variance Decomposition, Bayesian Linear Regression Prof. Nicholas Zabaras School of Engineering University of Warwick Coventry CV4 7AL United Kingdom Email: [email protected] URL: http://www.zabaras.com/ July 30, 2014 1

Upload: others

Post on 13-May-2022

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Linear Models Of Regression: Bias-Variance Decomposition

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

Linear Models Of Regression:

Bias-Variance Decomposition,

Bayesian Linear Regression

Prof. Nicholas Zabaras

School of Engineering

University of Warwick

Coventry CV4 7AL

United Kingdom

Email: [email protected]

URL: http://www.zabaras.com/

July 30, 2014

1

Page 2: Linear Models Of Regression: Bias-Variance Decomposition

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

The bias-variance decomposition

Ridge Regression, Shrinkage, Regularization effect of Big Data

Bayesian linear regression, Parameter posterior distribution, A Note on

Data Centering, Numerical Example

Predictive distribution, Gaussian Processes

Bayesian inference in linear regression when s2 is unknown, Zellner’s

g-Prior, Uninformative (Semi-Conjugate) Prior, Evidence

Approximation

Contents

2

Kevin Murphy, Machine Learning: A probabilistic Perspective, Chapter 7 Chris Bishops’ PRML book, Chapter 3 Regression using parametric discriminative models in pmtk3 (run TutRegr.m in Pmtk3, pmtk3-

1nov12/docs/tutorial/html/tutRegr.html)

Page 3: Linear Models Of Regression: Bias-Variance Decomposition

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

The Bias-Variance Decomposition

3

MLE (i.e. least squares) leads to severe over-fitting if complex

models are trained using data sets of limited size.

Over-fitting occurs whenever the number of basis functions is

large (i.e. for complex models) and the training data set is of

limited size.

Limiting the number of basis functions limits the flexibility of the

model.

Regularization controls over-fitting but one needs to determine

l.

Over-fitting is property of MLE and does not arise when we

marginalize over parameters in a Bayesian setting.

Before returning to a Bayesian setting, we will discuss the bias-

variance tradeoff (a frequentist viewpoint of model complexity).

Page 4: Linear Models Of Regression: Bias-Variance Decomposition

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

Loss Function and the Regression Function

4

Recall the regression loss-function

The decision problem is to minimize the expected loss:

The solution of this minimization problem known as the

regression function is:

i.e. the average of t conditioned on x.

A useful expression was derived in an earlier lecture:

2

, ( ) ( )L t y y t x x

2

( ) ( , )L y t p t d dt x x x

( ) ( | ) |ty tp t dt t x x x

2 2

( ) | ( ) | ( )L y t p d t t p ,t d dt x x x x x x x

Page 5: Linear Models Of Regression: Bias-Variance Decomposition

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

The Bias-Variance Decomposition

5

We can write the expected squared loss as:

where h(x) is the conditional expectation which is given by

Recall that the second term, which is independent of y(x), arises

from the intrinsic noise on the data and represents the minimum

achievable value of the expected loss.

If we model h(x) using a parametric function y(x,w) governed by

w, then from a Bayesian perspective the uncertainty in our

model is expressed through a posterior distribution over w.

2 2

( ) ( ) ( ) ( ) ( , )L y h p d h t p t d dt x x x x x x x

( ) [ | ] ( | )h t tp t dt x x x

Page 6: Linear Models Of Regression: Bias-Variance Decomposition

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

The Bias-Variance Decomposition

6

A frequentist treatment involves making a point estimate of w

based on the data set D, and interpret the uncertainty of this

estimate as follows.

Suppose we had a large number of data sets each of size N

and each drawn independently from p(x, t).

For any given data set D, we run our learning algorithm and

obtain a prediction function y(x; D). Different data sets from the

ensemble give different functions and values of the squared

loss.

The performance of the learning algorithm is then assessed by

taking the average over this ensemble of data sets.

2 2

( ) ( ) ( ) ( ) ( , )L y h p d h t p t d dt x x x x x x x

Page 7: Linear Models Of Regression: Bias-Variance Decomposition

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

The Bias-Variance Decomposition

7

For any given data set D, we can run our learning algorithm

and obtain a prediction function y(x;D).

Here is the average of our prediction function over

all data sets.

Take the expectation of this expression with respect to D and

note that the final term will vanish, giving

Recall that h(x) is the desired regression function.

2 2

22

( ; ) ( ) ( ; ) [ ( ; )] [ ( ; )] ( )

( ; ) [ ( ; )] ( ; ) ( )

2 ( ; ) [ ( ; )] ( ; ) ( )

D D

D D

D D

y h y y y h

y y y h

y y y h

x x x x x x

x x x x

x x x x

D D D D

D D D

D D D

2

2 22( ; ) ( ) ( ; ) ( ) ( ; ) ( ; )D D D D

Bias Variance

y h y h y y

x x x x x xD D D D

[ ( ; )]D y x D

Page 8: Linear Models Of Regression: Bias-Variance Decomposition

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

The Bias-Variance Decomposition

8

So far, we have considered a single input value x. If we

substitute this expansion back into the expected squared

loss function shown above, we obtain the following

decomposition of the expected squared loss

Expected loss = (bias)2 + variance + noise

where

22

2

2

( ; ) ( ) ( )

( ; ) ( ; ) (

( ) ( , )

D

D D

y h p d

= y y p

h t p t d dt

D

D D

x x x x

x x x)dx

x x x

bias

variance

noise

2 2

( ) ( ) ( ) ( ) ( , )L y h p d h t p t d dt x x x x x x x

Page 9: Linear Models Of Regression: Bias-Variance Decomposition

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

The Bias-Variance Decomposition

9

Expected loss = (bias)2 + variance + noise

There is a tradeoff between bias and variance

flexible models have low bias and high variance

rigid models have high bias and low variance

The bias-variance decomposition provides insights in model

complexity but is of limited use since several data sets D

are needed.

22

2

2

( ; ) ( ) ( )

( ; ) ( ; ) (

( ) ( , )

D

D D

y h p d

= y y p

h t p t d dt

D

D D

x x x x

x x x)dx

x x x

bias

variance

noise

Page 10: Linear Models Of Regression: Bias-Variance Decomposition

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

0 0.5 1

-1

-0.5

0

0.5

1

x

t

ln l = 2.6

0 0.5 1

-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

x

t

Average of the 100 fits

Sin function data generated from

The Bias-Variance Decomposition

10

Dependence of bias and variance on model complexity, governed by a regularization parameter λ,

using the sinusoidal data set. On the left only 20 of the 100 fits are shown.

L = 100 data sets, each with N = 25 data points. 24

Gaussian basis functions, # of parameters M = 25

MatLab code

Large l

High Bias

Large l

Low variance

Page 11: Linear Models Of Regression: Bias-Variance Decomposition

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

0 0.5 1

-1

-0.5

0

0.5

1

x

t

ln l = -0.31

0 0.5 1

-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

x

t

Average of the 100 fits

Sin function data generated from

The Bias-Variance Decomposition

11

Dependence of bias and variance on model complexity, governed by a regularization parameter λ,

using the sinusoidal data set. On the left only 20 of the 100 fits are shown.

L = 100 data sets, each with N = 25 data points. 24

Gaussian basis functions, # of parameters M = 25

MatLab code

Page 12: Linear Models Of Regression: Bias-Variance Decomposition

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

The Bias-Variance Decomposition

12

Note averaging of many solutions of the complex model (M=25) obtained from different data sets

gives very good solution.

Weighted averaging is also done in Bayesian approach but there it is with respect to the posterior

distribution of parameters!

L = 100 data sets, each with N = 25 data points. 24

Gaussian basis functions, # of parameters M = 25

Lowest l

Large

variance

Low Bias

MatLab code

0 0.5 1

-1

-0.5

0

0.5

1

x

t

ln l = -2.4

0 0.5 1

-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

x

t

Average of the 100 fits

Sin function data generated from

Page 13: Linear Models Of Regression: Bias-Variance Decomposition

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

Trade-off Quantities

13

The average prediction over L data sets is estimated from

The integrated squared bias and integrated variance are then

given by

( )

1

1( ) ( )

L

y x y xL

2

( )

1 1

1 1( ) ( )

N L

n n

n

y x y xN L

variance =

22 ( ; ) ( ) ( )D y h p d Dx x x xbias

22

1

1( ) ( )

N

n n

n

y x h xN

bias

2

( ; ) ( ; ) (D D= y y p D Dx x x)dxvariance

Page 14: Linear Models Of Regression: Bias-Variance Decomposition

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

The Bias-Variance Decomposition

14

Note that small regularization parameter l allows the

model to be finely tuned to the noise in each individual data

set leading to small bias but large variance.

Conversely, large l forces all w’s to go towards zero leading

to large bias but small variance.

MatLab code

-3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 20

0.05

0.1

Squared Bias + Variance Plot

ln(6)

variance

bias2

bias2+var

Test Error rms / 3

Complex Models Simple Models

Model Complexity Variance Bias

low Complex High Low

high Simple Low High

l

l

Page 15: Linear Models Of Regression: Bias-Variance Decomposition

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

The Bias-Variance Decomposition

15

Plot of squared bias and variance together with their sum.

Also shown is the test set rms squared error for a test data

set size of 1000 points.

The minimum value of (bias)2 + variance occurs around ln λ

= −0.31, which is near to the value that gives the minimum

error on the test data.

MatLab code -3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.05

0.1

Squared Bias + Variance Plot

variance

bias2

bias2+var

Test Error rms / 3

Page 16: Linear Models Of Regression: Bias-Variance Decomposition

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

Ridge Regression

16

MLE overfits as it is picking the parameter values that are

the best for modeling the training data.

If the data is noisy, such parameters result in complex

functions.

0 2 4 6 8 10 12 14 16 18 20-10

-5

0

5

10

15

20Consider fitting a degree

14 polynomial to N = 21

data using LS. The resu-

lting curve is wiggly and wi

(excluding w0) take large

values in order to inter-

polate the data perfectly.

If we changed the data a

little, the coefficients would

change a lot.

Run linregPolyVsRegDemo

from PMTK3

Page 17: Linear Models Of Regression: Bias-Variance Decomposition

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

Ridge Regression

17

Degree 14 polynomial fit to N

= 21 data points with

increasing amounts of L2

regularization.

Data was generated from

noise with σ2 = 4.

0 2 4 6 8 10 12 14 16 18 20-10

-5

0

5

10

15

20ln lambda -20.135

The error bars, representing

the noise σ2 get wider as the

fit gets smoother, since we

are ascribing more of the

data variation to the noise.

0 2 4 6 8 10 12 14 16 18 20-10

-5

0

5

10

15

20ln lambda -8.571

Run linregPolyVsRegDemo

from PMTK3

2

0

1

minN

T T

n n

n

t w x l

w+ w + w w

Page 18: Linear Models Of Regression: Bias-Variance Decomposition

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

Ridge Regression

18

For a Bayesian perspective of regularized least squares,

consider the MAP estimation with a zero-mean Gaussian prior:

The MAP estimation problem is the same as:

Here l is the complexity penalty. The Ridge Regression –

Penalized Least Squares leads to the following estimate:

w0 is not regularized as it t does not affect the complexity. Regu-

larization ensures the function is simple (e.g. w=0 corresponds

to a straight line). Increasing l results in smoother functions and

smaller wi.

1( ) ( | 0, )p w w IN

2

0

1

ln ( | )2 2

NT T

n n

n

p t w x +const

w t + w w w

2

0

1

minN

T T

n n

n

t w x

l l

w+ w + w w , =

1

T TRidge Ml

w I t

Page 19: Linear Models Of Regression: Bias-Variance Decomposition

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

Ridge Regression

19

-25 -20 -15 -10 -5 0 50

5

10

15

20

25

log lambda

mean squared error

train mse

test mse

Run linregPolyVsRegDemo

from PMTK3

-20 -15 -10 -5 0 50.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

log lambda

negative log marg. likelihood

CV estimate of MSE

We continue with the degree 14 polynomial fit by ridge regression,

plotted vs log(λ) (N = 21, σ2 = 4).

On the left, notice that as λ increases, the training error increases. The

test error has the classical U shape. Cross validation can be used to

select an optimal λ.

On the right, we estimate of performance using the training set alone

using cross validation and −log p(D|λ) (the plots are vertically

rescaled to [0,1]).

Training error

Test error

Complex models Simple models

−log p(D|λ)

5-fold cross-validation

estimate of future MSE

Page 20: Linear Models Of Regression: Bias-Variance Decomposition

Bayesian Scientific Computing, Spring 2013 (N. Zabaras) 20

To avoid inverting , we augment the original data

with virtual data coming from the prior as follows:

Note that with these definitions, the penalized LS problem looks

as:

This is clear since:

The Ridge estimate is: . Consider a QR

decomposition of . Then we only invert R which is upper

triangular (MatLab implements this as w=\t). Cost: O(ND2).

Numerically Stable Ridge Estimate

1T

Ml

I

11

, ,MM

0I

tt

minT

w

t w t w

T T TT T T

t wt w t w t w w t w t w w w

w

1

T T

Ridge

w t

QR

1

1 1T T T T T T T TRidge

w R Q QR R Q t R R R Q t R Q t

Page 21: Linear Models Of Regression: Bias-Variance Decomposition

Bayesian Scientific Computing, Spring 2013 (N. Zabaras) 21

Consider now the case where D>>N. We first perform and

truncate an SVD decomposition as:

Defining Z=US an NxN matrix, we can write the Ridge estimate:

In essence,

we replace the D-dimensional xi with the N-

dimensional zi and perform penalized fit as before.

We then transform the N-dimensional solution to the D-

dimensional solution by multiplying by V.

Geometrically, we are rotating to a new coordinate system

in which all but the first N coordinates are zero.

This does not affect the solution since the spherical

Gaussian prior is rotationally invariant. The overall time is

now O(DN2) operations.

Numerically Stable Ridge Estimate

, , ,T T T

N N N N N Ndiagonal USV UU I VV I S

1

T TRidge Nl

w V I Z Z Z t

Page 22: Linear Models Of Regression: Bias-Variance Decomposition

Bayesian Scientific Computing, Spring 2013 (N. Zabaras) 22

Using the singular values of , ridge predictions on the

training set can be written as:

\

Recall that ordinary least squares gives:

If , then direction uj has little effect on the prediction.

This motivates the definition of effective number of degrees of

freedom as:

We will see later in this lecture that for a uniform prior on w, the

posterior covariance is:

Connections with PCA

2

1 12 2

21

MjT T T T

Ridge N N j j

j j

sl l

s l

y w USV V I S SU t = US I S SU t = u u

2

js

1

MT

LS j j

j

y

y w u u

2

js l

2

21

( ) , 0 ( )

( 0) ( ) 0

Mj

j j

dof dof M

dof M and dof

sl l

s l

l l

11

cov | T

w D

Page 23: Linear Models Of Regression: Bias-Variance Decomposition

Bayesian Scientific Computing, Spring 2013 (N. Zabaras) 23

Thus the directions in which we are most uncertain about w are

determined by the eigenvectors of with the smallest

eigenvalues. The eigenvalues of this matrix are . Hence small

σj correspond to directions with high posterior variance. It is

these directions which ridge regression shrinks the most.

Shrinkage

11cov | T

w D

2

21/ s

2

11/ s

-2 -1 0 1 2 3 4 5 6 7

-2

-1

0

1

2

3

4

5

MLE

Prior Mean

Posterior Mean

X

LIKELIHOOD

PRIOR 2 2

1 2s s

T 2

js

w1 is not-well determined by

the data (has high posterior

variance), but w2 is well-

determined.

Ill-determined parameters are

reduced in size towards 0.

This is called shrinkage.

Pr11 1 20,MLE

MAP ior MAPw w w w

Page 24: Linear Models Of Regression: Bias-Variance Decomposition

Bayesian Scientific Computing, Spring 2013 (N. Zabaras) 24

Another effective regularizing approach is to use a lots of

data. More training data in general implies better learning.

The test set error decreases to a plateau as N increases.

This is illustrated by plotting the MSE incurred on the test

set achieved by polynomial regression of different degrees

vs N (learning curve).

The level of the plateau for the test error consists of two

terms:

an irreducible component (that all models incur) due to

the intrinsic variability of the generating process (noise

floor); and

a component that depends on the discrepancy

between the generating process (the “truth”) and the

model (structural error).

Regularization Effect of Big Data

Page 25: Linear Models Of Regression: Bias-Variance Decomposition

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

Learning Curves

25

Truth is a

degree 2

polynomial and

we fit

polynomials of

degrees 1, 2

and 25.

The structural

error for M2 and

M25 is zero as

both capture the

true generating

process.

The structural

error for M1 is

substantial: the

plateau occurs

high above the

noise floor.

2 4s

Run linregPolyVsN

from PMTK3

0 20 40 60 80 100 120 140 160 180 2000

2

4

6

8

10

12

14

16

18

20

22

size of training set

mse

truth=degree 2, model = degree 1

train

test

0 20 40 60 80 100 120 140 160 180 2000

2

4

6

8

10

12

14

16

18

20

22

size of training set

mse

truth=degree 2, model = degree 2

train

test

0 20 40 60 80 100 120 140 160 180 2000

2

4

6

8

10

12

14

16

18

20

22

size of training setm

se

truth=degree 2, model = degree 25

train

test

0 20 40 60 80 100 120 140 160 180 2000

2

4

6

8

10

12

14

16

18

20

22

size of training set

mse

truth=degree 2, model = degree 10

train

test Training MSE

X Test MSE

Page 26: Linear Models Of Regression: Bias-Variance Decomposition

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

Test Error and Simple Models

26

For any model

that is

expressive

enough to

capture the

truth (i.e., M2,

M10, M25 ), the

test error goes

to the noise

floor as N → ∞.

However, the

test error will go

to zero faster

for simpler

models (there

are fewer

parameters to

estimate). Run linregPolyVsN

from PMTK3

0 20 40 60 80 100 120 140 160 180 2000

2

4

6

8

10

12

14

16

18

20

22

size of training set

mse

truth=degree 2, model = degree 1

train

test

0 20 40 60 80 100 120 140 160 180 2000

2

4

6

8

10

12

14

16

18

20

22

size of training set

mse

truth=degree 2, model = degree 2

train

test

0 20 40 60 80 100 120 140 160 180 2000

2

4

6

8

10

12

14

16

18

20

22

size of training setm

se

truth=degree 2, model = degree 25

train

test

0 20 40 60 80 100 120 140 160 180 2000

2

4

6

8

10

12

14

16

18

20

22

size of training set

mse

truth=degree 2, model = degree 10

train

test

Page 27: Linear Models Of Regression: Bias-Variance Decomposition

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

Approximation Error

27

Run linregPolyVsN

from PMTK3

For finite N,

there is

discrepancy

between the

parameters that

we estimate and

the best

parameters that

we could

estimate given

the particular

model class.

The

approximation

error goes to

zero as N → ∞,

but it goes to

zero faster for

simpler models.

0 20 40 60 80 100 120 140 160 180 2000

2

4

6

8

10

12

14

16

18

20

22

size of training set

mse

truth=degree 2, model = degree 1

train

test

0 20 40 60 80 100 120 140 160 180 2000

2

4

6

8

10

12

14

16

18

20

22

size of training set

mse

truth=degree 2, model = degree 2

train

test

0 20 40 60 80 100 120 140 160 180 2000

2

4

6

8

10

12

14

16

18

20

22

size of training setm

se

truth=degree 2, model = degree 25

train

test

0 20 40 60 80 100 120 140 160 180 2000

2

4

6

8

10

12

14

16

18

20

22

size of training set

mse

truth=degree 2, model = degree 10

train

test

Page 28: Linear Models Of Regression: Bias-Variance Decomposition

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

Multi-task Learning

28

In domains with lots of data, simple methods work very

well. However, more often we have little data.

E.g., in web search data one has a lots of data but

personalizing the results gives us only few data per

user.

In general, in multi-task learning, we often borrow

statistical strength from tasks with lots of data and

share it with tasks with little data.

Halevy, A., P. Norvig, and F. Pereira (2009). The unreasonable effectiveness of data. IEEE Intelligent Systems 24(2),

8–12.

Page 29: Linear Models Of Regression: Bias-Variance Decomposition

Bayesian Scientific Computing, Spring 2013 (N. Zabaras) 29

Effective model complexity in MLE is governed by the number

of basis functions and is controlled by the size of the data set.

With regularization, the effective model complexity is controlled

mainly by l and still by the number and form of the basis

functions.

Thus the model complexity for a particular problem cannot be

decided simply by maximizing the likelihood function as this

leads to excessively complex models and over-fitting.

Independent hold-out data can be used to determine model

complexity but this wastes data and it is computationally

expensive (see earlier example on optimal regularization using

test data sets).

Bayesian Linear Regression

Page 30: Linear Models Of Regression: Bias-Variance Decomposition

Bayesian Scientific Computing, Spring 2013 (N. Zabaras) 30

A Bayesian treatment of linear regression avoids the over-

fitting of maximum likelihood.

Bayesian approaches lead to automatic methods of

determining model complexity using the training data alone.

Bayesian Linear Regression

Page 31: Linear Models Of Regression: Bias-Variance Decomposition

Bayesian Scientific Computing, Spring 2013 (N. Zabaras) 31

Assume additive Gaussian noise with known precision . The

likelihood function p(t|w) is the exponential of a quadratic

function of w

and its conjugate prior is Gaussian:

Combining this with the likelihood and using results for

marginal and conditional Gaussian distributions, gives the

posterior

where

0 0( ) ( | , )p w w m SN

( ) ( | , )N Np w | t w m SN

1

0 0

1 1

0

T

N N

T

N

m S S m t

S S

/2

21

11

| , , | ( ), exp ( )2 2

NN NT T

n n n n

nn

p t t

t X w w x w xN

Bayesian Linear Regression

Page 32: Linear Models Of Regression: Bias-Variance Decomposition

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

We now have the product of two Gaussians and the

posterior is easily computed as:

Posterior Distribution: Derivation

32

1

0 0 0 0 0

2

1

1 1

1( | , ) exp ,

2

( | , , ) exp ( )2

1( | , , ) exp ( ) ( ) ( )

2

T

NT

n n

n

N NT T

n n n n

n n

p

p x t

p t

w m S w - m S w - m

t x w w

t x w w x x w x w

0

1 1

0 0 0

1 1

1 1 1 1

0 0 0 0

1

( | , , )

1 1exp ( ) ( ) ( )

2 2

, ( ) ( )

N

N NT T T T T

n n n n

n n

NT T T

N N N n n

n

p ,

t

,

m

w x t S

w S w w S m w x x w w x

w | S S m t S S S x x S

N

Square in w

Page 33: Linear Models Of Regression: Bias-Variance Decomposition

Bayesian Scientific Computing, Spring 2013 (N. Zabaras) 33

Note that because the posterior distribution is Gaussian, its

posterior mode coincides with its mean.

The above expressions for the posterior mean and variance

can also be written for a sequential calculation (we already

have observed N data points and now considering an

additional data point (xN+1,tN+1)). In this case, we have:

( ) ( | , )N Np w | t w m SN 1

0 0

1 1

0

T

N N

T

N

m S S m t

S S

MAP Nw = m

1 1 1 1( , ) ( | , )N N N N N Np t , w | , x m S w m SN 1

1 1 1 1

1 1

1 1 1

N N N N n n

T

N N n n

t

m S S m

S S

Bayesian Linear Regression

Page 34: Linear Models Of Regression: Bias-Variance Decomposition

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

Bayesian Linear Regression

34

Let us consider for a prior, a zero-mean isotropic Gaussian

governed by a single precision parameter (see earlier

example) so that

and the corresponding posterior distribution over w is then

given by

The log of the posterior is the sum of the log likelihood and

the log of the prior and, as a function of w, takes the form

Thus the MAP estimate is the same as regularized least

squares (Ridge Regression) with

1( ) ( | 0, )p w w IN

1

T

N N

T

N

m S t

S I

2

1

ln ( | ) ( )2 2

NT T

n n

n

p t x +const

w t w w w

/ .l

Page 35: Linear Models Of Regression: Bias-Variance Decomposition

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

A Note on Data Centering

35

In linear regression, it helps to center the data in a way that

does not require us to compute the offset term m. Write the

likelihood as:

Let us assume that the input data are centered in each

dimension such that:

The mean of the output is equally likely to be positive or

negative. Let us put an improper prior and integrate

m out.

( | , , ) exp2

T

N Np ,

m m m

1 1t x w t w t w

1 1 2 1 11

1 2 2 2 2 1 22

1 2

1 2

( ) ( ) .. ( )( )

( ) ( ) .. ( ) ( ) ( ) .. ( )( ),

: : : : ..:

( ) ( ) .. ( )( )

TM

TTM N

T

N

TN N M NN

x x xx

x x x x x xx

x x xx

0 1 1. .T

i i i M i x x x x

1

0 1,...,N

i j

j

i M

x

( ) 1p m

Page 36: Linear Models Of Regression: Bias-Variance Decomposition

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

A Note on Data Centering

36

Introducing, , the marginal likelihood becomes :

Completing the square in m gives (use the centering of the

input):

Our model is now simplified if instead of t we use (centered

output) and the likelihood is simply written as:

Recall that the MLE estimate for m is:

( | , ) exp2

T

N N N Np , t t t t d

m m m

1 1 1 1

A

t x w t w t w

1

1 N

i

i

t tN

0 0

2

( | , ) exp 22

exp2

T TN

T T

N

tN tN

T

N N

p , t N t d

t t

m m m

1

1

1 1

w

t x w A A A

t w t w

Nt 1t t

( | , ) exp2

Tp ,

t x w t w t w

1

1

, ,...,D

T

j Dj

j

t wm

is formed by averaging each column of Φ

Page 37: Linear Models Of Regression: Bias-Variance Decomposition

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

A Note on Data Centering

37

To simplify the earlier notation, consider a linear regression

model of the form

In the context e.g. of MLE, we need to minimize

Minimization wrt w0 gives:

where:

Thus:

We can now estimate the MLE estimate of w0 as follows:

0| Ty w x w x

0

T

w t x w

0

2

0

1

minN

T

i iw

i

t w

,w

- w x

0 0 0

1

0N

T T T

i i

i

t w w N tN N w t

w x w x w x

1

11

2 2

1

1

1

,:

:

N

i

i

N

Ni

i i

i

D N

iD

i

x N

x

x Nxt t N

xx N

x

Page 38: Linear Models Of Regression: Bias-Variance Decomposition

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

A Note on Data Centering

38

Substituting the bias term in our objective function gives:

Minimization wrt w gives:

We thus first compute the MLE of w using the centered input

and output as follows:

We can now estimate the MLE estimate of w0 as follows:

1

1 1

N NTT T

c c c c i i i i

i i

t t

w X X X t x x x x x x

1

11 2 11 11 12 1

1 2 2 221 22 221

1 21 2

1

..

..,

: : : : :::

..

N

iTT iDD

NTT

T D iDic N

TT D D NN N NNN

iD

i

x N

x x x x x x x

x Nx x x x x x x

x x x x x x xx N

1

x x

x xX X X X x x

x x

,

0

T

w t x w

22

1 1

min minN N

T T T

i i i i

i i

t t t t

w w

w x w x w x x

1 1

N NT

i i i i

i i

t t

x x x x w x x

1

,c N

N

i

i

t

t t N

1t t t t

Page 39: Linear Models Of Regression: Bias-Variance Decomposition

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

Bayesian Regression: Example

39

We generate synthetic data from the function f(x, a) = a0+a1x

with parameter values a0 = −0.3 and a1 = 0.5 by first choosing

values of xn from the uniform distribution U(x|−1, 1), then

evaluating f(xn, a), and finally adding Gaussian noise with

standard deviation of 0.2 to obtain the target values tn.

We assume β=(1/0.2)2=25 and α=2.0.

We perform Bayesian inference sequentially – one point at a

time – so the posterior at each level becomes the new prior.

We show results after 1, 2 and 22 points have been collected.

The results include the likelihood contours (for 1 point), the

posterior and samples of the regression function from the

posterior.

Page 40: Linear Models Of Regression: Bias-Variance Decomposition

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

Prior - No data yet

-1 -0.5 0 0.5 1-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

-1 -0.5 0 0.5 1-3

-2

-1

0

1

2

3y(x,w) using samples of w from the prior

Bayesian Regression: Example

40

MatLab Code

Page 41: Linear Models Of Regression: Bias-Variance Decomposition

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

Likelihood Contour

-1 -0.5 0 0.5 1-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1Contours of the posterior

-1 -0.5 0 0.5 1-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

-1 -0.5 0 0.5 1-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1y(x,w) using samples of w from the posterior

Example: One Data Point Collected

41

Note that the regression lines pass close to the data point (shown with a circle)

MatLab Code

Page 42: Linear Models Of Regression: Bias-Variance Decomposition

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

Likelihood Contour

-1 -0.5 0 0.5 1-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1Contours of the posterior

-1 -0.5 0 0.5 1-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

-1 -0.5 0 0.5 1-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1y(x,w) using samples of w from the posterior

Example: 2nd Data Point Collected

42

Note that the regression lines now pass close to both data points

MatLab Code

Page 43: Linear Models Of Regression: Bias-Variance Decomposition

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

Likelihood Contour

-1 -0.5 0 0.5 1-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1Contours of the posterior

-1 -0.5 0 0.5 1-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

-1 -0.5 0 0.5 1-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1y(x,w) using samples of w from the posterior

Example: 22 Data Points Collected

43

Note that the regression lines after 22 data points have been collected

MatLab Code

Page 44: Linear Models Of Regression: Bias-Variance Decomposition

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

Summary of Results

44

prior/posterior (no data yet)

-1 -0.5 0 0.5 1-1

-0.5

0

0.5

1

-1 -0.5 0 0.5 1-1

-0.5

0

0.5

1data space (no data yet)

-1 -0.5 0 0.5 1-1

-0.5

0

0.5

1likelihood

-1 -0.5 0 0.5 1-1

-0.5

0

0.5

1prior/posterior

-1 -0.5 0 0.5 1-1

-0.5

0

0.5

1data space

-1 -0.5 0 0.5 1-1

-0.5

0

0.5

1likelihood

-1 -0.5 0 0.5 1-1

-0.5

0

0.5

1prior/posterior

-1 -0.5 0 0.5 1-1

-0.5

0

0.5

1data space

-1 -0.5 0 0.5 1-1

-0.5

0

0.5

1likelihood

-1 -0.5 0 0.5 1-1

-0.5

0

0.5

1prior/posterior

-1 -0.5 0 0.5 1-1

-0.5

0

0.5

1data space

MatLab Code

Page 45: Linear Models Of Regression: Bias-Variance Decomposition

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

Summary of Results

45

W0

W1

prior/posterior

-1 0 1-1

0

1

-1 0 1-1

0

1

x

y

data space

W0

W1

-1 0 1-1

0

1

W0

W1

-1 0 1-1

0

1

-1 0 1-1

0

1

x

y

W0

W1

-1 0 1-1

0

1

W0

W1

-1 0 1-1

0

1

-1 0 1-1

0

1

x

y

W0

W1

-1 0 1-1

0

1

W0

W1

-1 0 1-1

0

1

-1 0 1-1

0

1

x

y

likelihood

Run bayesLinRegDemo2d

from PMTK3

After 20 data points

Page 46: Linear Models Of Regression: Bias-Variance Decomposition

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

Predictive Distribution

46

In practice, we are not interested in w itself but in making

predictions of t for new values of x. This requires that we

evaluate the predictive distribution

where

The 1st term represents the noise on the data whereas the 2nd

term reflects the uncertainty associated with w.

Because the noise process and the distribution of w are

independent Gaussians, their variances are additive.

The error bars get larger as we move away from the training

points. By contrast, in the plug-in approximation, the error bars

are of constant size.

As additional data points are observed, the posterior

distribution becomes narrower.

2( | , , ) ( | , ) ( | , ) | ( ), ( )T

N Np t x p t x , p , , d t x x s Nx t w w x t w m

2 11( ) ( ) ( ),T T

N N Nx x S xs

S I

Page 47: Linear Models Of Regression: Bias-Variance Decomposition

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

In a full Bayesian treatment, we want to compute the

predictive distribution, i.e. given the training data x and t

and a new test point x, we want the distribution:

To compute the needed marginal, we use a result from an

earlier lecture.

Predictive Distribution

47

1

1 1

1

1 1

( | , , ) ( | , ) ( | , ) ,

( | , , ) ( | ( , ), )

( | , , )

1 1exp ( ) ( ) ( )

2 2

( ), , ( ) ( )

N NT T T

M M n n n n

n n

N NT

N n n N N M M n n

n n

p t x p t x p d where

p t x t y x and

p ,

x x t x

t x x x

N

N

x t w w x t w

w w

w x t

w I w w w w

w |, S S S I

Page 48: Linear Models Of Regression: Bias-Variance Decomposition

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

Appendix: Useful Result

48

For the above linear model, we proved in these notes, the

following very useful results about marginal and conditional

Gaussian models.

1

1

| ,

| | ,

p

p

N

N

x x

y x y Ax b L

m

1 1| , Tp Ny y A b L + A Am

1 1

| | ( ,T T Tp

x y x A LA A L y b) A LAm N

Page 49: Linear Models Of Regression: Bias-Variance Decomposition

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

Predictive Distribution

49

Thus for our problem:

The predictive distribution now takes the form:

1

1

1 1

( ),

( ) ,

N

N n n N

n

T

, t x

t, x , = 0

x w = S = S

y A = b L =

m

1

1

| ,

| | ,

p

p

N

N

x x

y x y Ax b L

m

1( | , , ) ( | ( , ), )p t x t y x w wN

1

( | , , ) ( ),N

N n n N

n

p , t x

w x t w |, S SN

1 1| , Tp y y A b L + A Am N

1

1

| ( ) ( ), ( ) ( )N

T T

N n n N

n

p t t x t x x x

S + SN

Page 50: Linear Models Of Regression: Bias-Variance Decomposition

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

In a full Bayesian treatment, we want to compute the

predictive distribution, i.e. given the training data x and t

and a new test point x, we want the distribution:

where the mean and variance are given by

Note that:

Predictive Distribution

50

1( | , , ) ( | , ) ( | , ) , ( | , , ) ( | ( , ), )p t x p t x p d p t x t y x x t w w x t w w wN

1

2 1

1

1

( ) ( ) ( ) ( ) ,

( ) ( ) ( ) ( )

( ) ( )

NT T T

N n n N N N

n

T

N N

NT T

N n n

n

m x x x t x

x x x

x x

s

+

S m m S t

S

S I I

uncertainty in the data+uncertainty in w

2( | , , ) ( | ( ), ( ))Np t x t m x xsx t N

2 1 1 2 2

1( ) ( ) ( ) ( ) ( )N

T

N N N Nx x x and x xs s s

S+

Page 51: Linear Models Of Regression: Bias-Variance Decomposition

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

It is easy to show:

Note:

and the identity:

Using these results, we can write:

Predictive Distribution

51

2 1 1 2 2

1( ) ( ) ( ) ( ) ( )N

T

N N N Nx x x and x xs s s

S+

11 1

1

1

( ) ( ) ( ) ( )N

T T

N n n N n n

n

x x x x

S I S

1 1

11

11

T

T

T

M v v MM vv M

v M v

12 1 1 1 1

1 1

2

2 2

( ) ( ) ( ) ( ) ( ) ( ) ( )

( ) ( ) ( ) ( )( ) ( ) ( ) ( )

1 ( ) ( ) 1 ( ) ( )

T T T

N N N n n

T T

N n n N n NT

N N NT T

n N n n N n

x x x x x x x

x x x xx x x x

x x x x

s

s s

S S

S S SS

S S

+ + +

Page 52: Linear Models Of Regression: Bias-Variance Decomposition

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

The notation used here is as follows:

Predictive Distribution: Summary

52

1

2 1

1

1

( ) ( ) ( )

( ) ( ) ( )

( ) ( )

NT

N n n

n

T

N N

NT

N n n

n

m x x x t

x x x

x x

s

S

S

S I

+

2( | , , ) ( | ( ), ( ))Np t x t m x xsx t N

2 2 1

1

:

1

( ) , ( ) 1 ..

:

n

T M

n n

M

n

For Polynomial regression

x

x x x x x x

x

Note:

Predictive mean and

variance are functions

of x.

0

1

2 0 1 2 1

1

( )

( )

( ) ( ) , ( ) ( ) ( ) ( ) .. ( ) ,

:

( )

n

n

T

n n n n n M n

M n

x

x

x x x x x x x unit matrix M M

x

I

Page 53: Linear Models Of Regression: Bias-Variance Decomposition

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-2

-1.5

-1

-0.5

0

0.5

1

1.5

2Predictive Distribution, M = 9

Generating function sin(2pi*x)

Random data points for fitting

Predictive Mean

Predictive Standard Deviation

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-2

-1.5

-1

-0.5

0

0.5

1

1.5

2Predictive Distribution, M = 9

Generating function sin(2pi*x)

Random data points for fitting

Predictive Mean

Predictive Standard Deviation

Pointwise Uncertainty in the Predictions

53

MatLab code

N=1

N=2

M=9 Gaussians, 10 parameters

Scale of Gaussians adjusted

with data

= 5*10-3

= 11.1

Using N=1,2,4,10

Data are given here

The predictive uncertainty is

smaller near the data.

The level of uncertainty

decreases with N

Page 54: Linear Models Of Regression: Bias-Variance Decomposition

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-2

-1.5

-1

-0.5

0

0.5

1

1.5

2Predictive Distribution, M = 9

Generating function sin(2pi*x)

Random data points for fitting

Predictive Mean

Predictive Standard Deviation

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-2

-1.5

-1

-0.5

0

0.5

1

1.5

2Predictive Distribution, M = 9

Generating function sin(2pi*x)

Random data points for fitting

Predictive Mean

Predictive Standard Deviation

Pointwise Uncertainty in the Predictions

54

MatLab code

N=4 N=10

Page 55: Linear Models Of Regression: Bias-Variance Decomposition

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

Summary of Results

55

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-2

-1.5

-1

-0.5

0

0.5

1

1.5

2Predictive Distribution, M = 9

Generating function sin(2pi*x)

Random data points for fitting

Predictive Mean

Predictive Standard Deviation

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-2

-1.5

-1

-0.5

0

0.5

1

1.5

2Predictive Distribution, M = 9

Generating function sin(2pi*x)

Random data points for fitting

Predictive Mean

Predictive Standard Deviation

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-2

-1.5

-1

-0.5

0

0.5

1

1.5

2Predictive Distribution, M = 9

Generating function sin(2pi*x)

Random data points for fitting

Predictive Mean

Predictive Standard Deviation

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-2

-1.5

-1

-0.5

0

0.5

1

1.5

2Predictive Distribution, M = 9

Generating function sin(2pi*x)

Random data points for fitting

Predictive Mean

Predictive Standard Deviation

MatLab code

Page 56: Linear Models Of Regression: Bias-Variance Decomposition

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

Plugin Approximation

56

-8 -6 -4 -2 0 2 4 6 80

10

20

30

40

50

60

plugin approximation (MLE)

prediction

training data

-8 -6 -4 -2 0 2 4 6 8-10

0

10

20

30

40

50

60

70

80

Posterior predictive (known variance)

prediction

training data

-8 -6 -4 -2 0 2 4 6 8-20

0

20

40

60

80

100

functions sampled from posterior

-8 -6 -4 -2 0 2 4 6 80

5

10

15

20

25

30

35

40

45

50

functions sampled from plugin approximation to posterior

Run linregPostPredDemo

from PMTK3

( | , , ) ( | , ) ( )

( | , )

p t x p t x d

p t x

wx t w w w

w

Page 57: Linear Models Of Regression: Bias-Variance Decomposition

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-2

-1.5

-1

-0.5

0

0.5

1

1.5

2Plots of y(x,w) where w is a sample from the posterior over w

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-2

-1.5

-1

-0.5

0

0.5

1

1.5

2Plots of y(x,w) where w is a sample from the posterior over w

Covariance Between the Predictions

57

Draw samples from the posterior of w and

then plot y(x,w). We use the same data as

the earlier example.

We are visualizing the joint uncertainty

in the posterior distribution between the

y values at two or more x values.

MatLab Code

N=2

N=1

Same data and basis functions

as in the earlier example.

Page 58: Linear Models Of Regression: Bias-Variance Decomposition

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-2

-1.5

-1

-0.5

0

0.5

1

1.5

2Plots of y(x,w) where w is a sample from the posterior over w

Covariance Between the Predictions

58

Draw samples from the posterior of w and

then plot y(x,w)

We are visualizing the joint uncertainty

in the posterior distribution between the

y values at two or more x values.

MatLab Code

N=12

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-2

-1.5

-1

-0.5

0

0.5

1

1.5

2Plots of y(x,w) where w is a sample from the posterior over w

N=4

Page 59: Linear Models Of Regression: Bias-Variance Decomposition

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

Summary of Results

59

0 0.2 0.4 0.6 0.8 1-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

0 0.2 0.4 0.6 0.8 1-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

0 0.2 0.4 0.6 0.8 1-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

0 0.2 0.4 0.6 0.8 1-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

MatLab Code

Page 60: Linear Models Of Regression: Bias-Variance Decomposition

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

Gaussian Basis vs. Gaussian Process

60

If we use localized basis functions such as Gaussians, then in

regions away from the basis function support, the contribution

from the second term in the predictive variance will go to zero,

leaving only the noise contribution β−1.

The model becomes very confident in its predictions when

extrapolating outside the region occupied by the basis functions.

This is an undesirable behavior.

This problem can be avoided by adopting an alternative

Bayesian approach to regression (Gaussian processes).

2 1 1

( )( ) ( ) ( )

away from theT

N Nsupport of x

x x xs S+

Page 61: Linear Models Of Regression: Bias-Variance Decomposition

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

Bayesian Inference when s2 is Unknown

61

Let us extend the previous results for linear regression assuming

now that s2 is unknown.

Assume a likelihood of the form:*

A conjugate prior has the following form:

The posterior is now derived as:

00

2 2 2 2

0 0 0 0 0 0 0 0

1/2 1

0 0 0 020

1/2/2 2

0 0

, , | , , , | , | ,

2exp

22

Taa D

D

p a b a b

bb

a

s s s s

ss

NIG N InvGammaw w w V w w V

w w V w w

V

/2

2 2 2

/2 2

1| , , | , exp

22

TN

N Np s s s

s

Ny Xw y Xw

y X w y Xw I

0

0

1( )/2 1

0 0 0 02 20

21/2/2

0 0

2, | exp

22

T Taa D N

N D

bbp

as s

s

Dw w V w w y Xw y Xw

wV

In the remaining of this lecture, the response is denoted as y and the dimensionality of w is taken as D.

Page 62: Linear Models Of Regression: Bias-Variance Decomposition

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

Bayesian Inference when s2 is Unknown

62

Let us define the following:

With these definitions, one with simple algebra can show:

The posterior marginals can now be derived explicitly:

0

0

1( )/2 1

0 0 0 02 20

21/2/2

0 0

2, | exp

22

T Taa D N

N D

bbp

as s

s

w - w V w - w y - Xw y - Xww

VD

11 1

0 0 0

1 1

0 0 0 0 0

,

1/ 2,

2

T T

N N N

T T T

N N N N Na a N b b

V V X X w V V w X y

w V w y y w V w

2 2 2 2, | , | , , , | , | ,N N N N N N N Np a b a bs s s sw w w V w w VD NIG N InvGamma

1 1 1/2 1

0 0 0 0 0 02 2

2

1/2 1

2

2

2, | exp

2

2exp

2

N

N

T T T T Ta D

N N N N

Ta D

N N N N

bp

b

s ss

ss

w w V w w y Xw y Xw w V w y y w V ww

w w V w w

D

2 2| | ,N Np a bs sD InvGamma

2

1 2

| , , 2 12

Na DT

N N NND N N N

N N

bp a

a b

D Tw w V w w

w w V

Page 63: Linear Models Of Regression: Bias-Variance Decomposition

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

Posterior Marginals

63

The marginal posterior can be directly written as:

2

1 1 2/2 1

2 2

2

0

2| exp 1

2 2

N

N

a DT T

a DN N N N N N N

N

bp d

bs s

s

Dw w V w w w w V w w

w

2

1 2

| , , 2 12

Na DT

N N NND N N N

N N

bp a

a b

D Tw w V w w

w w V

To compute the integral above, simply set l=s-2, ds2=-l-2dl and use the normalizing factor of the

Gamma distribution . 1

0

( ) .a b a ae d a b bll l

Page 64: Linear Models Of Regression: Bias-Variance Decomposition

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

Posterior Predictive Distribution

64

Consider the posterior predictive for m new test inputs:

As a first step, let us integrate in w by writing:

We next integrate in l=1/s2 and use the normalization of the

Gamma distribution:

1

/2 /2 12 2 2

/2 2 2

21| , exp exp

2 22

N

TT

m a DN N N N

m

bp d ds s s

s s

Dy Xw y Xw w w V w w

y X w

1

1 11 1 1 1 1

1 1 1 1

2

2

2

T T

N N N N

TT T T T T

N N N N N N N

T TT T T TT

N N N N N N N N N

b

b

y Xw y Xw w w V w w

w X X V X y V w X X V w X X V X y V w

X y V w X X V X y V w w V w y y

This

cancels out

from the

integration

in w

/2 1 /2| , expN N

m a m ap dl l l

Dy X

/2

1 1 1 1

/21

| , 2

12

N

N

m aT T

T T T TT

N N N N N N N N N

m a

T TN

N m N N

N

N

p b

b

a

a

Dy X X y V w X X V X y V w w V w y y

y Xw I XV X y Xw

Use the Sherman Morrison

Woodburry formula here to show

that (symmetry of V0 is assumed)

1 1

1T T T

m N m N

I XV X I X X X V X

Page 65: Linear Models Of Regression: Bias-Variance Decomposition

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

Bayesian Inference when s2 is Unknown

65

The posterior predictive is also a student T:

The predictive variance has two terms

due to the measurement noise

and due to the uncertainty in w. The second term

depends on how close a test input is to the training data.

| , | , , 2T

Nm N m N N

N

bp a

a

D Ty X y Xw I XV X

Nm

N

b

aI

TN

N

N

b

aXV X

Page 66: Linear Models Of Regression: Bias-Variance Decomposition

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

Zellner’s G-Prior

66

It is common to set a0 = b0 = 0, corresponding to an

uninformative prior for σ2, and to set w0 = 0 and V0 =

g(XTX)−1 for any positive value g.

This is called Zellner’s g-prior. Here g plays a role

analogous to 1/λ in ridge regression. However, the prior

covariance is proportional to (XTX)−1 rather than I.

This ensures that the posterior is invariant to scaling of the

inputs.

1 1

2 2 2 2, , | , ,0,0 | , | 0,0T Tp g gs s s s

0 0NIG N InvGammaw w X X w X X

2 2 2 2

0 0 0 0 0 0 0 0, , | , , , | , | ,p a b a bs s s sNIG N InvGammaw w w V w w V

Zellner, A. (1986). On assessing prior distributions and bayesian regression analysis with g-prior distributions. In

Bayesian inference and decision techniques, Studies of Bayesian and Econometrics and Statistics volume 6.

North Holland.

Minka, T. (2000b). Bayesian linear regression. Technical report, MIT.

Page 67: Linear Models Of Regression: Bias-Variance Decomposition

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

Unit Information Prior

67

We will see below that if we use an uninformative prior, the

posterior precision given N measurements is .

The unit information prior is defined to contain as much

information as one sample.

To create a unit information prior for linear regression, we

need to use which is equivalent to the g-prior

with g = N.

1 T

N

V X X

1

0

1 T

N

V X X

1 1

2 2 2 2, , | , ,0,0 | , | 0,0T Tp g gs s s s

0 0NIG N InvGammaw w X X w X X

Kass, R. and L. Wasserman (1995). A reference bayesian test for nested hypotheses and its relationship to

the schwarz criterio. J. of the Am. Stat. Assoc. 90(431), 928–934.

Page 68: Linear Models Of Regression: Bias-Variance Decomposition

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

Uninformative Prior

68

An uninformative prior can be obtained by considering the

uninformative limit of the conjugate g-prior, which

corresponds to setting g = ∞. This is equivalent to an

improper NIG prior with w0 = 0, V0 = ∞I, a0 = 0 and b0 = 0,

which gives p(w, σ2) ∝ σ−(D+2).

Alternatively, we can start with the semi-conjugate prior

p(w, σ2) = p(w)p(σ2), and take each term to its

uninformative limit individually, which gives p(w, σ2) ∝ σ−2.

This is equivalent to an improper NIG prior with w0 = 0,V =

∞I, a0 = −D/2 and b0 = 0.

2 2 2 2 ( 2), , | , ,0,0 | , | 0,0 Dp s s s s s 0 0NIG N InvGammaw w I w I

2 2 2 2 2, , | , ,0,0 | , | ,02

Dp s s s s s

0 0NIG N InvGammaw w I w I

Page 69: Linear Models Of Regression: Bias-Variance Decomposition

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

Uninformative Prior

69

Using the uninformative prior, , the

corresponding posterior and marginal posteriors are given

by

2 2

22

, | , | , , , ,

| , , 2 | , , ,

N N N N

ND N N N D

N

p a b

b sp a N D

a N D

s s

D NIG

D T T

w w w V

w w V w w C

1 1 11 1

0 0 0

0

1 1 2 2

0 0 0 0

1

,

/ 2 / 2,

1/ 2,

2

T T T T TMLEN N N

N

TT T T

MLE MLEN N N N

T TMLEN

a a N N D

b b s s

V C V X X X X w V V w X y X X X y = w

w V w y y w V w y X w y X w

w w X X X y

2 2,p s s w

Page 70: Linear Models Of Regression: Bias-Variance Decomposition

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

The Caterpillar Example

70

The use of a (semi-conjugate) uninformative prior is quite

interesting since the resulting posterior turns out to be

equivalent to the results obtained from frequentist statistics.

This is equivalent to the sampling distribution of the MLE

which is given by the following:

is the standard error of the estimated parameter.

The frequentist confidence interval and the Bayesian

marginal credible interval for the parameters are the same.

2

| | , ,jj

jj j

C sp w w w N D

N D

D T

2

~ ,jj jj

N D j

j

w w C st s

s N D

Rice, J. (1995). Mathematical statistics and data analysis. Duxbury. 2nd edition (page 542)

Casella, G. and R. Berger (2002). Statistical inference. Duxbury. 2nd edition (page 554)

Page 71: Linear Models Of Regression: Bias-Variance Decomposition

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

The Caterpillar Example

71

As a worked example of this, consider the caterpillar

dataset. We can compute the posterior mean and

standard deviation, and the 95% credible intervals (CI) for

the regression coefficients.

coeff mean stddev 95pc CI sig

w0 10.998 3.06027 [ 4.652, 17.345] *

w1 -0.004 0.00156 [ -0.008, -0.001] *

w2 -0.054 0.02190 [ -0.099, -0.008] *

w3 0.068 0.09947 [ -0.138, 0.274]

w4 -1.294 0.56381 [ -2.463, -0.124] *

w5 0.232 0.10438 [ 0.015, 0.448] *

w6 -0.357 1.56646 [ -3.605, 2.892]

w7 -0.237 1.00601 [ -2.324, 1.849]

w8 0.181 0.23672 [ -0.310, 0.672]

w9 -1.285 0.86485 [ -3.079, 0.508]

w10 -0.433 0.73487 [ -1.957, 1.091]

The 95%

credible intervals

are identical to

the 95%

confidence

intervals

computed using

standard

frequentist

methods.

Run linregBayesCaterpillar

from PMTK3

Marin, J.-M. and C. Robert (2007). Bayesian Core: a practical approach to

computational Bayesian statistics. Springer.

Page 72: Linear Models Of Regression: Bias-Variance Decomposition

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

The Caterpillar Example

72

We can use these marginal posteriors to compute if the

coefficients are significantly different from 0 -- check if its

95% CI excludes 0.

The CIs for coefficients 0, 1, 2, 4, 5 are all significant.

These results are the same as those produced by a

frequentist approach using p-values at the 5% level.

But note that the MLE does not even exist when N <D, so

standard frequentist inference theory breaks down in this

setting. Bayesian inference theory still works using proper

priors.

Maruyama, Y. and E. George (2008). A g-prior extension for p > n. Technical report, U. Tokyo.

Page 73: Linear Models Of Regression: Bias-Variance Decomposition

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

Empirical Bayes for Linear Regression

73

We describe next an empirical Bayes procedure for picking

the hyper-parameters in the prior.

More precisely, we choose η = (α, λ) to maximize the

marginal likelihood, where λ = 1/σ2 be the precision of the

observation noise and α is the precision of the prior, p(w) =

N(w|0, α-1I).

This is known as the evidence procedure.

MacKay, D. (1995b). Probable networks and plausible predictions — a review of practical Bayesian methods for

supervised neural networks. Network.

Buntine, W. and A. Weigend (1991). Bayesian backpropagation. Complex Systems 5, 603–643.

MacKay, D. (1999). Comparision of approximate methods for handling hyperparameters. Neural Computation

11(5), 1035–1068.

Page 74: Linear Models Of Regression: Bias-Variance Decomposition

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

Empirical Bayes for Linear Regression

74

The evidence procedure provides an alternative to using

cross validation.

In the Figure, the log marginal likelihood is plotted for

different values of α, as well as the maximum value found

by the optimizer.

-25 -20 -15 -10 -5 0 5-150

-140

-130

-120

-110

-100

-90

-80

-70

-60

-50

log alpha

log evidence

Run linregPolyVsRegDemo

from PMTK3

Page 75: Linear Models Of Regression: Bias-Variance Decomposition

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

Empirical Bayes for Linear Regression

75

-25 -20 -15 -10 -5 0 5-150

-140

-130

-120

-110

-100

-90

-80

-70

-60

-50

log alpha

log evidence

Run linregPolyVsRegDemo

from PMTK3

-20 -15 -10 -5 0 50.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

log lambda

negative log marg. likelihood

CV estimate of MSE

We obtain the same result as 5-CV (λ = 1/σ2 is fixed in

both methods).

The key advantage of the evidence procedure over CV is

that it allows different αj to be used for every feature.

Page 76: Linear Models Of Regression: Bias-Variance Decomposition

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

Automatic Relevancy Determination

76

The evidence procedure can be used to perform feature

selection (automatic relevancy determination or ARD)

The evidence procedure is also useful when comparing

different kinds of models:

It is important to (at least approximately) integrate over η

rather than setting it arbitrarily.

Using variation Bayes models our uncertainty on η rather

than computing point estimates.

( | ) ( | ) ( , ,| ) ( | )

( | ) ( | ( | ), ),

p m p m p m p m d d

max p m p m p m d

D D

D

w w w

w w w