linear models of regression: bias-variance decomposition
TRANSCRIPT
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Linear Models Of Regression:
Bias-Variance Decomposition,
Bayesian Linear Regression
Prof. Nicholas Zabaras
School of Engineering
University of Warwick
Coventry CV4 7AL
United Kingdom
Email: [email protected]
URL: http://www.zabaras.com/
July 30, 2014
1
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
The bias-variance decomposition
Ridge Regression, Shrinkage, Regularization effect of Big Data
Bayesian linear regression, Parameter posterior distribution, A Note on
Data Centering, Numerical Example
Predictive distribution, Gaussian Processes
Bayesian inference in linear regression when s2 is unknown, Zellner’s
g-Prior, Uninformative (Semi-Conjugate) Prior, Evidence
Approximation
Contents
2
Kevin Murphy, Machine Learning: A probabilistic Perspective, Chapter 7 Chris Bishops’ PRML book, Chapter 3 Regression using parametric discriminative models in pmtk3 (run TutRegr.m in Pmtk3, pmtk3-
1nov12/docs/tutorial/html/tutRegr.html)
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
The Bias-Variance Decomposition
3
MLE (i.e. least squares) leads to severe over-fitting if complex
models are trained using data sets of limited size.
Over-fitting occurs whenever the number of basis functions is
large (i.e. for complex models) and the training data set is of
limited size.
Limiting the number of basis functions limits the flexibility of the
model.
Regularization controls over-fitting but one needs to determine
l.
Over-fitting is property of MLE and does not arise when we
marginalize over parameters in a Bayesian setting.
Before returning to a Bayesian setting, we will discuss the bias-
variance tradeoff (a frequentist viewpoint of model complexity).
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Loss Function and the Regression Function
4
Recall the regression loss-function
The decision problem is to minimize the expected loss:
The solution of this minimization problem known as the
regression function is:
i.e. the average of t conditioned on x.
A useful expression was derived in an earlier lecture:
2
, ( ) ( )L t y y t x x
2
( ) ( , )L y t p t d dt x x x
( ) ( | ) |ty tp t dt t x x x
2 2
( ) | ( ) | ( )L y t p d t t p ,t d dt x x x x x x x
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
The Bias-Variance Decomposition
5
We can write the expected squared loss as:
where h(x) is the conditional expectation which is given by
Recall that the second term, which is independent of y(x), arises
from the intrinsic noise on the data and represents the minimum
achievable value of the expected loss.
If we model h(x) using a parametric function y(x,w) governed by
w, then from a Bayesian perspective the uncertainty in our
model is expressed through a posterior distribution over w.
2 2
( ) ( ) ( ) ( ) ( , )L y h p d h t p t d dt x x x x x x x
( ) [ | ] ( | )h t tp t dt x x x
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
The Bias-Variance Decomposition
6
A frequentist treatment involves making a point estimate of w
based on the data set D, and interpret the uncertainty of this
estimate as follows.
Suppose we had a large number of data sets each of size N
and each drawn independently from p(x, t).
For any given data set D, we run our learning algorithm and
obtain a prediction function y(x; D). Different data sets from the
ensemble give different functions and values of the squared
loss.
The performance of the learning algorithm is then assessed by
taking the average over this ensemble of data sets.
2 2
( ) ( ) ( ) ( ) ( , )L y h p d h t p t d dt x x x x x x x
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
The Bias-Variance Decomposition
7
For any given data set D, we can run our learning algorithm
and obtain a prediction function y(x;D).
Here is the average of our prediction function over
all data sets.
Take the expectation of this expression with respect to D and
note that the final term will vanish, giving
Recall that h(x) is the desired regression function.
2 2
22
( ; ) ( ) ( ; ) [ ( ; )] [ ( ; )] ( )
( ; ) [ ( ; )] ( ; ) ( )
2 ( ; ) [ ( ; )] ( ; ) ( )
D D
D D
D D
y h y y y h
y y y h
y y y h
x x x x x x
x x x x
x x x x
D D D D
D D D
D D D
2
2 22( ; ) ( ) ( ; ) ( ) ( ; ) ( ; )D D D D
Bias Variance
y h y h y y
x x x x x xD D D D
[ ( ; )]D y x D
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
The Bias-Variance Decomposition
8
So far, we have considered a single input value x. If we
substitute this expansion back into the expected squared
loss function shown above, we obtain the following
decomposition of the expected squared loss
Expected loss = (bias)2 + variance + noise
where
22
2
2
( ; ) ( ) ( )
( ; ) ( ; ) (
( ) ( , )
D
D D
y h p d
= y y p
h t p t d dt
D
D D
x x x x
x x x)dx
x x x
bias
variance
noise
2 2
( ) ( ) ( ) ( ) ( , )L y h p d h t p t d dt x x x x x x x
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
The Bias-Variance Decomposition
9
Expected loss = (bias)2 + variance + noise
There is a tradeoff between bias and variance
flexible models have low bias and high variance
rigid models have high bias and low variance
The bias-variance decomposition provides insights in model
complexity but is of limited use since several data sets D
are needed.
22
2
2
( ; ) ( ) ( )
( ; ) ( ; ) (
( ) ( , )
D
D D
y h p d
= y y p
h t p t d dt
D
D D
x x x x
x x x)dx
x x x
bias
variance
noise
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
0 0.5 1
-1
-0.5
0
0.5
1
x
t
ln l = 2.6
0 0.5 1
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
x
t
Average of the 100 fits
Sin function data generated from
The Bias-Variance Decomposition
10
Dependence of bias and variance on model complexity, governed by a regularization parameter λ,
using the sinusoidal data set. On the left only 20 of the 100 fits are shown.
L = 100 data sets, each with N = 25 data points. 24
Gaussian basis functions, # of parameters M = 25
MatLab code
Large l
High Bias
Large l
Low variance
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
0 0.5 1
-1
-0.5
0
0.5
1
x
t
ln l = -0.31
0 0.5 1
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
x
t
Average of the 100 fits
Sin function data generated from
The Bias-Variance Decomposition
11
Dependence of bias and variance on model complexity, governed by a regularization parameter λ,
using the sinusoidal data set. On the left only 20 of the 100 fits are shown.
L = 100 data sets, each with N = 25 data points. 24
Gaussian basis functions, # of parameters M = 25
MatLab code
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
The Bias-Variance Decomposition
12
Note averaging of many solutions of the complex model (M=25) obtained from different data sets
gives very good solution.
Weighted averaging is also done in Bayesian approach but there it is with respect to the posterior
distribution of parameters!
L = 100 data sets, each with N = 25 data points. 24
Gaussian basis functions, # of parameters M = 25
Lowest l
Large
variance
Low Bias
MatLab code
0 0.5 1
-1
-0.5
0
0.5
1
x
t
ln l = -2.4
0 0.5 1
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
x
t
Average of the 100 fits
Sin function data generated from
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Trade-off Quantities
13
The average prediction over L data sets is estimated from
The integrated squared bias and integrated variance are then
given by
( )
1
1( ) ( )
L
y x y xL
2
( )
1 1
1 1( ) ( )
N L
n n
n
y x y xN L
variance =
22 ( ; ) ( ) ( )D y h p d Dx x x xbias
22
1
1( ) ( )
N
n n
n
y x h xN
bias
2
( ; ) ( ; ) (D D= y y p D Dx x x)dxvariance
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
The Bias-Variance Decomposition
14
Note that small regularization parameter l allows the
model to be finely tuned to the noise in each individual data
set leading to small bias but large variance.
Conversely, large l forces all w’s to go towards zero leading
to large bias but small variance.
MatLab code
-3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 20
0.05
0.1
Squared Bias + Variance Plot
ln(6)
variance
bias2
bias2+var
Test Error rms / 3
Complex Models Simple Models
Model Complexity Variance Bias
low Complex High Low
high Simple Low High
l
l
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
The Bias-Variance Decomposition
15
Plot of squared bias and variance together with their sum.
Also shown is the test set rms squared error for a test data
set size of 1000 points.
The minimum value of (bias)2 + variance occurs around ln λ
= −0.31, which is near to the value that gives the minimum
error on the test data.
MatLab code -3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.05
0.1
Squared Bias + Variance Plot
variance
bias2
bias2+var
Test Error rms / 3
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Ridge Regression
16
MLE overfits as it is picking the parameter values that are
the best for modeling the training data.
If the data is noisy, such parameters result in complex
functions.
0 2 4 6 8 10 12 14 16 18 20-10
-5
0
5
10
15
20Consider fitting a degree
14 polynomial to N = 21
data using LS. The resu-
lting curve is wiggly and wi
(excluding w0) take large
values in order to inter-
polate the data perfectly.
If we changed the data a
little, the coefficients would
change a lot.
Run linregPolyVsRegDemo
from PMTK3
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Ridge Regression
17
Degree 14 polynomial fit to N
= 21 data points with
increasing amounts of L2
regularization.
Data was generated from
noise with σ2 = 4.
0 2 4 6 8 10 12 14 16 18 20-10
-5
0
5
10
15
20ln lambda -20.135
The error bars, representing
the noise σ2 get wider as the
fit gets smoother, since we
are ascribing more of the
data variation to the noise.
0 2 4 6 8 10 12 14 16 18 20-10
-5
0
5
10
15
20ln lambda -8.571
Run linregPolyVsRegDemo
from PMTK3
2
0
1
minN
T T
n n
n
t w x l
w+ w + w w
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Ridge Regression
18
For a Bayesian perspective of regularized least squares,
consider the MAP estimation with a zero-mean Gaussian prior:
The MAP estimation problem is the same as:
Here l is the complexity penalty. The Ridge Regression –
Penalized Least Squares leads to the following estimate:
w0 is not regularized as it t does not affect the complexity. Regu-
larization ensures the function is simple (e.g. w=0 corresponds
to a straight line). Increasing l results in smoother functions and
smaller wi.
1( ) ( | 0, )p w w IN
2
0
1
ln ( | )2 2
NT T
n n
n
p t w x +const
w t + w w w
2
0
1
minN
T T
n n
n
t w x
l l
w+ w + w w , =
1
T TRidge Ml
w I t
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Ridge Regression
19
-25 -20 -15 -10 -5 0 50
5
10
15
20
25
log lambda
mean squared error
train mse
test mse
Run linregPolyVsRegDemo
from PMTK3
-20 -15 -10 -5 0 50.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
log lambda
negative log marg. likelihood
CV estimate of MSE
We continue with the degree 14 polynomial fit by ridge regression,
plotted vs log(λ) (N = 21, σ2 = 4).
On the left, notice that as λ increases, the training error increases. The
test error has the classical U shape. Cross validation can be used to
select an optimal λ.
On the right, we estimate of performance using the training set alone
using cross validation and −log p(D|λ) (the plots are vertically
rescaled to [0,1]).
Training error
Test error
Complex models Simple models
−log p(D|λ)
5-fold cross-validation
estimate of future MSE
Bayesian Scientific Computing, Spring 2013 (N. Zabaras) 20
To avoid inverting , we augment the original data
with virtual data coming from the prior as follows:
Note that with these definitions, the penalized LS problem looks
as:
This is clear since:
The Ridge estimate is: . Consider a QR
decomposition of . Then we only invert R which is upper
triangular (MatLab implements this as w=\t). Cost: O(ND2).
Numerically Stable Ridge Estimate
1T
Ml
I
11
, ,MM
0I
tt
minT
w
t w t w
T T TT T T
t wt w t w t w w t w t w w w
w
1
T T
Ridge
w t
QR
1
1 1T T T T T T T TRidge
w R Q QR R Q t R R R Q t R Q t
Bayesian Scientific Computing, Spring 2013 (N. Zabaras) 21
Consider now the case where D>>N. We first perform and
truncate an SVD decomposition as:
Defining Z=US an NxN matrix, we can write the Ridge estimate:
In essence,
we replace the D-dimensional xi with the N-
dimensional zi and perform penalized fit as before.
We then transform the N-dimensional solution to the D-
dimensional solution by multiplying by V.
Geometrically, we are rotating to a new coordinate system
in which all but the first N coordinates are zero.
This does not affect the solution since the spherical
Gaussian prior is rotationally invariant. The overall time is
now O(DN2) operations.
Numerically Stable Ridge Estimate
, , ,T T T
N N N N N Ndiagonal USV UU I VV I S
1
T TRidge Nl
w V I Z Z Z t
Bayesian Scientific Computing, Spring 2013 (N. Zabaras) 22
Using the singular values of , ridge predictions on the
training set can be written as:
\
Recall that ordinary least squares gives:
If , then direction uj has little effect on the prediction.
This motivates the definition of effective number of degrees of
freedom as:
We will see later in this lecture that for a uniform prior on w, the
posterior covariance is:
Connections with PCA
2
1 12 2
21
MjT T T T
Ridge N N j j
j j
sl l
s l
y w USV V I S SU t = US I S SU t = u u
2
js
1
MT
LS j j
j
y
y w u u
2
js l
2
21
( ) , 0 ( )
( 0) ( ) 0
Mj
j j
dof dof M
dof M and dof
sl l
s l
l l
11
cov | T
w D
Bayesian Scientific Computing, Spring 2013 (N. Zabaras) 23
Thus the directions in which we are most uncertain about w are
determined by the eigenvectors of with the smallest
eigenvalues. The eigenvalues of this matrix are . Hence small
σj correspond to directions with high posterior variance. It is
these directions which ridge regression shrinks the most.
Shrinkage
11cov | T
w D
2
21/ s
2
11/ s
-2 -1 0 1 2 3 4 5 6 7
-2
-1
0
1
2
3
4
5
MLE
Prior Mean
Posterior Mean
X
LIKELIHOOD
PRIOR 2 2
1 2s s
T 2
js
w1 is not-well determined by
the data (has high posterior
variance), but w2 is well-
determined.
Ill-determined parameters are
reduced in size towards 0.
This is called shrinkage.
Pr11 1 20,MLE
MAP ior MAPw w w w
Bayesian Scientific Computing, Spring 2013 (N. Zabaras) 24
Another effective regularizing approach is to use a lots of
data. More training data in general implies better learning.
The test set error decreases to a plateau as N increases.
This is illustrated by plotting the MSE incurred on the test
set achieved by polynomial regression of different degrees
vs N (learning curve).
The level of the plateau for the test error consists of two
terms:
an irreducible component (that all models incur) due to
the intrinsic variability of the generating process (noise
floor); and
a component that depends on the discrepancy
between the generating process (the “truth”) and the
model (structural error).
Regularization Effect of Big Data
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Learning Curves
25
Truth is a
degree 2
polynomial and
we fit
polynomials of
degrees 1, 2
and 25.
The structural
error for M2 and
M25 is zero as
both capture the
true generating
process.
The structural
error for M1 is
substantial: the
plateau occurs
high above the
noise floor.
2 4s
Run linregPolyVsN
from PMTK3
0 20 40 60 80 100 120 140 160 180 2000
2
4
6
8
10
12
14
16
18
20
22
size of training set
mse
truth=degree 2, model = degree 1
train
test
0 20 40 60 80 100 120 140 160 180 2000
2
4
6
8
10
12
14
16
18
20
22
size of training set
mse
truth=degree 2, model = degree 2
train
test
0 20 40 60 80 100 120 140 160 180 2000
2
4
6
8
10
12
14
16
18
20
22
size of training setm
se
truth=degree 2, model = degree 25
train
test
0 20 40 60 80 100 120 140 160 180 2000
2
4
6
8
10
12
14
16
18
20
22
size of training set
mse
truth=degree 2, model = degree 10
train
test Training MSE
X Test MSE
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Test Error and Simple Models
26
For any model
that is
expressive
enough to
capture the
truth (i.e., M2,
M10, M25 ), the
test error goes
to the noise
floor as N → ∞.
However, the
test error will go
to zero faster
for simpler
models (there
are fewer
parameters to
estimate). Run linregPolyVsN
from PMTK3
0 20 40 60 80 100 120 140 160 180 2000
2
4
6
8
10
12
14
16
18
20
22
size of training set
mse
truth=degree 2, model = degree 1
train
test
0 20 40 60 80 100 120 140 160 180 2000
2
4
6
8
10
12
14
16
18
20
22
size of training set
mse
truth=degree 2, model = degree 2
train
test
0 20 40 60 80 100 120 140 160 180 2000
2
4
6
8
10
12
14
16
18
20
22
size of training setm
se
truth=degree 2, model = degree 25
train
test
0 20 40 60 80 100 120 140 160 180 2000
2
4
6
8
10
12
14
16
18
20
22
size of training set
mse
truth=degree 2, model = degree 10
train
test
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Approximation Error
27
Run linregPolyVsN
from PMTK3
For finite N,
there is
discrepancy
between the
parameters that
we estimate and
the best
parameters that
we could
estimate given
the particular
model class.
The
approximation
error goes to
zero as N → ∞,
but it goes to
zero faster for
simpler models.
0 20 40 60 80 100 120 140 160 180 2000
2
4
6
8
10
12
14
16
18
20
22
size of training set
mse
truth=degree 2, model = degree 1
train
test
0 20 40 60 80 100 120 140 160 180 2000
2
4
6
8
10
12
14
16
18
20
22
size of training set
mse
truth=degree 2, model = degree 2
train
test
0 20 40 60 80 100 120 140 160 180 2000
2
4
6
8
10
12
14
16
18
20
22
size of training setm
se
truth=degree 2, model = degree 25
train
test
0 20 40 60 80 100 120 140 160 180 2000
2
4
6
8
10
12
14
16
18
20
22
size of training set
mse
truth=degree 2, model = degree 10
train
test
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Multi-task Learning
28
In domains with lots of data, simple methods work very
well. However, more often we have little data.
E.g., in web search data one has a lots of data but
personalizing the results gives us only few data per
user.
In general, in multi-task learning, we often borrow
statistical strength from tasks with lots of data and
share it with tasks with little data.
Halevy, A., P. Norvig, and F. Pereira (2009). The unreasonable effectiveness of data. IEEE Intelligent Systems 24(2),
8–12.
Bayesian Scientific Computing, Spring 2013 (N. Zabaras) 29
Effective model complexity in MLE is governed by the number
of basis functions and is controlled by the size of the data set.
With regularization, the effective model complexity is controlled
mainly by l and still by the number and form of the basis
functions.
Thus the model complexity for a particular problem cannot be
decided simply by maximizing the likelihood function as this
leads to excessively complex models and over-fitting.
Independent hold-out data can be used to determine model
complexity but this wastes data and it is computationally
expensive (see earlier example on optimal regularization using
test data sets).
Bayesian Linear Regression
Bayesian Scientific Computing, Spring 2013 (N. Zabaras) 30
A Bayesian treatment of linear regression avoids the over-
fitting of maximum likelihood.
Bayesian approaches lead to automatic methods of
determining model complexity using the training data alone.
Bayesian Linear Regression
Bayesian Scientific Computing, Spring 2013 (N. Zabaras) 31
Assume additive Gaussian noise with known precision . The
likelihood function p(t|w) is the exponential of a quadratic
function of w
and its conjugate prior is Gaussian:
Combining this with the likelihood and using results for
marginal and conditional Gaussian distributions, gives the
posterior
where
0 0( ) ( | , )p w w m SN
( ) ( | , )N Np w | t w m SN
1
0 0
1 1
0
T
N N
T
N
m S S m t
S S
/2
21
11
| , , | ( ), exp ( )2 2
NN NT T
n n n n
nn
p t t
t X w w x w xN
Bayesian Linear Regression
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
We now have the product of two Gaussians and the
posterior is easily computed as:
Posterior Distribution: Derivation
32
1
0 0 0 0 0
2
1
1 1
1( | , ) exp ,
2
( | , , ) exp ( )2
1( | , , ) exp ( ) ( ) ( )
2
T
NT
n n
n
N NT T
n n n n
n n
p
p x t
p t
w m S w - m S w - m
t x w w
t x w w x x w x w
0
1 1
0 0 0
1 1
1 1 1 1
0 0 0 0
1
( | , , )
1 1exp ( ) ( ) ( )
2 2
, ( ) ( )
N
N NT T T T T
n n n n
n n
NT T T
N N N n n
n
p ,
t
,
m
w x t S
w S w w S m w x x w w x
w | S S m t S S S x x S
N
Square in w
Bayesian Scientific Computing, Spring 2013 (N. Zabaras) 33
Note that because the posterior distribution is Gaussian, its
posterior mode coincides with its mean.
The above expressions for the posterior mean and variance
can also be written for a sequential calculation (we already
have observed N data points and now considering an
additional data point (xN+1,tN+1)). In this case, we have:
( ) ( | , )N Np w | t w m SN 1
0 0
1 1
0
T
N N
T
N
m S S m t
S S
MAP Nw = m
1 1 1 1( , ) ( | , )N N N N N Np t , w | , x m S w m SN 1
1 1 1 1
1 1
1 1 1
N N N N n n
T
N N n n
t
m S S m
S S
Bayesian Linear Regression
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Bayesian Linear Regression
34
Let us consider for a prior, a zero-mean isotropic Gaussian
governed by a single precision parameter (see earlier
example) so that
and the corresponding posterior distribution over w is then
given by
The log of the posterior is the sum of the log likelihood and
the log of the prior and, as a function of w, takes the form
Thus the MAP estimate is the same as regularized least
squares (Ridge Regression) with
1( ) ( | 0, )p w w IN
1
T
N N
T
N
m S t
S I
2
1
ln ( | ) ( )2 2
NT T
n n
n
p t x +const
w t w w w
/ .l
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
A Note on Data Centering
35
In linear regression, it helps to center the data in a way that
does not require us to compute the offset term m. Write the
likelihood as:
Let us assume that the input data are centered in each
dimension such that:
The mean of the output is equally likely to be positive or
negative. Let us put an improper prior and integrate
m out.
( | , , ) exp2
T
N Np ,
m m m
1 1t x w t w t w
1 1 2 1 11
1 2 2 2 2 1 22
1 2
1 2
( ) ( ) .. ( )( )
( ) ( ) .. ( ) ( ) ( ) .. ( )( ),
: : : : ..:
( ) ( ) .. ( )( )
TM
TTM N
T
N
TN N M NN
x x xx
x x x x x xx
x x xx
0 1 1. .T
i i i M i x x x x
1
0 1,...,N
i j
j
i M
x
( ) 1p m
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
A Note on Data Centering
36
Introducing, , the marginal likelihood becomes :
Completing the square in m gives (use the centering of the
input):
Our model is now simplified if instead of t we use (centered
output) and the likelihood is simply written as:
Recall that the MLE estimate for m is:
( | , ) exp2
T
N N N Np , t t t t d
m m m
1 1 1 1
A
t x w t w t w
1
1 N
i
i
t tN
0 0
2
( | , ) exp 22
exp2
T TN
T T
N
tN tN
T
N N
p , t N t d
t t
m m m
1
1
1 1
w
t x w A A A
t w t w
Nt 1t t
( | , ) exp2
Tp ,
t x w t w t w
1
1
, ,...,D
T
j Dj
j
t wm
is formed by averaging each column of Φ
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
A Note on Data Centering
37
To simplify the earlier notation, consider a linear regression
model of the form
In the context e.g. of MLE, we need to minimize
Minimization wrt w0 gives:
where:
Thus:
We can now estimate the MLE estimate of w0 as follows:
0| Ty w x w x
0
T
w t x w
0
2
0
1
minN
T
i iw
i
t w
,w
- w x
0 0 0
1
0N
T T T
i i
i
t w w N tN N w t
w x w x w x
1
11
2 2
1
1
1
,:
:
N
i
i
N
Ni
i i
i
D N
iD
i
x N
x
x Nxt t N
xx N
x
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
A Note on Data Centering
38
Substituting the bias term in our objective function gives:
Minimization wrt w gives:
We thus first compute the MLE of w using the centered input
and output as follows:
We can now estimate the MLE estimate of w0 as follows:
1
1 1
N NTT T
c c c c i i i i
i i
t t
w X X X t x x x x x x
1
11 2 11 11 12 1
1 2 2 221 22 221
1 21 2
1
..
..,
: : : : :::
..
N
iTT iDD
NTT
T D iDic N
TT D D NN N NNN
iD
i
x N
x x x x x x x
x Nx x x x x x x
x x x x x x xx N
1
x x
x xX X X X x x
x x
,
0
T
w t x w
22
1 1
min minN N
T T T
i i i i
i i
t t t t
w w
w x w x w x x
1 1
N NT
i i i i
i i
t t
x x x x w x x
1
,c N
N
i
i
t
t t N
1t t t t
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Bayesian Regression: Example
39
We generate synthetic data from the function f(x, a) = a0+a1x
with parameter values a0 = −0.3 and a1 = 0.5 by first choosing
values of xn from the uniform distribution U(x|−1, 1), then
evaluating f(xn, a), and finally adding Gaussian noise with
standard deviation of 0.2 to obtain the target values tn.
We assume β=(1/0.2)2=25 and α=2.0.
We perform Bayesian inference sequentially – one point at a
time – so the posterior at each level becomes the new prior.
We show results after 1, 2 and 22 points have been collected.
The results include the likelihood contours (for 1 point), the
posterior and samples of the regression function from the
posterior.
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Prior - No data yet
-1 -0.5 0 0.5 1-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
-1 -0.5 0 0.5 1-3
-2
-1
0
1
2
3y(x,w) using samples of w from the prior
Bayesian Regression: Example
40
MatLab Code
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Likelihood Contour
-1 -0.5 0 0.5 1-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1Contours of the posterior
-1 -0.5 0 0.5 1-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
-1 -0.5 0 0.5 1-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1y(x,w) using samples of w from the posterior
Example: One Data Point Collected
41
Note that the regression lines pass close to the data point (shown with a circle)
MatLab Code
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Likelihood Contour
-1 -0.5 0 0.5 1-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1Contours of the posterior
-1 -0.5 0 0.5 1-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
-1 -0.5 0 0.5 1-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1y(x,w) using samples of w from the posterior
Example: 2nd Data Point Collected
42
Note that the regression lines now pass close to both data points
MatLab Code
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Likelihood Contour
-1 -0.5 0 0.5 1-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1Contours of the posterior
-1 -0.5 0 0.5 1-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
-1 -0.5 0 0.5 1-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1y(x,w) using samples of w from the posterior
Example: 22 Data Points Collected
43
Note that the regression lines after 22 data points have been collected
MatLab Code
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Summary of Results
44
prior/posterior (no data yet)
-1 -0.5 0 0.5 1-1
-0.5
0
0.5
1
-1 -0.5 0 0.5 1-1
-0.5
0
0.5
1data space (no data yet)
-1 -0.5 0 0.5 1-1
-0.5
0
0.5
1likelihood
-1 -0.5 0 0.5 1-1
-0.5
0
0.5
1prior/posterior
-1 -0.5 0 0.5 1-1
-0.5
0
0.5
1data space
-1 -0.5 0 0.5 1-1
-0.5
0
0.5
1likelihood
-1 -0.5 0 0.5 1-1
-0.5
0
0.5
1prior/posterior
-1 -0.5 0 0.5 1-1
-0.5
0
0.5
1data space
-1 -0.5 0 0.5 1-1
-0.5
0
0.5
1likelihood
-1 -0.5 0 0.5 1-1
-0.5
0
0.5
1prior/posterior
-1 -0.5 0 0.5 1-1
-0.5
0
0.5
1data space
MatLab Code
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Summary of Results
45
W0
W1
prior/posterior
-1 0 1-1
0
1
-1 0 1-1
0
1
x
y
data space
W0
W1
-1 0 1-1
0
1
W0
W1
-1 0 1-1
0
1
-1 0 1-1
0
1
x
y
W0
W1
-1 0 1-1
0
1
W0
W1
-1 0 1-1
0
1
-1 0 1-1
0
1
x
y
W0
W1
-1 0 1-1
0
1
W0
W1
-1 0 1-1
0
1
-1 0 1-1
0
1
x
y
likelihood
Run bayesLinRegDemo2d
from PMTK3
After 20 data points
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Predictive Distribution
46
In practice, we are not interested in w itself but in making
predictions of t for new values of x. This requires that we
evaluate the predictive distribution
where
The 1st term represents the noise on the data whereas the 2nd
term reflects the uncertainty associated with w.
Because the noise process and the distribution of w are
independent Gaussians, their variances are additive.
The error bars get larger as we move away from the training
points. By contrast, in the plug-in approximation, the error bars
are of constant size.
As additional data points are observed, the posterior
distribution becomes narrower.
2( | , , ) ( | , ) ( | , ) | ( ), ( )T
N Np t x p t x , p , , d t x x s Nx t w w x t w m
2 11( ) ( ) ( ),T T
N N Nx x S xs
S I
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
In a full Bayesian treatment, we want to compute the
predictive distribution, i.e. given the training data x and t
and a new test point x, we want the distribution:
To compute the needed marginal, we use a result from an
earlier lecture.
Predictive Distribution
47
1
1 1
1
1 1
( | , , ) ( | , ) ( | , ) ,
( | , , ) ( | ( , ), )
( | , , )
1 1exp ( ) ( ) ( )
2 2
( ), , ( ) ( )
N NT T T
M M n n n n
n n
N NT
N n n N N M M n n
n n
p t x p t x p d where
p t x t y x and
p ,
x x t x
t x x x
N
N
x t w w x t w
w w
w x t
w I w w w w
w |, S S S I
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Appendix: Useful Result
48
For the above linear model, we proved in these notes, the
following very useful results about marginal and conditional
Gaussian models.
1
1
| ,
| | ,
p
p
N
N
x x
y x y Ax b L
m
1 1| , Tp Ny y A b L + A Am
1 1
| | ( ,T T Tp
x y x A LA A L y b) A LAm N
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Predictive Distribution
49
Thus for our problem:
The predictive distribution now takes the form:
1
1
1 1
( ),
( ) ,
N
N n n N
n
T
, t x
t, x , = 0
x w = S = S
y A = b L =
m
1
1
| ,
| | ,
p
p
N
N
x x
y x y Ax b L
m
1( | , , ) ( | ( , ), )p t x t y x w wN
1
( | , , ) ( ),N
N n n N
n
p , t x
w x t w |, S SN
1 1| , Tp y y A b L + A Am N
1
1
| ( ) ( ), ( ) ( )N
T T
N n n N
n
p t t x t x x x
S + SN
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
In a full Bayesian treatment, we want to compute the
predictive distribution, i.e. given the training data x and t
and a new test point x, we want the distribution:
where the mean and variance are given by
Note that:
Predictive Distribution
50
1( | , , ) ( | , ) ( | , ) , ( | , , ) ( | ( , ), )p t x p t x p d p t x t y x x t w w x t w w wN
1
2 1
1
1
( ) ( ) ( ) ( ) ,
( ) ( ) ( ) ( )
( ) ( )
NT T T
N n n N N N
n
T
N N
NT T
N n n
n
m x x x t x
x x x
x x
s
+
S m m S t
S
S I I
uncertainty in the data+uncertainty in w
2( | , , ) ( | ( ), ( ))Np t x t m x xsx t N
2 1 1 2 2
1( ) ( ) ( ) ( ) ( )N
T
N N N Nx x x and x xs s s
S+
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
It is easy to show:
Note:
and the identity:
Using these results, we can write:
Predictive Distribution
51
2 1 1 2 2
1( ) ( ) ( ) ( ) ( )N
T
N N N Nx x x and x xs s s
S+
11 1
1
1
( ) ( ) ( ) ( )N
T T
N n n N n n
n
x x x x
S I S
1 1
11
11
T
T
T
M v v MM vv M
v M v
12 1 1 1 1
1 1
2
2 2
( ) ( ) ( ) ( ) ( ) ( ) ( )
( ) ( ) ( ) ( )( ) ( ) ( ) ( )
1 ( ) ( ) 1 ( ) ( )
T T T
N N N n n
T T
N n n N n NT
N N NT T
n N n n N n
x x x x x x x
x x x xx x x x
x x x x
s
s s
S S
S S SS
S S
+ + +
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
The notation used here is as follows:
Predictive Distribution: Summary
52
1
2 1
1
1
( ) ( ) ( )
( ) ( ) ( )
( ) ( )
NT
N n n
n
T
N N
NT
N n n
n
m x x x t
x x x
x x
s
S
S
S I
+
2( | , , ) ( | ( ), ( ))Np t x t m x xsx t N
2 2 1
1
:
1
( ) , ( ) 1 ..
:
n
T M
n n
M
n
For Polynomial regression
x
x x x x x x
x
Note:
Predictive mean and
variance are functions
of x.
0
1
2 0 1 2 1
1
( )
( )
( ) ( ) , ( ) ( ) ( ) ( ) .. ( ) ,
:
( )
n
n
T
n n n n n M n
M n
x
x
x x x x x x x unit matrix M M
x
I
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-2
-1.5
-1
-0.5
0
0.5
1
1.5
2Predictive Distribution, M = 9
Generating function sin(2pi*x)
Random data points for fitting
Predictive Mean
Predictive Standard Deviation
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-2
-1.5
-1
-0.5
0
0.5
1
1.5
2Predictive Distribution, M = 9
Generating function sin(2pi*x)
Random data points for fitting
Predictive Mean
Predictive Standard Deviation
Pointwise Uncertainty in the Predictions
53
MatLab code
N=1
N=2
M=9 Gaussians, 10 parameters
Scale of Gaussians adjusted
with data
= 5*10-3
= 11.1
Using N=1,2,4,10
Data are given here
The predictive uncertainty is
smaller near the data.
The level of uncertainty
decreases with N
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-2
-1.5
-1
-0.5
0
0.5
1
1.5
2Predictive Distribution, M = 9
Generating function sin(2pi*x)
Random data points for fitting
Predictive Mean
Predictive Standard Deviation
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-2
-1.5
-1
-0.5
0
0.5
1
1.5
2Predictive Distribution, M = 9
Generating function sin(2pi*x)
Random data points for fitting
Predictive Mean
Predictive Standard Deviation
Pointwise Uncertainty in the Predictions
54
MatLab code
N=4 N=10
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Summary of Results
55
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-2
-1.5
-1
-0.5
0
0.5
1
1.5
2Predictive Distribution, M = 9
Generating function sin(2pi*x)
Random data points for fitting
Predictive Mean
Predictive Standard Deviation
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-2
-1.5
-1
-0.5
0
0.5
1
1.5
2Predictive Distribution, M = 9
Generating function sin(2pi*x)
Random data points for fitting
Predictive Mean
Predictive Standard Deviation
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-2
-1.5
-1
-0.5
0
0.5
1
1.5
2Predictive Distribution, M = 9
Generating function sin(2pi*x)
Random data points for fitting
Predictive Mean
Predictive Standard Deviation
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-2
-1.5
-1
-0.5
0
0.5
1
1.5
2Predictive Distribution, M = 9
Generating function sin(2pi*x)
Random data points for fitting
Predictive Mean
Predictive Standard Deviation
MatLab code
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Plugin Approximation
56
-8 -6 -4 -2 0 2 4 6 80
10
20
30
40
50
60
plugin approximation (MLE)
prediction
training data
-8 -6 -4 -2 0 2 4 6 8-10
0
10
20
30
40
50
60
70
80
Posterior predictive (known variance)
prediction
training data
-8 -6 -4 -2 0 2 4 6 8-20
0
20
40
60
80
100
functions sampled from posterior
-8 -6 -4 -2 0 2 4 6 80
5
10
15
20
25
30
35
40
45
50
functions sampled from plugin approximation to posterior
Run linregPostPredDemo
from PMTK3
( | , , ) ( | , ) ( )
( | , )
p t x p t x d
p t x
wx t w w w
w
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-2
-1.5
-1
-0.5
0
0.5
1
1.5
2Plots of y(x,w) where w is a sample from the posterior over w
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-2
-1.5
-1
-0.5
0
0.5
1
1.5
2Plots of y(x,w) where w is a sample from the posterior over w
Covariance Between the Predictions
57
Draw samples from the posterior of w and
then plot y(x,w). We use the same data as
the earlier example.
We are visualizing the joint uncertainty
in the posterior distribution between the
y values at two or more x values.
MatLab Code
N=2
N=1
Same data and basis functions
as in the earlier example.
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-2
-1.5
-1
-0.5
0
0.5
1
1.5
2Plots of y(x,w) where w is a sample from the posterior over w
Covariance Between the Predictions
58
Draw samples from the posterior of w and
then plot y(x,w)
We are visualizing the joint uncertainty
in the posterior distribution between the
y values at two or more x values.
MatLab Code
N=12
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-2
-1.5
-1
-0.5
0
0.5
1
1.5
2Plots of y(x,w) where w is a sample from the posterior over w
N=4
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Summary of Results
59
0 0.2 0.4 0.6 0.8 1-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
0 0.2 0.4 0.6 0.8 1-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
0 0.2 0.4 0.6 0.8 1-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
0 0.2 0.4 0.6 0.8 1-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
MatLab Code
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Gaussian Basis vs. Gaussian Process
60
If we use localized basis functions such as Gaussians, then in
regions away from the basis function support, the contribution
from the second term in the predictive variance will go to zero,
leaving only the noise contribution β−1.
The model becomes very confident in its predictions when
extrapolating outside the region occupied by the basis functions.
This is an undesirable behavior.
This problem can be avoided by adopting an alternative
Bayesian approach to regression (Gaussian processes).
2 1 1
( )( ) ( ) ( )
away from theT
N Nsupport of x
x x xs S+
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Bayesian Inference when s2 is Unknown
61
Let us extend the previous results for linear regression assuming
now that s2 is unknown.
Assume a likelihood of the form:*
A conjugate prior has the following form:
The posterior is now derived as:
00
2 2 2 2
0 0 0 0 0 0 0 0
1/2 1
0 0 0 020
1/2/2 2
0 0
, , | , , , | , | ,
2exp
22
Taa D
D
p a b a b
bb
a
s s s s
ss
NIG N InvGammaw w w V w w V
w w V w w
V
/2
2 2 2
/2 2
1| , , | , exp
22
TN
N Np s s s
s
Ny Xw y Xw
y X w y Xw I
0
0
1( )/2 1
0 0 0 02 20
21/2/2
0 0
2, | exp
22
T Taa D N
N D
bbp
as s
s
Dw w V w w y Xw y Xw
wV
In the remaining of this lecture, the response is denoted as y and the dimensionality of w is taken as D.
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Bayesian Inference when s2 is Unknown
62
Let us define the following:
With these definitions, one with simple algebra can show:
The posterior marginals can now be derived explicitly:
0
0
1( )/2 1
0 0 0 02 20
21/2/2
0 0
2, | exp
22
T Taa D N
N D
bbp
as s
s
w - w V w - w y - Xw y - Xww
VD
11 1
0 0 0
1 1
0 0 0 0 0
,
1/ 2,
2
T T
N N N
T T T
N N N N Na a N b b
V V X X w V V w X y
w V w y y w V w
2 2 2 2, | , | , , , | , | ,N N N N N N N Np a b a bs s s sw w w V w w VD NIG N InvGamma
1 1 1/2 1
0 0 0 0 0 02 2
2
1/2 1
2
2
2, | exp
2
2exp
2
N
N
T T T T Ta D
N N N N
Ta D
N N N N
bp
b
s ss
ss
w w V w w y Xw y Xw w V w y y w V ww
w w V w w
D
2 2| | ,N Np a bs sD InvGamma
2
1 2
| , , 2 12
Na DT
N N NND N N N
N N
bp a
a b
D Tw w V w w
w w V
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Posterior Marginals
63
The marginal posterior can be directly written as:
2
1 1 2/2 1
2 2
2
0
2| exp 1
2 2
N
N
a DT T
a DN N N N N N N
N
bp d
bs s
s
Dw w V w w w w V w w
w
2
1 2
| , , 2 12
Na DT
N N NND N N N
N N
bp a
a b
D Tw w V w w
w w V
To compute the integral above, simply set l=s-2, ds2=-l-2dl and use the normalizing factor of the
Gamma distribution . 1
0
( ) .a b a ae d a b bll l
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Posterior Predictive Distribution
64
Consider the posterior predictive for m new test inputs:
As a first step, let us integrate in w by writing:
We next integrate in l=1/s2 and use the normalization of the
Gamma distribution:
1
/2 /2 12 2 2
/2 2 2
21| , exp exp
2 22
N
TT
m a DN N N N
m
bp d ds s s
s s
Dy Xw y Xw w w V w w
y X w
1
1 11 1 1 1 1
1 1 1 1
2
2
2
T T
N N N N
TT T T T T
N N N N N N N
T TT T T TT
N N N N N N N N N
b
b
y Xw y Xw w w V w w
w X X V X y V w X X V w X X V X y V w
X y V w X X V X y V w w V w y y
This
cancels out
from the
integration
in w
/2 1 /2| , expN N
m a m ap dl l l
Dy X
/2
1 1 1 1
/21
| , 2
12
N
N
m aT T
T T T TT
N N N N N N N N N
m a
T TN
N m N N
N
N
p b
b
a
a
Dy X X y V w X X V X y V w w V w y y
y Xw I XV X y Xw
Use the Sherman Morrison
Woodburry formula here to show
that (symmetry of V0 is assumed)
1 1
1T T T
m N m N
I XV X I X X X V X
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Bayesian Inference when s2 is Unknown
65
The posterior predictive is also a student T:
The predictive variance has two terms
due to the measurement noise
and due to the uncertainty in w. The second term
depends on how close a test input is to the training data.
| , | , , 2T
Nm N m N N
N
bp a
a
D Ty X y Xw I XV X
Nm
N
b
aI
TN
N
N
b
aXV X
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Zellner’s G-Prior
66
It is common to set a0 = b0 = 0, corresponding to an
uninformative prior for σ2, and to set w0 = 0 and V0 =
g(XTX)−1 for any positive value g.
This is called Zellner’s g-prior. Here g plays a role
analogous to 1/λ in ridge regression. However, the prior
covariance is proportional to (XTX)−1 rather than I.
This ensures that the posterior is invariant to scaling of the
inputs.
1 1
2 2 2 2, , | , ,0,0 | , | 0,0T Tp g gs s s s
0 0NIG N InvGammaw w X X w X X
2 2 2 2
0 0 0 0 0 0 0 0, , | , , , | , | ,p a b a bs s s sNIG N InvGammaw w w V w w V
Zellner, A. (1986). On assessing prior distributions and bayesian regression analysis with g-prior distributions. In
Bayesian inference and decision techniques, Studies of Bayesian and Econometrics and Statistics volume 6.
North Holland.
Minka, T. (2000b). Bayesian linear regression. Technical report, MIT.
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Unit Information Prior
67
We will see below that if we use an uninformative prior, the
posterior precision given N measurements is .
The unit information prior is defined to contain as much
information as one sample.
To create a unit information prior for linear regression, we
need to use which is equivalent to the g-prior
with g = N.
1 T
N
V X X
1
0
1 T
N
V X X
1 1
2 2 2 2, , | , ,0,0 | , | 0,0T Tp g gs s s s
0 0NIG N InvGammaw w X X w X X
Kass, R. and L. Wasserman (1995). A reference bayesian test for nested hypotheses and its relationship to
the schwarz criterio. J. of the Am. Stat. Assoc. 90(431), 928–934.
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Uninformative Prior
68
An uninformative prior can be obtained by considering the
uninformative limit of the conjugate g-prior, which
corresponds to setting g = ∞. This is equivalent to an
improper NIG prior with w0 = 0, V0 = ∞I, a0 = 0 and b0 = 0,
which gives p(w, σ2) ∝ σ−(D+2).
Alternatively, we can start with the semi-conjugate prior
p(w, σ2) = p(w)p(σ2), and take each term to its
uninformative limit individually, which gives p(w, σ2) ∝ σ−2.
This is equivalent to an improper NIG prior with w0 = 0,V =
∞I, a0 = −D/2 and b0 = 0.
2 2 2 2 ( 2), , | , ,0,0 | , | 0,0 Dp s s s s s 0 0NIG N InvGammaw w I w I
2 2 2 2 2, , | , ,0,0 | , | ,02
Dp s s s s s
0 0NIG N InvGammaw w I w I
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Uninformative Prior
69
Using the uninformative prior, , the
corresponding posterior and marginal posteriors are given
by
2 2
22
, | , | , , , ,
| , , 2 | , , ,
N N N N
ND N N N D
N
p a b
b sp a N D
a N D
s s
D NIG
D T T
w w w V
w w V w w C
1 1 11 1
0 0 0
0
1 1 2 2
0 0 0 0
1
,
/ 2 / 2,
1/ 2,
2
T T T T TMLEN N N
N
TT T T
MLE MLEN N N N
T TMLEN
a a N N D
b b s s
V C V X X X X w V V w X y X X X y = w
w V w y y w V w y X w y X w
w w X X X y
2 2,p s s w
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
The Caterpillar Example
70
The use of a (semi-conjugate) uninformative prior is quite
interesting since the resulting posterior turns out to be
equivalent to the results obtained from frequentist statistics.
This is equivalent to the sampling distribution of the MLE
which is given by the following:
is the standard error of the estimated parameter.
The frequentist confidence interval and the Bayesian
marginal credible interval for the parameters are the same.
2
| | , ,jj
jj j
C sp w w w N D
N D
D T
2
~ ,jj jj
N D j
j
w w C st s
s N D
Rice, J. (1995). Mathematical statistics and data analysis. Duxbury. 2nd edition (page 542)
Casella, G. and R. Berger (2002). Statistical inference. Duxbury. 2nd edition (page 554)
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
The Caterpillar Example
71
As a worked example of this, consider the caterpillar
dataset. We can compute the posterior mean and
standard deviation, and the 95% credible intervals (CI) for
the regression coefficients.
coeff mean stddev 95pc CI sig
w0 10.998 3.06027 [ 4.652, 17.345] *
w1 -0.004 0.00156 [ -0.008, -0.001] *
w2 -0.054 0.02190 [ -0.099, -0.008] *
w3 0.068 0.09947 [ -0.138, 0.274]
w4 -1.294 0.56381 [ -2.463, -0.124] *
w5 0.232 0.10438 [ 0.015, 0.448] *
w6 -0.357 1.56646 [ -3.605, 2.892]
w7 -0.237 1.00601 [ -2.324, 1.849]
w8 0.181 0.23672 [ -0.310, 0.672]
w9 -1.285 0.86485 [ -3.079, 0.508]
w10 -0.433 0.73487 [ -1.957, 1.091]
The 95%
credible intervals
are identical to
the 95%
confidence
intervals
computed using
standard
frequentist
methods.
Run linregBayesCaterpillar
from PMTK3
Marin, J.-M. and C. Robert (2007). Bayesian Core: a practical approach to
computational Bayesian statistics. Springer.
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
The Caterpillar Example
72
We can use these marginal posteriors to compute if the
coefficients are significantly different from 0 -- check if its
95% CI excludes 0.
The CIs for coefficients 0, 1, 2, 4, 5 are all significant.
These results are the same as those produced by a
frequentist approach using p-values at the 5% level.
But note that the MLE does not even exist when N <D, so
standard frequentist inference theory breaks down in this
setting. Bayesian inference theory still works using proper
priors.
Maruyama, Y. and E. George (2008). A g-prior extension for p > n. Technical report, U. Tokyo.
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Empirical Bayes for Linear Regression
73
We describe next an empirical Bayes procedure for picking
the hyper-parameters in the prior.
More precisely, we choose η = (α, λ) to maximize the
marginal likelihood, where λ = 1/σ2 be the precision of the
observation noise and α is the precision of the prior, p(w) =
N(w|0, α-1I).
This is known as the evidence procedure.
MacKay, D. (1995b). Probable networks and plausible predictions — a review of practical Bayesian methods for
supervised neural networks. Network.
Buntine, W. and A. Weigend (1991). Bayesian backpropagation. Complex Systems 5, 603–643.
MacKay, D. (1999). Comparision of approximate methods for handling hyperparameters. Neural Computation
11(5), 1035–1068.
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Empirical Bayes for Linear Regression
74
The evidence procedure provides an alternative to using
cross validation.
In the Figure, the log marginal likelihood is plotted for
different values of α, as well as the maximum value found
by the optimizer.
-25 -20 -15 -10 -5 0 5-150
-140
-130
-120
-110
-100
-90
-80
-70
-60
-50
log alpha
log evidence
Run linregPolyVsRegDemo
from PMTK3
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Empirical Bayes for Linear Regression
75
-25 -20 -15 -10 -5 0 5-150
-140
-130
-120
-110
-100
-90
-80
-70
-60
-50
log alpha
log evidence
Run linregPolyVsRegDemo
from PMTK3
-20 -15 -10 -5 0 50.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
log lambda
negative log marg. likelihood
CV estimate of MSE
We obtain the same result as 5-CV (λ = 1/σ2 is fixed in
both methods).
The key advantage of the evidence procedure over CV is
that it allows different αj to be used for every feature.
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Automatic Relevancy Determination
76
The evidence procedure can be used to perform feature
selection (automatic relevancy determination or ARD)
The evidence procedure is also useful when comparing
different kinds of models:
It is important to (at least approximately) integrate over η
rather than setting it arbitrarily.
Using variation Bayes models our uncertainty on η rather
than computing point estimates.
( | ) ( | ) ( , ,| ) ( | )
( | ) ( | ( | ), ),
p m p m p m p m d d
max p m p m p m d
D D
D
w w w
w w w