statistical machine learning - gitlab · statistical machine learning c 2020 ong & walder &...
TRANSCRIPT
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Outlines
OverviewIntroductionLinear Algebra
Probability
Linear Regression 1
Linear Regression 2
Linear Classification 1
Linear Classification 2
Kernel MethodsSparse Kernel Methods
Mixture Models and EM 1Mixture Models and EM 2Neural Networks 1Neural Networks 2Principal Component Analysis
AutoencodersGraphical Models 1
Graphical Models 2
Graphical Models 3
Sampling
Sequential Data 1
Sequential Data 2
1of 825
Statistical Machine Learning
Christian Walder
Machine Learning Research GroupCSIRO Data61
and
College of Engineering and Computer ScienceThe Australian National University
CanberraSemester One, 2020.
(Many figures from C. M. Bishop, "Pattern Recognition and Machine Learning")
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Review
Bayesian Regression
Sequential Update of thePosterior
Predictive Distribution
Proof of the PredictiveDistribution
Predictive Distributionwith Simplified Prior
Limitations of LinearBasis Function Models
160of 825
Part IV
Linear Regression 2
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Review
Bayesian Regression
Sequential Update of thePosterior
Predictive Distribution
Proof of the PredictiveDistribution
Predictive Distributionwith Simplified Prior
Limitations of LinearBasis Function Models
161of 825
Linear Regression
Basis functionsMaximum Likelihood with Gaussian NoiseRegularisationBias variance decomposition
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Review
Bayesian Regression
Sequential Update of thePosterior
Predictive Distribution
Proof of the PredictiveDistribution
Predictive Distributionwith Simplified Prior
Limitations of LinearBasis Function Models
162of 825
Training and Testing:(Non-Bayesian) Point Estimate
model with adjustable parameter w
training data x
training targets
t
model with fixed parameter w*
test data x
test target
t
Training Phase
Test Phasefix the most appropriate w*
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Review
Bayesian Regression
Sequential Update of thePosterior
Predictive Distribution
Proof of the PredictiveDistribution
Predictive Distributionwith Simplified Prior
Limitations of LinearBasis Function Models
163of 825
Bayesian Regression
Bayes Theorem
posterior =likelihood× prior
normalisationp(w | t) =
p(t |w) p(w)
p(t)
where we left out the conditioning on x (always assumed),and β, which is assumed to be constant.I.i.d. regression likelihood for additive Gaussian noise is
p(t |w) =
N∏n=1
N (tn | y(xn,w), β−1)
=
N∏n=1
N (tn |w>φ(xn), β−1)
= const× exp{−β 12
(t−Φw)>(t−Φw)}
= N (t |Φw, β−1I)
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Review
Bayesian Regression
Sequential Update of thePosterior
Predictive Distribution
Proof of the PredictiveDistribution
Predictive Distributionwith Simplified Prior
Limitations of LinearBasis Function Models
164of 825
How to choose a prior?
The choice of prior affords an intuitive control over ourinductive biasAll inference schemes have such biases, and often arisemore opaquely than the prior in Bayes’ rule.Can we find a prior for the given likelihood which
makes sense for the problem at handallows us to find a posterior in a ’nice’ form
An answer to the second question:
Definition ( Conjugate Prior)
A class of prior probability distributions p(w) is conjugate to aclass of likelihood functions p(x |w) if the resulting posteriordistributions p(w | x) are in the same family as p(w).
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Review
Bayesian Regression
Sequential Update of thePosterior
Predictive Distribution
Proof of the PredictiveDistribution
Predictive Distributionwith Simplified Prior
Limitations of LinearBasis Function Models
165of 825
Examples of Conjugate Prior Distributions
Table: Discrete likelihood distributions
Likelihood Conjugate PriorBernoulli BetaBinomial BetaPoisson Gamma
Multinomial Dirichlet
Table: Continuous likelihood distributions
Likelihood Conjugate PriorUniform Pareto
Exponential GammaNormal Normal (mean parameter)
Multivariate normal Multivariate normal (mean parameter)
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Review
Bayesian Regression
Sequential Update of thePosterior
Predictive Distribution
Proof of the PredictiveDistribution
Predictive Distributionwith Simplified Prior
Limitations of LinearBasis Function Models
166of 825
Conjugate Prior to a Gaussian Distribution
Example : If the likelihood function is Gaussian, choosinga Gaussian prior for the mean will ensure that theposterior distribution is also Gaussian.Given a marginal distribution for x and a conditionalGaussian distribution for y given x in the form
p(x) = N (x |µ,Λ−1)
p(y | x) = N (y |Ax + b,L−1)
we get
p(y) = N (y |Aµ + b,L−1 + AΛ−1A>)
p(x | y) = N (x |Σ{A>L(y− b) + Λµ},Σ)
where Σ = (Λ + A>L A)−1.
Note that the covariance Σ does not involve y.
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Review
Bayesian Regression
Sequential Update of thePosterior
Predictive Distribution
Proof of the PredictiveDistribution
Predictive Distributionwith Simplified Prior
Limitations of LinearBasis Function Models
167of 825
Conjugate Prior to a Gaussian Distribution(intuition)
Given
p(x) = N (x |µ,Λ−1)
p(y | x) = N (y |Ax + b,L−1)⇔ y = Ax + b +N (0,L−1)
We have E[y] = E[Ax + b] = Aµ + b and by the easily proven Bienayméformula for the variance of the sum of uncorrelated variables,
cov[y] = cov[Ax + b]︸ ︷︷ ︸=E[Ax(Ax)>]=AE[xx>]A>=AΛ−1A>
+ cov[N (0,L−1)]︸ ︷︷ ︸=L−1
.
So y is Gaussian with
p(y) = N (y |Aµ + b,L−1 + AΛ−1A>)
Then letting Σ = (Λ + A>L A)−1 and
p(x | y) = N (x |Σ{A>L(y− b) + Λµ},Σ)
⇔ x = Σ{A>L(y− b) + Λµ}+N (0,Σ)
yields the correct moments for x, since
E[x] = E[Σ{A>L(y− b) + Λµ}] = Σ{A>L(Aµ + b− b) + Λµ}
= Σ{A>LAµ + Λµ} = (Λ + A>L A)−1{A>LA + Λ}µ = µ,
and it is similar (but tedious ; don’t do it) to recover cov[x] = Λ.
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Review
Bayesian Regression
Sequential Update of thePosterior
Predictive Distribution
Proof of the PredictiveDistribution
Predictive Distributionwith Simplified Prior
Limitations of LinearBasis Function Models
168of 825
Bayesian Regression
Choose a Gaussian prior with mean m0 and covariance S0
p(w) = N (w |m0,S0)
Same likelihood as before (here written in vector form):
p(t|w, β) = N (t |Φ>w, β−1I)
Given N data pairs (xn, tn), the posterior is
p(w | t) = N (w |mN ,SN)
where
mN = SN(S−10 m0 + βΦ>t)
S−1N = S−1
0 + βΦ>Φ
(derive this with the identities on the previous slides)
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Review
Bayesian Regression
Sequential Update of thePosterior
Predictive Distribution
Proof of the PredictiveDistribution
Predictive Distributionwith Simplified Prior
Limitations of LinearBasis Function Models
169of 825
Bayesian Regression: Zero Mean, Isotropic Prior
For simplicity we proceed with m0 = 0 and S0 = α−1I, so
p(w |α) = N (w | 0, α−1I)
The posterior becomes p(w | t) = N (w |mN ,SN) with
mN = βSNΦ>t
S−1N = αI + βΦ>Φ
For α� β we get
mN → wML = (Φ>Φ)−1Φ>t
Log of posterior is sum of log likelihood and log of prior
ln p(w | t) = −β2
(t−Φw)>(t−Φw)− α
2w>w + const
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Review
Bayesian Regression
Sequential Update of thePosterior
Predictive Distribution
Proof of the PredictiveDistribution
Predictive Distributionwith Simplified Prior
Limitations of LinearBasis Function Models
170of 825
Bayesian Regression
Log of posterior is sum of log likelihood and log of prior
ln p(w | t) = −β 12‖t−Φw‖2︸ ︷︷ ︸
sum-of-squares-error
−α2‖w‖2︸ ︷︷ ︸
regulariser
+ const.
The maximum a posteriori estimator
wm.a.p. = arg maxw
p(w|t)
corresponds to minimising the sum-of-squares errorfunction with quadratic regularisation coefficient λ = α/β.The posterior is Gaussian so mode = mean: wm.a.p. = mN .For α� β the we recover unregularised least squares(equivalently m.a.p. approaches maximum likelihood), forexample in case of
an infinitely broad prior with α→ 0an infinitely precise likelihood with β →∞
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Review
Bayesian Regression
Sequential Update of thePosterior
Predictive Distribution
Proof of the PredictiveDistribution
Predictive Distributionwith Simplified Prior
Limitations of LinearBasis Function Models
171of 825
Bayesian Inference in General:Sequential Update of Belief
If we have not yet seen any data point (N = 0), theposterior is equal to the prior.
Sequential arrival of data points : the posterior given someobserved data acts as the prior for the future data.Nicely fits a sequential learning framework.
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Review
Bayesian Regression
Sequential Update of thePosterior
Predictive Distribution
Proof of the PredictiveDistribution
Predictive Distributionwith Simplified Prior
Limitations of LinearBasis Function Models
172of 825
Sequential Update of the Posterior
Example of a linear basis function modelSingle input x, single output t
Linear model y(x,w) = w0 + w1x.True data distribution sampling procedure:
1 Choose an xn from the uniform distribution U(x | − 1,+1).2 Calculate f (xn, a) = a0 + a1xn, where a0 = −0.3, a1 = 0.5.3 Add Gaussian noise with standard deviation σ = 0.2,
tn = N (xn | f (xn, a), 0.04)
Set the precision of the uniform prior to α = 2.0.
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Review
Bayesian Regression
Sequential Update of thePosterior
Predictive Distribution
Proof of the PredictiveDistribution
Predictive Distributionwith Simplified Prior
Limitations of LinearBasis Function Models
173of 825
Sequential Update of the Posterior
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Review
Bayesian Regression
Sequential Update of thePosterior
Predictive Distribution
Proof of the PredictiveDistribution
Predictive Distributionwith Simplified Prior
Limitations of LinearBasis Function Models
174of 825
Sequential Update of the Posterior
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Review
Bayesian Regression
Sequential Update of thePosterior
Predictive Distribution
Proof of the PredictiveDistribution
Predictive Distributionwith Simplified Prior
Limitations of LinearBasis Function Models
175of 825
Predictive Distribution
In the training phase, data x and targets t are providedIn the test phase, a new data value x is given and thecorresponding target value t is asked forBayesian approach: Find the probability of the test target tgiven the test data x, the training data x and the trainingtargets t
p(t | x, x, t)
This is the Predictive Distribution (c.f. the posteriordistribution, which is over the parameters).
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Review
Bayesian Regression
Sequential Update of thePosterior
Predictive Distribution
Proof of the PredictiveDistribution
Predictive Distributionwith Simplified Prior
Limitations of LinearBasis Function Models
176of 825
How to calculate the Predictive Distribution?
Introduce the model parameter w via the sum rule
p(t | x, x, t) =
∫p(t,w | x, x, t)dw
=
∫p(t |w, x, x, t)p(w | x, x, t)dw
The test target t depends only on the test data x and themodel parameter w, but not on the training data and thetraining targets
p(t |w, x, x, t) = p(t |w, x)
The model parameter w are learned with the training datax and the training targets t only
p(w | x, x, t) = p(w | x, t)
Predictive Distribution
p(t | x, x, t) =
∫p(t |w, x)p(w | x, t)dw
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Review
Bayesian Regression
Sequential Update of thePosterior
Predictive Distribution
Proof of the PredictiveDistribution
Predictive Distributionwith Simplified Prior
Limitations of LinearBasis Function Models
177of 825
Proof of the Predictive Distribution
The predictive distribution is
p(t | x, x, t) =
∫p(t |w, x, x, t)p(w | x, x, t)dw
because∫p(t |w, x, x, t)p(w | x, x, t)dw =
∫p(t,w, x, x, t)p(w, x, x, t)
p(w, x, x, t)p(x, x, t)
dw
=
∫p(t,w, x, x, t)
p(x, x, t)dw
=p(t, x, x, t)p(x, x, t)
= p(t | x, x, t),
or simply∫p(t |w, x, x, t)p(w | x, x, t)dw =
∫p(t,w | x, x, t)dw
= p(t | x, x, t).
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Review
Bayesian Regression
Sequential Update of thePosterior
Predictive Distribution
Proof of the PredictiveDistribution
Predictive Distributionwith Simplified Prior
Limitations of LinearBasis Function Models
178of 825
Predictive Distribution with Simplified Prior
Find the predictive distribution
p(t | t, α, β) =
∫p(t |w, β) p(w | t, α, β)dw
(remember : conditioning on x is often suppressed tosimplify the notation.)Now we know (neglecting as usual to notate conditioningon x)
p(t |w, β) = N (t |w>φ(x), β−1)
and the posterior was
p(w | t, α, β) = N (w |mN ,SN)
where
mN = βSNΦ>t
S−1N = αI + βΦ>Φ
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Review
Bayesian Regression
Sequential Update of thePosterior
Predictive Distribution
Proof of the PredictiveDistribution
Predictive Distributionwith Simplified Prior
Limitations of LinearBasis Function Models
179of 825
Predictive Distribution with Simplified Prior
If we do the integral (it turns out to be the convolution ofthe two Gaussians), we get for the predictive distribution
p(t | x, t, α, β) = N (t |m>N φ(x), σ2N(x))
where the variance σ2N(x) is given by
σ2N(x) =
1β
+ φ(x)>SNφ(x).
This is more easily shown using a similar approach to theearlier “intution” slide and again with the Bienayméformula, now using
t = w>φ(x) +N (0, β−1).
However this is a linear-Gaussian specific trick and ingeneral we need to integrate out the parameters.
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Review
Bayesian Regression
Sequential Update of thePosterior
Predictive Distribution
Proof of the PredictiveDistribution
Predictive Distributionwith Simplified Prior
Limitations of LinearBasis Function Models
180of 825
Predictive Distribution with Simplified Prior
Example with artificial sinusoidal data from sin(2πx) (green)and added noise. Number of data points N = 1.
x
t
0 1
−1
0
1
Mean of the predictive distribution (red) and regions of onestandard deviation from mean (red shaded).
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Review
Bayesian Regression
Sequential Update of thePosterior
Predictive Distribution
Proof of the PredictiveDistribution
Predictive Distributionwith Simplified Prior
Limitations of LinearBasis Function Models
181of 825
Predictive Distribution with Simplified Prior
Example with artificial sinusoidal data from sin(2πx) (green)and added noise. Number of data points N = 2.
x
t
0 1
−1
0
1
Mean of the predictive distribution (red) and regions of onestandard deviation from mean (red shaded).
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Review
Bayesian Regression
Sequential Update of thePosterior
Predictive Distribution
Proof of the PredictiveDistribution
Predictive Distributionwith Simplified Prior
Limitations of LinearBasis Function Models
182of 825
Predictive Distribution with Simplified Prior
Example with artificial sinusoidal data from sin(2πx) (green)and added noise. Number of data points N = 4.
x
t
0 1
−1
0
1
Mean of the predictive distribution (red) and regions of onestandard deviation from mean (red shaded).
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Review
Bayesian Regression
Sequential Update of thePosterior
Predictive Distribution
Proof of the PredictiveDistribution
Predictive Distributionwith Simplified Prior
Limitations of LinearBasis Function Models
183of 825
Predictive Distribution with Simplified Prior
Example with artificial sinusoidal data from sin(2πx) (green)and added noise. Number of data points N = 25.
x
t
0 1
−1
0
1
Mean of the predictive distribution (red) and regions of onestandard deviation from mean (red shaded).
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Review
Bayesian Regression
Sequential Update of thePosterior
Predictive Distribution
Proof of the PredictiveDistribution
Predictive Distributionwith Simplified Prior
Limitations of LinearBasis Function Models
184of 825
Predictive Distribution with Simplified Prior
Plots of the function y(x,w) using samples from the posteriordistribution over w. Number of data points N = 1.
x
t
0 1
−1
0
1
x
t
0 1
−1
0
1
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Review
Bayesian Regression
Sequential Update of thePosterior
Predictive Distribution
Proof of the PredictiveDistribution
Predictive Distributionwith Simplified Prior
Limitations of LinearBasis Function Models
185of 825
Predictive Distribution with Simplified Prior
Plots of the function y(x,w) using samples from the posteriordistribution over w. Number of data points N = 2.
x
t
0 1
−1
0
1
x
t
0 1
−1
0
1
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Review
Bayesian Regression
Sequential Update of thePosterior
Predictive Distribution
Proof of the PredictiveDistribution
Predictive Distributionwith Simplified Prior
Limitations of LinearBasis Function Models
186of 825
Predictive Distribution with Simplified Prior
Plots of the function y(x,w) using samples from the posteriordistribution over w. Number of data points N = 4.
x
t
0 1
−1
0
1
x
t
0 1
−1
0
1
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Review
Bayesian Regression
Sequential Update of thePosterior
Predictive Distribution
Proof of the PredictiveDistribution
Predictive Distributionwith Simplified Prior
Limitations of LinearBasis Function Models
187of 825
Predictive Distribution with Simplified Prior
Plots of the function y(x,w) using samples from the posteriordistribution over w. Number of data points N = 25.
x
t
0 1
−1
0
1
x
t
0 1
−1
0
1
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Review
Bayesian Regression
Sequential Update of thePosterior
Predictive Distribution
Proof of the PredictiveDistribution
Predictive Distributionwith Simplified Prior
Limitations of LinearBasis Function Models
188of 825
Limitations of Linear Basis Function Models
Basis function φj(x) are fixed before the training data set isobserved.Curse of dimensionality : Number of basis function growsrapidly, often exponentially, with the dimensionality D.But typical data sets have two nice properties which canbe exploited if the basis functions are not fixed :
Data lie close to a nonlinear manifold with intrinsicdimension much smaller than D. Need algorithms whichplace basis functions only where data are (e.g. kernelmethods / Gaussian processes).Target variables may only depend on a few significantdirections within the data manifold. Need algorithms whichcan exploit this property (e.g. linear methods or shallowneural networks).
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Review
Bayesian Regression
Sequential Update of thePosterior
Predictive Distribution
Proof of the PredictiveDistribution
Predictive Distributionwith Simplified Prior
Limitations of LinearBasis Function Models
189of 825
Curse of Dimensionality
Linear Algebra allows us to operate in n-dimensionalvector spaces using the intution from our 3-dimensionalworld as a vector space. No surprises as long as n is finite.If we add more structure to a vector space (e.g. innerproduct, metric), our intution gained from the3-dimensional world around us may be wrong.Example: Sphere of radius r = 1. What is the fraction ofthe volume of the sphere in a D-dimensional space whichlies between radius r = 1 and r = 1− ε ?Volume scales like rD, therefore the formula for the volumeof a sphere is VD(r) = KDrD.
VD(1)− VD(1− ε)VD(1)
= 1− (1− ε)D
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Review
Bayesian Regression
Sequential Update of thePosterior
Predictive Distribution
Proof of the PredictiveDistribution
Predictive Distributionwith Simplified Prior
Limitations of LinearBasis Function Models
190of 825
Curse of Dimensionality
Fraction of the volume of the sphere in a D-dimensionalspace which lies between radius r = 1 and r = 1− ε
VD(1)− VD(1− ε)VD(1)
= 1− (1− ε)D
ε
volu
me
frac
tion
D = 1
D = 2
D = 5
D = 20
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Review
Bayesian Regression
Sequential Update of thePosterior
Predictive Distribution
Proof of the PredictiveDistribution
Predictive Distributionwith Simplified Prior
Limitations of LinearBasis Function Models
191of 825
Curse of Dimensionality
Probability density with respect to radius r of a Gaussiandistribution for various values of the dimensionality D.
D = 1
D = 2
D = 20
r
p(r)
0 2 40
1
2
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Review
Bayesian Regression
Sequential Update of thePosterior
Predictive Distribution
Proof of the PredictiveDistribution
Predictive Distributionwith Simplified Prior
Limitations of LinearBasis Function Models
192of 825
Curse of Dimensionality
Probability density with respect to radius r of a Gaussiandistribution for various values of the dimensionality D.Example: D = 2; assume µ = 0,Σ = I
N (x | 0, I) =1
2πexp
{−1
2x>x}
=1
2πexp
{−1
2(x2
1 + x22)
}Coordinate transformation
x1 = r cos(φ) x2 = r sin(φ)
Probability in the new coordinates
p(r, φ | 0, I) = N (r(x), φ(x) | 0, I) | J |
where | J | = r is the determinant of the Jacobian for thegiven coordinate transformation.
p(r, φ | 0, I) =1
2πr exp
{−1
2r2}
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Review
Bayesian Regression
Sequential Update of thePosterior
Predictive Distribution
Proof of the PredictiveDistribution
Predictive Distributionwith Simplified Prior
Limitations of LinearBasis Function Models
193of 825
Curse of Dimensionality
Probability density with respect to radius r of a Gaussiandistribution for D = 2 (and µ = 0,Σ = I)
p(r, φ | 0, I) =1
2πr exp
{−1
2r2}
Integrate over all angles φ
p(r | 0, I) =
∫ 2π
0
12π
r exp
{−1
2r2}
dφ = r exp
{−1
2r2}
D = 1
D = 2
D = 20
r
p(r)
0 2 40
1
2
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Review
Bayesian Regression
Sequential Update of thePosterior
Predictive Distribution
Proof of the PredictiveDistribution
Predictive Distributionwith Simplified Prior
Limitations of LinearBasis Function Models
194of 825
Summary: Linear Regression
Basis functionsMaximum likelihood with Gaussian noiseRegularisationBias variance decompositionConjugate priorBayesian linear regressionSequential update of the posteriorPredictive distributionCurse of dimensionality