statistical machine learning - gitlab · statistical machine learning c 2020 ong & walder &...

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Outlines

OverviewIntroductionLinear Algebra

Probability

Linear Regression 1

Linear Regression 2

Linear Classification 1

Linear Classification 2

Kernel MethodsSparse Kernel Methods

Mixture Models and EM 1Mixture Models and EM 2Neural Networks 1Neural Networks 2Principal Component Analysis

AutoencodersGraphical Models 1

Graphical Models 2

Graphical Models 3

Sampling

Sequential Data 1

Sequential Data 2

1of 825

Statistical Machine Learning

Christian Walder

Machine Learning Research GroupCSIRO Data61

and

College of Engineering and Computer ScienceThe Australian National University

CanberraSemester One, 2020.

(Many figures from C. M. Bishop, "Pattern Recognition and Machine Learning")




University

Review

Bayesian Regression

Sequential Update of thePosterior

Predictive Distribution

Proof of the PredictiveDistribution

Predictive Distributionwith Simplified Prior

Limitations of LinearBasis Function Models

160of 825

Part IV

Linear Regression 2




University

Review

Bayesian Regression






161of 825

Linear Regression

Basis functionsMaximum Likelihood with Gaussian NoiseRegularisationBias variance decomposition




University

Review

Bayesian Regression






162of 825

Training and Testing:(Non-Bayesian) Point Estimate

model with adjustable parameter w

training data x

training targets

t

model with fixed parameter w*

test data x

test target

t

Training Phase

Test Phasefix the most appropriate w*




University

Review

Bayesian Regression






163of 825

Bayesian Regression

Bayes Theorem

posterior =likelihood× prior

normalisationp(w | t) =

p(t |w) p(w)

p(t)

where we left out the conditioning on x (always assumed),and β, which is assumed to be constant.I.i.d. regression likelihood for additive Gaussian noise is

p(t |w) =

N∏n=1

N (tn | y(xn,w), β−1)

=

N∏n=1

N (tn |w>φ(xn), β−1)

= const× exp{−β 12

(t−Φw)>(t−Φw)}

= N (t |Φw, β−1I)




University

Review

Bayesian Regression






164of 825

How to choose a prior?

The choice of prior affords an intuitive control over ourinductive biasAll inference schemes have such biases, and often arisemore opaquely than the prior in Bayes’ rule.Can we find a prior for the given likelihood which

makes sense for the problem at handallows us to find a posterior in a ’nice’ form

An answer to the second question:

Definition ( Conjugate Prior)

A class of prior probability distributions p(w) is conjugate to aclass of likelihood functions p(x |w) if the resulting posteriordistributions p(w | x) are in the same family as p(w).




University

Review

Bayesian Regression






165of 825

Examples of Conjugate Prior Distributions

Table: Discrete likelihood distributions

Likelihood Conjugate PriorBernoulli BetaBinomial BetaPoisson Gamma

Multinomial Dirichlet

Table: Continuous likelihood distributions

Likelihood Conjugate PriorUniform Pareto

Exponential GammaNormal Normal (mean parameter)

Multivariate normal Multivariate normal (mean parameter)




University

Review

Bayesian Regression






166of 825

Conjugate Prior to a Gaussian Distribution

Example : If the likelihood function is Gaussian, choosinga Gaussian prior for the mean will ensure that theposterior distribution is also Gaussian.Given a marginal distribution for x and a conditionalGaussian distribution for y given x in the form

p(x) = N (x |µ,Λ−1)

p(y | x) = N (y |Ax + b,L−1)

we get

p(y) = N (y |Aµ + b,L−1 + AΛ−1A>)

p(x | y) = N (x |Σ{A>L(y− b) + Λµ},Σ)

where Σ = (Λ + A>L A)−1.

Note that the covariance Σ does not involve y.




University

Review

Bayesian Regression






167of 825

Conjugate Prior to a Gaussian Distribution(intuition)

Given

p(x) = N (x |µ,Λ−1)

p(y | x) = N (y |Ax + b,L−1)⇔ y = Ax + b +N (0,L−1)

We have E[y] = E[Ax + b] = Aµ + b and by the easily proven Bienayméformula for the variance of the sum of uncorrelated variables,

cov[y] = cov[Ax + b]︸︷︷︸=E[Ax(Ax)>]=AE[xx>]A>=AΛ−1A>

+ cov[N (0,L−1)]︸︷︷︸=L−1

.

So y is Gaussian with

p(y) = N (y |Aµ + b,L−1 + AΛ−1A>)

Then letting Σ = (Λ + A>L A)−1 and

p(x | y) = N (x |Σ{A>L(y− b) + Λµ},Σ)

⇔ x = Σ{A>L(y− b) + Λµ}+N (0,Σ)

yields the correct moments for x, since

E[x] = E[Σ{A>L(y− b) + Λµ}] = Σ{A>L(Aµ + b− b) + Λµ}

= Σ{A>LAµ + Λµ} = (Λ + A>L A)−1{A>LA + Λ}µ = µ,

and it is similar (but tedious ; don’t do it) to recover cov[x] = Λ.




University

Review

Bayesian Regression






168of 825

Bayesian Regression

Choose a Gaussian prior with mean m0 and covariance S0

p(w) = N (w |m0,S0)

Same likelihood as before (here written in vector form):

p(t|w, β) = N (t |Φ>w, β−1I)

Given N data pairs (xn, tn), the posterior is

p(w | t) = N (w |mN ,SN)

where

mN = SN(S−10 m0 + βΦ>t)

S−1N = S−1

0 + βΦ>Φ

(derive this with the identities on the previous slides)




University

Review

Bayesian Regression






169of 825

Bayesian Regression: Zero Mean, Isotropic Prior

For simplicity we proceed with m0 = 0 and S0 = α−1I, so

p(w |α) = N (w | 0, α−1I)

The posterior becomes p(w | t) = N (w |mN ,SN) with

mN = βSNΦ>t

S−1N = αI + βΦ>Φ

For α� β we get

mN → wML = (Φ>Φ)−1Φ>t

Log of posterior is sum of log likelihood and log of prior

ln p(w | t) = −β2

(t−Φw)>(t−Φw)− α

2w>w + const




University

Review

Bayesian Regression






170of 825

Bayesian Regression

Log of posterior is sum of log likelihood and log of prior

ln p(w | t) = −β 12‖t−Φw‖2︸︷︷︸

sum-of-squares-error

−α2‖w‖2︸︷︷︸

regulariser

+ const.

The maximum a posteriori estimator

wm.a.p. = arg maxw

p(w|t)

corresponds to minimising the sum-of-squares errorfunction with quadratic regularisation coefficient λ = α/β.The posterior is Gaussian so mode = mean: wm.a.p. = mN .For α� β the we recover unregularised least squares(equivalently m.a.p. approaches maximum likelihood), forexample in case of

an infinitely broad prior with α→ 0an infinitely precise likelihood with β →∞




University

Review

Bayesian Regression






171of 825

Bayesian Inference in General:Sequential Update of Belief

If we have not yet seen any data point (N = 0), theposterior is equal to the prior.

Sequential arrival of data points : the posterior given someobserved data acts as the prior for the future data.Nicely fits a sequential learning framework.




University

Review

Bayesian Regression






172of 825

Sequential Update of the Posterior

Example of a linear basis function modelSingle input x, single output t

Linear model y(x,w) = w0 + w1x.True data distribution sampling procedure:

1 Choose an xn from the uniform distribution U(x | − 1,+1).2 Calculate f (xn, a) = a0 + a1xn, where a0 = −0.3, a1 = 0.5.3 Add Gaussian noise with standard deviation σ = 0.2,

tn = N (xn | f (xn, a), 0.04)

Set the precision of the uniform prior to α = 2.0.




University

Review

Bayesian Regression






173of 825





University

Review

Bayesian Regression






174of 825





University

Review

Bayesian Regression






175of 825


In the training phase, data x and targets t are providedIn the test phase, a new data value x is given and thecorresponding target value t is asked forBayesian approach: Find the probability of the test target tgiven the test data x, the training data x and the trainingtargets t

p(t | x, x, t)

This is the Predictive Distribution (c.f. the posteriordistribution, which is over the parameters).




University

Review

Bayesian Regression






176of 825

How to calculate the Predictive Distribution?

Introduce the model parameter w via the sum rule

p(t | x, x, t) =

∫p(t,w | x, x, t)dw

=

∫p(t |w, x, x, t)p(w | x, x, t)dw

The test target t depends only on the test data x and themodel parameter w, but not on the training data and thetraining targets

p(t |w, x, x, t) = p(t |w, x)

The model parameter w are learned with the training datax and the training targets t only

p(w | x, x, t) = p(w | x, t)


p(t | x, x, t) =

∫p(t |w, x)p(w | x, t)dw




University

Review

Bayesian Regression






178of 825

Predictive Distribution with Simplified Prior

Find the predictive distribution

p(t | t, α, β) =

∫p(t |w, β) p(w | t, α, β)dw

(remember : conditioning on x is often suppressed tosimplify the notation.)Now we know (neglecting as usual to notate conditioningon x)

p(t |w, β) = N (t |w>φ(x), β−1)

and the posterior was

p(w | t, α, β) = N (w |mN ,SN)

where

mN = βSNΦ>t

S−1N = αI + βΦ>Φ




University

Review

Bayesian Regression






179of 825


If we do the integral (it turns out to be the convolution ofthe two Gaussians), we get for the predictive distribution

p(t | x, t, α, β) = N (t |m>N φ(x), σ2N(x))

where the variance σ2N(x) is given by

σ2N(x) =

1β

+ φ(x)>SNφ(x).

This is more easily shown using a similar approach to theearlier “intution” slide and again with the Bienayméformula, now using

t = w>φ(x) +N (0, β−1).

However this is a linear-Gaussian specific trick and ingeneral we need to integrate out the parameters.




University

Review

Bayesian Regression






180of 825


Example with artificial sinusoidal data from sin(2πx) (green)and added noise. Number of data points N = 1.

x

t

0 1

−1

0

1

Mean of the predictive distribution (red) and regions of onestandard deviation from mean (red shaded).




University

Review

Bayesian Regression






181of 825



x

t

0 1

−1

0

1





University

Review

Bayesian Regression






182of 825



x

t

0 1

−1

0

1





University

Review

Bayesian Regression






183of 825



x

t

0 1

−1

0

1





University

Review

Bayesian Regression






184of 825


Plots of the function y(x,w) using samples from the posteriordistribution over w. Number of data points N = 1.

x

t

0 1

−1

0

1

x

t

0 1

−1

0

1




University

Review

Bayesian Regression






185of 825



x

t

0 1

−1

0

1

x

t

0 1

−1

0

1




University

Review

Bayesian Regression






186of 825



x

t

0 1

−1

0

1

x

t

0 1

−1

0

1




University

Review

Bayesian Regression






187of 825



x

t

0 1

−1

0

1

x

t

0 1

−1

0

1




University

Review

Bayesian Regression






188of 825

Limitations of Linear Basis Function Models

Basis function φj(x) are fixed before the training data set isobserved.Curse of dimensionality : Number of basis function growsrapidly, often exponentially, with the dimensionality D.But typical data sets have two nice properties which canbe exploited if the basis functions are not fixed :

Data lie close to a nonlinear manifold with intrinsicdimension much smaller than D. Need algorithms whichplace basis functions only where data are (e.g. kernelmethods / Gaussian processes).Target variables may only depend on a few significantdirections within the data manifold. Need algorithms whichcan exploit this property (e.g. linear methods or shallowneural networks).




University

Review

Bayesian Regression






189of 825

Curse of Dimensionality

Linear Algebra allows us to operate in n-dimensionalvector spaces using the intution from our 3-dimensionalworld as a vector space. No surprises as long as n is finite.If we add more structure to a vector space (e.g. innerproduct, metric), our intution gained from the3-dimensional world around us may be wrong.Example: Sphere of radius r = 1. What is the fraction ofthe volume of the sphere in a D-dimensional space whichlies between radius r = 1 and r = 1− ε ?Volume scales like rD, therefore the formula for the volumeof a sphere is VD(r) = KDrD.

VD(1)− VD(1− ε)VD(1)

= 1− (1− ε)D




University

Review

Bayesian Regression






190of 825


Fraction of the volume of the sphere in a D-dimensionalspace which lies between radius r = 1 and r = 1− ε

VD(1)− VD(1− ε)VD(1)

= 1− (1− ε)D

ε

volu

me

frac

tion

D = 1

D = 2

D = 5

D = 20

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1




University

Review

Bayesian Regression






191of 825


Probability density with respect to radius r of a Gaussiandistribution for various values of the dimensionality D.

D = 1

D = 2

D = 20

r

p(r)

0 2 40

1

2




University

Review

Bayesian Regression






192of 825


Probability density with respect to radius r of a Gaussiandistribution for various values of the dimensionality D.Example: D = 2; assume µ = 0,Σ = I

N (x | 0, I) =1

2πexp

{−1

2x>x}

=1

2πexp

{−1

2(x2

1 + x22)

}Coordinate transformation

x1 = r cos(φ) x2 = r sin(φ)

Probability in the new coordinates

p(r, φ | 0, I) = N (r(x), φ(x) | 0, I) | J |

where | J | = r is the determinant of the Jacobian for thegiven coordinate transformation.

p(r, φ | 0, I) =1

2πr exp

{−1

2r2}




University

Review

Bayesian Regression






193of 825


Probability density with respect to radius r of a Gaussiandistribution for D = 2 (and µ = 0,Σ = I)

p(r, φ | 0, I) =1

2πr exp

{−1

2r2}

Integrate over all angles φ

p(r | 0, I) =

∫ 2π

0

12π

r exp

{−1

2r2}

dφ = r exp

{−1

2r2}

D = 1

D = 2

D = 20

r

p(r)

0 2 40

1

2




University

Review

Bayesian Regression






194of 825

Summary: Linear Regression

Basis functionsMaximum likelihood with Gaussian noiseRegularisationBias variance decompositionConjugate priorBayesian linear regressionSequential update of the posteriorPredictive distributionCurse of dimensionality

statistical machine learning - gitlab · statistical machine learning c 2020 ong & walder &...

Documents