statistical machine learning · probability theory probability densities expectations and...

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Outlines

OverviewIntroductionLinear Algebra

Probability

Linear Regression 1

Linear Regression 2

Linear Classification 1

Linear Classification 2

Kernel MethodsSparse Kernel Methods

Mixture Models and EM 1Mixture Models and EM 2Neural Networks 1Neural Networks 2Principal Component Analysis

AutoencodersGraphical Models 1

Graphical Models 2

Graphical Models 3

Sampling

Sequential Data 1

Sequential Data 2

1of 825

Statistical Machine Learning

Christian Walder

Machine Learning Research GroupCSIRO Data61

and

College of Engineering and Computer ScienceThe Australian National University

CanberraSemester One, 2020.

(Many figures from C. M. Bishop, "Pattern Recognition and Machine Learning")




University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations andCovariances

68of 825

Part II

Introduction




University


Probability Theory



69of 825

Flavour of this course

Formalise intuitions about problemsUse language of mathematics to express modelsGeometry, vectors, linear algebra for reasoningProbabilistic models to capture uncertaintyDesign and analysis of algorithmsNumerical algorithms in pythonUnderstand the choices when designing machine learningmethods




University


Probability Theory



70of 825

What is Machine Learning?

Definition (Mitchell, 1998)

A computer program is said to learn from experience E withrespect to some class of tasks T and performance measure P,if its performance at tasks in T, as measured by P, improveswith experience E.




University


Probability Theory



71of 825


some artificial data created from the function

sin(2πx) + random noise x = 0, . . . , 1

x

t

0 1

−1

0

1




University


Probability Theory



72of 825

Polynomial Curve Fitting - Input Specification

N = 10

x ≡ (x1, . . . , xN)T

t ≡ (t1, . . . , tN)T




University


Probability Theory



73of 825

Polynomial Curve Fitting - Input Specification

N = 10

x ≡ (x1, . . . , xN)T

t ≡ (t1, . . . , tN)T

xi ∈ R i = 1,. . . , N

ti ∈ R i = 1,. . . , N




University


Probability Theory



74of 825

Polynomial Curve Fitting - Model Specification

M : order of polynomial

y(x,w) = w0 + w1 x + w2 x2 + · · ·+ wM xM

=

M∑m=0

wm xm

nonlinear function of x

linear function of the unknown model parameter wHow can we find good parameters w = (w1, . . . ,wM)

T?




University


Probability Theory



75of 825

Learning is Improving Performance

t

x

y(xn,w)

tn

xn

Performance measure : Error between target andprediction of the model for the training data

E(w) =12

N∑n=1

(y(xn,w)− tn)2

unique minimum of E(w) for argument w? under certainconditions (what are they?)




University


Probability Theory



76of 825

Learning is Improving Performance

t

x

y(xn,w)

tn

xn

Performance measure : Error between target andprediction of the model for the training data

E(w) =12

N∑n=1

(y(xn,w)− tn)2

unique minimum of E(w) for argument w? under certainconditions (what are they?)




University


Probability Theory



77of 825

Model Comparison or Model Selection

y(x,w) =

M∑m=0

wm xm

∣∣∣∣∣M=0

= w0

x

t

M = 0

0 1

−1

0

1




University


Probability Theory



78of 825


y(x,w) =

M∑m=0

wm xm

∣∣∣∣∣M=1

= w0 + w1 x

x

t

M = 1

0 1

−1

0

1




University


Probability Theory



79of 825


y(x,w) =

M∑m=0

wm xm

∣∣∣∣∣M=3

= w0 + w1 x + w2 x2 + w3 x3

x

t

M = 3

0 1

−1

0

1




University


Probability Theory



80of 825


y(x,w) =

M∑m=0

wm xm

∣∣∣∣∣M=9

= w0 + w1 x + · · ·+ w8 x8 + w9 x9

overfitting

x

t

M = 9

0 1

−1

0

1




University


Probability Theory



81of 825

Testing the Model

Train the model and get w?

Get 100 new data pointsRoot-mean-square (RMS) error

ERMS =√

2E(w?)/N

M

ERMS

0 3 6 90

0.5

1TrainingTest




University


Probability Theory



82of 825

Testing the Model

M = 0 M = 1 M = 3 M = 9w?

0 0.19 0.82 0.31 0.35w?

1 -1.27 7.99 232.37w?

2 -25.43 -5321.83w?

3 17.37 48568.31w?

4 -231639.30w?

5 640042.26w?

6 -1061800.52w?

7 1042400.18w?

8 -557682.99w?

9 125201.43

Table: Coefficients w? for polynomials of various order.




University


Probability Theory



83of 825

More Data

N = 15

x

t

N = 15

0 1

−1

0

1




University


Probability Theory



84of 825

More Data

N = 100heuristics : have no less than 5 to 10 times as many datapoints than parametersbut number of parameters is not necessarily the mostappropriate measure of model complexity !later: Bayesian approach

x

t

N = 100

0 1

−1

0

1




University


Probability Theory



85of 825

Regularisation

How to constrain the growing of the coefficients w ?Add a regularisation term to the error function

E(w) =12

N∑n=1

( y(xn,w)− tn)2+λ

2‖w‖2

Squared norm of the parameter vector w

‖w‖2 ≡ wTw = w20 + w2

1 + · · ·+ w2M

unique minimum of E(w) for argument w? under certainconditions (what are they for λ = 0? for λ > 0?)




University


Probability Theory



86of 825

Regularisation

M = 9

x

t

ln λ = −18

0 1

−1

0

1




University


Probability Theory



87of 825

Regularisation

M = 9

x

t

ln λ = 0

0 1

−1

0

1




University


Probability Theory



88of 825

Regularisation

M = 9E

RMS

ln λ−35 −30 −25 −200

0.5

1TrainingTest




University


Probability Theory



89of 825

What is Machine Learning?

Definition (Mitchell, 1998)

A computer program is said to learn from experience E withrespect to some class of tasks T and performance measure P,if its performance at tasks in T, as measured by P, improveswith experience E.

Task: regressionExperience: x input examples, t output labelsPerformance: squared errorModel choiceRegularisationdo not train on the test set!




University


Probability Theory



90of 825

Probability Theory

p(X,Y )

X

Y = 2

Y = 1




University


Probability Theory



91of 825

Probability Theory

Y vs. X a b c d e f g h i sum2 0 0 0 1 4 5 8 6 2 261 3 6 8 8 5 3 1 0 0 34

sum 3 6 8 9 9 8 9 6 2 60

p(X,Y )

X

Y = 2

Y = 1




University


Probability Theory



92of 825

Sum Rule


sum 3 6 8 9 9 8 9 6 2 60

p(X = d,Y = 1) = 8/60p(X = d) = p(X = d,Y = 2) + p(X = d,Y = 1)

= 1/60 + 8/60

p(X = d) =∑

Y

p(X = d,Y)

p(X) =∑

Y

p(X,Y)




University


Probability Theory



93of 825

Sum Rule


sum 3 6 8 9 9 8 9 6 2 60

p(X) =∑

Y

p(X,Y)

p(X)

X

p(Y) =∑

X

p(X,Y)

p(Y )




University


Probability Theory



94of 825

Product Rule


sum 3 6 8 9 9 8 9 6 2 60

Conditional Probability

p(X = d | Y = 1) = 8/34

Calculate p(Y = 1):

p(Y = 1) =∑

X

p(X,Y = 1) = 34/60

p(X = d,Y = 1) = p(X = d | Y = 1)p(Y = 1)

p(X,Y) = p(X | Y) p(Y)

Another intuitive view is renormalisation of relative frequencies:

p(X | Y) = p(X,Y)p(Y)




University


Probability Theory



95of 825

Sum and Product Rules


sum 3 6 8 9 9 8 9 6 2 60

p(X) =∑

Y

p(X,Y)

p(X)

X

p(X | Y) = p(X,Y)p(Y)

X

p(X |Y = 1)




University


Probability Theory



96of 825

Sum Rule and Product Rule

Sum Rulep(X) =

∑Y

p(X,Y)

Product Rulep(X,Y) = p(X | Y) p(Y)

These rules form the basis of Bayesian machine learning, andthis course!




University


Probability Theory



97of 825

Bayes Theorem

Use product rule

p(X,Y) = p(X | Y) p(Y) = p(Y | X) p(X)

Bayes Theorem

p(Y | X) = p(X | Y) p(Y)p(X)

only defined for p(X) > 0

and

p(X) =∑

Y

p(X,Y) (sum rule)

=∑

Y

p(X | Y) p(Y) (product rule)




University


Probability Theory



98of 825


Real valued variable x ∈ RProbability of x to fall in the interval (x, x + δx) is given byp(x)δx for infinitesimal small δx.

p(x ∈ (a, b)) =∫ b

ap(x) dx.

xδx

p(x) P (x)




University


Probability Theory



99of 825

Constraints on p(x)

Nonnegativep(x) ≥ 0

Normalisation ∫ ∞−∞

p(x) dx = 1.

xδx

p(x) P (x)




University


Probability Theory



100of 825

Cumulative distribution function P(x)

P(x) =∫ x

−∞p(z) dz

orddx

P(x) = p(x)

xδx

p(x) P (x)




University


Probability Theory



101of 825

Multivariate Probability Density

Vector x ≡ (x1, . . . , xD)T =

x1...

xD

Nonnegative

p(x) ≥ 0

Normalisation ∫ ∞−∞

p(x) dx = 1.

This means ∫ ∞−∞· · ·∫ ∞−∞

p(x) dx1 . . . dxD = 1.




University


Probability Theory



102of 825

Sum and Product Rule for Probability Densities

Sum Rulep(x) =

∫ ∞−∞

p(x, y) dy

Product Rulep(x, y) = p(y | x) p(x)




University


Probability Theory



103of 825

Expectations

Weighted average of a function f(x) under the probabilitydistribution p(x)

E [f ] =∑

x

p(x) f (x) discrete distribution p(x)

E [f ] =∫

p(x) f (x) dx probability density p(x)




University


Probability Theory



104of 825

How to approximate E [f ]

Given a finite number N of points xn drawn from theprobability distribution p(x).Approximate the expectation by a finite sum:

E [f ] ' 1N

N∑n=1

f (xn)

How to draw points from a probability distribution p(x) ?Lecture coming about “Sampling”




University


Probability Theory



105of 825

Expection of a function of several variables

arbitrary function f (x, y)

Ex [f (x, y)] =∑

x

p(x) f (x, y) discrete distribution p(x)

Ex [f (x, y)] =∫

p(x) f (x, y) dx probability density p(x)

Note that Ex [f (x, y)] is a function of y.




University


Probability Theory



106of 825

Conditional Expectation

arbitrary function f (x)

Ex [f | y] =∑

x

p(x | y) f (x) discrete distribution p(x)

Ex [f | y] =∫

p(x | y) f (x) dx probability density p(x)

Note that Ex [f | y] is a function of y.Other notation used in the literature : Ex|y [f ].What is E [E [f (x) | y]] ? Can we simplify it?This must mean Ey [Ex [f (x) | y]]. (Why?)

Ey [Ex [f (x) | y]] =∑

y

p(y)Ex [f | y] =∑

y

p(y)∑

x

p(x|y) f (x)

=∑x,y

f (x) p(x, y) =∑

x

f (x) p(x)

= Ex [f (x)]




University


Probability Theory



107of 825

Variance

arbitrary function f (x)

var[f ] = E[(f (x)− E [f (x)])2] = E

[f (x)2]− E [f (x)]2

Special case: f (x) = x

var[x] = E[(x− E [x])2] = E

[x2]− E [x]2




University


Probability Theory



108of 825

Covariance

Two random variables x ∈ R and y ∈ R

cov[x, y] = Ex,y [(x− E [x])(y− E [y])]

= Ex,y [x y]− E [x]E [y]

With E [x] = a and E [y] = b

cov[x, y] = Ex,y [(x− a)(y− b)]

= Ex,y [x y]− Ex,y [x b]− Ex,y [a y] + Ex,y [a b]

= Ex,y [x y]− b Ex,y [x]︸︷︷︸=Ex[x]

−a Ex,y [y]︸︷︷︸=Ey[y]

+a b Ex,y [1]︸︷︷︸=1

= Ex,y [x y]− a b− a b + a b = Ex,y [x y]− a b

= Ex,y [x y]− E [x]E [y]

Expresses how strongly x and y vary together. If x and yare independent, their covariance vanishes.




University


Probability Theory



109of 825

Covariance for Vector Valued Variables

Two random variables x ∈ RD and y ∈ RD

cov[x, y] = Ex,y[(x− E [x])(yT − E

[yT])]

= Ex,y[x yT]− E [x]E

[yT]




University


Probability Theory



110of 825

The Gaussian Distribution

x ∈ RGaussian Distribution with mean µ and variance σ2

N (x |µ, σ2) =1

(2πσ2)12exp{− 1

2σ2 (x− µ)2}

N (x|µ, σ2)

x

2σ

µ




University


Probability Theory



111of 825

The Gaussian Distribution

N (x |µ, σ2) > 0∫∞−∞N (x |µ, σ2) dx = 1

Expectation over x

E [x] =∫ ∞−∞N (x |µ, σ2) x dx = µ

Expectation over x2

E[x2] = ∫ ∞

−∞N (x |µ, σ2) x2 dx = µ2 + σ2

Variance of x

var[x] = E[x2]− E [x]2 = σ2




University


Probability Theory



112of 825

Strategy in this course

Estimate best predictor = training = learningGiven data (x1, y1), . . . , (xn, yn), find a predictor fw(·).

1 Identify the type of input x and output y data2 Propose a (linear) mathematical model for fw3 Design an objective function or likelihood4 Calculate the optimal parameter (w)5 Model uncertainty using the Bayesian approach6 Implement and compute (the algorithm in python)7 Interpret and diagnose results

statistical machine learning · probability theory probability densities expectations and...

Documents