statistical machine learning · probability theory probability densities expectations and...

46
Statistical Machine Learning c 2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Outlines Overview Introduction Linear Algebra Probability Linear Regression 1 Linear Regression 2 Linear Classification 1 Linear Classification 2 Kernel Methods Sparse Kernel Methods Mixture Models and EM 1 Mixture Models and EM 2 Neural Networks 1 Neural Networks 2 Principal Component Analysis Autoencoders Graphical Models 1 Graphical Models 2 Graphical Models 3 Sampling Sequential Data 1 Sequential Data 2 1of 825 Statistical Machine Learning Christian Walder Machine Learning Research Group CSIRO Data61 and College of Engineering and Computer Science The Australian National University Canberra Semester One, 2020. (Many figures from C. M. Bishop, "Pattern Recognition and Machine Learning")

Upload: others

Post on 21-Aug-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Statistical Machine Learning · Probability Theory Probability Densities Expectations and Covariances 69of 825 Flavour of this course Formalise intuitions about problems Use language

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Outlines

OverviewIntroductionLinear Algebra

Probability

Linear Regression 1

Linear Regression 2

Linear Classification 1

Linear Classification 2

Kernel MethodsSparse Kernel Methods

Mixture Models and EM 1Mixture Models and EM 2Neural Networks 1Neural Networks 2Principal Component Analysis

AutoencodersGraphical Models 1

Graphical Models 2

Graphical Models 3

Sampling

Sequential Data 1

Sequential Data 2

1of 825

Statistical Machine Learning

Christian Walder

Machine Learning Research GroupCSIRO Data61

and

College of Engineering and Computer ScienceThe Australian National University

CanberraSemester One, 2020.

(Many figures from C. M. Bishop, "Pattern Recognition and Machine Learning")

Page 2: Statistical Machine Learning · Probability Theory Probability Densities Expectations and Covariances 69of 825 Flavour of this course Formalise intuitions about problems Use language

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations andCovariances

68of 825

Part II

Introduction

Page 3: Statistical Machine Learning · Probability Theory Probability Densities Expectations and Covariances 69of 825 Flavour of this course Formalise intuitions about problems Use language

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations andCovariances

69of 825

Flavour of this course

Formalise intuitions about problemsUse language of mathematics to express modelsGeometry, vectors, linear algebra for reasoningProbabilistic models to capture uncertaintyDesign and analysis of algorithmsNumerical algorithms in pythonUnderstand the choices when designing machine learningmethods

Page 4: Statistical Machine Learning · Probability Theory Probability Densities Expectations and Covariances 69of 825 Flavour of this course Formalise intuitions about problems Use language

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations andCovariances

70of 825

What is Machine Learning?

Definition (Mitchell, 1998)

A computer program is said to learn from experience E withrespect to some class of tasks T and performance measure P,if its performance at tasks in T, as measured by P, improveswith experience E.

Page 5: Statistical Machine Learning · Probability Theory Probability Densities Expectations and Covariances 69of 825 Flavour of this course Formalise intuitions about problems Use language

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations andCovariances

71of 825

Polynomial Curve Fitting

some artificial data created from the function

sin(2πx) + random noise x = 0, . . . , 1

x

t

0 1

−1

0

1

Page 6: Statistical Machine Learning · Probability Theory Probability Densities Expectations and Covariances 69of 825 Flavour of this course Formalise intuitions about problems Use language

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations andCovariances

72of 825

Polynomial Curve Fitting - Input Specification

N = 10

x ≡ (x1, . . . , xN)T

t ≡ (t1, . . . , tN)T

Page 7: Statistical Machine Learning · Probability Theory Probability Densities Expectations and Covariances 69of 825 Flavour of this course Formalise intuitions about problems Use language

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations andCovariances

73of 825

Polynomial Curve Fitting - Input Specification

N = 10

x ≡ (x1, . . . , xN)T

t ≡ (t1, . . . , tN)T

xi ∈ R i = 1,. . . , N

ti ∈ R i = 1,. . . , N

Page 8: Statistical Machine Learning · Probability Theory Probability Densities Expectations and Covariances 69of 825 Flavour of this course Formalise intuitions about problems Use language

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations andCovariances

74of 825

Polynomial Curve Fitting - Model Specification

M : order of polynomial

y(x,w) = w0 + w1 x + w2 x2 + · · ·+ wM xM

=

M∑m=0

wm xm

nonlinear function of x

linear function of the unknown model parameter wHow can we find good parameters w = (w1, . . . ,wM)

T?

Page 9: Statistical Machine Learning · Probability Theory Probability Densities Expectations and Covariances 69of 825 Flavour of this course Formalise intuitions about problems Use language

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations andCovariances

75of 825

Learning is Improving Performance

t

x

y(xn,w)

tn

xn

Performance measure : Error between target andprediction of the model for the training data

E(w) =12

N∑n=1

(y(xn,w)− tn)2

unique minimum of E(w) for argument w? under certainconditions (what are they?)

Page 10: Statistical Machine Learning · Probability Theory Probability Densities Expectations and Covariances 69of 825 Flavour of this course Formalise intuitions about problems Use language

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations andCovariances

76of 825

Learning is Improving Performance

t

x

y(xn,w)

tn

xn

Performance measure : Error between target andprediction of the model for the training data

E(w) =12

N∑n=1

(y(xn,w)− tn)2

unique minimum of E(w) for argument w? under certainconditions (what are they?)

Page 11: Statistical Machine Learning · Probability Theory Probability Densities Expectations and Covariances 69of 825 Flavour of this course Formalise intuitions about problems Use language

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations andCovariances

77of 825

Model Comparison or Model Selection

y(x,w) =

M∑m=0

wm xm

∣∣∣∣∣M=0

= w0

x

t

M = 0

0 1

−1

0

1

Page 12: Statistical Machine Learning · Probability Theory Probability Densities Expectations and Covariances 69of 825 Flavour of this course Formalise intuitions about problems Use language

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations andCovariances

78of 825

Model Comparison or Model Selection

y(x,w) =

M∑m=0

wm xm

∣∣∣∣∣M=1

= w0 + w1 x

x

t

M = 1

0 1

−1

0

1

Page 13: Statistical Machine Learning · Probability Theory Probability Densities Expectations and Covariances 69of 825 Flavour of this course Formalise intuitions about problems Use language

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations andCovariances

79of 825

Model Comparison or Model Selection

y(x,w) =

M∑m=0

wm xm

∣∣∣∣∣M=3

= w0 + w1 x + w2 x2 + w3 x3

x

t

M = 3

0 1

−1

0

1

Page 14: Statistical Machine Learning · Probability Theory Probability Densities Expectations and Covariances 69of 825 Flavour of this course Formalise intuitions about problems Use language

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations andCovariances

80of 825

Model Comparison or Model Selection

y(x,w) =

M∑m=0

wm xm

∣∣∣∣∣M=9

= w0 + w1 x + · · ·+ w8 x8 + w9 x9

overfitting

x

t

M = 9

0 1

−1

0

1

Page 15: Statistical Machine Learning · Probability Theory Probability Densities Expectations and Covariances 69of 825 Flavour of this course Formalise intuitions about problems Use language

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations andCovariances

81of 825

Testing the Model

Train the model and get w?

Get 100 new data pointsRoot-mean-square (RMS) error

ERMS =√

2E(w?)/N

M

ERMS

0 3 6 90

0.5

1TrainingTest

Page 16: Statistical Machine Learning · Probability Theory Probability Densities Expectations and Covariances 69of 825 Flavour of this course Formalise intuitions about problems Use language

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations andCovariances

82of 825

Testing the Model

M = 0 M = 1 M = 3 M = 9w?

0 0.19 0.82 0.31 0.35w?

1 -1.27 7.99 232.37w?

2 -25.43 -5321.83w?

3 17.37 48568.31w?

4 -231639.30w?

5 640042.26w?

6 -1061800.52w?

7 1042400.18w?

8 -557682.99w?

9 125201.43

Table: Coefficients w? for polynomials of various order.

Page 17: Statistical Machine Learning · Probability Theory Probability Densities Expectations and Covariances 69of 825 Flavour of this course Formalise intuitions about problems Use language

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations andCovariances

83of 825

More Data

N = 15

x

t

N = 15

0 1

−1

0

1

Page 18: Statistical Machine Learning · Probability Theory Probability Densities Expectations and Covariances 69of 825 Flavour of this course Formalise intuitions about problems Use language

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations andCovariances

84of 825

More Data

N = 100heuristics : have no less than 5 to 10 times as many datapoints than parametersbut number of parameters is not necessarily the mostappropriate measure of model complexity !later: Bayesian approach

x

t

N = 100

0 1

−1

0

1

Page 19: Statistical Machine Learning · Probability Theory Probability Densities Expectations and Covariances 69of 825 Flavour of this course Formalise intuitions about problems Use language

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations andCovariances

85of 825

Regularisation

How to constrain the growing of the coefficients w ?Add a regularisation term to the error function

E(w) =12

N∑n=1

( y(xn,w)− tn)2+λ

2‖w‖2

Squared norm of the parameter vector w

‖w‖2 ≡ wTw = w20 + w2

1 + · · ·+ w2M

unique minimum of E(w) for argument w? under certainconditions (what are they for λ = 0? for λ > 0?)

Page 20: Statistical Machine Learning · Probability Theory Probability Densities Expectations and Covariances 69of 825 Flavour of this course Formalise intuitions about problems Use language

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations andCovariances

86of 825

Regularisation

M = 9

x

t

ln λ = −18

0 1

−1

0

1

Page 21: Statistical Machine Learning · Probability Theory Probability Densities Expectations and Covariances 69of 825 Flavour of this course Formalise intuitions about problems Use language

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations andCovariances

87of 825

Regularisation

M = 9

x

t

ln λ = 0

0 1

−1

0

1

Page 22: Statistical Machine Learning · Probability Theory Probability Densities Expectations and Covariances 69of 825 Flavour of this course Formalise intuitions about problems Use language

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations andCovariances

88of 825

Regularisation

M = 9E

RMS

ln λ−35 −30 −25 −200

0.5

1TrainingTest

Page 23: Statistical Machine Learning · Probability Theory Probability Densities Expectations and Covariances 69of 825 Flavour of this course Formalise intuitions about problems Use language

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations andCovariances

89of 825

What is Machine Learning?

Definition (Mitchell, 1998)

A computer program is said to learn from experience E withrespect to some class of tasks T and performance measure P,if its performance at tasks in T, as measured by P, improveswith experience E.

Task: regressionExperience: x input examples, t output labelsPerformance: squared errorModel choiceRegularisationdo not train on the test set!

Page 24: Statistical Machine Learning · Probability Theory Probability Densities Expectations and Covariances 69of 825 Flavour of this course Formalise intuitions about problems Use language

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations andCovariances

90of 825

Probability Theory

p(X,Y )

X

Y = 2

Y = 1

Page 25: Statistical Machine Learning · Probability Theory Probability Densities Expectations and Covariances 69of 825 Flavour of this course Formalise intuitions about problems Use language

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations andCovariances

91of 825

Probability Theory

Y vs. X a b c d e f g h i sum2 0 0 0 1 4 5 8 6 2 261 3 6 8 8 5 3 1 0 0 34

sum 3 6 8 9 9 8 9 6 2 60

p(X,Y )

X

Y = 2

Y = 1

Page 26: Statistical Machine Learning · Probability Theory Probability Densities Expectations and Covariances 69of 825 Flavour of this course Formalise intuitions about problems Use language

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations andCovariances

92of 825

Sum Rule

Y vs. X a b c d e f g h i sum2 0 0 0 1 4 5 8 6 2 261 3 6 8 8 5 3 1 0 0 34

sum 3 6 8 9 9 8 9 6 2 60

p(X = d,Y = 1) = 8/60p(X = d) = p(X = d,Y = 2) + p(X = d,Y = 1)

= 1/60 + 8/60

p(X = d) =∑

Y

p(X = d,Y)

p(X) =∑

Y

p(X,Y)

Page 27: Statistical Machine Learning · Probability Theory Probability Densities Expectations and Covariances 69of 825 Flavour of this course Formalise intuitions about problems Use language

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations andCovariances

93of 825

Sum Rule

Y vs. X a b c d e f g h i sum2 0 0 0 1 4 5 8 6 2 261 3 6 8 8 5 3 1 0 0 34

sum 3 6 8 9 9 8 9 6 2 60

p(X) =∑

Y

p(X,Y)

p(X)

X

p(Y) =∑

X

p(X,Y)

p(Y )

Page 28: Statistical Machine Learning · Probability Theory Probability Densities Expectations and Covariances 69of 825 Flavour of this course Formalise intuitions about problems Use language

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations andCovariances

94of 825

Product Rule

Y vs. X a b c d e f g h i sum2 0 0 0 1 4 5 8 6 2 261 3 6 8 8 5 3 1 0 0 34

sum 3 6 8 9 9 8 9 6 2 60

Conditional Probability

p(X = d | Y = 1) = 8/34

Calculate p(Y = 1):

p(Y = 1) =∑

X

p(X,Y = 1) = 34/60

p(X = d,Y = 1) = p(X = d | Y = 1)p(Y = 1)

p(X,Y) = p(X | Y) p(Y)

Another intuitive view is renormalisation of relative frequencies:

p(X | Y) = p(X,Y)p(Y)

Page 29: Statistical Machine Learning · Probability Theory Probability Densities Expectations and Covariances 69of 825 Flavour of this course Formalise intuitions about problems Use language

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations andCovariances

95of 825

Sum and Product Rules

Y vs. X a b c d e f g h i sum2 0 0 0 1 4 5 8 6 2 261 3 6 8 8 5 3 1 0 0 34

sum 3 6 8 9 9 8 9 6 2 60

p(X) =∑

Y

p(X,Y)

p(X)

X

p(X | Y) = p(X,Y)p(Y)

X

p(X |Y = 1)

Page 30: Statistical Machine Learning · Probability Theory Probability Densities Expectations and Covariances 69of 825 Flavour of this course Formalise intuitions about problems Use language

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations andCovariances

96of 825

Sum Rule and Product Rule

Sum Rulep(X) =

∑Y

p(X,Y)

Product Rulep(X,Y) = p(X | Y) p(Y)

These rules form the basis of Bayesian machine learning, andthis course!

Page 31: Statistical Machine Learning · Probability Theory Probability Densities Expectations and Covariances 69of 825 Flavour of this course Formalise intuitions about problems Use language

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations andCovariances

97of 825

Bayes Theorem

Use product rule

p(X,Y) = p(X | Y) p(Y) = p(Y | X) p(X)

Bayes Theorem

p(Y | X) = p(X | Y) p(Y)p(X)

only defined for p(X) > 0

and

p(X) =∑

Y

p(X,Y) (sum rule)

=∑

Y

p(X | Y) p(Y) (product rule)

Page 32: Statistical Machine Learning · Probability Theory Probability Densities Expectations and Covariances 69of 825 Flavour of this course Formalise intuitions about problems Use language

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations andCovariances

98of 825

Probability Densities

Real valued variable x ∈ RProbability of x to fall in the interval (x, x + δx) is given byp(x)δx for infinitesimal small δx.

p(x ∈ (a, b)) =∫ b

ap(x) dx.

xδx

p(x) P (x)

Page 33: Statistical Machine Learning · Probability Theory Probability Densities Expectations and Covariances 69of 825 Flavour of this course Formalise intuitions about problems Use language

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations andCovariances

99of 825

Constraints on p(x)

Nonnegativep(x) ≥ 0

Normalisation ∫ ∞−∞

p(x) dx = 1.

xδx

p(x) P (x)

Page 34: Statistical Machine Learning · Probability Theory Probability Densities Expectations and Covariances 69of 825 Flavour of this course Formalise intuitions about problems Use language

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations andCovariances

100of 825

Cumulative distribution function P(x)

P(x) =∫ x

−∞p(z) dz

orddx

P(x) = p(x)

xδx

p(x) P (x)

Page 35: Statistical Machine Learning · Probability Theory Probability Densities Expectations and Covariances 69of 825 Flavour of this course Formalise intuitions about problems Use language

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations andCovariances

101of 825

Multivariate Probability Density

Vector x ≡ (x1, . . . , xD)T =

x1...

xD

Nonnegative

p(x) ≥ 0

Normalisation ∫ ∞−∞

p(x) dx = 1.

This means ∫ ∞−∞· · ·∫ ∞−∞

p(x) dx1 . . . dxD = 1.

Page 36: Statistical Machine Learning · Probability Theory Probability Densities Expectations and Covariances 69of 825 Flavour of this course Formalise intuitions about problems Use language

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations andCovariances

102of 825

Sum and Product Rule for Probability Densities

Sum Rulep(x) =

∫ ∞−∞

p(x, y) dy

Product Rulep(x, y) = p(y | x) p(x)

Page 37: Statistical Machine Learning · Probability Theory Probability Densities Expectations and Covariances 69of 825 Flavour of this course Formalise intuitions about problems Use language

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations andCovariances

103of 825

Expectations

Weighted average of a function f(x) under the probabilitydistribution p(x)

E [f ] =∑

x

p(x) f (x) discrete distribution p(x)

E [f ] =∫

p(x) f (x) dx probability density p(x)

Page 38: Statistical Machine Learning · Probability Theory Probability Densities Expectations and Covariances 69of 825 Flavour of this course Formalise intuitions about problems Use language

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations andCovariances

104of 825

How to approximate E [f ]

Given a finite number N of points xn drawn from theprobability distribution p(x).Approximate the expectation by a finite sum:

E [f ] ' 1N

N∑n=1

f (xn)

How to draw points from a probability distribution p(x) ?Lecture coming about “Sampling”

Page 39: Statistical Machine Learning · Probability Theory Probability Densities Expectations and Covariances 69of 825 Flavour of this course Formalise intuitions about problems Use language

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations andCovariances

105of 825

Expection of a function of several variables

arbitrary function f (x, y)

Ex [f (x, y)] =∑

x

p(x) f (x, y) discrete distribution p(x)

Ex [f (x, y)] =∫

p(x) f (x, y) dx probability density p(x)

Note that Ex [f (x, y)] is a function of y.

Page 40: Statistical Machine Learning · Probability Theory Probability Densities Expectations and Covariances 69of 825 Flavour of this course Formalise intuitions about problems Use language

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations andCovariances

106of 825

Conditional Expectation

arbitrary function f (x)

Ex [f | y] =∑

x

p(x | y) f (x) discrete distribution p(x)

Ex [f | y] =∫

p(x | y) f (x) dx probability density p(x)

Note that Ex [f | y] is a function of y.Other notation used in the literature : Ex|y [f ].What is E [E [f (x) | y]] ? Can we simplify it?This must mean Ey [Ex [f (x) | y]]. (Why?)

Ey [Ex [f (x) | y]] =∑

y

p(y)Ex [f | y] =∑

y

p(y)∑

x

p(x|y) f (x)

=∑x,y

f (x) p(x, y) =∑

x

f (x) p(x)

= Ex [f (x)]

Page 41: Statistical Machine Learning · Probability Theory Probability Densities Expectations and Covariances 69of 825 Flavour of this course Formalise intuitions about problems Use language

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations andCovariances

107of 825

Variance

arbitrary function f (x)

var[f ] = E[(f (x)− E [f (x)])2] = E

[f (x)2]− E [f (x)]2

Special case: f (x) = x

var[x] = E[(x− E [x])2] = E

[x2]− E [x]2

Page 42: Statistical Machine Learning · Probability Theory Probability Densities Expectations and Covariances 69of 825 Flavour of this course Formalise intuitions about problems Use language

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations andCovariances

108of 825

Covariance

Two random variables x ∈ R and y ∈ R

cov[x, y] = Ex,y [(x− E [x])(y− E [y])]

= Ex,y [x y]− E [x]E [y]

With E [x] = a and E [y] = b

cov[x, y] = Ex,y [(x− a)(y− b)]

= Ex,y [x y]− Ex,y [x b]− Ex,y [a y] + Ex,y [a b]

= Ex,y [x y]− b Ex,y [x]︸ ︷︷ ︸=Ex[x]

−a Ex,y [y]︸ ︷︷ ︸=Ey[y]

+a b Ex,y [1]︸ ︷︷ ︸=1

= Ex,y [x y]− a b− a b + a b = Ex,y [x y]− a b

= Ex,y [x y]− E [x]E [y]

Expresses how strongly x and y vary together. If x and yare independent, their covariance vanishes.

Page 43: Statistical Machine Learning · Probability Theory Probability Densities Expectations and Covariances 69of 825 Flavour of this course Formalise intuitions about problems Use language

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations andCovariances

109of 825

Covariance for Vector Valued Variables

Two random variables x ∈ RD and y ∈ RD

cov[x, y] = Ex,y[(x− E [x])(yT − E

[yT])]

= Ex,y[x yT]− E [x]E

[yT]

Page 44: Statistical Machine Learning · Probability Theory Probability Densities Expectations and Covariances 69of 825 Flavour of this course Formalise intuitions about problems Use language

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations andCovariances

110of 825

The Gaussian Distribution

x ∈ RGaussian Distribution with mean µ and variance σ2

N (x |µ, σ2) =1

(2πσ2)12exp{− 1

2σ2 (x− µ)2}

N (x|µ, σ2)

x

µ

Page 45: Statistical Machine Learning · Probability Theory Probability Densities Expectations and Covariances 69of 825 Flavour of this course Formalise intuitions about problems Use language

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations andCovariances

111of 825

The Gaussian Distribution

N (x |µ, σ2) > 0∫∞−∞N (x |µ, σ2) dx = 1

Expectation over x

E [x] =∫ ∞−∞N (x |µ, σ2) x dx = µ

Expectation over x2

E[x2] = ∫ ∞

−∞N (x |µ, σ2) x2 dx = µ2 + σ2

Variance of x

var[x] = E[x2]− E [x]2 = σ2

Page 46: Statistical Machine Learning · Probability Theory Probability Densities Expectations and Covariances 69of 825 Flavour of this course Formalise intuitions about problems Use language

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations andCovariances

112of 825

Strategy in this course

Estimate best predictor = training = learningGiven data (x1, y1), . . . , (xn, yn), find a predictor fw(·).

1 Identify the type of input x and output y data2 Propose a (linear) mathematical model for fw3 Design an objective function or likelihood4 Calculate the optimal parameter (w)5 Model uncertainty using the Bayesian approach6 Implement and compute (the algorithm in python)7 Interpret and diagnose results