lecture 3: inferences using least-squares. abstraction vector of n random variables, x with joint...

Lecture 3:

Inferences using Least-Squares

Abstraction

Vector of N random variables, x

with joint probability density p(x)

expectation x

and covariance Cx

x2

x1

Shown as 2D here, but actually N-dimensional

the multivariate normal distribution

p(x) = (2)-N/2 |Cx|-1/2 exp{ -1/2 (x-x)T Cx-1 (x-x) }

has expectation x

covariance Cx

And is normalized to unit area

examples

x = 2 Cx = 1 0 1 0 1

p(x,y)

x = 2 Cx = 2 0 1 0 1

p(x,y)

x = 2 Cx = 1 0 1 0 2

p(x,y)

x = 2 Cx = 1 0.5 1 0.5 1

p(x,y)

x = 2 Cx = 1 -0.5 1 -0.5 1

p(x,y)

Remember this from last lecture ?

x2

x1

x2

x1

x1

p(x1)

p(x1) = p(x1,x2) dx2

x2

p(x2)

p(x2) = p(x1,x2) dx1

distribution of x1

(irrespective of x2)distribution of x2

(irrespective of x1)

p(x,y)

p(y)

y

y

x

p(y) = p(x,y) dx

p(x)

x

p(x,y)

y

x

p(x) = p(x,y) dy

p(x,y)

p(x|y)

p(y|x)

Any linear function of a normal distributionis a normal distribution

p(x) = (2)-N/2 |Cx|-1/2 exp{ -1/2 (x-x)T Cx-1 (x-x) }

And y=Mx then

p(y) = (2)-N/2 |Cy|-1/2 exp{ -1/2 (y-y)T Cy-1 (y-y) }

with y=Mx and Cy=MCxMT

Memorize!

Do you remember this from a previous lecture?

then the standard Least-squares solution is

mest = [GTG]-1 GT

and the rule for error propagation gives

Cm = d2 [GTG]-1

if d = G m

Example – all the data assumed to have the same true value, m1, and each measured with the same variance, d

2

d1 1

d2 1

d3 = 1 m1

…

dN 1

G

GTG = N so [GTG]-1 = N-1

GTd = i di

mest=[GTG]-1GTd = (i di) / N

Cm = d2 / N

m1est = (i di) / N … the traditional formula for the mean!

the estimated mean has variance Cm = d2 / N = m

2

note then that m = d / N

the estimated mean is a normally-distributed random variable

the width of this distribution, m, decreases with the square root of the number of measurements

Accuracy grows only slowly with N

N=1

N=100

N=10

N=1000

Estimating the variance of the data

What 2d do you use

in this formula?

Prior estimates of d

Based on knowledge of the limits of you measuring technique …

my ruler has only mm tics, so I’m going to assume that d = 0.5 mm

the manufacturer claims that the instrument is accurate to 0.1%, so since my typical measurement is 25, I’ll assume d=0.025

posterior estimate of the errorBased on error measured with respect to best fit

2d = (1/N) i (di

obs-dipre)2 = (1/N) i ei

2

1 x1 a y1

1 x2 b = y2

… … …

1 xN y3

G m = d

mest = [GTG]-1 GTd is normally distributed with variance

Cm = d2 [GTG]-1

p(m) = p(a,b) = p(intercept, slope)

slope

inte

rcep

t

How probable is a dataset ?

N data d are all drawn from the same distribution p(d)

the probable-ness of a single measurement di is p(di)

So the probable-ness of the whole dataset is

p(d1) p(d2) … p(dN) = i p(di)

L = ln i p(di) = i ln p(di)

called then “Likelihood” of the data

Now imagine that the distribution p(d) is known up to a vector m of unknown parameters

write p(d; m) with semicolon as a reminder

that its not a joint probability

The L is a function of m

L(m) = i ln p(di; m)

The Principle of Maximum Likelihood

choose m so that it maximizes L(m)

the dataset that was in fact observed is the most probable one that could have been observed

The best choice of parameters m are the ones that make the dataset likely

the multivariate normal distribution for data, d

p(d) = (2)-N/2 |Cd|-1/2 exp{ -1/2 (d-d)T Cd-1 (d-d) }

Let’s assume that the expectation d is given by a general linear model

d = Gm

And that the covariance Cd

is known (prior covariance)

Then we have a distribution P(d; m)with unknown parameters, m

p(d)=(2)-N/2|Cd|-1/2exp{ -½ (d-Gm)T Cd-1 (d-Gm) }

We can now apply theprinciple of maximum likelihood

To estimate the unknown parameters m

Find the m that maximizes L(m) = ln p(d; m)

with

p(d;m)=(2)-N/2|Cd|-1/2exp{ -½ (d-Gm)T Cd-1 (d-Gm) }

L(m) = ln p(d; m) =

- ½Nln(2) - ½ln(|Cd|) - ½(d-Gm)T Cd-1 (d-Gm)

The first two terms do not contain m, so the principle of maximum likelihood is

Maximize -½ (d-Gm) T Cd-1 (d-Gm)

or

Minimize (d-Gm) T Cd-1 (d-Gm)

Minimize (d-Gm) T Cd-1 (d-Gm)

Special case of uncorrelated data with equal variance

Cd = d2I

Minimize d-2 (d-Gm)T (d-Gm) with respect to m

Which is the same as

Minimize (d-Gm)T (d- Gm) with respect to m

This is the Principle of Least Squares

But back to the general case …

What formula for m does the rule

Minimize (d-Gm)T Cd-1 (d-Gm)

imply ?

Answer(after a lot of algebra)

m = [GT Cd-1G]-1GTCd

-1d

And then by the usual rules of error propagation

Cm = [GTCd-1G]-1

This special case is often called

Weighted Least Squares

Note that the total error is

E = eT Cd-1 e = i i

-2 ei2

Each individual error is weighted by the reciprocal of its variance, so errors involving data with SMALL variance get MORE weight

weight

Example: fitting a straight line

100 data, first 50 have a different d than the last 50

Equal variance

Left 50: d = 5 right 50: d = 5

Left has smaller variance

first 50: d = 5 last 50: d = 100

Right has smaller variance

first 50: d = 100 last 50: d = 5

What can go wrong in least-squares

m = [GTG]-1 GT d

the matrix [GTG]-1 is singular

m =

d1

d2

d3

…

dN

1 x1

1 x2

1 x3

…

1 xN

EXAMPLE - a straight line fit

N i xi

i xi Si xi2

GTG =

det(GTG) = N i xi2 – [i xi]2

[GTG]-1 singular when determinant is zero

N=1, only one measurement (x,d)

N i xi2 – [i xi]2 = x2 - x2 = 0

you can’t fit a straight line to only one point

N1, all data measured at the same x

N i xi2 – [i xi]2 = N2 x2 – N2 x2 = 0

measuring the same point over and over doesn’t help

det(GTG) = N i xi2 – [i xi]2 = 0

This sort of ‘missing measurement’might be difficult to recognize in a

complicated problem

but it happens all the time …

Example - Tomography

in this method, you try to plaster the subject with X-ray beams made at every possible position and direction, but you can easily wind up missing some small region …

no data coverage here

What to do ?

Introduce prior information

assumptions about the behavior of the unknowns

that ‘fill in’ the data gaps

Examples of Prior Information

The unknowns:

are close to some already-known valuethe density of the mantle is close to 3000 kg/m3

vary smoothly with time or with geographical position

ocean currents have length scales of 10’s of km

obey some physical law embodied in a PDEwater is incompressible andthus its velocity satisfies div(v)=0

Are you only fooling yourself ?

It depends …

are your assumptions good ones?

Application of theMaximum Likelihood Method

to this problem

so, let’s have a foray into the world of probability

Overall Strategy

1. Represent the observed data as a probability distribution

2. Represent prior information as a probability distribution

3. Represent the relationship between data and model parameters as a probability distribution

4. Combine the three distributions in a way that embodies combining the information that they contain

5. Apply maximum likelihood to the combined distribution

How to combine distributions in a way that embodies combining the

information that they contain …

Short answer: multiply them

x

p1(x)

x

p2(x)

x

pT(x)

x1 x2 x3

x between x1 and x3

x between x2 and x4

x between x2 and x3

x4

Overall Strategy

1. Represent the observed data as a Normal probability distribution

pA(d) exp{ -½ (d-dobs)T Cd-1 (d-dobs) }

In the absence of any other information, the best estimate of the mean of the data is the observed data itself.

Prior covariance of the data.

I don’t feel like typing the normalization

Overall Strategy

2. Represent prior information as a Normal probability distribution

pA(m) exp{ -½ (m-mA)T Cm-1 (m-mA) }

Prior estimate of the model, your best guess as to what it would be, in the absence of any observations.

Prior covariance of the model quantifies how good you think your prior estimate is …

example

one observationdobs = 0.8 ± 0.4

one model parameter withmA=1.0 ± 1.25

mA=1

dobs =

0.8

0 2

20

pA(d) pA(m)

Overall Strategy

3. Represent the relationship between data and model parameters as a probability distribution

pT(d,m) exp{ -½ (d-Gm)T CG-1 (d-Gm) }

Prior covariance of the theory quantifies how good you think your linear theory is.

linear theory, Gm=d, relating data, d, to model parameters, m.

example

theory: d=m

but only accurate to ± 0.2

mA=1

d obs

=0.

8

0 2

20

pT(d,m)

Overall Strategy

4. Combine the three distributions in a way that embodies combining the information that they contain

p (m,d) = pA(d) pA(m) pT(m,d)

exp{ -½ [

(d-dobs)T Cd-1 (d-dobs) +

(m-mA)T Cm-1 (m-mA) +

(d-Gm)T CG-1 (d-Gm) ]}

a bit of a mess, but it can be simplified ,,,

0 2

20

p(d,m)=pA(d) pA(m) pT(d,m)

Overall Strategy

5. Apply maximum likelihood to the combined distribution, p(d,m) = pA(d) pA(m) pT(m,d)

mest

dpre

0 2

20

p(d,m)

Maximum likelihood point

special case of an exact theory

Exact Theory: the covariance CG is very small: limit CG0

After projecting p(d,m) to p(m) by integrating over all d

p(m) exp{-½(Gm-dobs)TCd-1(Gm-dobs)+(m-mA)TCm

-1(m-mA)]}

maximizing p(m) is equivalent to minimizing

(Gm-dobs)TCd-1(Gm-dobs) + (m-mA)TCm

-1(m-mA)

weighted “prediction error” weighted “distance of the model from its prior value”

+

solutioncalculated via the usual messy minimization process

mest = mA + M [ dobs – GmA]

where M = [GTCd-1G + Cm

-1]-1 GT Cd-1

Don’t Memorize, but be prepared to use

interesting interpretation

mest - mA = M [ dobs – GmA]

estimated model minus its prior

observed data minus the prediction of the prior model

linear connection between the two is a generalized form of least squares

special uncorrelated case Cm=m

2I and Cd=d2I

M = [GTCd-1G + Cm

-1]-1 GT Cd-1

= [ GTG + (d/m)2I ]-1 GT

this formula is sometimes called “damped least squares”, with “damping factor” =d/m

Damped Least Squaresmakes the process of avoiding

singular matrices associated with insufficient data

trivially easy

you just add 2I to GTG before computing the inverse

GTG GTG + 2I

this process regularizes the matrix, so its inverse always exists

its interpretation is :in the absence of relevant data,

assume the model parameter has its prior value

Are you only fooling yourself ?

It depends …

is the assumption - that you know the prior value - a good one?

lecture 3: inferences using least-squares. abstraction vector of n random variables, x with joint...

Documents

xx t c x

px x px

y y x px

y px slide

distribution of x

n slide

yy t c y

y py y y x py