lecture 8 advanced topics in least squares - part two -
Post on 19-Dec-2015
220 views
TRANSCRIPT
You often spend more time futzing with reading files that are
in inscrutable formats
than the intellectually-interesting side of data analysis
Sample MatLab Code
cs = importdata('test.txt','=');Ns = length(cs);mag = zeros(Ns,1);Nm=0;for i = [1:Ns] s=char(cs(i)); smag=s(48:50); stype=s(51:52); if( stype == 'Mb' ) Nm=Nm+1; mag(Nm,1)=str2num(smag); endend
mag = mag(1:Nm);
a routine to read a text file
choose non-occurring delimiter to force complete line into one cell
returns “cellstr” data type: array of strings
convert “cellstr” element to string
convert string to number
m =
d1
d2
d3
…
dN
1 x1
1 x2
1 x3
…
1 xN
EXAMPLE - a straight line fit
N i xi
i xi i xi2
GTG =
det(GTG) = N i xi2 – [i xi]2
[GTG]-1 singular when determinant is zero
N=1, only one measurement (x,d)
N i xi2 – [i xi]2 = x2 - x2 = 0
you can’t fit a straight line to only one point
N1, all data measured at the same x
N i xi2 – [i xi]2 = N2 x2 – N2 x2 = 0
measuring the same point over and over doesn’t help
det(GTG) = N i xi2 – [i xi]2 = 0
m =
s1
s2
…
dNd-1
dNd-1
1 1
1 1…
1 -1
1 -1
another example – sums and differences
Ns+Nd Ns-Nd
Ns-Nd Ns+Nd
GTG =
det(GTG) = 0 = [Ns+Nd]2 - [Ns-Nd]2 =
[Ns2+Nd
2 +2NsNd] - [Ns2+Nd
2 -2NsNd] =
4NsNd
Ns sums, si, and Nd differences, di, of two unknowns m1 and m2
zero when Ns=0 or Nd=0, that is, only measurements of one kind
This sort of ‘missing measurement’might be difficult to recognize in a
complicated problem
but it happens all the time …
in this method, you try to plaster the subject with X-ray beams made at every possible position and direction, but you can easily wind up missing some small region …
no data coverage here
What to do ?
Introduce prior information
assumptions about the behavior of the unknowns
that ‘fill in’ the data gaps
Examples of Prior Information
The unknowns:
are close to some already-known valuethe density of the mantle is close to 3000 kg/m3
vary smoothly with time or with geographical position
ocean currents have length scales of 10’s of km
obey some physical law embodied in a PDEwater is incompressible andthus its velocity satisfies div(v)=0
Application of theMaximum Likelihood Method
to this problem
so, let’s have a foray into the world of probability
Overall Strategy
1. Represent the observed data as a probability distribution
2. Represent prior information as a probability distribution
3. Represent the relationship between data and model parameters as a probability distribution
4. Combine the three distributions in a way that embodies combining the information that they contain
5. Apply maximum likelihood to the combined distribution
How to combine distributions in a way that embodies combining the information that they contain …
Short answer: multiply them
But let’s step through a more well-reasoned analysis of why we should do that …
x
p1(x)
x
p2(x)
x
pT(x)
how to quantify the information in a distribution p(x)
Information compared to what?
Compared to a distribution pN(x) that represents the state of complete ignorance
Example: pN(x) = a uniform distribution
The information content should be a scalar quantity, Q
Q = ln[ p(x)/pN(x) ] p(x) dx
Q is the expected value of ln[ p(x)/pN(x) ]
Properties:
Q=0 when p(x) = pN(x)
Q0 always (since limitp0 of pln(p)=0)
Q is invariant under a change of variables xy
Combining distributions pA(x) and pB(x)
Desired properties of the combination:
pA(x) combined with pB(x) is the same as pB(x) combined with pA(x)
pA(x) combined [ pB(x) combined with pC(x)]is the same as [ pA(x) combined pB(x) ] combined with pC(x)
Q of [ pA(x) combined with pN(x) ] QA
pAB(x) = pA(x) pB(x) / pN(x)
When pN(x) is the uniform distribution …
… combining is just multiplying.
But note that for “points on the surface of a sphere’, the null distribution, p(,), is latitude and is longitude, where would not be uniform, but rather proportional to sin().
Overall Strategy
1. Represent the observed data as a Normal probability distribution
pA(d) exp{ -½ (d-dobs)T Cd-1 (d-dobs) }
In the absence of any other information, the best estimate of the mean of the data is the observed data itself.
Prior covariance of the data.
I don’t feel like typing the normalization
Overall Strategy
2. Represent prior information as a Normal probability distribution
pA(m) exp{ -½ (m-mA)T Cm-1 (m-mA) }
Prior estimate of the model, your best guess as to what it would be, in the absence of any observations.
Prior covariance of the model quantifies how good you think your prior estimate is …
Overall Strategy
3. Represent the relationship between data and model parameters as a probability distribution
pT(d,m) exp{ -½ (d-Gm)T CG-1 (d-Gm) }
Prior covariance of the theory quantifies how good you think your linear theory is.
linear theory, Gm=d, relating data, d, to model parameters, m.
Overall Strategy
4. Combine the three distributions in a way that embodies combining the information that they contain
p (m,d) = pA(d) pA(m) pT(m,d)
exp{ -½ [
(d-dobs)T Cd-1 (d-dobs) +
(m-mA)T Cm-1 (m-mA) +
(d-Gm)T CG-1 (d-Gm) ]}
a bit of a mess, but it can be simplified ,,,
Overall Strategy
5. Apply maximum likelihood to the combined distribution, p(d,m) = pA(d) pA(m) pT(m,d)
There are two distinct ways we could do this:
Find the (d,m) combinations that maximized the joint probability distribution, p(d,m)
Find the m that maximized the individual probability distribution, p(m) = p(d,m) dd
These do not necessarily give the same value for m
special case of an exact theory
in the limit CG0
exp{ -½ (d-Gm)T CG-1 (d-Gm) } (d-Gm)
and p(m) = p(d,m) dd
= pA(m) pA(d) (d-Gm) dd
= pA(m) pA(d=Gm)
so for normal distributions p(m) = exp{ -½ [
(Gm-dobs)T Cd-1 (Gm-dobs) + (m-mA)T Cm
-1 (m-mA) ]}
Dirac delta function, with property f(x) (x-y) dx = f(y)
special case of an exact theory
maximizing p(m) is equivalent to minimizing
(Gm-dobs)TCd-1(Gm-dobs) + (m-mA)TCm
-1(m-mA)
weighted “prediction error” weighted “distance of the model from its prior value”
+
solutioncalculated via the usual messy minimization process
mest = mA + M [ dobs – GmA]
where M = [GTCd-1G + Cm
-1]-1 GT Cd-1
Don’t Memorize, but be prepared to use
(e.g. in homework).
interesting interpretation
mest - mA = M [ dobs – GmA]
estimated model minus its prior
observed data minus the prediction of the prior model
linear connection between the two
special case of no prior informationCm
M = [GTCd-1G + Cm
-1]-1 GT Cd-1[GTCd
-1G]-1 GT Cd-1
mest = mA + [GTCd-1G]-1 GT Cd
-1 [ dobs – GmA]
= mA+[GTCd-1G]-1GTCd
-1dobs–[GTCd-1G]-1GTCd
-1GmA
= mA+[GTCd-1G]-1GTCd
-1dobs–mA
= [GTCd-1G]-1GTCd
-1dobs
recovers weighted least squares
special case of infinitely accurate prior information Cm 0
M = [GTCd-1G + Cm
-1]-1 GT Cd-1 0
mest = mA + 0
= mA
recovers prior value of m
special uncorrelated case Cm=m
2I and Cd=d2I
M = [GTCd-1G + Cm
-1]-1 GT Cd-1
= [d-2GTG + m
-2I]-1 GT d-2
= [ GTG + (d/m)2I ]-1 GT
this formula is sometimes called “damped least squares”, with “damping factor” =d/m
Damped Least Squaresmakes the process of avoiding
singular matrices associated with insufficient data
trivially easy
you just add 2I to GTG before computing the inverse
GTG GTG + 2I
this process regularizes the matrix, so its inverse always exists
its interpretation is :in the absence of relevant data,
assume the model parameter has its prior value
Are you only fooling yourself ?
It depends …
is the assumption - that you know the prior value - a good one?
Smoothness Constraints
e.g. model is smooth when its second derivative is small
d2mi/dx2 mi-1 - 2mi + mi+1
(assuming the data are organized according to one spatial variable)
matrix D approximates second derivative
1 -2 1 0 0 0 …
0 1 -2 1 0 0 …
…
0 0 0 … 1 -2 1
D =
d2m/dx2 Dm
Choosing a smooth solution is thus equivalent to minimizing
(Dm)T (Dm) = mT (DTD) m
comparing to the
(m-mA)TCm-1(m-mA)
minimization implied by the general solution
mest = mA + M [ dobs – GmA]where M = [GTCd
-1G + Cm-1]-1 GT Cd
-1
indicates that we should make the choicesmA = 0
Cm-1 = (DTD)
To implement smoothness