statistical inverse problems, model reduction and inverse...
TRANSCRIPT
Statistical Inverse Problems,
Model Reduction and
Inverse Crimes
Erkki Somersalo, Helsinki University of Technology, Finland
Firenze, March 22–26, 2004
CONTENTS OF THE LECTURES
1. Statistical inverse problems: A brief review
2. Model reduction, discretization invariance
3. Inverse crimes
Material based on the forthcoming book
Jari Kaipio and Erkki Somersalo: Computational and Statistical Inverse Prob-lems. Springer-Verlag (2004)
STATISTICAL INVERSE PROBLEMS
Bayesian paradigm, or “subjective probability”:
1. All variables are random variables
2. The randomness reflects the subject’s uncertainty of the actual values
3. The uncertainty is encoded into probability distributions of the variables
Notation: Random variables X, Y , E etc.
Realizations: If X : Ω → Rn, we denote
X(ω) = x ∈ Rn.
Probability densities:
PX ∈ B
=
∫B
πX(x)dx =∫
B
π(x)dx.
Hierarchy of the variables:
1. Unobservable variables of primary interest, X
2. Unobservable variables of secondary interest, E
3. Observable variables, Y
Example: Linear inverse problem with additive noise,
y = Ax + e, A ∈ Rm×n.
Stochastic extension:Y = AX + E.
Conditioning: Joint probability density of X and Y :
PX ∈ A, Y ∈ B
=
∫A×B
π(x, y)dx dy.
Marginal densities:
PX ∈ A
= P
X ∈ A, Y ∈ Rm
=
∫A×Rm
π(x, y)dx dy,
in other words,
π(x) =∫
Rm
π(x, y)dy.
Conditional probability:
PX ∈ A | Y ∈ B
=
∫A×B
π(x, y)dx dy∫B
π(y)dy.
Shrink B into a single point y:
PX ∈ A | Y = y
=
∫A
π(x, y)π(y)
dx =∫
A
π(x | y)dx,
where
π(x | y) =π(x, y)π(y)
or π(x, y) = π(x | y)π(y).
Bayesian solution of an inverse problem: Given a measurement y = yobserved
of the observable variable Y , find the posterior density of X,
πpost(x) = π(x | yobserved).
Prior density, πpr(x) expresses all prior information independent of the mea-surement.
Likelihood density π(y | x) is the likelihood of a measurement outcome y givenx.
Bayes formula:
π(x | y) =πpr(x)π(y | x)
π(y).
Three steps of Bayesian inversion:
1. Construct the prior density
2. Construct the likelihood density
3. Extract useful information from the posterior density
Example: Linear model with additive noise,
Y = AX + E,
where the density πnoise is known. Fixing X = x yields
π(y | x) = πnoise(y −Ax),
and soπ(x | y) = πpr(x)πnoise(y −Ax).
Assume that X and E are mutually independent and Gaussian,
X ∼ N (x0,Γpr), E ∼ N (0,Γe),
where Γpr ∈ Rn×n and Γe ∈ Rm×m are symmetric positive (semi)definite.
πpr(x) ∝ exp(−1
2(x− x0)TΓ−1
pr (x− x0))
,
π(y | x) ∝ exp(−1
2(y −Ax)TΓ−1
e (y −Ax))
.
From Bayes formula, the posterior covariance is Gaussian,
π(x | y) ∼ N (x∗,Γpost),
where
x∗ = x0 + ΓprAT(AΓprA
T + Γe)−1(y −Ax0),
Γpost = Γpr − ΓprAT(AΓprA
T + Γe)−1AΓpr.
Special case: Assume that
x0 = 0, Γpr = γ2I, Γe = σ2I.
In this case,x∗ = AT(AAT + α2I)−1y, α =
σ
γ,
known as Wiener filtered solution (m×m problem), or, equivalently,
x∗ = (ATA + α2I)−1ATy,
which is the Tikhonov regularized solution (n× n problem).
Engineering rule of thumb: If n < m, use Tikhonov, if m < n use Wiener.
(In practice, ATA or AAT should often not be calculated.)
Frequently asked question: How do you determine α?
Bayesian paradigm: Either
1. You know γ and σ; then α = σ/γ,
or
2. You don’t know them; make them part of the estimation problem.
This is the empirical Bayes approach.
Example: If γ in the previous example in unknown, write
πpr(x | γ) ∝ 1γn
exp(− 1
2γ2‖x‖2
),
and writeπpr(x, γ) = πpr(x | γ)πh(γ),
where πh is a hyperprior or hierarchical prior.
Determine π(x, γ | y).
BAYESIAN ESTIMATION
Classical inversion methods produce estimates of the unknown.
In contrast, Bayesian approach produces a probability density that can beused
• to produce estimates,
• to assess the quality of estimates (statistical and classical).
Example: Conditional mean (CM) and maximum a posteriori (MAP) esti-mates:
xCM =∫
Rn
xπ(x | y)dx,
xMAP = arg maxπ(x | y).
Calculating MAP esitmate is an optimization problem, CM estimate and in-tegration problem.
Monte Carlo integration: If n is large, quadrature methods not feasible.
MC methods: Assume that we have a sample,
S =x1, x2, . . . , xN
, xj ∈ Rn.
Write
xCM =∫
Rn
xπ(x | y)dx ≈N∑
j=1
wjxj ,
wherewj = π(xj | y).
Importance sampling: Generate the sample S randomly.
Simple but inefficient (in particular when n is large).
A better idea: Generate the sample using the density π(x | y).
Ideal case: The points xj are distributed according to the density π(x | y),and
xCM =∫
Rn
xπ(x | y)dx ≈ 1N
N∑j=1
xj .
Markov chain Monte Carlo methods (MCMC): Generate the sample sequen-tially,
x0 → x1 → . . . xj → x+1 → . . . → xN .
Idea: Define a transition probability P (xj , Bj+1),
P (xj , Bj+1) = PXj+1 ∈ Bj+1, provided that Xj = xj
.
Assuming that Xj has probability density πj(xj),
Pxj+1 ∈ Bj+1
=
∫Rn
P (xj , Bj+1)πj(xj)dxj = πj+1(Bj+1).
Choose the transition kernel so that π(x | y) is invariant measure:
∫B
π(x | y)dx =∫
Rn
P (x′, B)π(x′ | y)dx′.
Then all the variables Xj are distributed according to π(x | y).
Best known algorithms:
Metropolis-Hastings, Gibbs sampler.
−2 −1 0 1 2−1
−0.5
0
0.5
1
1.5
2
−2 −1 0 1 2−1
−0.5
0
0.5
1
1.5
2
(d)
Gibbs sampler: Update one component at the time as follows:
Given xj = [xj1, x
j2, . . . , x
jn].
Draw xj+11 from t 7→ π(t, xj
2, . . . , xjn | y),
draw xj+12 from t 7→ π(xj+1
1 , t, xj3, . . . , x
jn | y),
...
draw xj+1n from t 7→ π(xj+1
1 , xj+12 , . . . , xj+1
n−1, t | y).
−2 −1 0 1 2−1
−0.5
0
0.5
1
1.5
2
Define a cost function Ψ : Rn × Rn → R.
The Bayes cost of an estimator x = x(y) is defined as
B(x) = EΨ(X, x(Y ))
=
∫ ∫Ψ(x, x(y))π(x, y)dx dy.
Further, we can write
B(x) =∫ ∫
Ψ(x, x)π(y | x)dy πpr(x)dx
=∫
B(x | x)πpr(x)dx = EB(x | x)
,
whereB(x | x) =
∫Ψ(x, x)π(y | x)dy
is the conditional Bayes cost.
The Bayes cost method: Fix Ψ and define the estimator xB so that
B(xB) ≤ B(x)
for all estimators x of x.
By Bayes formula,
B(x) =∫ ∫
Ψ(x, x)π(x | y)dx π(y)dy.
Since π(y) ≥ 0 and x(y) depends only on y,
xB(y) = arg min ∫
Ψ(x, x)π(x | y)dx
= arg min
E
Ψ(x, x)
∣∣ y
.
Mean square error criterion: Choose Ψ(x, x) = ‖x− x‖2, giving
B(x) = E‖X − X‖2
= trace
(corr(X − X)
),
where X = x(Y ), and
corr(X − X
)= E
(X − X)(X − X)T
∈ Rn×n.
This Bayes estimator is called the mean square estimator xMS. We have
xMS =∫
xπ(x | y) dx = xCM.
We have
E‖X − x‖2 | y
= E
‖X‖2 | y
− 2E
X | y
Tx + ‖x‖2
= E‖X‖2 | y
−
∥∥EX | y
2∥∥ +∥∥E
X | y
− x
∥∥2
≥ E‖X‖2 | y
−
∥∥EX | y
2∥∥,
and the equality holds only if
x(y) = EX | y = xCM.
Furthermore,E
X − xCM
= E
X − E
X | y
= 0.
Question: xCM is optimal, but is it informative?
0 0.5 1 1.50
1
2
3
4
5
6
CM MAP
(a)0 0.5 1 1.5
0
1
2
3
4
5
6
MAP CM
(b)
No estimate is foolproof. Optimality is subjective.
DISCRETIZED MODELS
Consider a linear model with additive noise,
y = Af + e, f ∈ H, y, e ∈ Rm.
Discretization, e.g. by collocation,
xn = [f(p1); f(p2); . . . ; f(pn)] ∈ Rn,
Af ≈ Anxn, An ∈ Rm×n.
Assume that the discretization scheme is convergent,
limn→∞
‖Af −Anxn‖ = 0.
Accurate discrete model:
y = ANxN + e, ‖ANxN −Af‖ < tol .
Stochastic extension:Y = ANXN + E,
where Y , XN and E are random variables.
Passing into a coarse mesh. Possible reasons:
1. 2D and 3D applications, problems too large
2. Real time applications
3. Inverse modelling based on prescribed meshing
Coarse mesh model with n < N ,
Af ≈ Anxn, ‖Anxn −Af‖ > tol .
Stochastic extension of the simple reduced model is
Y = AnXn + E.
Inverse crime:
• WriteY = Y = AnXn + E, (1)
and develop the inversion scheme based on this model,
• generate data with the simple reduced model and test the inversionmethod with this data.
Usually, inverse crime results are overly optimistic.
Questions:
1. How to model the discretization error?
2. How to model the prior information?
3. Is the inverse crime always significant?
PRIOR MODELLING
Assume a Gaussian model,
XN ∼ N (xN0 ,ΓN ),
i.e., the prior density is
πpr(xN ) ∝ exp(−1
2(xN − xN
0
)T(ΓN
)−1(xN − xN
0
)).
Projection (e.g. interpolation, averaging or downsampling),
P : RN → Rn, XN 7→ Xn.
Then,
EXn
= E
PXN
= PE
XN
= PxN
0 ,
EXn
(Xn
)T= E
PXN
(XN
)TPT
= PE
XN
(XN
)TPT,
and therefore,Xn ∼ N (xn
0 ,Γn) = N (PxN0 , P ΓN PT).
However, this is not what we normally do!
Example: H = continuous functions on [0, 1].
Discretization by multiresolution bases. Let
ϕ(t) =
1, if 0 ≤ t < 1,0, if t < 0 or t ≥ 1.
Define V j , 0 ≤ j < ∞, V j ⊂ V j+1,
V j = spanϕj
k|1 ≤ k ≤ 2j,
whereϕj
k(t) = 2j/2ϕ(2jt− k − 1).
Discrete representation,
f j(t) =2j∑
k=1
xjkϕj
k(t) ∈ V j .
Projector P : xj 7→ xj−1
P = Ij−1 ⊗ e1 =1√2
1 1 0 0 . . . 0 00 0 1 1 . . . 0 0...
...0 0 0 0 . . . 1 1
∈ R2j−1×2j
.
Assume the prior information f ∈ C20 ([0, 1]).
Second order smoothness prior of XN , N = 2j :
πpr(xN ) ∝ exp(−1
2α‖LNxN‖2
)= exp
(−1
2(xN )T
[α(LN
)TLN
]xN
),
where
LN = 22j
−2 1 0 . . . 01 −2 1
0 1 −2...
.... . . 1
0 . . . 1 −2
∈ RN×N .
The prior covariance is
ΓN =[α(LN
)TLN
]−1
.
Passing to level n = 2j−1 = N/2:
Ln = 22(j−1)
−2 1 0 . . . 01 −2 1
0 1 −2...
.... . . 1
0 . . . 1 −2
= PLNPT ∈ Rn×n.
Natural candidate for the smoothness prior for Xn is
πpr(xn) ∝ exp(−1
2α‖Lnxn‖2
)= exp
(−1
2(xn)T
[α(Ln
)TLn
]xn
),
But this is inconsistent, since
Γn =[α(Ln
)TLn
]−1
6= P[α(LN
)TLN
]−1
PT = Γn.
Numerical example:
Af(t) =∫ 1
0
K(t− s)f(s)ds, K(s) = e−κs2,
where κ = 15. Sampling:
yj = Af(tj) + ej , tj = (j − 1/2)/50, 1 ≤ j ≤ 50,
andE ∼ N (0, σ2I), σ = 2% of max
(Af(tj)
).
Smoothness prior
πpr(xN ) ∝ exp(−1
2α‖LNxN‖2
), N = 512.
Reduced model with n = 8.
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−4
−2
0
2
4
6
8x 10
−3
Figure 1: MAP estimate with N = 512, n = 8. Black dots correspond to Γn,red dots to Γn.
DISCRETIZATION ERROR
From fine mesh to coarse mesh: Complete error model
Y = ANXN + E (2)
= AnXn + (AN −AnP )XN + E
= AnXn + Ediscr + E.
Error covariance: Assume that E, XN are mutually independent,
E ∼ N (0,Γe), XN ∼ N (xN0 ,ΓN ).
The complete error E = Ediscr + E is Gaussian,
E ∼ N (e0, Γe),
where
e0 = (AN −AnP )xN0 ,
Γe = (AN −AnP )ΓN (AN −AnP )T + Γe.
Error variance:
var(E
)= E
‖E− e0‖2
= E
‖Ediscr − e0‖2
+ E
‖E‖2
= = trace
((AN −AnP )ΓN (AN −AnP )T
)+ trace
(Γe
)= var
(Ediscr
)+ var
(E
).
The complete error model is noise dominated, if
var(Ediscr
)< var
(E
),
and modelling error dominated if
var(Ediscr
)> var
(E
).
Enhanced error model: Use the likelihood and prior
π(y | xn) ∝ exp(−1
2(y −Anxn − y0)TΓ−1
e (y −Anxn − y0))
,
πpr(xn) ∝ exp(−1
2(xn − xn
0 )T(Γn
pr
)−1(xn − xn0 )
),
where
y0 = EY
= AnEXn+ e0
= AnPxN0 +
(AN −AnP
)xN
0
= ANxN0 .
MAP estimate, denoted by xneem is
xneem = argmin‖Ln
pr
(xn − xn
0
)‖2 + ‖Le
(Anxn − y − y0
)‖2
= argmin∥∥∥∥[
Lnpr
LeAn
]xn −
[Ln
prxn0
Le(y − y0)
]∥∥∥∥2
,
whereLpr = chol
(Γn
pr
)−1, Le = chol
(Γn
e
)−1.
This leads to a normal equation of size n× n.
Note: Enhanced error model is not the complete error model, because Xn iscorrelated with the complete error E through XN .
Complete error model: Assume, for a while that Xn and Y are zero mean. Wehave
Xn = PXN , Y = ANXN + E.
Variable Z = [Xn;Y ] is Gaussian, with mean an covariance
EZZT
=
[E
Xn(Xn)T
E
XnY T
E
Y (Xn)T
E
Y Y T
]=
[PΓNP PΓN (AN )T
ANΓN ANΓN (AN )T + Γe
].
From this, calculate the conditional density π(xn | y).
π(xn | y) ∼ N (xncem,Γn
cem),
where
xncem = PxN
0 + PΓNpr
(AN )T
[ANΓN
pr
(AN
)T + Γe
]−1 (y −ANxN
0
),
and
Γncem = PΓN
prPT − PΓN
pr
(AN
)T[ANΓN
pr
(AN
)T + Γe
]−1
ANΓNprP
T.
Note: The computation of xncem requires soving an m ×m system, indepen-
dently of n. (Compare to xneem).
Example: Full angle tomography.
X−ray source
Detector
Figure 2: True object and the discretized model.
Intensity decrease along a line segment d`:
dI = −Iµd`,
where µ = µ(p) ≥ 0, p ∈ Ω is the mass absorption.
Let I0 be the intensity of the transmitted X-ray.
The received intensity I is
log(
I
I0
)=
∫ I
I0
dI
I= −
∫`
µ(p)d`(p).
Inverse problem of X-ray tomography: Estimate µ : Ω → R+ from the valuesof its integrals along a set of straight lines passing through Ω.
Figure 3: Sinogram data.
Gaussian structural smoothness prior: Three weakly correlated subregions.Inside each region pixels mutually correlated.
20 40 60 80
10
20
30
40
50
60
70
80
Figure 4: Prior geometry
Construction of the prior: Pixel centers pj , 1 ≤ j ≤ N .
Divide the pixels in clicques C1, C2 and C3. In medical imaging, this is calledimage segmenting.
Define the neighbourhood systemN = Ni | 1 ≤ i ≤ N, Ni ⊂ 1, 2, . . . , N,where
j ∈ Ni if and only if pixels pi and pj are neighbours and in the same clicque.
Define the density of a Markov random field X as
πMRF(x) ∝ exp
−12α
N∑j=1
|xj − cj
∑i∈Nj
xi|2
= exp(−1
2αxTBx
),
where the coupling constant cj depends of the clicque.
The matrix B is singular.
Remedy: Select few points pj | j ∈ I ′′, where I ′′ ⊂ I = 1, 2, . . . , N. LetI ′ = I \ I ′′.
Denote x = [x′;x′′].
The conditional density πMRF(x′ | x′′), (i.e., x′′ fixed), is a proper measurewith respect to x′.
Defineπpr(x) = πMRF(x′ | x′′)π0(x′′),
where π0 is Gaussian, e.g.,
π0 ∼ N (0, γ20I).
Figure 5: Four random draws from the prior density.
Data generated in a N = 84 × 84 mesh, inverse solutions computed in an = 42× 42 mesh.
Proper data y ∈ Rm and inverse crime data yic ∈ Rm:
y = ANxNtrue + e, yic = AnPxN
true + e,
where xNtrue is drawn from the prior density, e is a realization of
E ∼ N (0, σ2I),
where
σ2 = κm−1trace((AN −AnP )ΓN (AN −AnP )T
), 0.1 ≤ κ ≤ 10.
In other words,
0.1 ≤ κ =noise variance
discretization error variance≤ 10.
What is the structure of the discretization error? Can we approximate it byGaussian white noise?
5 10 15 20 25 30 35 40
0
0.02
0.04
0.06
0.08
0.1
Γ A
Projection number
ΓA(k,k)
ΓA(k,k+1)
Figure 6: The diagonal and the first off-diagonal of discretization error covari-ance.
Error analysis:
1. Draw a sample xN1 , xN
2 , . . . , xNS , S = 500, from the prior density.
2. Choose the noise level σ = σ(κ) and generate data y1(κ), y2(κ), . . . , yS(κ),both proper and inverse crime version.
3. Calculate the estimates x(y1(κ)), x(y2(κ)), . . . , x(yS(κ).
4. Estimate the estimation error,
E‖X − X(κ)‖2
≈ 1
S
S∑j=1
‖x(yj(κ))− xj‖2.
Estimators: CM, CM with enhanced error model and truncated CGNR byMorozov discrepancy principle, discrepancy
δ2 = τE‖E‖2
= τmσ(κ)2, τ = 1.1
10−2
10−1
10−3
10−2
10−1
100
||^x
− x
||2
Noise level
CG
CG IC
CM
CM Corr
Figure 7: Estimation errors with various noise levels. Dashed line isvar(Ediscr).
Error level 0.0029247
Error level 0.0047491
Error level 0.0060516
Error level 0.0077115
Error level 0.11093
Example: Estimate error: If x = x(y) is an estimator, define the relativeestimation error as
D(x) =E
‖X − X‖2
E
‖X‖2
.
Observe:D(0) = 1.
D(xCM) ≤ D(x)
for any estimator x.
Test case: Limited angle tomography, Reconstructions with truncated singularvalue decomposition (TSVD) versus CM estimate.
Calculate D(xTSVD) and D(xCM) by ensemble averaging (S = 500).
TSVD estimate:y = Ax + e.
SVD decomposition: A = UDV T, where
U = [u1, u2, . . . , um] ∈ Rm×m, V = [v1, v2, . . . , vn] ∈ Rm×n,
and
D = diag(d1, d2, . . . , dmin(n,m)) ∈ Rm×n, d1 ≥ d2 ≥ . . . ≥ dmin(n,m) ≥ 0.
xTSVD(y, r) =r∑
j=1
1dj
(uT
j y)vj ,
and the truncation parameter r is chosen, e.g., by the Morozov discrepancyprinciple,
‖y −AxTSVD(y, r)‖2 ≤ τE‖E‖2
< ‖y −AxTSVD(y, r − 1)‖2.
5 10 15 20 25 30 35 40
10
20
30
40
50
60
5 10 15 20 25 30 35 40
10
20
30
40
50
60
5 10 15 20 25 30 35 40
10
20
30
40
50
60
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.80
20
40
60
80
100
120
140
160
180
||^x − x||2
Den
sity
CM
TSVD
0 0.1 0.2 0.3 0.4 0.5 0.6 0.70
10
20
30
40
50
60
70
80
||^x − x||2
Den
sity CM
TSVD
CONCLUSIONS
• The Bayesian approach is useful for incorporating complex prior infor-mation into inverse solvers.
• It is not a method of producing a single estimator - although it can beused as a tool for that, too.
• It facilitates error analysis of discretization, modelling and estimationby deterministic methods.
• Working with ensembles makes possible to analyze non-linear problemsas well (e.g. EIT, OAST).