statistics for inverse problems 101€¦ · i problems: possibly badly biased and computational...

Statistics for Inverse Problems 101

June 5, 2011

IntroductionI Inverse problem: to recover an unknown object from

indirect noisy observationsExample: Deblurring of a signal fForward problem: µ(x) =

∫ 10 A(x − t)f (t) dt

Noisy data: yi = µ(xi) + εi , i = 1, . . . ,nInverse problem: recover f (t) for t ∈ [0,1]

General Model

I A = (Ai), Ai : H → IR, f ∈ HI yi = Ai [f ] + εi , i = 1, . . . ,nI The recovery of f from A[f ] is ill-posedI Errors εi modeled as randomI Estimate f is an H-valued random variable

Basic Question (UQ)How good is the estimate f?

I What is the distribution of f?Summarize characteristics of the distribution of fE.g., how ‘likely’ is it that f will be ‘far’ from f?

Review: E, Var & Cov(1) Expected value of random variable X with pdf f (x):

E(X ) = µx =

∫x f (x) dx

(2) Variance of X : Var(X ) = E(

(X − µx )2)

(3) Covariance of random variables X and Y :

Cov(X ,Y ) = E ( (X − µx )(Y − µy ) )

(4) Expected value of random vector X = (Xi):

E(X ) = µx = ( E(Xi) )

(5) Covariance matrix of X :

Var(X ) = E( (X − µx )(X − µx )t ) = ( Cov(Xi ,Xj) )

(6) For a fixed n ×m matrix A

E(AX ) = A E(X ), Var(AX ) = A Var(X ) At

Example: X = (X1, . . . ,Xn) correlated, Var(X ) = σ2Σ,then

Var(

X1 + · · ·+ Xn

n

)= Var(1tX/n) =

σ2

n2 1tΣ1

Review: least-squares

y = Aβ + ε

I A is n ×m with n > m and AtA has stable inverseI E(ε) = 0, Var(ε) = σ2II Least-squares estimate of β:

β = arg minb‖y − Ab‖2 = (AtA)−1Aty

I β is an unbiased estimator of β:

Eβ( β ) = β ∀β

I Covariance matrix of β:

Varβ( β ) = σ2 (AtA)−1

I If ε is Gaussian N(0, σ2I), then β is the MLE and

β ∼ N(β, σ2 (AtA)−1 )

I Unbiased estimator of σ2:

σ2 = ‖y − A β‖2/(n −m)

= ‖y − y‖2/(n − dof(H)),

where y = H β, H = A (AtA)−1At = ‘hat-matrix’

Confidence regions

I 1− α confidence interval for βi :

Ii(α) : βi ± tα/2,n−m σ

√( (AtA)−1 )i,i

I C.I.’s are pre-data

I Pointwise vs simultaneous 1− α coverage:pointwise: P[ βi ∈ Ii(α) ] = 1− α ∀isimultaneous: P[ βi ∈ Ii(α) ∀i ] = 1− α

I Easy to find 1− α joint confidence region Rα for βwhich gives simultaneous coverage and ifRfα = {f (b) : b ∈ Rα} ⇒ P[f (β) ∈ Rf

α] ≥ 1− α

Residuals

r = y − Aβ = y − y

I E(r) = 0, Var(r) = σ2(I − H)

I If ε ∼ N(0, σ2I), then: r ∼ N(0, σ2(I − H))

⇒ do not behave like original errors ε

I Corrected residuals ri = ri/√

1− Hii

I Corrected residuals may be used for model validation

Review: Bayes theoremI Conditional probability measure P( · | B):

P(A | B) = P(A ∩ B)/P(B) if P(B) 6= 0

I Bayes theorem: for P(A) 6= 0

P(A | B) = P(B | A)P(A)/P(B)

I For X and Θ with pdf’s

fΘ|X (θ | x) ∝ fX |Θ(x | θ)fΘ(θ)

More generally:

dµΘ|X

dµΘ∝

dµX |Θ

dν

I Basic tools: conditional probability measures andRadon-Nikodym derivatives

I Conditional probabilities can be defined viaKolmogorov’s definition of conditional expectations orusing conditional distributions (e.g., via disintegrationof measures)

Review: Gaussian Bayesian linear model

y | β ∼ N(Aβ, σ2I), β ∼ N(µ, τ 2Σ)

µ, σ, τ,Σ known

I Goal: update prior distribution of β in light of data y .Compute posterior distribution of β given y (useBayes theorem):

β | y ∼ N(µy ,Σy )

Σy = σ2 (AtA + (σ2/τ 2)Σ−1 )−1

µy =(

AtA + (σ2/τ 2)Σ−1 )−1(Aty + (σ2/τ 2)Σ−1µ)

= weighted average of µls and µ

I Summarizing posterior:

µy = E(µ | y) = mean and mode of posterior (MAP)Σy = Var(µ | y) = covariance matrix of posterior

1− α credible interval for βi :

(µy )i ± zα/2

√( Σy )i,i

Can also construct 1− α joint credible region for β

Some properties

I µy minimizes E‖δ(y)− β‖2 among all δ(y)

I µy → βls as σ2/τ 2 → 0, µy → µ as σ2/τ 2 →∞

I µy can be used as a frequentist estimate of β:� µy is biased� its variance is NOT obtained by sampling from the

posterior

I Warning: a 1− α credible regions MAY NOT have1− α frequentist coverage

I If µ, σ, τ or Σ unknown ⇒ use hierarchical orempirical Bayesian and MCMC methods

Parametric reductions for IP

I Transform infinite to finite dimensional problem: i.e.,A[f ] ≈ Aa

I Assumptions

y = A[f ] + ε = Aa + δ + ε

Function representation: f (x) = φ(x)taPenalized least-squares estimate:

a = arg minb‖y − Ab‖2 + λ2btSb

= (AtA + λ2S)−1Aty

f (x) = φ(x)t a

I Example: yi =∫ 1

0 A(xi − t)f (t) dt + εi

Quadrature approximation:∫ 1

0A(xi − t)f (t) dt ≈ (1/m)

m∑j=1

A(xi − tj)f (tj)

Discretized system:

y = Af + δ + ε, Aij = A(xi − tj), fi = f (xi)

Regularized estimate:

f = arg ming∈IRm

‖y − Ag‖2 + λ2‖Dg‖2

= (AtA + λ2DtD)−1Aty ≡ Ly

f (xi) = fi = eti Ly

I Example: Ai continuous on Hilbert space H. Then:

∃ orthonormal φk ∈ H, orthonormal vk ∈ IRn andnon-decreasing λk ≥ 0 such that A[φk ] = λkv k ,f = f0 + f1, with f0 ∈ Null(A), f1 ∈ Null(A)⊥

f0 not constrained by data, f1 =∑n

i=1 akφk

Discretized system:

y = Va + ε, V = (λ1v1 · · ·λnvn)

I Regularized estimate:

a = arg minb∈IRn

‖y − Vb‖2 + λ2‖b‖2 ≡ Ly

f (x) = φ(x)tLy , φ(x) = (φi(x))

I Example: Ai continuous on RKHS H = Wm([0,1]).Then: H = Nm−1

⊕Hm and ∃ φj ∈ H and κj ∈ H

such that

‖y −A[f ]‖2 + λ2∫ 1

0(f (m)

1 )2 =∑

i

(yi − 〈κi , f 〉)2 + λ2‖PHm f‖2

f =∑

j

ajφj + η, η ∈ span{φk}⊥

Discretized system:

y = Xa + δ + ε

Regularized estimate:

a = arg minb∈IRn

‖y − Xb‖2 + λ2btPb ≡ Ly

f (x) = φ(x)tLy , φ(x) = (φi(x))

To summarize:

y = Aa + δ + ε

f (x) = φ(x)ta

a = arg minb∈IRn

‖y − Xb‖2 + λ2btPb

= (AtA + λ2P)−1Aty

f (x) = φ(x)t a

Bias, variance and MSE of fI Is f (x) close to f (x) on average?

Bias( f (x) ) = E( f (x) )− f (x) = φ(x)tBλa + φ(x)tGλAtδ

I Pointwise sampling variability:

Var( f (x) ) = σ2‖AGλφ(x) ‖2

I Pointwise MSE:

MSE( f (x) ) = Bias( f (x) )2 + Var( f (x) )

I Integrated MSE:

IMSE( f (x) ) = E∫|f (x)−f (x)|2 dx =

∫MSE( f (x) ) dx

Review: are unbiased estimators good?

I They don’t have to exist

I If they exist they may be very difficult to find

I A biased estimator may still be better: variancevariance trade off

i.e,. don’t get too attached to them

What about the bias of f?I Upper bounds are sometimes possible

I Geometric information can be obtained from bounds

I Optimization approach: find max/min of bias s.t.f ∈ S

I Average bias: choose a prior for f and compute meanbias over prior

I Example: f in RKHS H, f (x) = 〈ρx , f 〉

Bias( f (x) ) = 〈Ax − ρx , f 〉|Bias( f (x) ) | ≤ ‖Ax − ρx‖ ‖f‖

Confidence intervals

I If ε Gaussian and |Bias( f (x) )| ≤ B(x), then

f (x)± zα/2(σ‖AGλφ(x)‖+ B(x) )

is a 1− α C.I. for f (x)

I Simultaneous 1− α C.I.’s for E(f (x)), x ∈ S: find βsuch that

P[

supx∈S|Z tV (x)| ≥ β

]≤ α

Zi iid N(0,1), V (x) = KGλφ(x)/‖KGλφ(x)‖ and useresults from maxima of Gaussian processes theory

Residuals

r = y − Aa = y − y

I Moments:

E(r) = −A Bias( a ) + δ

Var(r) = σ2(I − Hλ)2

Note: Bias( A a ) = A Bias( a ) may be small even ifBias( a ) is significant

I Corrected residuals: ri = ri/(1− (Hλ)ii)

Estimating σ and λ

I λ can be chosen using same data y (with GCV,L-curve, discrepancy principle, etc.) or independenttraining dataThis selection introduces an additional error

I If δ ≈ 0, σ2 can be estimated as in least-squares:

σ2 = ‖y − y‖2/(n − dof(Hλ))

otherwise one may consider y = µ+ ε, µ = A[f ] anduse methods from nonparametric regression

Resampling methods

Idea: If f and σ are ‘good’ estimates, then one should beable to generate synthetic data consistent with actualobservations

I If f is a good estimate ⇒ make synthetic data

y∗1 = A[f ] + ε∗i , . . . , y∗b = A[f ] + ε∗b

with ε∗i from same distribution as ε. Use sameestimation procedure to get f ∗i from y∗i :

y∗1 → f ∗1 , . . . , y∗b → f ∗b

Approximate distribution of f with that of f ∗1 , . . . , f∗b

Generating ε∗ similar to ε

I Parametric resampling: ε∗ ∼ N(0, σ2I)

I Nonparametric resampling: sample ε∗ withreplacement from corrected residuals

I Problems:� Possibly badly biased and computational intensive� Residuals may need to be corrected also for

correlation structure� A bad f may lead to misleading results

I Training sets: Let {f1, ...fk} be a training set (e.g.,historical data)� For each fj use resampling to estimate MSE(fj)� Study variability of MSE(fj)/‖fj‖

Bayesian inversion

I Hierarchical model (simple Gaussian case):

y | a,θ ∼ N(Aa + δ, σ2I), a | θ ∼ Fa(· | θ), θ ∼ Fθ

θ = (δ, σ, τ)

I It all reduces to sampling from posterior F (a | y)

⇒ use MCMC methods

Possible problems:� Selection of priors� Convergence of MCMC� Computational cost of MCMC in high dimensions� Interpretation of results

Warning

Parameters can be tuned so that

f = E(f | y)

but this does not mean that the uncertainty of f isobtained from the posterior of f given y

Warning

Bayesian or frequentist inversions can be used foruncertainty quantification but:

I Uncertainties in each have different interpretationsthat should be understood

I Both make assumptions (sometimes stringent) thatshould be revealed

I The validity of the assumptions should be explored

⇒ exploratory analysis and model validation

Exploratory data analysis (by example)Example: `2 and `1 estimates

data = y = Af + ε = AWβ + ε

f 2 = arg ming‖y − Ag ‖2

2 + λ2 ‖Dg ‖22

λ : from GCV

f 1 = W β, β = arg minb‖y − AWb ‖2

2 + λ ‖b ‖1

λ : from discrepancy principle

E(ε) = 0, Var(ε) = σ2Iσ = smoothing spline estimate

Example: y , Af & f

Tools: robust simulation summaries

I Sample mean and variance of x1, . . . , xm:

X = (x1 + · · ·+ xm)/m

S2 =1m

∑i

(xi − X )2

I Recursive formulas:

Xn+1 = Xn +1

n + 1(xn+1 − Xn)

Tn+1 = Tn +n

n + 1(xn+1 − Xn)2, S2

n = Tn/n

I Not robust

Robust measures

I Median and median absolute deviation from median(MAD)

X = median{x1, . . . , xm}

MAD = median{|x1 − Xm|, . . . , |xm − Xm|

}

I Approximate recursive formulas with goodasymptotics

Stability of f `2

Stability of f `1

Tools: trace estimators

Trace(H) =∑

i

Hii =∑

i

eti Hei

I Expensive for large H

I Randomized estimator: H symmetric n × n,U1, . . . ,Um with U ij iid Unif{−1, 1}. Define

Tm(H) =1m

m∑i=1

U ti HU i

I E( Tm(H) ) = Trace( H )

I Var( Tm(H) ) = 2m [ Trace(H2)− ‖ diag(H) ‖2 ]

Some bounds

I Relative variance:

Vr (Tm) =Var( Tm(H) )

Trace( H )2 ≤2

mnS2σ

σ2 ,

σ = (1/n)∑

i σi , S2σ = (1/n)

∑i(σi − σ)2

I Concentration: For any t > 0,

P(

Tm(H) ≥ Trace(H) (1 + t))≤ e−m t2/4Vr (bT1)

P(

Tm(H) ≤ Trace(H) (1− t))≤ e−m t2/4Vr (bT1)

I Example: ‘Hat matrix’ H in least-squares; projectsonto k -dimensional space

Vr (Tm) ≤ 2mk

(1− k

n

)P(

Tm(H) ≥ Trace(H) (1 + t))≤ e−mkt2/8.

I Example: Approximating a determinant of A > 0:

log( Det(A) ) = Trace( log(A) ) ≈ 1m

∑i

U ti log(A)U i

log(A)U requires only matrix-vector products with A

CI for bias of Af `2

I Good estimate of f ⇒ good estimate of Af

I y : direct observation of Af ; less sensitive to λ

I 1− α CIs for bias of Af ( y = Hy ):

yi − yi

1− Hii± zα/2 σ

√1 +

(H2)ii − (Hii)2

(1− Hii)2

yi − yi

Trace(I − H)/n± zα/2 σ

1√Trace(I − H)/n

95% CIs for bias

Coverage of 95% CIs for bias

Bias bounds for f `2

I For fixed λ:Var(f `2) = σ2G(λ)−1AtA G(λ)−1

Bias( f `2) = E( f `2 )− f = B(λ),

G(λ) = AtA + λ2DtD, B(λ) = −λ2G(λ)−1 DtDf .

I Median bias: BiasM( f `2 ) = median( f `2 )− f

I For λ estimated:

BiasM( f `2 ) ≈ median[ B(λ) ]

I Holder’s inequality

|B(λ)i | ≤ λ2 Up,i(λ) ‖Df‖q

|B(λ)i | ≤ λ2 Uwp,i(λ) ‖β‖q

Up,i(λ) = ‖DG(λ)−1ei‖p or Uwp,i(λ) = ‖WDtDG(λ)−1ei‖p

I Plots of Up,i(λ) and Uwp,i(λ) vs xi

Assessing fit

I Compare y to Af + ε

I Parametric/nonparametric bootstrap samples y∗:(i) (parametric) ε∗i ∼ N(0, σ2I), or (nonparametric) from

corrected residuals rc(ii) y∗i = Af + ε∗i

I Compare statistics of y and {y∗1, . . . ,y∗B} (blue,red)I Compare statistics of rc and N(0, σ2I) (black)I Compare parametric vs nonparametric results (red)I Example: Simulations for Af `2 with:

y (1) goody (2) with skewed noisey (3) biased

Example

T1(y) = min(y)

T2(y) = max(y)

Tk (y) =1n

n∑i=1

( y −mean(y) )k/std(y)k (k = 3,4)

T5(y) = sample median (y)

T6(y) = MAD(y)

T7(y) = 1st sample quartile (y)

T8(y) = 3rd sample quartile (y)

T9(y) = # runs above/below median(y)

Pj = P(|Tj(y∗)| ≥ |T o

j |), T o

j = Tj(y)

Assessing fit: Bayesian case

I Hierarchical model:

y | f , σ, γ ∼ N(Af , σ2I)

f | γ ∝ exp(−f tDtDf /2γ2 )

(σ, γ) ∼ π

with E(f |y) =

(AtA +

σ2

γ2 DtD)−1

Aty

I Posterior mean = f`2 for γ = σ/λ

I Sample (f ∗, σ∗) from posterior of (f , σ) | y ⇒ andsimulated data y∗ = Af ∗ + σ∗ε∗ with ε∗ ∼ N(0, I)

I Explore consistency of y with simulated y∗

Simulating f

I Modify simulations from y∗i = Af + ε∗i to

y∗i = Af i + ε∗i , f i ‘similar’ to f

I Frequentist ‘simulation’ of f?One option: fi from historical data (training set)

I Check relative MSE for the different f i

Bayesian or frequentist?

I Results have different interpretations. Differentmeanings of probability: P 6= P

I Frequentists can use Bayesian methods to deriveprocedures; Bayesians can use frequentist methodsto evaluate procedures

I Consistent results for small problems and lots of data

I Possibly very different results for complex large-scaleproblems

I Theorems do not address relation of theory to reality

Personal view

I Neither is a substitute for detective work, commonsense and honesty

I Some say that Bayesian is the only way, some othersthat frequentist is the best way, while those withvision see

Application: experimental designClassic framework

yi = f (xi)tθ + εi ⇒ y = Fθ + ε,

I Experimental configurations xi ∈ X ⊂ IRp

I θ ∈ IRm unknown parametersI εi iid zero-mean, variance σ2

I Experimental design question: How many timesshould each configuration xi be used to get ‘best’estimate of θ?i.e., variance σ2

i for each selected xi is controlled byaveraging independent replications

I More general approach: Control the σ2i to get ‘best’

estimate of θ (i.e., not necessarily by averaging)Let σt ,i be ‘optimal’ target variance for xi . Setconstraint: ∑

i

σ2i

σ2t ,i

= N

for fixed N ∈ IR.Define wi = σ2

i /Nσ2t ,i ⇒

wi ≥ 0,∑

i

wi = 1 and σ2t ,i =

σ2i

Nw−1

i

I i th experimental configuration: (xi ,wi)

I Experiment ξ: {x1, . . . , xn,w1, . . . ,wn}

I Well-posed case: F tF =∑

i f (xi)f (xi)t well

conditioned

θ = arg min∑

i

wi(yi − f (xi)tθ)2 =

(F tWF

)−1 F tWy

W = diag(wi)

θ unbiased with

Var(θ) = σ2 (F tWF)−1

= σ2 M(ξ)−1,

M(ξ) = information matrix = F tWF =∑

i wi f (xi)f (xi)t

Write D(ξ) = M(ξ)−1

I Optimization problem:

minw

Ψ(D(ξ)) s.t. w ≥ 0,∑

wi = 1,

D-design: Ψ = det

A-design: Ψ = Trace ⇒ minimization of MSE(θ)

Optimal wi independent of σ and N

Ill-posed problems

I 1D magnetotelluric example:

dj =

∫ 1

0exp(−αjx) cos(γjx) m(x) dx + εj j = 1, . . . ,n

γj : recording frequenciesαj : determined by frequencies and backgroundconductivitym(x): conductivity

I Goal: Choose γj , αj to determine smooth mconsistent with data

Ill-posed problemsI Ill-conditioning of F requires regularizationI Regularized estimate

θ = arg min∑

i

wi [ yi − f (xi)tθ ]2 + λ2‖Dθ‖2

=(

F tWF + λ2DtD)−1

F tW y

I MSE( θ ) depends on unknown θ ⇒ control‘average’ MSE

I Use prior moment conditions:

E(y | θ, γ) = Fθ, Var(y | θ, γ) = σ2 W−1/NE(θ | γ) = θ0, Var(θ | γ) = γ2Σ

E(γ2) = µ2γ,

with σ and Σ known and γ = σ/λ

I Affine estimate minimizing Bayes risk: δ2 = E(1/λ2)

θ =(

Nδ2 F tWF + Σ−1 )−1 (Nδ2 F tWy + Σ−1θ0

)

E[ (θ − θ)(θ − θ)t ] =τσ2

N(τF tWF + (1− τ)Σ−1 )−1

τ ∈ (0,1)

I A-design optimization:

minw

Trace(τF tWF + (1− τ)Σ−1 )−1

s.t. w ≥ 0,∑

wi = 1.

I Optimal wi depend on σ and N

ReadingsBasic Bayesian methods:

� Carlin, JB & Louis, TA, 2008, Bayesian Methods for Data Analysis (3rd Edn.), Chapman & Hall

� Gelman, A, Carlin, JB, Stern, HS & Rubin, DB, 2003, Bayesian Data Analysis (2nd Edn.), Chapman & Hall

Bayesian methods for inverse problems:

� Calvetti, D. & Somersalo, E, 2007, Introduction to Bayesian Scientific Computing: Ten Lectures onSubjective Computing, Springer

� Kaipio, J & Somersalo, E, 2004, Statistical and Computational Inverse Problems, Springer

Discussion of frequentist vs Bayesian statistics

� Freedman, D, Some issues in the foundation of statistics, 1995, Foundations of Sciences, 1, 19–39.Rejoinder pp.69–83

Tutorial on frequentist and Bayesian methods for inverse problems

� Stark, PB & Tenorio, L, 2011, A primer of frequentist and Bayesian inference in inverse problems. InComputational Methods for Large-Scale Inverse Problems and Quantification of Uncertainty, pp.9–32,Wiley

Statistics and inverse problems

� Evans, SN & Stark, PB, 2003, Inverse problems as statistics, Inverse Problems, 18, R55

� O’Sullivan, F, 1986, A statistical perspective on ill-posed inverse problems, Statistical Science, 1, 502–527

� Tenorio, L, Andersson, F, de Hoop, M & Ma, P, 2011, Data analysis tools for uncertainty quantification ofinverse problems, Inverse Problems, 27, 045001

� Wahba, G, 1990, Spline Smoothing for Observational Data, SIAM

� Special section in Inverse Problems: Volume 24, Number 3, June 2008

Mathematical statistics

� Casella, G & Berger, RL, 2001, Statistical Inference, Duxbury Press

� Schervish, ML, 1996, Theory of Statistics, Springer

statistics for inverse problems 101€¦ · i problems: possibly badly biased and computational...

Documents