statistics for inverse problems 101€¦ · i problems: possibly badly biased and computational...
TRANSCRIPT
Statistics for Inverse Problems 101
June 5, 2011
IntroductionI Inverse problem: to recover an unknown object from
indirect noisy observationsExample: Deblurring of a signal fForward problem: µ(x) =
∫ 10 A(x − t)f (t) dt
Noisy data: yi = µ(xi) + εi , i = 1, . . . ,nInverse problem: recover f (t) for t ∈ [0,1]
General Model
I A = (Ai), Ai : H → IR, f ∈ HI yi = Ai [f ] + εi , i = 1, . . . ,nI The recovery of f from A[f ] is ill-posedI Errors εi modeled as randomI Estimate f is an H-valued random variable
Basic Question (UQ)How good is the estimate f?
I What is the distribution of f?Summarize characteristics of the distribution of fE.g., how ‘likely’ is it that f will be ‘far’ from f?
Review: E, Var & Cov(1) Expected value of random variable X with pdf f (x):
E(X ) = µx =
∫x f (x) dx
(2) Variance of X : Var(X ) = E(
(X − µx )2)
(3) Covariance of random variables X and Y :
Cov(X ,Y ) = E ( (X − µx )(Y − µy ) )
(4) Expected value of random vector X = (Xi):
E(X ) = µx = ( E(Xi) )
(5) Covariance matrix of X :
Var(X ) = E( (X − µx )(X − µx )t ) = ( Cov(Xi ,Xj) )
(6) For a fixed n ×m matrix A
E(AX ) = A E(X ), Var(AX ) = A Var(X ) At
Example: X = (X1, . . . ,Xn) correlated, Var(X ) = σ2Σ,then
Var(
X1 + · · ·+ Xn
n
)= Var(1tX/n) =
σ2
n2 1tΣ1
Review: least-squares
y = Aβ + ε
I A is n ×m with n > m and AtA has stable inverseI E(ε) = 0, Var(ε) = σ2II Least-squares estimate of β:
β = arg minb‖y − Ab‖2 = (AtA)−1Aty
I β is an unbiased estimator of β:
Eβ( β ) = β ∀β
I Covariance matrix of β:
Varβ( β ) = σ2 (AtA)−1
I If ε is Gaussian N(0, σ2I), then β is the MLE and
β ∼ N(β, σ2 (AtA)−1 )
I Unbiased estimator of σ2:
σ2 = ‖y − A β‖2/(n −m)
= ‖y − y‖2/(n − dof(H)),
where y = H β, H = A (AtA)−1At = ‘hat-matrix’
Confidence regions
I 1− α confidence interval for βi :
Ii(α) : βi ± tα/2,n−m σ
√( (AtA)−1 )i,i
I C.I.’s are pre-data
I Pointwise vs simultaneous 1− α coverage:pointwise: P[ βi ∈ Ii(α) ] = 1− α ∀isimultaneous: P[ βi ∈ Ii(α) ∀i ] = 1− α
I Easy to find 1− α joint confidence region Rα for βwhich gives simultaneous coverage and ifRfα = {f (b) : b ∈ Rα} ⇒ P[f (β) ∈ Rf
α] ≥ 1− α
Residuals
r = y − Aβ = y − y
I E(r) = 0, Var(r) = σ2(I − H)
I If ε ∼ N(0, σ2I), then: r ∼ N(0, σ2(I − H))
⇒ do not behave like original errors ε
I Corrected residuals ri = ri/√
1− Hii
I Corrected residuals may be used for model validation
Review: Bayes theoremI Conditional probability measure P( · | B):
P(A | B) = P(A ∩ B)/P(B) if P(B) 6= 0
I Bayes theorem: for P(A) 6= 0
P(A | B) = P(B | A)P(A)/P(B)
I For X and Θ with pdf’s
fΘ|X (θ | x) ∝ fX |Θ(x | θ)fΘ(θ)
More generally:
dµΘ|X
dµΘ∝
dµX |Θ
dν
I Basic tools: conditional probability measures andRadon-Nikodym derivatives
I Conditional probabilities can be defined viaKolmogorov’s definition of conditional expectations orusing conditional distributions (e.g., via disintegrationof measures)
Review: Gaussian Bayesian linear model
y | β ∼ N(Aβ, σ2I), β ∼ N(µ, τ 2Σ)
µ, σ, τ,Σ known
I Goal: update prior distribution of β in light of data y .Compute posterior distribution of β given y (useBayes theorem):
β | y ∼ N(µy ,Σy )
Σy = σ2 (AtA + (σ2/τ 2)Σ−1 )−1
µy =(
AtA + (σ2/τ 2)Σ−1 )−1(Aty + (σ2/τ 2)Σ−1µ)
= weighted average of µls and µ
I Summarizing posterior:
µy = E(µ | y) = mean and mode of posterior (MAP)Σy = Var(µ | y) = covariance matrix of posterior
1− α credible interval for βi :
(µy )i ± zα/2
√( Σy )i,i
Can also construct 1− α joint credible region for β
Some properties
I µy minimizes E‖δ(y)− β‖2 among all δ(y)
I µy → βls as σ2/τ 2 → 0, µy → µ as σ2/τ 2 →∞
I µy can be used as a frequentist estimate of β:� µy is biased� its variance is NOT obtained by sampling from the
posterior
I Warning: a 1− α credible regions MAY NOT have1− α frequentist coverage
I If µ, σ, τ or Σ unknown ⇒ use hierarchical orempirical Bayesian and MCMC methods
Parametric reductions for IP
I Transform infinite to finite dimensional problem: i.e.,A[f ] ≈ Aa
I Assumptions
y = A[f ] + ε = Aa + δ + ε
Function representation: f (x) = φ(x)taPenalized least-squares estimate:
a = arg minb‖y − Ab‖2 + λ2btSb
= (AtA + λ2S)−1Aty
f (x) = φ(x)t a
I Example: yi =∫ 1
0 A(xi − t)f (t) dt + εi
Quadrature approximation:∫ 1
0A(xi − t)f (t) dt ≈ (1/m)
m∑j=1
A(xi − tj)f (tj)
Discretized system:
y = Af + δ + ε, Aij = A(xi − tj), fi = f (xi)
Regularized estimate:
f = arg ming∈IRm
‖y − Ag‖2 + λ2‖Dg‖2
= (AtA + λ2DtD)−1Aty ≡ Ly
f (xi) = fi = eti Ly
I Example: Ai continuous on Hilbert space H. Then:
∃ orthonormal φk ∈ H, orthonormal vk ∈ IRn andnon-decreasing λk ≥ 0 such that A[φk ] = λkv k ,f = f0 + f1, with f0 ∈ Null(A), f1 ∈ Null(A)⊥
f0 not constrained by data, f1 =∑n
i=1 akφk
Discretized system:
y = Va + ε, V = (λ1v1 · · ·λnvn)
I Regularized estimate:
a = arg minb∈IRn
‖y − Vb‖2 + λ2‖b‖2 ≡ Ly
f (x) = φ(x)tLy , φ(x) = (φi(x))
I Example: Ai continuous on RKHS H = Wm([0,1]).Then: H = Nm−1
⊕Hm and ∃ φj ∈ H and κj ∈ H
such that
‖y −A[f ]‖2 + λ2∫ 1
0(f (m)
1 )2 =∑
i
(yi − 〈κi , f 〉)2 + λ2‖PHm f‖2
f =∑
j
ajφj + η, η ∈ span{φk}⊥
Discretized system:
y = Xa + δ + ε
Regularized estimate:
a = arg minb∈IRn
‖y − Xb‖2 + λ2btPb ≡ Ly
f (x) = φ(x)tLy , φ(x) = (φi(x))
To summarize:
y = Aa + δ + ε
f (x) = φ(x)ta
a = arg minb∈IRn
‖y − Xb‖2 + λ2btPb
= (AtA + λ2P)−1Aty
f (x) = φ(x)t a
Bias, variance and MSE of fI Is f (x) close to f (x) on average?
Bias( f (x) ) = E( f (x) )− f (x) = φ(x)tBλa + φ(x)tGλAtδ
I Pointwise sampling variability:
Var( f (x) ) = σ2‖AGλφ(x) ‖2
I Pointwise MSE:
MSE( f (x) ) = Bias( f (x) )2 + Var( f (x) )
I Integrated MSE:
IMSE( f (x) ) = E∫|f (x)−f (x)|2 dx =
∫MSE( f (x) ) dx
Review: are unbiased estimators good?
I They don’t have to exist
I If they exist they may be very difficult to find
I A biased estimator may still be better: variancevariance trade off
i.e,. don’t get too attached to them
What about the bias of f?I Upper bounds are sometimes possible
I Geometric information can be obtained from bounds
I Optimization approach: find max/min of bias s.t.f ∈ S
I Average bias: choose a prior for f and compute meanbias over prior
I Example: f in RKHS H, f (x) = 〈ρx , f 〉
Bias( f (x) ) = 〈Ax − ρx , f 〉|Bias( f (x) ) | ≤ ‖Ax − ρx‖ ‖f‖
Confidence intervals
I If ε Gaussian and |Bias( f (x) )| ≤ B(x), then
f (x)± zα/2(σ‖AGλφ(x)‖+ B(x) )
is a 1− α C.I. for f (x)
I Simultaneous 1− α C.I.’s for E(f (x)), x ∈ S: find βsuch that
P[
supx∈S|Z tV (x)| ≥ β
]≤ α
Zi iid N(0,1), V (x) = KGλφ(x)/‖KGλφ(x)‖ and useresults from maxima of Gaussian processes theory
Residuals
r = y − Aa = y − y
I Moments:
E(r) = −A Bias( a ) + δ
Var(r) = σ2(I − Hλ)2
Note: Bias( A a ) = A Bias( a ) may be small even ifBias( a ) is significant
I Corrected residuals: ri = ri/(1− (Hλ)ii)
Estimating σ and λ
I λ can be chosen using same data y (with GCV,L-curve, discrepancy principle, etc.) or independenttraining dataThis selection introduces an additional error
I If δ ≈ 0, σ2 can be estimated as in least-squares:
σ2 = ‖y − y‖2/(n − dof(Hλ))
otherwise one may consider y = µ+ ε, µ = A[f ] anduse methods from nonparametric regression
Resampling methods
Idea: If f and σ are ‘good’ estimates, then one should beable to generate synthetic data consistent with actualobservations
I If f is a good estimate ⇒ make synthetic data
y∗1 = A[f ] + ε∗i , . . . , y∗b = A[f ] + ε∗b
with ε∗i from same distribution as ε. Use sameestimation procedure to get f ∗i from y∗i :
y∗1 → f ∗1 , . . . , y∗b → f ∗b
Approximate distribution of f with that of f ∗1 , . . . , f∗b
Generating ε∗ similar to ε
I Parametric resampling: ε∗ ∼ N(0, σ2I)
I Nonparametric resampling: sample ε∗ withreplacement from corrected residuals
I Problems:� Possibly badly biased and computational intensive� Residuals may need to be corrected also for
correlation structure� A bad f may lead to misleading results
I Training sets: Let {f1, ...fk} be a training set (e.g.,historical data)� For each fj use resampling to estimate MSE(fj)� Study variability of MSE(fj)/‖fj‖
Bayesian inversion
I Hierarchical model (simple Gaussian case):
y | a,θ ∼ N(Aa + δ, σ2I), a | θ ∼ Fa(· | θ), θ ∼ Fθ
θ = (δ, σ, τ)
I It all reduces to sampling from posterior F (a | y)
⇒ use MCMC methods
Possible problems:� Selection of priors� Convergence of MCMC� Computational cost of MCMC in high dimensions� Interpretation of results
Warning
Parameters can be tuned so that
f = E(f | y)
but this does not mean that the uncertainty of f isobtained from the posterior of f given y
Warning
Bayesian or frequentist inversions can be used foruncertainty quantification but:
I Uncertainties in each have different interpretationsthat should be understood
I Both make assumptions (sometimes stringent) thatshould be revealed
I The validity of the assumptions should be explored
⇒ exploratory analysis and model validation
Exploratory data analysis (by example)Example: `2 and `1 estimates
data = y = Af + ε = AWβ + ε
f 2 = arg ming‖y − Ag ‖2
2 + λ2 ‖Dg ‖22
λ : from GCV
f 1 = W β, β = arg minb‖y − AWb ‖2
2 + λ ‖b ‖1
λ : from discrepancy principle
E(ε) = 0, Var(ε) = σ2Iσ = smoothing spline estimate
Example: y , Af & f
Tools: robust simulation summaries
I Sample mean and variance of x1, . . . , xm:
X = (x1 + · · ·+ xm)/m
S2 =1m
∑i
(xi − X )2
I Recursive formulas:
Xn+1 = Xn +1
n + 1(xn+1 − Xn)
Tn+1 = Tn +n
n + 1(xn+1 − Xn)2, S2
n = Tn/n
I Not robust
Robust measures
I Median and median absolute deviation from median(MAD)
X = median{x1, . . . , xm}
MAD = median{|x1 − Xm|, . . . , |xm − Xm|
}
I Approximate recursive formulas with goodasymptotics
Stability of f `2
Stability of f `1
Tools: trace estimators
Trace(H) =∑
i
Hii =∑
i
eti Hei
I Expensive for large H
I Randomized estimator: H symmetric n × n,U1, . . . ,Um with U ij iid Unif{−1, 1}. Define
Tm(H) =1m
m∑i=1
U ti HU i
I E( Tm(H) ) = Trace( H )
I Var( Tm(H) ) = 2m [ Trace(H2)− ‖ diag(H) ‖2 ]
Some bounds
I Relative variance:
Vr (Tm) =Var( Tm(H) )
Trace( H )2 ≤2
mnS2σ
σ2 ,
σ = (1/n)∑
i σi , S2σ = (1/n)
∑i(σi − σ)2
I Concentration: For any t > 0,
P(
Tm(H) ≥ Trace(H) (1 + t))≤ e−m t2/4Vr (bT1)
P(
Tm(H) ≤ Trace(H) (1− t))≤ e−m t2/4Vr (bT1)
I Example: ‘Hat matrix’ H in least-squares; projectsonto k -dimensional space
Vr (Tm) ≤ 2mk
(1− k
n
)P(
Tm(H) ≥ Trace(H) (1 + t))≤ e−mkt2/8.
I Example: Approximating a determinant of A > 0:
log( Det(A) ) = Trace( log(A) ) ≈ 1m
∑i
U ti log(A)U i
log(A)U requires only matrix-vector products with A
CI for bias of Af `2
I Good estimate of f ⇒ good estimate of Af
I y : direct observation of Af ; less sensitive to λ
I 1− α CIs for bias of Af ( y = Hy ):
yi − yi
1− Hii± zα/2 σ
√1 +
(H2)ii − (Hii)2
(1− Hii)2
yi − yi
Trace(I − H)/n± zα/2 σ
1√Trace(I − H)/n
95% CIs for bias
Coverage of 95% CIs for bias
Bias bounds for f `2
I For fixed λ:Var(f `2) = σ2G(λ)−1AtA G(λ)−1
Bias( f `2) = E( f `2 )− f = B(λ),
G(λ) = AtA + λ2DtD, B(λ) = −λ2G(λ)−1 DtDf .
I Median bias: BiasM( f `2 ) = median( f `2 )− f
I For λ estimated:
BiasM( f `2 ) ≈ median[ B(λ) ]
I Holder’s inequality
|B(λ)i | ≤ λ2 Up,i(λ) ‖Df‖q
|B(λ)i | ≤ λ2 Uwp,i(λ) ‖β‖q
Up,i(λ) = ‖DG(λ)−1ei‖p or Uwp,i(λ) = ‖WDtDG(λ)−1ei‖p
I Plots of Up,i(λ) and Uwp,i(λ) vs xi
Assessing fit
I Compare y to Af + ε
I Parametric/nonparametric bootstrap samples y∗:(i) (parametric) ε∗i ∼ N(0, σ2I), or (nonparametric) from
corrected residuals rc(ii) y∗i = Af + ε∗i
I Compare statistics of y and {y∗1, . . . ,y∗B} (blue,red)I Compare statistics of rc and N(0, σ2I) (black)I Compare parametric vs nonparametric results (red)I Example: Simulations for Af `2 with:
y (1) goody (2) with skewed noisey (3) biased
Example
T1(y) = min(y)
T2(y) = max(y)
Tk (y) =1n
n∑i=1
( y −mean(y) )k/std(y)k (k = 3,4)
T5(y) = sample median (y)
T6(y) = MAD(y)
T7(y) = 1st sample quartile (y)
T8(y) = 3rd sample quartile (y)
T9(y) = # runs above/below median(y)
Pj = P(|Tj(y∗)| ≥ |T o
j |), T o
j = Tj(y)
Assessing fit: Bayesian case
I Hierarchical model:
y | f , σ, γ ∼ N(Af , σ2I)
f | γ ∝ exp(−f tDtDf /2γ2 )
(σ, γ) ∼ π
with E(f |y) =
(AtA +
σ2
γ2 DtD)−1
Aty
I Posterior mean = f`2 for γ = σ/λ
I Sample (f ∗, σ∗) from posterior of (f , σ) | y ⇒ andsimulated data y∗ = Af ∗ + σ∗ε∗ with ε∗ ∼ N(0, I)
I Explore consistency of y with simulated y∗
Simulating f
I Modify simulations from y∗i = Af + ε∗i to
y∗i = Af i + ε∗i , f i ‘similar’ to f
I Frequentist ‘simulation’ of f?One option: fi from historical data (training set)
I Check relative MSE for the different f i
Bayesian or frequentist?
I Results have different interpretations. Differentmeanings of probability: P 6= P
I Frequentists can use Bayesian methods to deriveprocedures; Bayesians can use frequentist methodsto evaluate procedures
I Consistent results for small problems and lots of data
I Possibly very different results for complex large-scaleproblems
I Theorems do not address relation of theory to reality
Personal view
I Neither is a substitute for detective work, commonsense and honesty
I Some say that Bayesian is the only way, some othersthat frequentist is the best way, while those withvision see
Application: experimental designClassic framework
yi = f (xi)tθ + εi ⇒ y = Fθ + ε,
I Experimental configurations xi ∈ X ⊂ IRp
I θ ∈ IRm unknown parametersI εi iid zero-mean, variance σ2
I Experimental design question: How many timesshould each configuration xi be used to get ‘best’estimate of θ?i.e., variance σ2
i for each selected xi is controlled byaveraging independent replications
I More general approach: Control the σ2i to get ‘best’
estimate of θ (i.e., not necessarily by averaging)Let σt ,i be ‘optimal’ target variance for xi . Setconstraint: ∑
i
σ2i
σ2t ,i
= N
for fixed N ∈ IR.Define wi = σ2
i /Nσ2t ,i ⇒
wi ≥ 0,∑
i
wi = 1 and σ2t ,i =
σ2i
Nw−1
i
I i th experimental configuration: (xi ,wi)
I Experiment ξ: {x1, . . . , xn,w1, . . . ,wn}
I Well-posed case: F tF =∑
i f (xi)f (xi)t well
conditioned
θ = arg min∑
i
wi(yi − f (xi)tθ)2 =
(F tWF
)−1 F tWy
W = diag(wi)
θ unbiased with
Var(θ) = σ2 (F tWF)−1
= σ2 M(ξ)−1,
M(ξ) = information matrix = F tWF =∑
i wi f (xi)f (xi)t
Write D(ξ) = M(ξ)−1
I Optimization problem:
minw
Ψ(D(ξ)) s.t. w ≥ 0,∑
wi = 1,
D-design: Ψ = det
A-design: Ψ = Trace ⇒ minimization of MSE(θ)
Optimal wi independent of σ and N
Ill-posed problems
I 1D magnetotelluric example:
dj =
∫ 1
0exp(−αjx) cos(γjx) m(x) dx + εj j = 1, . . . ,n
γj : recording frequenciesαj : determined by frequencies and backgroundconductivitym(x): conductivity
I Goal: Choose γj , αj to determine smooth mconsistent with data
Ill-posed problemsI Ill-conditioning of F requires regularizationI Regularized estimate
θ = arg min∑
i
wi [ yi − f (xi)tθ ]2 + λ2‖Dθ‖2
=(
F tWF + λ2DtD)−1
F tW y
I MSE( θ ) depends on unknown θ ⇒ control‘average’ MSE
I Use prior moment conditions:
E(y | θ, γ) = Fθ, Var(y | θ, γ) = σ2 W−1/NE(θ | γ) = θ0, Var(θ | γ) = γ2Σ
E(γ2) = µ2γ,
with σ and Σ known and γ = σ/λ
I Affine estimate minimizing Bayes risk: δ2 = E(1/λ2)
θ =(
Nδ2 F tWF + Σ−1 )−1 (Nδ2 F tWy + Σ−1θ0
)
E[ (θ − θ)(θ − θ)t ] =τσ2
N(τF tWF + (1− τ)Σ−1 )−1
τ ∈ (0,1)
I A-design optimization:
minw
Trace(τF tWF + (1− τ)Σ−1 )−1
s.t. w ≥ 0,∑
wi = 1.
I Optimal wi depend on σ and N
ReadingsBasic Bayesian methods:
� Carlin, JB & Louis, TA, 2008, Bayesian Methods for Data Analysis (3rd Edn.), Chapman & Hall
� Gelman, A, Carlin, JB, Stern, HS & Rubin, DB, 2003, Bayesian Data Analysis (2nd Edn.), Chapman & Hall
Bayesian methods for inverse problems:
� Calvetti, D. & Somersalo, E, 2007, Introduction to Bayesian Scientific Computing: Ten Lectures onSubjective Computing, Springer
� Kaipio, J & Somersalo, E, 2004, Statistical and Computational Inverse Problems, Springer
Discussion of frequentist vs Bayesian statistics
� Freedman, D, Some issues in the foundation of statistics, 1995, Foundations of Sciences, 1, 19–39.Rejoinder pp.69–83
Tutorial on frequentist and Bayesian methods for inverse problems
� Stark, PB & Tenorio, L, 2011, A primer of frequentist and Bayesian inference in inverse problems. InComputational Methods for Large-Scale Inverse Problems and Quantification of Uncertainty, pp.9–32,Wiley
Statistics and inverse problems
� Evans, SN & Stark, PB, 2003, Inverse problems as statistics, Inverse Problems, 18, R55
� O’Sullivan, F, 1986, A statistical perspective on ill-posed inverse problems, Statistical Science, 1, 502–527
� Tenorio, L, Andersson, F, de Hoop, M & Ma, P, 2011, Data analysis tools for uncertainty quantification ofinverse problems, Inverse Problems, 27, 045001
� Wahba, G, 1990, Spline Smoothing for Observational Data, SIAM
� Special section in Inverse Problems: Volume 24, Number 3, June 2008
Mathematical statistics
� Casella, G & Berger, RL, 2001, Statistical Inference, Duxbury Press
� Schervish, ML, 1996, Theory of Statistics, Springer