lecture 1: numerical issues from inverse problems ...mueller/graduate...lecture 1: numerical issues...

Lecture 1: Numerical Issues from InverseProblems (Parameter Estimation, Regularization

Theory, and Parallel Algorithms)

Youzuo Lin 1

Joint work with: Rosemary A. Renaut 2

Brendt Wohlberg 1

Hongbin Guo 2

1. Los Alamos National Laboratory2. School of Mathematical and Statistical Sciences, Arizona State University

Graduate Student Workshop on Inverse Problem, 2016

Outline

1 Introduction: Inverse Problem and Regularization

2 Topic 1: Multisplitting for Regularized Least SquaresRegularization ParallelizationMultiple Right Hand Side Problem

3 Topic 2: Total-Variation RegularizationNumerical Methods for Total VariationOptimal Parameter Selection, UPRE Approach

4 Topic 3: Projected Krylov Subspace Solvers on GPUFine-Grained Parallelism Model for Krylov Subspace Solvers onGPUOptimizations of Krylov Subspace Algorithms on GPUNumerical Results

5 References

Inverse Theory Inverse Problems Workshop 1 / 64

Outline





5 References


Inverse Problems and Ill-posedness

• General Linear Inverse Problem

b = Af + η,Measured data b andsystem (or transform)matrix A

⇓Input Data f

• Image Restoration - DeconvolutionTransform matrix A is blurring matrix,η is the noise

• Image Reconstruction - TomographyTransform matrix A is Radontransform, η is the noise


Direct Inverse Filtering

• How about f = A−1b?

Figure: Left: Blurred and Noisy Image with SNR of 12.21 (dB), Right:Restored Image by Direct Inversion with SNR of -11.17 (dB)


Constrained Least Squares and RegularizationModel

Constrained Least Squares:

minf{J(f)} ,

subject to ‖ Af− b ‖22= σ2.

Regularization Model:

minf

{‖ Af− b ‖22 +λ2J(f)

}, λ > 0,

‖ Af− b ‖22 : Data Fidelity TermJ(f) : Regularization Termλ : Regularization Parameter


Specific Regularization

Tikhonov Regularization:When J(f) is a quadratic functional fT LT Lf,

minf

{‖ Af− b ‖22 +λ2 ‖ Lf ‖22

},

Total Variation (TV) Regularization:When J(f) is a nonquadratic functional like Total Variation,

minf

{‖ Af− b ‖22 +λ2 ‖ ∇f ‖1

},

Others:Learning Based Regularization, Compressive Sensing etc.


Outline





5 References


Topic 1: Multisplitting for Regularized LeastSquares [Renaut, Lin and Guo, 2009]

• Regularization Parallelization• Bottleneck for Applications• Formulation-Multisplitting and Domain Decomposition• Local Solver Selection and Computational Cost• Update Scheme

• Multiple Right-Hand Sides Problem• A Second Look at Multisplitting• Hybrid Regularization with Multisplitting and Multiple Right Hand

Sides• Computational Cost Analysis• Application in Image Reconstruction and Restoration

• Numerical Results


Tikhonov Solution and Memory Bottleneck

• Solving for the TK regularization is equivalent to solving for a largescale linear system:

(AT A + λ2LT L)f = AT b.

• What about RAM requirement? In the positron emissiontomography (PET) image reconstruction example in singleprecision, for an image of size 256× 256, with projection anglesfrom 0o to 180o with a 2o increment, the required RAM for storingA is::

65536× 33030× 410243 ≈ 8.02GB!

• What can we do about this? We apply Domain Decompositionidea to our problem.


Formulation - Multisplitting

Linear multisplitting (LMS): Given a matrix A ∈ <n×n and a collectionof matrices M(i), N(i), E (i) ∈ <n×n, i = 1, . . . ,p satisfying

1 A = M(i) − N(i) for each i , i = 1, . . . ,p.2 M(i) is regular (nonsingular), i = 1, . . . ,p.3 E (i) is a non-negative diagonal matrix, i = 1, . . . ,p and∑p

i=1 E (i) = I.

Then the collection of triples (M(i),N(i),E (i)), i = 1, . . . ,p is called amultisplitting of A and the LMS method is defined by the iteration:

x (k+1) =

p∑i=1

E (i)(M(i))−1(N(i)x (k) + fb).


Formulation - Domain Decomposition

• Domain decomposition is a particular method of multisplitting inwhich the submatrices M(i) are defined to be consistent with apartitioning of the domain:

x = (xT1 , x

T2 , ..., x

Tp )T .

• eg: Different Splitting Schemes


Formulation - Operator Split [Renaut 1998]

• Corresponding to different splitting of the image x , we will havesplitting on kernel operator A as,

A = (A1,A2, ...,Ap),

• The linear system Ax ∼= b is replaced with the split systems

Aiyi∼= bi(x),

where bi(x) = b −∑

j 6=i Ajxj = b − Ax + Aixi .


Formulation - Local Problem

• Local Problem - Solving Ax ∼= b turns into solving p subproblems,

miny∈<ni

‖Aiyi − bi(x)‖2, 1 ≤ i ≤ p.

• Update Scheme - The global solution update from local solutionsat step K to step K + 1 is given by,

x (k+1) =

p∑i=1

τ(k+1)i (xlocal)

(k+1)i ,

where (xlocal)(k+1)i =

((x (k)1 )T , . . . , (x (k)

i−1)T , (y (k+1)i )T , (x (k)

i+1)T , . . . , (x (k)p )T )T


Formulation - Regularized Multisplitting

• Tikhonov Regularization:

minx

{‖ Ax− b ‖22 +λ2‖Lx‖22

}, λ > 0,

• equivalent to,

min∥∥∥∥( A

LΛ

)x −

(b0

)∥∥∥∥2

2,

where Λ is block diagonal with entries λIni .


Formulation - Regularized Multisplitting

• Similarly, we will have splitting on the operator as,(ALΛ

)=

((A1λ1L1

)(A2λ2L2

)· · ·(

ApλpLp

)).

• i th block equation of the normal equations is given by

(ATi Ai + λ2

i LTi Li)yi = AT

i bi(x)− λi

p∑j=1,j 6=i

λjLTi Ljxj .


Local Solver Selection and Computational Cost

• How to select the local solver ?

• Computation Cost and Memory Cost• Iterative Solver: Conjugate Gradient for Least Squares (CGLS)

• Computation Cost: CCGLS ≈ 2n2i m̃ + K (2(ki + 1)n2

i + 10kini )• Memory Cost: Matrix vector multiplication, only need to store matrix

Ai

• Solver: CGLS seems to be reasonable at this point!



• How to select the local solver ?• Computation Cost and Memory Cost

• Iterative Solver: Conjugate Gradient for Least Squares (CGLS)• Computation Cost: CCGLS ≈ 2n2

i m̃ + K (2(ki + 1)n2i + 10kini )

• Memory Cost: Matrix vector multiplication, only need to store matrixAi





• Iterative Solver: Conjugate Gradient for Least Squares (CGLS)

• Computation Cost: CCGLS ≈ 2n2i m̃ + K (2(ki + 1)n2

i + 10kini )• Memory Cost: Matrix vector multiplication, only need to store matrix

Ai





• Iterative Solver: Conjugate Gradient for Least Squares (CGLS)• Computation Cost: CCGLS ≈ 2n2

i m̃ + K (2(ki + 1)n2i + 10kini )

• Memory Cost: Matrix vector multiplication, only need to store matrixAi



Four Different Update Schemes

• Block Jacobian Update• Set τ (k+1)

i = τi = 1,sox(k+1) = ((y(k)

1 )T , (y(k)2 )T , ..., (y(k)

i−1)T , (y(k+1)i )T , (y(k)

i+1)T , ..., (y(k)p )T )T .

• Convex Update• Set τ (k+1)

i = τi = 1p ,

x(k+1) = x(k) + 1p (y(k+1) − x(k)).

• Local Update• x(k+1) = (xlocal )

(k+1)i .

• Optimal τ Update• min

τr(x(k+1)),

where r(x(k+1)) = r(x(k))−p∑

i=1

τ(k+1)i

(AiλiLi

)δ(k+1)i .


A Second Look at the Subproblem

The i th block equation of the normal equations is given by

(ATi Ai + λ2

i LTi Li)yi = AT

i bi(x)− λi

p∑j=1,j 6=i

λjLTi Ljxj .

• The system matrix is unchanged;


A Second Look at the Subproblem

Rewrite the right hand side (RHS) as,

bi(x(k+1)) = bi(x(k))−

p∑j=1,j 6=i

τ(k+1)j B(k+1)

j

+ OB(k+1)i ,

where B(k+1)j = Aj(x

(k+1)j − xk

j ), OB(k+1)i is the update of the

overlapped problem.

• The new RHS is updated based on the previous RHS;


Linear System with Multiple RHS

• Multiple RHS Problem Setup

AX = B = [b(1), . . . ,b(s)],

where A ∈ <m×n, X ∈ <n×s and B ∈ <m×s.

• What about solving s systems independently?• Direct Method (LU): Major Computational Cost ≈ O(n3) + s ·O(n2);• Iterative Method (CGLS): Major Computational Cost ≈ s ·O(n2);

• Several algorithms have been proposed to speed up solving theabove system.


Efficient Iterative Algorithm for Solving MultipleRHS

• If b(1), . . . ,b(s) are random or completely orthogonal to eachother, no way to speed up the algorithm!

• If b(1), . . . ,b(s) are close to each other and share information• Galerkin-Projection based Krylov subspace methods are proposed

for an efficient solver.• We choose CG as the specific Krylov subspace method proposed

in [Chan and Wan, 1997], because it is well-supportedtheoretically.


Basic Idea of Projected CG Algorithm for SolvingMRHS

• Step 1: Solve the seed system.• Select the seed;• Apply the CGLS algorithm for solving seed system and save the

conjugate directions for Step 2;

• Step 2: Solve the remaining system by projection• Use the conjugate directions from the seed systems to span the

Krylov subspace• Project the systems onto the Krylov subspace for an approximated

solution

• Step 3: Restart the seed and Refine the solution• A new seed might be selected from the remaining systems to

restart the procedure.• Refine the approximation, if the accuracy is insufficient.


Adapt to Our Problem: Solving Each Subproblemfrom RLSMS, Scheme 1

• Scheme 1 (Project/Restart): Treat the ith subproblem at alliteration as a Multiple RHS with s = 2, i.e. at iteration k = 1, solvethe subproblem as seed, for iteration k > 1, solve it as projectionand refine it as necessary;

• Comment: Straight forward adaptation, but usually the accuracyof approximation remaining after projection is not enough; Needextra steps for further refinement;


Adapt to Our Problem: Solving Each Subproblemfrom RLSMS, Scheme 2

• Scheme 2 (Project/Augment): Based on seed selection as inScheme 1, in the refinement stages of solving the remainingsystems, stores the newly generated conjugate directions withthose from solving seeds;

• Comment: The motivation of this change is again to improve theKrylov subspace. By updating the conjugate directions, we will beable to add in new information which is not from solving the seedsystem. Numerical tests indicate this scheme works best withlarge scale problems!


Computational Cost Analysis

The computation cost of projected CGLS consists of two othercomponents: one is the cost for solving the seed system, which is thecost of the CGLS; the other is the cost for solving the remainingsystems, which includes the projection and the few more steps ofCGLS as needed:

CProjCG ≈ 2(K + 2ki + 2)mni + (6(K − 1)ki + 10ki + 1)ni + µ · CG.


1-D Phillips, Restoration Result

Figure: Restoration results using different methods for the sample of Phillipssize of 1024× 1, noise level 6%.


1-D Phillips, Global and Local Iteration Result

Figure: Maximum numbers of inner iterations against outer iteration count.


2-D Seismic Reconstruction, Phantom Size 64× 64

Figure: SNR (Without Splitting): 11.65 dB, SNR (CGLS/Zero): 11.68 dB,SNR (Project/Restart): 11.63 dB, SNR (Project/Augment): 11.60 dB.


2-D Seismic Reconstruction, Global and LocalIteration Result

Figure: Maximum numbers of inner iterations against outer iteration count.


Outline





5 References


Topic 2: Total-Variation Regularization

• Numerical Methods for Total Variation• Iteratively Reweighted Norm Algorithm• Lagged Diffusivity Fixed Point Iteration Algorithm

• Optimal Regularization Parameter Selection,• Trace Approximation• UPRE Approach



Iteratively Reweighted Norm Algorithm

Approximate L1 norm by weighted L2 norm, TV regularizationbecomes:

minf

{12‖ Af − b ‖22 +

λ

2‖W 1/2

r Df ‖22},

⇐⇒ minf

{12‖

(I 00 W 1/2

r

)((A√λD

)f −

(b0

))‖22

},

where Wr = diag(2((Dx f )2 + (Dy f )2)−12 ).

This minimization problem can be solved by CG solver.Ref: “B. Wohlberg, P. Rodrguez, An Iteratively Reweighted Norm Algorithmfor Minimization of Total Variation Functionals, IEEE Signal ProcessingLetters, 2007”


Lagged Diffusivity Fixed Point Iteration Algorithm

Approximate L1 norm by,∑ψ(f ) =

∑√( df

dx + β2)

Gradient = AT (Af − b) + λL(f )f ,

Approximate Hessian = AT A + λL(f ),

where L(f ) = 4xDT diag(ψ′(f ))D.Based on the Quasi-Newton idea, we solve for f iteratively by,

fk+1 = fk − (AT A + λL(fk ))−1Gradient ,

which is called as “Lagged Diffusivity Fixed Point Iteration”.Ref: “C.R. Vogel, M. E. Oman, Iterative Methods for Total Variation, SIAM J.on Sci. Comp, 17(1996)”


Image Deblurring Comparison using TV andTikhonov

Figure: Left: Original Clean Image, Right: Blurred and Noisy Image


Image Deblurring Comparison using TV andTikhonov

Figure: Left: TK based Deblurred Result, Right: TV Based DeblurredResult


Problem Description-Simple Example


Fundamental Idea of UPRE

Derived from the Mean Squared Error (MSE) of the predictive error,

1n||Pλ||2 =

1n||Afλ − Aftrue||2,

where fλ is the regularized solution,ftrue is the true solution.To avoid using ftrue, we define a functional which is the unbiasedestimator of the above, and name it UPRE (Unbiased Predictive RiskEstimator).

λopt = arg minλ

UPRE(λ)


UPRE for Tikhonov

Vogel in [1] proved that, because the solution for TikhonovRegularization is linearly dependent on the right hand side of (2), i.e.,

fTK = RTK(b + η),

where RTK = (AT A + λI)−1AT , we will have

UPRETikhonov(λ) = E(1n||Pλ||2)

=1n||rλ||2 +

2σ2

ntrace(Aλ)− σ2,

where rλ = Afλ − b, Aλ = A(AT A + λI)−1AT , and σ is the variance ofthe noise.


UPRE for TV

• Difficulty 1: Nonquadratic penalty functional in TV term.• Hint: Linearization Approximation!• In Lagged Diffusivity approach,

fTV = RTV(b + η),

where RTV = (AT A + λ2 L(fk ))−1AT

the influence matrix,

Aλ = A(AT A + λ2 L(fk ))−1AT .

Aλ can be proved to be symmetric, which will give us the samederivation of the UPRE functional for TV.


UPRE for TV

• Difficulty 2: How to computetrace(Aλ) = trace(A(AT A + λ

2 L(fk ))−1AT )

• Hint: Krylov Space Method!• Basic Idea :

trace(f (M)) ' E(uT f (M)u),

where u is a discrete multivariate random variable, where eachentry takes the values -1 and +1 with probability 0.5, and thematrix M is symmetric positive definite (SPD).


UPRE for TV

• Golub [2] pointed out that,

E(uT f (M)u) =k∑

i=1

ωi f (θi),

where θi are the eigenvalues of M, and ωi are squares of the firstcomponents of the normalized eigenvectors of M.

• Start the Lanczos procedure on M, and find out the mostsignificant eigenvectors until accuracy is satisfied.


Trace Approximation Results

Figure: Left: Accurate Trace Vs. Estimated Trace, Right: Time Cost forComputing Accurate Trace Vs. Time Cost for Computing Estimated Trace


TV Based Deblurring Using Optimal and RandomParameters

Figure: Left: Use Optimal Parameter, Right: Use Random Parameter


More General Results for Deblurring

Figure: Plot of Optimal SNR and Estimated SNR versus original noisy SNRon satellite image.


More General Results for Denoising

Figure: Plot of Optimal SNR and Estimated SNR versus original noisy SNRon satellite image. Ref: “G. Gilboa, N. Sochen, Y. Y. Zeevi, Estimation ofoptimal PDE-based denoising in the SNR sense, IEEE Transactions onImage Processing 15(8), 2006.”


Conclusions

• Tikhonov Regularization:• Pros : easy to implement, cheap cost;• Cons : the result is usually smoothed out;

• Total Variation (TV) Regularization:• Pros : Preserve image edges provided with reasonable parameter,

piecewise constant• Cons : Expensive cost


Outline





5 References


Topic 3: Projected Krylov Subspace Solvers onGPU [Lin and Renaut, 2010]

• Problem Description and Krylov Subspace Solvers• 3-D Medical Image Reconstruction• Inverse Linear Systems with Multiple Right-hand Sides• Conjugate Gradient Algorithm

• Fine-Grained Parallelism Model for Krylov Subspace Solvers onGPU

• GPU Structure• BLAS Performance on GPU• CG on GPU (ver 0)

• Optimizations of Krylov Subspace Algorithms on GPU• Projected CG• Modified Projected CG



3-D Medical Image Reconstruction

Figure: Image Slices from a 3-D Scan


Inverse Linear Systems with Multiple Right-handSides

• Multiple Right-hand Sides System:

Ax = B = [b(1), . . .b(s)],

where bi stands for the i th measurement with respect to the i th slice asin a 3-D medical image.• Tikhonov regularization is applied to the above systems.• Regularization parameter is assumed to be close enough to theoptimal value.


Conjugate Gradient Algorithm

• k = 0, r0 = b− Ax0;while (||rk ||2 > TOL ∗ ||b||2)

k = k + 1;if (k == 1)

p1 = r0;elseβk =< rk−1, rk−1 > / < rk−2, rk−2 >;pk = rk−1 + βkpk−1;

end;αk =< rk−1, rk−1 > / < pk , Apk >;xk = xk−1 + αkpk ;rk = rk−1 − αkApk ;

end (while).


GPU Structure - NVIDIA Tesla C1060

Figure: NVIDIA Tesla C1060


GPU Structure - NVIDIA Tesla C1060

Streaming Multiprocessors (SM): 30Streaming Processors (SP): 8

Register: 16 K per SMShared Memory: 16 KB per SMConstant Memory: 64 KBGlobal Memory: 4 GBBandwidth: 102 GB/Sec

GFLOPS (PEAK)Single Precision: 933Double Precision: 78

Figure: Structure ofNVIDIA Tesla C1060


BLAS Performance on GPU - Matrix VectorMultiplication

• Testing EnvironmentOperating System: Ubuntu, 64 bitCPU (Host): Intel Quad Core Xeon E5506, 2.13 GHzGPU (Device): NVIDIA Tesla C1060Programming on Host: Matlab using Single / Multi Core(s)Programming on Device: CUDA + CUBLAS, Matlab + Jacket



• Testing matrices are selected from Matrix Market

Benchmark Prob. Sizebcsstm19 8171138 bus 1138bcsstm23 3134bcsstk16 4884s1rmt3m1 5489bcsstk18 15439



Figure: Time Cost Figure: Speedup Ratio



Figure: Relative Error


BLAS Performance on GPU - BLAS 1, BLAS 2 andBLAS 3

Figure: GFLOPS Comparison

• Comment: BLAS 3 outperforms BLAS 1 and BLAS 2 significantly.


CG on GPU (ver 0)

• A synthetic problem setup using Chan’s example [Chan and Wan,1997]

A = diag(1, . . . ,n), n = 10240. bi(t) = sin(t + i∆t), i = 1, . . . ,n,and ∆t = 2π/100.• TOL = 1.0× 10−5.• Results:

CG on Host: 31.90 sec on a single core or 17.53 sec on amulti-core.

CG on Device: 5.49 sec.


CG on GPU (ver 0) - A closer look

Total Time: 5.49 sec;

Mat-Vec: 3.98 sec(72.5%);

Dot-Prod: 0.57 sec(10.4%);

SAXPY: 0.16 sec(2.9%);

Comment: Mat-Vec is too expensive!


Projected CG (PrCG) - Algorithm

• By applying Lanczos-Galerkin projection method to CG, we have thePrCG [Chan and Wan, 1997].

i = 0;r(q,1)i = b(q) − Ax(q,1)

i ;for i = 1 to k

α(q,1)i =

< p(1)i , r(q,1)i >

< p(1)i , Ap(1)

i >;

x(q,1)i = x(q,1)

i + α(q,1)i p(1)

i ;r(q,1)i+1 = r(q,1)i − α(q,1)

i Ap(1)i ;

end (for);% Restart the system if needed;


Projected CG (PrCG)

Total Time: 1.91 sec;

Mat-Vec: 1.01 sec(52.8%);

Dot-Prod: 0.45 sec(23.6%);

SAXPY: 0.29 sec(15.2%);

Comment: SAXPY and Dot-Prod are becoming relatively expensive.


Augmented Projected CG (APrCG)

Using the orthogonality properties that < p(1)j , Ap(1)

i >= 0, j 6= i ,

and < p(1)j , r(q,1)i >= 0, j < i ,

take the dot product with p(1)i on both sides of

r(q,1)i+1 = b(q) −k∑

i=1

α(q,1)i Ap(1)

i , we can rewrite αi as,

α(q,1)i =

< p(1)i , b(q) >

< p(1)i , Ap(1)

i >.


Augmented Projected CG (APrCG) - Algorithm

• Let Pk = [p1, . . . ,pk ] and APk = [Ap1, . . . ,Apk ].

i = 0;r(q,1)i = b(q) − Ax(q,1)

i ;α =< Pk , r(q,1)0 > ./diagVec(< Pk , APk >);Λ = diag(α);x(q,1) = x(q,1)

0 + sum(Pk · Λ);r(q,1) = r(q,1)0 − sum(APk · Λ);


Augmented Projected CG (APrCG)

Figure: PrCG Figure: APrCG


Numerical Test / 3D Shepp-Logan Phantom

Figure: Slice Show of A 3D Shepp-Logan Phantom


3D Shepp-Logan Phantom - Results

Slice CG PrCG APrCG ImpRateCost SNR Cost SNR Cost SNR65 17.41 13.67 2.16 13.67 2.06 13.67 4.63%

66 16.34 13.80 2.74 13.83 2.58 13.83 5.84%

67 14.35 13.91 3.11 13.95 2.81 13.95 9.65%

68 13.03 13.80 3.53 13.83 3.46 13.83 1.98%

Table: Reconstruction of Four Consecutive Slices from 65 to 68. The 64th

slice is selected as seed.


3D Shepp-Logan Phantom - Results

Figure: Reconstruction Results on Host and Device,slice index = 66.


Outline





5 References


References

1 R. A. Renaut, Y. Lin, and H. Guo. “Solving Least SquaresProblems by Using a Regularized Multisplitting Method.”Numerical Linear Algebra with applications, vol. 19, issue. 4,2009.

2 Y. Lin, B. Wohlberg, and H. Guo. “UPRE method for total variationparameter selection.” Signal processing, vol. 90, no. 8, 2010.


lecture 1: numerical issues from inverse problems ...mueller/graduate...lecture 1: numerical issues...

Documents