daniel kressner chair of numerical algorithms and hpc ...the format imost frequently used in...
TRANSCRIPT
-
Numerical Linear AlgebraA biased survey
Daniel Kressner
Chair of Numerical Algorithms and HPCMATHICSE / SMA / SB / EPF Lausanne
[email protected]://anchp.epfl.ch
Banff, 7.10.2014
http://anchp.epfl.ch
-
Purpose of this talk
I give a flavor of the general field of numerical linear algebra 1
I discuss the role of sparsity1
I discuss role of optimization in numerical linear algebra1
I point out some exciting new developments1
1according to the strongly biased/limited view of the speaker
-
Outline
I Numerical Linear AlgebraI general principlesI the beauty of black boxesI the beauty of error analysis
I Numerical Linear Algebra and Sparsity and OptimizationI sparse and data-sparse matricesI sparse and data-sparse vectors
-
General principles
-
Numerical linear algebra
Typical tasks for a matrix A:
I Linear systemsAx = b
I Eigenvalue problemsAx = λx
I Matrix functions
exp(A), log(A),√
A, sign(A), . . .
Recurring principle: Exploitation of structure.
Well-established: Structure in A (sparsity, symmetry, low-rank, . . .)
Current trends: Structure in x (sparsity, low-rank, incorporation ofinformation from underlying application . . .)
-
Numerical linear algebra in scientific computing...
Application
Mathematical Model
Discretization /Linearization
Linear Algebraproblem
Picture taken from http://en.wikipedia.org/wiki/Food_chain
http://en.wikipedia.org/wiki/Food_chain
-
...a fundamental component
Numerical linear algebraI often dominates computational time in scientific computing imposes limitations on model complexity, discretizationaccuracy, . . .
I transfers knowledge across different disciplinesI deeply dives into algorithmic design and analysisI provides black-box solversI provides software libraries (LAPACK) at the heart of virtually
every scientific computing package (Maple, Mathematica,NumPy, MATLAB, R, Trilinos)
I is at the forefront of high-performance computing (LINPACKbenchmark)
-
Software development – a recent example
2005
2009
2012
before 2005. Eigenvalue solver in ScaLAPACK 1.x knownto be slow and buggy.
2005–2009. Algorithmic developments (pipelined multi-shifts, parallel aggressive early deflation) and researchcode. Jointly with R. Granat and B. Kågström.Main publication: A novel parallel QR algorithm for hybriddistributed memory HPC systems. SIAM J. Sci. Comput.,32(4):2345–2378, 2010.
2009–2011. Further algorithmic improvements, debug-ging, bug fixing, documentation, testing/benchmarking,incorporation of feedback from Intel and others. Effortsjoined by PhD student M. Shao.
November 2011. Release of production code as routinePxHSEQR in ScaLAPACK 2.0.
2012 –∞. Code maintenance.
201? ACM TOMS software publication.
-
Software development – a recent example
Old eigenvalue solver in ScaLAPACK 1.x vs.New eigenvalue solver (research code) vs.New eigenvalue solver in ScaLAPACK 2.0 (production code)
ScaLAPACK 1.x 2009 ScaLAPACK 2.00
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2x 10
4
16755 sec
575 sec 320 sec
2009 ScaLAPACK 2.00
0.5
1
1.5
2
2.5
3x 10
4
7 hours
2 hours
16 000× 16 000 matrix 100 000× 100 000 matrixon 100 cores1 on 1 024 cores
1Intel Xeon quadcore L5420 2.5 GHz nodes
-
The beauty of black boxesNo need to care about what is inside.
-
A black box
eig Eigenvalues and eigenvectors.E = eig(A) produces a column vector E containingthe eigenvalues of a square matrix A....
Underlying variant of QR algorithm:I Braman/Byers/Mathias: The multishift QR algorithm. Part I: Maintaining well-focused shifts
and level 3 performance. SIAM J. Matrix Anal., 2002.I Braman/Byers/Mathias: The multishift QR algorithm. Part II: Aggressive early deflation.2
SIAM J. Matrix Anal., 2002.
2Awarded SIAG/Linear Algebra + SIAM outstanding paper prizes.
-
Not a black box
eig Eigenvalues and eigenvectors.E = eig(A,abstol,reltol,maxit,ns,aed,adhoc,...) producesa column vector E containing the eigenvalues of a squarematrix A, where
- abstol is the tolerance for declaring eigenvalues converged
- reltol is a relative convergence criterion, which may ormay not improve accuracy. Choose at your own risk.Wilkinson [quote from Stewart’2002]: "... if you want to argueabout it, I would rather be on your side."
- ns is the number of shifts used in every iterationChoose carefully! The correct choice depends on your machineas well as on your application!
- aedsize is the size of the aggressive early deflation windowChoose carefully! The correct choice depends on your machineas well as on your application!
- adhoc is a magic number used when things start falling apart
- ...
-
The beauty of error analysis
-
Arnoldi/Lanczos/CG in a nutshell
I Assume A allows for cheap matrix-vector multiplications .I For starting vector x0, consider Krylov subspace
Kk (A, x0) := span{x0,Ax0,A2x0, . . . ,Ak−1x0}.
I Arnoldi method = Gram-Schmidt + twist applied to Kk (A, x0).I Produces orthonormal basis Uk ∈ Rn×k of Kk (A, x0).I Arnoldi decomposition:
AUk = Uk Hk + rank 1,
where Hk is k × k Hessenberg.Symmetric A I Hk tridiagonalI three-term recurrence Lanczos methodI no need to store full basis UkI CG/Lanczos = Galerkin with Kk (A, x0)
-
Numerical example
Lanczos applied to symmetric matrix with isolated eigenvalue:
0 20 40 60 80 10010
−20
10−10
100
1010
‖UTk Uk − I‖2
0 20 40 60 80 10010
−20
10−10
100
1010
Convergence of 3 largest Ritzvalues
Total loss of orthogonality of Uk in finite-precision arithmetic Total failure of CG/Lanczos?
-
Error analysis
Paige’1976/1980: Lanczos performed in finite-precision arithmeticyields perturbed Arnoldi decomposition
AUk = Uk Hk + rank 1 +4Uk , ‖4Uk‖ = O(machine precision)
I Starting point for deep understanding of behavior of Lanczos infinite precision arithmetic.
I Subsequent results by Greenbaum, Meurant, Strakoš, Wülling,Zemke, and others.
I Restores reputation of several properties that hold in exactarithmetic, but there are subtle differences!
I Survey [Meurant/Strakoš’2006] and book [Meurant’2006].
-
Data-sparse matrices
-
A tridiagonal matrix
A =
0 20 40
0
10
20
30
40
50
nz = 148
Exact matrix sparsity is a shy deer.
-
Inverse of tridiagonal matrix
A−1 =
0 20 40
0
10
20
30
40
50
nz = 2500
I Explicit inverse needed, e.g., in sparse covariance matrixestimation.
I More common situation: Inverse implicitly present in Schurcomplements, e.g., direct sparse factorizations / signalprocessing on graphs.
-
Inverse of symmetric tridiagonal matrix, cond(A) ≈ 3
A−1 =
0 10 20 30 40 50
0
10
20
30
40
50
|white entries| ≤ 10−15
I Classical result [Demko/Moss/Smith’1984] for banded matrices:Exponential decay of [A−1]ij wrt |i − j |. Decay rate for pos. def.:√
cond(A)− 1√cond(A) + 1
I Proof based on polynomial approximation of 1/x on [λmin, λmax].
-
Inverse of symmetric tridiagonal matrix, cond(A) ≈ 5
A−1 =
0 10 20 30 40 50
0
10
20
30
40
50
|white entries| ≤ 10−15
I Classical result [Demko/Moss/Smith’1984] for banded matrices:Exponential decay of [A−1]ij wrt |i − j |. Decay rate for pos. def.:√
cond(A)− 1√cond(A) + 1
I Proof based on polynomial approximation of 1/x on [λmin, λmax].
-
Inverse of symmetric tridiagonal matrix, cond(A) ≈ 40
A−1 =
0 10 20 30 40 50
0
10
20
30
40
50
|white entries| ≤ 10−15
I Classical result [Demko/Moss/Smith’1984] for banded matrices:Exponential decay of [A−1]ij wrt |i − j |. Decay rate for pos. def.:√
cond(A)− 1√cond(A) + 1
I Proof based on polynomial approximation of 1/x on [λmin, λmax].
-
Inverse of symmetric tridiagonal matrix, cond(A) ≈ 103
A−1 =
0 10 20 30 40 50
0
10
20
30
40
50
|white entries| ≤ 10−15
I Classical result [Demko/Moss/Smith’1984] for banded matrices:Exponential decay of [A−1]ij wrt |i − j |. Decay rate for pos. def.:√
cond(A)− 1√cond(A) + 1
I Proof based on polynomial approximation of 1/x on [λmin, λmax].
-
Inverse of symmetric tridiagonal matrix, cond(A) ≈ 103
A−1 =
0 10 20 30 40 50
0
10
20
30
40
50
|white entries| ≤ 10−15Extensions:I More general graph structures: decay wrt distance between two
nodes [Benzi/Razouk’2007].I 2D structures [Canuto/Simoncini/Verani’2014].I Operator algebra framework [Bickel/Lindner’2012].
-
Inverse of symmetric tridiagonal matrix, cond(A) ≈ 103
A−1 =
0 10 20 30 40 50
0
10
20
30
40
50
I Each off-diagonal block has rank 1.+ Nestedness of singular vectors.= semi-separable matrix (represented by 2 vectors of length n)
2 books by [Vandebril/Van Barel/Mastronardi’2007/2008].
-
Hierarchical matrices
I cluster tree of column/row indices block partitioning
I each admissible block replaced bylow-rank matrix using, e.g., adaptive crossapproximation
I clustering/admissibility decided viaanalytic properties (smoothness of kernel)or topological properties (graphclustering)
I LU factorization, inversion, QRfactorization, ... can be performed withinthe format
I Most frequently used in applications featuring dense matrices:integral operators with nonlocal kernel.
I HSS matrices/H2 matrices= Hierarchical matrices + nestedness of low-rank factors.O(n log n) storage O(n) storage
I Books by Bebendorf, Börm, and Hackbusch.
-
Hierarchical matrices: Current trendsI Inject additional analytical information to limit ranks:
Gillman, Greengard, Hao, Martinsson, Zorin, ...I Randomized algorithms for low-rank compression:
Chiu, Demanet, Greengard, Martinsson, ...I Parallel implementation:
Dongarra et al., Keyes et al., Kriemann, Li et al., ...I Combination with sparse LU factorization:
Chandrasekaran, Li, Xia, ...I Application to machine learning:
kernel methods [Si/Hsieh/Dhillon’2014], covariance matrixestimation [Ballani/DK’2014].
-
Data-sparse vectors
-
SettingI Linear system or eigenvalue problem
Ax = b, Ax = λx ,
with x ∈ Rn.I Very large/HUGE n requires exploiting additional info beyond
structure of A.
Exploit compressibility of x .
Examples:I stable wavelet/frame discretizations of PDEs approximate sparsity
I Lyapunov matrix equations in control theory/model reduction approximate low (matrix) rank
I discretizations of certain high-dimensional PDEs approximate low (tensor) rank
-
General algorithms
1. Iterate and compress:I Combine existing iterative solver with compression of iterates.I Analyzed only for stationary iterations.I Examples:
I Adaptive wavelet method by Cohen/Dahmen/DeVore’2001/2002.I Power method+low tensor rank by Beylkin/Mohlenkam’2002.I Richardson iteration+low tensor rank for stochastic PDEs by
Khoromskij/Schwab’2011.I CG/BiCGstab+low tensor rank for parametric PDEs by
DK/Tobler’2011.I Combination of adaptive wavelet+low tensor rank by
Bachmayr/Dahmen’2014.
-
General algorithms
2. Constrain and optimize:I Reformulate linear algebra problem as optimization problem.I Constrain admissible set to compressed vectors.I Works best for low-rank structures.
Often much more efficient than iterate+compress.I Allows for greedy strategies.
3. Specialized algorithms:
I Low-rank solvers for Lyapunov and Riccati equations;see [Benner/Saak’2013], [Simoncini’2014] for surveys.
I . . .
-
Example: PDE-eigenvalue problemGoal: Compute smallest eigenvalue for
∆u(ξ) + V (ξ)u(ξ) = λu(ξ) in Ω = [0,1]d ,u(ξ) = 0 on ∂Ω.
Assumption: Potential represented as
V (ξ) =s∑
j=1
V (1)j (ξ1)V(2)j (ξ2) · · ·V
(d)j (ξd ).
finite difference/tensorized finite elements discretization
Au = (AL +AV )u = λu,with
AL =d∑
j=1
I ⊗ · · · ⊗ I︸ ︷︷ ︸d−j times
⊗AL ⊗ I ⊗ · · · ⊗ I︸ ︷︷ ︸j−1 times
,
AV =s∑
j=1
A(d)V ,j ⊗ · · · ⊗ A(2)V ,j ⊗ A
(1)V ,j .
-
Example: Henon-Heiles potentialConsider Ω = [−10,2]d and potential ([Meyer et al. 1990; Raab et al.2000; Faou et al. 2009])
V (ξ) =12
d∑j=1
σjξ2j +
d−1∑j=1
(σ∗(ξjξ
2j+1 −
13ξ3j ) +
σ2∗16
(ξ2j + ξ2j+1)
2).
with σj ≡ 1, σ∗ = 0.2.
Discretization with n = 128 dof/dimension for d = 20 dimensions.
I Eigenvector has nd ≈ 1042 entries.I Explicit storage of eigenvector would require 1025 exabyte!
-
Example: Henon-Heiles potentialConsider Ω = [−10,2]d and potential ([Meyer et al. 1990; Raab et al.2000; Faou et al. 2009])
V (ξ) =12
d∑j=1
σjξ2j +
d−1∑j=1
(σ∗(ξjξ
2j+1 −
13ξ3j ) +
σ2∗16
(ξ2j + ξ2j+1)
2).
with σj ≡ 1, σ∗ = 0.2.
Discretization with n = 128 dof/dimension for d = 20 dimensions.
I Eigenvector has nd ≈ 1042 entries.I Explicit storage of eigenvector would require 1025 exabyte!
Solved with accuracy 10−12 in less than 1 hour on laptop.
-
Rayleigh quotients wrt low-rank matricesd = 2 : symmetric n2 × n2 matrix A.
λmin(A) = minx 6=0
〈x ,Ax〉〈x , x〉
.
We now...I reshape vector x into n × n matrix X ;I reinterpret Ax as linear operator A : X 7→ A(X );I for example if A =
∑sk=1 Bk ⊗ Ak then
A(X ) =s∑
k=1
Bk XATk .
-
Rayleigh quotients wrt low-rank matricesd = 2 : symmetric n2 × n2 matrix A.
λmin(A) = minX 6=0
〈X ,A(X )〉〈X ,X 〉
with matrix inner product 〈·, ·〉. We now...I restrict X to low-rank matrices.
-
Rayleigh quotients wrt low-rank matricesd = 2 : symmetric n2 × n2 matrix A.
λmin(A)≈ minX=UV T 6=0
〈X ,A(X )〉〈X ,X 〉
.
I Approximation error governed by low-rank approximability of X .I Solved by Riemannian optimization techniques or
alternating linear scheme (ALS).
-
ALSALS for solving
λmin(A)≈ minX=UV T 6=0
〈X ,A(X )〉〈X ,X 〉
.
Initially:I fix target rank rI U ∈ Rm×r ,V n×r randomly, such that V is ONB
λ̃− λ = 6× 103residual = 3× 103
-
ALSALS for solving
λmin(A)≈ minX=UV T 6=0
〈X ,A(X )〉〈X ,X 〉
.
Fix V , optimize for U.
〈X ,A(X )〉 = vec(UV T )TA vec(UV T )= vec(U)T (V ⊗ I)TA(V ⊗ I)vec(U)
Compute smallest eigenvalue of reduced matrix (rn × rn) matrix
(V ⊗ I)TA(V ⊗ I).
Note: Computation of reduced matrix benefits from Kroneckerstructure of A.
-
ALSALS for solving
λmin(A)≈ minX=UV T 6=0
〈X ,A(X )〉〈X ,X 〉
.
Fix V , optimize for U.
λ̃− λ = 2× 103residual = 2× 103
-
ALSALS for solving
λmin(A)≈ minX=UV T 6=0
〈X ,A(X )〉〈X ,X 〉
.
Orthonormalize U, fix U, optimize for V .
〈X ,A(X )〉 = vec(UV T )TA vec(UV T )= vec(V T )(I ⊗ U)TA(I ⊗ U)vec(V T )
Compute smallest eigenvalue of reduced matrix (rn × rn) matrix
(I ⊗ U)TA(I ⊗ U).
Note: Computation of reduced matrix benefits from Kroneckerstructure of A.
-
ALSALS for solving
λmin(A)≈ minX=UV T 6=0
〈X ,A(X )〉〈X ,X 〉
.
Orthonormalize U, fix U, optimize for V .
λ̃− λ = 1.5× 10−7residual = 7.7×10−3
-
ALSALS for solving
λmin(A)≈ minX=UV T 6=0
〈X ,A(X )〉〈X ,X 〉
.
Orthonormalize V , fix V , optimize for U.
λ̃− λ = 1× 10−12residual = 6× 10−7
-
ALSALS for solving
λmin(A)≈ minX=UV T 6=0
〈X ,A(X )〉〈X ,X 〉
.
Orthonormalize U, fix U, optimize for V .
λ̃− λ = 7.6× 10−13residual = 7.2×10−8
-
d � 1:Low-rank tensor formats
-
Tensor network diagrams
I Introduced by Roger Penrose.I Heavily used in quantum mechanics (spin networks).
-
These are two matrices A,B
-
This is the matrix product C = AB
Cij =r∑
k=1
Aik Bkj
-
This is the matrix product C = UΣV T
Cij =r∑
k=1
r∑`=1
Uik Σk`Vj`
If r � n: Implicit representation of C via smaller matrices U,V ,Σ.
-
This is a tensor X of order 3
I X ∈ Rn1×n2×n3 is a 3D arrayI Xijk denotes entry (i , j , k)
-
This is a tensor X of order 3 in Tucker decomposition
Xijk =r1∑
`1=1
r2∑`2=1
r3∑`3=1
C`1`2`3Ui`1Vj`2Wk`3
Implicit representation of X viaI r1 × r2 × r3 core tensor CI n1 × r1 matrix U spans first modeI n2 × r2 matrix V spans second modeI n3 × r3 matrix W spans third mode.
-
Tucker decomposition & multilinear rank
Reshape tensor into matrix by slicing, e.g. for first dimension:
X = X(1) = ∈ Rn1×(n2·n3)
Multilinear rank of tensor X ∈ Rn1×n2×n3 defined by tuple
r = (r1, r2, r3), with ri = rank(X(i)).
X = U
W
V
C
Representation of rank-r-tensor:Tucker decomposition:
X = C ×1 U ×2 V ×3 W
U ∈ Rn1×r1 , V ∈ Rn2×r2 , W ∈ Rn3×r3 , andcore tensor C ∈ Rr1×r2×r3
-
This is a tensor X of order 6 in TT decomposition
I X implicitly represented by four r × n × r tensors and two n × rmatrices
I Quantum mechanics: MPS (matrix product states)I Introduced in numerical analysis by Oseledets and Tyrtishnikov.
-
Ranks of a tensor in TT decomposition
This partition corresponds to low-rank factorization
X (1,2,3) = UV T , X (1,2,3) ∈ Rn1n2n3×n4n5n6 , U ∈ Rn1n2n3×r3 , V ∈ Rn4n5n6×r3
X (1,2,3) is matricization/unfolding/flattening/reshape of X :
Merge multi-indices (1,2,3) into row indices andmulti-indices (4,5,6) into column indices
The ranks of X (1,...,µ) for µ = 1, . . . ,d − 1 are the TT ranks of X .
-
When to expect good low-rank approximations?I Consider given function-related tensor and study best
approximation error wrt TT ranks r .I Approximation error from separation wrt to {x1, . . . , xa}:
f (x1, . . . , xa, xa+1, . . . , xd ) ≈r∑
k=1
gk (x1, . . . , xa)hk (xa+1, . . . , xd )
for a = 1, . . . ,d − 1.I Well-known: For analytic functions
error . exp(−rmax{1/a,1/(d−a)}).
I [Temlyakov’1992, Uschmajew/Schneider’2013]: For f ∈ Bs,mix
error . r−2s(log r)2s(max{a,d−a}−1).
Smoothness is neither sufficient nor necessary for high dimensions!
Need to take into account topology of problem (e.g., [Hastings’2008]for quantum ground states). Extended in [DK/Uschmajew’2014].
-
Two tensor X ,Y of order 6 in TT decomposition
-
Inner product of two tensors in TT decomposition
I Carrying out contractions requires O(dnr4) instead of O(nd )operations for tensors of order d .
-
This is a tensor X of order 16 in PEPS
I PEPS = Projected Entangled Pair StatesI Schuch/Wolf/Verstraete/Cirac’2007: inner product of two PEPS is
NP hardI Landsberg/Qi/Ye’2012: PEPS not Zariski closed
-
ALS for TT decompositions
I Originates from quantum mechanics = one-site DMRG.I More general (numerical analysis) viewpoint developed in
[Holtz/Rohwedder/Schneider’2012; Dolgov/Oseledets’2012].
Goal:min
{ 〈X ,A(X )〉〈X ,X〉
: X ∈Mr, X 6= 0}
Mr =
ALS: Choose one node t , fix all other nodes, determine new tensor atnode t by minimizing Rayleigh quotient 〈X ,A(X )〉〈X ,X〉 . This is done for allnodes (a sweep), and sweeps are continued until convergence.
-
ALS for TT decompositions: Nuts & Bolts
I Subproblems are of size nr2 × nr2:I Iterative method (LOBPCG) needed if nr 2 > O(102).I Availability of good preconditioner crucial.I Preconditioner can be inherited from full problem.
I Rank adaptation strategies:I DMRG: Merge/optimize/split neighbouring cores.I Cheaper alternative: Enrich neighbouring cores with local
(preconditioned) gradient information [Dolgov/Savostyanov’2013],[DK/Steinlechner/Uschmajew’2013].
I Computation of several eigenvalues possible with block-TTformat [Dolgov/Khoromskij/Oseledets/Savostyanov’2014].
I Local convergence results in [Rohwedder/Uschmajew’2013,Uschmajew/Vandereycken’2013].
-
Numerical Experiments - Henon-Heiles, d = 20
ALS
0 500 1000 1500 2000 250010
−15
10−10
10−5
100
105
Execution time [s]
0 500 1000 1500 2000 25000
10
20
30
40
50
60err_lambda
res
nr_iter
Size = 12820 ≈ 1042. Maximal TT rank 40.
-
Numerical Experiments - Henon-Heiles, d = 100
0 500 1000 1500 2000 2500 300010
−8
10−6
10−4
10−2
100
102
104
Resid
ual and e
igenvalu
e e
rror
I spectral discretization with 10 dof/dimension size 10100
I Algorithm: Combination of ALS with preconditioned residuals[DK/Steinlechner/Uschmajew’2013].
I rank adaptivityI computed smallest eigenvalue = 70.7415
-
Low-rank tensor techniquesI Emerged during last five years in numerical analysis.I Successfully applied to:
I parameter-dependent / multi-dimensional integrals;I electronic structure calculations: Hartree-Fock / DFT;I stochastic and parametric PDEs;I high-dimensional Boltzmann / chemical master / Fokker-Planck /
Schrödinger equations;I micromagnetism;I rational approximation problems;I computational homogenization;I computational finance;I multivariate regression and machine learning;I queuing models;I context-aware recommender systemsI . . .
I For references on these applications, seeI L. Grasedyck, DK, Ch. Tobler (2013). A literature survey of low-
rank tensor approximation techniques. GAMM-Mitteilungen, 36(1).I W. Hackbusch (2012). Tensor Spaces and Numerical Tensor
Calculus, Springer.
-
ConclusionsI Numerical linear algebra alive and healthy.I Lots of potential to incorporate techniques from compressed
sensing/optimization/... into numerical linear algebra algorithms.
Some important aspects not covered:
I Preconditioning.I Randomized algorithms.I Communication-avoiding algorithms.I Merging iterative methods with (adaptive) discretization.I . . .