dense linear algebra - netlib.org · compute vs. memory speed • machine balance (# flops per...
TRANSCRIPT
![Page 1: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/1.jpg)
Dense Linear Algebra
Mark [email protected]://www.icl.utk.edu/~mgates3/
1
CS 594 — Scientific Computing for Engineers — Spring 2017
![Page 2: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/2.jpg)
Outline• Legacy Software
• BLAS• LINPACK• LAPACK• ScaLAPACK
• New Software• PLASMA using OpenMP• DPLASMA and PaRSEC• MAGMA for CUDA, OpenCL, or Xeon Phi• Case study: 2-stage SVD
2
![Page 3: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/3.jpg)
Memory hierarchy
3
CPU
Registers
CPU cacheL1 Level 1 cacheL2 Level 2 cacheL3 Level 3 cache
Random Access Memory (RAM)
Solid State Drive (SSD, Flash)
Mechanical Hard Drive (HDD)
Virtual Memory, Files
SDRAM, DDR, GDDR, HBM, etc.
Burst buffers, Non-volatile Memory, Files
FastestMost expensiveSmallest capacity
FastExpensiveSmall capacity
Modest speedModest costModest capacity
SlowCheapLarge capacity
SlowestCheapestLargest capacity
Adapted from illustration by Ryan Leng
168 integer168 float
32 KiB L1 per core256 KiB L2 per core1.5 MiB L3 per core
64 GiB
500 GiB
4 TiB
Examples (Haswell)
![Page 4: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/4.jpg)
Compute vs. memory speed• Machine balance (# flops per memory access)
• Flops “free,”memory expensive
• Good for dense,BLAS-3 operations(matrix multiply)
• Flops & memoryaccess balanced
• Good for sparse &vector operations
Data from Stream benchmark (McCalpin) and vendor information pages.4
![Page 5: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/5.jpg)
Outline• Legacy Software
• BLAS • LINPACK• LAPACK• ScaLAPACK
• New Software• PLASMA using OpenMP• DPLASMA and PaRSEC• MAGMA for CUDA, OpenCL, or Xeon Phi• Case study: 2-stage SVD
5
![Page 6: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/6.jpg)
BLAS: Basic Linear Algebra Subroutines• Level 1 BLAS — vector operations
• O(n) data and flops (floating point operations)• Memory bound:
O(1) flops per memory access
• Level 2 BLAS — matrix-vector operations• O(n2) data and flops• Memory bound:
O(1) flops per memory access
• Level 3 BLAS — matrix-matrix operations• O(n2) data, O(n3) flops• Surface-to-volume effect• Compute bound:
O(n) flops per memory access
6
y x y+ β= α
y A y+ β= α x
A C+ β= α BC
![Page 7: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/7.jpg)
BLAS: Basic Linear Algebra Subroutines• Intel Sandy Bridge (2 socket E5-2670)
• Peak 333 Gflop/s = 2.6 GHz * 16 cores * 8 double-precision flops/cycle †• Max memory bandwidth 51 GB/s ‡
7
† http://stackoverflow.com/questions/15655835/flops-per-cycle-for-sandy-bridge-and-haswell-sse2-avx-avx2‡ http://ark.intel.com/products/64595/Intel-Xeon-Processor-E5-2670-20M-Cache-2_60-GHz-8_00-GTs-Intel-QPI
![Page 8: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/8.jpg)
Memory bandwidth• Intel Ark
• https://ark.intel.com/• Search for model number (e.g., “E5-2697 v4” for Broadwell, see /proc/cpuinfo)
• Stream benchmark• https://www.cs.virginia.edu/stream/• Add -fopenmp or equivalent to Makefile CFLAGS and FFLAGS• Adjust STREAM_ARRAY_SIZE depending on cache size;
see instructions in code
8
prompt>less/proc/cpuinfo...modelname:Intel(R)Xeon(R)[email protected]:46080KBphysicalid:0siblings:18...flags:fpuvmedepsetscmsrpaemcecx8apicsepmtrrpgemcacmovpatpse36clflushdtsacpimmxfxsrssesse2sshttmpbesyscallnxpdpe1gbrdtscplmconstant_tscarch_perfmonpebsbtsrep_goodnoplxtopologynonstop_tscaperfmperfeagerfpupnipclmulqdqdtes64monitords_cplvmxsmxesttm2ssse3fmacx16xtprpdcmpciddcasse4_1sse4_2x2apicmovbepopcnttsc_deadline_timeraesxsaveavxf16crdrandlahf_lmabm3dnowprefetchidaaratepbplnptsdthermintel_pttpr_shadowvnmiflexpriorityeptvpidfsgsbasetsc_adjustbmi1hleavx2smepbmi2ermsinvpcidrtmcqmrdseedadxsmapxsaveoptcqm_llccqm_occup_llccqm_mbm_totalcqm_mbm_local...
![Page 9: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/9.jpg)
BLAS naming scheme
9
Example dgemm C = AB + C
![Page 10: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/10.jpg)
• One or two letter data type (precision)• i = integer (e.g., index) • s = single (float) • d = double • c = single-complex • z = double-complex
BLAS naming scheme
9
Example dgemm C = AB + C
![Page 11: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/11.jpg)
• One or two letter data type (precision)• i = integer (e.g., index) • s = single (float) • d = double • c = single-complex • z = double-complex
• Two letter matrix type (BLAS 2, 3)• ge= general nonsymmetric • sy= symmetric (A = AT) • he= Hermitian (complex, A = AH) • tr= triangular (L or U) • Also banded and packed formats
BLAS naming scheme
9
Example dgemm C = AB + C
![Page 12: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/12.jpg)
• One or two letter data type (precision)• i = integer (e.g., index) • s = single (float) • d = double • c = single-complex • z = double-complex
• Two letter matrix type (BLAS 2, 3)• ge= general nonsymmetric • sy= symmetric (A = AT) • he= Hermitian (complex, A = AH) • tr= triangular (L or U) • Also banded and packed formats
• Two or more letter function, e.g.• mv= matrix-vector product • mm= matrix-matrix product • etc.
BLAS naming scheme
9
Example dgemm C = AB + C
![Page 13: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/13.jpg)
• One or two letter data type• i = integer (e.g., index) • s = single (float) • d = double • c = single-complex • z = double-complex
• Two letter matrix type (BLAS 2, 3)• ge= general nonsymmetric • sy= symmetric (A = AT) • he= complex Hermitian (A = AH) • tr= triangular (L or U) • Also banded and packed formats
• Two or more letter function, e.g.• mv= matrix-vector product • mm= matrix-matrix product • etc.
index = argmax
i
|xi
|result = Â
i
|xi
|
BLAS naming scheme
10
x
y
�=
c s
�s c
� x
y
�
result = kxk2
BLAS 1 examples sdot result = xT y (single) ddot result = xT y (double) cdotc result = xH y (single-complex) cdotu result = xT y (single-complex) zdotc result = xH y (double-complex) zdotu result = xT y (double-complex)
_axpy y = α x + y _scal y = α y _copy y = x _swap x ↔ y _nrm2 _asum i_amax _rot Apply Given’s rotation:
![Page 14: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/14.jpg)
BLAS naming scheme• One or two letter data type
• i = integer (e.g., index) • s = single (float) • d = double • c = single-complex • z = double-complex
• Two letter matrix type (BLAS 2, 3)• ge= general nonsymmetric • sy= symmetric (A = AT) • he= complex Hermitian (A = AH) • tr= triangular (L or U) • Also banded and packed formats
• Two or more letter function, e.g.• mv= matrix-vector product • mm= matrix-matrix product • etc.
BLAS 2 examples _gemv y = Ax + y, A general _symv y = Ax + y, A symmetric _hemv y = Ax + y, A Hermitian _ger C = xyT + C, C general _syr C = xxT + C, C symmetric _her C = xxH + C, C Hermitian _trmv x = Ax, A triangular _trsv solve Ax = b, A triangular
BLAS 3 examples _gemm C = AB + C, all general _symm C = AB + C, A symmetric _hemm C = AB + C, A Hermitian _syrk C = AAT + C, C symmetric _herk C = AAH + C, C Hermitian _trmm X = AX, A triangular _trsm solve AX = B, A triangular
11
![Page 15: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/15.jpg)
Why is BLAS so important?
• BLAS are efficient, portable, parallel, and widely available• Vendors (Intel, AMD, IBM, Cray, NVIDIA, ...) provide highly optimized versions• Open source libraries available (ATLAS, OpenBLAS)
• Commonly used for high quality linear algebra and HPC software
• Performance of many applications depends on underlying BLAS
12
![Page 16: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/16.jpg)
Outline• Legacy Software
• BLAS• LINPACK • LAPACK• ScaLAPACK
• New Software• PLASMA using OpenMP• DPLASMA and PaRSEC• MAGMA for CUDA, OpenCL, or Xeon Phi• Case study: 2-stage SVD
13
![Page 17: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/17.jpg)
LINPACK LINear algebra PACKage
• Math software library• solving dense & banded linear systems• singular value decomposition (SVD)
• Written in 1970s using Fortran 66• Aiming for software portability and efficiency• Level 1 BLAS• Four primary contributors:
• Jack Dongarra, Argonne• Jim Bunch, U. California, San Diego• Cleve Moler, New Mexico• Pete Stewart, U. of Maryland
• Superseded by LAPACK
14
![Page 18: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/18.jpg)
Computing in 1974• High performance computers
• IBM 370/195• CDC 7600• Univac 1110• DEC PDP-10• Honeywell 6030
• Fortran 66• EISPACK released in 1974
• Eigenvalue and singular value problems• Translation of Algol into Fortran• Did not use BLAS
• Level 1 BLAS — vector operations (1979)• Aimed at vector supercomputer architectures
• LINPACK released in 1979• About time of Cray 1
15
![Page 19: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/19.jpg)
LU Factorization in LINPACK
• Factor one column at a time• i_amax and _scal
• Update each column of trailing matrix, one column at a time• _axpy
• Level 1 BLAS• Bulk synchronous
• Single main thread• Parallel work in BLAS• “Fork-and-join” model
16
singlemain
thread
vectorized ormulti-threaded
BLAS
sync
![Page 20: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/20.jpg)
Outline• Legacy Software
• BLAS• LINPACK• LAPACK • ScaLAPACK
• New Software• PLASMA using OpenMP• DPLASMA and PaRSEC• MAGMA for CUDA, OpenCL, or Xeon Phi• Case study: 2-stage SVD
17
![Page 21: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/21.jpg)
LAPACK Linear Algebra PACKage
• Reformulate linear algebra algorithms as blocked algorithmsusing BLAS-3
• Linear systems• Nearly 100% BLAS-3
• Singular value and symmetric eigenvalue problems• About 50% BLAS-3
• Nonsymmetric eigenvalue problems• About 80% BLAS-3
• 4 data types: single, double, single-complex, double-complex
18
![Page 22: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/22.jpg)
LU factorization in LAPACK
• Factor panel of nb columns• getf2, unblocked BLAS-2 code
• Level 3 BLAS update block-row of U• trsm
• Level 3 BLAS update trailing matrix• gemm• Aimed at machines with cache hierarchy
• Bulk synchronous
19
![Page 23: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/23.jpg)
Parallelism in LAPACK• Most flops in gemm update
20
• 2/3 n3 term• Easily parallelized using
multi-threaded BLAS• Done in any reasonable
software
• Other operations lower order• Potentially expensive if not
parallelized
= lu( )
trsm solve
getf2 panel
= +
=
gemm multiply
L
U
laswpswap rows
![Page 24: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/24.jpg)
LAPACK routine, solve AX = B• dgesv( n, nrhs, A, lda, ipiv, B, ldb, info )
• input:
• output:
21
info (error code) = 0: no error < 0: invalid argument > 0: numerical error (e.g., singular)
L, U overwrite AX overwrites BMatrices stored column-wise
A Bn
n nrhs
lda ldb n ipivn
Xn
n nrhs
lda ldb n nL
U
implicit unit diagonal
ipiv
![Page 25: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/25.jpg)
LAPACK naming
• Similar to BLAS naming• Data type + matrix type + function
• Additional matrix types:• po = real symmetric / complex Hermitian positive definite (SPD / HPD, all λ > 0)• or = orthogonal (AAT = I)• un = unitary (AAH = I)• Also banded, packed, etc. formats
22
![Page 26: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/26.jpg)
LAPACK routines• ~480 routines x 4 data types (s, d, c, z)• Driver routines: solve an entire problem
• sv Solve linear system: Ax = b • gesv general non-symmetric (LU) • posv symmetric positive definite (Cholesky) • sysv symmetric indefinite (LDLT) • Also packed, banded, tridiagonal storage
• ls Linear least squares: Ax ≅ b • gglse linear equality-constrained least squares • ggglm general Gauss-Markov linear model
• ev Eigenvalue decomposition: Ax = λx and Ax = λMx • syevd symmetric and symmetric generalized • geev non-symmetric and non-symmetric generalized
• svd Singular value decomposition: A = UΣVH • gesvd standard and generalized
• gesdd D&C (faster) 23
http://www.icl.utk.edu/~mgates3/docs/http://www.icl.utk.edu/~mgates3/docs/lapack.html
![Page 27: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/27.jpg)
LAPACK routines• Computation routines: one step of problem
• trf triangular factorization (LU, Cholesky, LDLT) • trs triangular solve • qrf orthogonal QR factorization • mqr multiply by Q • gqr generate Q • etc.
• Auxiliary routines, begin with “la”• lan__ matrix norm (one, inf, Frobenius, max); see SVD for 2-norm • lascl scale matrix • lacpy copy matrix • laset set matrix & diagonal to constants • etc.
24
http://www.icl.utk.edu/~mgates3/docs/http://www.icl.utk.edu/~mgates3/docs/lapack.html
![Page 28: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/28.jpg)
Outline• Legacy Software
• BLAS• LINPACK• LAPACK• ScaLAPACK
• New Software• PLASMA using OpenMP• DPLASMA and PaRSEC• MAGMA for CUDA, OpenCL, or Xeon Phi• Case study: 2-stage SVD
25
![Page 29: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/29.jpg)
ScaLAPACK Scalable Linear Algebra PACKage
• Distributed memory
• Message Passing• Clusters of SMPs• Supercomputers
• Dense linear algebra
• Modules• PBLAS: Parallel BLAS• BLACS: Basic Linear Algebra Communication Subprograms
26
![Page 30: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/30.jpg)
PBLAS• Similar to BLAS in functionality and naming• Built on BLAS and BLACS• Provide global view of matrix
• LAPACK: dge___( m, n, A(ia, ja), lda, ... ) • Submatrix offsets implicit in pointer
• ScaLAPACK: pdge___( m, n, A, ia, ja, descA, ... ) • Pass submatrix offsets and matrix descriptor
27
![Page 31: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/31.jpg)
ScaLAPACK structure
28
ScaLAPACK
PBLAS
LAPACK
BLAS BLACS
MPI
Global addressingLocal addressing
Platform independentPlatform specific
![Page 32: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/32.jpg)
ScaLAPACK routine, solve AX = B• LAPACK: dgesv(n, nrhs, A, lda, ipiv, B, ldb, info) • ScaLAPACK:pdgesv(n, nrhs, A, ia, ja, descA, ipiv, B, ib, jb, descB, info)
• input:
• output:
29
info (error code) = 0: no error < 0: invalid argument > 0: numerical error (e.g., singular)
L, U overwrite AX overwrites B
A11 B12
n
n nrhs
n
ip1
n
A12 A13
A21 A22 A23
A31 A32 A33
B11
B22B21
B32B31
ip2
ip3
ip1
n ip2
ip3
B12
nrhs
n
B11
B22B21
B32B31
n
n
U12 U13
L21 U23
L31 L32
U11L11U22L22
U33L33
Global matrix point of view
implicit unit diagonal
mb
nb
![Page 33: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/33.jpg)
2D block-cyclic layout
30
m × n matrixp × q process grid
Global matrix view Local process point of view
11
m
n
12 13
21 22 23
31 32 33
14 15 16
24 25 26
34 35 36
17 18
27 28
37 38
44 42 43
51 52 53
61 62 63
44 45 46
54 55 56
64 65 66
47 48
57 58
67 68
71 72 73
81 82 83
91 92 93
74 75 76
84 85 86
94 95 96
77 78
87 88
97 98
11 14 17
31 34 37
51 54 57
12 15 18
32 35 38
52 55 58
13 16
33 36
53 56
71 74 77
91 94 97
21 24 27
72 75 78
92 95 98
22 25 28
73 76
93 96
23 26
41 44 47
61 64 67
81 84 87
42 45 48
62 65 68
82 85 88
43 46
63 66
83 86
Process 1, 1 Process 1, 2 Process 1, 3
Process 2, 1 Process 2, 2 Process 2, 3p processes
q processes
![Page 34: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/34.jpg)
2D block-cyclic layout
31
m × n matrixp × q process grid
Global matrix view Local process point of view
11
m
n
12 13
21 22 23
31 32 33
14 15 16
24 25 26
34 35 36
17 18
27 28
37 38
44 42 43
51 52 53
61 62 63
44 45 46
54 55 56
64 65 66
47 48
57 58
67 68
71 72 73
81 82 83
91 92 93
74 75 76
84 85 86
94 95 96
77 78
87 88
97 98
11 14 17
31 34 37
51 54 57
12 15 18
32 35 38
52 55 58
13 16
33 36
53 56
71 74 77
91 94 97
21 24 27
72 75 78
92 95 98
22 25 28
73 76
93 96
23 26
41 44 47
61 64 67
81 84 87
42 45 48
62 65 68
82 85 88
43 46
63 66
83 86
Process 1, 1 Process 1, 2 Process 1, 3
Process 2, 1 Process 2, 2 Process 2, 3p processes
q processes
![Page 35: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/35.jpg)
2D block-cyclic layout
32
m × n matrixp × q process grid
Global matrix view Local process point of view
11
m
n
12 13
21 22 23
31 32 33
14 15 16
24 25 26
34 35 36
17 18
27 28
37 38
44 42 43
51 52 53
61 62 63
44 45 46
54 55 56
64 65 66
47 48
57 58
67 68
71 72 73
81 82 83
91 92 93
74 75 76
84 85 86
94 95 96
77 78
87 88
97 98
11 14 17
31 34 37
51 54 57
12 15 18
32 35 38
52 55 58
13 16
33 36
53 56
71 74 77
91 94 97
21 24 27
72 75 78
92 95 98
22 25 28
73 76
93 96
23 26
41 44 47
61 64 67
81 84 87
42 45 48
62 65 68
82 85 88
43 46
63 66
83 86
Process 1, 1 Process 1, 2 Process 1, 3
Process 2, 1 Process 2, 2 Process 2, 3p processes
q processes
![Page 36: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/36.jpg)
2D block-cyclic layout
33
11
m
n
12 13
21 22 23
31 32 33
14 15 16
24 25 26
34 35 36
17 18
27 28
37 38
44 42 43
51 52 53
61 62 63
44 45 46
54 55 56
64 65 66
47 48
57 58
67 68
71 72 73
81 82 83
91 92 93
74 75 76
84 85 86
94 95 96
77 78
87 88
97 98
m × n matrixp × q process grid
Global matrix view Local process point of view
11 14 17
31 34 37
51 54 57
12 15 18
32 35 38
52 55 58
13 16
33 36
53 56
71 74 77
91 94 97
21 24 27
72 75 78
92 95 98
22 25 28
73 76
93 96
23 26
41 44 47
61 64 67
81 84 87
42 45 48
62 65 68
82 85 88
43 46
63 66
83 86
Process 1, 1 Process 1, 2 Process 1, 3
Process 2, 1 Process 2, 2 Process 2, 3p processes
q processes
![Page 37: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/37.jpg)
2D block-cyclic layout
34
m × n matrixp × q process grid
Global matrix view Local process point of view
m
n
Process 1, 1 Process 1, 2 Process 1, 3
Process 2, 1 Process 2, 2 Process 2, 3p processes
q processes
![Page 38: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/38.jpg)
2D block-cyclic layout
35
m × n matrixp × q process grid
Global matrix view Local process point of view
m
n
Process 1, 1 Process 1, 2 Process 1, 3
Process 2, 1 Process 2, 2 Process 2, 3p processes
q processes
![Page 39: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/39.jpg)
2D block-cyclic layout
36
m × n matrixp × q process grid
Global matrix view Local process point of view
m
n
Process 1, 1 Process 1, 2 Process 1, 3
Process 2, 1 Process 2, 2 Process 2, 3p processes
q processes
![Page 40: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/40.jpg)
2D block-cyclic layout
37
m × n matrixp × q process grid
Global matrix view Local process point of view
m
n
Process 1, 1 Process 1, 2 Process 1, 3
Process 2, 1 Process 2, 2 Process 2, 3p processes
q processes
![Page 41: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/41.jpg)
2D block-cyclic layout
38
m × n matrixp × q process grid
Global matrix view Local process point of view
m
n
Process 1, 1 Process 1, 2 Process 1, 3
Process 2, 1 Process 2, 2 Process 2, 3p processes
q processes
![Page 42: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/42.jpg)
• Why not simple 2D distribution?
p processes
q processes
11 12 13
21 22 23
31 32 33
14 15 16
24 25 26
34 35 36
17 18
27 28
37 38
44 42 43
51 52 53
61 62 63
44 45 46
54 55 56
64 65 66
47 48
57 58
67 68
71 72 73
81 82 83
91 92 93
74 75 76
84 85 86
94 95 96
77 78
87 88
97 98
Process 1, 1 Process 1, 2 Process 1, 3
Process 2, 1 Process 2, 2 Process 2, 3
11
m
n
12 13
21 22 23
31 32 33
14 15 16
24 25 26
34 35 36
17 18
27 28
37 38
44 42 43
51 52 53
61 62 63
44 45 46
54 55 56
64 65 66
47 48
57 58
67 68
71 72 73
81 82 83
91 92 93
74 75 76
84 85 86
94 95 96
77 78
87 88
97 98
Why 2D block cyclic?
39
m × n matrixp × q process grid
Global matrix view Local process point of view
![Page 43: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/43.jpg)
• Why not simple 2D distribution?
p processes
q processes
11 12 13
21 22 23
31 32 33
14 15 16
24 25 26
34 35 36
17 18
27 28
37 38
44 42 43
51 52 53
61 62 63
44 45 46
54 55 56
64 65 66
47 48
57 58
67 68
71 72 73
81 82 83
91 92 93
74 75 76
84 85 86
94 95 96
77 78
87 88
97 98
Process 1, 1 Process 1, 2 Process 1, 3
Process 2, 1 Process 2, 2 Process 2, 3
11
m
n
12 13
21 22 23
31 32 33
14 15 16
24 25 26
34 35 36
17 18
27 28
37 38
44 42 43
51 52 53
61 62 63
44 45 46
54 55 56
64 65 66
47 48
57 58
67 68
71 72 73
81 82 83
91 92 93
74 75 76
84 85 86
94 95 96
77 78
87 88
97 98
Why 2D block cyclic?
39
m × n matrixp × q process grid
Global matrix view Local process point of view
![Page 44: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/44.jpg)
Why 2D block cyclic?• Better load balancing!
40
m × n matrixp × q process grid
Global matrix view Local process point of view
m
n
Process 1, 1 Process 1, 2 Process 1, 3
Process 2, 1 Process 2, 2 Process 2, 3p processes
q processes
![Page 45: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/45.jpg)
ScaLAPACK routines• ~140–160 routines x 4 data types (s, d, c, z)• Driver routines: solve an entire problem
• sv Solve linear system: Ax = b • gesv general non-symmetric (LU) • posv symmetric positive definite (Cholesky) • sysv symmetric indefinite (LDLT) • Also packed, banded, tridiagonal storage
• ls Linear least squares: Ax ≅ b • gglse linear equality-constrained least squares • ggglm general Gauss-Markov linear model
• ev Eigenvalue decomposition: Ax = λx and Ax = λMx • syevd symmetric and symmetric generalized • geev non-symmetric (only gehrd Hessenberg reduction, no geev)
• svd Singular value decomposition: A = UΣVH • gesvd standard and generalized (also not optimized for tall matrices) • gesdd D&C (faster)
41
![Page 46: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/46.jpg)
Parallelism in ScaLAPACK• Similar to LAPACK• Bulk-synchronous• Most flops in gemm
update• 2/3 n3 term• Can use sequential BLAS,
p x q = # cores = # MPI processes,num_threads = 1
• Or multi-threaded BLAS,p x q = # nodes = # MPI processes,num_threads = # cores/node
42
= lu( )
trsm solve
getf2 panel
= +
=
gemm multiply
L
U
laswpswap rows
![Page 47: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/47.jpg)
Legacy software libraries
43
LINPACK (70s)vector operations Level 1 BLAS
LAPACK (80s)block operations Level 3 BLAS
ScaLAPACK (90s)2D block cyclic
distribution
PBLAS BLACS MPI
![Page 48: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/48.jpg)
Outline• Legacy Software
• BLAS• LINPACK• LAPACK• ScaLAPACK
• New Software• PLASMA using OpenMP • DPLASMA and PaRSEC• MAGMA for CUDA, OpenCL, or Xeon Phi• Case study: 2-stage SVD
44
![Page 49: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/49.jpg)
Emerging software solutions
45
• PLASMA • Tile layout & algorithms• Dynamic scheduling — OpenMP 4
• DPLASMA — PaRSEC• Distributed with accelerators• Tile layout & algorithms• Dynamic scheduling — parameterized task graph
• MAGMA • Hybrid multicore + accelerator (GPU, Xeon Phi)• Block algorithms (LAPACK style)• Standard layout• Static scheduling
2007 2009 2011
![Page 50: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/50.jpg)
Tile matrix layout
• Tiled layout• Each tile is contiguous (column major)• Enables dataflow scheduling• Cache and TLB efficient (reduces conflict misses and false sharing)• MPI messaging efficiency (zero-copy communication)• In-place, parallel layout translation
46
LAPACK column major (D)PLASMA tile layout
mlda
n nb
nb
![Page 51: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/51.jpg)
Tile algorithms: Cholesky
47
LAPACK Algorithm (right looking) Tile Algorithm
= chol( )
1
2
3
= ⧸
= 1
2
3
1T 2T 3T–
trsm
syrk
= chol( )
1 = ⧸
2 = ⧸
3 = ⧸
trsm
trsm
trsm
= 1 1T– syrk
= 2 2T– syrk
= 3 3T– syrk
= 2 1T– gemm
= 3 1T– gemm
= 2 3T– gemm
![Page 52: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/52.jpg)
Track dependencies — Directed acyclic graph (DAG)
48
Classical fork-join schedulewith artificial synchronizations
![Page 53: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/53.jpg)
Track dependencies — Directed acyclic graph (DAG)
48
Classical fork-join schedulewith artificial synchronizations
![Page 54: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/54.jpg)
Track dependencies — Directed acyclic graph (DAG)
48
Classical fork-join schedulewith artificial synchronizations
Reordered for 3 cores,without synchronizations
![Page 55: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/55.jpg)
Execution trace• LAPACK-style fork-join leave cores idle
49
min: 0 max: 822.827
0 100 200 300 400 500 600 700 800
panels
potrf trsm syrk gemm idle24 cores Matrix is 8000 x 8000, tile size is 400 x 400.
time
![Page 56: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/56.jpg)
Execution trace• PLASMA squeezes out idle time
50
min: 0 max: 701.543
0 100 200 300 400 500 600
24 cores Matrix is 8000 x 8000, tile size is 400 x 400. potrf trsm syrk gemm idle
panels time
![Page 57: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/57.jpg)
superscalar scheduling
serial code
side-eUect-free tasks
dependency resolution
resolving data hazards
read after write
write after read
write after write
similar approach
StarPU from Inria
SMPSs from Barcelona
SuperGlue and DuctTEiP from Uppsala
Jade from Stanford (historical)
OpenMP
QUARKmultithreading
Dataflow scheduling• Exploit parallelism
• Load balance
• Remove artificial synchronization between steps
• Maximize data locality
• Parallel correctness inherited from serial tile algorithm• Runtime automatically resolves data hazards
(read after write, write after read, write after write)
51
![Page 58: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/58.jpg)
Dynamic scheduling runtimes• Jade Stanford University • SMPSs / OMPSs Barcelona Supercomputer Center • StarPU INRIA Bordeaux • QUARK University of Tennessee • SuperGlue & DuctTEiP Uppsala University • OpenMP 4
52
May 2008 OpenMP 3.0
April 2009 GCC 4.4
July 2013 OpenMP 4.0
April 2014 GCC 4.9
#pragma omp task
#pragma omp task depend
![Page 59: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/59.jpg)
PLASMA architecture
53
![Page 60: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/60.jpg)
Cholesky: pseudo-code
54
// Sequential Tile Choleskyfor k = 1 ... ntiles {
potrf( Akk ) for i = k+1 ... ntiles {
trsm( Akk, Aik ) } for i = k+1 ... ntiles {
syrk( Aik, Aii ) for j = i+1 ... ntiles {
gemm( Ajk, Aik, Aij ) }
} }
// PLASMA OpenMP Tile Choleskyfor k = 1 .. ntiles {
omp task potrf( ... ) for i = k+1 .. ntiles {
omp task trsm( ... ) } for i = k+1 .. ntiles {
omp task syrk( ... ) { for j = i+1 .. ntiles
omp task gemm( ... ) }
} }
![Page 61: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/61.jpg)
Cholesky: OpenMP
55
#pragma omp parallel #pragma omp master {
for (k = 0; k < nt; k++) { #pragma omp task depend(inout:A(k,k)[0:nb*nb]) info = LAPACKE_dpotrf_work(
LAPACK_COL_MAJOR, lapack_const(PlasmaLower), nb, A(k,k), nb);
for (m = k+1; m < nt; m++) { #pragma omp task depend(in:A(k,k)[0:nb*nb]) \
depend(inout:A(m,k)[0:nb*nb]) cblas_dtrsm(
CblasColMajor, CblasRight, CblasLower, CblasTrans, CblasNonUnit, nb, nb, 1.0, A(k,k), nb,
A(m,k), nb); } for (m = k+1; m < nt; m++) {
#pragma omp task depend(in:A(m,k)[0:nb*nb]) \ depend(inout:A(m,m)[0:nb*nb])
cblas_dsyrk( CblasColMajor, CblasLower, CblasNoTrans, nb, nb, -1.0, A(m,k), nb,
1.0, A(m,m), nb);
for (n = k+1; n < m; n++) { #pragma omp task depend(in:A(m,k)[0:nb*nb]) \
depend(in:A(n,k)[0:nb*nb]) \ depend(inout:A(m,n)[0:nb*nb])
cblas_dgemm( CblasColMajor, CblasNoTrans, CblasTrans, nb, nb, nb, -1.0, A(m,k), nb,
A(n,k), nb, 1.0, A(m,n), nb);
} }
} }
#pragma omp task depend(in:A(m,k)[0:nb*nb]) \ depend(in:A(n,k)[0:nb*nb]) \ depend(inout:A(m,n)[0:nb*nb])
cblas_dgemm( CblasColMajor, CblasNoTrans, CblasTrans, nb, nb, nb, -1.0, A(m,k), nb,
A(n,k), nb, 1.0, A(m,n), nb);
![Page 62: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/62.jpg)
Cholesky performance
56
0
100
200
300
400
500
600
0 5000 10000 15000 20000
double precision Cholesky factorization Intel Xeon E5-2650 v3 (Haswell), 2.3GHz, 20 cores
OpenMP MKL QUARK
![Page 63: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/63.jpg)
Merging DAGs
48 cores, matrix is 4000 x 4000, tile size is 200 x 200.
57
Cholesky A = LLT
Invert L-1
Multiply A-1 = L-T L-1
![Page 64: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/64.jpg)
Merging DAGs
48 cores, matrix is 4000 x 4000, tile size is 200 x 200.
time →
57
Cholesky A = LLT
Invert L-1
Multiply A-1 = L-T L-1
![Page 65: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/65.jpg)
Merging DAGs
48 cores, matrix is 4000 x 4000, tile size is 200 x 200.
time →
57
Choleskymatrix inverse
Cholesky A = LLT
Invert L-1
Multiply A-1 = L-T L-1
![Page 66: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/66.jpg)
Merging DAGs
48 cores, matrix is 4000 x 4000, tile size is 200 x 200.
time →
time →
57
Choleskymatrix inverse
Cholesky A = LLT
Invert L-1
Multiply A-1 = L-T L-1
![Page 67: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/67.jpg)
Merging DAGs
48 cores, matrix is 4000 x 4000, tile size is 200 x 200.
time →
time →
57
Choleskymatrix inverse
NOTE: Please do not explicitly invert matrices to compute X = A-1 B.Solve using gesv, posv, or sysv; faster and more accurate than inverting and multiplying
Cholesky A = LLT
Invert L-1
Multiply A-1 = L-T L-1
![Page 68: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/68.jpg)
Cholesky inversion performance
58
0
100
200
300
400
500
600
0 5000 10000 15000 20000
double precision Cholesky inversion Intel Xeon E5-2650 v3 (Haswell), 2.3GHz, 20 cores
OpenMP MKL QUARK
![Page 69: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/69.jpg)
LU performance• Multi-threaded panel• Factorization reaches peak performance quickly
59
0
20
40
60
80
100
120
140
0 5000 10000 15000 20000 25000 30000 35000
2 threads
4 threads
8 threads
16 threads
panel height (panel width = 256)
Gflo
p/s
LU panel factorization
Intel Sandy Bridge (32 cores)
0
50
100
150
200
250
300
350
0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000
Matrix size
LAPACK
MKL
PLASMA
LU factorization
Intel Sandy Bridge (32 cores)
S.Donfack,J.Dongarra,M.Faverge,M.Gates,J.Kurzak,P.Luszczek,I.Yamazaki,AsurveyofrecentdevelopmentsinparallelimplementationsofGaussianelimination,ConcurrencyandComputation:PracticeandExperience,27(5):1292–1309,2015.DOI:10.1002/cpe.3110
![Page 70: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/70.jpg)
PLASMA QR• Tile QR
• Great for square matrices• Great for multicore• Pairwise reductions
“domino”
• TSQR / CAQR• Tall-skinny QR• Communication avoiding QR• Tree of pairwise reductions• Great for tall-skinny matrices (least squares)• Great for distributed memory• But triangle-triangle (TT) pairs less efficient
than square-square (SS) or triangle-square (TS)
60
![Page 71: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/71.jpg)
TSQR performance• Fixed cols, increase rows from square (left) to tall-skinny (right)
61
0
500
1000
1500
2000
2500
3000
0 50000 100000 150000 200000 250000 300000
Per
form
ance
(GFl
op/s
)
M (N=4,480)
60 nodes, 8 cores/node, Intel Nehalem Xeon E5520, 2.27GHz Theoretical Peak: 4358.4 Gflop/s
TSQR
ScaLAPACK (MKL)
![Page 72: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/72.jpg)
Outline• Legacy Software
• BLAS• LINPACK• LAPACK• ScaLAPACK
• New Software• PLASMA using OpenMP• DPLASMA and PaRSEC • MAGMA for CUDA, OpenCL, or Xeon Phi• Case study: 2-stage SVD
62
![Page 73: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/73.jpg)
Distributed dataflow execution
63
DataDow Executiondistributed memory
DataDow Executiondistributed memory
DataDow Executiondistributed memory
DataDow Executiondistributed memory
![Page 74: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/74.jpg)
64
DataDow Executiondistributed memory
localdepenency
communication
DataDow Executiondistributed memory
localdepenency
communication
DataDow Executiondistributed memory
![Page 75: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/75.jpg)
DPLASMA architecture
65
DPLASMAsoftware stack
![Page 76: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/76.jpg)
Parameterized task graph (PTG)• Symbolic DAG representation• Problem size independent• Completely distributed• DAG is O(n3),
PTG avoids explicitly creating it
• Runtime• Data-driven execution• Locality-aware scheduling• Communication overlap
66
PaRSECparametrized task graphs
parametrized task graph
symbolic DAG representation
problem size independent
completely distributed
runtime implementation
data-driven execution
locality-aware scheduling
communication overlapping
![Page 77: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/77.jpg)
From serial code to PTG
67
DGEQRTkkk1ARG ← Ak,k | DTSMQRk,k,k-11ARG � DORMQRk,k+1..N,k(⬕)
1ARG � DTSQRTk+1,k,k(⬔)
1ARG � Ak,k(⬕)
DORMQRknk1ARG ← DGEQRTk,k,k(⬕)
2ARG ← Ak,n | DTSMQRk,n,k-12ARG � DTSMQRk+1,n,k2ARG � Ak,n
DTSQRTmkk1ARG ← DGEQRTm-1,k,k(⬔) | DTSQRTm-1,k,k(⬔)
1ARG � DTSQRTm+1,k,k(⬔) | Ak,k(⬔)
2ARG ← Am,k | DTSMQRm,k,k-12ARG � DTSMQRm,k+1..N,k2ARG � Am,k
DTSMQRmnk1ARG ← DTSQRTm,k,k2ARG ← DORMQRm-1,n,k | DTSMQRm-1,n,k2ARG � DTSMQRm+1,n,k | An,k3ARG ← Am,n | DTSMQRm,n,k-13ARG � DGEQRTm,n,k+1 | DORMQRm,n,k+1 | � DTSQRTm,n,k+1 | DTSMQRm,n,k+1 | � Am,n
FOR k=0 TO N-1
DGEQRT(inoutAkk)
FOR n=k+1 to N
DORMQR(inA⬕kk, inoutAkn)
FOR m=k+1 to N
DTSQRT(inoutA⬔kk, inoutAmk)
FOR n=k+1 to N
DTSMQR(inAmk, inoutAkn, inoutAmn)
PaRSECfrom serial code to PTG
serial code
PTG representation
DGEQRTkkk1ARG ← Ak,k | DTSMQRk,k,k-11ARG � DORMQRk,k+1..N,k(⬕)
1ARG � DTSQRTk+1,k,k(⬔)
1ARG � Ak,k(⬕)
DORMQRknk1ARG ← DGEQRTk,k,k(⬕)
2ARG ← Ak,n | DTSMQRk,n,k-12ARG � DTSMQRk+1,n,k2ARG � Ak,n
DTSQRTmkk1ARG ← DGEQRTm-1,k,k(⬔) | DTSQRTm-1,k,k(⬔)
1ARG � DTSQRTm+1,k,k(⬔) | Ak,k(⬔)
2ARG ← Am,k | DTSMQRm,k,k-12ARG � DTSMQRm,k+1..N,k2ARG � Am,k
DTSMQRmnk1ARG ← DTSQRTm,k,k2ARG ← DORMQRm-1,n,k | DTSMQRm-1,n,k2ARG � DTSMQRm+1,n,k | An,k3ARG ← Am,n | DTSMQRm,n,k-13ARG � DGEQRTm,n,k+1 | DORMQRm,n,k+1 | � DTSQRTm,n,k+1 | DTSMQRm,n,k+1 | � Am,n
FOR k=0 TO N-1
DGEQRT(inoutAkk)
FOR n=k+1 to N
DORMQR(inA⬕kk, inoutAkn)
FOR m=k+1 to N
DTSQRT(inoutA⬔kk, inoutAmk)
FOR n=k+1 to N
DTSMQR(inAmk, inoutAkn, inoutAmn)
PaRSECfrom serial code to PTG
serial code
PTG representation
Serial code PTG representation
![Page 78: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/78.jpg)
QR performance• Fixed rows, increase cols, from tall-skinny (left) to square (right)
68
DPLASMA's Performancedistributed (multicore)
![Page 79: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/79.jpg)
Outline• Legacy Software
• BLAS• LINPACK• LAPACK• ScaLAPACK
• New Software• PLASMA using OpenMP• DPLASMA and PaRSEC• MAGMA for CUDA, OpenCL, or Xeon Phi • Case study: 2-stage SVD
69
![Page 80: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/80.jpg)
Challenges of GPUs• High levels of parallelism
• Expensive branches (if-then-else)
• Computation vs. communicationgap growing• NVIDIA Pascal P100 has 4670 Gflop/s,
732 GB/s memory, 16 GB/s PCIe (80 GB/s NVLINK)
• Use hybrid approach• Small, non-parallelizable tasks on CPU• Large, parallel tasks on GPU• Overlap communication & computation
Chapter 1. Introduction
CUDA C Programming Guide Version 4.0 3
The reason behind the discrepancy in floating-point capability between the CPU and the GPU is that the GPU is specialized for compute-intensive, highly parallel computation – exactly what graphics rendering is about – and therefore designed such that more transistors are devoted to data processing rather than data caching and flow control, as schematically illustrated by Figure 1-2.
Figure 1-2. The GPU Devotes More Transistors to Data Processing
More specifically, the GPU is especially well-suited to address problems that can be expressed as data-parallel computations – the same program is executed on many data elements in parallel – with high arithmetic intensity – the ratio of arithmetic operations to memory operations. Because the same program is executed for each data element, there is a lower requirement for sophisticated flow control, and because it is executed on many data elements and has high arithmetic intensity, the memory access latency can be hidden with calculations instead of big data caches.
Data-parallel processing maps data elements to parallel processing threads. Many applications that process large data sets can use a data-parallel programming model to speed up the computations. In 3D rendering, large sets of pixels and vertices are mapped to parallel threads. Similarly, image and media processing applications such as post-processing of rendered images, video encoding and decoding, image scaling, stereo vision, and pattern recognition can map image blocks and pixels to parallel processing threads. In fact, many algorithms outside the field of image rendering and processing are accelerated by data-parallel processing, from general signal processing or physics simulation to computational finance or computational biology.
1.2 CUDA™: a General-Purpose Parallel Computing Architecture In November 2006, NVIDIA introduced CUDA™, a general purpose parallel computing architecture – with a new parallel programming model and instruction set architecture – that leverages the parallel compute engine in NVIDIA GPUs to
Cache
ALU Control
ALU
ALU
ALU
DRAM
CPU
DRAM
GPU
Chapter 1. Introduction
CUDA C Programming Guide Version 4.0 3
The reason behind the discrepancy in floating-point capability between the CPU and the GPU is that the GPU is specialized for compute-intensive, highly parallel computation – exactly what graphics rendering is about – and therefore designed such that more transistors are devoted to data processing rather than data caching and flow control, as schematically illustrated by Figure 1-2.
Figure 1-2. The GPU Devotes More Transistors to Data Processing
More specifically, the GPU is especially well-suited to address problems that can be expressed as data-parallel computations – the same program is executed on many data elements in parallel – with high arithmetic intensity – the ratio of arithmetic operations to memory operations. Because the same program is executed for each data element, there is a lower requirement for sophisticated flow control, and because it is executed on many data elements and has high arithmetic intensity, the memory access latency can be hidden with calculations instead of big data caches.
Data-parallel processing maps data elements to parallel processing threads. Many applications that process large data sets can use a data-parallel programming model to speed up the computations. In 3D rendering, large sets of pixels and vertices are mapped to parallel threads. Similarly, image and media processing applications such as post-processing of rendered images, video encoding and decoding, image scaling, stereo vision, and pattern recognition can map image blocks and pixels to parallel processing threads. In fact, many algorithms outside the field of image rendering and processing are accelerated by data-parallel processing, from general signal processing or physics simulation to computational finance or computational biology.
1.2 CUDA™: a General-Purpose Parallel Computing Architecture In November 2006, NVIDIA introduced CUDA™, a general purpose parallel computing architecture – with a new parallel programming model and instruction set architecture – that leverages the parallel compute engine in NVIDIA GPUs to
Cache
ALU Control
ALU
ALU
ALU
DRAM
CPU
DRAM
GPU
PCIe
Figure from NVIDIA CUDA C Programming Guide70
![Page 81: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/81.jpg)
Panel Look ahead
Trailing matrixA = QA
One-sided factorization• LU, Cholesky, QR factorizations for solving linear systems
Level 2BLAS on
CPU
Level 3 BLAS on
GPU
DAG
71
![Page 82: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/82.jpg)
Panel Look ahead
Trailing matrixA = QA
One-sided factorization• LU, Cholesky, QR factorizations for solving linear systems
Level 2BLAS on
CPU
Level 3 BLAS on
GPU
LookaheadPanel
Trailing matrix
DAG
71
![Page 83: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/83.jpg)
Panel Look ahead
Trailing matrixA = QA
One-sided factorization• LU, Cholesky, QR factorizations for solving linear systems
Level 2BLAS on
CPU
Level 3 BLAS on
GPU
LookaheadPanel
Trailing matrix
DAG
71
![Page 84: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/84.jpg)
Panel Look ahead
Trailing matrixA = QA
One-sided factorization• LU, Cholesky, QR factorizations for solving linear systems
Level 2BLAS on
CPU
Level 3 BLAS on
GPU
LookaheadPanel
Trailing matrix
TrailingmatrixPanel
DAG
71
![Page 85: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/85.jpg)
Panel Look ahead
Trailing matrixA = QA
One-sided factorization• LU, Cholesky, QR factorizations for solving linear systems
Level 2BLAS on
CPU
Level 3 BLAS on
GPU
LookaheadPanel
Trailing matrix
TrailingmatrixPanelPanel
DAG
71
![Page 86: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/86.jpg)
Panel Look ahead
Trailing matrixA = QA
criticalpath
One-sided factorization• LU, Cholesky, QR factorizations for solving linear systems
Level 2BLAS on
CPU
Level 3 BLAS on
GPU
LookaheadPanel
Trailing matrix
TrailingmatrixPanelPanel
DAG
71
![Page 87: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/87.jpg)
72
![Page 88: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/88.jpg)
Keeneland system, using one node 3 Nvidia GPUs (M2070 @ 1.1 GHz, 5.4 GB) 2 x 6 Intel Cores (X5660 @ 2.8 GHz, 23 GB)
73
![Page 89: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/89.jpg)
Two-sided factorization• Hessenberg, tridiagonal factorizations for eigenvalue problems
Level 2BLAS on
GPU
Level 2BLAS on
CPU
Level 3 BLAS on
GPU
Panel Trailing matrixA = QTAQ
yi = Avi
column ai
74
![Page 90: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/90.jpg)
MAGMA Hessenberg in double precision
Gflo
p/s
Matrix size
Keeneland system, using one node 3 Nvidia GPUs (M2070 @ 1.1 GHz, 5.4 GB) 2 x 6 Intel Cores (X5660 @ 2.8 GHz, 23 GB)
75
![Page 91: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/91.jpg)
Outline• Legacy Software
• BLAS• LINPACK• LAPACK• ScaLAPACK
• New Software• PLASMA using OpenMP• DPLASMA and PaRSEC• MAGMA for CUDA, OpenCL, or Xeon Phi• Case study: 2-stage SVD
76
![Page 92: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/92.jpg)
Computing SVD in 3 Phases
77
![Page 93: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/93.jpg)
Computing SVD in 3 Phases
77
1.
1. Reduction to bidiagonal form
A = U1 B V1T
A B
![Page 94: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/94.jpg)
Computing SVD in 3 Phases
77
1. 2.
1. Reduction to bidiagonal form2. Bidiagonal SVD
A = U1 B V1T B = U2 Σ V2T
A B Σ
![Page 95: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/95.jpg)
V V1= V2
U U1 U2=
Computing SVD in 3 Phases
77
1. 2. 3.
1. Reduction to bidiagonal form2. Bidiagonal SVD3. Back transform singular vectors
A = U1 B V1T B = U2 Σ V2T
A B Σ
![Page 96: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/96.jpg)
Computing SVD in 3 Phases
1. Reduction to bidiagonal form (2 stage) 1a. Full to band 1b. Band to bidiagonal
2. Bidiagonal SVD3. Back transform singular vectors
78
2.
1a. 1b.
3.
A = Ua Aband VaTAband = Ub B VbT
U Ua Ub U2=
V Va= Vb V2
A
Aband
B Σ
B = U2 Σ V2T
![Page 97: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/97.jpg)
One stage reduction
• For i = 1 to n by nb• Eliminate cols i : i + nb and rows i : i + nb• Update trailing matrix using block Householder
(BLAS 3 matrix multiplies)79
43 n3
43 n3
• flops in BLAS 2 gemv• flops in BLAS 3 gemm• Performance limited to
twice gemv speed
A B
![Page 98: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/98.jpg)
One stage reduction
• For i = 1 to n by nb• Eliminate cols i : i + nb and rows i : i + nb• Update trailing matrix using block Householder
(BLAS 3 matrix multiplies)79
43 n3
43 n3
• flops in BLAS 2 gemv• flops in BLAS 3 gemm• Performance limited to
twice gemv speed
A B
![Page 99: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/99.jpg)
One stage reduction
• For i = 1 to n by nb• Eliminate cols i : i + nb and rows i : i + nb• Update trailing matrix using block Householder
(BLAS 3 matrix multiplies)79
43 n3
43 n3
• flops in BLAS 2 gemv• flops in BLAS 3 gemm• Performance limited to
twice gemv speed
A B
![Page 100: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/100.jpg)
One stage reduction
• For i = 1 to n by nb• Eliminate cols i : i + nb and rows i : i + nb• Update trailing matrix using block Householder
(BLAS 3 matrix multiplies)79
43 n3
43 n3
• flops in BLAS 2 gemv• flops in BLAS 3 gemm• Performance limited to
twice gemv speed
A B
![Page 101: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/101.jpg)
One stage reduction
• For i = 1 to n by nb• Eliminate cols i : i + nb and rows i : i + nb• Update trailing matrix using block Householder
(BLAS 3 matrix multiplies)79
43 n3
43 n3
• flops in BLAS 2 gemv• flops in BLAS 3 gemm• Performance limited to
twice gemv speed
A B
![Page 102: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/102.jpg)
One stage reduction
• For i = 1 to n by nb• Eliminate cols i : i + nb and rows i : i + nb• Update trailing matrix using block Householder
(BLAS 3 matrix multiplies)79
43 n3
43 n3
• flops in BLAS 2 gemv• flops in BLAS 3 gemm• Performance limited to
twice gemv speed
A B
![Page 103: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/103.jpg)
One stage reduction
• For i = 1 to n by nb• Eliminate cols i : i + nb and rows i : i + nb• Update trailing matrix using block Householder
(BLAS 3 matrix multiplies)79
43 n3
43 n3
• flops in BLAS 2 gemv• flops in BLAS 3 gemm• Performance limited to
twice gemv speed
A B
![Page 104: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/104.jpg)
One stage reduction
• For i = 1 to n by nb• Eliminate cols i : i + nb and rows i : i + nb• Update trailing matrix using block Householder
(BLAS 3 matrix multiplies)79
43 n3
43 n3
• flops in BLAS 2 gemv• flops in BLAS 3 gemm• Performance limited to
twice gemv speed
A B
![Page 105: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/105.jpg)
One stage reduction
• For i = 1 to n by nb• Eliminate cols i : i + nb and rows i : i + nb• Update trailing matrix using block Householder
(BLAS 3 matrix multiplies)79
43 n3
43 n3
• flops in BLAS 2 gemv• flops in BLAS 3 gemm• Performance limited to
twice gemv speed
A B
![Page 106: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/106.jpg)
• 1st stage: full to band • QR factorize panel, update trailing matrix• LQ factorize panel, update trailing matrix
• 8n3 flops in BLAS 3 gemm• More efficient with
large bandwidth nb
Two stage reduction
80
83 n3
A Aband
![Page 107: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/107.jpg)
• 1st stage: full to band • QR factorize panel, update trailing matrix• LQ factorize panel, update trailing matrix
• 8n3 flops in BLAS 3 gemm• More efficient with
large bandwidth nb
Two stage reduction
80
83 n3
A Aband
![Page 108: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/108.jpg)
• 1st stage: full to band • QR factorize panel, update trailing matrix• LQ factorize panel, update trailing matrix
• 8n3 flops in BLAS 3 gemm• More efficient with
large bandwidth nb
Two stage reduction
80
83 n3
A Aband
![Page 109: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/109.jpg)
• 1st stage: full to band • QR factorize panel, update trailing matrix• LQ factorize panel, update trailing matrix
• 8n3 flops in BLAS 3 gemm• More efficient with
large bandwidth nb
Two stage reduction
80
83 n3
A Aband
![Page 110: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/110.jpg)
• 1st stage: full to band • QR factorize panel, update trailing matrix• LQ factorize panel, update trailing matrix
• 8n3 flops in BLAS 3 gemm• More efficient with
large bandwidth nb
Two stage reduction
80
83 n3
A Aband
![Page 111: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/111.jpg)
• 1st stage: full to band • QR factorize panel, update trailing matrix• LQ factorize panel, update trailing matrix
• 8n3 flops in BLAS 3 gemm• More efficient with
large bandwidth nb
Two stage reduction
80
83 n3
A Aband
![Page 112: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/112.jpg)
• 1st stage: full to band • QR factorize panel, update trailing matrix• LQ factorize panel, update trailing matrix
• 8n3 flops in BLAS 3 gemm• More efficient with
large bandwidth nb
Two stage reduction
80
83 n3
A Aband
![Page 113: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/113.jpg)
• 1st stage: full to band • QR factorize panel, update trailing matrix• LQ factorize panel, update trailing matrix
• 8n3 flops in BLAS 3 gemm• More efficient with
large bandwidth nb
Two stage reduction
80
83 n3
A Aband
![Page 114: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/114.jpg)
• 1st stage: full to band • QR factorize panel, update trailing matrix• LQ factorize panel, update trailing matrix
• 8n3 flops in BLAS 3 gemm• More efficient with
large bandwidth nb
Two stage reduction
80
83 n3
A Aband
![Page 115: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/115.jpg)
• 1st stage: full to band • QR factorize panel, update trailing matrix• LQ factorize panel, update trailing matrix
• 8n3 flops in BLAS 3 gemm• More efficient with
large bandwidth nb
Two stage reduction
80
83 n3
A Aband
![Page 116: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/116.jpg)
• 1st stage: full to band • QR factorize panel, update trailing matrix• LQ factorize panel, update trailing matrix
• 8n3 flops in BLAS 3 gemm• More efficient with
large bandwidth nb
Two stage reduction
80
83 n3
A Aband
![Page 117: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/117.jpg)
• 1st stage: full to band • QR factorize panel, update trailing matrix• LQ factorize panel, update trailing matrix
• 8n3 flops in BLAS 3 gemm• More efficient with
large bandwidth nb
Two stage reduction
80
83 n3
A Aband
![Page 118: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/118.jpg)
• 1st stage: full to band • QR factorize panel, update trailing matrix• LQ factorize panel, update trailing matrix
• 8n3 flops in BLAS 3 gemm• More efficient with
large bandwidth nb
Two stage reduction
80
83 n3
A Aband
![Page 119: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/119.jpg)
• 1st stage: full to band • QR factorize panel, update trailing matrix• LQ factorize panel, update trailing matrix
• 8n3 flops in BLAS 3 gemm• More efficient with
large bandwidth nb
Two stage reduction
80
83 n3
A Aband
![Page 120: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/120.jpg)
• 1st stage: full to band • QR factorize panel, update trailing matrix• LQ factorize panel, update trailing matrix
• 8n3 flops in BLAS 3 gemm• More efficient with
large bandwidth nb
Two stage reduction
80
83 n3
A Aband
![Page 121: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/121.jpg)
• O(nb n2) flops in BLAS 2• More efficient with
small bandwidth nb
• Tune optimal nb
Two stage reduction
• 2nd stage: band to bidiagonal • Cache friendly kernels• Pipeline multiple sweeps in parallel
81
O(nb n2)
Aband B
![Page 122: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/122.jpg)
• O(nb n2) flops in BLAS 2• More efficient with
small bandwidth nb
• Tune optimal nb
Two stage reduction
• 2nd stage: band to bidiagonal • Cache friendly kernels• Pipeline multiple sweeps in parallel
81
O(nb n2)
Aband B
![Page 123: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/123.jpg)
• O(nb n2) flops in BLAS 2• More efficient with
small bandwidth nb
• Tune optimal nb
Two stage reduction
• 2nd stage: band to bidiagonal • Cache friendly kernels• Pipeline multiple sweeps in parallel
81
O(nb n2)
Aband B
![Page 124: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/124.jpg)
• O(nb n2) flops in BLAS 2• More efficient with
small bandwidth nb
• Tune optimal nb
Two stage reduction
• 2nd stage: band to bidiagonal • Cache friendly kernels• Pipeline multiple sweeps in parallel
81
O(nb n2)
Aband B
![Page 125: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/125.jpg)
• O(nb n2) flops in BLAS 2• More efficient with
small bandwidth nb
• Tune optimal nb
Two stage reduction
• 2nd stage: band to bidiagonal • Cache friendly kernels• Pipeline multiple sweeps in parallel
81
O(nb n2)
Aband B
![Page 126: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/126.jpg)
Accelerated phases
82
8x 16x
NVIDIA Pascal P100 Intel Haswell 2x10 core 2.3 GHz
lower is better
![Page 127: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/127.jpg)
Overall performance
83
(a) Singular values only (no vectors), using 83n
3 foroperation count in all cases.
(b) With singular vectors, using 9n3 for operation countin all cases.
Figure 10: SVD performance.
Figure 11: Profile of SVD implementations showing each phase, for n = 20000. With QR iteration, it first explicitly generatesU1 and V1, then multiplies them by U2 and V2 during QR iteration, whereas D&C explicitly generates U2 and V2, andsubsequently multiplies them by implicit U1 and V1 in back transformation.
15
(a) Singular values only (no vectors), using 83n
3 foroperation count in all cases.
(b) With singular vectors, using 9n3 for operation countin all cases.
Figure 10: SVD performance.
Figure 11: Profile of SVD implementations showing each phase, for n = 20000. With QR iteration, it first explicitly generatesU1 and V1, then multiplies them by U2 and V2 during QR iteration, whereas D&C explicitly generates U2 and V2, andsubsequently multiplies them by implicit U1 and V1 in back transformation.
15
NVIDIA Pascal P100 Intel Haswell 2x10 core 2.3 GHz
singular values only (no vectors) with singular vectors
higher is better
![Page 128: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations](https://reader031.vdocuments.mx/reader031/viewer/2022022108/5c00c23e09d3f252338b93c7/html5/thumbnails/128.jpg)
Questions?
84