class18_linalg-ii-handout.pdf
TRANSCRIPT
-
7/29/2019 class18_linalg-II-handout.pdf
1/48
High Performance Linear Algebra II
M. D. Jones, Ph.D.
Center for Computational ResearchUniversity at Buffalo
State University of New York
High Performance Computing I, 2012
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 1 / 53
-
7/29/2019 class18_linalg-II-handout.pdf
2/48
Introduction
Dense & Parallel
In this topic we will still discuss dense (as opposed to sparse)matrices, but now focus on parallel execution.
Best Performance (and scalability) most-often comes down tomaking best use of (local) L3 BLAS
Optimized BLAS are generally crucial for achieving performancegains
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 3 / 53
-
7/29/2019 class18_linalg-II-handout.pdf
3/48
Introduction
BLAS Recapitulated
BLAS are critical:
Foundation for all (computational) linear algebra
Parallel versions still call serial versions on each processor
Performance goes up with increasing ratio of floating point
operations to memory references
Well-tuned BLAS takes maximum advantage of memory hierarchy
BLAS Level Calculation Memory Refs. Flop Count Ratio(Flop/Mem)1 DDOT 3N 2N 2/3
2 DGEMV N2 2N2 23 DGEMM 4N2 2N3 N/2
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 4 / 53
-
7/29/2019 class18_linalg-II-handout.pdf
4/48
Introduction
Simple Memory Model
Let us assume just two levels of memory - slow and fast
m = # memory elements (words) moved between fast and slow memory,
m = time per slow memory operation,
fop = number of floating-point operations,
f = time per floating-point operation,
and q = fop/m. Minimum time is just fop f, but more realistically:
time fop
f+ m
m= f
op
f[1 +
m/(q
f)] .
key observation - we need optimal reuse of data = larger q (certainlym f).
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 5 / 53
-
7/29/2019 class18_linalg-II-handout.pdf
5/48
Matrix Multiplication Sequential
Simple Matrix Multiplication
Everyone has probably written this out at least once ...fo r ( i =0 ; i
-
7/29/2019 class18_linalg-II-handout.pdf
6/48
Matrix Multiplication Partitioning
Partitioning
In this case, partitioning is also known as block matrixmultiplication:
Divide into s2 submatrices, N/sN/selements in each submatrix
Let m = N/s, and
f o r ( p = 0; p
-
7/29/2019 class18_linalg-II-handout.pdf
7/48
Matrix Multiplication Divide & Conquer
Recursive/Divide & Conquer
Let N be a power of 2 (not strictly necessary) and divide A and B into4 submatrices, delimited by:
A =
App ApqAqp Aqq
.
Then the solution requires 8 pairs of submatrix multiplications, whichcan be done recursively ...
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 9 / 53
Matrix Multiplication Divide & Conquer
-
7/29/2019 class18_linalg-II-handout.pdf
8/48
Matrix Multiplication Divide & Conquer
matm ultR ( A , B , s ) {
i f ( s = =1) {C = AB ; / t e r m in a t i on c o n d it i o n /}else {
s = s / 2 ;P0 = matmu ltR( A_ { p p } ,B_ {p p } , s ) ;P1 = matmu ltR( A_ { p q } ,B_ {q p } , s ) ;P2 = matmu ltR( A_ { p p } ,B_ {p q } , s ) ;P3 = matmu ltR( A_ { p q } ,B_ {q q } , s ) ;P4 = matmu ltR( A_ { q p } ,B_ {p p } , s ) ;P5 = matmu ltR( A_ { q q } ,B_ {q p } , s ) ;P6 = matmu ltR( A_ { q p } ,B_ {p q } , s ) ;P7 = matmu ltR( A_ { q q } ,B_ {q q } , s ) ;C_{ pp} = P0+P1;C_{ pq} = P2+P3;C_{ qp} = P4+P5;C_{ qq} = P6+P7;
}re tu rn ( C ) ;
}
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 10 / 53
Matrix Multiplication Divide & Conquer
-
7/29/2019 class18_linalg-II-handout.pdf
9/48
Matrix Multiplication Divide & Conquer
Well suited for SMP with cache hierarchy
Size of data continually reduced and localized
Can be highly performing when making maximum reuse of data
within cache memory hierarchyMore generally we want to address cases for which matrix will notfit within memory of a single machine, message
passing/distributed model is required
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 11 / 53
Matrix Multiplication Cannons Algorithm
-
7/29/2019 class18_linalg-II-handout.pdf
10/48
Matrix Multiplication Cannon s Algorithm
Cannons Algorithm
Cannons algorithm:
Uses a mesh of processors with a torus topology (periodicboundary conditions
Elements are shifted to an aligned position: Row i of A is shiftedi places to the left, while row j of B is shifted j spots upward. This
puts Ai,j+i and Bi+j,j in processor Pi,j (namely we have appropriatesubmatrices/elements to multiply in Pi,j)
Usually submatrices are used, but we can use elements to
simplify the notation a bit
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 12 / 53
Matrix Multiplication Cannons Algorithm
-
7/29/2019 class18_linalg-II-handout.pdf
11/48
Matrix Multiplication Cannon s Algorithm
After alignment, Cannons algorithm proceeds as follows:
Each process Pi,j multiplies its elements
Row i of A is shifted one place left, and column j of B is shiftedone place up (brings together adjacent elements of A and B
needed for summing results)Each process Pi,j multiplies its elements and adds to
accumulating sum
Preceding two steps repeated until results obtained (N 1 shifts)
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 13 / 53
Matrix Multiplication Cannons Algorithm
-
7/29/2019 class18_linalg-II-handout.pdf
12/48
Matrix Multiplication Cannon s Algorithm
Cannon Illustrated
i
j
B
A
Pij
Cannons algorithm showing flow of data for process Pij.M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 14 / 53
Matrix Multiplication Cannons Algorithm
-
7/29/2019 class18_linalg-II-handout.pdf
13/48
p g
For s submatrices (m = N/s) the communication time for Cannons
algorithm is given by
comm = 4(s 1)(lat + m2dat),
where lat is the latency and dat is the time required to send one value
(word). The computation time in Cannons algorithm:
comp = 2sm3 = 2m2N,
or O(m2N).
Cannons algorithm is also known as the ScaLAPACK outer productalgorithm (can you see another numerical library coming?) ...
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 15 / 53
Matrix Multiplication 2D Pipeline
-
7/29/2019 class18_linalg-II-handout.pdf
14/48
p p
2D Pipeline
Another useful algorithm is the so-called 2D pipeline, in which thedata flows into a rectangular array of processors. Labeling the processgrid using (0, 0) as the top left corner, the data flows from the left andfrom the top, with process Pi,j:
1 RECV(A from Pi,j1) { A data from left }2 RECV(B from Pi1,j) { B data from above }
3 Accumulate Ci,j4 SEND(A to Pi,j+1) { A data to right }
5 SEND(B to Pi+1,j) { B data down }(illustrate with sketch)
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 16 / 53
Matrix Multiplication 2D Pipeline
-
7/29/2019 class18_linalg-II-handout.pdf
15/48
2D Pipeline Illustrated
C00 C01 C02 C03
C10 C11 C12 C13
C20 C21 C22 C23
C30 C31 C32 C33
A03A02A01A00
A13A12A11A10 --
A23A22A21A20 -- --
A33A32A31A30 -- -- --
B30B20B10B00
B31B21B11B01
--
B32
B22B12B02
--
--
B33B23
B13B03
--
--
--
1 cycle delay P32
2D pipeline (or systolic array) algorithm showing flow of data.M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 17 / 53
Matrix Multiplication Strassens Method
-
7/29/2019 class18_linalg-II-handout.pdf
16/48
Strassens Method
The techniques discussed thus far are fundamentally O(N3), but there
is a clever method due to Strassen1
which is O(N2.81
):
A =
A11 A12A21 A22
B =
B11 B12B21 B22
C =
C11 C12C21 C22
Q1 = (A11 + A22)(B11 + B22), C11 = Q1 + Q4 Q5 + Q7
Q2 = (A21 + A22)B11, C12 = Q3 + Q5
Q3 = A11(B12 B22), C21 = Q2 + Q4
Q4 = A22(B11 + B21), C22 = Q1 + Q3 Q2 + Q6
Q5 = (A11 + A12)B22,
Q6 = (A11 + A21)(B11 + B12),
Q7 = (A12 A22)(B21 + B22)
Conventional matrix multiplication takes N3 multiplies and N3 N2
adds
1V. Strassen, Gaussian Elimination Is Not Optimal, Numerische
Mathematik, 13, 353-356 (1969).M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 18 / 53
Matrix Multiplication Strassens Method
-
7/29/2019 class18_linalg-II-handout.pdf
17/48
If N is not a power of 2 can pad with zeros
Reiterate until submatrices reduced to numbers or optimal matrix
multiply size for particular processor (Strassen switches tostandard matrix multiplication for small enough submatrices)
Eliminates 1 matrix multiply, instead of O(Nlog
28
),O(Nlog2 7) = N2.81
Conventional matrix multiplication takes N3 multiplies and N3 N2
adds, Strassen is 7Nlog2 7 6N2
Requires additional storage for intermediate matrices, can be lessstable numerically ...
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 19 / 53
LU Decomposition in Parallel
-
7/29/2019 class18_linalg-II-handout.pdf
18/48
Sequential LU/Gaussian Elimination Code
LU code fragment:
f o r ( k = 1 ; k
-
7/29/2019 class18_linalg-II-handout.pdf
19/48
Parallel LU/Gaussian Elimination
One strategy for parallelizing the Gaussian eleimination part of LUdecomposition is to do so over the middle loop (j) in the algorithm.
Decomposing the columns, and using a message passingimplementation, we might have something like the following:
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 22 / 53
LU Decomposition in Parallel
-
7/29/2019 class18_linalg-II-handout.pdf
20/48
do k=1 ,N1i f ( col umn k i s min e ) then ! columnwi se d e co mp o si t i o n
do i =k+1 ,NL ( i , k ) = A ( i , k ) / A ( k , k )
end doBCAST ( L ( k + 1 : N, k ) , r o o t = o wne r o f k )
elseRECV ( L ( k + 1 : N, k ) )
end i fdo j =k+1 ,N ( mod ul o I own col umn j ) ! columnwi se d e co mp o si t i o ndo i =k+1 ,N
A( i , j ) = A( i , j ) L ( i , k )A( k , j )end do
end doend do
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 23 / 53
LU Decomposition in Parallel
P ll l LU Effi i
-
7/29/2019 class18_linalg-II-handout.pdf
21/48
Parallel LU Efficiency
If we do a (not overly crude) analysis of the preceding parallel
algorithm, for each column k we have:
Broadcat N k values, lets say taking time cb
Compute (N k)2 multiply-adds, time cfma
k cb(N k) + cfma(N k)2
Np,
Summing over k we find:
(Np) cb2
N2 +cfma
3
N3
Np, (1)
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 24 / 53
LU Decomposition in Parallel
P ll l LU S d
-
7/29/2019 class18_linalg-II-handout.pdf
22/48
Parallel LU Speedup
Now, looking at the parallel speedup factor we have:
S(Np) = /(Np)
cfmaN
3/3cbN2/2+cfmaN3/(3Np)
=
32Ncb
cfma + 1Np1
.
See the ratio of communication cost to computation? It never goes
away
Minimizing communication costs is again the key to improving the
efficiency ...
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 25 / 53
LU Decomposition in Parallel
P ll l LU Effi i
-
7/29/2019 class18_linalg-II-handout.pdf
23/48
Parallel LU Efficiency
The efficiency is given by:
E(Np) = S(Np)/Np,
3Np2N
cbcfma
+ 1
1
,
= 1 3Np2N cbcfma
Note that this algorithm is scalable (in time) by our previousdefinition, for a given fixed ratio of Np/N
Note that it is not scalable in terms of memory, since the memoryrequirements grow steadily as O(N2)
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 26 / 53
LU Decomposition in Parallel
Improving Parallel LU
-
7/29/2019 class18_linalg-II-handout.pdf
24/48
Improving Parallel LU
The most obvious way to improve the efficiency of our parallel
Gaussian elimination is to overlap computation with the necessarycommunication. Consider the revised algorithm ...
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 27 / 53
LU Decomposition in Parallel
-
7/29/2019 class18_linalg-II-handout.pdf
25/48
! 1 s t p r oc e s so r c ompu tes L ( 2 : N , 1 ) and BCAST ( L ( 2 : N , 1 ) )do k=1 ,N1
i f ( c ol um n k i s n o t m in e ) then ! p os t RECV f o r m u l t i p l i e r sRECV( L ( k+1 :N, k ) f ro m BCAST( n e xt_L )
end i fi f ( column k+1 is mine ) then ! compute n e xt s e t o f m u l t i p l i e r s
do i =k+1 ,N ! and e l i m i n a t e f o r column k +1A ( i , k + 1) = A ( i , k + 1) L ( i , k )A ( k , k + 1 )n ex t _L ( i , k + 1) = A ( i , k + 1 ) / A ( k + 1 , k + 1 )
end do
BCAST ( n e xt _ L ( k + 2 : N, k + 1 ) , r o o t = o wne r o f k +1 )end i fdo j =k+2 ,N ( mod ul o I own col umn j ) ! p er fo rm e l i m i n a t i on s
do i =k+1 ,NA( i , j ) = A( i , j ) L ( i , k )A( k , j )
end doend do
end do
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 28 / 53
ScaLAPACK
ScaLAPACK
-
7/29/2019 class18_linalg-II-handout.pdf
26/48
ScaLAPACK
Scalable LAPACK library (LAPACK is the general purpose LinearAlgebra PACKage), contains LAPACK-style driver routines formulated
for distributed memory parallel processing:
An exceptional example of an MPI library:1 Portable - based on optimized low-level routines2 User friendly - parallel versions of the common LAPACK routines
have similar syntax, names just prefixed with a P.3 User of the library need never write parallel processing code - its all
under the covers.4 Efficient.
Source code and documentation available from NETLIB:http://www.netlib.org/scalapack
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 30 / 53
ScaLAPACK
Under the ScaLAPACK Hood
http://www.netlib.org/scalapackhttp://www.netlib.org/scalapack -
7/29/2019 class18_linalg-II-handout.pdf
27/48
Under the ScaLAPACK Hood
Uses a block-cyclic decomposition of matrices, into Prow Pcol 2D
representation. If we now consider a NB NB sub-block (starting atindex ib, ending at index end):
do ib =1,N,NBend=MIN(N, ib+NB1)do i = i b ,N
( a ) F i nd p i v o t r ow , k , BCAST c ol um n( b ) Swap r o ws k and i i n b lo ck column , BCAST row k( c ) A( i +1:NB, i )=A( i +1:N, i ) / A( i , i )(d ) A( i +1 :N, i +1 :end )=A( i +1:N, i +1: end)A( i +1:N, i )A( i , i +1:end )
end do( e ) BCAST a l l swap i n f o r m a t i o n to r i g h t and l e f t( f ) A pp ly a l l row swaps to o th e r col umns(g ) BCAST row k to r i g h t( h ) A ( i b : end , end+1:N)=LL /A( ib : end , end +1:N)( i ) BCAST A( ib : end , end +1:N) down( j ) BCAST A( end +1:N, ib : end ) r i g h t( k ) E l im i n at e A (end+1:N, end +1:N)
this is our old friend LU decomposition, available as the ScaLAPACKPGETRF routine.
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 31 / 53
ScaLAPACK ScaLAPACK Schematic - PBLAS & BLACS
ScaLAPACK Schematic
-
7/29/2019 class18_linalg-II-handout.pdf
28/48
ScaLAPACK Schematic
MPI/PVM
Local
Global
LAPACK
BLAS
BLACS
PBLAS
ScaLAPACK
General schematic of the ScaLAPACK library.
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 32 / 53
ScaLAPACK ScaLAPACK Analysis
ScaLAPACK Efficiency
-
7/29/2019 class18_linalg-II-handout.pdf
29/48
ScaLAPACK Efficiency
Variable Description
CfN3 Total Number FP Operations
CvN2/
Np Total Number Data CommunicatedCmN/NB Total Number of Messages
f Time per FP Operationv Time per Data Item Communicatedm Time per Message
With these quantities, the Time for the ScaLAPACK drivers is given by:
(N, Np) =CfN3
Npf +
CvN2Np
v +CmN
NBm,
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 33 / 53
ScaLAPACK ScaLAPACK Analysis
-
7/29/2019 class18_linalg-II-handout.pdf
30/48
and the scaling:
E(N, Np)
1 + 1NBCmmCff
NpN2 + C
vvCff
NpN1
.
Note that:
Values of Cf, Cv, and Cm depend on driver routine (and
underlying algorithm)
Scalable for constant N2/Np ...
Small problems, dominated by m/f, the ratio of latency to timeper FP operation (see why latency matters?)
Medium problems, significantly impacted by ratio of networkbandwidth to FP operation rate, f/v
Large problems, node Flop/s (1/f) dominates
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 34 / 53
ScaLAPACK ScaLAPACK Analysis
Block-Cyclic Decomposition
-
7/29/2019 class18_linalg-II-handout.pdf
31/48
Block Cyclic Decomposition
ScaLAPACK uses a block-cyclic decomposition:
On the left, a 4 processor column-wise decomposition, to the right theblock-cyclic version using Pr = Pq = 2, each square is NBxNB (2x2 in
this example). The advantage of block-cyclic is that it allows levels 2and 3 BLAS operations on subvectors and submatrices within each
processor.
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 35 / 53
ScaLAPACK Simple Example Code
Sample ScaLAPACK Program
-
7/29/2019 class18_linalg-II-handout.pdf
32/48
Sample ScaLAPACK Program
"Simplest" program to solve A x = b using PDGESV (LU), available atwww.netlib.org/scalapack/examples/example1.f
1 PROGRAM EXAMPLE12 3 Example Program so lv in g Ax=b vi a ScaLAPACK r ou ti ne PDGESV4 5 . . P ar am et er s . .6 INTEGER DLEN_, IA , JA, IB , JB, M, N, MB, NB, RSRC,
7 $ CSRC, MXLLDA, MXLLDB, NRHS, NBRHS, NOUT,8 $ MXLOCR, MXLOCC, MXRHSC9 PARAMETER ( DLEN_ = 9 , IA = 1 , JA = 1 , IB = 1 , JB = 1 ,
10 $ M = 9 , N = 9 , MB = 2 , NB = 2 , RSRC = 0 ,11 $ CSRC = 0 , MXLLDA = 5 , MXLLDB = 5 , NRHS = 1 ,12 $ NBRHS = 1 , NOUT = 6 , MXLOCR = 5 , MXLOCC = 4 ,13 $ MXRHSC = 1 )14 DOUBLE PRECISION ONE15 PARAMETER ( ONE = 1.0D+0 )
16 . .17 . . L oc al S ca la rs . .18 INTEGER ICTXT , INFO , MYCOL, MYROW, NPCOL, NPROW19 DOUBLE PRECISION ANORM, BNORM, EPS, RESID , XNORM
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 36 / 53
ScaLAPACK Simple Example Code
http://www.netlib.org/scalapack/examples/example1.fhttp://www.netlib.org/scalapack/examples/example1.f -
7/29/2019 class18_linalg-II-handout.pdf
33/48
20 . .21 . . L oc al A rr ay s . .22 INTEGER DESCA( DLEN_ ) , DESCB( DLEN_ ) ,23 $ I PIV ( MXLOCR+NB )24 DOUBLE PRECISION A( MXLLDA, MXLOCC ) , A0 ( MXLLDA, MXLOCC ) ,
25 $ B( MXLLDB, MXRHSC ) , B0 ( MXLLDB, MXRHSC ) ,26 $ WORK( MXLOCR )27 . .28 . . E xt er na l F un ct io ns . .29 DOUBLE PRECISION PDLAMCH, PDLANGE30 EXTERNAL PDLAMCH, PDLANGE31 . .32 . . E x te r na l S ub ro ut in es . .33 EXTERNAL BLACS_EXIT , BLACS_GRIDEXIT , BLACS_GRIDINFO ,
34 $ DESCINIT , MATINIT , PDGEMM, PDGESV, PDLACPY,35 $ SL_INIT36 . .37 . . I n t r i n s i c F u n c t i o n s . .38 INTRINSIC DBLE39 . .40 . . Data s t a te me nt s . .41 DATA NPROW / 2 / , NPCOL / 3 / 42 . .43 . . E x ec u ta b le S t a te me nt s . .44 45 IN IT IA LI ZE THE PROCESS GRID46 47 CALL SL_ INI T ( ICTXT , NPROW, NPCOL )48 CALL BLACS_GRIDINFO ( ICTXT , NPROW, NPCOL, MYROW, MYCOL )
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 37 / 53
ScaLAPACK Simple Example Code
-
7/29/2019 class18_linalg-II-handout.pdf
34/48
49 50 I f I m n ot i n t he pr oc es s g r id , go t o t he end o f t he program51 52 I F ( MYROW. EQ.1 )53 $ GO TO 10
54 55 DISTRIBUTE THE MATRIX ON THE PROCESS GRID56 I n i t i a l i z e t he a rr ay d e sc r i pt o r s f o r t he m at ri ce s A and B57 58 CALL DESCINIT( DESCA, M, N, MB, NB, RSRC, CSRC, ICTXT , MXLLDA,59 $ INFO )60 CALL DESCINIT ( DESCB, N, NRHS, NB, NBRHS, RSRC, CSRC, ICTXT ,61 $ MXLLDB, INFO )62
63 G en er at e m a t ri c es A and B and d i s t r i b u t e t o t he p ro ce ss g r i d64 65 CALL MATINIT ( A, DESCA, B, DESCB )66 67 Make a c op y o f A a nd B f o r c h ec k in g p ur po se s68 69 CALL PDLACPY( A l l , N, N, A, 1 , 1 , DESCA, A0 , 1 , 1 , DESCA )70 CALL PDLACPY( A l l , N, NRHS, B, 1 , 1 , DESCB, B0, 1 , 1 , DESCB )71
72 CALL THE SCALAPACK ROUTINE73 S ol ve t he l i n e a r s yst em A X = B74 75 CALL PDGESV( N, NRHS, A, IA , JA, DESCA, IPIV , B, IB , JB, DESCB,76 $ INFO )
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 38 / 53
ScaLAPACK Simple Example Code
-
7/29/2019 class18_linalg-II-handout.pdf
35/48
I F ( MYROW.EQ.0 .AND. MYCOL. EQ. 0 ) THENWRITE( NOUT, FMT = 9 999 )WRITE( NOUT, FMT = 9998 ) M, N , NB
WRITE( NOUT, FMT = 9997 )NPROW
NPCOL, NPROW, NPCOLWRITE( NOUT, FMT = 9996 )INFOEND IF
Compute r e s i d u a l | | A X B | | / ( | | X | | | | A | | eps N )
EPS = PDLAMCH( ICTXT , Ep si lo n )ANORM = PDLANGE( I , N, N, A, 1 , 1 , DESCA, WORK )BNORM = PDLANGE( I , N, NRHS, B, 1 , 1 , DESCB, WORK )CALL PDGEMM( N , N , N, NRHS, N, ONE, A0 , 1 , 1 , DESCA, B, 1 , 1 ,
$ DESCB, ONE, B0, 1 , 1 , DESCB )XNORM = PDLANGE( I , N, NRHS, B0, 1 , 1 , DESCB, WORK )RESID = XNORM / ( ANORMBNORMEPSDBLE ( N ) )
I F ( MYROW.EQ.0 .AND. MYCOL. EQ. 0 ) THENI F ( RESID. LT. 10 .0D+0 ) THEN
WRITE( NOUT, FMT = 9 995 )WRITE( NOUT, FMT = 9993 )RESID
ELSE
WRITE( NOUT, FMT = 9 994 )WRITE( NOUT, FMT = 9993 )RESID
END IFEND IF
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 39 / 53
ScaLAPACK Simple Example Code
-
7/29/2019 class18_linalg-II-handout.pdf
36/48
104 105 RELEASE THE PROCESS GRID106 Free th e BLACS co n t e xt107
108 CALL BLACS_GRIDEXIT( ICTXT )109 10 CONTINUE110 111 Ex it the BLACS112 113 CALL BLACS_EXIT( 0 )114 115 9999 FORMAT( / ScaLAPACK Example Program #1 - May 1 , 1997 )116 9998 FORMAT( / S ol vi ng Ax=b where A i s a , I3 , by , I3 ,
117 $ m a t r i x w i t h a b l o c k s i z e o f , I 3 )118 9997 FORMAT( R u nn ing on , I 3 , p ro ce ss es , where t h e p ro c es s g r i d ,119 $ i s , I3 , by , I 3 )120 9996 FORMAT( / INFO c ode r e t u r n e d by PDGESV = , I 3 )121 9995 FORMAT( /122 $ A cco rdi ng t o t he n o rm al i z e d re si d u al th e s o l u t i o n i s c or re ct . 123 $ )124 9994 FORMAT( /125 $ Acco rd in g t o th e n o rm al i z e d r e si d u al t he s o l u t i o n i s i n c o rr e c t .
126 $ )127 9993 FORMAT( / | | Ax b | | / ( | | x | | | | A | | epsN ) = , 1P , E16 .8 )128 STOP129 END
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 40 / 53
ScaLAPACK Simple Example Code
Compiling and running on U2 (use the Intel MKL link advisor to get the
-
7/29/2019 class18_linalg-II-handout.pdf
37/48
Compiling and running on U2 (use the Intel MKL link advisor to get the
linking right):
1 [ k 07n14 : ~ / d _s c al ap ac k ] $ module l o ad i n t e l2 [ k 07n14 : ~ / d _s c al ap ac k ] $ module l o ad i n t e lmpi3 [ k 07n14 : ~ / d _s c al ap ac k ] $ module l o ad m kl / 1 0 . 34 [ k 07n14 : ~ / d _s ca la pa ck ] $ m p i i f o r t o example1 example1. f L$MKL / l i b / i n t e l 6 4 \ 5 l m k l _ s c a l a p a c k _ l p 6 4 l m k l _ s o l v e r _ l p 6 4 _ s e q u e n t i a l Wl , - s t a r tg ro up \ 6 l m k l _ i n t e l _ l p 6 4 l m k l _ s e q u e n t i a l l m k l _ c o r e l m k l _ b l ac s _ i n t e l m pi _ l p 6 4 \ 7 Wl , - endgroup l p t h r e a d8 [ k 07n 14 : ~ / d _ sc a la p ac k ] $ l d d e xa mp le 19 l i n u xv ds o . so . 1 => ( 0 x 0 0 0 0 7 f f f 4 9 5f f 0 0 0 )
10 l i b d l . s o . 2 => / l i b 6 4 / l i b d l . s o . 2 ( 0 x000000370b400000 )11 l i b m k l _ s c al a p a c k _ l p6 4 . s o => / u t i l / i n t e l / composerx e2 01 1. 1. 10 7/ m k l / l i b / i n t e l 6 4 / 12 l ib m kl _s c al ap a ck _ lp 64 . s o ( 0 x00002b723da34000 )
13 l i b m k l _ i n t e l _ l p 6 4 . s o => / u t i l / i n t e l / c omposerxe2 01 1. 1. 10 7/ m k l / l i b / i n t e l 6 4 / 14 l i b m k l _i n te l _ l p 6 4 . so (0 x00002b723e120000 )15 l i b m k l _ s e q u e n t i a l . s o = > / u t i l / i n t e l / c om pos erxe2 01 1. 1. 10 7/ m k l / l i b / i n t e l 6 4 / 16 l i bm k l _s e qu e nt i al . s o ( 0x00002b723e7d2000 )17 l i b m k l _ c o r e . s o => / u t i l / i n t e l / composerx e2 01 1. 1. 10 7/ m k l / l i b / i n t e l 6 4 / l i b m k l _ c o r e . s o18 ( 0 x00002b723edc4000 )19 l i b m k l _ b l a c s _ i n t e l m p i _ l p 6 4 . s o = > / u t i l / i n t e l / c omposerx e2 01 1. 1. 10 7/ m kl / l i b / i n t e l 6 4 / 20 l i bm k l_ b la c s_ i nt e lm p i_ l p6 4 . s o ( 0 x00002b723fc68000 )21 l i b p t h r e a d . s o . 0 => / l i b 6 4 / l i b p t h r e a d . s o . 0 ( 0 x 00000 03c c9200000 )
22 l i b m p i . so . 4 => / u t i l / i n t e l / i m p i / 4 . 0 . 3 . 0 0 8 / i n t e l 6 4 / l i b / l i b m p i . s o . 4 ( 0 x 00 00 2b 72 3f dc 40 00 )23 l i b m p i g f . so . 4 => / u t i l / i n t e l / i m p i / 4 . 0 . 3 . 0 0 8 / i n t e l 6 4 / l i b / l i b m p i g f . s o . 4 ( 0 x00002b724028e000 )24 l i b r t . s o . 1 => / l i b 6 4 / l i b r t . s o . 1 ( 0 x 00 00 00 3c c9 a0 00 00 )25 l i b m . s o . 6 => / l i b 6 4 / l i b m . s o . 6 ( 0 x 000 00 03 90a c0 00 00 )26 l i b c . s o . 6 => / l i b 6 4 / l i b c . s o . 6 ( 0 x 0000 003c c8600000 )27 l i b g c c _ s . so . 1 => / l i b 6 4 / l i b g c c _ s . so . 1 ( 0 x 0000003c cbe 00000 )28 / l i b 6 4 / l dl i n u xx8664.so. 2 (0 x0000003cc8200000 )
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 41 / 53
ScaLAPACK Simple Example Code
-
7/29/2019 class18_linalg-II-handout.pdf
38/48
29 [ u 2 : ~ / d _s c al ap ac k ] $ qs ub q debug l n o d e s=1 :p p n =8 ,wa l l t i me =0 1 :0 0 :0 0 I30 qsub : w a i t i n g f o r j o b 1 11 29 61 .d15n41 . c c r . b u f f a l o . e du t o s t a r t31 qs ub : j o b 1 11 29 61 . d15n41 . c c r . b u f f a l o . edu r e ad y3233 J ob 1 11 29 61 . d15n4 1 . c c r . b u f f a l o . edu h as r e q ue s t e d 8 c o r es / p r o c e s so r s p e r node .34 PBSTMPDIR i s / scr a tch /1 1 1 2 9 6 1 .d1 5n 41 . ccr . b u ff a l o . e du35 [ d06n40b: ~ ] $ cd $PBS_O_WORKDIR36 [ d06n40b : ~ / d_scalapack ] $ NNODES= cat $PBS_NODEFILE | uniq | wc l 37 [ d 06 n4 0b :~ / d _ sca l ap a ck ] $ mp db oo t n $NNODES f $PBS_NODEFILE v38 r u nn i ng m p d a l l e x i t on d06n40b39 LAUNCHED mpd on d06n40b vi a40 RUNNING: mpd on d06n40b41 [ d06n40b : ~ / d _s c al ap ac k ] $ module l o ad i n t e lmpi42 [ d06n40b : ~ / d _s c al ap ac k ] $ module l o ad i n t e l43 [ d06n4 0b : ~ / d _ sc a l ap a ck ] $ m od ul e l o a d m kl / 1 0 . 344 [ d06n4 0b : ~ / d _ sc a l ap a ck ] $ m pi ex ec n p 6 . / e xa mp le 14546 ScaLAPACK Example Program #1 - May 1 , 19974748 S o l v i n g Ax=b where A i s a 9 by 9 ma t r i x w i t h a b l o c k s i z e o f 249 Running on 6 processes , where th e process g r i d i s 2 by 35051 INFO code r e t u r n e d by PDGESV = 0
5253 A cc or di ng t o t he no rm al iz ed r e s id u al t he s o l u ti o n i s c o r re c t .5455 | | Ax b | | / ( | | x | | | | A | | epsN ) = 0 .0 00 00 00 0E+00
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 42 / 53
ScaLAPACK ScaLAPACK - HPL Benchmark
ScaLAPACK Driver Illustrated - HPL
-
7/29/2019 class18_linalg-II-handout.pdf
39/48
HPL (High Performance Linpack) benchmark still used to rank the
Top500 list (www.top500.org)
Super-sized version of previous example code, uses PDGESV to
solve randomly seeded linear system, check results to ensure thatresults are accurate
Following plot shows results for U2s Myrinet nodes (1536processors total) and Qlogic QDR Infiniband (4416 cores total) -
note scalability
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 43 / 53
ScaLAPACK ScaLAPACK - HPL Benchmark
Linpack (HPL) Benchmark
http://www.top500.org/http://www.top500.org/http://www.top500.org/ -
7/29/2019 class18_linalg-II-handout.pdf
40/48
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192Number of Processors
101
102
103
104
105
106
H
PLMeasuredPerfor
mance[GFlop/s]
Myrinet MX
Qlogic QDR IB
Linpack (HPL) Benchmark
CCR U2 Clusters
N=14,800 N=29,600
N=59,800
N=112,320
N=228,000
N=400,000
N=1,078,272
N=70,656
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 44 / 53
Iterative Methods Jacobi Iteration
Jacobi Iteration
-
7/29/2019 class18_linalg-II-handout.pdf
41/48
Our method for improving on solutions to systems of linear equations
can be formalized into a solution technique in its own right (Jacobi didit first):
x(k)i =
1
Ai,i
bi
j=i
Ai,jx(k1)
j
,
Particularly useful in solution of (O|P)DEs using finite differences
Advantage of small memory requirements (especially for sparse
systems)
Disadvantage of convergence problems (slowly, or not)
Initial guess - often take x0 = b
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 46 / 53
Iterative Methods Jacobi Iteration
-
7/29/2019 class18_linalg-II-handout.pdf
42/48
Measuring convergence for Jacobi Iteration can be tricky, particularly
in parallel. At first glance, you might be tempted by:x(k+1)i xki < , i,but that says little about solution accuracy.
j
Ai,jxk
j bi
<
is a better form, and can reuse values already computed in theprevious iteration.
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 47 / 53
Iterative Methods Jacobi Iteration
Sequential
-
7/29/2019 class18_linalg-II-handout.pdf
43/48
do i =1,Nx ( i )=b ( i )
end dodo i t e r =1 ,MAXITER
do i =1,Nsum = 0 .0do j =1,N
i f ( i / = j ) sum = sum + A ( i , j )x ( j )end donew_x ( i ) = ( b ( i )sum ) / A( i , i )
end dodo i =1,N
x ( i ) = new_x( i )end do
end do
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 48 / 53
Iterative Methods Jacobi Iteration
Block Parallel
-
7/29/2019 class18_linalg-II-handout.pdf
44/48
Have each process accountable for solving a particular block ofunknowns, and simply use an Allgather or Allgatherv (for an unequaldistribution of work on the processes). The resulting computation time
is
comp =N
Np(2N + 4)Niter
with a time for communication
comm = (Nplat + Ndat)Niter,
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 49 / 53
Iterative Methods Jacobi Iteration
-
7/29/2019 class18_linalg-II-handout.pdf
45/48
and the speedup factor:
S(Np) =N(2N + 4)
N(2N + 4)/Np + Nplat + Ndat==
Np
1 + comm/comp.
Note that this is scalable, but only if the ratio N/Np is maintained withincreasing Np.
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 50 / 53
Iterative Methods Gauss-Seidel
Gauss-Seidel Relaxation
-
7/29/2019 class18_linalg-II-handout.pdf
46/48
Technique for convergence acceleration, usually converged faster thanJacobi,
x
(k)
i =
1
Ai,i
bi
i1j=1
Ai,jx
(k)
j
Nj=i+1
Ai,jx
(k1)
j
,
making use of just-updated solution. More difficult to parallelize, butcan be done through particular patterning (say using a checkerboard
allocation) in which regions can be computed simultaneously.
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 51 / 53
Iterative Methods Successive Overrelaxation
Overrelaxation
-
7/29/2019 class18_linalg-II-handout.pdf
47/48
Another improved convergence technique in which factor (1 )xi is
added:
x(k)i =
Ai,i
bi
j=i
Ai,jx(k1)
j
+ (1 )x
(k1)i ,
where 0 < < 1. For the Gauss-Seidel method, this is slightlymodified:
x(k)
i=
Ai,ib
i
i1
j=1
Ai,j
x(k)
j
N
j=i+1
Ai,j
x(k1)
j+ (1 )xk1
i,
with 0 < 2.
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 52 / 53
Iterative Methods Multigrid
Multigrid Method
-
7/29/2019 class18_linalg-II-handout.pdf
48/48
Makes use of successively finer grids to improve solution:
Coarse starting grid quickly take distant effects into account
Initial values of finer grids available through interpolation fromexisting grid values
Of course, we can even have regions of variable grid densitiesusing so-called adaptive grid methods
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 53 / 53