class18_linalg-ii-handout.pdf

7/29/2019 class18_linalg-II-handout.pdf

1/48

High Performance Linear Algebra II

M. D. Jones, Ph.D.

Center for Computational ResearchUniversity at Buffalo

State University of New York

High Performance Computing I, 2012

M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 1 / 53


2/48

Introduction

Dense & Parallel

In this topic we will still discuss dense (as opposed to sparse)matrices, but now focus on parallel execution.

Best Performance (and scalability) most-often comes down tomaking best use of (local) L3 BLAS

Optimized BLAS are generally crucial for achieving performancegains



3/48

Introduction

BLAS Recapitulated

BLAS are critical:

Foundation for all (computational) linear algebra

Parallel versions still call serial versions on each processor

Performance goes up with increasing ratio of floating point

operations to memory references

Well-tuned BLAS takes maximum advantage of memory hierarchy

BLAS Level Calculation Memory Refs. Flop Count Ratio(Flop/Mem)1 DDOT 3N 2N 2/3

2 DGEMV N2 2N2 23 DGEMM 4N2 2N3 N/2



4/48

Introduction

Simple Memory Model

Let us assume just two levels of memory - slow and fast

m = # memory elements (words) moved between fast and slow memory,

m = time per slow memory operation,

fop = number of floating-point operations,

f = time per floating-point operation,

and q = fop/m. Minimum time is just fop f, but more realistically:

time fop

f+ m

m= f

op

f[1 +

m/(q

f)] .

key observation - we need optimal reuse of data = larger q (certainlym f).



5/48

Matrix Multiplication Sequential

Simple Matrix Multiplication

Everyone has probably written this out at least once ...fo r ( i =0 ; i


6/48

Matrix Multiplication Partitioning

Partitioning

In this case, partitioning is also known as block matrixmultiplication:

Divide into s2 submatrices, N/sN/selements in each submatrix

Let m = N/s, and

f o r ( p = 0; p


7/48

Matrix Multiplication Divide & Conquer

Recursive/Divide & Conquer

Let N be a power of 2 (not strictly necessary) and divide A and B into4 submatrices, delimited by:

A =

App ApqAqp Aqq

.

Then the solution requires 8 pairs of submatrix multiplications, whichcan be done recursively ...




8/48


matm ultR ( A , B , s ) {

i f ( s = =1) {C = AB ; / t e r m in a t i on c o n d it i o n /}else {

s = s / 2 ;P0 = matmu ltR( A_ { p p } ,B_ {p p } , s ) ;P1 = matmu ltR( A_ { p q } ,B_ {q p } , s ) ;P2 = matmu ltR( A_ { p p } ,B_ {p q } , s ) ;P3 = matmu ltR( A_ { p q } ,B_ {q q } , s ) ;P4 = matmu ltR( A_ { q p } ,B_ {p p } , s ) ;P5 = matmu ltR( A_ { q q } ,B_ {q p } , s ) ;P6 = matmu ltR( A_ { q p } ,B_ {p q } , s ) ;P7 = matmu ltR( A_ { q q } ,B_ {q q } , s ) ;C_{ pp} = P0+P1;C_{ pq} = P2+P3;C_{ qp} = P4+P5;C_{ qq} = P6+P7;

}re tu rn ( C ) ;

}




9/48


Well suited for SMP with cache hierarchy

Size of data continually reduced and localized

Can be highly performing when making maximum reuse of data

within cache memory hierarchyMore generally we want to address cases for which matrix will notfit within memory of a single machine, message

passing/distributed model is required


Matrix Multiplication Cannons Algorithm


10/48

Matrix Multiplication Cannon s Algorithm

Cannons Algorithm

Cannons algorithm:

Uses a mesh of processors with a torus topology (periodicboundary conditions

Elements are shifted to an aligned position: Row i of A is shiftedi places to the left, while row j of B is shifted j spots upward. This

puts Ai,j+i and Bi+j,j in processor Pi,j (namely we have appropriatesubmatrices/elements to multiply in Pi,j)

Usually submatrices are used, but we can use elements to

simplify the notation a bit




11/48


After alignment, Cannons algorithm proceeds as follows:

Each process Pi,j multiplies its elements

Row i of A is shifted one place left, and column j of B is shiftedone place up (brings together adjacent elements of A and B

needed for summing results)Each process Pi,j multiplies its elements and adds to

accumulating sum

Preceding two steps repeated until results obtained (N 1 shifts)




12/48


Cannon Illustrated

i

j

B

A

Pij

Cannons algorithm showing flow of data for process Pij.M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 14 / 53



13/48

p g

For s submatrices (m = N/s) the communication time for Cannons

algorithm is given by

comm = 4(s 1)(lat + m2dat),

where lat is the latency and dat is the time required to send one value

(word). The computation time in Cannons algorithm:

comp = 2sm3 = 2m2N,

or O(m2N).

Cannons algorithm is also known as the ScaLAPACK outer productalgorithm (can you see another numerical library coming?) ...


Matrix Multiplication 2D Pipeline


14/48

p p

2D Pipeline

Another useful algorithm is the so-called 2D pipeline, in which thedata flows into a rectangular array of processors. Labeling the processgrid using (0, 0) as the top left corner, the data flows from the left andfrom the top, with process Pi,j:

1 RECV(A from Pi,j1) { A data from left }2 RECV(B from Pi1,j) { B data from above }

3 Accumulate Ci,j4 SEND(A to Pi,j+1) { A data to right }

5 SEND(B to Pi+1,j) { B data down }(illustrate with sketch)


Matrix Multiplication 2D Pipeline


15/48

2D Pipeline Illustrated

C00 C01 C02 C03

C10 C11 C12 C13

C20 C21 C22 C23

C30 C31 C32 C33

A03A02A01A00

A13A12A11A10 --

A23A22A21A20 -- --

A33A32A31A30 -- -- --

B30B20B10B00

B31B21B11B01

--

B32

B22B12B02

--

--

B33B23

B13B03

--

--

--

1 cycle delay P32

2D pipeline (or systolic array) algorithm showing flow of data.M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 17 / 53

Matrix Multiplication Strassens Method


16/48

Strassens Method

The techniques discussed thus far are fundamentally O(N3), but there

is a clever method due to Strassen1

which is O(N2.81

):

A =

A11 A12A21 A22

B =

B11 B12B21 B22

C =

C11 C12C21 C22

Q1 = (A11 + A22)(B11 + B22), C11 = Q1 + Q4 Q5 + Q7

Q2 = (A21 + A22)B11, C12 = Q3 + Q5

Q3 = A11(B12 B22), C21 = Q2 + Q4

Q4 = A22(B11 + B21), C22 = Q1 + Q3 Q2 + Q6

Q5 = (A11 + A12)B22,

Q6 = (A11 + A21)(B11 + B12),

Q7 = (A12 A22)(B21 + B22)

Conventional matrix multiplication takes N3 multiplies and N3 N2

adds

1V. Strassen, Gaussian Elimination Is Not Optimal, Numerische

Mathematik, 13, 353-356 (1969).M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 18 / 53

Matrix Multiplication Strassens Method


17/48

If N is not a power of 2 can pad with zeros

Reiterate until submatrices reduced to numbers or optimal matrix

multiply size for particular processor (Strassen switches tostandard matrix multiplication for small enough submatrices)

Eliminates 1 matrix multiply, instead of O(Nlog

28

),O(Nlog2 7) = N2.81

Conventional matrix multiplication takes N3 multiplies and N3 N2

adds, Strassen is 7Nlog2 7 6N2

Requires additional storage for intermediate matrices, can be lessstable numerically ...


LU Decomposition in Parallel


18/48

Sequential LU/Gaussian Elimination Code

LU code fragment:

f o r ( k = 1 ; k


19/48

Parallel LU/Gaussian Elimination

One strategy for parallelizing the Gaussian eleimination part of LUdecomposition is to do so over the middle loop (j) in the algorithm.

Decomposing the columns, and using a message passingimplementation, we might have something like the following:




20/48

do k=1 ,N1i f ( col umn k i s min e ) then ! columnwi se d e co mp o si t i o n

do i =k+1 ,NL ( i , k ) = A ( i , k ) / A ( k , k )

end doBCAST ( L ( k + 1 : N, k ) , r o o t = o wne r o f k )

elseRECV ( L ( k + 1 : N, k ) )

end i fdo j =k+1 ,N ( mod ul o I own col umn j ) ! columnwi se d e co mp o si t i o ndo i =k+1 ,N

A( i , j ) = A( i , j ) L ( i , k )A( k , j )end do

end doend do



P ll l LU Effi i


21/48

Parallel LU Efficiency

If we do a (not overly crude) analysis of the preceding parallel

algorithm, for each column k we have:

Broadcat N k values, lets say taking time cb

Compute (N k)2 multiply-adds, time cfma

k cb(N k) + cfma(N k)2

Np,

Summing over k we find:

(Np) cb2

N2 +cfma

3

N3

Np, (1)



P ll l LU S d


22/48

Parallel LU Speedup

Now, looking at the parallel speedup factor we have:

S(Np) = /(Np)

cfmaN

3/3cbN2/2+cfmaN3/(3Np)

=

32Ncb

cfma + 1Np1

.

See the ratio of communication cost to computation? It never goes

away

Minimizing communication costs is again the key to improving the

efficiency ...



P ll l LU Effi i


23/48

Parallel LU Efficiency

The efficiency is given by:

E(Np) = S(Np)/Np,

3Np2N

cbcfma

+ 1

1

,

= 1 3Np2N cbcfma

Note that this algorithm is scalable (in time) by our previousdefinition, for a given fixed ratio of Np/N

Note that it is not scalable in terms of memory, since the memoryrequirements grow steadily as O(N2)



Improving Parallel LU


24/48

Improving Parallel LU

The most obvious way to improve the efficiency of our parallel

Gaussian elimination is to overlap computation with the necessarycommunication. Consider the revised algorithm ...




25/48

! 1 s t p r oc e s so r c ompu tes L ( 2 : N , 1 ) and BCAST ( L ( 2 : N , 1 ) )do k=1 ,N1

i f ( c ol um n k i s n o t m in e ) then ! p os t RECV f o r m u l t i p l i e r sRECV( L ( k+1 :N, k ) f ro m BCAST( n e xt_L )

end i fi f ( column k+1 is mine ) then ! compute n e xt s e t o f m u l t i p l i e r s

do i =k+1 ,N ! and e l i m i n a t e f o r column k +1A ( i , k + 1) = A ( i , k + 1) L ( i , k )A ( k , k + 1 )n ex t _L ( i , k + 1) = A ( i , k + 1 ) / A ( k + 1 , k + 1 )

end do

BCAST ( n e xt _ L ( k + 2 : N, k + 1 ) , r o o t = o wne r o f k +1 )end i fdo j =k+2 ,N ( mod ul o I own col umn j ) ! p er fo rm e l i m i n a t i on s

do i =k+1 ,NA( i , j ) = A( i , j ) L ( i , k )A( k , j )

end doend do

end do


ScaLAPACK

ScaLAPACK


26/48

ScaLAPACK

Scalable LAPACK library (LAPACK is the general purpose LinearAlgebra PACKage), contains LAPACK-style driver routines formulated

for distributed memory parallel processing:

An exceptional example of an MPI library:1 Portable - based on optimized low-level routines2 User friendly - parallel versions of the common LAPACK routines

have similar syntax, names just prefixed with a P.3 User of the library need never write parallel processing code - its all

under the covers.4 Efficient.

Source code and documentation available from NETLIB:http://www.netlib.org/scalapack


ScaLAPACK

Under the ScaLAPACK Hood
http://www.netlib.org/scalapackhttp://www.netlib.org/scalapack


27/48

Under the ScaLAPACK Hood

Uses a block-cyclic decomposition of matrices, into Prow Pcol 2D

representation. If we now consider a NB NB sub-block (starting atindex ib, ending at index end):

do ib =1,N,NBend=MIN(N, ib+NB1)do i = i b ,N

( a ) F i nd p i v o t r ow , k , BCAST c ol um n( b ) Swap r o ws k and i i n b lo ck column , BCAST row k( c ) A( i +1:NB, i )=A( i +1:N, i ) / A( i , i )(d ) A( i +1 :N, i +1 :end )=A( i +1:N, i +1: end)A( i +1:N, i )A( i , i +1:end )

end do( e ) BCAST a l l swap i n f o r m a t i o n to r i g h t and l e f t( f ) A pp ly a l l row swaps to o th e r col umns(g ) BCAST row k to r i g h t( h ) A ( i b : end , end+1:N)=LL /A( ib : end , end +1:N)( i ) BCAST A( ib : end , end +1:N) down( j ) BCAST A( end +1:N, ib : end ) r i g h t( k ) E l im i n at e A (end+1:N, end +1:N)

this is our old friend LU decomposition, available as the ScaLAPACKPGETRF routine.


ScaLAPACK ScaLAPACK Schematic - PBLAS & BLACS

ScaLAPACK Schematic


28/48

ScaLAPACK Schematic

MPI/PVM

Local

Global

LAPACK

BLAS

BLACS

PBLAS

ScaLAPACK

General schematic of the ScaLAPACK library.


ScaLAPACK ScaLAPACK Analysis

ScaLAPACK Efficiency


29/48

ScaLAPACK Efficiency

Variable Description

CfN3 Total Number FP Operations

CvN2/

Np Total Number Data CommunicatedCmN/NB Total Number of Messages

f Time per FP Operationv Time per Data Item Communicatedm Time per Message

With these quantities, the Time for the ScaLAPACK drivers is given by:

(N, Np) =CfN3

Npf +

CvN2Np

v +CmN

NBm,




30/48

and the scaling:

E(N, Np)

1 + 1NBCmmCff

NpN2 + C

vvCff

NpN1

.

Note that:

Values of Cf, Cv, and Cm depend on driver routine (and

underlying algorithm)

Scalable for constant N2/Np ...

Small problems, dominated by m/f, the ratio of latency to timeper FP operation (see why latency matters?)

Medium problems, significantly impacted by ratio of networkbandwidth to FP operation rate, f/v

Large problems, node Flop/s (1/f) dominates



Block-Cyclic Decomposition


31/48

Block Cyclic Decomposition

ScaLAPACK uses a block-cyclic decomposition:

On the left, a 4 processor column-wise decomposition, to the right theblock-cyclic version using Pr = Pq = 2, each square is NBxNB (2x2 in

this example). The advantage of block-cyclic is that it allows levels 2and 3 BLAS operations on subvectors and submatrices within each

processor.


ScaLAPACK Simple Example Code

Sample ScaLAPACK Program


32/48

Sample ScaLAPACK Program

"Simplest" program to solve A x = b using PDGESV (LU), available atwww.netlib.org/scalapack/examples/example1.f

1 PROGRAM EXAMPLE12 3 Example Program so lv in g Ax=b vi a ScaLAPACK r ou ti ne PDGESV4 5 . . P ar am et er s . .6 INTEGER DLEN_, IA , JA, IB , JB, M, N, MB, NB, RSRC,

7 $ CSRC, MXLLDA, MXLLDB, NRHS, NBRHS, NOUT,8 $ MXLOCR, MXLOCC, MXRHSC9 PARAMETER ( DLEN_ = 9 , IA = 1 , JA = 1 , IB = 1 , JB = 1 ,

10 $ M = 9 , N = 9 , MB = 2 , NB = 2 , RSRC = 0 ,11 $ CSRC = 0 , MXLLDA = 5 , MXLLDB = 5 , NRHS = 1 ,12 $ NBRHS = 1 , NOUT = 6 , MXLOCR = 5 , MXLOCC = 4 ,13 $ MXRHSC = 1 )14 DOUBLE PRECISION ONE15 PARAMETER ( ONE = 1.0D+0 )

16 . .17 . . L oc al S ca la rs . .18 INTEGER ICTXT , INFO , MYCOL, MYROW, NPCOL, NPROW19 DOUBLE PRECISION ANORM, BNORM, EPS, RESID , XNORM


http://www.netlib.org/scalapack/examples/example1.fhttp://www.netlib.org/scalapack/examples/example1.f


33/48

20 . .21 . . L oc al A rr ay s . .22 INTEGER DESCA( DLEN_ ) , DESCB( DLEN_ ) ,23 $ I PIV ( MXLOCR+NB )24 DOUBLE PRECISION A( MXLLDA, MXLOCC ) , A0 ( MXLLDA, MXLOCC ) ,

25 $ B( MXLLDB, MXRHSC ) , B0 ( MXLLDB, MXRHSC ) ,26 $ WORK( MXLOCR )27 . .28 . . E xt er na l F un ct io ns . .29 DOUBLE PRECISION PDLAMCH, PDLANGE30 EXTERNAL PDLAMCH, PDLANGE31 . .32 . . E x te r na l S ub ro ut in es . .33 EXTERNAL BLACS_EXIT , BLACS_GRIDEXIT , BLACS_GRIDINFO ,

34 $ DESCINIT , MATINIT , PDGEMM, PDGESV, PDLACPY,35 $ SL_INIT36 . .37 . . I n t r i n s i c F u n c t i o n s . .38 INTRINSIC DBLE39 . .40 . . Data s t a te me nt s . .41 DATA NPROW / 2 / , NPCOL / 3 / 42 . .43 . . E x ec u ta b le S t a te me nt s . .44 45 IN IT IA LI ZE THE PROCESS GRID46 47 CALL SL_ INI T ( ICTXT , NPROW, NPCOL )48 CALL BLACS_GRIDINFO ( ICTXT , NPROW, NPCOL, MYROW, MYCOL )




34/48

49 50 I f I m n ot i n t he pr oc es s g r id , go t o t he end o f t he program51 52 I F ( MYROW. EQ.1 )53 $ GO TO 10

54 55 DISTRIBUTE THE MATRIX ON THE PROCESS GRID56 I n i t i a l i z e t he a rr ay d e sc r i pt o r s f o r t he m at ri ce s A and B57 58 CALL DESCINIT( DESCA, M, N, MB, NB, RSRC, CSRC, ICTXT , MXLLDA,59 $ INFO )60 CALL DESCINIT ( DESCB, N, NRHS, NB, NBRHS, RSRC, CSRC, ICTXT ,61 $ MXLLDB, INFO )62

63 G en er at e m a t ri c es A and B and d i s t r i b u t e t o t he p ro ce ss g r i d64 65 CALL MATINIT ( A, DESCA, B, DESCB )66 67 Make a c op y o f A a nd B f o r c h ec k in g p ur po se s68 69 CALL PDLACPY( A l l , N, N, A, 1 , 1 , DESCA, A0 , 1 , 1 , DESCA )70 CALL PDLACPY( A l l , N, NRHS, B, 1 , 1 , DESCB, B0, 1 , 1 , DESCB )71

72 CALL THE SCALAPACK ROUTINE73 S ol ve t he l i n e a r s yst em A X = B74 75 CALL PDGESV( N, NRHS, A, IA , JA, DESCA, IPIV , B, IB , JB, DESCB,76 $ INFO )




35/48

I F ( MYROW.EQ.0 .AND. MYCOL. EQ. 0 ) THENWRITE( NOUT, FMT = 9 999 )WRITE( NOUT, FMT = 9998 ) M, N , NB

WRITE( NOUT, FMT = 9997 )NPROW

NPCOL, NPROW, NPCOLWRITE( NOUT, FMT = 9996 )INFOEND IF

Compute r e s i d u a l | | A X B | | / ( | | X | | | | A | | eps N )

EPS = PDLAMCH( ICTXT , Ep si lo n )ANORM = PDLANGE( I , N, N, A, 1 , 1 , DESCA, WORK )BNORM = PDLANGE( I , N, NRHS, B, 1 , 1 , DESCB, WORK )CALL PDGEMM( N , N , N, NRHS, N, ONE, A0 , 1 , 1 , DESCA, B, 1 , 1 ,

$ DESCB, ONE, B0, 1 , 1 , DESCB )XNORM = PDLANGE( I , N, NRHS, B0, 1 , 1 , DESCB, WORK )RESID = XNORM / ( ANORMBNORMEPSDBLE ( N ) )

I F ( MYROW.EQ.0 .AND. MYCOL. EQ. 0 ) THENI F ( RESID. LT. 10 .0D+0 ) THEN

WRITE( NOUT, FMT = 9 995 )WRITE( NOUT, FMT = 9993 )RESID

ELSE

WRITE( NOUT, FMT = 9 994 )WRITE( NOUT, FMT = 9993 )RESID

END IFEND IF




36/48

104 105 RELEASE THE PROCESS GRID106 Free th e BLACS co n t e xt107

108 CALL BLACS_GRIDEXIT( ICTXT )109 10 CONTINUE110 111 Ex it the BLACS112 113 CALL BLACS_EXIT( 0 )114 115 9999 FORMAT( / ScaLAPACK Example Program #1 - May 1 , 1997 )116 9998 FORMAT( / S ol vi ng Ax=b where A i s a , I3 , by , I3 ,

117 $ m a t r i x w i t h a b l o c k s i z e o f , I 3 )118 9997 FORMAT( R u nn ing on , I 3 , p ro ce ss es , where t h e p ro c es s g r i d ,119 $ i s , I3 , by , I 3 )120 9996 FORMAT( / INFO c ode r e t u r n e d by PDGESV = , I 3 )121 9995 FORMAT( /122 $ A cco rdi ng t o t he n o rm al i z e d re si d u al th e s o l u t i o n i s c or re ct . 123 $ )124 9994 FORMAT( /125 $ Acco rd in g t o th e n o rm al i z e d r e si d u al t he s o l u t i o n i s i n c o rr e c t .

126 $ )127 9993 FORMAT( / | | Ax b | | / ( | | x | | | | A | | epsN ) = , 1P , E16 .8 )128 STOP129 END



Compiling and running on U2 (use the Intel MKL link advisor to get the


37/48

Compiling and running on U2 (use the Intel MKL link advisor to get the

linking right):

1 [ k 07n14 : ~ / d _s c al ap ac k ] $ module l o ad i n t e l2 [ k 07n14 : ~ / d _s c al ap ac k ] $ module l o ad i n t e lmpi3 [ k 07n14 : ~ / d _s c al ap ac k ] $ module l o ad m kl / 1 0 . 34 [ k 07n14 : ~ / d _s ca la pa ck ] $ m p i i f o r t o example1 example1. f L$MKL / l i b / i n t e l 6 4 \ 5 l m k l _ s c a l a p a c k _ l p 6 4 l m k l _ s o l v e r _ l p 6 4 _ s e q u e n t i a l Wl , - s t a r tg ro up \ 6 l m k l _ i n t e l _ l p 6 4 l m k l _ s e q u e n t i a l l m k l _ c o r e l m k l _ b l ac s _ i n t e l m pi _ l p 6 4 \ 7 Wl , - endgroup l p t h r e a d8 [ k 07n 14 : ~ / d _ sc a la p ac k ] $ l d d e xa mp le 19 l i n u xv ds o . so . 1 => ( 0 x 0 0 0 0 7 f f f 4 9 5f f 0 0 0 )

10 l i b d l . s o . 2 => / l i b 6 4 / l i b d l . s o . 2 ( 0 x000000370b400000 )11 l i b m k l _ s c al a p a c k _ l p6 4 . s o => / u t i l / i n t e l / composerx e2 01 1. 1. 10 7/ m k l / l i b / i n t e l 6 4 / 12 l ib m kl _s c al ap a ck _ lp 64 . s o ( 0 x00002b723da34000 )

13 l i b m k l _ i n t e l _ l p 6 4 . s o => / u t i l / i n t e l / c omposerxe2 01 1. 1. 10 7/ m k l / l i b / i n t e l 6 4 / 14 l i b m k l _i n te l _ l p 6 4 . so (0 x00002b723e120000 )15 l i b m k l _ s e q u e n t i a l . s o = > / u t i l / i n t e l / c om pos erxe2 01 1. 1. 10 7/ m k l / l i b / i n t e l 6 4 / 16 l i bm k l _s e qu e nt i al . s o ( 0x00002b723e7d2000 )17 l i b m k l _ c o r e . s o => / u t i l / i n t e l / composerx e2 01 1. 1. 10 7/ m k l / l i b / i n t e l 6 4 / l i b m k l _ c o r e . s o18 ( 0 x00002b723edc4000 )19 l i b m k l _ b l a c s _ i n t e l m p i _ l p 6 4 . s o = > / u t i l / i n t e l / c omposerx e2 01 1. 1. 10 7/ m kl / l i b / i n t e l 6 4 / 20 l i bm k l_ b la c s_ i nt e lm p i_ l p6 4 . s o ( 0 x00002b723fc68000 )21 l i b p t h r e a d . s o . 0 => / l i b 6 4 / l i b p t h r e a d . s o . 0 ( 0 x 00000 03c c9200000 )

22 l i b m p i . so . 4 => / u t i l / i n t e l / i m p i / 4 . 0 . 3 . 0 0 8 / i n t e l 6 4 / l i b / l i b m p i . s o . 4 ( 0 x 00 00 2b 72 3f dc 40 00 )23 l i b m p i g f . so . 4 => / u t i l / i n t e l / i m p i / 4 . 0 . 3 . 0 0 8 / i n t e l 6 4 / l i b / l i b m p i g f . s o . 4 ( 0 x00002b724028e000 )24 l i b r t . s o . 1 => / l i b 6 4 / l i b r t . s o . 1 ( 0 x 00 00 00 3c c9 a0 00 00 )25 l i b m . s o . 6 => / l i b 6 4 / l i b m . s o . 6 ( 0 x 000 00 03 90a c0 00 00 )26 l i b c . s o . 6 => / l i b 6 4 / l i b c . s o . 6 ( 0 x 0000 003c c8600000 )27 l i b g c c _ s . so . 1 => / l i b 6 4 / l i b g c c _ s . so . 1 ( 0 x 0000003c cbe 00000 )28 / l i b 6 4 / l dl i n u xx8664.so. 2 (0 x0000003cc8200000 )




38/48

29 [ u 2 : ~ / d _s c al ap ac k ] $ qs ub q debug l n o d e s=1 :p p n =8 ,wa l l t i me =0 1 :0 0 :0 0 I30 qsub : w a i t i n g f o r j o b 1 11 29 61 .d15n41 . c c r . b u f f a l o . e du t o s t a r t31 qs ub : j o b 1 11 29 61 . d15n41 . c c r . b u f f a l o . edu r e ad y3233 J ob 1 11 29 61 . d15n4 1 . c c r . b u f f a l o . edu h as r e q ue s t e d 8 c o r es / p r o c e s so r s p e r node .34 PBSTMPDIR i s / scr a tch /1 1 1 2 9 6 1 .d1 5n 41 . ccr . b u ff a l o . e du35 [ d06n40b: ~ ] $ cd $PBS_O_WORKDIR36 [ d06n40b : ~ / d_scalapack ] $ NNODES= cat $PBS_NODEFILE | uniq | wc l 37 [ d 06 n4 0b :~ / d _ sca l ap a ck ] $ mp db oo t n $NNODES f $PBS_NODEFILE v38 r u nn i ng m p d a l l e x i t on d06n40b39 LAUNCHED mpd on d06n40b vi a40 RUNNING: mpd on d06n40b41 [ d06n40b : ~ / d _s c al ap ac k ] $ module l o ad i n t e lmpi42 [ d06n40b : ~ / d _s c al ap ac k ] $ module l o ad i n t e l43 [ d06n4 0b : ~ / d _ sc a l ap a ck ] $ m od ul e l o a d m kl / 1 0 . 344 [ d06n4 0b : ~ / d _ sc a l ap a ck ] $ m pi ex ec n p 6 . / e xa mp le 14546 ScaLAPACK Example Program #1 - May 1 , 19974748 S o l v i n g Ax=b where A i s a 9 by 9 ma t r i x w i t h a b l o c k s i z e o f 249 Running on 6 processes , where th e process g r i d i s 2 by 35051 INFO code r e t u r n e d by PDGESV = 0

5253 A cc or di ng t o t he no rm al iz ed r e s id u al t he s o l u ti o n i s c o r re c t .5455 | | Ax b | | / ( | | x | | | | A | | epsN ) = 0 .0 00 00 00 0E+00


ScaLAPACK ScaLAPACK - HPL Benchmark

ScaLAPACK Driver Illustrated - HPL


39/48

HPL (High Performance Linpack) benchmark still used to rank the

Top500 list (www.top500.org)

Super-sized version of previous example code, uses PDGESV to

solve randomly seeded linear system, check results to ensure thatresults are accurate

Following plot shows results for U2s Myrinet nodes (1536processors total) and Qlogic QDR Infiniband (4416 cores total) -

note scalability


ScaLAPACK ScaLAPACK - HPL Benchmark

Linpack (HPL) Benchmark
http://www.top500.org/http://www.top500.org/http://www.top500.org/


40/48

1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192Number of Processors

101

102

103

104

105

106

H

PLMeasuredPerfor

mance[GFlop/s]

Myrinet MX

Qlogic QDR IB

Linpack (HPL) Benchmark

CCR U2 Clusters

N=14,800 N=29,600

N=59,800

N=112,320

N=228,000

N=400,000

N=1,078,272

N=70,656


Iterative Methods Jacobi Iteration

Jacobi Iteration


41/48

Our method for improving on solutions to systems of linear equations

can be formalized into a solution technique in its own right (Jacobi didit first):

x(k)i =

1

Ai,i

bi

j=i

Ai,jx(k1)

j

,

Particularly useful in solution of (O|P)DEs using finite differences

Advantage of small memory requirements (especially for sparse

systems)

Disadvantage of convergence problems (slowly, or not)

Initial guess - often take x0 = b




42/48

Measuring convergence for Jacobi Iteration can be tricky, particularly

in parallel. At first glance, you might be tempted by:x(k+1)i xki < , i,but that says little about solution accuracy.

j

Ai,jxk

j bi

<

is a better form, and can reuse values already computed in theprevious iteration.



Sequential


43/48

do i =1,Nx ( i )=b ( i )

end dodo i t e r =1 ,MAXITER

do i =1,Nsum = 0 .0do j =1,N

i f ( i / = j ) sum = sum + A ( i , j )x ( j )end donew_x ( i ) = ( b ( i )sum ) / A( i , i )

end dodo i =1,N

x ( i ) = new_x( i )end do

end do



Block Parallel


44/48

Have each process accountable for solving a particular block ofunknowns, and simply use an Allgather or Allgatherv (for an unequaldistribution of work on the processes). The resulting computation time

is

comp =N

Np(2N + 4)Niter

with a time for communication

comm = (Nplat + Ndat)Niter,




45/48

and the speedup factor:

S(Np) =N(2N + 4)

N(2N + 4)/Np + Nplat + Ndat==

Np

1 + comm/comp.

Note that this is scalable, but only if the ratio N/Np is maintained withincreasing Np.


Iterative Methods Gauss-Seidel

Gauss-Seidel Relaxation


46/48

Technique for convergence acceleration, usually converged faster thanJacobi,

x

(k)

i =

1

Ai,i

bi

i1j=1

Ai,jx

(k)

j

Nj=i+1

Ai,jx

(k1)

j

,

making use of just-updated solution. More difficult to parallelize, butcan be done through particular patterning (say using a checkerboard

allocation) in which regions can be computed simultaneously.


Iterative Methods Successive Overrelaxation

Overrelaxation


47/48

Another improved convergence technique in which factor (1 )xi is

added:

x(k)i =

Ai,i

bi

j=i

Ai,jx(k1)

j

+ (1 )x

(k1)i ,

where 0 < < 1. For the Gauss-Seidel method, this is slightlymodified:

x(k)

i=

Ai,ib

i

i1

j=1

Ai,j

x(k)

j

N

j=i+1

Ai,j

x(k1)

j+ (1 )xk1

i,

with 0 < 2.


Iterative Methods Multigrid

Multigrid Method


48/48

Makes use of successively finer grids to improve solution:

Coarse starting grid quickly take distant effects into account

Initial values of finer grids available through interpolation fromexisting grid values

Of course, we can even have regions of variable grid densitiesusing so-called adaptive grid methods


class18_linalg-ii-handout.pdf

Documents