class18_linalg-ii-handout.pdf

Upload: pavan-behara

Post on 04-Apr-2018

220 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/29/2019 class18_linalg-II-handout.pdf

    1/48

    High Performance Linear Algebra II

    M. D. Jones, Ph.D.

    Center for Computational ResearchUniversity at Buffalo

    State University of New York

    High Performance Computing I, 2012

    M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 1 / 53

  • 7/29/2019 class18_linalg-II-handout.pdf

    2/48

    Introduction

    Dense & Parallel

    In this topic we will still discuss dense (as opposed to sparse)matrices, but now focus on parallel execution.

    Best Performance (and scalability) most-often comes down tomaking best use of (local) L3 BLAS

    Optimized BLAS are generally crucial for achieving performancegains

    M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 3 / 53

  • 7/29/2019 class18_linalg-II-handout.pdf

    3/48

    Introduction

    BLAS Recapitulated

    BLAS are critical:

    Foundation for all (computational) linear algebra

    Parallel versions still call serial versions on each processor

    Performance goes up with increasing ratio of floating point

    operations to memory references

    Well-tuned BLAS takes maximum advantage of memory hierarchy

    BLAS Level Calculation Memory Refs. Flop Count Ratio(Flop/Mem)1 DDOT 3N 2N 2/3

    2 DGEMV N2 2N2 23 DGEMM 4N2 2N3 N/2

    M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 4 / 53

  • 7/29/2019 class18_linalg-II-handout.pdf

    4/48

    Introduction

    Simple Memory Model

    Let us assume just two levels of memory - slow and fast

    m = # memory elements (words) moved between fast and slow memory,

    m = time per slow memory operation,

    fop = number of floating-point operations,

    f = time per floating-point operation,

    and q = fop/m. Minimum time is just fop f, but more realistically:

    time fop

    f+ m

    m= f

    op

    f[1 +

    m/(q

    f)] .

    key observation - we need optimal reuse of data = larger q (certainlym f).

    M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 5 / 53

  • 7/29/2019 class18_linalg-II-handout.pdf

    5/48

    Matrix Multiplication Sequential

    Simple Matrix Multiplication

    Everyone has probably written this out at least once ...fo r ( i =0 ; i

  • 7/29/2019 class18_linalg-II-handout.pdf

    6/48

    Matrix Multiplication Partitioning

    Partitioning

    In this case, partitioning is also known as block matrixmultiplication:

    Divide into s2 submatrices, N/sN/selements in each submatrix

    Let m = N/s, and

    f o r ( p = 0; p

  • 7/29/2019 class18_linalg-II-handout.pdf

    7/48

    Matrix Multiplication Divide & Conquer

    Recursive/Divide & Conquer

    Let N be a power of 2 (not strictly necessary) and divide A and B into4 submatrices, delimited by:

    A =

    App ApqAqp Aqq

    .

    Then the solution requires 8 pairs of submatrix multiplications, whichcan be done recursively ...

    M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 9 / 53

    Matrix Multiplication Divide & Conquer

  • 7/29/2019 class18_linalg-II-handout.pdf

    8/48

    Matrix Multiplication Divide & Conquer

    matm ultR ( A , B , s ) {

    i f ( s = =1) {C = AB ; / t e r m in a t i on c o n d it i o n /}else {

    s = s / 2 ;P0 = matmu ltR( A_ { p p } ,B_ {p p } , s ) ;P1 = matmu ltR( A_ { p q } ,B_ {q p } , s ) ;P2 = matmu ltR( A_ { p p } ,B_ {p q } , s ) ;P3 = matmu ltR( A_ { p q } ,B_ {q q } , s ) ;P4 = matmu ltR( A_ { q p } ,B_ {p p } , s ) ;P5 = matmu ltR( A_ { q q } ,B_ {q p } , s ) ;P6 = matmu ltR( A_ { q p } ,B_ {p q } , s ) ;P7 = matmu ltR( A_ { q q } ,B_ {q q } , s ) ;C_{ pp} = P0+P1;C_{ pq} = P2+P3;C_{ qp} = P4+P5;C_{ qq} = P6+P7;

    }re tu rn ( C ) ;

    }

    M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 10 / 53

    Matrix Multiplication Divide & Conquer

  • 7/29/2019 class18_linalg-II-handout.pdf

    9/48

    Matrix Multiplication Divide & Conquer

    Well suited for SMP with cache hierarchy

    Size of data continually reduced and localized

    Can be highly performing when making maximum reuse of data

    within cache memory hierarchyMore generally we want to address cases for which matrix will notfit within memory of a single machine, message

    passing/distributed model is required

    M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 11 / 53

    Matrix Multiplication Cannons Algorithm

  • 7/29/2019 class18_linalg-II-handout.pdf

    10/48

    Matrix Multiplication Cannon s Algorithm

    Cannons Algorithm

    Cannons algorithm:

    Uses a mesh of processors with a torus topology (periodicboundary conditions

    Elements are shifted to an aligned position: Row i of A is shiftedi places to the left, while row j of B is shifted j spots upward. This

    puts Ai,j+i and Bi+j,j in processor Pi,j (namely we have appropriatesubmatrices/elements to multiply in Pi,j)

    Usually submatrices are used, but we can use elements to

    simplify the notation a bit

    M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 12 / 53

    Matrix Multiplication Cannons Algorithm

  • 7/29/2019 class18_linalg-II-handout.pdf

    11/48

    Matrix Multiplication Cannon s Algorithm

    After alignment, Cannons algorithm proceeds as follows:

    Each process Pi,j multiplies its elements

    Row i of A is shifted one place left, and column j of B is shiftedone place up (brings together adjacent elements of A and B

    needed for summing results)Each process Pi,j multiplies its elements and adds to

    accumulating sum

    Preceding two steps repeated until results obtained (N 1 shifts)

    M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 13 / 53

    Matrix Multiplication Cannons Algorithm

  • 7/29/2019 class18_linalg-II-handout.pdf

    12/48

    Matrix Multiplication Cannon s Algorithm

    Cannon Illustrated

    i

    j

    B

    A

    Pij

    Cannons algorithm showing flow of data for process Pij.M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 14 / 53

    Matrix Multiplication Cannons Algorithm

  • 7/29/2019 class18_linalg-II-handout.pdf

    13/48

    p g

    For s submatrices (m = N/s) the communication time for Cannons

    algorithm is given by

    comm = 4(s 1)(lat + m2dat),

    where lat is the latency and dat is the time required to send one value

    (word). The computation time in Cannons algorithm:

    comp = 2sm3 = 2m2N,

    or O(m2N).

    Cannons algorithm is also known as the ScaLAPACK outer productalgorithm (can you see another numerical library coming?) ...

    M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 15 / 53

    Matrix Multiplication 2D Pipeline

  • 7/29/2019 class18_linalg-II-handout.pdf

    14/48

    p p

    2D Pipeline

    Another useful algorithm is the so-called 2D pipeline, in which thedata flows into a rectangular array of processors. Labeling the processgrid using (0, 0) as the top left corner, the data flows from the left andfrom the top, with process Pi,j:

    1 RECV(A from Pi,j1) { A data from left }2 RECV(B from Pi1,j) { B data from above }

    3 Accumulate Ci,j4 SEND(A to Pi,j+1) { A data to right }

    5 SEND(B to Pi+1,j) { B data down }(illustrate with sketch)

    M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 16 / 53

    Matrix Multiplication 2D Pipeline

  • 7/29/2019 class18_linalg-II-handout.pdf

    15/48

    2D Pipeline Illustrated

    C00 C01 C02 C03

    C10 C11 C12 C13

    C20 C21 C22 C23

    C30 C31 C32 C33

    A03A02A01A00

    A13A12A11A10 --

    A23A22A21A20 -- --

    A33A32A31A30 -- -- --

    B30B20B10B00

    B31B21B11B01

    --

    B32

    B22B12B02

    --

    --

    B33B23

    B13B03

    --

    --

    --

    1 cycle delay P32

    2D pipeline (or systolic array) algorithm showing flow of data.M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 17 / 53

    Matrix Multiplication Strassens Method

  • 7/29/2019 class18_linalg-II-handout.pdf

    16/48

    Strassens Method

    The techniques discussed thus far are fundamentally O(N3), but there

    is a clever method due to Strassen1

    which is O(N2.81

    ):

    A =

    A11 A12A21 A22

    B =

    B11 B12B21 B22

    C =

    C11 C12C21 C22

    Q1 = (A11 + A22)(B11 + B22), C11 = Q1 + Q4 Q5 + Q7

    Q2 = (A21 + A22)B11, C12 = Q3 + Q5

    Q3 = A11(B12 B22), C21 = Q2 + Q4

    Q4 = A22(B11 + B21), C22 = Q1 + Q3 Q2 + Q6

    Q5 = (A11 + A12)B22,

    Q6 = (A11 + A21)(B11 + B12),

    Q7 = (A12 A22)(B21 + B22)

    Conventional matrix multiplication takes N3 multiplies and N3 N2

    adds

    1V. Strassen, Gaussian Elimination Is Not Optimal, Numerische

    Mathematik, 13, 353-356 (1969).M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 18 / 53

    Matrix Multiplication Strassens Method

  • 7/29/2019 class18_linalg-II-handout.pdf

    17/48

    If N is not a power of 2 can pad with zeros

    Reiterate until submatrices reduced to numbers or optimal matrix

    multiply size for particular processor (Strassen switches tostandard matrix multiplication for small enough submatrices)

    Eliminates 1 matrix multiply, instead of O(Nlog

    28

    ),O(Nlog2 7) = N2.81

    Conventional matrix multiplication takes N3 multiplies and N3 N2

    adds, Strassen is 7Nlog2 7 6N2

    Requires additional storage for intermediate matrices, can be lessstable numerically ...

    M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 19 / 53

    LU Decomposition in Parallel

  • 7/29/2019 class18_linalg-II-handout.pdf

    18/48

    Sequential LU/Gaussian Elimination Code

    LU code fragment:

    f o r ( k = 1 ; k

  • 7/29/2019 class18_linalg-II-handout.pdf

    19/48

    Parallel LU/Gaussian Elimination

    One strategy for parallelizing the Gaussian eleimination part of LUdecomposition is to do so over the middle loop (j) in the algorithm.

    Decomposing the columns, and using a message passingimplementation, we might have something like the following:

    M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 22 / 53

    LU Decomposition in Parallel

  • 7/29/2019 class18_linalg-II-handout.pdf

    20/48

    do k=1 ,N1i f ( col umn k i s min e ) then ! columnwi se d e co mp o si t i o n

    do i =k+1 ,NL ( i , k ) = A ( i , k ) / A ( k , k )

    end doBCAST ( L ( k + 1 : N, k ) , r o o t = o wne r o f k )

    elseRECV ( L ( k + 1 : N, k ) )

    end i fdo j =k+1 ,N ( mod ul o I own col umn j ) ! columnwi se d e co mp o si t i o ndo i =k+1 ,N

    A( i , j ) = A( i , j ) L ( i , k )A( k , j )end do

    end doend do

    M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 23 / 53

    LU Decomposition in Parallel

    P ll l LU Effi i

  • 7/29/2019 class18_linalg-II-handout.pdf

    21/48

    Parallel LU Efficiency

    If we do a (not overly crude) analysis of the preceding parallel

    algorithm, for each column k we have:

    Broadcat N k values, lets say taking time cb

    Compute (N k)2 multiply-adds, time cfma

    k cb(N k) + cfma(N k)2

    Np,

    Summing over k we find:

    (Np) cb2

    N2 +cfma

    3

    N3

    Np, (1)

    M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 24 / 53

    LU Decomposition in Parallel

    P ll l LU S d

  • 7/29/2019 class18_linalg-II-handout.pdf

    22/48

    Parallel LU Speedup

    Now, looking at the parallel speedup factor we have:

    S(Np) = /(Np)

    cfmaN

    3/3cbN2/2+cfmaN3/(3Np)

    =

    32Ncb

    cfma + 1Np1

    .

    See the ratio of communication cost to computation? It never goes

    away

    Minimizing communication costs is again the key to improving the

    efficiency ...

    M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 25 / 53

    LU Decomposition in Parallel

    P ll l LU Effi i

  • 7/29/2019 class18_linalg-II-handout.pdf

    23/48

    Parallel LU Efficiency

    The efficiency is given by:

    E(Np) = S(Np)/Np,

    3Np2N

    cbcfma

    + 1

    1

    ,

    = 1 3Np2N cbcfma

    Note that this algorithm is scalable (in time) by our previousdefinition, for a given fixed ratio of Np/N

    Note that it is not scalable in terms of memory, since the memoryrequirements grow steadily as O(N2)

    M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 26 / 53

    LU Decomposition in Parallel

    Improving Parallel LU

  • 7/29/2019 class18_linalg-II-handout.pdf

    24/48

    Improving Parallel LU

    The most obvious way to improve the efficiency of our parallel

    Gaussian elimination is to overlap computation with the necessarycommunication. Consider the revised algorithm ...

    M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 27 / 53

    LU Decomposition in Parallel

  • 7/29/2019 class18_linalg-II-handout.pdf

    25/48

    ! 1 s t p r oc e s so r c ompu tes L ( 2 : N , 1 ) and BCAST ( L ( 2 : N , 1 ) )do k=1 ,N1

    i f ( c ol um n k i s n o t m in e ) then ! p os t RECV f o r m u l t i p l i e r sRECV( L ( k+1 :N, k ) f ro m BCAST( n e xt_L )

    end i fi f ( column k+1 is mine ) then ! compute n e xt s e t o f m u l t i p l i e r s

    do i =k+1 ,N ! and e l i m i n a t e f o r column k +1A ( i , k + 1) = A ( i , k + 1) L ( i , k )A ( k , k + 1 )n ex t _L ( i , k + 1) = A ( i , k + 1 ) / A ( k + 1 , k + 1 )

    end do

    BCAST ( n e xt _ L ( k + 2 : N, k + 1 ) , r o o t = o wne r o f k +1 )end i fdo j =k+2 ,N ( mod ul o I own col umn j ) ! p er fo rm e l i m i n a t i on s

    do i =k+1 ,NA( i , j ) = A( i , j ) L ( i , k )A( k , j )

    end doend do

    end do

    M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 28 / 53

    ScaLAPACK

    ScaLAPACK

  • 7/29/2019 class18_linalg-II-handout.pdf

    26/48

    ScaLAPACK

    Scalable LAPACK library (LAPACK is the general purpose LinearAlgebra PACKage), contains LAPACK-style driver routines formulated

    for distributed memory parallel processing:

    An exceptional example of an MPI library:1 Portable - based on optimized low-level routines2 User friendly - parallel versions of the common LAPACK routines

    have similar syntax, names just prefixed with a P.3 User of the library need never write parallel processing code - its all

    under the covers.4 Efficient.

    Source code and documentation available from NETLIB:http://www.netlib.org/scalapack

    M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 30 / 53

    ScaLAPACK

    Under the ScaLAPACK Hood

    http://www.netlib.org/scalapackhttp://www.netlib.org/scalapack
  • 7/29/2019 class18_linalg-II-handout.pdf

    27/48

    Under the ScaLAPACK Hood

    Uses a block-cyclic decomposition of matrices, into Prow Pcol 2D

    representation. If we now consider a NB NB sub-block (starting atindex ib, ending at index end):

    do ib =1,N,NBend=MIN(N, ib+NB1)do i = i b ,N

    ( a ) F i nd p i v o t r ow , k , BCAST c ol um n( b ) Swap r o ws k and i i n b lo ck column , BCAST row k( c ) A( i +1:NB, i )=A( i +1:N, i ) / A( i , i )(d ) A( i +1 :N, i +1 :end )=A( i +1:N, i +1: end)A( i +1:N, i )A( i , i +1:end )

    end do( e ) BCAST a l l swap i n f o r m a t i o n to r i g h t and l e f t( f ) A pp ly a l l row swaps to o th e r col umns(g ) BCAST row k to r i g h t( h ) A ( i b : end , end+1:N)=LL /A( ib : end , end +1:N)( i ) BCAST A( ib : end , end +1:N) down( j ) BCAST A( end +1:N, ib : end ) r i g h t( k ) E l im i n at e A (end+1:N, end +1:N)

    this is our old friend LU decomposition, available as the ScaLAPACKPGETRF routine.

    M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 31 / 53

    ScaLAPACK ScaLAPACK Schematic - PBLAS & BLACS

    ScaLAPACK Schematic

  • 7/29/2019 class18_linalg-II-handout.pdf

    28/48

    ScaLAPACK Schematic

    MPI/PVM

    Local

    Global

    LAPACK

    BLAS

    BLACS

    PBLAS

    ScaLAPACK

    General schematic of the ScaLAPACK library.

    M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 32 / 53

    ScaLAPACK ScaLAPACK Analysis

    ScaLAPACK Efficiency

  • 7/29/2019 class18_linalg-II-handout.pdf

    29/48

    ScaLAPACK Efficiency

    Variable Description

    CfN3 Total Number FP Operations

    CvN2/

    Np Total Number Data CommunicatedCmN/NB Total Number of Messages

    f Time per FP Operationv Time per Data Item Communicatedm Time per Message

    With these quantities, the Time for the ScaLAPACK drivers is given by:

    (N, Np) =CfN3

    Npf +

    CvN2Np

    v +CmN

    NBm,

    M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 33 / 53

    ScaLAPACK ScaLAPACK Analysis

  • 7/29/2019 class18_linalg-II-handout.pdf

    30/48

    and the scaling:

    E(N, Np)

    1 + 1NBCmmCff

    NpN2 + C

    vvCff

    NpN1

    .

    Note that:

    Values of Cf, Cv, and Cm depend on driver routine (and

    underlying algorithm)

    Scalable for constant N2/Np ...

    Small problems, dominated by m/f, the ratio of latency to timeper FP operation (see why latency matters?)

    Medium problems, significantly impacted by ratio of networkbandwidth to FP operation rate, f/v

    Large problems, node Flop/s (1/f) dominates

    M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 34 / 53

    ScaLAPACK ScaLAPACK Analysis

    Block-Cyclic Decomposition

  • 7/29/2019 class18_linalg-II-handout.pdf

    31/48

    Block Cyclic Decomposition

    ScaLAPACK uses a block-cyclic decomposition:

    On the left, a 4 processor column-wise decomposition, to the right theblock-cyclic version using Pr = Pq = 2, each square is NBxNB (2x2 in

    this example). The advantage of block-cyclic is that it allows levels 2and 3 BLAS operations on subvectors and submatrices within each

    processor.

    M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 35 / 53

    ScaLAPACK Simple Example Code

    Sample ScaLAPACK Program

  • 7/29/2019 class18_linalg-II-handout.pdf

    32/48

    Sample ScaLAPACK Program

    "Simplest" program to solve A x = b using PDGESV (LU), available atwww.netlib.org/scalapack/examples/example1.f

    1 PROGRAM EXAMPLE12 3 Example Program so lv in g Ax=b vi a ScaLAPACK r ou ti ne PDGESV4 5 . . P ar am et er s . .6 INTEGER DLEN_, IA , JA, IB , JB, M, N, MB, NB, RSRC,

    7 $ CSRC, MXLLDA, MXLLDB, NRHS, NBRHS, NOUT,8 $ MXLOCR, MXLOCC, MXRHSC9 PARAMETER ( DLEN_ = 9 , IA = 1 , JA = 1 , IB = 1 , JB = 1 ,

    10 $ M = 9 , N = 9 , MB = 2 , NB = 2 , RSRC = 0 ,11 $ CSRC = 0 , MXLLDA = 5 , MXLLDB = 5 , NRHS = 1 ,12 $ NBRHS = 1 , NOUT = 6 , MXLOCR = 5 , MXLOCC = 4 ,13 $ MXRHSC = 1 )14 DOUBLE PRECISION ONE15 PARAMETER ( ONE = 1.0D+0 )

    16 . .17 . . L oc al S ca la rs . .18 INTEGER ICTXT , INFO , MYCOL, MYROW, NPCOL, NPROW19 DOUBLE PRECISION ANORM, BNORM, EPS, RESID , XNORM

    M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 36 / 53

    ScaLAPACK Simple Example Code

    http://www.netlib.org/scalapack/examples/example1.fhttp://www.netlib.org/scalapack/examples/example1.f
  • 7/29/2019 class18_linalg-II-handout.pdf

    33/48

    20 . .21 . . L oc al A rr ay s . .22 INTEGER DESCA( DLEN_ ) , DESCB( DLEN_ ) ,23 $ I PIV ( MXLOCR+NB )24 DOUBLE PRECISION A( MXLLDA, MXLOCC ) , A0 ( MXLLDA, MXLOCC ) ,

    25 $ B( MXLLDB, MXRHSC ) , B0 ( MXLLDB, MXRHSC ) ,26 $ WORK( MXLOCR )27 . .28 . . E xt er na l F un ct io ns . .29 DOUBLE PRECISION PDLAMCH, PDLANGE30 EXTERNAL PDLAMCH, PDLANGE31 . .32 . . E x te r na l S ub ro ut in es . .33 EXTERNAL BLACS_EXIT , BLACS_GRIDEXIT , BLACS_GRIDINFO ,

    34 $ DESCINIT , MATINIT , PDGEMM, PDGESV, PDLACPY,35 $ SL_INIT36 . .37 . . I n t r i n s i c F u n c t i o n s . .38 INTRINSIC DBLE39 . .40 . . Data s t a te me nt s . .41 DATA NPROW / 2 / , NPCOL / 3 / 42 . .43 . . E x ec u ta b le S t a te me nt s . .44 45 IN IT IA LI ZE THE PROCESS GRID46 47 CALL SL_ INI T ( ICTXT , NPROW, NPCOL )48 CALL BLACS_GRIDINFO ( ICTXT , NPROW, NPCOL, MYROW, MYCOL )

    M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 37 / 53

    ScaLAPACK Simple Example Code

  • 7/29/2019 class18_linalg-II-handout.pdf

    34/48

    49 50 I f I m n ot i n t he pr oc es s g r id , go t o t he end o f t he program51 52 I F ( MYROW. EQ.1 )53 $ GO TO 10

    54 55 DISTRIBUTE THE MATRIX ON THE PROCESS GRID56 I n i t i a l i z e t he a rr ay d e sc r i pt o r s f o r t he m at ri ce s A and B57 58 CALL DESCINIT( DESCA, M, N, MB, NB, RSRC, CSRC, ICTXT , MXLLDA,59 $ INFO )60 CALL DESCINIT ( DESCB, N, NRHS, NB, NBRHS, RSRC, CSRC, ICTXT ,61 $ MXLLDB, INFO )62

    63 G en er at e m a t ri c es A and B and d i s t r i b u t e t o t he p ro ce ss g r i d64 65 CALL MATINIT ( A, DESCA, B, DESCB )66 67 Make a c op y o f A a nd B f o r c h ec k in g p ur po se s68 69 CALL PDLACPY( A l l , N, N, A, 1 , 1 , DESCA, A0 , 1 , 1 , DESCA )70 CALL PDLACPY( A l l , N, NRHS, B, 1 , 1 , DESCB, B0, 1 , 1 , DESCB )71

    72 CALL THE SCALAPACK ROUTINE73 S ol ve t he l i n e a r s yst em A X = B74 75 CALL PDGESV( N, NRHS, A, IA , JA, DESCA, IPIV , B, IB , JB, DESCB,76 $ INFO )

    M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 38 / 53

    ScaLAPACK Simple Example Code

  • 7/29/2019 class18_linalg-II-handout.pdf

    35/48

    I F ( MYROW.EQ.0 .AND. MYCOL. EQ. 0 ) THENWRITE( NOUT, FMT = 9 999 )WRITE( NOUT, FMT = 9998 ) M, N , NB

    WRITE( NOUT, FMT = 9997 )NPROW

    NPCOL, NPROW, NPCOLWRITE( NOUT, FMT = 9996 )INFOEND IF

    Compute r e s i d u a l | | A X B | | / ( | | X | | | | A | | eps N )

    EPS = PDLAMCH( ICTXT , Ep si lo n )ANORM = PDLANGE( I , N, N, A, 1 , 1 , DESCA, WORK )BNORM = PDLANGE( I , N, NRHS, B, 1 , 1 , DESCB, WORK )CALL PDGEMM( N , N , N, NRHS, N, ONE, A0 , 1 , 1 , DESCA, B, 1 , 1 ,

    $ DESCB, ONE, B0, 1 , 1 , DESCB )XNORM = PDLANGE( I , N, NRHS, B0, 1 , 1 , DESCB, WORK )RESID = XNORM / ( ANORMBNORMEPSDBLE ( N ) )

    I F ( MYROW.EQ.0 .AND. MYCOL. EQ. 0 ) THENI F ( RESID. LT. 10 .0D+0 ) THEN

    WRITE( NOUT, FMT = 9 995 )WRITE( NOUT, FMT = 9993 )RESID

    ELSE

    WRITE( NOUT, FMT = 9 994 )WRITE( NOUT, FMT = 9993 )RESID

    END IFEND IF

    M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 39 / 53

    ScaLAPACK Simple Example Code

  • 7/29/2019 class18_linalg-II-handout.pdf

    36/48

    104 105 RELEASE THE PROCESS GRID106 Free th e BLACS co n t e xt107

    108 CALL BLACS_GRIDEXIT( ICTXT )109 10 CONTINUE110 111 Ex it the BLACS112 113 CALL BLACS_EXIT( 0 )114 115 9999 FORMAT( / ScaLAPACK Example Program #1 - May 1 , 1997 )116 9998 FORMAT( / S ol vi ng Ax=b where A i s a , I3 , by , I3 ,

    117 $ m a t r i x w i t h a b l o c k s i z e o f , I 3 )118 9997 FORMAT( R u nn ing on , I 3 , p ro ce ss es , where t h e p ro c es s g r i d ,119 $ i s , I3 , by , I 3 )120 9996 FORMAT( / INFO c ode r e t u r n e d by PDGESV = , I 3 )121 9995 FORMAT( /122 $ A cco rdi ng t o t he n o rm al i z e d re si d u al th e s o l u t i o n i s c or re ct . 123 $ )124 9994 FORMAT( /125 $ Acco rd in g t o th e n o rm al i z e d r e si d u al t he s o l u t i o n i s i n c o rr e c t .

    126 $ )127 9993 FORMAT( / | | Ax b | | / ( | | x | | | | A | | epsN ) = , 1P , E16 .8 )128 STOP129 END

    M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 40 / 53

    ScaLAPACK Simple Example Code

    Compiling and running on U2 (use the Intel MKL link advisor to get the

  • 7/29/2019 class18_linalg-II-handout.pdf

    37/48

    Compiling and running on U2 (use the Intel MKL link advisor to get the

    linking right):

    1 [ k 07n14 : ~ / d _s c al ap ac k ] $ module l o ad i n t e l2 [ k 07n14 : ~ / d _s c al ap ac k ] $ module l o ad i n t e lmpi3 [ k 07n14 : ~ / d _s c al ap ac k ] $ module l o ad m kl / 1 0 . 34 [ k 07n14 : ~ / d _s ca la pa ck ] $ m p i i f o r t o example1 example1. f L$MKL / l i b / i n t e l 6 4 \ 5 l m k l _ s c a l a p a c k _ l p 6 4 l m k l _ s o l v e r _ l p 6 4 _ s e q u e n t i a l Wl , - s t a r tg ro up \ 6 l m k l _ i n t e l _ l p 6 4 l m k l _ s e q u e n t i a l l m k l _ c o r e l m k l _ b l ac s _ i n t e l m pi _ l p 6 4 \ 7 Wl , - endgroup l p t h r e a d8 [ k 07n 14 : ~ / d _ sc a la p ac k ] $ l d d e xa mp le 19 l i n u xv ds o . so . 1 => ( 0 x 0 0 0 0 7 f f f 4 9 5f f 0 0 0 )

    10 l i b d l . s o . 2 => / l i b 6 4 / l i b d l . s o . 2 ( 0 x000000370b400000 )11 l i b m k l _ s c al a p a c k _ l p6 4 . s o => / u t i l / i n t e l / composerx e2 01 1. 1. 10 7/ m k l / l i b / i n t e l 6 4 / 12 l ib m kl _s c al ap a ck _ lp 64 . s o ( 0 x00002b723da34000 )

    13 l i b m k l _ i n t e l _ l p 6 4 . s o => / u t i l / i n t e l / c omposerxe2 01 1. 1. 10 7/ m k l / l i b / i n t e l 6 4 / 14 l i b m k l _i n te l _ l p 6 4 . so (0 x00002b723e120000 )15 l i b m k l _ s e q u e n t i a l . s o = > / u t i l / i n t e l / c om pos erxe2 01 1. 1. 10 7/ m k l / l i b / i n t e l 6 4 / 16 l i bm k l _s e qu e nt i al . s o ( 0x00002b723e7d2000 )17 l i b m k l _ c o r e . s o => / u t i l / i n t e l / composerx e2 01 1. 1. 10 7/ m k l / l i b / i n t e l 6 4 / l i b m k l _ c o r e . s o18 ( 0 x00002b723edc4000 )19 l i b m k l _ b l a c s _ i n t e l m p i _ l p 6 4 . s o = > / u t i l / i n t e l / c omposerx e2 01 1. 1. 10 7/ m kl / l i b / i n t e l 6 4 / 20 l i bm k l_ b la c s_ i nt e lm p i_ l p6 4 . s o ( 0 x00002b723fc68000 )21 l i b p t h r e a d . s o . 0 => / l i b 6 4 / l i b p t h r e a d . s o . 0 ( 0 x 00000 03c c9200000 )

    22 l i b m p i . so . 4 => / u t i l / i n t e l / i m p i / 4 . 0 . 3 . 0 0 8 / i n t e l 6 4 / l i b / l i b m p i . s o . 4 ( 0 x 00 00 2b 72 3f dc 40 00 )23 l i b m p i g f . so . 4 => / u t i l / i n t e l / i m p i / 4 . 0 . 3 . 0 0 8 / i n t e l 6 4 / l i b / l i b m p i g f . s o . 4 ( 0 x00002b724028e000 )24 l i b r t . s o . 1 => / l i b 6 4 / l i b r t . s o . 1 ( 0 x 00 00 00 3c c9 a0 00 00 )25 l i b m . s o . 6 => / l i b 6 4 / l i b m . s o . 6 ( 0 x 000 00 03 90a c0 00 00 )26 l i b c . s o . 6 => / l i b 6 4 / l i b c . s o . 6 ( 0 x 0000 003c c8600000 )27 l i b g c c _ s . so . 1 => / l i b 6 4 / l i b g c c _ s . so . 1 ( 0 x 0000003c cbe 00000 )28 / l i b 6 4 / l dl i n u xx8664.so. 2 (0 x0000003cc8200000 )

    M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 41 / 53

    ScaLAPACK Simple Example Code

  • 7/29/2019 class18_linalg-II-handout.pdf

    38/48

    29 [ u 2 : ~ / d _s c al ap ac k ] $ qs ub q debug l n o d e s=1 :p p n =8 ,wa l l t i me =0 1 :0 0 :0 0 I30 qsub : w a i t i n g f o r j o b 1 11 29 61 .d15n41 . c c r . b u f f a l o . e du t o s t a r t31 qs ub : j o b 1 11 29 61 . d15n41 . c c r . b u f f a l o . edu r e ad y3233 J ob 1 11 29 61 . d15n4 1 . c c r . b u f f a l o . edu h as r e q ue s t e d 8 c o r es / p r o c e s so r s p e r node .34 PBSTMPDIR i s / scr a tch /1 1 1 2 9 6 1 .d1 5n 41 . ccr . b u ff a l o . e du35 [ d06n40b: ~ ] $ cd $PBS_O_WORKDIR36 [ d06n40b : ~ / d_scalapack ] $ NNODES= cat $PBS_NODEFILE | uniq | wc l 37 [ d 06 n4 0b :~ / d _ sca l ap a ck ] $ mp db oo t n $NNODES f $PBS_NODEFILE v38 r u nn i ng m p d a l l e x i t on d06n40b39 LAUNCHED mpd on d06n40b vi a40 RUNNING: mpd on d06n40b41 [ d06n40b : ~ / d _s c al ap ac k ] $ module l o ad i n t e lmpi42 [ d06n40b : ~ / d _s c al ap ac k ] $ module l o ad i n t e l43 [ d06n4 0b : ~ / d _ sc a l ap a ck ] $ m od ul e l o a d m kl / 1 0 . 344 [ d06n4 0b : ~ / d _ sc a l ap a ck ] $ m pi ex ec n p 6 . / e xa mp le 14546 ScaLAPACK Example Program #1 - May 1 , 19974748 S o l v i n g Ax=b where A i s a 9 by 9 ma t r i x w i t h a b l o c k s i z e o f 249 Running on 6 processes , where th e process g r i d i s 2 by 35051 INFO code r e t u r n e d by PDGESV = 0

    5253 A cc or di ng t o t he no rm al iz ed r e s id u al t he s o l u ti o n i s c o r re c t .5455 | | Ax b | | / ( | | x | | | | A | | epsN ) = 0 .0 00 00 00 0E+00

    M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 42 / 53

    ScaLAPACK ScaLAPACK - HPL Benchmark

    ScaLAPACK Driver Illustrated - HPL

  • 7/29/2019 class18_linalg-II-handout.pdf

    39/48

    HPL (High Performance Linpack) benchmark still used to rank the

    Top500 list (www.top500.org)

    Super-sized version of previous example code, uses PDGESV to

    solve randomly seeded linear system, check results to ensure thatresults are accurate

    Following plot shows results for U2s Myrinet nodes (1536processors total) and Qlogic QDR Infiniband (4416 cores total) -

    note scalability

    M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 43 / 53

    ScaLAPACK ScaLAPACK - HPL Benchmark

    Linpack (HPL) Benchmark

    http://www.top500.org/http://www.top500.org/http://www.top500.org/
  • 7/29/2019 class18_linalg-II-handout.pdf

    40/48

    1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192Number of Processors

    101

    102

    103

    104

    105

    106

    H

    PLMeasuredPerfor

    mance[GFlop/s]

    Myrinet MX

    Qlogic QDR IB

    Linpack (HPL) Benchmark

    CCR U2 Clusters

    N=14,800 N=29,600

    N=59,800

    N=112,320

    N=228,000

    N=400,000

    N=1,078,272

    N=70,656

    M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 44 / 53

    Iterative Methods Jacobi Iteration

    Jacobi Iteration

  • 7/29/2019 class18_linalg-II-handout.pdf

    41/48

    Our method for improving on solutions to systems of linear equations

    can be formalized into a solution technique in its own right (Jacobi didit first):

    x(k)i =

    1

    Ai,i

    bi

    j=i

    Ai,jx(k1)

    j

    ,

    Particularly useful in solution of (O|P)DEs using finite differences

    Advantage of small memory requirements (especially for sparse

    systems)

    Disadvantage of convergence problems (slowly, or not)

    Initial guess - often take x0 = b

    M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 46 / 53

    Iterative Methods Jacobi Iteration

  • 7/29/2019 class18_linalg-II-handout.pdf

    42/48

    Measuring convergence for Jacobi Iteration can be tricky, particularly

    in parallel. At first glance, you might be tempted by:x(k+1)i xki < , i,but that says little about solution accuracy.

    j

    Ai,jxk

    j bi

    <

    is a better form, and can reuse values already computed in theprevious iteration.

    M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 47 / 53

    Iterative Methods Jacobi Iteration

    Sequential

  • 7/29/2019 class18_linalg-II-handout.pdf

    43/48

    do i =1,Nx ( i )=b ( i )

    end dodo i t e r =1 ,MAXITER

    do i =1,Nsum = 0 .0do j =1,N

    i f ( i / = j ) sum = sum + A ( i , j )x ( j )end donew_x ( i ) = ( b ( i )sum ) / A( i , i )

    end dodo i =1,N

    x ( i ) = new_x( i )end do

    end do

    M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 48 / 53

    Iterative Methods Jacobi Iteration

    Block Parallel

  • 7/29/2019 class18_linalg-II-handout.pdf

    44/48

    Have each process accountable for solving a particular block ofunknowns, and simply use an Allgather or Allgatherv (for an unequaldistribution of work on the processes). The resulting computation time

    is

    comp =N

    Np(2N + 4)Niter

    with a time for communication

    comm = (Nplat + Ndat)Niter,

    M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 49 / 53

    Iterative Methods Jacobi Iteration

  • 7/29/2019 class18_linalg-II-handout.pdf

    45/48

    and the speedup factor:

    S(Np) =N(2N + 4)

    N(2N + 4)/Np + Nplat + Ndat==

    Np

    1 + comm/comp.

    Note that this is scalable, but only if the ratio N/Np is maintained withincreasing Np.

    M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 50 / 53

    Iterative Methods Gauss-Seidel

    Gauss-Seidel Relaxation

  • 7/29/2019 class18_linalg-II-handout.pdf

    46/48

    Technique for convergence acceleration, usually converged faster thanJacobi,

    x

    (k)

    i =

    1

    Ai,i

    bi

    i1j=1

    Ai,jx

    (k)

    j

    Nj=i+1

    Ai,jx

    (k1)

    j

    ,

    making use of just-updated solution. More difficult to parallelize, butcan be done through particular patterning (say using a checkerboard

    allocation) in which regions can be computed simultaneously.

    M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 51 / 53

    Iterative Methods Successive Overrelaxation

    Overrelaxation

  • 7/29/2019 class18_linalg-II-handout.pdf

    47/48

    Another improved convergence technique in which factor (1 )xi is

    added:

    x(k)i =

    Ai,i

    bi

    j=i

    Ai,jx(k1)

    j

    + (1 )x

    (k1)i ,

    where 0 < < 1. For the Gauss-Seidel method, this is slightlymodified:

    x(k)

    i=

    Ai,ib

    i

    i1

    j=1

    Ai,j

    x(k)

    j

    N

    j=i+1

    Ai,j

    x(k1)

    j+ (1 )xk1

    i,

    with 0 < 2.

    M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 52 / 53

    Iterative Methods Multigrid

    Multigrid Method

  • 7/29/2019 class18_linalg-II-handout.pdf

    48/48

    Makes use of successively finer grids to improve solution:

    Coarse starting grid quickly take distant effects into account

    Initial values of finer grids available through interpolation fromexisting grid values

    Of course, we can even have regions of variable grid densitiesusing so-called adaptive grid methods

    M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 53 / 53