combining building blocks for parallel multilevel matrix...

Combining Building Blocks for Parallel MultilevelMatrix Multiplication

Sascha Hunold1 Thomas Rauber1 Gudula Runger2

1Department of Computer ScienceUniversity of Bayreuth

2Department of Computer ScienceChemnitz University of Technology

PMAA’06, Rennes

Hunold, Rauber, Runger Building Blocks for Multilevel Matrix MultiplicationPMAA’06, Rennes 1 /

33

Outline

1 Introduction

2 Specific Approach

3 Execution environment: Tlib

4 Algorithm combinationsRecursive splitting – Strassen algorithmtpmm - a new recursive algorithm for memory hierarchiesTuning of PDGEMM for better performanceCombination of different algorithms

5 Experimental results

6 Conclusions


33

Introduction

matrix multiplication is one of the core computations in manyalgorithms from scientific computing;→ many efficient realizations have been invented;

efficient sequential implementations: ATLAS, PHiPAC;efficient parallel implementations: SUMMA, PDGEMM;

execution platforms differ in important characteristics memoryhierarchy, communication architectureefficient implementation must exploit these characteristics;→ adaptivity is an important property;


33

Introduction

multiprocessor task (M-task) implementations have been successfulfor many platforms for different application areas;advantage: reduction in communication time, better scalability;

approach: M-task implementation of matrix multiplication;

goal: competetiveness with ScaLAPACK and vendor-specificimplementations;


33

Outline

1 Introduction

2 Specific Approach




6 Conclusions


33

Specific Approach

hierarchical M-task implementation with combination of differentalgorithms to exploit specific characteristics of the execution platform;

lower level: efficient adaptive and/or highly tuned sequentialalgorithm to use single-processor architecture;

upper level: hierarchical (recursive) algorithm to improvecommunication behavior → Strassen algorithm;

intermediate level: hierarchical (recursive) algorithm tpmm toimprove memory access and communication behavior formulti-processor nodes;


33

Specific Approach

result: multi-level algorithm which allows adaptions to different executionplatforms:

different combinations of algorithms;

different levels of recursions of upper level Strassen algorithm;

different lower level sequential algorithms;

different block sizes for low-level algorithms;

different schedulings for upper level Strassen algorithm;


33

Outline

1 Introduction

2 Specific Approach




6 Conclusions


33

Execution environment: Tlib

The runtime library Tlib allows the realization of hierarchicallystructured M-tasks programs;

Tlib is based on MPI and allows the use of arbitrary MPI functions.

The Tlib API provides support for

* the creation and administration of a dynamic hierarchy ofprocessor groups;

* the coordination and mapping of nested M-tasks;* the handling and termination of recursive calls and group splittings;* the organization of communication between M-tasks;

The splitting and mapping can be adapted to the execution platform.


33

Interface of the Tlib library

Type of all Tlib tasks (basic and coordination):

void *F (void * arg, MPI Comm comm, T Descr *pdescr)

• void * arg packed arguments for F;

• comm MPI communicator for internal communication;

• pdescr pointer to a descriptor of executing processor group;

The function F may contain calls of Tlib functions to create sub-groupsand to map sub-computations to sub-groups.


33

Splitting operations for processor groups

int T SplitGrp( T Descr * pdescr,

T Descr * pdescr1,

float per1,

float per2)

• pdescr: current group descriptor

• pdescr1: new group descriptor fortwo new subgroups

• per1 relative size of first subgroup

• per2 relative size of second sub-group

int T SplitGrpParfor( int n,

T Descr *pdescr,

T Descr *pdescr1,

float p[])

int T SplitGrpExpl( T Descr * pdescr,

T Descr * pdescr1,

int color,

int key)

group subdivision

group subdivision group subdivision

group subdivision


33

Concurrent execution of M-tasks

Mapping of two concurrent M-tasks to processor groups:

int T Par( void * (*f1)(void *, MPI Comm, T Descr *),

void * parg1, void * pres1,

void * (*f2)(void *, MPI Comm, T Descr *),

void * parg2, void * pres2,

T Descr *pdescr)

• f1 function for first M-Task

• parg1, pres1 packed arguments and results for f1• f2 function for second M-Task

• parg2, pres2 packed arguments and results for f2• pdescr pointer to current group descriptor describing two groups


33

Outline

1 Introduction

2 Specific Approach




6 Conclusions


33

Algorithm combinations

A combination of different algorithms for matrix multiplication canlead to very competitive implementations:Strassen method, tpMM, Atlas;

Illustration:

ringpdgemm

Strassen

tpMM

low level

ESSL ATLAS

BLAS

dgemm


33

Recursive splitting – Strassen algorithm

matrix multiplication C = AB with A,B,C ∈ RI n×n:decompose A,B into square blocks of size n/2:(

C11 C12

C21 C22

)=

(A11 A12

A21 A22

) (B11 B12

B21 B22

)compute the submatrices C11,C12,C21,C22 separately according to

C11 = Q1 + Q4 − Q5 + Q7 C12 = Q3 + Q5

C21 = Q2 + Q4 C22 = Q1 + Q3 − Q2 + Q6

compute Q1, . . . ,Q7 by recursive calls:

Q1 = strassen(A11 + A22, B11 + B22);

Q2 = strassen(A21 + A22, B11);

Q3 = strassen(A11, B12 − B22);

Q4 = strassen(A22, B21 − B11);

Q5 = strassen(A11 + A12, B22);

Q6 = strassen(A21 − A11, B11 + B12);

Q7 = strassen(A12 − A22, B21 + B22);


33

Strassen algorithm – task scheduling

organization into four tasksTask C11(), Task C12(), Task C21(), Task C22():

Task C11 Task C12 Task C21 Task C22on group 0 on group 1 on group 2 on group 3

compute Q1 compute Q3 compute Q2 compute Q1

compute Q7 compute Q5 compute Q4 compute Q6

receive Q5 send Q5 send Q2 receive Q2

receive Q4 send Q3 send Q4 receive Q3

compute C11 compute C12 compute C21 compute C22

realization with Tlib: Task C11() and Task q1()


33

Strassen algorithm – main Tlib coordination function

void * Strassen (void * arg, MPI Comm comm,T Descr *pdescr){

T Descr descr1;void * (* f[4])(), *pres[4], *parg[4];float per[4];

per[0]=0.25; per[1]=0.25; per[2]=0.25; per[3]=0.25;T Split GrpParfor (4, pdescr, &descr1, per);

f[0] = Task C11; f[1] = Task C12;f[2] = Task C21; f[3] = Task C22;parg[0] = parg[1] = parg[2] = parg[3] = arg;T Parfor (f, parg, pres, &descr1);assemble matrix (pres[0], pres[1], pres[2], pres[3]);

}


33

Strassen algorithm – Task C11()

void * Task C11 (void * arg, MPI Comm comm,T Descr *pdescr){

double **q1, **q4, **q5, **q7, **res;int i,j,n2;

/* extract arguments from arg including matrix size n */n2 = n/2;q1 = Task q1 (arg, comm, pdescr);

q7 = Task q7 (arg, comm, pdescr);

/* allocate res, q4 and q5 as matrices of size n/2 times n/2 *//* receive q4 from group 2 and q5 from group 1 using parent comm. */for (i=0; i < n2; i++)for (j=0; j < n2; j++)res[i][j] = q1[i][j]+q4[i][j]-q5[i][j]+q7[i][j];

return res;}


33

Strassen algorithm – Task q1()

void *Task q1(void *arg, MPI Comm comm, T Descr *pdescr){double **res, **q11, **q12, **q1;

struct struct MM mul *mm, *arg1;

*mm = (struct MM mul *) arg;

n2 = (*mm).n /2;

/* allocate q1, q11, q12 as matrices of size n2 times n2 */

for (i=0; i < n2; i++)

for (j=0; j < n2; j++) {q11[i][j] = (*mm).a[i][j]+(*mm).a[i+n2][j+n2]; /* A11+A22 */

q12[i][j] = (*mm).b[i][j]+(*mm).b[i+n2][j+n2]; /* B11+B22 */

}arg1 = (struct MM mul *) malloc (sizeof(struct MM mul));

(*arg1).a = q11; (*arg1).b = q12; (*arg1).n = n2;

q1 = Strassen (arg1, comm, pdescr);

return q1;

}


33

tpmm for memory hierarchies

block structure of tpmm: p = 2i processors

C11C

C

C

C

C

C

C

12

13

14

15

Phase 4

16

17

18

C

C

C

C 21

22

23

24

Phase 1 Phase 2

C 41

C

C

31

32

Phase 3

tpMM(n,p) =for (l = 1 to log p + 1)

for k = 1, . . . , p/2l−1 compute in parallelcompute block (Clk);


33

tpmm for memory hierarchies

tpmm: data distribution for matrix B and computation order for 8processors:

P0 P1 P2 P3 P4 P5 P6 P7

P0P1

P2P3

P4P5

P6P7

P1 P0 P3 P2 P5 P4 P7 P6

P0

P1P2

P3P4

P5P6

P7

00

00

00

0

P0 P1P3P2 P4 P5P7P6

P3

P2

P1

P0

P4P5

P6P7

00

00

00

0

11

11

11

11 0

P2P3 P1 P0 P6P7 P4P5

P0P1

P2P3

P4P5

P6P7

00

00

00

0

11

11

11

11 0

22

22

22

22

bottom level: Atlas for one-processor matrix multiplication


33

tpmm performance in isolation

MFLOPS per processor for 4 and 8 processors on IBM Regatta:

1024 2048 3072 4096 5120 6144 7168 81920

500

1000

1500

2000

2500

3000

3500

matrix size

mflo

ps (p

er p

roce

ssor

)

pdgemm 4 procstpmm 4 procs

1024 2048 3072 4096 5120 6144 7168 8192 92160

500

1000

1500

2000

2500

3000

3500

matrix size

mflo

ps (p

er p

roce

ssor

)

pdgemm 8 procstpmm 8 procs


33

Combining Strassen and tpMM

p = 4i2j processors with i ≥ 1 and j ≥ 0:4 processor groups are build at each recursion level of Strassen;2j processors are used for the execution of tpmm;

data layout for 16 processors:

P0P1P2P3

P4P5P6P7P12P13P14P15

PPPP

8

9

10

11

P

P PP P P P P P

P8 9 P10 P11 P12 P13

P14 P15

0 1 2 3 4 5 6 7


33

Combining Strassen and PDGEMM

p = 4i · p = 2d processors with i ≥ 1 and d ∈ {0, 1};The processors are mapped row-blockwise onto the blocks of A, Band C so that the blocks have size n

r ×nc .

data layout for 16 processors:

logical processor mappingintitial processor mapping after split by colors

P

P

4

P

PP

P4 5

1 2P

P6 P7

P3

P

P0

P

8

P12

P9

P13

10P 11

P14 15

P

P3

1 P

P6 7

5

8P

P

P0

2

9

P10 P

P

11

P12 P13

P14 P15


33

Tuning of PDGEMM

impact of blocking factor on PDGEMM performance:

500

1000

1500

2000

2500

3000

3500

1664 1536 1408 1280 1152 1024 896 768 640 512 384 256 128 1

MF

LOP

S/p

blocking factor

p = 64, logical block size = 128

102420483072

409651206144

716881929216

102401126412288

13312

500

1000

1500

2000

2500

3000

3500

1664 1536 1408 1280 1152 1024 896 768 640 512 384 256 128 1M

FLO

PS

/pblocking factor

p = 64, logical block size = 512

102420483072

409651206144

716881929216

102401126412288

13312

Opteron cluster with Infiniband interconnect


33

Tuning of PDGEMM

impact of logical block size on PDGEMM performance:

2300

2400

2500

2600

2700

2800

2900

3000

3100

3200

3300

0 100 200 300 400 500 600 700 800 900 1000 1100

MF

LOP

S /

proc

esso

r

logical block size lb

PDGEMM, p=32, nb=128

614481921024012288

Opteron cluster with Infiniband interconnectHunold, Rauber, Runger Building Blocks for Multilevel Matrix Multiplication

PMAA’06, Rennes 26 /33

Outline

1 Introduction

2 Specific Approach




6 Conclusions


33

Experimental evaluation: IBM Regatta p690

p = 32 and p = 64: MFLOPS for different algorithmic combinations:

2048 4096 6144 8192 10240 122880

500

1000

1500

2000

2500

3000

matrix size

mflo

ps (

per

proc

esso

r)

pdgemm_64 32 procsstrassen_pdgemm_r1 32 procsstrassen_pdgemm_r2 32 procsstrassen_ring 32 procsstrassen_tpmm 32 procs

2048 4096 6144 8192 10240 122880

500

1000

1500

2000

2500

matrix size

mflo

ps (

per

proc

esso

r)

pdgemm 64 procsstrassen_pdgemm_r1 64 procsstrassen_pdgemm_r2 64 procsstrassen_r3 64 procsstrassen_ring 64 procsstrassen_tpmm 64 procs


33

Experimental evaluation: Dual Xeon Cluster 3 GHz

MFLOPS for different algorithmic combinations:

2048 4096 5120 6144 7168 8192 9216 102400

500

1000

1500

2000

matrix size

mflo

ps (p

er p

roce

ssor

)

pdgemm_tester 16 procsstrassen_pdgemm_r1 16 procsstrassen_pdgemm_r2 16 procsstrassen_ring 16 procsstrassen_tpmm_plug 16 procs

2048 4096 5120 6144 7168 8192 9216 102400

200

400

600

800

1000

matrix size

mflo

ps (p

er p

roce

ssor

)

pdgemm_tester 32 procsstrassen_pdgemm_r1 32 procsstrassen_pdgemm_r2 32 procsstrassen_ring 32 procsstrassen_tpmm_plug 32 procs

MFLOPS per processor for 16 and 32 processors


33

Experimental evaluation: Infiniband Cluster

MFLOPS for different algorithmic combinations:

4096 6144 8192 10240 122880

500

1000

1500

2000

2500

32 processors, TCP over Infiniband

Matrix dimension

MF

LOP

S /

proc

esso

r

Strassen + PDGEMM (def)Strassen + PDGEMM (opt)PDGEMM (opt)PDGEMM (def)Strassen + RingStrassen + tpMMStrassen + PDGEMM (2 recs, def)Strassen + PDGEMM (2 recs, opt)

4096 6144 8192 10240 122880

500

1000

1500

2000

2500

64 processors, TCP over Infiniband

Matrix dimension

MF

LOP

S /

proc

esso

r

Strassen + PDGEMM (def)Strassen + PDGEMM (opt)Strassen + DGEMM (3 recs)PDGEMM (opt)PDGEMM (def)Strassen + RingStrassen + tpMMStrassen + PDGEMM (2 recs, def)Strassen + PDGEMM (2 recs, opt)

MFLOPS per processor for 32 and 64 processors


33

Experimental evaluation: SGI Altix

p = 16: MFLOPS for different algorithmic combinations:

4096 6144 8192 10240 122880

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

Matrix dimension

MF

LOP

S /

proc

esso

r16 processors, HLRB II

Strassen + PDGEMM (def)Strassen + PDGEMM (opt)PDGEMM (def)PDGEMM (opt)Strassen + RingStrassen + tpMMStrassen + DGEMM (2 recs)


33

Outline

1 Introduction

2 Specific Approach




6 Conclusions


33

Conclusions

The combination of different algorithms as building blocks leads tocompetetive parallel implementations;

There are many different ways to combine the building blocks;for different platforms, different combinations lead to the bestperformance;

automatic support for finding the best combination is useful;

outlook: the combination of different algorithms might be especiallyuseful for heterogeneous platforms;

further information:[email protected]@cs.tu-chemnitz.de


33

combining building blocks for parallel multilevel matrix...

Documents