combining building blocks for parallel multilevel matrix...

33
Combining Building Blocks for Parallel Multilevel Matrix Multiplication Sascha Hunold 1 Thomas Rauber 1 Gudula R¨ unger 2 1 Department of Computer Science University of Bayreuth 2 Department of Computer Science Chemnitz University of Technology PMAA’06, Rennes Hunold, Rauber, R¨ unger Building Blocks for Multilevel Matrix Multiplication PMAA’06, Rennes 1/ 33

Upload: others

Post on 02-Mar-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Combining Building Blocks for Parallel Multilevel Matrix Multiplicationpmaa06.irisa.fr/pres/33-Rauber-PMAA06.pdf · 2006-09-11 · Combining Building Blocks for Parallel Multilevel

Combining Building Blocks for Parallel MultilevelMatrix Multiplication

Sascha Hunold1 Thomas Rauber1 Gudula Runger2

1Department of Computer ScienceUniversity of Bayreuth

2Department of Computer ScienceChemnitz University of Technology

PMAA’06, Rennes

Hunold, Rauber, Runger Building Blocks for Multilevel Matrix MultiplicationPMAA’06, Rennes 1 /

33

Page 2: Combining Building Blocks for Parallel Multilevel Matrix Multiplicationpmaa06.irisa.fr/pres/33-Rauber-PMAA06.pdf · 2006-09-11 · Combining Building Blocks for Parallel Multilevel

Outline

1 Introduction

2 Specific Approach

3 Execution environment: Tlib

4 Algorithm combinationsRecursive splitting – Strassen algorithmtpmm - a new recursive algorithm for memory hierarchiesTuning of PDGEMM for better performanceCombination of different algorithms

5 Experimental results

6 Conclusions

Hunold, Rauber, Runger Building Blocks for Multilevel Matrix MultiplicationPMAA’06, Rennes 2 /

33

Page 3: Combining Building Blocks for Parallel Multilevel Matrix Multiplicationpmaa06.irisa.fr/pres/33-Rauber-PMAA06.pdf · 2006-09-11 · Combining Building Blocks for Parallel Multilevel

Introduction

matrix multiplication is one of the core computations in manyalgorithms from scientific computing;→ many efficient realizations have been invented;

efficient sequential implementations: ATLAS, PHiPAC;efficient parallel implementations: SUMMA, PDGEMM;

execution platforms differ in important characteristics memoryhierarchy, communication architectureefficient implementation must exploit these characteristics;→ adaptivity is an important property;

Hunold, Rauber, Runger Building Blocks for Multilevel Matrix MultiplicationPMAA’06, Rennes 3 /

33

Page 4: Combining Building Blocks for Parallel Multilevel Matrix Multiplicationpmaa06.irisa.fr/pres/33-Rauber-PMAA06.pdf · 2006-09-11 · Combining Building Blocks for Parallel Multilevel

Introduction

multiprocessor task (M-task) implementations have been successfulfor many platforms for different application areas;advantage: reduction in communication time, better scalability;

approach: M-task implementation of matrix multiplication;

goal: competetiveness with ScaLAPACK and vendor-specificimplementations;

Hunold, Rauber, Runger Building Blocks for Multilevel Matrix MultiplicationPMAA’06, Rennes 4 /

33

Page 5: Combining Building Blocks for Parallel Multilevel Matrix Multiplicationpmaa06.irisa.fr/pres/33-Rauber-PMAA06.pdf · 2006-09-11 · Combining Building Blocks for Parallel Multilevel

Outline

1 Introduction

2 Specific Approach

3 Execution environment: Tlib

4 Algorithm combinationsRecursive splitting – Strassen algorithmtpmm - a new recursive algorithm for memory hierarchiesTuning of PDGEMM for better performanceCombination of different algorithms

5 Experimental results

6 Conclusions

Hunold, Rauber, Runger Building Blocks for Multilevel Matrix MultiplicationPMAA’06, Rennes 5 /

33

Page 6: Combining Building Blocks for Parallel Multilevel Matrix Multiplicationpmaa06.irisa.fr/pres/33-Rauber-PMAA06.pdf · 2006-09-11 · Combining Building Blocks for Parallel Multilevel

Specific Approach

hierarchical M-task implementation with combination of differentalgorithms to exploit specific characteristics of the execution platform;

lower level: efficient adaptive and/or highly tuned sequentialalgorithm to use single-processor architecture;

upper level: hierarchical (recursive) algorithm to improvecommunication behavior → Strassen algorithm;

intermediate level: hierarchical (recursive) algorithm tpmm toimprove memory access and communication behavior formulti-processor nodes;

Hunold, Rauber, Runger Building Blocks for Multilevel Matrix MultiplicationPMAA’06, Rennes 6 /

33

Page 7: Combining Building Blocks for Parallel Multilevel Matrix Multiplicationpmaa06.irisa.fr/pres/33-Rauber-PMAA06.pdf · 2006-09-11 · Combining Building Blocks for Parallel Multilevel

Specific Approach

result: multi-level algorithm which allows adaptions to different executionplatforms:

different combinations of algorithms;

different levels of recursions of upper level Strassen algorithm;

different lower level sequential algorithms;

different block sizes for low-level algorithms;

different schedulings for upper level Strassen algorithm;

Hunold, Rauber, Runger Building Blocks for Multilevel Matrix MultiplicationPMAA’06, Rennes 7 /

33

Page 8: Combining Building Blocks for Parallel Multilevel Matrix Multiplicationpmaa06.irisa.fr/pres/33-Rauber-PMAA06.pdf · 2006-09-11 · Combining Building Blocks for Parallel Multilevel

Outline

1 Introduction

2 Specific Approach

3 Execution environment: Tlib

4 Algorithm combinationsRecursive splitting – Strassen algorithmtpmm - a new recursive algorithm for memory hierarchiesTuning of PDGEMM for better performanceCombination of different algorithms

5 Experimental results

6 Conclusions

Hunold, Rauber, Runger Building Blocks for Multilevel Matrix MultiplicationPMAA’06, Rennes 8 /

33

Page 9: Combining Building Blocks for Parallel Multilevel Matrix Multiplicationpmaa06.irisa.fr/pres/33-Rauber-PMAA06.pdf · 2006-09-11 · Combining Building Blocks for Parallel Multilevel

Execution environment: Tlib

The runtime library Tlib allows the realization of hierarchicallystructured M-tasks programs;

Tlib is based on MPI and allows the use of arbitrary MPI functions.

The Tlib API provides support for

* the creation and administration of a dynamic hierarchy ofprocessor groups;

* the coordination and mapping of nested M-tasks;* the handling and termination of recursive calls and group splittings;* the organization of communication between M-tasks;

The splitting and mapping can be adapted to the execution platform.

Hunold, Rauber, Runger Building Blocks for Multilevel Matrix MultiplicationPMAA’06, Rennes 9 /

33

Page 10: Combining Building Blocks for Parallel Multilevel Matrix Multiplicationpmaa06.irisa.fr/pres/33-Rauber-PMAA06.pdf · 2006-09-11 · Combining Building Blocks for Parallel Multilevel

Interface of the Tlib library

Type of all Tlib tasks (basic and coordination):

void *F (void * arg, MPI Comm comm, T Descr *pdescr)

• void * arg packed arguments for F;

• comm MPI communicator for internal communication;

• pdescr pointer to a descriptor of executing processor group;

The function F may contain calls of Tlib functions to create sub-groupsand to map sub-computations to sub-groups.

Hunold, Rauber, Runger Building Blocks for Multilevel Matrix MultiplicationPMAA’06, Rennes 10 /

33

Page 11: Combining Building Blocks for Parallel Multilevel Matrix Multiplicationpmaa06.irisa.fr/pres/33-Rauber-PMAA06.pdf · 2006-09-11 · Combining Building Blocks for Parallel Multilevel

Splitting operations for processor groups

int T SplitGrp( T Descr * pdescr,

T Descr * pdescr1,

float per1,

float per2)

• pdescr: current group descriptor

• pdescr1: new group descriptor fortwo new subgroups

• per1 relative size of first subgroup

• per2 relative size of second sub-group

int T SplitGrpParfor( int n,

T Descr *pdescr,

T Descr *pdescr1,

float p[])

int T SplitGrpExpl( T Descr * pdescr,

T Descr * pdescr1,

int color,

int key)

group subdivision

group subdivision group subdivision

group subdivision

Hunold, Rauber, Runger Building Blocks for Multilevel Matrix MultiplicationPMAA’06, Rennes 11 /

33

Page 12: Combining Building Blocks for Parallel Multilevel Matrix Multiplicationpmaa06.irisa.fr/pres/33-Rauber-PMAA06.pdf · 2006-09-11 · Combining Building Blocks for Parallel Multilevel

Concurrent execution of M-tasks

Mapping of two concurrent M-tasks to processor groups:

int T Par( void * (*f1)(void *, MPI Comm, T Descr *),

void * parg1, void * pres1,

void * (*f2)(void *, MPI Comm, T Descr *),

void * parg2, void * pres2,

T Descr *pdescr)

• f1 function for first M-Task

• parg1, pres1 packed arguments and results for f1• f2 function for second M-Task

• parg2, pres2 packed arguments and results for f2• pdescr pointer to current group descriptor describing two groups

Hunold, Rauber, Runger Building Blocks for Multilevel Matrix MultiplicationPMAA’06, Rennes 12 /

33

Page 13: Combining Building Blocks for Parallel Multilevel Matrix Multiplicationpmaa06.irisa.fr/pres/33-Rauber-PMAA06.pdf · 2006-09-11 · Combining Building Blocks for Parallel Multilevel

Outline

1 Introduction

2 Specific Approach

3 Execution environment: Tlib

4 Algorithm combinationsRecursive splitting – Strassen algorithmtpmm - a new recursive algorithm for memory hierarchiesTuning of PDGEMM for better performanceCombination of different algorithms

5 Experimental results

6 Conclusions

Hunold, Rauber, Runger Building Blocks for Multilevel Matrix MultiplicationPMAA’06, Rennes 13 /

33

Page 14: Combining Building Blocks for Parallel Multilevel Matrix Multiplicationpmaa06.irisa.fr/pres/33-Rauber-PMAA06.pdf · 2006-09-11 · Combining Building Blocks for Parallel Multilevel

Algorithm combinations

A combination of different algorithms for matrix multiplication canlead to very competitive implementations:Strassen method, tpMM, Atlas;

Illustration:

ringpdgemm

Strassen

tpMM

low level

ESSL ATLAS

BLAS

dgemm

Hunold, Rauber, Runger Building Blocks for Multilevel Matrix MultiplicationPMAA’06, Rennes 14 /

33

Page 15: Combining Building Blocks for Parallel Multilevel Matrix Multiplicationpmaa06.irisa.fr/pres/33-Rauber-PMAA06.pdf · 2006-09-11 · Combining Building Blocks for Parallel Multilevel

Recursive splitting – Strassen algorithm

matrix multiplication C = AB with A,B,C ∈ RI n×n:decompose A,B into square blocks of size n/2:(

C11 C12

C21 C22

)=

(A11 A12

A21 A22

) (B11 B12

B21 B22

)compute the submatrices C11,C12,C21,C22 separately according to

C11 = Q1 + Q4 − Q5 + Q7 C12 = Q3 + Q5

C21 = Q2 + Q4 C22 = Q1 + Q3 − Q2 + Q6

compute Q1, . . . ,Q7 by recursive calls:

Q1 = strassen(A11 + A22, B11 + B22);

Q2 = strassen(A21 + A22, B11);

Q3 = strassen(A11, B12 − B22);

Q4 = strassen(A22, B21 − B11);

Q5 = strassen(A11 + A12, B22);

Q6 = strassen(A21 − A11, B11 + B12);

Q7 = strassen(A12 − A22, B21 + B22);

Hunold, Rauber, Runger Building Blocks for Multilevel Matrix MultiplicationPMAA’06, Rennes 15 /

33

Page 16: Combining Building Blocks for Parallel Multilevel Matrix Multiplicationpmaa06.irisa.fr/pres/33-Rauber-PMAA06.pdf · 2006-09-11 · Combining Building Blocks for Parallel Multilevel

Strassen algorithm – task scheduling

organization into four tasksTask C11(), Task C12(), Task C21(), Task C22():

Task C11 Task C12 Task C21 Task C22on group 0 on group 1 on group 2 on group 3

compute Q1 compute Q3 compute Q2 compute Q1

compute Q7 compute Q5 compute Q4 compute Q6

receive Q5 send Q5 send Q2 receive Q2

receive Q4 send Q3 send Q4 receive Q3

compute C11 compute C12 compute C21 compute C22

realization with Tlib: Task C11() and Task q1()

Hunold, Rauber, Runger Building Blocks for Multilevel Matrix MultiplicationPMAA’06, Rennes 16 /

33

Page 17: Combining Building Blocks for Parallel Multilevel Matrix Multiplicationpmaa06.irisa.fr/pres/33-Rauber-PMAA06.pdf · 2006-09-11 · Combining Building Blocks for Parallel Multilevel

Strassen algorithm – main Tlib coordination function

void * Strassen (void * arg, MPI Comm comm,T Descr *pdescr){

T Descr descr1;void * (* f[4])(), *pres[4], *parg[4];float per[4];

per[0]=0.25; per[1]=0.25; per[2]=0.25; per[3]=0.25;T Split GrpParfor (4, pdescr, &descr1, per);

f[0] = Task C11; f[1] = Task C12;f[2] = Task C21; f[3] = Task C22;parg[0] = parg[1] = parg[2] = parg[3] = arg;T Parfor (f, parg, pres, &descr1);assemble matrix (pres[0], pres[1], pres[2], pres[3]);

}

Hunold, Rauber, Runger Building Blocks for Multilevel Matrix MultiplicationPMAA’06, Rennes 17 /

33

Page 18: Combining Building Blocks for Parallel Multilevel Matrix Multiplicationpmaa06.irisa.fr/pres/33-Rauber-PMAA06.pdf · 2006-09-11 · Combining Building Blocks for Parallel Multilevel

Strassen algorithm – Task C11()

void * Task C11 (void * arg, MPI Comm comm,T Descr *pdescr){

double **q1, **q4, **q5, **q7, **res;int i,j,n2;

/* extract arguments from arg including matrix size n */n2 = n/2;q1 = Task q1 (arg, comm, pdescr);

q7 = Task q7 (arg, comm, pdescr);

/* allocate res, q4 and q5 as matrices of size n/2 times n/2 *//* receive q4 from group 2 and q5 from group 1 using parent comm. */for (i=0; i < n2; i++)for (j=0; j < n2; j++)res[i][j] = q1[i][j]+q4[i][j]-q5[i][j]+q7[i][j];

return res;}

Hunold, Rauber, Runger Building Blocks for Multilevel Matrix MultiplicationPMAA’06, Rennes 18 /

33

Page 19: Combining Building Blocks for Parallel Multilevel Matrix Multiplicationpmaa06.irisa.fr/pres/33-Rauber-PMAA06.pdf · 2006-09-11 · Combining Building Blocks for Parallel Multilevel

Strassen algorithm – Task q1()

void *Task q1(void *arg, MPI Comm comm, T Descr *pdescr){double **res, **q11, **q12, **q1;

struct struct MM mul *mm, *arg1;

*mm = (struct MM mul *) arg;

n2 = (*mm).n /2;

/* allocate q1, q11, q12 as matrices of size n2 times n2 */

for (i=0; i < n2; i++)

for (j=0; j < n2; j++) {q11[i][j] = (*mm).a[i][j]+(*mm).a[i+n2][j+n2]; /* A11+A22 */

q12[i][j] = (*mm).b[i][j]+(*mm).b[i+n2][j+n2]; /* B11+B22 */

}arg1 = (struct MM mul *) malloc (sizeof(struct MM mul));

(*arg1).a = q11; (*arg1).b = q12; (*arg1).n = n2;

q1 = Strassen (arg1, comm, pdescr);

return q1;

}

Hunold, Rauber, Runger Building Blocks for Multilevel Matrix MultiplicationPMAA’06, Rennes 19 /

33

Page 20: Combining Building Blocks for Parallel Multilevel Matrix Multiplicationpmaa06.irisa.fr/pres/33-Rauber-PMAA06.pdf · 2006-09-11 · Combining Building Blocks for Parallel Multilevel

tpmm for memory hierarchies

block structure of tpmm: p = 2i processors

C11C

C

C

C

C

C

C

12

13

14

15

Phase 4

16

17

18

C

C

C

C 21

22

23

24

Phase 1 Phase 2

C 41

C

C

31

32

Phase 3

tpMM(n,p) =for (l = 1 to log p + 1)

for k = 1, . . . , p/2l−1 compute in parallelcompute block (Clk);

Hunold, Rauber, Runger Building Blocks for Multilevel Matrix MultiplicationPMAA’06, Rennes 20 /

33

Page 21: Combining Building Blocks for Parallel Multilevel Matrix Multiplicationpmaa06.irisa.fr/pres/33-Rauber-PMAA06.pdf · 2006-09-11 · Combining Building Blocks for Parallel Multilevel

tpmm for memory hierarchies

tpmm: data distribution for matrix B and computation order for 8processors:

P0 P1 P2 P3 P4 P5 P6 P7

P0P1

P2P3

P4P5

P6P7

P1 P0 P3 P2 P5 P4 P7 P6

P0

P1P2

P3P4

P5P6

P7

00

00

00

0

P0 P1P3P2 P4 P5P7P6

P3

P2

P1

P0

P4P5

P6P7

00

00

00

0

11

11

11

11 0

P2P3 P1 P0 P6P7 P4P5

P0P1

P2P3

P4P5

P6P7

00

00

00

0

11

11

11

11 0

22

22

22

22

bottom level: Atlas for one-processor matrix multiplication

Hunold, Rauber, Runger Building Blocks for Multilevel Matrix MultiplicationPMAA’06, Rennes 21 /

33

Page 22: Combining Building Blocks for Parallel Multilevel Matrix Multiplicationpmaa06.irisa.fr/pres/33-Rauber-PMAA06.pdf · 2006-09-11 · Combining Building Blocks for Parallel Multilevel

tpmm performance in isolation

MFLOPS per processor for 4 and 8 processors on IBM Regatta:

1024 2048 3072 4096 5120 6144 7168 81920

500

1000

1500

2000

2500

3000

3500

matrix size

mflo

ps (p

er p

roce

ssor

)

pdgemm 4 procstpmm 4 procs

1024 2048 3072 4096 5120 6144 7168 8192 92160

500

1000

1500

2000

2500

3000

3500

matrix size

mflo

ps (p

er p

roce

ssor

)

pdgemm 8 procstpmm 8 procs

Hunold, Rauber, Runger Building Blocks for Multilevel Matrix MultiplicationPMAA’06, Rennes 22 /

33

Page 23: Combining Building Blocks for Parallel Multilevel Matrix Multiplicationpmaa06.irisa.fr/pres/33-Rauber-PMAA06.pdf · 2006-09-11 · Combining Building Blocks for Parallel Multilevel

Combining Strassen and tpMM

p = 4i2j processors with i ≥ 1 and j ≥ 0:4 processor groups are build at each recursion level of Strassen;2j processors are used for the execution of tpmm;

data layout for 16 processors:

P0P1P2P3

P4P5P6P7P12P13P14P15

PPPP

8

9

10

11

P

P PP P P P P P

P8 9 P10 P11 P12 P13

P14 P15

0 1 2 3 4 5 6 7

Hunold, Rauber, Runger Building Blocks for Multilevel Matrix MultiplicationPMAA’06, Rennes 23 /

33

Page 24: Combining Building Blocks for Parallel Multilevel Matrix Multiplicationpmaa06.irisa.fr/pres/33-Rauber-PMAA06.pdf · 2006-09-11 · Combining Building Blocks for Parallel Multilevel

Combining Strassen and PDGEMM

p = 4i · p = 2d processors with i ≥ 1 and d ∈ {0, 1};The processors are mapped row-blockwise onto the blocks of A, Band C so that the blocks have size n

r ×nc .

data layout for 16 processors:

logical processor mappingintitial processor mapping after split by colors

P

P

4

P

PP

P4 5

1 2P

P6 P7

P3

P

P0

P

8

P12

P9

P13

10P 11

P14 15

P

P3

1 P

P6 7

5

8P

P

P0

2

9

P10 P

P

11

P12 P13

P14 P15

Hunold, Rauber, Runger Building Blocks for Multilevel Matrix MultiplicationPMAA’06, Rennes 24 /

33

Page 25: Combining Building Blocks for Parallel Multilevel Matrix Multiplicationpmaa06.irisa.fr/pres/33-Rauber-PMAA06.pdf · 2006-09-11 · Combining Building Blocks for Parallel Multilevel

Tuning of PDGEMM

impact of blocking factor on PDGEMM performance:

500

1000

1500

2000

2500

3000

3500

1664 1536 1408 1280 1152 1024 896 768 640 512 384 256 128 1

MF

LOP

S/p

blocking factor

p = 64, logical block size = 128

102420483072

409651206144

716881929216

102401126412288

13312

500

1000

1500

2000

2500

3000

3500

1664 1536 1408 1280 1152 1024 896 768 640 512 384 256 128 1M

FLO

PS

/pblocking factor

p = 64, logical block size = 512

102420483072

409651206144

716881929216

102401126412288

13312

Opteron cluster with Infiniband interconnect

Hunold, Rauber, Runger Building Blocks for Multilevel Matrix MultiplicationPMAA’06, Rennes 25 /

33

Page 26: Combining Building Blocks for Parallel Multilevel Matrix Multiplicationpmaa06.irisa.fr/pres/33-Rauber-PMAA06.pdf · 2006-09-11 · Combining Building Blocks for Parallel Multilevel

Tuning of PDGEMM

impact of logical block size on PDGEMM performance:

2300

2400

2500

2600

2700

2800

2900

3000

3100

3200

3300

0 100 200 300 400 500 600 700 800 900 1000 1100

MF

LOP

S /

proc

esso

r

logical block size lb

PDGEMM, p=32, nb=128

614481921024012288

Opteron cluster with Infiniband interconnectHunold, Rauber, Runger Building Blocks for Multilevel Matrix Multiplication

PMAA’06, Rennes 26 /33

Page 27: Combining Building Blocks for Parallel Multilevel Matrix Multiplicationpmaa06.irisa.fr/pres/33-Rauber-PMAA06.pdf · 2006-09-11 · Combining Building Blocks for Parallel Multilevel

Outline

1 Introduction

2 Specific Approach

3 Execution environment: Tlib

4 Algorithm combinationsRecursive splitting – Strassen algorithmtpmm - a new recursive algorithm for memory hierarchiesTuning of PDGEMM for better performanceCombination of different algorithms

5 Experimental results

6 Conclusions

Hunold, Rauber, Runger Building Blocks for Multilevel Matrix MultiplicationPMAA’06, Rennes 27 /

33

Page 28: Combining Building Blocks for Parallel Multilevel Matrix Multiplicationpmaa06.irisa.fr/pres/33-Rauber-PMAA06.pdf · 2006-09-11 · Combining Building Blocks for Parallel Multilevel

Experimental evaluation: IBM Regatta p690

p = 32 and p = 64: MFLOPS for different algorithmic combinations:

2048 4096 6144 8192 10240 122880

500

1000

1500

2000

2500

3000

matrix size

mflo

ps (

per

proc

esso

r)

pdgemm_64 32 procsstrassen_pdgemm_r1 32 procsstrassen_pdgemm_r2 32 procsstrassen_ring 32 procsstrassen_tpmm 32 procs

2048 4096 6144 8192 10240 122880

500

1000

1500

2000

2500

matrix size

mflo

ps (

per

proc

esso

r)

pdgemm 64 procsstrassen_pdgemm_r1 64 procsstrassen_pdgemm_r2 64 procsstrassen_r3 64 procsstrassen_ring 64 procsstrassen_tpmm 64 procs

Hunold, Rauber, Runger Building Blocks for Multilevel Matrix MultiplicationPMAA’06, Rennes 28 /

33

Page 29: Combining Building Blocks for Parallel Multilevel Matrix Multiplicationpmaa06.irisa.fr/pres/33-Rauber-PMAA06.pdf · 2006-09-11 · Combining Building Blocks for Parallel Multilevel

Experimental evaluation: Dual Xeon Cluster 3 GHz

MFLOPS for different algorithmic combinations:

2048 4096 5120 6144 7168 8192 9216 102400

500

1000

1500

2000

matrix size

mflo

ps (p

er p

roce

ssor

)

pdgemm_tester 16 procsstrassen_pdgemm_r1 16 procsstrassen_pdgemm_r2 16 procsstrassen_ring 16 procsstrassen_tpmm_plug 16 procs

2048 4096 5120 6144 7168 8192 9216 102400

200

400

600

800

1000

matrix size

mflo

ps (p

er p

roce

ssor

)

pdgemm_tester 32 procsstrassen_pdgemm_r1 32 procsstrassen_pdgemm_r2 32 procsstrassen_ring 32 procsstrassen_tpmm_plug 32 procs

MFLOPS per processor for 16 and 32 processors

Hunold, Rauber, Runger Building Blocks for Multilevel Matrix MultiplicationPMAA’06, Rennes 29 /

33

Page 30: Combining Building Blocks for Parallel Multilevel Matrix Multiplicationpmaa06.irisa.fr/pres/33-Rauber-PMAA06.pdf · 2006-09-11 · Combining Building Blocks for Parallel Multilevel

Experimental evaluation: Infiniband Cluster

MFLOPS for different algorithmic combinations:

4096 6144 8192 10240 122880

500

1000

1500

2000

2500

32 processors, TCP over Infiniband

Matrix dimension

MF

LOP

S /

proc

esso

r

Strassen + PDGEMM (def)Strassen + PDGEMM (opt)PDGEMM (opt)PDGEMM (def)Strassen + RingStrassen + tpMMStrassen + PDGEMM (2 recs, def)Strassen + PDGEMM (2 recs, opt)

4096 6144 8192 10240 122880

500

1000

1500

2000

2500

64 processors, TCP over Infiniband

Matrix dimension

MF

LOP

S /

proc

esso

r

Strassen + PDGEMM (def)Strassen + PDGEMM (opt)Strassen + DGEMM (3 recs)PDGEMM (opt)PDGEMM (def)Strassen + RingStrassen + tpMMStrassen + PDGEMM (2 recs, def)Strassen + PDGEMM (2 recs, opt)

MFLOPS per processor for 32 and 64 processors

Hunold, Rauber, Runger Building Blocks for Multilevel Matrix MultiplicationPMAA’06, Rennes 30 /

33

Page 31: Combining Building Blocks for Parallel Multilevel Matrix Multiplicationpmaa06.irisa.fr/pres/33-Rauber-PMAA06.pdf · 2006-09-11 · Combining Building Blocks for Parallel Multilevel

Experimental evaluation: SGI Altix

p = 16: MFLOPS for different algorithmic combinations:

4096 6144 8192 10240 122880

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

Matrix dimension

MF

LOP

S /

proc

esso

r16 processors, HLRB II

Strassen + PDGEMM (def)Strassen + PDGEMM (opt)PDGEMM (def)PDGEMM (opt)Strassen + RingStrassen + tpMMStrassen + DGEMM (2 recs)

Hunold, Rauber, Runger Building Blocks for Multilevel Matrix MultiplicationPMAA’06, Rennes 31 /

33

Page 32: Combining Building Blocks for Parallel Multilevel Matrix Multiplicationpmaa06.irisa.fr/pres/33-Rauber-PMAA06.pdf · 2006-09-11 · Combining Building Blocks for Parallel Multilevel

Outline

1 Introduction

2 Specific Approach

3 Execution environment: Tlib

4 Algorithm combinationsRecursive splitting – Strassen algorithmtpmm - a new recursive algorithm for memory hierarchiesTuning of PDGEMM for better performanceCombination of different algorithms

5 Experimental results

6 Conclusions

Hunold, Rauber, Runger Building Blocks for Multilevel Matrix MultiplicationPMAA’06, Rennes 32 /

33

Page 33: Combining Building Blocks for Parallel Multilevel Matrix Multiplicationpmaa06.irisa.fr/pres/33-Rauber-PMAA06.pdf · 2006-09-11 · Combining Building Blocks for Parallel Multilevel

Conclusions

The combination of different algorithms as building blocks leads tocompetetive parallel implementations;

There are many different ways to combine the building blocks;for different platforms, different combinations lead to the bestperformance;

automatic support for finding the best combination is useful;

outlook: the combination of different algorithms might be especiallyuseful for heterogeneous platforms;

further information:[email protected]@cs.tu-chemnitz.de

Hunold, Rauber, Runger Building Blocks for Multilevel Matrix MultiplicationPMAA’06, Rennes 33 /

33