combining building blocks for parallel multilevel matrix...
TRANSCRIPT
Combining Building Blocks for Parallel MultilevelMatrix Multiplication
Sascha Hunold1 Thomas Rauber1 Gudula Runger2
1Department of Computer ScienceUniversity of Bayreuth
2Department of Computer ScienceChemnitz University of Technology
PMAA’06, Rennes
Hunold, Rauber, Runger Building Blocks for Multilevel Matrix MultiplicationPMAA’06, Rennes 1 /
33
Outline
1 Introduction
2 Specific Approach
3 Execution environment: Tlib
4 Algorithm combinationsRecursive splitting – Strassen algorithmtpmm - a new recursive algorithm for memory hierarchiesTuning of PDGEMM for better performanceCombination of different algorithms
5 Experimental results
6 Conclusions
Hunold, Rauber, Runger Building Blocks for Multilevel Matrix MultiplicationPMAA’06, Rennes 2 /
33
Introduction
matrix multiplication is one of the core computations in manyalgorithms from scientific computing;→ many efficient realizations have been invented;
efficient sequential implementations: ATLAS, PHiPAC;efficient parallel implementations: SUMMA, PDGEMM;
execution platforms differ in important characteristics memoryhierarchy, communication architectureefficient implementation must exploit these characteristics;→ adaptivity is an important property;
Hunold, Rauber, Runger Building Blocks for Multilevel Matrix MultiplicationPMAA’06, Rennes 3 /
33
Introduction
multiprocessor task (M-task) implementations have been successfulfor many platforms for different application areas;advantage: reduction in communication time, better scalability;
approach: M-task implementation of matrix multiplication;
goal: competetiveness with ScaLAPACK and vendor-specificimplementations;
Hunold, Rauber, Runger Building Blocks for Multilevel Matrix MultiplicationPMAA’06, Rennes 4 /
33
Outline
1 Introduction
2 Specific Approach
3 Execution environment: Tlib
4 Algorithm combinationsRecursive splitting – Strassen algorithmtpmm - a new recursive algorithm for memory hierarchiesTuning of PDGEMM for better performanceCombination of different algorithms
5 Experimental results
6 Conclusions
Hunold, Rauber, Runger Building Blocks for Multilevel Matrix MultiplicationPMAA’06, Rennes 5 /
33
Specific Approach
hierarchical M-task implementation with combination of differentalgorithms to exploit specific characteristics of the execution platform;
lower level: efficient adaptive and/or highly tuned sequentialalgorithm to use single-processor architecture;
upper level: hierarchical (recursive) algorithm to improvecommunication behavior → Strassen algorithm;
intermediate level: hierarchical (recursive) algorithm tpmm toimprove memory access and communication behavior formulti-processor nodes;
Hunold, Rauber, Runger Building Blocks for Multilevel Matrix MultiplicationPMAA’06, Rennes 6 /
33
Specific Approach
result: multi-level algorithm which allows adaptions to different executionplatforms:
different combinations of algorithms;
different levels of recursions of upper level Strassen algorithm;
different lower level sequential algorithms;
different block sizes for low-level algorithms;
different schedulings for upper level Strassen algorithm;
Hunold, Rauber, Runger Building Blocks for Multilevel Matrix MultiplicationPMAA’06, Rennes 7 /
33
Outline
1 Introduction
2 Specific Approach
3 Execution environment: Tlib
4 Algorithm combinationsRecursive splitting – Strassen algorithmtpmm - a new recursive algorithm for memory hierarchiesTuning of PDGEMM for better performanceCombination of different algorithms
5 Experimental results
6 Conclusions
Hunold, Rauber, Runger Building Blocks for Multilevel Matrix MultiplicationPMAA’06, Rennes 8 /
33
Execution environment: Tlib
The runtime library Tlib allows the realization of hierarchicallystructured M-tasks programs;
Tlib is based on MPI and allows the use of arbitrary MPI functions.
The Tlib API provides support for
* the creation and administration of a dynamic hierarchy ofprocessor groups;
* the coordination and mapping of nested M-tasks;* the handling and termination of recursive calls and group splittings;* the organization of communication between M-tasks;
The splitting and mapping can be adapted to the execution platform.
Hunold, Rauber, Runger Building Blocks for Multilevel Matrix MultiplicationPMAA’06, Rennes 9 /
33
Interface of the Tlib library
Type of all Tlib tasks (basic and coordination):
void *F (void * arg, MPI Comm comm, T Descr *pdescr)
• void * arg packed arguments for F;
• comm MPI communicator for internal communication;
• pdescr pointer to a descriptor of executing processor group;
The function F may contain calls of Tlib functions to create sub-groupsand to map sub-computations to sub-groups.
Hunold, Rauber, Runger Building Blocks for Multilevel Matrix MultiplicationPMAA’06, Rennes 10 /
33
Splitting operations for processor groups
int T SplitGrp( T Descr * pdescr,
T Descr * pdescr1,
float per1,
float per2)
• pdescr: current group descriptor
• pdescr1: new group descriptor fortwo new subgroups
• per1 relative size of first subgroup
• per2 relative size of second sub-group
int T SplitGrpParfor( int n,
T Descr *pdescr,
T Descr *pdescr1,
float p[])
int T SplitGrpExpl( T Descr * pdescr,
T Descr * pdescr1,
int color,
int key)
group subdivision
group subdivision group subdivision
group subdivision
Hunold, Rauber, Runger Building Blocks for Multilevel Matrix MultiplicationPMAA’06, Rennes 11 /
33
Concurrent execution of M-tasks
Mapping of two concurrent M-tasks to processor groups:
int T Par( void * (*f1)(void *, MPI Comm, T Descr *),
void * parg1, void * pres1,
void * (*f2)(void *, MPI Comm, T Descr *),
void * parg2, void * pres2,
T Descr *pdescr)
• f1 function for first M-Task
• parg1, pres1 packed arguments and results for f1• f2 function for second M-Task
• parg2, pres2 packed arguments and results for f2• pdescr pointer to current group descriptor describing two groups
Hunold, Rauber, Runger Building Blocks for Multilevel Matrix MultiplicationPMAA’06, Rennes 12 /
33
Outline
1 Introduction
2 Specific Approach
3 Execution environment: Tlib
4 Algorithm combinationsRecursive splitting – Strassen algorithmtpmm - a new recursive algorithm for memory hierarchiesTuning of PDGEMM for better performanceCombination of different algorithms
5 Experimental results
6 Conclusions
Hunold, Rauber, Runger Building Blocks for Multilevel Matrix MultiplicationPMAA’06, Rennes 13 /
33
Algorithm combinations
A combination of different algorithms for matrix multiplication canlead to very competitive implementations:Strassen method, tpMM, Atlas;
Illustration:
ringpdgemm
Strassen
tpMM
low level
ESSL ATLAS
BLAS
dgemm
Hunold, Rauber, Runger Building Blocks for Multilevel Matrix MultiplicationPMAA’06, Rennes 14 /
33
Recursive splitting – Strassen algorithm
matrix multiplication C = AB with A,B,C ∈ RI n×n:decompose A,B into square blocks of size n/2:(
C11 C12
C21 C22
)=
(A11 A12
A21 A22
) (B11 B12
B21 B22
)compute the submatrices C11,C12,C21,C22 separately according to
C11 = Q1 + Q4 − Q5 + Q7 C12 = Q3 + Q5
C21 = Q2 + Q4 C22 = Q1 + Q3 − Q2 + Q6
compute Q1, . . . ,Q7 by recursive calls:
Q1 = strassen(A11 + A22, B11 + B22);
Q2 = strassen(A21 + A22, B11);
Q3 = strassen(A11, B12 − B22);
Q4 = strassen(A22, B21 − B11);
Q5 = strassen(A11 + A12, B22);
Q6 = strassen(A21 − A11, B11 + B12);
Q7 = strassen(A12 − A22, B21 + B22);
Hunold, Rauber, Runger Building Blocks for Multilevel Matrix MultiplicationPMAA’06, Rennes 15 /
33
Strassen algorithm – task scheduling
organization into four tasksTask C11(), Task C12(), Task C21(), Task C22():
Task C11 Task C12 Task C21 Task C22on group 0 on group 1 on group 2 on group 3
compute Q1 compute Q3 compute Q2 compute Q1
compute Q7 compute Q5 compute Q4 compute Q6
receive Q5 send Q5 send Q2 receive Q2
receive Q4 send Q3 send Q4 receive Q3
compute C11 compute C12 compute C21 compute C22
realization with Tlib: Task C11() and Task q1()
Hunold, Rauber, Runger Building Blocks for Multilevel Matrix MultiplicationPMAA’06, Rennes 16 /
33
Strassen algorithm – main Tlib coordination function
void * Strassen (void * arg, MPI Comm comm,T Descr *pdescr){
T Descr descr1;void * (* f[4])(), *pres[4], *parg[4];float per[4];
per[0]=0.25; per[1]=0.25; per[2]=0.25; per[3]=0.25;T Split GrpParfor (4, pdescr, &descr1, per);
f[0] = Task C11; f[1] = Task C12;f[2] = Task C21; f[3] = Task C22;parg[0] = parg[1] = parg[2] = parg[3] = arg;T Parfor (f, parg, pres, &descr1);assemble matrix (pres[0], pres[1], pres[2], pres[3]);
}
Hunold, Rauber, Runger Building Blocks for Multilevel Matrix MultiplicationPMAA’06, Rennes 17 /
33
Strassen algorithm – Task C11()
void * Task C11 (void * arg, MPI Comm comm,T Descr *pdescr){
double **q1, **q4, **q5, **q7, **res;int i,j,n2;
/* extract arguments from arg including matrix size n */n2 = n/2;q1 = Task q1 (arg, comm, pdescr);
q7 = Task q7 (arg, comm, pdescr);
/* allocate res, q4 and q5 as matrices of size n/2 times n/2 *//* receive q4 from group 2 and q5 from group 1 using parent comm. */for (i=0; i < n2; i++)for (j=0; j < n2; j++)res[i][j] = q1[i][j]+q4[i][j]-q5[i][j]+q7[i][j];
return res;}
Hunold, Rauber, Runger Building Blocks for Multilevel Matrix MultiplicationPMAA’06, Rennes 18 /
33
Strassen algorithm – Task q1()
void *Task q1(void *arg, MPI Comm comm, T Descr *pdescr){double **res, **q11, **q12, **q1;
struct struct MM mul *mm, *arg1;
*mm = (struct MM mul *) arg;
n2 = (*mm).n /2;
/* allocate q1, q11, q12 as matrices of size n2 times n2 */
for (i=0; i < n2; i++)
for (j=0; j < n2; j++) {q11[i][j] = (*mm).a[i][j]+(*mm).a[i+n2][j+n2]; /* A11+A22 */
q12[i][j] = (*mm).b[i][j]+(*mm).b[i+n2][j+n2]; /* B11+B22 */
}arg1 = (struct MM mul *) malloc (sizeof(struct MM mul));
(*arg1).a = q11; (*arg1).b = q12; (*arg1).n = n2;
q1 = Strassen (arg1, comm, pdescr);
return q1;
}
Hunold, Rauber, Runger Building Blocks for Multilevel Matrix MultiplicationPMAA’06, Rennes 19 /
33
tpmm for memory hierarchies
block structure of tpmm: p = 2i processors
C11C
C
C
C
C
C
C
12
13
14
15
Phase 4
16
17
18
C
C
C
C 21
22
23
24
Phase 1 Phase 2
C 41
C
C
31
32
Phase 3
tpMM(n,p) =for (l = 1 to log p + 1)
for k = 1, . . . , p/2l−1 compute in parallelcompute block (Clk);
Hunold, Rauber, Runger Building Blocks for Multilevel Matrix MultiplicationPMAA’06, Rennes 20 /
33
tpmm for memory hierarchies
tpmm: data distribution for matrix B and computation order for 8processors:
P0 P1 P2 P3 P4 P5 P6 P7
P0P1
P2P3
P4P5
P6P7
P1 P0 P3 P2 P5 P4 P7 P6
P0
P1P2
P3P4
P5P6
P7
00
00
00
0
P0 P1P3P2 P4 P5P7P6
P3
P2
P1
P0
P4P5
P6P7
00
00
00
0
11
11
11
11 0
P2P3 P1 P0 P6P7 P4P5
P0P1
P2P3
P4P5
P6P7
00
00
00
0
11
11
11
11 0
22
22
22
22
bottom level: Atlas for one-processor matrix multiplication
Hunold, Rauber, Runger Building Blocks for Multilevel Matrix MultiplicationPMAA’06, Rennes 21 /
33
tpmm performance in isolation
MFLOPS per processor for 4 and 8 processors on IBM Regatta:
1024 2048 3072 4096 5120 6144 7168 81920
500
1000
1500
2000
2500
3000
3500
matrix size
mflo
ps (p
er p
roce
ssor
)
pdgemm 4 procstpmm 4 procs
1024 2048 3072 4096 5120 6144 7168 8192 92160
500
1000
1500
2000
2500
3000
3500
matrix size
mflo
ps (p
er p
roce
ssor
)
pdgemm 8 procstpmm 8 procs
Hunold, Rauber, Runger Building Blocks for Multilevel Matrix MultiplicationPMAA’06, Rennes 22 /
33
Combining Strassen and tpMM
p = 4i2j processors with i ≥ 1 and j ≥ 0:4 processor groups are build at each recursion level of Strassen;2j processors are used for the execution of tpmm;
data layout for 16 processors:
P0P1P2P3
P4P5P6P7P12P13P14P15
PPPP
8
9
10
11
P
P PP P P P P P
P8 9 P10 P11 P12 P13
P14 P15
0 1 2 3 4 5 6 7
Hunold, Rauber, Runger Building Blocks for Multilevel Matrix MultiplicationPMAA’06, Rennes 23 /
33
Combining Strassen and PDGEMM
p = 4i · p = 2d processors with i ≥ 1 and d ∈ {0, 1};The processors are mapped row-blockwise onto the blocks of A, Band C so that the blocks have size n
r ×nc .
data layout for 16 processors:
logical processor mappingintitial processor mapping after split by colors
P
P
4
P
PP
P4 5
1 2P
P6 P7
P3
P
P0
P
8
P12
P9
P13
10P 11
P14 15
P
P3
1 P
P6 7
5
8P
P
P0
2
9
P10 P
P
11
P12 P13
P14 P15
Hunold, Rauber, Runger Building Blocks for Multilevel Matrix MultiplicationPMAA’06, Rennes 24 /
33
Tuning of PDGEMM
impact of blocking factor on PDGEMM performance:
500
1000
1500
2000
2500
3000
3500
1664 1536 1408 1280 1152 1024 896 768 640 512 384 256 128 1
MF
LOP
S/p
blocking factor
p = 64, logical block size = 128
102420483072
409651206144
716881929216
102401126412288
13312
500
1000
1500
2000
2500
3000
3500
1664 1536 1408 1280 1152 1024 896 768 640 512 384 256 128 1M
FLO
PS
/pblocking factor
p = 64, logical block size = 512
102420483072
409651206144
716881929216
102401126412288
13312
Opteron cluster with Infiniband interconnect
Hunold, Rauber, Runger Building Blocks for Multilevel Matrix MultiplicationPMAA’06, Rennes 25 /
33
Tuning of PDGEMM
impact of logical block size on PDGEMM performance:
2300
2400
2500
2600
2700
2800
2900
3000
3100
3200
3300
0 100 200 300 400 500 600 700 800 900 1000 1100
MF
LOP
S /
proc
esso
r
logical block size lb
PDGEMM, p=32, nb=128
614481921024012288
Opteron cluster with Infiniband interconnectHunold, Rauber, Runger Building Blocks for Multilevel Matrix Multiplication
PMAA’06, Rennes 26 /33
Outline
1 Introduction
2 Specific Approach
3 Execution environment: Tlib
4 Algorithm combinationsRecursive splitting – Strassen algorithmtpmm - a new recursive algorithm for memory hierarchiesTuning of PDGEMM for better performanceCombination of different algorithms
5 Experimental results
6 Conclusions
Hunold, Rauber, Runger Building Blocks for Multilevel Matrix MultiplicationPMAA’06, Rennes 27 /
33
Experimental evaluation: IBM Regatta p690
p = 32 and p = 64: MFLOPS for different algorithmic combinations:
2048 4096 6144 8192 10240 122880
500
1000
1500
2000
2500
3000
matrix size
mflo
ps (
per
proc
esso
r)
pdgemm_64 32 procsstrassen_pdgemm_r1 32 procsstrassen_pdgemm_r2 32 procsstrassen_ring 32 procsstrassen_tpmm 32 procs
2048 4096 6144 8192 10240 122880
500
1000
1500
2000
2500
matrix size
mflo
ps (
per
proc
esso
r)
pdgemm 64 procsstrassen_pdgemm_r1 64 procsstrassen_pdgemm_r2 64 procsstrassen_r3 64 procsstrassen_ring 64 procsstrassen_tpmm 64 procs
Hunold, Rauber, Runger Building Blocks for Multilevel Matrix MultiplicationPMAA’06, Rennes 28 /
33
Experimental evaluation: Dual Xeon Cluster 3 GHz
MFLOPS for different algorithmic combinations:
2048 4096 5120 6144 7168 8192 9216 102400
500
1000
1500
2000
matrix size
mflo
ps (p
er p
roce
ssor
)
pdgemm_tester 16 procsstrassen_pdgemm_r1 16 procsstrassen_pdgemm_r2 16 procsstrassen_ring 16 procsstrassen_tpmm_plug 16 procs
2048 4096 5120 6144 7168 8192 9216 102400
200
400
600
800
1000
matrix size
mflo
ps (p
er p
roce
ssor
)
pdgemm_tester 32 procsstrassen_pdgemm_r1 32 procsstrassen_pdgemm_r2 32 procsstrassen_ring 32 procsstrassen_tpmm_plug 32 procs
MFLOPS per processor for 16 and 32 processors
Hunold, Rauber, Runger Building Blocks for Multilevel Matrix MultiplicationPMAA’06, Rennes 29 /
33
Experimental evaluation: Infiniband Cluster
MFLOPS for different algorithmic combinations:
4096 6144 8192 10240 122880
500
1000
1500
2000
2500
32 processors, TCP over Infiniband
Matrix dimension
MF
LOP
S /
proc
esso
r
Strassen + PDGEMM (def)Strassen + PDGEMM (opt)PDGEMM (opt)PDGEMM (def)Strassen + RingStrassen + tpMMStrassen + PDGEMM (2 recs, def)Strassen + PDGEMM (2 recs, opt)
4096 6144 8192 10240 122880
500
1000
1500
2000
2500
64 processors, TCP over Infiniband
Matrix dimension
MF
LOP
S /
proc
esso
r
Strassen + PDGEMM (def)Strassen + PDGEMM (opt)Strassen + DGEMM (3 recs)PDGEMM (opt)PDGEMM (def)Strassen + RingStrassen + tpMMStrassen + PDGEMM (2 recs, def)Strassen + PDGEMM (2 recs, opt)
MFLOPS per processor for 32 and 64 processors
Hunold, Rauber, Runger Building Blocks for Multilevel Matrix MultiplicationPMAA’06, Rennes 30 /
33
Experimental evaluation: SGI Altix
p = 16: MFLOPS for different algorithmic combinations:
4096 6144 8192 10240 122880
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
Matrix dimension
MF
LOP
S /
proc
esso
r16 processors, HLRB II
Strassen + PDGEMM (def)Strassen + PDGEMM (opt)PDGEMM (def)PDGEMM (opt)Strassen + RingStrassen + tpMMStrassen + DGEMM (2 recs)
Hunold, Rauber, Runger Building Blocks for Multilevel Matrix MultiplicationPMAA’06, Rennes 31 /
33
Outline
1 Introduction
2 Specific Approach
3 Execution environment: Tlib
4 Algorithm combinationsRecursive splitting – Strassen algorithmtpmm - a new recursive algorithm for memory hierarchiesTuning of PDGEMM for better performanceCombination of different algorithms
5 Experimental results
6 Conclusions
Hunold, Rauber, Runger Building Blocks for Multilevel Matrix MultiplicationPMAA’06, Rennes 32 /
33
Conclusions
The combination of different algorithms as building blocks leads tocompetetive parallel implementations;
There are many different ways to combine the building blocks;for different platforms, different combinations lead to the bestperformance;
automatic support for finding the best combination is useful;
outlook: the combination of different algorithms might be especiallyuseful for heterogeneous platforms;
further information:[email protected]@cs.tu-chemnitz.de
Hunold, Rauber, Runger Building Blocks for Multilevel Matrix MultiplicationPMAA’06, Rennes 33 /
33