improving communication performance in dense linear algebra … · 2016. 8. 18. · collective...

Collective communication2.5D algorithms

Modelling exascale

Improving communication performance in denselinear algebra via topology-aware collectives

Edgar Solomonik∗, Abhinav Bhatele†, and James Demmel∗

∗ University of California, Berkeley† Lawrence Livermore National Laboratory

Supercomputing, November 2011

Edgar Solomonik Mapping dense linear algebra 1/ 29


Modelling exascale

Outline

Collective communicationRectangular collectives

2.5D algorithms2.5D Matrix Multiplication2.5D LU factorization

Modelling exascaleMulticast performanceMM and LU performance



Modelling exascaleRectangular collectives

Performance of multicast (BG/P vs Cray)

128

256

512

1024

2048

4096

8192

8 64 512 4096

Ban

dwid

th (M

B/s

ec)

#nodes

1 MB multicast on BG/P, Cray XT5, and Cray XE6

BG/PXE6XT5




Why the performance discrepancy in multicasts?

I Cray machines use binomial multicastsI Form spanning tree from a list of nodesI Route copies of message down each branchI Network contention degrades utilization on a 3D torus

I BG/P uses rectangular multicastsI Require network topology to be a k-ary n-cubeI Form 2n edge-disjoint spanning trees

I Route in different dimensional orderI Use both directions of bidirectional network




2D rectangular multicasts trees

root2D 4X4 Torus Spanning tree 1 Spanning tree 2

Spanning tree 3 Spanning tree 4 All 4 trees combined

[Watts and Van De Geijn 95]




Another look at that first plot

How much better are rectangularalgorithms on P = 4096 nodes?

I Binomial collectives on XE6I 1/30th of link

bandwidth

I Rectangular collectives onBG/P

I 4X the link bandwidth

I 120X improvement inefficiency!

How can we apply this?

128

256

512

1024

2048

4096

8192

8 64 512 4096

Ban

dwid

th (M

B/s

ec)

#nodes

1 MB multicast on BG/P, Cray XT5, and Cray XE6

BG/PXE6XT5



Modelling exascale

2.5D Matrix Multiplication2.5D LU factorization

Matrix multiplication

A

BA

BA

B

AB



Modelling exascale


2D matrix multiplication

A

BA

BA

B

AB

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

16 CPUs (4x4)

[Cannon 69], [Van De Geijn and Watts 97]



Modelling exascale


3D matrix multiplication

A

BA

BA

B

AB

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

64 CPUs (4x4x4)

4 copies of matrices

[Agarwal et al 95], [Aggarwal, Chandra, and Snir 90], [Bernsten 89]



Modelling exascale


2.5D matrix multiplication

A

BA

BA

B

AB

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

32 CPUs (4x4x2)

2 copies of matrices



Modelling exascale


Strong scaling matrix multiplication

0

20

40

60

80

100

256 512 1024 2048

Per

cent

age

of m

achi

ne p

eak

#nodes

2.5D MM on BG/P (n=65,536)

2.5D SUMMA2D SUMMA

ScaLAPACK PDGEMM



Modelling exascale


2.5D MM on 65,536 cores

0

20

40

60

80

100

8192 131072

Per

cent

age

of m

achi

ne p

eak

n

Matrix multiplication on 16,384 nodes of BG/P

12X faster

2.7X faster

Using 16 matrix copies

2D MM2.5D MM



Modelling exascale


Cost breakdown of MM on 65,536 cores

0

0.2

0.4

0.6

0.8

1

1.2

1.4

n=8192, 2D

n=8192, 2.5D

n=131072, 2D

n=131072, 2.5D

Exe

cutio

n tim

e no

rmal

ized

by

2D

Matrix multiplication on 16,384 nodes of BG/P

95% reduction in comm computationidle

communication



Modelling exascale


2.5D LU factorization

L₀₀U₀₀

U₀₃

U₀₃

U₀₁

L₂₀L₃₀

L₁₀(A)

U₀₀

U₀₀

L₀₀

L₀₀

U₀₀L₀₀



Modelling exascale



L₀₀U₀₀

U₀₃

U₀₃

U₀₁

L₂₀L₃₀

L₁₀(A)

(B)U₀₀

U₀₀

L₀₀

L₀₀

U₀₀L₀₀



Modelling exascale



Look at how thisupdate is distributed.

A

BA

BA

B

AB

Same 3D updatein multiplication



Modelling exascale



L₀₀U₀₀

U₀₃

U₀₃

U₀₁

L₂₀L₃₀

L₁₀(A)

(B)

U

L

(C) (D)

U₀₀

U₀₀

L₀₀

L₀₀

U₀₀L₀₀

[Solomonik and Demmel, EuroPar ’11, Distinguished Paper]



Modelling exascale


2.5D LU on 65,536 cores

0

20

40

60

80

100

NO-pivot 2D

NO-pivot 2.5D

CA-pivot 2D

CA-pivot 2.5D

Tim

e (s

ec)

LU on 16,384 nodes of BG/P (n=131,072)

2X faster

2X faster

computeidle

communication



Modelling exascale


Rectangular (RCT) vs binomial (BNM) collectives

0

10

20

30

40

50

60

70

80

MM LU LU+PVT

Per

cent

age

of m

achi

ne p

eak

Binomial vs rectangular collectives on BG/P (n=131,072, p=16,384)

2D BNM2.5D BNM2.5D RCT



Modelling exascale

Multicast performanceMM and LU performance

A model for rectangular multicasts

tmcast = m/Bn + 2(d + 1) · o + 3L + d · P1/d · (2o + L)

Our multicast model consists of 3 terms

1. m/Bn, the bandwidth cost

2. 2(d + 1) · o + 3L, the multicast start-up overhead3. d · P1/d · (2o + L), the path overhead



Modelling exascale


A model for binomial multicasts

tbnm = log2(P) · (m/Bn + 2o + L)

I The root of the binomial tree sends log2(P) copies of message

I The setup overhead is overlapped with the path overhead

I We assume no contention



Modelling exascale


Model verification: one dimension

200

400

600

800

1000

1 8 64 512 4096 32768 262144

Ban

dwid

th (M

B/s

ec)

msg size (KB)

DCMF Broadcast on a ring of 8 nodes of BG/P

trect modelDCMF rectangle dput

tbnm modelDCMF binomial



Modelling exascale


Model verification: two dimensions

0

500

1000

1500

2000

1 8 64 512 4096 32768 262144

Ban

dwid

th (M

B/s

ec)

msg size (KB)

DCMF Broadcast on 64 (8x8) nodes of BG/P

trect modelDCMF rectangle dput

tbnm modelDCMF binomial



Modelling exascale


Model verification: three dimensions

0

500

1000

1500

2000

2500

3000

1 8 64 512 4096 32768 262144

Ban

dwid

th (M

B/s

ec)

msg size (KB)

DCMF Broadcast on 512 (8x8x8) nodes of BG/P

trect modelFaraj et al data

DCMF rectangle dputtbnm model

DCMF binomial



Modelling exascale


Modelling collectives at exascale (p = 262, 144)

0

20

40

60

80

100

120

140

160

180

200

0.125 1 8 64 512 4096 32768 262144

Ban

dwid

th (G

B/s

ec)

msg size (MB)

Exascale broadcast performance

trect 6Dtrect 5Dtrect 4Dtrect 3Dtrect 2Dtrect 1Dtbnm 3D



Modelling exascale


Modelling matrix multiplication at exascale

0.4

0.5

0.6

0.7

0.8

0.9

1

1 8 64

Par

alle

l effi

cien

cy

z dimension of partition

MM strong scaling at exascale (xy plane to full xyz torus)

2.5D with rectangular (c=z)2.5D with binomial (c=z)

2D with binomial



Modelling exascale


Modelling LU factorization at exascale

0

0.2

0.4

0.6

0.8

1

1 8 64

Par

alle

l effi

cien

cy

z dimension of partition

LU strong scaling at exascale (xy plane to full xyz torus)

2.5D with rectangular (c=z)2.5D with binomial (c=z)

2D LU with binomial



Modelling exascale


Conclusion

I Topology-aware schedulingI Present in IBM BG but not in Cray supercomputersI Avoids network contention/congestionI Enables optimized communication collectivesI Leads to simple communication performance models

I Future workI An automated framework for topology-aware mappingI Tensor computations mappingI Better models for network contention



Modelling exascale


Acknowledgements

I Krell CSGF DOE fellowship (DE-AC02-06CH11357)I Resources at Argonne National Lab and Lawrence Berkeley

National LabI DE-SC0003959I DE-SC0004938I DE-FC02-06-ER25786I DE-SC0001845I DE-AC02-06CH11357

I Berkeley ParLab fundingI Microsoft (Award #024263) and Intel (Award #024894)I U.C. Discovery (Award #DIG07-10227)

I Released by Lawrence Livermore National Laboratory asLLNL-PRES-514231


Reductions

Backup slides


Reductions

A model for rectangular reductions

tred = max[m/(8γ), 3m/β,m/Bn]+2(d+1)·o+3L+d ·P1/d ·(2o+L)

I Any multicast tree can be inverted to produce a reduction treeI The reduction operator must be applied at each node

I each node operates on 2m dataI both the memory bandwidth and computation cost can be

overlapped


Reductions

Rectangular reduction performance on BG/P

50

100

150

200

250

300

350

400

8 64 512

Ban

dwid

th (M

B/s

ec)

#nodes

DCMF Reduce peak bandwidth (largest message size)

torus rectangle ringtorus binomial

torus rectangletorus short binomial

BG/P rectangular reduction performs significantly worse thanmulticast


Reductions

Performance of custom line reduction

100

200

300

400

500

600

700

800

512 4096 32768

Ban

dwid

th (M

B/s

ec)

msg size (KB)

Performance of custom Reduce/Multicast on 8 nodes

MPI BroadcastCustom Ring Multicast

Custom Ring Reduce 2Custom Ring Reduce 1

MPI Reduce


Reductions

A new LU latency lower bound

k₁

k₀

k₂

k₃

k₄

k

A₀₀

A₂₂

A₃₃

A₄₄

A

n

n

critical path

d-1,d-1d-1

A₁₁

flops lower bound requires d = Ω(√p) blocks/messages

bandwidth lower bound required d = Ω(√cp) blocks/messages


Reductions

2.5D LU strong scaling (without pivoting)

0

20

40

60

80

100

256 512 1024 2048

Per

cent

age

of m

achi

ne p

eak

#nodes

LU without pivoting on BG/P (n=65,536)

ideal scaling2.5D LU

2D LU


Reductions

Strong scaling of 2.5D LU with tournament pivoting

0

20

40

60

80

100

256 512 1024 2048

Per

cent

age

of m

achi

ne p

eak

#nodes

LU with tournament pivoting on BG/P (n=65,536)

ideal scaling2.5D LU

2D LUScaLAPACK PDGETRF


Collective communicationRectangular collectives

2.5D algorithms2.5D Matrix Multiplication2.5D LU factorization

Modelling exascaleMulticast performanceMM and LU performance

AppendixReductions

improving communication performance in dense linear algebra … · 2016. 8. 18. · collective...

Documents