improving communication performance in dense linear algebra … · 2016. 8. 18. · collective...

36
Collective communication 2.5D algorithms Modelling exascale Improving communication performance in dense linear algebra via topology-aware collectives Edgar Solomonik * , Abhinav Bhatele , and James Demmel * * University of California, Berkeley Lawrence Livermore National Laboratory Supercomputing, November 2011 Edgar Solomonik Mapping dense linear algebra 1/ 29

Upload: others

Post on 28-Jan-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

  • Collective communication2.5D algorithms

    Modelling exascale

    Improving communication performance in denselinear algebra via topology-aware collectives

    Edgar Solomonik∗, Abhinav Bhatele†, and James Demmel∗

    ∗ University of California, Berkeley† Lawrence Livermore National Laboratory

    Supercomputing, November 2011

    Edgar Solomonik Mapping dense linear algebra 1/ 29

  • Collective communication2.5D algorithms

    Modelling exascale

    Outline

    Collective communicationRectangular collectives

    2.5D algorithms2.5D Matrix Multiplication2.5D LU factorization

    Modelling exascaleMulticast performanceMM and LU performance

    Edgar Solomonik Mapping dense linear algebra 2/ 29

  • Collective communication2.5D algorithms

    Modelling exascaleRectangular collectives

    Performance of multicast (BG/P vs Cray)

    128

    256

    512

    1024

    2048

    4096

    8192

    8 64 512 4096

    Ban

    dwid

    th (M

    B/s

    ec)

    #nodes

    1 MB multicast on BG/P, Cray XT5, and Cray XE6

    BG/PXE6XT5

    Edgar Solomonik Mapping dense linear algebra 3/ 29

  • Collective communication2.5D algorithms

    Modelling exascaleRectangular collectives

    Why the performance discrepancy in multicasts?

    I Cray machines use binomial multicastsI Form spanning tree from a list of nodesI Route copies of message down each branchI Network contention degrades utilization on a 3D torus

    I BG/P uses rectangular multicastsI Require network topology to be a k-ary n-cubeI Form 2n edge-disjoint spanning trees

    I Route in different dimensional orderI Use both directions of bidirectional network

    Edgar Solomonik Mapping dense linear algebra 4/ 29

  • Collective communication2.5D algorithms

    Modelling exascaleRectangular collectives

    2D rectangular multicasts trees

    root2D 4X4 Torus Spanning tree 1 Spanning tree 2

    Spanning tree 3 Spanning tree 4 All 4 trees combined

    [Watts and Van De Geijn 95]

    Edgar Solomonik Mapping dense linear algebra 5/ 29

  • Collective communication2.5D algorithms

    Modelling exascaleRectangular collectives

    Another look at that first plot

    How much better are rectangularalgorithms on P = 4096 nodes?

    I Binomial collectives on XE6I 1/30th of link

    bandwidth

    I Rectangular collectives onBG/P

    I 4X the link bandwidth

    I 120X improvement inefficiency!

    How can we apply this?

    128

    256

    512

    1024

    2048

    4096

    8192

    8 64 512 4096

    Ban

    dwid

    th (M

    B/s

    ec)

    #nodes

    1 MB multicast on BG/P, Cray XT5, and Cray XE6

    BG/PXE6XT5

    Edgar Solomonik Mapping dense linear algebra 6/ 29

  • Collective communication2.5D algorithms

    Modelling exascale

    2.5D Matrix Multiplication2.5D LU factorization

    Matrix multiplication

    A

    BA

    BA

    B

    AB

    Edgar Solomonik Mapping dense linear algebra 7/ 29

  • Collective communication2.5D algorithms

    Modelling exascale

    2.5D Matrix Multiplication2.5D LU factorization

    2D matrix multiplication

    A

    BA

    BA

    B

    AB

    CPU CPU CPU CPU

    CPU CPU CPU CPU

    CPU CPU CPU CPU

    CPU CPU CPU CPU

    16 CPUs (4x4)

    [Cannon 69], [Van De Geijn and Watts 97]

    Edgar Solomonik Mapping dense linear algebra 8/ 29

  • Collective communication2.5D algorithms

    Modelling exascale

    2.5D Matrix Multiplication2.5D LU factorization

    3D matrix multiplication

    A

    BA

    BA

    B

    AB

    CPU CPU CPU CPU

    CPU CPU CPU CPU

    CPU CPU CPU CPU

    CPU CPU CPU CPU

    CPU CPU CPU CPU

    CPU CPU CPU CPU

    CPU CPU CPU CPU

    CPU CPU CPU CPU

    CPU CPU CPU CPU

    CPU CPU CPU CPU

    CPU CPU CPU CPU

    CPU CPU CPU CPU

    CPU CPU CPU CPU

    CPU CPU CPU CPU

    CPU CPU CPU CPU

    CPU CPU CPU CPU

    64 CPUs (4x4x4)

    4 copies of matrices

    [Agarwal et al 95], [Aggarwal, Chandra, and Snir 90], [Bernsten 89]

    Edgar Solomonik Mapping dense linear algebra 9/ 29

  • Collective communication2.5D algorithms

    Modelling exascale

    2.5D Matrix Multiplication2.5D LU factorization

    2.5D matrix multiplication

    A

    BA

    BA

    B

    AB

    CPU CPU CPU CPU

    CPU CPU CPU CPU

    CPU CPU CPU CPU

    CPU CPU CPU CPU

    CPU CPU CPU CPU

    CPU CPU CPU CPU

    CPU CPU CPU CPU

    CPU CPU CPU CPU

    32 CPUs (4x4x2)

    2 copies of matrices

    Edgar Solomonik Mapping dense linear algebra 10/ 29

  • Collective communication2.5D algorithms

    Modelling exascale

    2.5D Matrix Multiplication2.5D LU factorization

    Strong scaling matrix multiplication

    0

    20

    40

    60

    80

    100

    256 512 1024 2048

    Per

    cent

    age

    of m

    achi

    ne p

    eak

    #nodes

    2.5D MM on BG/P (n=65,536)

    2.5D SUMMA2D SUMMA

    ScaLAPACK PDGEMM

    Edgar Solomonik Mapping dense linear algebra 11/ 29

  • Collective communication2.5D algorithms

    Modelling exascale

    2.5D Matrix Multiplication2.5D LU factorization

    2.5D MM on 65,536 cores

    0

    20

    40

    60

    80

    100

    8192 131072

    Per

    cent

    age

    of m

    achi

    ne p

    eak

    n

    Matrix multiplication on 16,384 nodes of BG/P

    12X faster

    2.7X faster

    Using 16 matrix copies

    2D MM2.5D MM

    Edgar Solomonik Mapping dense linear algebra 12/ 29

  • Collective communication2.5D algorithms

    Modelling exascale

    2.5D Matrix Multiplication2.5D LU factorization

    Cost breakdown of MM on 65,536 cores

    0

    0.2

    0.4

    0.6

    0.8

    1

    1.2

    1.4

    n=8192, 2D

    n=8192, 2.5D

    n=131072, 2D

    n=131072, 2.5D

    Exe

    cutio

    n tim

    e no

    rmal

    ized

    by

    2D

    Matrix multiplication on 16,384 nodes of BG/P

    95% reduction in comm computationidle

    communication

    Edgar Solomonik Mapping dense linear algebra 13/ 29

  • Collective communication2.5D algorithms

    Modelling exascale

    2.5D Matrix Multiplication2.5D LU factorization

    2.5D LU factorization

    L₀₀U₀₀

    U₀₃

    U₀₃

    U₀₁

    L₂₀L₃₀

    L₁₀(A)

    U₀₀

    U₀₀

    L₀₀

    L₀₀

    U₀₀L₀₀

    Edgar Solomonik Mapping dense linear algebra 14/ 29

  • Collective communication2.5D algorithms

    Modelling exascale

    2.5D Matrix Multiplication2.5D LU factorization

    2.5D LU factorization

    L₀₀U₀₀

    U₀₃

    U₀₃

    U₀₁

    L₂₀L₃₀

    L₁₀(A)

    (B)U₀₀

    U₀₀

    L₀₀

    L₀₀

    U₀₀L₀₀

    Edgar Solomonik Mapping dense linear algebra 15/ 29

  • Collective communication2.5D algorithms

    Modelling exascale

    2.5D Matrix Multiplication2.5D LU factorization

    2.5D LU factorization

    Look at how thisupdate is distributed.

    A

    BA

    BA

    B

    AB

    Same 3D updatein multiplication

    Edgar Solomonik Mapping dense linear algebra 16/ 29

  • Collective communication2.5D algorithms

    Modelling exascale

    2.5D Matrix Multiplication2.5D LU factorization

    2.5D LU factorization

    L₀₀U₀₀

    U₀₃

    U₀₃

    U₀₁

    L₂₀L₃₀

    L₁₀(A)

    (B)

    U

    L

    (C) (D)

    U₀₀

    U₀₀

    L₀₀

    L₀₀

    U₀₀L₀₀

    [Solomonik and Demmel, EuroPar ’11, Distinguished Paper]

    Edgar Solomonik Mapping dense linear algebra 17/ 29

  • Collective communication2.5D algorithms

    Modelling exascale

    2.5D Matrix Multiplication2.5D LU factorization

    2.5D LU on 65,536 cores

    0

    20

    40

    60

    80

    100

    NO-pivot 2D

    NO-pivot 2.5D

    CA-pivot 2D

    CA-pivot 2.5D

    Tim

    e (s

    ec)

    LU on 16,384 nodes of BG/P (n=131,072)

    2X faster

    2X faster

    computeidle

    communication

    Edgar Solomonik Mapping dense linear algebra 18/ 29

  • Collective communication2.5D algorithms

    Modelling exascale

    2.5D Matrix Multiplication2.5D LU factorization

    Rectangular (RCT) vs binomial (BNM) collectives

    0

    10

    20

    30

    40

    50

    60

    70

    80

    MM LU LU+PVT

    Per

    cent

    age

    of m

    achi

    ne p

    eak

    Binomial vs rectangular collectives on BG/P (n=131,072, p=16,384)

    2D BNM2.5D BNM2.5D RCT

    Edgar Solomonik Mapping dense linear algebra 19/ 29

  • Collective communication2.5D algorithms

    Modelling exascale

    Multicast performanceMM and LU performance

    A model for rectangular multicasts

    tmcast = m/Bn + 2(d + 1) · o + 3L + d · P1/d · (2o + L)

    Our multicast model consists of 3 terms

    1. m/Bn, the bandwidth cost

    2. 2(d + 1) · o + 3L, the multicast start-up overhead3. d · P1/d · (2o + L), the path overhead

    Edgar Solomonik Mapping dense linear algebra 20/ 29

  • Collective communication2.5D algorithms

    Modelling exascale

    Multicast performanceMM and LU performance

    A model for binomial multicasts

    tbnm = log2(P) · (m/Bn + 2o + L)

    I The root of the binomial tree sends log2(P) copies of message

    I The setup overhead is overlapped with the path overhead

    I We assume no contention

    Edgar Solomonik Mapping dense linear algebra 21/ 29

  • Collective communication2.5D algorithms

    Modelling exascale

    Multicast performanceMM and LU performance

    Model verification: one dimension

    200

    400

    600

    800

    1000

    1 8 64 512 4096 32768 262144

    Ban

    dwid

    th (M

    B/s

    ec)

    msg size (KB)

    DCMF Broadcast on a ring of 8 nodes of BG/P

    trect modelDCMF rectangle dput

    tbnm modelDCMF binomial

    Edgar Solomonik Mapping dense linear algebra 22/ 29

  • Collective communication2.5D algorithms

    Modelling exascale

    Multicast performanceMM and LU performance

    Model verification: two dimensions

    0

    500

    1000

    1500

    2000

    1 8 64 512 4096 32768 262144

    Ban

    dwid

    th (M

    B/s

    ec)

    msg size (KB)

    DCMF Broadcast on 64 (8x8) nodes of BG/P

    trect modelDCMF rectangle dput

    tbnm modelDCMF binomial

    Edgar Solomonik Mapping dense linear algebra 23/ 29

  • Collective communication2.5D algorithms

    Modelling exascale

    Multicast performanceMM and LU performance

    Model verification: three dimensions

    0

    500

    1000

    1500

    2000

    2500

    3000

    1 8 64 512 4096 32768 262144

    Ban

    dwid

    th (M

    B/s

    ec)

    msg size (KB)

    DCMF Broadcast on 512 (8x8x8) nodes of BG/P

    trect modelFaraj et al data

    DCMF rectangle dputtbnm model

    DCMF binomial

    Edgar Solomonik Mapping dense linear algebra 24/ 29

  • Collective communication2.5D algorithms

    Modelling exascale

    Multicast performanceMM and LU performance

    Modelling collectives at exascale (p = 262, 144)

    0

    20

    40

    60

    80

    100

    120

    140

    160

    180

    200

    0.125 1 8 64 512 4096 32768 262144

    Ban

    dwid

    th (G

    B/s

    ec)

    msg size (MB)

    Exascale broadcast performance

    trect 6Dtrect 5Dtrect 4Dtrect 3Dtrect 2Dtrect 1Dtbnm 3D

    Edgar Solomonik Mapping dense linear algebra 25/ 29

  • Collective communication2.5D algorithms

    Modelling exascale

    Multicast performanceMM and LU performance

    Modelling matrix multiplication at exascale

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    1 8 64

    Par

    alle

    l effi

    cien

    cy

    z dimension of partition

    MM strong scaling at exascale (xy plane to full xyz torus)

    2.5D with rectangular (c=z)2.5D with binomial (c=z)

    2D with binomial

    Edgar Solomonik Mapping dense linear algebra 26/ 29

  • Collective communication2.5D algorithms

    Modelling exascale

    Multicast performanceMM and LU performance

    Modelling LU factorization at exascale

    0

    0.2

    0.4

    0.6

    0.8

    1

    1 8 64

    Par

    alle

    l effi

    cien

    cy

    z dimension of partition

    LU strong scaling at exascale (xy plane to full xyz torus)

    2.5D with rectangular (c=z)2.5D with binomial (c=z)

    2D LU with binomial

    Edgar Solomonik Mapping dense linear algebra 27/ 29

  • Collective communication2.5D algorithms

    Modelling exascale

    Multicast performanceMM and LU performance

    Conclusion

    I Topology-aware schedulingI Present in IBM BG but not in Cray supercomputersI Avoids network contention/congestionI Enables optimized communication collectivesI Leads to simple communication performance models

    I Future workI An automated framework for topology-aware mappingI Tensor computations mappingI Better models for network contention

    Edgar Solomonik Mapping dense linear algebra 28/ 29

  • Collective communication2.5D algorithms

    Modelling exascale

    Multicast performanceMM and LU performance

    Acknowledgements

    I Krell CSGF DOE fellowship (DE-AC02-06CH11357)I Resources at Argonne National Lab and Lawrence Berkeley

    National LabI DE-SC0003959I DE-SC0004938I DE-FC02-06-ER25786I DE-SC0001845I DE-AC02-06CH11357

    I Berkeley ParLab fundingI Microsoft (Award #024263) and Intel (Award #024894)I U.C. Discovery (Award #DIG07-10227)

    I Released by Lawrence Livermore National Laboratory asLLNL-PRES-514231

    Edgar Solomonik Mapping dense linear algebra 29/ 29

  • Reductions

    Backup slides

    Edgar Solomonik Mapping dense linear algebra 30/ 29

  • Reductions

    A model for rectangular reductions

    tred = max[m/(8γ), 3m/β,m/Bn]+2(d+1)·o+3L+d ·P1/d ·(2o+L)

    I Any multicast tree can be inverted to produce a reduction treeI The reduction operator must be applied at each node

    I each node operates on 2m dataI both the memory bandwidth and computation cost can be

    overlapped

    Edgar Solomonik Mapping dense linear algebra 31/ 29

  • Reductions

    Rectangular reduction performance on BG/P

    50

    100

    150

    200

    250

    300

    350

    400

    8 64 512

    Ban

    dwid

    th (M

    B/s

    ec)

    #nodes

    DCMF Reduce peak bandwidth (largest message size)

    torus rectangle ringtorus binomial

    torus rectangletorus short binomial

    BG/P rectangular reduction performs significantly worse thanmulticast

    Edgar Solomonik Mapping dense linear algebra 32/ 29

  • Reductions

    Performance of custom line reduction

    100

    200

    300

    400

    500

    600

    700

    800

    512 4096 32768

    Ban

    dwid

    th (M

    B/s

    ec)

    msg size (KB)

    Performance of custom Reduce/Multicast on 8 nodes

    MPI BroadcastCustom Ring Multicast

    Custom Ring Reduce 2Custom Ring Reduce 1

    MPI Reduce

    Edgar Solomonik Mapping dense linear algebra 33/ 29

  • Reductions

    A new LU latency lower bound

    k₁

    k₀

    k₂

    k₃

    k₄

    k

    A₀₀

    A₂₂

    A₃₃

    A₄₄

    A

    n

    n

    critical path

    d-1,d-1d-1

    A₁₁

    flops lower bound requires d = Ω(√p) blocks/messages

    bandwidth lower bound required d = Ω(√cp) blocks/messages

    Edgar Solomonik Mapping dense linear algebra 34/ 29

  • Reductions

    2.5D LU strong scaling (without pivoting)

    0

    20

    40

    60

    80

    100

    256 512 1024 2048

    Per

    cent

    age

    of m

    achi

    ne p

    eak

    #nodes

    LU without pivoting on BG/P (n=65,536)

    ideal scaling2.5D LU

    2D LU

    Edgar Solomonik Mapping dense linear algebra 35/ 29

  • Reductions

    Strong scaling of 2.5D LU with tournament pivoting

    0

    20

    40

    60

    80

    100

    256 512 1024 2048

    Per

    cent

    age

    of m

    achi

    ne p

    eak

    #nodes

    LU with tournament pivoting on BG/P (n=65,536)

    ideal scaling2.5D LU

    2D LUScaLAPACK PDGETRF

    Edgar Solomonik Mapping dense linear algebra 36/ 29

    Collective communicationRectangular collectives

    2.5D algorithms2.5D Matrix Multiplication2.5D LU factorization

    Modelling exascaleMulticast performanceMM and LU performance

    AppendixReductions