lattice gauge computing summary · apenext (r. de pietri) n in 0.18µasic, native implementation of...

20
Don Holmgren - CHEP'03 - La Jolla - March 28, 2003 1 Lattice Gauge Computing Summary Don Holmgren Fermilab

Upload: others

Post on 19-Oct-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

  • Don Holmgren - CHEP'03 - La Jolla - March 28, 2003 1

    Lattice Gauge Computing Summary

    Don HolmgrenFermilab

  • Don Holmgren - CHEP'03 - La Jolla - March 28, 2003 2

    Outline

    n Communities / Collaborationsn Existing Facilitiesn LQCD on Purpose-Built Machines

    n LQCD on Clusters

  • Don Holmgren - CHEP'03 - La Jolla - March 28, 2003 3

    Lattice Gauge Computing Requirements

    n Very large floating point requirements:u Need 100’s to 1000’s of Gflop/sec-years for resultsu Primary work is inverting large 4-D sparse matricesu Sustained performance is often low (20% of peak)u Multiprocessors are essential – clusters or massively parallel

    n Inter-processor communications requirements:u High bandwidth (100’s Mbyte/sec)

    u Low latency - O(µsec)

  • Don Holmgren - CHEP'03 - La Jolla - March 28, 2003 4

    Communities / Collaborations

    n LATFOR (Lattice Forum) (K. Jansen)u Forum of German lattice physicists, plus groups in Austria

    (Graz, Wien) and Switzerland (Bern)u Universities: Berlin, Bielefeld, Darmstadt, Leipzig, Műnster,

    Regensburg, Wuppertalu Research Labs: DESY, GSI, NIC

    n Efforts:u Coordinate physics programu Share softwareu Share configurations/propagators (ILDG = International

    Lattice Data Grid)

  • Don Holmgren - CHEP'03 - La Jolla - March 28, 2003 5

    Communities / Collaborations

    n LATFOR - Requirementsu Typical application profile:

    Lattice Size 32^3 • 64Memory 100 GbyteI/O request 0.1 Gbyte/GflopRuntime 15 Teraflops-years

    u Need 12.5 Teraflops sustained

  • Don Holmgren - CHEP'03 - La Jolla - March 28, 2003 6

    Communities / Collaborationsn SciDAC Lattice Gauge Computing

    u Majority of US communityu Labs: Brookhaven, Fermilab, Jefferson Labu Universities: Indiana, Utah, UCSB, UCSD, MIT, BU,

    ASU, Illinois, OSU, Columbia…u http://www.lqcd.org/

    n Effortsu QCDOC (Columbia/BNL)u Cluster Prototypes (Fermilab, JLAB)u Software (QMP, QLA, QIO, QDP) to allow applications

    to run on either type of hardware (Chulwoo Jung)u O(10 Tflops) in 2004 (QCDOC), 2005/6 (clusters)

  • Don Holmgren - CHEP'03 - La Jolla - March 28, 2003 7

    Facilitiesn SciDAC:

    u Jefferson LabNow: 128 nodes / 128 processors (2.0 GHz Xeon), MyrinetSoon: 256 nodes / 256? Processors (Xeon), GigE mesh

    u FermilabNow: 176 nodes / 352 processors (2.0/2.4 GHz Xeon), MyrinetAutumn: O(256) processor expansion (Xeon? Itanium2?)

    Myrinet? GigE?

  • Don Holmgren - CHEP'03 - La Jolla - March 28, 2003 8

    Facilities

    n DESY Clusters: u DESY Hamburg

    16 Dual 1.7 GHz Xeon16 Dual 2.0 GHz XeonMyrinet

    u DESY Zeuthen16 Dual 1.7 GHz XeonMyrinet

  • Don Holmgren - CHEP'03 - La Jolla - March 28, 2003 9��� � ���� �� ��� �� �� �� � �� �� �� � � � �� ��� �� ��� �

    �� � � � �

    n

    n

    n

    n

    n

    n

    n

    n

    n

    n

    n

  • Don Holmgren - CHEP'03 - La Jolla - March 28, 2003 10

    LQCD On Purpose-Built Hardware

    n QCDOC: (T. Wettig)u “QCD on a Chip” ASIC (0.18µ):

    « Partner with IBM« Based on PPC940 + 64-bit FPU, 1 Gflops peak« 4 MB EDRAM + controller for external DDR« 6 bidirectional LVDS channels, each 2 X 500 Mbps« Ethernet« ~ 5 Watts« $1/MF assuming 50% of peak

    u Packaging:« 2 ASICs/daughterboard, 32 daughterboards per motherboard« 8 motherboards per backplane (512 processors)

  • Don Holmgren - CHEP'03 - La Jolla - March 28, 2003 11

    QCDOC

    n Schedule:u Funded: 10 Tflops each in late 2003 for Edinburgh, RIKEN-BNL

    5 Tflops for Columbia (SciDAC) [pending]u 10 – 20 Tflops for US @ BNL in 2004u ASIC design is done, 1st chips in May

  • Don Holmgren - CHEP'03 - La Jolla - March 28, 2003 12

    QCDOCn Performance: (P. Boyle)

    u Single node kernels, based on simulations:« SU3 x SU3 800 Mflops/node

    « SU3 x 2spinor 780 Mflops/node

    u Multinode:« Example – Wilson 2^4, 470 Mflops/node, 22µsec/iteration,

    each iteration has 16 distinct communications

  • Don Holmgren - CHEP'03 - La Jolla - March 28, 2003 13

    QCDOC

    n Fast global sums via network passthru:

  • Don Holmgren - CHEP'03 - La Jolla - March 28, 2003 14

    APEnext (R. De Pietri)

    n In 0.18µ ASIC, native implementation of “fused multiply-add”:a x b + c (complex numbers)

    n 256 registers, each holding a complex number (double precision)n 200 MHz, ~ 5 watts, 1.6 Gflops/chip, $1/Mflop @ 50% of peakn 6+1 channels of LVDS communications, each 200 Mbyte/sec

    u 3D torus (x, y, z)u Fast I/O link to UNIX host

    n Schedule:u 300 to 600 ASIC’s to be delivered in June `03

    u From these, assemble and test 256-processor crate

    u Mass production to start in Sept `03

  • Don Holmgren - CHEP'03 - La Jolla - March 28, 2003 15��� � ���� �� ��� �� �� �� � �� �� �� � � � �� ��� �� ��� �

    �� � � � �

    n

    n

    n

    q

    q

    q

    q

    Z+(bp)

    Y+(bp)

    X+(cables)

    0 2

    4 6

    8 10

    12 14

    1 3

    5 7

    9 11

    13 15

    J&T

    DDR-MEM

    X+……Z-

  • Don Holmgren - CHEP'03 - La Jolla - March 28, 2003 16

    LQCD on Clusters

    n Performance (DESY: P. Wegner, Fermilab: D. Holmgren)n Common themes:

    u SSE/SSE2 very important for optimizing performanceu Large memory bandwidth increase in Pentium 4 critical to LQCDu Building clusters with Myrinet, investigating other networksu Pay attention to PCI bus performance, varies with chipset

  • Don Holmgren - CHEP'03 - La Jolla - March 28, 2003 17Peter Wegner, DESY CHEP03, 25 March 2003 10

    Parallel (1-dim) Dirac Operator Benchmark (SSE), even-odd preconditioned, 2 x 163 , XEON CPUs, single CPU performance

    0

    50

    100

    150

    200

    250

    300

    350

    400

    450

    500

    4 single(4 CPUs)

    8 single(8 CPUs)

    16 single(16

    CPUs)

    2 dual (4CPUs)

    4 dual (8CPUs)

    8 dual (16CPUs)

    16 dual(32

    CPUs)

    1.7 GHz, i860

    2.0 GHz, i860

    2.4 GHz, E7500

    MFLOPS

    Myrinet2000i860:90 MB/s

    E7500: 190 MB/s

  • Don Holmgren - CHEP'03 - La Jolla - March 28, 2003 18Peter Wegner, DESY CHEP03, 25 March 2003 12

    Maximal Efficiency of external I/O

    0.80210 + 210297370Infinibandnon-SSE

    0.58100 + 100228390Gigabit, Ethernet, non-SSE

    0.91hidden

    368406

    Myrinet/Parastation (E7500), non-blocking, non-SSE

    0.66181 + 181446675Myrinet/Parastation (E7500), SSE

    0.68190 + 190432631Myrinet/GM(E7500), SSE

    0.5390 + 90307579Myrinet (i860),SSE

    EfficiencyMaximal Bandwidth

    MFLOPS(with communication)

    MFLOPs(without communication)

  • Don Holmgren - CHEP'03 - La Jolla - March 28, 2003 19Peter Wegner, DESY CHEP03, 25 March 2003 15

    Infiniband interconnect

  • Don Holmgren - CHEP'03 - La Jolla - March 28, 2003 20

    LQCD on Clusters - Management

    n See talks by A. Gellrich, A. Singhn Common themes:

    u Linux, PBS, Myrinet, MPI, private networksu Web based monitoring and alarmsu MRTG history plots (e.g. temperatures, fans)u Some Myrinet hardware reliability issues