lattice gauge computing summary · apenext (r. de pietri) n in 0.18µasic, native implementation of...

Don Holmgren - CHEP'03 - La Jolla - March 28, 2003 1

Lattice Gauge Computing Summary

Don HolmgrenFermilab


Outline

n Communities / Collaborationsn Existing Facilitiesn LQCD on Purpose-Built Machines

n LQCD on Clusters


Lattice Gauge Computing Requirements

n Very large floating point requirements:u Need 100’s to 1000’s of Gflop/sec-years for resultsu Primary work is inverting large 4-D sparse matricesu Sustained performance is often low (20% of peak)u Multiprocessors are essential – clusters or massively parallel

n Inter-processor communications requirements:u High bandwidth (100’s Mbyte/sec)

u Low latency - O(µsec)


Communities / Collaborations

n LATFOR (Lattice Forum) (K. Jansen)u Forum of German lattice physicists, plus groups in Austria

(Graz, Wien) and Switzerland (Bern)u Universities: Berlin, Bielefeld, Darmstadt, Leipzig, Műnster,

Regensburg, Wuppertalu Research Labs: DESY, GSI, NIC

n Efforts:u Coordinate physics programu Share softwareu Share configurations/propagators (ILDG = International

Lattice Data Grid)


Communities / Collaborations

n LATFOR - Requirementsu Typical application profile:

Lattice Size 32^3 • 64Memory 100 GbyteI/O request 0.1 Gbyte/GflopRuntime 15 Teraflops-years

u Need 12.5 Teraflops sustained


Communities / Collaborationsn SciDAC Lattice Gauge Computing

u Majority of US communityu Labs: Brookhaven, Fermilab, Jefferson Labu Universities: Indiana, Utah, UCSB, UCSD, MIT, BU,

ASU, Illinois, OSU, Columbia…u http://www.lqcd.org/

n Effortsu QCDOC (Columbia/BNL)u Cluster Prototypes (Fermilab, JLAB)u Software (QMP, QLA, QIO, QDP) to allow applications

to run on either type of hardware (Chulwoo Jung)u O(10 Tflops) in 2004 (QCDOC), 2005/6 (clusters)


Facilitiesn SciDAC:

u Jefferson LabNow: 128 nodes / 128 processors (2.0 GHz Xeon), MyrinetSoon: 256 nodes / 256? Processors (Xeon), GigE mesh

u FermilabNow: 176 nodes / 352 processors (2.0/2.4 GHz Xeon), MyrinetAutumn: O(256) processor expansion (Xeon? Itanium2?)

Myrinet? GigE?


Facilities

n DESY Clusters: u DESY Hamburg

16 Dual 1.7 GHz Xeon16 Dual 2.0 GHz XeonMyrinet

u DESY Zeuthen16 Dual 1.7 GHz XeonMyrinet

Don Holmgren - CHEP'03 - La Jolla - March 28, 2003 9��

��

n

n

n

n

n

n

n

n

n

n

n


LQCD On Purpose-Built Hardware

n QCDOC: (T. Wettig)u “QCD on a Chip” ASIC (0.18µ):

« Partner with IBM« Based on PPC940 + 64-bit FPU, 1 Gflops peak« 4 MB EDRAM + controller for external DDR« 6 bidirectional LVDS channels, each 2 X 500 Mbps« Ethernet« ~ 5 Watts« $1/MF assuming 50% of peak

u Packaging:« 2 ASICs/daughterboard, 32 daughterboards per motherboard« 8 motherboards per backplane (512 processors)


QCDOC

n Schedule:u Funded: 10 Tflops each in late 2003 for Edinburgh, RIKEN-BNL

5 Tflops for Columbia (SciDAC) [pending]u 10 – 20 Tflops for US @ BNL in 2004u ASIC design is done, 1st chips in May


QCDOCn Performance: (P. Boyle)

u Single node kernels, based on simulations:« SU3 x SU3 800 Mflops/node

« SU3 x 2spinor 780 Mflops/node

u Multinode:« Example – Wilson 2^4, 470 Mflops/node, 22µsec/iteration,

each iteration has 16 distinct communications


QCDOC

n Fast global sums via network passthru:


APEnext (R. De Pietri)

n In 0.18µ ASIC, native implementation of “fused multiply-add”:a x b + c (complex numbers)

n 256 registers, each holding a complex number (double precision)n 200 MHz, ~ 5 watts, 1.6 Gflops/chip, $1/Mflop @ 50% of peakn 6+1 channels of LVDS communications, each 200 Mbyte/sec

u 3D torus (x, y, z)u Fast I/O link to UNIX host

n Schedule:u 300 to 600 ASIC’s to be delivered in June `03

u From these, assemble and test 256-processor crate

u Mass production to start in Sept `03

Don Holmgren - CHEP'03 - La Jolla - March 28, 2003 15��

��

n

n

n

q

q

q

q

Z+(bp)

Y+(bp)

X+(cables)

0 2

4 6

8 10

12 14

1 3

5 7

9 11

13 15

J&T

DDR-MEM

X+……Z-


LQCD on Clusters

n Performance (DESY: P. Wegner, Fermilab: D. Holmgren)n Common themes:

u SSE/SSE2 very important for optimizing performanceu Large memory bandwidth increase in Pentium 4 critical to LQCDu Building clusters with Myrinet, investigating other networksu Pay attention to PCI bus performance, varies with chipset

Don Holmgren - CHEP'03 - La Jolla - March 28, 2003 17Peter Wegner, DESY CHEP03, 25 March 2003 10

Parallel (1-dim) Dirac Operator Benchmark (SSE), even-odd preconditioned, 2 x 163 , XEON CPUs, single CPU performance

0

50

100

150

200

250

300

350

400

450

500

4 single(4 CPUs)

8 single(8 CPUs)

16 single(16

CPUs)

2 dual (4CPUs)

4 dual (8CPUs)

8 dual (16CPUs)

16 dual(32

CPUs)

1.7 GHz, i860

2.0 GHz, i860

2.4 GHz, E7500

MFLOPS

Myrinet2000i860:90 MB/s

E7500: 190 MB/s


Maximal Efficiency of external I/O

0.80210 + 210297370Infinibandnon-SSE

0.58100 + 100228390Gigabit, Ethernet, non-SSE

0.91hidden

368406

Myrinet/Parastation (E7500), non-blocking, non-SSE

0.66181 + 181446675Myrinet/Parastation (E7500), SSE

0.68190 + 190432631Myrinet/GM(E7500), SSE

0.5390 + 90307579Myrinet (i860),SSE

EfficiencyMaximal Bandwidth

MFLOPS(with communication)

MFLOPs(without communication)


Infiniband interconnect


LQCD on Clusters - Management

n See talks by A. Gellrich, A. Singhn Common themes:

u Linux, PBS, Myrinet, MPI, private networksu Web based monitoring and alarmsu MRTG history plots (e.g. temperatures, fans)u Some Myrinet hardware reliability issues

lattice gauge computing summary · apenext (r. de pietri) n in 0.18µasic, native implementation of...

Documents