lattice gauge computing summary · apenext (r. de pietri) n in 0.18µasic, native implementation of...
TRANSCRIPT
-
Don Holmgren - CHEP'03 - La Jolla - March 28, 2003 1
Lattice Gauge Computing Summary
Don HolmgrenFermilab
-
Don Holmgren - CHEP'03 - La Jolla - March 28, 2003 2
Outline
n Communities / Collaborationsn Existing Facilitiesn LQCD on Purpose-Built Machines
n LQCD on Clusters
-
Don Holmgren - CHEP'03 - La Jolla - March 28, 2003 3
Lattice Gauge Computing Requirements
n Very large floating point requirements:u Need 100’s to 1000’s of Gflop/sec-years for resultsu Primary work is inverting large 4-D sparse matricesu Sustained performance is often low (20% of peak)u Multiprocessors are essential – clusters or massively parallel
n Inter-processor communications requirements:u High bandwidth (100’s Mbyte/sec)
u Low latency - O(µsec)
-
Don Holmgren - CHEP'03 - La Jolla - March 28, 2003 4
Communities / Collaborations
n LATFOR (Lattice Forum) (K. Jansen)u Forum of German lattice physicists, plus groups in Austria
(Graz, Wien) and Switzerland (Bern)u Universities: Berlin, Bielefeld, Darmstadt, Leipzig, Műnster,
Regensburg, Wuppertalu Research Labs: DESY, GSI, NIC
n Efforts:u Coordinate physics programu Share softwareu Share configurations/propagators (ILDG = International
Lattice Data Grid)
-
Don Holmgren - CHEP'03 - La Jolla - March 28, 2003 5
Communities / Collaborations
n LATFOR - Requirementsu Typical application profile:
Lattice Size 32^3 • 64Memory 100 GbyteI/O request 0.1 Gbyte/GflopRuntime 15 Teraflops-years
u Need 12.5 Teraflops sustained
-
Don Holmgren - CHEP'03 - La Jolla - March 28, 2003 6
Communities / Collaborationsn SciDAC Lattice Gauge Computing
u Majority of US communityu Labs: Brookhaven, Fermilab, Jefferson Labu Universities: Indiana, Utah, UCSB, UCSD, MIT, BU,
ASU, Illinois, OSU, Columbia…u http://www.lqcd.org/
n Effortsu QCDOC (Columbia/BNL)u Cluster Prototypes (Fermilab, JLAB)u Software (QMP, QLA, QIO, QDP) to allow applications
to run on either type of hardware (Chulwoo Jung)u O(10 Tflops) in 2004 (QCDOC), 2005/6 (clusters)
-
Don Holmgren - CHEP'03 - La Jolla - March 28, 2003 7
Facilitiesn SciDAC:
u Jefferson LabNow: 128 nodes / 128 processors (2.0 GHz Xeon), MyrinetSoon: 256 nodes / 256? Processors (Xeon), GigE mesh
u FermilabNow: 176 nodes / 352 processors (2.0/2.4 GHz Xeon), MyrinetAutumn: O(256) processor expansion (Xeon? Itanium2?)
Myrinet? GigE?
-
Don Holmgren - CHEP'03 - La Jolla - March 28, 2003 8
Facilities
n DESY Clusters: u DESY Hamburg
16 Dual 1.7 GHz Xeon16 Dual 2.0 GHz XeonMyrinet
u DESY Zeuthen16 Dual 1.7 GHz XeonMyrinet
-
Don Holmgren - CHEP'03 - La Jolla - March 28, 2003 9��� � ���� �� ��� �� �� �� � �� �� �� � � � �� ��� �� ��� �
�� � � � �
n
n
n
n
n
n
n
n
n
n
n
-
Don Holmgren - CHEP'03 - La Jolla - March 28, 2003 10
LQCD On Purpose-Built Hardware
n QCDOC: (T. Wettig)u “QCD on a Chip” ASIC (0.18µ):
« Partner with IBM« Based on PPC940 + 64-bit FPU, 1 Gflops peak« 4 MB EDRAM + controller for external DDR« 6 bidirectional LVDS channels, each 2 X 500 Mbps« Ethernet« ~ 5 Watts« $1/MF assuming 50% of peak
u Packaging:« 2 ASICs/daughterboard, 32 daughterboards per motherboard« 8 motherboards per backplane (512 processors)
-
Don Holmgren - CHEP'03 - La Jolla - March 28, 2003 11
QCDOC
n Schedule:u Funded: 10 Tflops each in late 2003 for Edinburgh, RIKEN-BNL
5 Tflops for Columbia (SciDAC) [pending]u 10 – 20 Tflops for US @ BNL in 2004u ASIC design is done, 1st chips in May
-
Don Holmgren - CHEP'03 - La Jolla - March 28, 2003 12
QCDOCn Performance: (P. Boyle)
u Single node kernels, based on simulations:« SU3 x SU3 800 Mflops/node
« SU3 x 2spinor 780 Mflops/node
u Multinode:« Example – Wilson 2^4, 470 Mflops/node, 22µsec/iteration,
each iteration has 16 distinct communications
-
Don Holmgren - CHEP'03 - La Jolla - March 28, 2003 13
QCDOC
n Fast global sums via network passthru:
-
Don Holmgren - CHEP'03 - La Jolla - March 28, 2003 14
APEnext (R. De Pietri)
n In 0.18µ ASIC, native implementation of “fused multiply-add”:a x b + c (complex numbers)
n 256 registers, each holding a complex number (double precision)n 200 MHz, ~ 5 watts, 1.6 Gflops/chip, $1/Mflop @ 50% of peakn 6+1 channels of LVDS communications, each 200 Mbyte/sec
u 3D torus (x, y, z)u Fast I/O link to UNIX host
n Schedule:u 300 to 600 ASIC’s to be delivered in June `03
u From these, assemble and test 256-processor crate
u Mass production to start in Sept `03
-
Don Holmgren - CHEP'03 - La Jolla - March 28, 2003 15��� � ���� �� ��� �� �� �� � �� �� �� � � � �� ��� �� ��� �
�� � � � �
n
n
n
q
q
q
q
Z+(bp)
Y+(bp)
X+(cables)
0 2
4 6
8 10
12 14
1 3
5 7
9 11
13 15
J&T
DDR-MEM
X+……Z-
-
Don Holmgren - CHEP'03 - La Jolla - March 28, 2003 16
LQCD on Clusters
n Performance (DESY: P. Wegner, Fermilab: D. Holmgren)n Common themes:
u SSE/SSE2 very important for optimizing performanceu Large memory bandwidth increase in Pentium 4 critical to LQCDu Building clusters with Myrinet, investigating other networksu Pay attention to PCI bus performance, varies with chipset
-
Don Holmgren - CHEP'03 - La Jolla - March 28, 2003 17Peter Wegner, DESY CHEP03, 25 March 2003 10
Parallel (1-dim) Dirac Operator Benchmark (SSE), even-odd preconditioned, 2 x 163 , XEON CPUs, single CPU performance
0
50
100
150
200
250
300
350
400
450
500
4 single(4 CPUs)
8 single(8 CPUs)
16 single(16
CPUs)
2 dual (4CPUs)
4 dual (8CPUs)
8 dual (16CPUs)
16 dual(32
CPUs)
1.7 GHz, i860
2.0 GHz, i860
2.4 GHz, E7500
MFLOPS
Myrinet2000i860:90 MB/s
E7500: 190 MB/s
-
Don Holmgren - CHEP'03 - La Jolla - March 28, 2003 18Peter Wegner, DESY CHEP03, 25 March 2003 12
Maximal Efficiency of external I/O
0.80210 + 210297370Infinibandnon-SSE
0.58100 + 100228390Gigabit, Ethernet, non-SSE
0.91hidden
368406
Myrinet/Parastation (E7500), non-blocking, non-SSE
0.66181 + 181446675Myrinet/Parastation (E7500), SSE
0.68190 + 190432631Myrinet/GM(E7500), SSE
0.5390 + 90307579Myrinet (i860),SSE
EfficiencyMaximal Bandwidth
MFLOPS(with communication)
MFLOPs(without communication)
-
Don Holmgren - CHEP'03 - La Jolla - March 28, 2003 19Peter Wegner, DESY CHEP03, 25 March 2003 15
Infiniband interconnect
-
Don Holmgren - CHEP'03 - La Jolla - March 28, 2003 20
LQCD on Clusters - Management
n See talks by A. Gellrich, A. Singhn Common themes:
u Linux, PBS, Myrinet, MPI, private networksu Web based monitoring and alarmsu MRTG history plots (e.g. temperatures, fans)u Some Myrinet hardware reliability issues