lattice qcd and gpu-s robert edwards, theory group chip watson, hpc & cio jie chen & balint...
TRANSCRIPT
Lattice QCD and GPU-s
Robert Edwards, Theory GroupChip Watson, HPC & CIO
Jie Chen & Balint Joo, HPCJefferson Lab
2
Outline
Will describe how:
• Capability computing + Capacity computing + SciDAC – Deliver science & NP milestones
• Collaborative efforts involve USQCD + JLab & DOE+NSF user communities
3
Hadronic & Nuclear Physics with LQCD
• Hadronic spectroscopy– Hadron resonance determinations– Exotic meson spectrum (JLab 12GeV )
• Hadronic structure– 3-D picture of hadrons from gluon & quark spin+flavor distributions– Ground & excited E&M transition form-factors (JLab 6GeV+12GeV+Mainz)– E&M polarizabilities of hadrons (Duke+CERN+Lund)
• Nuclear interactions– Nuclear processes relevant for stellar evolution– Hyperon-hyperon scattering– 3 & 4 nucleon interaction properties [Collab. w/LLNL] (JLab+LLNL)
• Beyond the Standard Model– Neutron decay constraints on BSM from Ultra Cold Neutron source (LANL)
5
Spectroscopy
Spectroscopy reveals fundamental aspects of hadronic physics– Essential degrees of freedom?– Gluonic excitations in mesons - exotic states of
matter?
• Status– Can extract excited hadron energies & identify spins, – Pursuing full QCD calculations with realistic quark
masses.
• New spectroscopy programs world-wide– E.g., BES III (Beijing), GSI/Panda (Darmstadt)– Crucial complement to 12 GeV program at JLab.
• Excited nucleon spectroscopy (JLab)• JLab GlueX: search for gluonic excitations.
6
USQCD National Effort
US Lattice QCD effort: Jefferson Laboratory, BNL and FNAL
FNALWeak matrix
elements
BNL
RHIC Physics
JLAB
Hadronic Physics
SciDAC – R&D Vehicle
Software R&D
INCITE resources (~20 TF-yr) + USQCD cluster facilities (17 TF-yr):
Impact on DOE’s High Energy & Nuclear Physics Program
7
QCD: Theory of Strong Interactions
• QCD: theory of quarks & gluons• Lattice QCD: approximate with a grid
– Systematically improvable• Gluon (Gauge) generation:
– “Configurations” via importance sampling
– Rewrite as diff. eqns. – sparse matrix solve per step – avoid “determinant” problem
• Analysis:– Compute observables via averages over
configurations
• Requires large scale computing resources
8
Gauge Generation: Cost Scaling• Cost: reasonable statistics, box size and “physical” pion
mass• Extrapolate in lattice spacings: 10 ~ 100 PF-yr
PF-years
State-of-Art
Today, 10TF-yr
2011 (100TF-yr)
9
Typical LQCD Workflow
Generate the configurations
Leadership level 24k cores, 10 TF-yr
t=0 t=T
Analyze• Typically mid-
range level• 256 cores
Few big jobs Few big files
Many small jobs Many big files
I/O movement
Extract Extract information from
measured observables
10
Computational RequirementsGauge generation (INCITE) : Analysis (LQCD)
Current calculations• Weak matrix elements: 1 : 1• Baryon spectroscopy: 1 : 10• Nuclear structure: 1 : 4
Computational Requirements: INCITE : LQCD Computing 1 : 1 (2005) 1 : 3 (2010)
Current availability: INCITE (~20 TF) : LQCD (17 TF)
Core work: solve sparse matrix equation (iteratively)
11
SciDAC Impact
• Software development– QCD friendly API’s and libraries: enables high user
productivity– Allows rapid prototyping & optimization – Significant software effort for GPU-s
• Algorithm improvements– Operators & contractions: clusters (Distillation: PRL (2009))
– Mixed-precision Dirac-solvers: INCITE+clusters+GPU-s, 2-3X
– Adaptive multi-grid solvers: clusters, ~8X (?)
• Hardware development via USQCD Facilities– Adding support for new hardware– GPU-s
12
Modern GPU Characteristics• Hundreds of simple cores: high flop rate• SIMD architecture (single instruction, multiple data)• Complex (high bandwidth) memory hierarchy• Fast context switching -> hides memory access latency• Gaming cards: no memory Error-Correction (ECC) – reliability
issue• I/O bandwidth << Memory bandwidth
Commodity Processors x86 CPU NVIDIA GT200 New Fermi GPU
#cores 8 240 480
Clock speed 3.2 GHz 1.4 GHz 1.4 GHz
Main memory bandwidth 20 GB/s 160 GB/s(gaming card)
180 GB/s(gaming card)
I/O bandwidth 7 GB/s(dual QDR IB)
3 GB/s 4 GB/s
Power 80 watts 200 watts 250 watts
15
530 GPUs at Jefferson Lab (July)200,000 cores (1,600 million core hours / year)600 Tflops peak single precision100 Tflops aggregate sustained in the inverter,
(mixed half / single precision)Significant increase in dedicated USQCD resources
All this for only $1M with hosts, networking, etc.
Disclaimer: • To exploit this performance, code has to be run on the
GPUs, not the CPU (Amdahl’s Law problem). • SciDAC-2 (& 3) software effort: move more inverters &
other code to gpu
A Large Capacity Resource
16
New Science Reach in 2010-2011
QCD Spectrum
• Gauge generation: (next dataset)– INCITE: Crays&BG/P-s, ~ 16K – 24K cores– Double precision
• Analysis (existing dataset): two-classes– Propagators (Dirac matrix inversions)
• Few GPU level• Single + half precision• No memory error-correction
– Contractions: • Clusters: few cores• Double precision + large memory
footprint
Cost (TF-yr)
New: 10 TF-yrOld: 1 TF-yr
10 TF-yr
1 TF-yr
21
Exotic matter
Suggests (many) exotics within range of JLab Hall D
Previous work: photo-production rates high
Current GPU work: (strong) decays - important experimental input
Exotics: first GPU results
Baryon Spectrum
“Missing resonance problem”• What are collective modes?• What is the structure of the states?
– Major focus of (and motivation for) JLab Hall B– Not resolved experimentally @ 6GeV
22
Nucleon & Delta Spectrum
First results from GPU-s
< 2% error bars[56,2+]D-wave
[70,1-]P-wave[70,1-]
P-wave
[56,2+]D-wave
Discern structure: wave-function overlaps
Change at light quark mass? Decays!
Suggests spectrum at least as dense as quark model
23
Towards resonance determinations
• Augment with multi-particle operators– Needs “annihilation diagrams” – provided by
Distillation Ideally suited for (GPU-s)
• Resonance determination– Scattering in a finite box – discrete energy levels– Lüscher finite volume techniques– Phase shifts ! Width
• First results (partially from GPU-s)– Seems practical
arxiv:0905.2160
24
26
Extending science reach
• USQCD:– Next calculations: physical quark masses: 100 TF – 1 PF-yr– New INCITE+Early Science application (ANL+ORNL+NERSC)– NSF Blue Waters Petascale (PRAC)
• Need SciDAC-3– Significant software effort for next generation GPU-s &
heterogeneous environments– Participate in emerging ASCR Exascale initiatives
• INCITE + LQCD synergy:– ARRA GPU system well matched to current leadership
facilities
27
Path to Exascale
• Enabled by some hybrid GPU system?– Cray + Nvidia ??
• NSF GaTech: Tier 2 (experimental facility)– Phase 1: HP cluster+GPU (Nvidia Tesla)– Phase 2: hybrid GPU+<partner>
• ASCR Exascale facility– Case studies for Science, Software+Runtime,
Hardware
• Exascale capacity resource will be needed
28
Summary
Capability + Capacity + SciDAC – Deliver science & HEP+NP milestones
Petascale (leadership) + Petascale (capacity)+SciDAC-3Spectrum + decays
First contact with experimental resolution
Exascale (leadership) + Exascale (capacity)+SciDAC-3Full resolution
Spectrum + transitionsNuclear structure
Collaborative efforts: USQCD + JLab user communities
Hardware: ARRA GPU Cluster• Host:• 2.4 GHz Nehalem• 48 GB memory / node• 65 nodes, 200 GPUs
• Original configuration:• 40 nodes w/ 4 GTX-285 GPUs• 16 nodes w/ 2 GTX-285 + QDR IB• 2 nodes w/ 4 Tesla C1050 or
S1070
• One quad GPU node = one rack of conventional nodes
33
SciDAC Software Stack
QCD friendly API’s/libs
• http://www.usqcd.org
Architectural level(Data parallel)
High-level (lapack-like)
GPU-s
Application level
34
Dirac Inverter with Parallel GPU-s
Divide problem among nodes:
• Trade-offs – On-node vs off-
node bandwidths– Locality vs memory
bandwidth
• Efficient at large problem size per node
35
Amdahl’s Law (Problem)
Also disappointing: the GPU is idle 80% of the time!
Conclusion: need to move more code to the GPU, and/or need task level parallelism (overlap CPU and GPU)
Jefferson Lab has split this workload into two jobs (red and black), for 2 machines (conventional, GPU)
• 2x clock time improvement
• A major challenge in exploiting GPUs is Amdahl’s Law:• If 60% of the code is GPU accelerated by 6x, • the net gain is only 2x.
36
Considerable Software R&D is Needed
Hardware
Device Drivers
Linux or mKernel
RTS MPI (?)
User ApplicationSpace
Up until now: O/S & RTS form a 'thin layer' between Application & H/W
Hardware
Device Drivers
Power RAS
Memory
RTSscheduling, load balancing,
work stealing, program modelcoexistence
MPI (?)
Programming Modelhybrid MPI + node parallelism
PGAS? Chapel?
UserApplication
Space
Exascale X-Stack (?)
Libraries(BLAS, PetSc,Trilinos...)
Need SciDAC-3 to move to Exascale
Chroma CPS MILC
MDWF
Dirac Operators
QOP
QDP++ QIO
QMP Message Passing
QDP/C
QLA QMT Threads
• Application Layer
Level 1: Basics
Level 2: Data Parallel
Level 3: Optimization
QA0, GCC-BGL, Workflow, Viz.Tools
• + tools from collaborations with other SciDAC • projects e.g. PERI
37
Need SciDAC-3• Application porting to new programming
models/languages– Node abstraction – portability (like QDP++
now?)– Interactions with more restrictive (liberating?)
exascale stack?• Performance libraries for Exascale hardware
– like level 3 currently– will need productivity tools
• Domain Specific Languages (QDP++ is almost this)
• Code Generators (More QA0, BAGEL etc)• Performance monitoring • Debugging, Simulation
• Algorithms for greater concurrency/reduced synchronization