lattice qcd and gpu-s robert edwards, theory group chip watson, hpc & cio jie chen & balint...

Lattice QCD and GPU-s

Robert Edwards, Theory GroupChip Watson, HPC & CIO

Jie Chen & Balint Joo, HPCJefferson Lab

2

Outline

Will describe how:

• Capability computing + Capacity computing + SciDAC – Deliver science & NP milestones

• Collaborative efforts involve USQCD + JLab & DOE+NSF user communities

3

Hadronic & Nuclear Physics with LQCD

• Hadronic spectroscopy– Hadron resonance determinations– Exotic meson spectrum (JLab 12GeV )

• Hadronic structure– 3-D picture of hadrons from gluon & quark spin+flavor distributions– Ground & excited E&M transition form-factors (JLab 6GeV+12GeV+Mainz)– E&M polarizabilities of hadrons (Duke+CERN+Lund)

• Nuclear interactions– Nuclear processes relevant for stellar evolution– Hyperon-hyperon scattering– 3 & 4 nucleon interaction properties [Collab. w/LLNL] (JLab+LLNL)

• Beyond the Standard Model– Neutron decay constraints on BSM from Ultra Cold Neutron source (LANL)

4

Bridges in Nuclear PhysicsNP Exascale

5

Spectroscopy

Spectroscopy reveals fundamental aspects of hadronic physics– Essential degrees of freedom?– Gluonic excitations in mesons - exotic states of

matter?

• Status– Can extract excited hadron energies & identify spins, – Pursuing full QCD calculations with realistic quark

masses.

• New spectroscopy programs world-wide– E.g., BES III (Beijing), GSI/Panda (Darmstadt)– Crucial complement to 12 GeV program at JLab.

• Excited nucleon spectroscopy (JLab)• JLab GlueX: search for gluonic excitations.

6

USQCD National Effort

US Lattice QCD effort: Jefferson Laboratory, BNL and FNAL

FNALWeak matrix

elements

BNL

RHIC Physics

JLAB

Hadronic Physics

SciDAC – R&D Vehicle

Software R&D

INCITE resources (~20 TF-yr) + USQCD cluster facilities (17 TF-yr):

Impact on DOE’s High Energy & Nuclear Physics Program

7

QCD: Theory of Strong Interactions

• QCD: theory of quarks & gluons• Lattice QCD: approximate with a grid

– Systematically improvable• Gluon (Gauge) generation:

– “Configurations” via importance sampling

– Rewrite as diff. eqns. – sparse matrix solve per step – avoid “determinant” problem

• Analysis:– Compute observables via averages over

configurations

• Requires large scale computing resources

8

Gauge Generation: Cost Scaling• Cost: reasonable statistics, box size and “physical” pion

mass• Extrapolate in lattice spacings: 10 ~ 100 PF-yr

PF-years

State-of-Art

Today, 10TF-yr

2011 (100TF-yr)

9

Typical LQCD Workflow

Generate the configurations

Leadership level 24k cores, 10 TF-yr

t=0 t=T

Analyze• Typically mid-

range level• 256 cores

Few big jobs Few big files

Many small jobs Many big files

I/O movement

Extract Extract information from

measured observables

10

Computational RequirementsGauge generation (INCITE) : Analysis (LQCD)

Current calculations• Weak matrix elements: 1 : 1• Baryon spectroscopy: 1 : 10• Nuclear structure: 1 : 4

Computational Requirements: INCITE : LQCD Computing 1 : 1 (2005) 1 : 3 (2010)

Current availability: INCITE (~20 TF) : LQCD (17 TF)

Core work: solve sparse matrix equation (iteratively)

11

SciDAC Impact

• Software development– QCD friendly API’s and libraries: enables high user

productivity– Allows rapid prototyping & optimization – Significant software effort for GPU-s

• Algorithm improvements– Operators & contractions: clusters (Distillation: PRL (2009))

– Mixed-precision Dirac-solvers: INCITE+clusters+GPU-s, 2-3X

– Adaptive multi-grid solvers: clusters, ~8X (?)

• Hardware development via USQCD Facilities– Adding support for new hardware– GPU-s

12

Modern GPU Characteristics• Hundreds of simple cores: high flop rate• SIMD architecture (single instruction, multiple data)• Complex (high bandwidth) memory hierarchy• Fast context switching -> hides memory access latency• Gaming cards: no memory Error-Correction (ECC) – reliability

issue• I/O bandwidth << Memory bandwidth

Commodity Processors x86 CPU NVIDIA GT200 New Fermi GPU

#cores 8 240 480

Clock speed 3.2 GHz 1.4 GHz 1.4 GHz

Main memory bandwidth 20 GB/s 160 GB/s(gaming card)

180 GB/s(gaming card)

I/O bandwidth 7 GB/s(dual QDR IB)

3 GB/s 4 GB/s

Power 80 watts 200 watts 250 watts

13

Inverter Strong Scaling: V=323x256

Local volume on GPU too small (I/O bottleneck)

3 Tflops

14

Science / Dollar for (Some) LQCD Capacity Apps

15

530 GPUs at Jefferson Lab (July)200,000 cores (1,600 million core hours / year)600 Tflops peak single precision100 Tflops aggregate sustained in the inverter,

(mixed half / single precision)Significant increase in dedicated USQCD resources

All this for only $1M with hosts, networking, etc.

Disclaimer: • To exploit this performance, code has to be run on the

GPUs, not the CPU (Amdahl’s Law problem). • SciDAC-2 (& 3) software effort: move more inverters &

other code to gpu

A Large Capacity Resource

16

New Science Reach in 2010-2011

QCD Spectrum

• Gauge generation: (next dataset)– INCITE: Crays&BG/P-s, ~ 16K – 24K cores– Double precision

• Analysis (existing dataset): two-classes– Propagators (Dirac matrix inversions)

• Few GPU level• Single + half precision• No memory error-correction

– Contractions: • Clusters: few cores• Double precision + large memory

footprint

Cost (TF-yr)

New: 10 TF-yrOld: 1 TF-yr

10 TF-yr

1 TF-yr

Isovector Meson Spectrum

17

Isovector Meson Spectrum

18

19

Exotic matter?

Can we observe exotic matter? Excited string

• QED

• QCD

20

Exotic matterExotics: world summary

21

Exotic matter

Suggests (many) exotics within range of JLab Hall D

Previous work: photo-production rates high

Current GPU work: (strong) decays - important experimental input

Exotics: first GPU results

Baryon Spectrum

“Missing resonance problem”• What are collective modes?• What is the structure of the states?

– Major focus of (and motivation for) JLab Hall B– Not resolved experimentally @ 6GeV

22

Nucleon & Delta Spectrum

First results from GPU-s

< 2% error bars[56,2+]D-wave

[70,1-]P-wave[70,1-]

P-wave

[56,2+]D-wave

Discern structure: wave-function overlaps

Change at light quark mass? Decays!

Suggests spectrum at least as dense as quark model

23

Towards resonance determinations

• Augment with multi-particle operators– Needs “annihilation diagrams” – provided by

Distillation Ideally suited for (GPU-s)

• Resonance determination– Scattering in a finite box – discrete energy levels– Lüscher finite volume techniques– Phase shifts ! Width

• First results (partially from GPU-s)– Seems practical

arxiv:0905.2160

24

Phase Shifts: demonstration

25

26

Extending science reach

• USQCD:– Next calculations: physical quark masses: 100 TF – 1 PF-yr– New INCITE+Early Science application (ANL+ORNL+NERSC)– NSF Blue Waters Petascale (PRAC)

• Need SciDAC-3– Significant software effort for next generation GPU-s &

heterogeneous environments– Participate in emerging ASCR Exascale initiatives

• INCITE + LQCD synergy:– ARRA GPU system well matched to current leadership

facilities

27

Path to Exascale

• Enabled by some hybrid GPU system?– Cray + Nvidia ??

• NSF GaTech: Tier 2 (experimental facility)– Phase 1: HP cluster+GPU (Nvidia Tesla)– Phase 2: hybrid GPU+<partner>

• ASCR Exascale facility– Case studies for Science, Software+Runtime,

Hardware

• Exascale capacity resource will be needed

28

Summary

Capability + Capacity + SciDAC – Deliver science & HEP+NP milestones

Petascale (leadership) + Petascale (capacity)+SciDAC-3Spectrum + decays

First contact with experimental resolution

Exascale (leadership) + Exascale (capacity)+SciDAC-3Full resolution

Spectrum + transitionsNuclear structure

Collaborative efforts: USQCD + JLab user communities

29

Backup slides

• The end

JLab ARRA: Phase 1

30

JLab ARRA: Phase 2

31

Hardware: ARRA GPU Cluster• Host:• 2.4 GHz Nehalem• 48 GB memory / node• 65 nodes, 200 GPUs

• Original configuration:• 40 nodes w/ 4 GTX-285 GPUs• 16 nodes w/ 2 GTX-285 + QDR IB• 2 nodes w/ 4 Tesla C1050 or

S1070

• One quad GPU node = one rack of conventional nodes

33

SciDAC Software Stack

QCD friendly API’s/libs

• http://www.usqcd.org

Architectural level(Data parallel)

High-level (lapack-like)

GPU-s

Application level

http://www.usqcd.org/

http://www.usqcd.org/usqcd-software






34

Dirac Inverter with Parallel GPU-s

Divide problem among nodes:

• Trade-offs – On-node vs off-

node bandwidths– Locality vs memory

bandwidth

• Efficient at large problem size per node

35

Amdahl’s Law (Problem)

Also disappointing: the GPU is idle 80% of the time!

Conclusion: need to move more code to the GPU, and/or need task level parallelism (overlap CPU and GPU)

Jefferson Lab has split this workload into two jobs (red and black), for 2 machines (conventional, GPU)

• 2x clock time improvement

• A major challenge in exploiting GPUs is Amdahl’s Law:• If 60% of the code is GPU accelerated by 6x, • the net gain is only 2x.

36

Considerable Software R&D is Needed

Hardware

Device Drivers

Linux or mKernel

RTS MPI (?)

User ApplicationSpace

Up until now: O/S & RTS form a 'thin layer' between Application & H/W

Hardware

Device Drivers

Power RAS

Memory

RTSscheduling, load balancing,

work stealing, program modelcoexistence

MPI (?)

Programming Modelhybrid MPI + node parallelism

PGAS? Chapel?

UserApplication

Space

Exascale X-Stack (?)

Libraries(BLAS, PetSc,Trilinos...)

Need SciDAC-3 to move to Exascale

Chroma CPS MILC

MDWF

Dirac Operators

QOP

QDP++ QIO

QMP Message Passing

QDP/C

QLA QMT Threads

• Application Layer

Level 1: Basics

Level 2: Data Parallel

Level 3: Optimization

QA0, GCC-BGL, Workflow, Viz.Tools

• + tools from collaborations with other SciDAC • projects e.g. PERI

37

Need SciDAC-3• Application porting to new programming

models/languages– Node abstraction – portability (like QDP++

now?)– Interactions with more restrictive (liberating?)

exascale stack?• Performance libraries for Exascale hardware

– like level 3 currently– will need productivity tools

• Domain Specific Languages (QDP++ is almost this)

• Code Generators (More QA0, BAGEL etc)• Performance monitoring • Debugging, Simulation

• Algorithms for greater concurrency/reduced synchronization

NP Exascale Report

lattice qcd and gpu-s robert edwards, theory group chip watson, hpc & cio jie chen & balint...

Documents

tfyr t

hadronic nuclear physics

lqcd computing

gev program

nuclear structure

qcd calculations

tf core work

tfyr8typical lqcd workflow