quarks, gpus and exotic matter - nvidia · 2013. 7. 5. · – nucleon = quarks + gluons • almost...
Post on 25-Aug-2020
3 Views
Preview:
TRANSCRIPT
Quarks, GPUs and Exotic Matter
Bálint Joó, Jefferson LabRon Babich, NVIDIA (presenter)
NVIDIA Theater SC’12Salt Lake City, Utah
Nov 2012
Acknowledgements • Science Results: Hadron Spectrum Collaboration • Software:
– The QUDA Community & NVIDIA– Frank Winter for his work on the JIT version of QDP++
• Machines: – USQCD National Facility for access to clusters at JLab, JLab SciComp Team– LLNL for access to Edge Cluster, – NERSC for access to Dirac Cluster – Oak Ridge Leadership Computing Facility, for access to TitanDev, and for Directors Discretionary
Allocation– NSF NICS for access to Keeneland Cluster– NCSA for access to BlueWaters
• Funding: US DOE– Contract DE-AC05-06OR23177: under which Jefferson Science Associates, LLC, manages and operates
Jefferson Laboratory, Grant No: DE-FC02-06ER41440: USQCD SciDAC II project)• Funding: NSF
– Grants: PHY-0835713 and OCI-0946441 • Special thanks from Balint to Ron for stepping in to present this talk.
Nuclear Physics and QCD• Ordinary matter is made up of atoms
– atom = nucleus + “orbiting” electrons– nucleus = protons + neutrons (nucleons)– nucleon = quarks + gluons
• Almost all of our mass comes from quarks & gluons• Quantum Chromodynamics (QCD) is the theory of quarks
and gluons.– quarks carry color charge (r,g,b)– gluons carry the color interactions eg. (-r,+b)
• We can only see things with net 0 color charge– never see individual quarks, gluons, only combinations – color charges must cancel between quarks and gluons– QCD allows “exotics”: quark-gluon excitations, glueballs
QCD in Nuclear Physics
Hägler, Musch, Negele, Schäfer, EPL 88 61001
• Can QCD predict the spectrum of hadrons ?– what is the role of the gluons?– what about exotics?– GlueX experiment of Jefferson Lab 12GeV, Hall D
• How do quarks and gluons make nucleons?– what are distributions of quarks, gluons, spin, etc ?– GPD experiments e.g. Jefferson Lab, Halls A & B
• QCD must explain nuclear interactions– ab initio calculations for simple systems– bridges to higher level effective theories
• QCD phase structure, equation of state– experiments at RHIC– input to higher level effective theories– astrophysics (physics of the Early Universe)
• Lattice QCD is the only known model independent, non-perturbative technique for carrying out QCD calculations.– Replace continuum space-time with lattice– Gluons live on links as SU(3) matrices– Quarks live on sites as vectors/spinors.– Change QCD to a system similar to a crystal
Lattice QCD
Evaluate Path Integral Using Markov Chain Monte Carlo Method
Large Scale LQCD Simulations
• Stage 1: Generate Configurations– snapshots of QCD vacuum– configurations generated in sequence– capability computing needed for large
lattices and light quarks• Stage 2a: Compute quark propagators
– task parallelizable (per configuration)– capacity workload (but can also use capability h/w)
• Stage 3: Extract Physics– on workstations,
small cluster partitions
• Stage 2b: Contract propagators into Correlation Functions– determines the physics you’ll see– complicated multi-index tensor contractions
Titan Image Courtesy of Oak Ridge Leadership Computing Facility (OLCF), Oak Ridge National Laboratory
The Lattice Dirac Equation • Describes how quarks interact with the gluons• Must be solved in gauge generation (Stage 1)
– O(1M) times, in sequence – ~60%-80% of workload spent in solvers
• Must be solved to generate quark propagators (Stage 2)– O(10M) times, but task parallel– solver is >90% of workload
• Operator has dimension ~100M, but very sparse – Efficient Matrix-Vector operations are crucial– Need optimized solvers
Aee -Deo
-Doe Aoo( ) φ = χ
Software: Chroma + QUDA• Chroma is a large lattice QCD framework
– algorithms for gauge generation, quark propagators etc– abstractions for components (solvers)– open source: http://usqcd.jlab.org/usqcd-docs/chroma/– developed/maintained through US DOE SciDAC funding– integrates QUDA library as a solver component
• R. G. Edwards, B. Joo, Nucl. Phys. Proc. Suppl. 140 (2005) 832
• QUDA is a highly optimized library for lattice QCD on GPUs– Linear Solvers, Force Terms, interfaces to code-bases– open source: http://lattice.github.com/quda– developed/maintained by NVIDIA & QUDA Community
• M. Clark, R. Babich, K. Barros, R. C. Brower, C. Rebbi, Comp. Phys. Commun. 181:1517-1528, 2010
QUDA Performance Optimization• LQCD is typically memory bound
– Dslash: Nearest neighbour stencil in 4D• Wilson Formulation: 0.92 FLOP/B (SP)• Staggered Formulation: ~0.66 FLOP/B (SP)• Key Optimizations focus on being memory friendly
• Layout data for coalesced memory access • Use symmetries to compress SU(3) matrices
– 2 row storage or 8 parameter storage– reconstruct 3rd row with “free” FLOPs– trade bandwidth for compute
• Use reduced precision if possible (e.g. 16bit)– mixed precision solver – iterative refinement + reliable updates
• Fuse BLAS like kernels - increase reuse
(V-1 sites)x12 floats12 floats
(V-1 sites) x 4 floats4 floats Pad
1 block
Using GPUs in Capacity Mode• USQCD National Facility (FNAL, JLab, BNL)
– distributed computational facility for LQCD– JLab and Fermilab operate GPU clusters– JLab GPU cluster used for generating quark propagators.
� � � � ��
�
���
���
���
���
����
����
����
����
����
������� ��
������
����� ��
�������������
������
�������������
������
�����������
������
�������� ��!"�
��
�#�
$��
���
%&
'!
�
Orange Bars: from NERSC Dirac ClusterOther data from JLab 9G & 10G Clusters
JLab 9G GPU Cluster
JLab: 127 quad nodes: Mix of Tesla C2050, M2050, and GTX 285/480/580 GPUs
FNAL: 72 dual nodes: M25050 GPUs
0
500
1000
1500
2000
Science From GPU ClustersJ. J. Dudek, R. G. Edwards, “Hybrid Baryons in QCD”, Phys. Rev. D85, 054016
• Hybrid Excitations in mesons, and baryons at a common scale of ~1200 MeV• Pattern suggests chromo-magnetic excitation
– common in mesons, baryons. – “Effective degree of freedom” ?– first principle calculation can agree with or disfavor effective models
Point to take home here:• These analysis computations are extremely demanding. • Need (apart from the gauge configurations):
– innovation in the method of the computation• so called “distillation” technique• variational method, with large operator basis
– optimized formulation of the lattice theory• Anisotropic lattices: cleaner determination of excited states
– availability of cheap capacity FLOPs• GPUs highly cost effective.• recall: O(10 M) solves of the Dirac Equation• lots of partitions of 4-16 GPUs (today)• => 32-64 GPUs tomorrow for larger lattices
Gauge Generation on GPUs• Gauge Generation is not task parallel
– proceeds sequentially– O(1M) solves of the Dirac Equation– needs the concentrated power of
capability computing facilities• Need to scale to 100s-1000s of GPUs• Two main obstacles:
– Host/Accelerator Model & Amdahl’s law
• Code not running on GPU limits speedup.
– Hardware Bottleneck• Ratio of peak device memory/PCIe2
bandwidths ~ 170/16 (for Fermi)• PCIe3, GPU Direct, etc should help here.
Sapp =1
(1− P ) + PS
16 32 64 128 256 512 1024 2048 4096 8192Interlagos Sockets (16 core/socket)
0.0625
0.125
0.25
0.5
1
2
4
8
16
32
64
128
Tflo
ps S
usta
ined
Titan, XK6 nodes, CPU only: Single Precision Reliable-IBiCGStab SolverRosa, XE6 nodes, CPU only: Single Precision Reliable IBiCGStab solverTitan, XK6 nodes, GPU only: Single Precision (single/single) Reliable BiCGStab solverTitan, XK6 nodes, GPU only: Mixed Precision (half/single) Reliable BiCGStab solverTitan, XK6 nodes, GPU only: Mixed Precision (half/single) GCR solver with Domain Decomposed preconditioner
Strong Scaling: 483x512 Lattice (Weak Field), Chroma + QUDA
100 Tflops
Architecture Aware Algorithms
• A domain decomposed preconditioner combined with a GCR solver – reduced communication needs in the Linear Solver– strong scaled to 768 nodes on the TitanDev Cray XK6 system
(Fermi Tesla GPUs) at the OLCF
Our work with strong scaling targets the newly installed Cray XK7 Titan System at Oak Ridge Leadership Computing Facility (OLCF) (pictured above) and other large scale GPU based systems such as NCSA BlueWaters, Keeneland and others
R. Babich, M. Clark, B. Joo, G. Shi,R. Brower, S. Gottlieb, SC’11, Seattle
Image Courtesy of Oak Ridge Leadership Computing Facility (OLCF), Oak Ridge National Laboratory
Moving more code to GPUs• Work with Frank Winter,
University of Edinburgh
• Re-wrote QDP++ layer on which Chroma is based, to run on GPUs
• Innovative Just-In-Time (JIT) compilation of C++ expression templates to GPU kernels
• Works on Cray XK too in ‘Just-Before-Time’ mode– pre-generate kernels on
regular Linux
16 32 64 128 256number of XK6 nodes
128
256
512
1024
2048
4096
Tim
e fo
r tra
ject
ory
(sec
)
Chroma (CPU only)Chroma(CPU) + QUDA SolversChroma(QDP-JIT) + QUDA
2 Flavor Wilson HMC (Gauge + 2 Flavor + Hasenbusch monomials), 323x96 lattice
• QDP-JIT version is fastest• Still suffers strong scaling effects eventually
- but problem size is small (fits on 1GPU)- expect better on current larger lattice sizes
- Data from TitanDev at OLCF, B. Joo & F. Winter
Sublattice/GPUis too small
Significantgain fromQDP-JIT
- F. Winter, "Accelerating QDP++ using GPUs" arXiv:1105:2279[hep-lat]
Preliminary
Conclusions• GPUs have brought a disruptive leap in the cost effectiveness
of lattice QCD calculations at the capacity level– enabled new analysis methods (e.g. distillation)– are producing discovery level science of great interest to
nuclear physics experiments (e.g. at Jefferson Lab)• By using architecture aware solvers, we have been able to
strong scale LQCD to over 100 TFlops sustained performance on TitanDev – expect even more performance from Kepler GPUs
• The QDP-JIT effort will allow us to move Chroma completely to the accelerators:– reduce Amdahl’s law, maximize speedup
• We look forward to more exciting science from GPUs
top related