what does titan tell us about preparing for exascale ... · pdf filewhat does titan tell us...
TRANSCRIPT
ORNL is managed by UT-Battelle
for the US Department of Energy
What does Titan
tell us about
preparing for
exascale
supercomputers?
Jack Wells
Director of Science
Oak Ridge Leadership Computing Facility
Oak Ridge National Laboratory
HPC Day
16 April 2015
West Virginia University
Morgantown, WV
2
Outline
• U.S. DOE Leadership Computing Program
• Hardware Trends: Increasing Parallelism
• The Titan Project: Accelerating Leadership
Computing
– Application Readiness & Lessons Learned
– Science on Titan
– Industrial Partnerships
• The Summit Project: ORNL’s pre-exascale
supercomputer
– Application Readiness
3
DOE’s Office of Science
Computation User Facilities
• DOE is leader in open High-Performance Computing
• Provide the world’s most powerful computational tools for open science
• Access is free to researchers who publish
• Boost US competitiveness
• Attract the best and brightest researchersNERSC
Edison is 2.57 PF
OLCF
Titan is 27 PFALCF
Mira is 10 PF
4
Origin of Leadership Computing Facility
Department of Energy High-End
Computing Revitalization Act of
2004 (Public Law 108-423):
The Secretary of Energy, acting
through the Office of Science, shall
• Establish and operate
Leadership Systems Facilities
• Provide access [to Leadership Systems
Facilities] on a competitive, merit-
reviewed basis to researchers in U.S.
industry, institutions of higher education,
national laboratories and other Federal
agencies
5
What is the Leadership Computing
Facility (LCF)?
• Collaborative DOE Office of Science user-facility program at ORNL and ANL
• Mission: Provide the computational and data resources required to solve the most challenging problems.
• 2-centers/2-architectures to address diverse and growing computational needs of the scientific community
• Highly competitive user allocation programs (INCITE, ALCC).
• Projects receive 10x to 100x more resource than at other generally available centers.
• LCF centers partner with users to enable science & engineering breakthroughs (Liaisons, Catalysts).
6
Three primary user programs for
access to LCF
Distribution of allocable hours
60% INCITE30% ALCC
ASCR Leadership Computing Challenge
10% Director’s Discretionary
7
Innovative and Novel Computational
Impact on Theory and Experiment
Call for Proposals
Contact information
Julia C. White, INCITE Manager
www.doeleadershipcomputing.org
The INCITE program seeks
proposals for high-impact
science and technology
research challenges that
require the power of the
leadership-class systems.
Allocations will be for
calendar year 2016.
April 15 – June 26, 2015
INCITE is an annual, peer-review allocation program that provides
unprecedented computational and data science resources
• 5.8 billion core-hours to be awarded for 2016 on the 27-petaflops Cray XK7 “Titan” and the 10-petaflops IBM BG/Q “Mira”
• Average awards expected to be ~75M and 100M core-hours for Titan and Mira, respectively
• INCITE is open to any science domain
• INCITE seeks computationally intensive, large-scale research campaigns
• Proposal-writing webinars: April and May
8
INCITE resources for high-impact research
• Center expertise is key to INCITE support model
• Multi-year proposals (1, 2, or 3 years)
– Embrace the long-term effort of high impact science
Argonne LCF Oak Ridge LCF
System IBM Blue Gene/Q Cray XK7
Name Mira Titan
Compute nodes 49,152 18,688
Node architecture PowerPC, 16 coresAMD Opteron, 16 cores
NVIDIA Kepler GPU
Processing cores 786,432 299,008
Memory per node,
(gigabytes)16 32 + 6
Peak performance,
(petaflops)10 27
9
Hardware scaled from single-core through dual-core to quad-core and dual-socket , 12-core SMP nodes
Scaling applications and system software was the biggest challenge
Cray XT4Dual-Core119 TF
2006 2007 2008
Cray XT3 Dual-Core54 TF
Cray XT4Quad-Core263 TF
ORNL has increased system
performance by 1,000 times 2004-2010
2005
Cray X13 TF
Cray XT3Single-core26 TF
2009
Cray XT5 Systems12-core, dual-socket SMP2.3 PF
10
Hardware Trend of ORNL’s Systems
1985 - 2013
11Herb Sutter, Dr. Dobb’s Journal, 2005,
http://www.gotw.ca/publications/concurrency-ddj.htm
The Free Lunch is Over
Moore’s Law continues (green)
But CPU clock rates stopped increasing in 2003 (dark blue)
Power (light blue) is capped by heat dissipation and $$$
Single-thread performance is growing slowly (magenta)
12
Microprocessor Performance “Expectation Gap”
National Research
Council (NRC) –
Computer Science and
Telecommunications
Board (2012)
Expectation
Gap
13
Peak FLOPS per Node was #1 Hardware
Requirement in 2009 User Survey
“Preparing for Exascale” OLCF Application
Requirements and Strategy, December 2009
14
Power consumption of 2.3 PF (Peak) Jaguar: 7 megawatts, equivalent to that of a small city (5,000 homes)
Power is THE problem
15
Using traditional CPUs
is not economically feasible
20 PF+ system: 30 megawatts (30,000 homes)
16
Why GPUs? Hierarchical Parallelism
High performance and power efficiency
on path to exascale
• Expose more parallelism through code
refactoring and source code directives
– Doubles CPU performance of many codes
• Use right type of processor for each task
• Data locality: Keep data near processing
– GPU has high bandwidth to local memory
for rapid access
– GPU has large internal cache
• Explicit data management: Explicitly
manage data movement between CPU
and GPU memories
CPU GPU Accelerator
• Optimized for sequential multitasking • Optimized for many
simultaneous tasks
• 10 performance per socket
• 5 more energy-efficient systems
17
200 cabinets
4,352 ft2 (404 m2)
8.9 MW peak power
SYSTEM SPECIFICATIONS:
• 27.1 PF/s peak performance• 24.5 GPU + 2.6 CPU
• 17.59 PF/s sustained perf. (LINPACK)
• 18,688 compute nodes, each with:• 16-Core AMD Opteron CPU
• NVIDIA Tesla “K20x” GPU
• 32 + 6 GB memory
• 710 TB total system memory
• 32 PB parallel filesystem (Lustre)
• Cray Gemini 3D Torus Interconnect
• 512 Service and I/O nodes
ORNL’s “Titan” Hybrid System:
Cray XK7 with AMD Opteron + NVIDIA Tesla processors
Throwing away 90% of available performance if not using GPUs
18
Center for Accelerated Application
Readiness (CAAR)
• We created CAAR as part of the Titan project to help prepare applications for accelerated architectures
• Goals:
– Work with code teams to develop and implement strategies for exposing hierarchical parallelism for our users applications
– Maintain code portability across modern architectures
– Learn from and share our results
• We selected six applications from across different science domains and algorithmic motifs
19
WL-LSMSIlluminating the role of material disorder, statistics, and fluctuations in nanoscale materials and systems.
S3DUnderstanding turbulent combustion through direct numerical simulation with complex chemistry..
NRDFRadiation transport –important in astrophysics, laser fusion, combustion, atmospheric dynamics, and medical imaging –computed on AMR grids.
CAM-SEAnswering questions about specific climate change adaptation and mitigation scenarios; realistically represent features like precipitation patterns / statistics and tropical storms.
DenovoDiscrete ordinates radiation transport calculations that can be used in a variety of nuclear energy and technology applications.
LAMMPSA molecular dynamics simulation of organic polymers for applications in organic photovoltaic heterojunctions , de-wetting phenomena and biosensor applications
Application Readiness for Titan
20
Application Power Efficiency of the Cray XK7
WL-LSMS for CPU-only and Accelerated Computing
• Runtime Is 8.6X faster for the accelerated code
• Energy consumed Is 7.3X less
o GPU accelerated code consumed 3,500 kW-hr
o CPU only code consumed 25,700 kW-hr
Power consumption traces for identical WL-LSMS runs
with 1024 Fe atoms on 18,561 Titan nodes (99% of Titan)
21
CAAR Lessons Learned
• Up to 1-3 person-years required to port each code
– Takes work, but an unavoidable step required for exascale
– Also pays off for other systems—the ported codes often run significantly faster CPU-only (Denovo 2X, CAM-SE >1.7X)
• An estimated 70-80% of developer time is spent in code restructuring, regardless of whether using CUDA, OpenCL, OpenACC, …
• Each code team must make its own choice of using CUDA vs. OpenCLvs. OpenACC, based on the specific case—may be different conclusion for each code
• Science codes are under active development—porting to GPU can be pursuing a “moving target,” challenging to manage
• More available flops on the node should lead us to think of new science opportunities enabled—e.g., more DOF per grid cell
• We may need to look in unconventional places to get another ~30X thread parallelism that may be needed for exascale—e.g., parallelism in time
22
2014 Strategic Science Accomplishments
from Titan
Habib and collaborators used its HACC Code on Titan’s CPU–GPU system to conduct today’s largest cosmological structure simulation at resolutions needed for modern-day galactic surveys.
K. Heitmann, 2014. arXiv.org, 1411.3396
Salman HabibArgonne National Laboratory
Cosmology
Chen and collaborators for the first time performed direct numerical simulation of a jet flame burning dimethyl ether (DME) at new turbulence scales over space and time.
A. Bhagatwala, et al. 2014. Proc. Combust. Inst. 35.
Jacqueline ChenSandia National Laboratory
Combustion
Paul Kent and collaborators performed the first ab initio simulation of a cuprate. They were also the first team to validate quantum Monte Carlo simulations for high-temperature superconductor simulations.
K. Foyevtsova, et al. 2014. Phys. Rev. X 4
Paul KentORNL
SuperconductingMaterials
Researchers at Procter & Gamble (P&G) and Temple University delivered a comprehensive picture in full atomistic detail of the molecular properties that drive skin barrier disruption.
M. Paloncyova, et al. 2014. Langmuir 30
C. M. MacDermaid, et al. 2014. J. Chem. Phys. 141
Michael KleinTemple University
Molecular Science
Chang and collaborators used the XGC1 code on Titan to obtain fundamental understanding of the divertor heat-load width physics and its dependence on the plasma current in present-day tokamak devices.
C. S. Chang, et al. 2014. Proceedings of the 25th Fusion Energy Conference, IAEA, October 13–18, 2014.
C.S. ChangPPPL
Fusion
23
2014 High Impact Science at the OLCF
Palaeoclimate evidence has suggested that the El Niño Southern Oscillation (ENSO) — the most important driver of interannualclimate variability — has swung from periods of intense variability to relative quiescence.
Reference: Zhengyu Liu, Nature,
Volume 515, 2014
Climate Science
Coarse-grained protein models on leadership computers allows the study of single protein properties, DNA–RNA complexes, amyloid fibril formation and protein suspensions in a crowded environment. Reference: Fabio Sterpone, Chemical Society Reviews,Vol. 43, 2014
Biological Sciences
Using OLCF resources, researchers were able to quantify the shape evolution of a platinum nanoparticle by tracking the propagation of different facets.Reference: Hong-GangLiao, Science,Vol. 345, 2014
Materials Sciences
Using a general circulation model (GCM), the team has shown that allowing Northern Hemisphere ice sheet meltwaterto flow into the ocean generates a bipolar seesaw effect, with the Southern Hemisphere warming at the expense of the Northern Hemisphere.Reference: Christo Buizert, ScienceVol. 345, 2014
Climate Science
Theory, modeling, and simulation can accelerate the process of materials understanding and design by providing atomic level understanding of the underlying physicochemical phenomena.Reference: Bobby Sumpter, Accounts of Chemical Research, Vol. 47, 2014
Materials Sciences
First-principles many-body methods are used to study the spin dynamics and the superconducting pairing symmetry of a large number of iron-based compounds, showing high-temperature superconductors have both dispersive high-energy and strong low-energy commensurate or nearly commensurate spin excitations. Reference: Z. P. Yin, Nature Physics,Vol. 10, 2014
Materials Sciences
24
35 5
8
1519
21
28
34
0
5
10
15
20
25
30
35
40
2005 2006 2007 2008 2009 2010 2011 2012 2013 2014
Nu
mb
er
of
Pro
jects
Year
Number of OLCF Industry Projects Operational Per Calendar Year
Industrial Partnerships:
Accelerating Competitiveness through
Computational Sciences
25
Human skinbarrier
Global flood maps
Engine cycle-to-cycle variation
Fuel efficient jet engines
Wind turbine resilience
Welding Software
Demonstrated small molecules can have large
and varying impact on skin permeability
depending on their molecular
characteristics—important for
product efficacy and safety
Developed
fluvial and
pluvial high
resolution global
flood maps to
enable
insurance firms
to better price
risk and reduce
loss of life and
property
Developing novel
approach to using
massively parallel,
multiple
simultaneous
combustion cycle
simulations to
address cycle-to-
cycle variations in
spark ignition
engine
Conducting first-of-a-kind high-
fidelity LES computations
of flow in turbomachinerycomponents for
more fuel efficient, next-generation jet
engines
First time simulation of ice formation within million-molecule water droplets is
expanding understanding of freezing at the
molecular level to enhance wind
turbine resilience in cold climates
Evaluating large-scale HPC and
GPU capability of critical welding
simulation software and
further developing & testing weld optimization
algorithm
Innovation through Industrial
Partnerships
26
Aircraftdesign
Consumer product stability
Gasoline engine injector
Jet engine efficiency
Li-ion batteries
Underhoodcooling
Unexpected
discovery of
multiple solutions
for steady RANS
equations with
separated flow
helps explain why
numerical
modeling
sometimes fails
to capture
maximum lift
Developed
method to
measure impact
of additives,
such as dyes
and perfumes,
on properties of
lipid systems
such as fabric
enhancer and
other formulated
products
Optimizing
multihole gasoline
spray injector
nozzle designs for
better in-cylinder
fuel-air mixture
distributions,
greater fuel
efficiency and
reduced physical
prototypes
Accurate predictions
of atomization of liquid fuel
by aerodynamic forces enhance
combustion stability, improve
efficiency, and reduce emissions
New classes of solid inorganic Li-ion electrolytes
could deliver high ionic
and low electronic conductivity and good
electrochemical stability
Developed a new, efficient and automatic
analytical cooling package
optimization process leading to one of a kind
design optimization of
cooling systems
Innovation through Industrial
Partnerships
27
CatalysisDesign
innovationAircraftdesign
Industrial fire suppression
Turbo machinery efficiency
Long-haul truck fuel efficiency
Demonstrated biomass as a
viable, sustainable feedstock
for hydrogen production for
fuel cells; showed nickel is a
feasible catalytic alternative to platinum
Accelerating design of shock
wave turbo compressors
for carbon capture and
sequestration
Simulatedtakeoff and landing
scenarios improved
a critical code for estimating characteristics of commercial
aircraft, including lift, drag, and controllability
Developing high-fidelity modeling
capability for fire growth and
suppression; fire losses
account for 30% of U.S. property
loss costs
Simulated unsteady flow in turbo machinery,
opening new opportunities for
design innovation and efficiency improvements.
Simulations reduced by 50%
the time to develop a unique system of add-on
parts that increases fuel
efficiency by 7−12% for long-haul (18-
wheeler) trucks
Innovation through Industrial
Partnerships
28
Additional lessons from Titan
• Complexity (certainly increased parallelism) is the new norm
• Community needs to work together to help manage this
• Education and training are very important.
• LCF is very active in advanced training, for example– OLCF “Hack-a-thon” on OPENACC Compiler Directives.
• 6 teams chosen, 2 mentors assigned per team, 6 weeks of guided planning and work, 1 final intense week of coding
https://www.olcf.ornl.gov/training-event/hackathon-openacc2014/
29
Roadmap to Exascale
Titan 27 PF600 TB DRAMHybrid GPU/CPU
2012
2017
2022
OLCF-5: 1 EF20 MW
OLCF-4: 100-250 PF4000 TB memory< 20MW
Our Science requires that we advance computational capability 1000x
over the next decade.
What are the Challenges?
30
What is CORAL (Partnership for 2017 System)
• CORAL is a Collaboration of Oak Ridge, Argonne, and Lawrence Livermore Labs to acquire three systems for delivery in 2017.
• DOE’s Office of Science (DOE/SC) and National Nuclear Security Administration (NNSA) signed an MOU agreeing to collaborate on HPC research and acquisitions
• Collaboration grouping of DOE labs was done based on common acquisition timings. Collaboration is a win-win for all parties.
– It reduces the number of RFPs vendors have to respond to
– It improves the number and quality of proposals
– It allows pooling of R&D funds
– It strengthens the alliance between SC/NNSA on road to exascale
– It encourages sharing technical expertise between Labs
31
Accelerating Future DOE Leadership
Systems (“CORAL”)
“Summit” System “Sierra” System
5X – 10X Higher Application Performance
IBM POWER CPUs, NVIDIA Tesla GPUs, Mellanox EDR 100Gb/s InfiniBand
Paving The Road to Exascale Performance
32
2017 OLCF Leadership System
Hybrid CPU/GPU architecture
Vendor: IBM (Prime) / NVIDIA™ / Mellanox Technologies®
At least 5X Titan’s Application Performance
Approximately 3,400 nodes, each with:
• Multiple IBM POWER9 CPUs and multiple NVIDIA Tesla® GPUs using the NVIDIA Volta architecture
• CPUs and GPUs completely connected with high speed NVLink
• Large coherent memory: over 512 GB (HBM + DDR4)
– all directly addressable from the CPUs and GPUs
• An additional 800 GB of NVRAM, which can be configured as either a burst buffer or as extended memory
• over 40 TF peak performance
Dual-rail Mellanox® EDR-IB full, non-blocking fat-tree interconnect
IBM Elastic Storage (GPFS™) - 1TB/s I/O and 120 PB disk capacity.
33
How does Summit compare to Titan?
Feature Summit Titan
Application Performance 5-10x Titan Baseline
Number of Nodes ~3,400 18,688
Node performance > 40 TF 1.4 TF
Memory per Node >512 GB (HBM + DDR4) 38GB (GDDR5+DDR3)
NVRAM per Node 800 GB 0
Node Interconnect NVLink (5-12x PCIe 3) PCIe 2
System Interconnect (node
injection bandwidth) Dual Rail EDR-IB (23 GB/s) Gemini (6.4 GB/s)
Interconnect Topology Non-blocking Fat Tree 3D Torus
Processors IBM POWER9
NVIDIA Volta™
AMD Opteron™
NVIDIA Kepler™
File System 120 PB, 1 TB/s, GPFS™ 32 PB, 1 TB/s, Lustre®
Peak power consumption 10 MW 9 MW
34
Two Tracks for Future Large Systems
Hybrid Multi-Core
• CPU / GPU Hybrid systems
• Likely to have multiple CPUs and
GPUs per node
• Small number of very fat nodes
• Expect data movement issues to be
much easier than previous systems –
coherent shared memory within a
node
• Multiple levels of memory – on
package, DDR, and non-volatile
Many Core
• 10’s of thousands of nodes with
millions of cores
• Homogeneous cores
• Multiple levels of memory – on
package, DDR, and non-volatile
• Unlike prior generations, future
products are likely to be self hosted
Cori at NERSC• Self-hosted many-core system
• Intel/Cray
• 9300 single-socket nodes
• Intel® Xeon Phi™ Knights Landing
(KNL)
• 16GB HBM, 64-128 GB DDR4
• Target delivery date: June, 2016
Summit at OLCF• Hybrid CPU/GPU system
• IBM/NVIDIA
• 3400 multi-socket nodes
• POWER9/Volta
• More than 512 GB coherent
memory per node
• Target delivery date: 2017
ALCF-3 at ALCF• 3rd Generation Intel Xeon Phi
(Knights Hill (KNH)
• > 50,000 compute nodes
• Target delivery date: 2018Edison (Cray): Cray XC30Intel Xeon E%-2695v2 12C 2.4 GHzAries
System attributes NERSC Now OLCF Now ALCF Now NERSC Upgrade OLCF Upgrade ALCF Upgrade
Name
Planned InstallationEdison TITAN MIRA
Cori
2016
Summit
2017-2018
Aurora
2018-2019
System peak (PF) 2.6 27 10 > 30 150 180
Peak Power (MW) 2 9 4.8 < 3.7 10 13
Total system memory 357 TB 710TB 768TB
~1 PB DDR4 + High
Bandwidth Memory
(HBM)+1.5PB
persistent memory
> 1.74 PB DDR4 +
HBM + 2.8 PB
persistent memory
> 7 PB DRAM and
persistent memory
Node performance (TF) 0.460 1.452 0.204 > 3 > 40 > 15 times Mira
Node processors Intel Ivy Bridge AMD Opteron
Nvidia Kepler
64-bit
PowerPC A2
Intel Knights
Landing many core
CPUs
Intel Haswell CPU
in data partition
Multiple IBM
Power9 CPUs &
multiple Nvidia
Voltas GPUS
Intel Knights Hill
many core CPUs
System size (nodes) 5,600 nodes 18,688 nodes 49,152
9,300 nodes
1,900 nodes in data
partition
~3,500 nodes >50,000 nodes
System Interconnect Aries Gemini 5D Torus Aries Dual Rail EDR-IBIntel Omni-Path
Architecture
File System7.6 PB
168 GB/s,
Lustre®
32 PB
1 TB/s,
Lustre®
26 PB
300 GB/s
GPFS™
28 PB
744 GB/s
Lustre®
120 PB
1 TB/s
GPFS™
150 PB
1 TB/s
Lustre®
ASCR Computing Upgrades At a Glance
CORAL Acquisitions
36
Center for Accelerated Application
Readiness: Summit
Center for Accelerated Application Readiness (CAAR)
• Performance analysis of community applications
• Technical plan for code restructuring and optimization
• Deployment on OLCF-4
OLCF-4 issued a call for proposals in FY2015 for
application development partnerships between
community developers, OLCF staff and the OLCF
Vendor Center of Excellence.
37
New CAAR ApplicationsApplication Domain Principal
InvestigatorInstitution
ACME (N) Climate Science David Bader Lawrence Livermore National Laboratory
DIRAC Relativistic Chemistry
Lucas Visscher Free University of Amsterdam
FLASH Astrophysics Bronson Messer Oak Ridge National Laboratory
GTC (NE) Plasma Physics Zhihong Lin University of California – Irvine
HACC(N) Cosmology Salman Habib Argonne National Laboratory
LSDALTON Chemistry Poul Jørgensen Aarhus University
NAMD (NE) Biophysics Klaus Schulten University of Illinois – Urbana Champaign
NUCCOR Nuclear Physics Gaute Hagen Oak Ridge National Laboratory
NWCHEM (N) Chemistry Karol Kowalski Pacific Northwest National Laboratory
QMCPACK Materials Science Paul Kent Oak Ridge National Laboratory
RAPTOR Engineering Joseph Oefelein Sandia National Laboratory
SPECFEM Seismic Science Jeroen Tromp Princeton University
XGC (N) Plasma Physics CS Chang Princeton Plasma Physics Laboratory
38
Are we seeing the real exascale
challenges?
Advanced simulation
and modeling apps
Scale
Power
Resilience
Memory wall
Conquering Petascale
problems of today
Beware being eaten
alive by the Exascale
problems of tomorrow.
39
Conclusions
• Leadership computing is for the critically important
problems that need the most powerful compute and data
infrastructure
• Accelerated, hybrid-multicore computing solutions are
performing well on real, complex scientific applications.
– But you must work to expose the parallelism in your codes.
– This refactoring of codes is largely common to all massively
parallel architectures
• The Summit project will extend the hybrid-multicore
supercomputer architecture into the (pre)-exascale era.
• OLCF resources are available to industry, academia, and
labs, through open, peer-reviewed allocation
mechanisms.
40
41
National Climate Research Center
(NCRC) – A NOAA and DOE Partnership Project
• NCRC grew out of a 2009 interagency
agreement between NOAA and the
Department of Energy.
• A subsequent work-for-others agreement
directly with ORNL allows NOAA to take
advantage of ORNL’s expertise in acquiring,
deploying, and operating world-class
supercomputers.
• It also provides additional opportunities for
collaborative work between NOAA and
ORNL.
• NOAA uses Gaea to study the planet’s
notoriously complex climate from a variety
of angles.
– Gaea is 1.1 petaflop Cray XE6. The system’s 40
cabinets contain 7,520 16-core AMD Interlagos
processors.
• NOAA is ORNL’s partner in the Titan project.
43
Acknowledgements
This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725.