nvidia update and directions on gpu acceleration for earth …€¦ · openacc: nvidia open wrf...
TRANSCRIPT
Stan Posey, HPC Program Manager, ESM and CFD, NVIDIA, Santa Clara, CA, USA
Carl Ponder, PhD, Applications Software Engineer, NVIDIA, Austin, TX, USA
NVIDIA Update and Directions on GPU Acceleration for Earth System Models
2
TOPICS OF
DISCUSSION
• NVIDIA GPU UPDATE
• ESM GPU PROGRESS
• WRF DEVELOPMENTS
3
NVIDIA GPU: Introduction and Hardware Features
Unified Memory
CPU
Tesla P100
GPU Introduction
1x
PCIe or NVLink
• Co-processor to the CPU
• Threaded Parallel (SIMT)
• CPUs: x86 | Power | ARM
• HPC Motivation:
o Performance
o Efficiency
o Cost Savings
ORNL Titan #3 Top500.org
18,688 GPUs
3x – 10x
4
GPU Feature Tesla P100 Tesla K80 Tesla K40 Stream Processors 3584 2 x 2496 2880
Core Clock 1328MHz 562MHz 745MHz
Boost Clock(s) 1480MHz 875MHz 810MHz, 875MHz
Memory Clock 1.4Gbps HBM2 5Gbps GDDR5 6Gbps GDDR5
Memory Bus Width 4096-bit 2 x 384-bit 384-bit
Memory Bandwidth 720GB/sec 2 x 240GB/sec 288GB/sec
VRAM 16GB 2 x 12GB 12GB
Half Precision 21.2 TFLOPS 8.74 TFLOPS 4.29 TFLOPS
Single Precision 10.6 TFLOPS 8.74 TFLOPS 4.29 TFLOPS
Double Precision 5.3 TFLOPS
(1/2 rate)
2.91 TFLOPS
(1/3 rate)
1.43 TFLOPS
(1/3 rate)
GPU GP100 (610mm2) GK210 GK110B
Transistor Count 15.3B 2 x 7.1B(?) 7.1B
Power Rating 300W 300W 235W
Cooling N/A Passive Active/Passive
Manufacturing Process TSMC 16nm FinFET TSMC 28nm TSMC 28nm
Architecture Pascal Kepler Kepler
NVIDIA GPU: Introduction and Hardware Features
Current GPUs Since 2014 Next GPU (Q4 2016)
Unified Memory
CPU
Tesla P100
• Co-processor to the CPU
• Threaded Parallel (SIMT)
• CPUs: x86 | Power | ARM
• HPC Motivation:
o Performance
o Efficiency
o Cost Savings
GPU Introduction
1x 3x – 10x
PCIe or NVLink
2.5x
3.7x
NOTE: P100 nodes available for community remote access on NVIDIA PSG cluster
5
3x
7x
14x
27x
0x
5x
10x
15x
20x
25x
30x
35x
40x
45x
50x
COSMO
2x K80 (4 x GPU) 2x P100 4x P100 8x P100
Speed-u
p v
s D
ual Socket
Hasw
ell
2x HSW CPU
COSMO Dycore Speedups on P100 GPU http://www.cosmo-model.org/
Results from NVIDIA Internal Cluster (US)
(Preliminary – Mar 2016)
• COSMO 5.3 MCH branch
• 128x128, 80xVertical
• Time steps 10
• CPU: x86 Xeon Haswell
o 10 Cores @ 2.8 GHz
• GPU: Tesla P100
• Use of 8-GPU single node
• CUDA 8
MeteoSwiss GPU Branch of COSMO Model – Dycore Only
Socket-to-socket:
P100 vs. HSW = 3.5x
Socket-to-socket:
P100 vs. HSW = 3.5x
6
Select NVIDIA ESM Highlights Since MultiCore 5
Growth in GPU funded-development; Large GPU system deployments
GPUs deployed for operational NWP by MeteoSwiss with COSMO model
OpenACC (PGI) developments for ACME Atmosphere in production release
New NCAR collaboration launched with OpenACC Hackathon Workshop https://www2.cisl.ucar.edu/news/summertime-hackathon
KISTI 2-week GPU and OpenACC Workshop focus on MPAS and WRF
NVIDIA selected for ECWMF ESCAPE program; ESCAPE GPU Workshop https://software.ecmwf.int/wiki/display/OPTR/NVIDIA+Basic+GPU+Training+with+emphasis+on+Fortran+and+OpenACC
US DOE ORNL-led GPU Hackathons included several ES model teams ACME, COAMPS, ECHAM6, FVCOM, NOAA GFDL models
7
COSMO 7 (6.6 KM) 3 per day, 3 day forecast
COSMO 2 (2.2 KM) 8 per day, 24 hr forecast
IFS from ECMWF 2 per day, 10 day forecast
COSMO E (2.2 KM) 2 per day, 5 day forecast
COSMO 1 (1.1 KM) 8 per day, 24 hr forecast
IFS from ECMWF 2 per day, 10 day forecast
MeteoSwiss COSMO NWP
Configurations During 2016
With GPUs
MeteoSwiss COSMO NWP
Configurations Since 2008
Before GPUs
“New configurations of higher resolution and ensemble predictions possible owing to the performance-per-energy gains from GPUs” –X. Lapillonne , MeteoSwiss; EGU Assembly, Apr 2015
MeteoSwiss and Operational COSMO NWP on GPUs
8
MeteoSwiss Weather Prediction Based on GPUs
World’s First GPU-Accelerated NWP
Piz Kesch (Cray CS Storm)
Installed at CSCS July 2015
2x Racks with 48 Total CPUs
192 Tesla K80 Total GPUs
High GPU Density Nodes:
2 x CPU + 8 x GPU
> 90% of FLOPS from GPUs
Operational NWP Mar 16
Image by NVIDIA/MeteoSwiss
9
MeteoSwiss Operational COSMO-E Benchmark
Cray XC40 Cray CS Storm “Original Code” Node = 2 x HSW
“Refactored Code” Node = 2 x HSW + 8 x K80
Speedup Vs. “Original”
10
MeteoSwiss Operational COSMO-E Benchmark
Cray XC40 Cray CS Storm “Original Code” Node = 2 x HSW
“Refactored Code” Node = 2 x HSW
Cray XC40 “Refactored Code” Node = 2 x HSW + 8 x K80
Speedup Vs. “Original”
Speedup Vs. “Refactored”
11
3x
7x
14x
27x
0x
5x
10x
15x
20x
25x
30x
35x
40x
45x
50x
COSMO
2x K80 (4 x GPU) 2x P100 4x P100 8x P100
Speed-u
p v
s D
ual Socket
Hasw
ell
2x HSW CPU
COSMO Dycore Speedups on P100 GPU http://www.cosmo-model.org/
Results from NVIDIA Internal Cluster (US)
(Preliminary – Mar 2016)
• COSMO 5.3 MCH branch
• 128x128, 80xVertical
• Time steps 10
• CPU: x86 Xeon Haswell
o 10 Cores @ 2.8 GHz
• GPU: Tesla P100
• Use of 8-GPU single node
• CUDA 8
MeteoSwiss GPU Branch of COSMO Model – Dycore Only
Socket-to-socket:
P100 vs. HSW = 3.5x
Socket-to-socket:
P100 vs. HSW = 3.5x
12
Update on DOE Pre-Exascale CORAL Systems
CORAL Summit System 5-10x Faster than Titan
1/5th the Nodes, Same Energy Use as Titan (Based on original 150 PF)
US DOE CORAL Systems ORNL Summit at 200 PF Early 2018
LLNL Sierra at 150 PF Mid-2018
Nodes of POWER 9 + Tesla Volta GPUs
NVLink Interconnect for CPUs + GPUs
ORNL Summit System Based on original 150 PF plan:
Approximately 3,400 total nodes
Each node 40+ TF peak performance
About 1/5 of total #2 Titan nodes (18K+)
Same energy used as #2 Titan (27 PF)
13
Applications
GPU
Libraries
Programming
in CUDA
OpenACC
Directives
Provides Fast
“Drop-In”
Acceleration
GPU-acceleration in
Standard Language
(Fortran, C, C++)
Maximum Flexibility
with GPU Architecture
and Software Features
Programming Strategies for GPU Acceleration
NOTE: Many application developments include a combination of these strategies
Increasing Development Effort
14
o Leverages GPU-clusters for large-scale (volume) data visualization and interactive visual computing
o Commercial software solution available and deployed for in-situ visualization of large-scale data
o Plugin for ParaView under development and available soon
o http://www.nvidia-arc.com/products/nvidia-index.html
1.8 billion cells + 500 time steps
Dataset courtesy of Prof. Leigh Orf, UW-Madison and Rob Sisneros, NCSA
Index: Scalable Rendering for Volume Visualization
15
GPUs at Convergence of Data and HPC in ESM
Fusion of Observations from Machine Learning with the Model
Yandex developments of “ML + Model” for Hyperlocal NWP with WRF: Yandex Introduces Hyperlocal Weather Forecasting Service Based on Machine Learning Technology
DL dominant topic at NCAR workshop Climate Informatics 2015
IBM acquisition of The Weather Company – applied data analytics
Data Assimilation – Next Phase Following Model Development
4DVAR GPU development success with MeteoSwiss and others
RIKEN study of 10,240 member ensemble with NICAM (Miyoshi, et al.)
Largest ensemble simulation of global weather using real-world data
16
WRF GPU
UPDATE
CUDA: TQI/SSEC Commercial “WRF”
TempoQuest
Plans for CUDA “WRF-based” software product
NVIDIA providing standard engineering guidance \
OpenACC: NVIDIA Open WRF Project Migrating 3.6.1 routines to 3.8
Initial projections of P100 GPU very good
Working towards unified memory capability
PGI compiler continues to improve/mature
Potential for Full model WRF on GPUs Several months away, hybrid in near term
P100 GPU will improve hybrid approach
UM + NVLink will improve data transfer times
P100 memory bandwidth 3x vs. Kepler-series
Stan Posey, [email protected]; Carl Ponder, [email protected]
Questions?