nvidia update and directions on gpu acceleration for earth …€¦ · openacc: nvidia open wrf...

Stan Posey, HPC Program Manager, ESM and CFD, NVIDIA, Santa Clara, CA, USA

Carl Ponder, PhD, Applications Software Engineer, NVIDIA, Austin, TX, USA

NVIDIA Update and Directions on GPU Acceleration for Earth System Models

2

TOPICS OF

DISCUSSION

• NVIDIA GPU UPDATE

• ESM GPU PROGRESS

• WRF DEVELOPMENTS

3

NVIDIA GPU: Introduction and Hardware Features

Unified Memory

CPU

Tesla P100

GPU Introduction

1x

PCIe or NVLink

• Co-processor to the CPU

• Threaded Parallel (SIMT)

• CPUs: x86 | Power | ARM

• HPC Motivation:

o Performance

o Efficiency

o Cost Savings

ORNL Titan #3 Top500.org

18,688 GPUs

3x – 10x

4

GPU Feature Tesla P100 Tesla K80 Tesla K40 Stream Processors 3584 2 x 2496 2880

Core Clock 1328MHz 562MHz 745MHz

Boost Clock(s) 1480MHz 875MHz 810MHz, 875MHz

Memory Clock 1.4Gbps HBM2 5Gbps GDDR5 6Gbps GDDR5

Memory Bus Width 4096-bit 2 x 384-bit 384-bit

Memory Bandwidth 720GB/sec 2 x 240GB/sec 288GB/sec

VRAM 16GB 2 x 12GB 12GB

Half Precision 21.2 TFLOPS 8.74 TFLOPS 4.29 TFLOPS

Single Precision 10.6 TFLOPS 8.74 TFLOPS 4.29 TFLOPS

Double Precision 5.3 TFLOPS

(1/2 rate)

2.91 TFLOPS

(1/3 rate)

1.43 TFLOPS

(1/3 rate)

GPU GP100 (610mm2) GK210 GK110B

Transistor Count 15.3B 2 x 7.1B(?) 7.1B

Power Rating 300W 300W 235W

Cooling N/A Passive Active/Passive

Manufacturing Process TSMC 16nm FinFET TSMC 28nm TSMC 28nm

Architecture Pascal Kepler Kepler

NVIDIA GPU: Introduction and Hardware Features

Current GPUs Since 2014 Next GPU (Q4 2016)

Unified Memory

CPU

Tesla P100

• Co-processor to the CPU

• Threaded Parallel (SIMT)

• CPUs: x86 | Power | ARM

• HPC Motivation:

o Performance

o Efficiency

o Cost Savings

GPU Introduction

1x 3x – 10x

PCIe or NVLink

2.5x

3.7x

NOTE: P100 nodes available for community remote access on NVIDIA PSG cluster

5

3x

7x

14x

27x

0x

5x

10x

15x

20x

25x

30x

35x

40x

45x

50x

COSMO

2x K80 (4 x GPU) 2x P100 4x P100 8x P100

Speed-u

p v

s D

ual Socket

Hasw

ell

2x HSW CPU

COSMO Dycore Speedups on P100 GPU http://www.cosmo-model.org/

Results from NVIDIA Internal Cluster (US)

(Preliminary – Mar 2016)

• COSMO 5.3 MCH branch

• 128x128, 80xVertical

• Time steps 10

• CPU: x86 Xeon Haswell

o 10 Cores @ 2.8 GHz

• GPU: Tesla P100

• Use of 8-GPU single node

• CUDA 8

MeteoSwiss GPU Branch of COSMO Model – Dycore Only

Socket-to-socket:

P100 vs. HSW = 3.5x

Socket-to-socket:

P100 vs. HSW = 3.5x

http://www.cosmo-model.org/




6

Select NVIDIA ESM Highlights Since MultiCore 5

Growth in GPU funded-development; Large GPU system deployments

GPUs deployed for operational NWP by MeteoSwiss with COSMO model

OpenACC (PGI) developments for ACME Atmosphere in production release

New NCAR collaboration launched with OpenACC Hackathon Workshop https://www2.cisl.ucar.edu/news/summertime-hackathon

KISTI 2-week GPU and OpenACC Workshop focus on MPAS and WRF

NVIDIA selected for ECWMF ESCAPE program; ESCAPE GPU Workshop https://software.ecmwf.int/wiki/display/OPTR/NVIDIA+Basic+GPU+Training+with+emphasis+on+Fortran+and+OpenACC

US DOE ORNL-led GPU Hackathons included several ES model teams ACME, COAMPS, ECHAM6, FVCOM, NOAA GFDL models

https://www2.cisl.ucar.edu/news/summertime-hackathon





https://software.ecmwf.int/wiki/display/OPTR/NVIDIA+Basic+GPU+Training+with+emphasis+on+Fortran+and+OpenACC



7

COSMO 7 (6.6 KM) 3 per day, 3 day forecast

COSMO 2 (2.2 KM) 8 per day, 24 hr forecast

IFS from ECMWF 2 per day, 10 day forecast

COSMO E (2.2 KM) 2 per day, 5 day forecast

COSMO 1 (1.1 KM) 8 per day, 24 hr forecast

IFS from ECMWF 2 per day, 10 day forecast

MeteoSwiss COSMO NWP

Configurations During 2016

With GPUs

MeteoSwiss COSMO NWP

Configurations Since 2008

Before GPUs

“New configurations of higher resolution and ensemble predictions possible owing to the performance-per-energy gains from GPUs” –X. Lapillonne , MeteoSwiss; EGU Assembly, Apr 2015

MeteoSwiss and Operational COSMO NWP on GPUs

8

MeteoSwiss Weather Prediction Based on GPUs

World’s First GPU-Accelerated NWP

Piz Kesch (Cray CS Storm)

Installed at CSCS July 2015

2x Racks with 48 Total CPUs

192 Tesla K80 Total GPUs

High GPU Density Nodes:

2 x CPU + 8 x GPU

> 90% of FLOPS from GPUs

Operational NWP Mar 16

Image by NVIDIA/MeteoSwiss

9

MeteoSwiss Operational COSMO-E Benchmark

Cray XC40 Cray CS Storm “Original Code” Node = 2 x HSW

“Refactored Code” Node = 2 x HSW + 8 x K80

Speedup Vs. “Original”

10

MeteoSwiss Operational COSMO-E Benchmark

Cray XC40 Cray CS Storm “Original Code” Node = 2 x HSW

“Refactored Code” Node = 2 x HSW

Cray XC40 “Refactored Code” Node = 2 x HSW + 8 x K80

Speedup Vs. “Original”

Speedup Vs. “Refactored”

11

3x

7x

14x

27x

0x

5x

10x

15x

20x

25x

30x

35x

40x

45x

50x

COSMO

2x K80 (4 x GPU) 2x P100 4x P100 8x P100

Speed-u

p v

s D

ual Socket

Hasw

ell

2x HSW CPU

COSMO Dycore Speedups on P100 GPU http://www.cosmo-model.org/

Results from NVIDIA Internal Cluster (US)

(Preliminary – Mar 2016)

• COSMO 5.3 MCH branch

• 128x128, 80xVertical

• Time steps 10

• CPU: x86 Xeon Haswell

o 10 Cores @ 2.8 GHz

• GPU: Tesla P100

• Use of 8-GPU single node

• CUDA 8

MeteoSwiss GPU Branch of COSMO Model – Dycore Only

Socket-to-socket:

P100 vs. HSW = 3.5x

Socket-to-socket:

P100 vs. HSW = 3.5x





12

Update on DOE Pre-Exascale CORAL Systems

CORAL Summit System 5-10x Faster than Titan

1/5th the Nodes, Same Energy Use as Titan (Based on original 150 PF)

US DOE CORAL Systems ORNL Summit at 200 PF Early 2018

LLNL Sierra at 150 PF Mid-2018

Nodes of POWER 9 + Tesla Volta GPUs

NVLink Interconnect for CPUs + GPUs

ORNL Summit System Based on original 150 PF plan:

Approximately 3,400 total nodes

Each node 40+ TF peak performance

About 1/5 of total #2 Titan nodes (18K+)

Same energy used as #2 Titan (27 PF)

13

Applications

GPU

Libraries

Programming

in CUDA

OpenACC

Directives

Provides Fast

“Drop-In”

Acceleration

GPU-acceleration in

Standard Language

(Fortran, C, C++)

Maximum Flexibility

with GPU Architecture

and Software Features

Programming Strategies for GPU Acceleration

NOTE: Many application developments include a combination of these strategies

Increasing Development Effort

14

o Leverages GPU-clusters for large-scale (volume) data visualization and interactive visual computing

o Commercial software solution available and deployed for in-situ visualization of large-scale data

o Plugin for ParaView under development and available soon

o http://www.nvidia-arc.com/products/nvidia-index.html

1.8 billion cells + 500 time steps

Dataset courtesy of Prof. Leigh Orf, UW-Madison and Rob Sisneros, NCSA

Index: Scalable Rendering for Volume Visualization

http://www.nvidia-arc.com/products/nvidia-index.html







15

GPUs at Convergence of Data and HPC in ESM

Fusion of Observations from Machine Learning with the Model

Yandex developments of “ML + Model” for Hyperlocal NWP with WRF: Yandex Introduces Hyperlocal Weather Forecasting Service Based on Machine Learning Technology

DL dominant topic at NCAR workshop Climate Informatics 2015

IBM acquisition of The Weather Company – applied data analytics

Data Assimilation – Next Phase Following Model Development

4DVAR GPU development success with MeteoSwiss and others

RIKEN study of 10,240 member ensemble with NICAM (Miyoshi, et al.)

Largest ensemble simulation of global weather using real-world data

https://yandex.com/company/press_center/press_releases/2015/2015-11-26/





https://www2.cisl.ucar.edu/events/workshops/climate-informatics/2015/home

https://www2.cisl.ucar.edu/events/workshops/climate-informatics/2015/home

http://www.riken.jp/en/pr/press/2015/20151111_1/




16

WRF GPU

UPDATE

CUDA: TQI/SSEC Commercial “WRF”

TempoQuest

Plans for CUDA “WRF-based” software product

NVIDIA providing standard engineering guidance \

OpenACC: NVIDIA Open WRF Project Migrating 3.6.1 routines to 3.8

Initial projections of P100 GPU very good

Working towards unified memory capability

PGI compiler continues to improve/mature

Potential for Full model WRF on GPUs Several months away, hybrid in near term

P100 GPU will improve hybrid approach

UM + NVLink will improve data transfer times

P100 memory bandwidth 3x vs. Kepler-series

Stan Posey, [email protected]; Carl Ponder, [email protected]

Questions?

mailto:[email protected]

mailto:[email protected]

nvidia update and directions on gpu acceleration for earth …€¦ · openacc: nvidia open wrf...

Documents