leadership ai computing - hpcadvisorycouncil.com

40
Mike Houston, Chief Architect - AI Systems Leadership AI Computing

Upload: others

Post on 31-Jan-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

Mike Houston, Chief Architect - AI Systems

Leadership AI Computing

2

AI Impact On Science

AI-DRIVEN MULTISCALE SIMULATIONLargest AI+MD Simulation Ever

DOCKING ANALYSIS58x Full Pipeline Speedup

DEEPMD-KIT1,000x Speedup

SQUARE KILOMETER ARRAY250GB/s Data Processed End-to-End

AI SUPERCOMPUTING DELIVERING SCIENTIFIC BREAKTHROUGHS7 of 10 Gordon Bell Finalists Used NVIDIA AI Platforms

CLIMATE (1.12 EF)LBNL | NVIDIA

GENOMICS (2.36 EF)ORNL

NUCLEAR WASTE REMEDIATION (1.2 EF)LBNL | PNNL | Brown U. | NVIDIA

CANCER DETECTION (1.3 EF)ORNL | Stony Brook U.

EXASCALE AI SCIENCE

FUSING SIMULATION + AI + DATA ANALYTICSTransforming Scientific Workflows Across Multiple Domains

PREDICTING EXTREME WEATHER EVENTS

5-Day Forecast with 85% Accuracy SimNet (PINN) for CFD

18,000x SpeedupRAPIDS FOR SEISMIC ANALYSIS

260X Speedup Using K-Means

FERMILAB USES TRITON TO SCALE DEEP LEARNING INFERENCE IN HIGH ENERGY PARTICLE PHYSICSGPU-Accelerated Offline Neutrino Reconstruction Workflow

Neutrino Event Classification from Reconstruction

400TB Data from Hundreds of Millions of Neutrino Events

17x Speedup of DL Model on T4 GPU Vs. CPU

Triton in Kubernetes Enables “DL Inference” as a Service

Expected to Scale to Thousands of Particle Physics Client Nodes

NETWORK

EDGE APPLIANCE

SUPERCOMPUTING

EXTREME IO

DATA

ANALYTICS

EDGE

STREAMING

CLOUD

SIMULATION

VISUALIZATION

AI

EXPANDING UNIVERSE OF SCIENTIFIC COMPUTING

THE ERA OF EXASCALE AI SUPERCOMPUTINGAI is the New Growth Driver for Modern HPC

HPL: Based on #1 system in June Top500AI: Peak system FP16 FLOPS

2017 2018 2019 2020 2021 20222018 2019 2020 2021

10000

1000

100

Perlmutter

Summit

FugakuSierra

JUWELS

Leonardo

AI

HPL

HPL VS. AI PERFORMANCE

PFLO

PS

Chemical Compounds

NLP

GENOMICS STRUCTURE DOCKING SIMULATION IMAGING

O(109)

O(1060)

Literature

Drug

Candidate

Biological

Target

Real World Data

SEARCH

FIGHTING COVID-19 WITH SCIENTIFIC COMPUTING$1.25 Trillion Industry | $2B R&D per Drug | 12 ½ Years Development | 90% Failure Rate

O(102)

AutoDock

RAPIDS

RAPIDS

SchrodingerNAMD, VMD, OpenMM

MELD

Clara Imaging

MONAI

CryoSPARC, Relion

AlphaFold

Clara Parabricks

RAPIDS

BioMegatron, BioBERT

Chemical Compounds

NLP

GENOMICS STRUCTURE DOCKING SIMULATION IMAGING

O(109)

O(1060)

Literature

Drug

Candidate

Biological

Target O(102)

Real World Data

SEARCH

FIGHTING COVID-19 WITH NVIDIA CLARA DISCOVERY

2010 2015 2020 2025

175

Zettabytes

58

Zettabytes

EXPLODING DATA AND MODEL SIZE

EXPLODING MODEL SIZE

Driving Superhuman CapabilitiesBIG DATA GROWTH

90% of the World’s Data in last 2 YearsGROWTH IN SCIENTIFIC DATA

Fueled by Accurate Sensors & Simulations

Source for Big Data Growth chart: IDC – The Digitization of the World (May, 2020)

2017 2018 2019 2020 2021

GPT-3

175 Bn

2017 2018 2019 2020

Turing-NLG

17 Bn

BERT

340 M

Transformer

65 M

GPT-2 8B

8.3 Bn

# P

ara

mete

rs (

Log s

cale

)

287 TB/day

ECMWF

393 TB

COVID-19 Graph Analytics

16 TB/sec

SKA

550 TB

NASA Mars Landing Sim.

AI SUPERCOMPUTING NEEDS EXTREME IO

System Memory

CPU

PCIe Switch

NIC

NODE A

200GB/s

System Memory

CPU

PCIe Switch

NODE B

NIC

8X

GPU GPU

8X

System Memory

CPU

PCIe Switch

NIC

NODE A

200GB/s Storage

8X

GPU

GPUDIRECT STORAGEGPUDIRECT RDMA

SHARP IN-NETWORK COMPUTING

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

IB Network

NODES 0-15

NODE 0-15 REDUCTIONS ∑

CUDA-X

CUDA

AI DRIVEMETRO ISAACCLARARAPIDS AERIALRTX HPC

MAGNUM IO

STORAGE IO NETWORK IOIN-NETWORK

COMPUTEIO MANAGMENT

2X DL Inference Performance

10X IO Performance | 6X Lower CPU Utilization

IO OPTIMIZATION IMPACT

IMPROVING SIMULATION OF PHOSPHORUS MONOLAYER WITH LATEST NCCL P2P

SEGMENTING EXTREME WEATHER PHENOMENA IN CLIMATE SIMULATIONS

REMOTE FILE READS AT PEAKFABRIC BANDWIDTH ON DGX A100

NUMPY DALI DALI+GDS WITHOUT GDS WITH GDSMPI NCCL NCCL+P2P

1X

1.24X

1.54X

1X

2.4X

2.9X

1X

3.8X

15

DATA CENTER ARCHITECTURE

16

SELENE

DGX SuperPOD Deployment

#1 on MLPerf for commercially available systems

#5 on TOP500 (63.46 PetaFLOPS HPL)

#5 on Green500 (23.98 GF/watt) - #1 on Green500 (26.2 GF/W) - single scalable unit

#4 on HPCG (1.6 PetaFLOPS)

#3 on HPL-AI (250 PetaFLOPS)

Fastest Industrial System in U.S. — 3+ ExaFLOPS AI

Built with NVIDIA DGX SuperPOD Architecture

• NVIDIA DGX A100 and NVIDIA Mellanox IB

• NVIDIA’s decade of AI experience

Configuration:

• 4480 NVIDIA A100 Tensor Core GPUs

• 560 NVIDIA DGX A100 systems

• 850 Mellanox 200G HDR IB switches

• 14 PB of all-flash storage

17

LESSONS LEARNEDHow to Build and Deploy HPC Systems

with Hyperscale Sensibilities

Speed and feed matching

Thermal and power design

Interconnect design

Deployability

Operability

Flexibility

Expandability

18

DGX-1 PODs

NVIDIA DGX-1 – original layout

RIKEN RAIDEN NVIDIA DGX-1 – new layout

DGX-1 Multi-POD

19

DGX SuperPOD with DGX-2

20

A NEW DATA CENTER DESIGN

21

DGX SUPERPODFast Deployment Ready - Cold Aisle Containment Design

22DGX SuperPOD Cooling / Airflow

23

A NEW GENERATION OF SYSTEMSNVIDIA DGX A100

GPUs 8x NVIDIA A100

GPU Memory 640 GB total

Peak performance 5 petaFLOPS AI | 10 petaOPS INT8

NVSwitches 6

System Power Usage 6.5kW max

CPUDual AMD Rome 7742

128 cores total, 2.25 GHz(base), 3.4GHz (max boost)

System Memory 2TB

Networking

8x Single-Port Mellanox ConnectX-6 200Gb/s HDR

Infiniband (Compute Network)

2x Dual-Port Mellanox ConnectX-6 200Gb/s HDR

Infiniband (Storage Network also used for Eth*)

StorageOS: 2x 1.92TB M.2 NVME drives

Internal Storage: 15TB (4x 3.84TB) U.2 NVME drives

Software Ubuntu Linux OS (5.3+ kernel)

System Weight 271 lbs (123 kgs)

Packaged System Weight 315 lbs (143 kgs)

Height 6U

Operating temp range 5°C to 30°C (41°F to 86°F)

* Optional upgrades

24

25

MODULARITY: RAPID DEPLOYMENT

Compute: Scalable Unit (SU) Compute Fabric

and Mgmt

Storage

26

DGX SUPERPODModular Architecture

1K GPU SuperPOD Cluster

• 140 DGX A100 nodes (1,120 GPUs) in a GPU POD

• 1st tier fast storage - DDN AI400x with Lustre

• Mellanox HDR 200Gb/s InfiniBand - Full Fat-tree

• Network optimized for AI and HPC

DGX A100 Nodes

• 2x AMD 7742 EPYC CPUs + 8x A100 GPUs

• NVLINK 3.0 Fully Connected Switch

• 8 Compute + 2 Storage HDR IB Ports

A Fast Interconnect

• Modular IB Fat-tree

• Separate network for Compute vs Storage

• Adaptive routing and SharpV2 support for offload

Distributed Core Switches

1K GPU POD

Spine Switches

Leaf Switches

Storage…

GPUPOD

Distributed Core Switches

Storage Spine Switches

Storage Leaf Switches

DGX A100

#140

DGX A100

#1

27

DGX SUPERPODExtensible Architecture

POD to POD

• Modular IB Fat-tree or DragonFly+

• Core IB Switches Distributed Between PODs

• Direct connect POD to POD

Distributed Core Switches

1K GPU POD

Spine Switches

Leaf Switches

Storage…

GPUPOD

GPUPOD

GPUPOD

GPUPOD

Distributed Core Switches

Storage Spine Switches

Storage Leaf Switches

DGX A100

#140

DGX A100

#1

28

Designed with Mellanox 200Gb HDR IB network

Separate compute and storage fabric8 Links for compute

2 Links for storage (Lustre)

Both networks share a similar fat-tree design

Modular POD design140 DGX A100 nodes are fully connected in a SuperPOD

SuperPOD contains compute nodes and storage

All nodes and storage are usable between SuperPODs

Sharpv2 optimized designLeaf and Spines organized in HCA planes

For a SuperPOD, all HCA1 from 140 DGX-2 connect to a

HCA1 Plane fat-tree network

Traffic from HCA1 to HCA1 between any two nodes in a

POD stay either at the Leaf or Spine level

Only use core switches when - Moving data between HCA planes (e.g. mlx5_0 to

mlx5_1 in another system)

- Moving any data between SuperPODs

Distributed Core Switches

MULTI NODE IB COMPUTEThe Details

29

DESIGNING FOR PERFORMANCEIn the Data Center

Scalable Unit (SU)

All design is based on a radix optimized approach for

Sharpv2 support and fabric performance and to align

with design of Mellanox Quantum switches.

30

SHARPHDR200 Selene Early Results

128 NVIDIA DGX A100

(1024 GPUs, 1024 InfiniBand Adapters

NCCL AllReduce Performance Increase with SHARP

3.0

2.0

1.0

0.0

Message Size

Pe

rfo

rma

nce

In

cre

ase

Fa

cto

r

31

STORAGE

Per SuperPOD:

Fast Parallel FS: Lustre (DDN)

- 10 DDN AI400X Units

- Total Capacity: 2.5 PB

- Max Perf Read/Write: 490/250 GB/s

- 80 HDR-100 cables required

- 16.6KW

Shared FS: Oracle ZFS5-2

- HA Controller Pair/768GB total

- 8U Total Space (4U per Disk Shelf, 2U per controller)

- 76.8 TB Raw - 24x3.2TB SSD

- 16x40GbE

- Key features: NFS, HA, snapshots, dedupe

- 2kW

Parallel filesystem for perf and

NFS for home directories

32

STORAGE HIERARCHY

• Memory (file) cache (aggregate): 224TB/sec - 1.1PB (2TB/node)

• NVMe cache (aggregate): 28TB/Sec - 16PB (30TB/node)

• Network filesystem (cache - Lustre): 2TB/sec - 10PB

• Object storage: 100GB/sec - 100+PB

33

SOFTWARE OPERATIONS

34

• Deep Learning Model:

• Hyperparameters tuned for multi-node scaling

• Multi-node launcher scripts

• Deep Learning Container:

• Optimized TensorFlow, GPU libraries, and multi-node software

• Host:

• Host OS, GPU driver, IB driver, container runtime engine (docker, enroot)

SCALE TO MULTIPLE NODESSoftware Stack - Application

35

• Slurm: User job scheduling & management

• Enroot: NVIDIA open-source tool to convert traditional container/OS images into unprivileged sandboxes

• Pyxis: NVIDIA open-source plugin integrating Enroot with Slurm

• DeepOps: NVIDIA open-source toolbox for GPU cluster management w/Ansible playbooks

SCALE TO MULTIPLE NODESSoftware Stack - System

Login nodes DGX Pod: DGX Servers w. DGX base OS

Slurm controller Enroot | DockerPyxis

NGC model containers (Pytorch, Tensorflow from 19.09)

DCGM

36

INTEGRATING CLUSTERS IN THE DEVELOPMENT WORKFLOW

• Integrating DL-friendly tools like GitLab, Docker w/ HPC systems

Kick off 10000’s of GPU hours of tests with a single button click in GitLab

… build and package with Docker

… schedule and prioritize with SLURM

… on demand or on a schedule

… reporting via GitLab, ELK stack, Slack, email

Emphasis on keeping things simple for users while hiding integration complexity

Ensure reproducibility and rapid triage

Supercomputer-scale CI (Continuous integration internal at NVIDIA)

37

LINKS

38

RESOURCES

GTC Sessions (https://www.nvidia.com/en-us/gtc/session-catalog/) :

Under the Hood of the new DGX A100 System Architecture [S21884]

Inside the NVIDIA Ampere Architecture [S21730]

CUDA New Features And Beyond [S21760]

Inside the NVIDIA HPC SDK: the Compilers, Libraries and Tools for Accelerated Computing [S21766]

Introducing NVIDIA DGX A100: the Universal AI System for Enterprise [S21702]

Mixed-Precision Training of Neural Networks [S22082]

Tensor Core Performance on NVIDIA GPUs: The Ultimate Guide [S21929]

Developing CUDA kernels to push Tensor Cores to the Absolute Limit on NVIDIA A100 [S21745]

HotChips:

Hot Chips Tutorial - Scale Out Training Experiences – Megatron Language Model

Hot Chips Session - NVIDIA’s A100 GPU: Performance and Innovation for GPU Computing

Pyxis/Enroot https://fosdem.org/2020/schedule/event/containers_hpc_unprivileged/

Presentations

39

RESOURCES

DGX A100 Page https://www.nvidia.com/en-us/data-center/dgx-a100/

Blogs

DGX SuperPOD https://blogs.nvidia.com/blog/2020/05/14/dgx-superpod-a100/

DDN Blog for DGX A100 Storage https://www.ddn.com/press-releases/ddn-a3i-nvidia-dgx-a100/

Kitchen Keynote summary https://blogs.nvidia.com/blog/2020/05/14/gtc-2020-keynote/

Double Precision Tensor Cores https://blogs.nvidia.com/blog/2020/05/14/double-precision-tensor-cores/

Links and other doc