: nvidia dgx system · tesla gpus & systems tesla gpu nvidia dgx family nvidia hgx system oem...

: NVIDIA DGX SYSTEM

2

CONTENTS

• NVIDIA DGX

• DGX AI

• DGX POD RA

• DGX POD RA

• DGX

3

1980 1990 2000 2010 2020

GPU-Computing perf

1.5X per year

1000X

by

2025

RISE OF GPU COMPUTING

Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K.

Olukotun, L. Hammond, and C. Batten New plot and data collected for 2010-2015 by K. Rupp

102

103

104

105

106

107

Single-threaded perf

1.5X per year

1.1X per year

APPLICATIONS

SYSTEMS

ALGORITHMS

CUDA

ARCHITECTURE

4

1

10

100

1000

Mar-12 Mar-13 Mar-14 Mar-15 Mar-16 Mar-17 Mar-18

Re

lati

ve

Pe

rfo

rm

an

ce

Mar-19

2013

BEYOND MOORE’S LAW

Base OS: CentOS 6.2

Resource Mgr: r304

CUDA: 5.0

Thrust: 1.5.3

2019

Accelerated Server

With FermiAccelerated Server

with Volta

NPP: 5.0

cuSPARSE: 5.0

cuRAND: 5.0

cuFFT: 5.0

cuBLAS: 5.0

Base OS: Ubuntu 16.04

Resource Mgr: r384

CUDA: 10.0

NPP: 10.0

cuSPARSE: 10.0

cuSOLVER: 10.0

cuRAND: 10.0

cuFFT: 10.0

cuBLAS: 10.0

Thrust: 1.9.0

Progress Of Stack In 6 Years

GPU-Accelerated Computing

CPU

Moore’s Law

2013 2014 2015 2016 2017 2018 2019March

Rela

tive P

erf

orm

ance

5

APPS &FRAMEWORKS

CUDA-X & NVIDIA SDKs

NVIDIA DATA CENTER PLATFORMSingle Platform Drives Utilization and Productivity

CUDA & CORE LIBRARIES - cuBLAS | NCCL

DEEP LEARNING

cuDNN

HPC

cuFFTOpenACC

+600 Applications

Amber

NAMD

CUSTOMER USE CASES Speech Translate Recommender

SCIENTIFIC APPLICATIONS

Molecular Simulations

WeatherForecasting

SeismicMapping

CONSUMER INTERNET & INDUSTRY APPLICATIONS

ManufacturingHealthcare Finance

MACHINE LEARNING

cuMLcuDF cuGRAPH cuDNN CUTLASS TensorRT

VIRTUAL GPU

VIRTUAL GRAPHICS

vDWS vPC

Creative & Technical

Knowledge Workers

vAPPS

TESLA GPUs & SYSTEMS

CLOUDTESLA GPU NVIDIA HGXNVIDIA DGX FAMILY SYSTEM OEM

https://aws.amazon.com/canada/

6

DATACENTER SERVERS WORKSTATIONS

NVIDIA ENTERPRISE GPU PRODUCT FAMILY

SPECIALIZED(Max Performance)

MAINSTREAM(Max Utility)

AI & HPC Model Development

V100Tensor Core, NVLink

32GB HBM2250W/300W

Rendering

RTX 8000/6000RT Core, Tensor Core

48/24 GB GDDR6250W/300W

T4Tensor Core, RT Core

16GB GDDR6, 70W

Enterprise Application Deployment

DATA SCIENCE VISUALIZATION

RTX 8000/6000Tensor Core

48/24 GB GDDR6

AI Development

RTX 8000/6000/5000*/4000*RT Core, Tensor Core

Design & Graphics

Computing For Modern Enterprise Workloads

*Not designed & cannot be qualified for Servers

https://www.nvidia.com/en-us/design-visualization/quadro/



7

END-TO-END PRODUCT FAMILY

DESKTOP

TITAN/GeForce

WORKSTATION

DGX Station

DATA CENTER

Tesla V100/T4

AUTOMOTIVE

Drive AGX Pegasus

VIRTUAL

WORKSTATION

Virtual GPU

SERVER

PLATFORM

HGX1/ HGX2

HPC / TRAINING INFERENCE

EMBEDDED

Jetson AGX Xavier

DATA CENTER

Tesla V100

Tesla T4

FULLY INTEGRATED AI SYSTEMS

DGX-1 DGX-2

8

DGX Station

9

DGX 工作站Groundbreaking AI – at your desk

个人AI超级计算机为研究人员和数据科学家打造

9

Key Features

1. 4 x NVIDIA Tesla V100 GPU (NOW 32 GB)

2. 2nd-gen NVLink (4-way)

3. Water-cooled design

4. 3 x DisplayPort (4K resolution)

5. Intel Xeon E5-2698 20-core

6. 256GB DDR4 RAM2

1

5

4

3

6

10

NVIDIA DGX STATION

SPECIFICATIONS

At a GlanceGPUs 4x NVIDIA® Tesla® V100

TFLOPS (GPU FP16) 500

GPU Memory 32 GB per GPU

NVIDIA Tensor Cores 2,560 (total)

NVIDIA CUDA Cores 20,480 (total)

CPU Intel Xeon E5-2698 v4 2.2 GHz (20-core)

System Memory 256 GB LRDIMM DDR4

StorageData: 3 x 1.92 TB SSD RAID 0

OS: 1 x 1.92 TB SSD

Network Dual 10GBASE-T LAN (RJ45)

Display 3x DisplayPort, 4K Resolution

Additional Ports 2x eSATA, 2x USB 3.1, 4x USB 3.0

Acoustics < 35 dB

Maximum Power Requirements 1500 W

Operating Temperature Range 10 - 30 oC

Software

Ubuntu Desktop Linux OS

DGX Recommended GPU Driver

CUDA Toolkit

10

DGX STATION SPECIFICATIONS

11

DGX-1

12

NVIDIA DGX-1 WITH VOLTAHighest Performance, Fully Integrated HW System

1 PetaFLOPS | 8x Tesla V100 32GB | 300 Gb/s NVLink Hybrid Cube Mesh

2x Xeon | 7 TB RAID 0 | Quad IB/Ethernet 100Gbps, Dual 10GbE | 3U — 3500W

7 TB SSD 8 x Tesla V100 32 GB

Quad IB/Ethernet 100Gbps, Dual 10GbE

2x Xeon

3U – 3200W NVLink Hybrid Cube Mesh

13

DGX-1 NVLINK300 GB/sec per GPU, 10x Faster than PCIe Gen3

NVLink for Tesla Volta

3 Rings

14

NVLINK vs PCIe for DL Training

15

DL DATA PARALLELISM – PCIE BASED

PCIe

Switch

CPU

PCIe

Switch

CPU

0

32

1 5

67

4

QPI Link

Data loading and gradient averaging share communication resources: Congestion

16

DL DATA PARALLELISM – NVLINK

PCIe

Switch

CPU

PCIe

Switch

CPU

0

32

1 5

67

4

No sharing of communication resources: No congestion

17

30% BETTER PERFORMANCE WITH NVLINK THAN PCIE

• Encoder and decoder embedding size of 512

• Batch size of 256 per GPU

• NVIDIA DGX containers version 17.11, processing real data with cuDNN 7.0.4, NCCL 2.1.2

18

2.54X BETTER PERFORMANCE WITH NVLINK

• Performance benefits increase with increasing encoder/ decoder embedding size

• Sockeye neural machine translation single-precision training

• NVIDIA DGX containers version 17.11, processing real data with cuDNN 7.0.4, NCCL 2.1.2

19

6,095imgs/sec

31.7Mimgs/sec

3.2Msamples/sec

13,579tokens/sec

334,435tokens/sec

6,116 imgs/sec

2,010imgs/sec

96.1Msamples/sec

17,185tokens/sec

596,891tokens/sec

0x

1x

2x

3x

4x

5x

6x

Spee

du

p v

s. S

erve

r w

ith

8 x

P1

00

SX

M2

PyTorch Deep Learning FrameworkTraining on V100 GPU Server vs P100 GPU Server

PyTorchDeep Learning Training

PyTorch is a deep learning framework that puts Python first.

VERSION1.1.0

ACCELERATED FEATURESFull framework accelerated

SCALABILITYMulti-GPU, multi-node

More Informationwww.pytorch.orgPyTorch on NGC

2.0xAvg. Speedup

3.0x Avg. Speedup

DGX-1Server with 8x V100

SXM2 32GB

Server with 8x V100 PCIe 16GB

GPU Server: Dual Xeon E5-2698 [email protected] with GPU servers as shownFramework: PyTorch v1.1.0; Mixed Precision; CUDA 10.1.105; NCCL 2.4.6, cuDNN 7.5.0.56; cuBLAS 10.1.105NVIDIA Driver: 410.104; Batch size: V100 PCIe: 256 for ResNet-50 v1.5/GNMT V2, 64 for SSD, 1048576 for NCF, 80 for Tacotron2 | V100 SXM2: 512 for ResNet-50 v1.5/GNMT V2, 64 for SSD, 1048576 for NCF, 80 for Tacotron2 | P100 SXM2: 128 for ResNet-50 v1.5/GNMT V2, 32 for SSD, 524288 for NCF, 48 for Tacotron2

ResNet-50 v1.5 (Image) SSD (Object Detection) NCF (Recommender) Tacotron2 (Speech) GNMT (Translation)

Up to 3x

faster*

*NCF performance benefit coming from larger 32GB GPU memory

32M

https://www.dropbox.com/s/trejky5bwbu3tbb/Deep Learning Performance Guide - 19.04.pptx?dl=0

20

TensorFlowDeep Learning Training

An open-source software library for numerical computation using data flow graphs.

VERSION1.13.1

ACCELERATED FEATURESFull framework accelerated

SCALABILITYMulti-GPU and multi-node

More Informationwww.tensorflow.org/TensorFlow on NGC

TensorFlow Deep Learning FrameworkTraining on V100 GPU Server vs P100 GPU Server

81,090tokens/sec

20,026,780samples/sec

5,795imgs/sec

588imgs/sec

459imgs/sec

136,116tokens/sec

67,075,162samples/sec

6,394imgs/sec

661imgs/sec

524imgs/sec

0x

1x

2x

3x

4x

Spee

du

p v

s. S

erve

r w

ith

8 x

P1

00

SX

M2

1.7x Avg. Speedup

2.0x Avg. Speedup

Server with 8x V100 PCIe 16GB

DGX-1Server with 8x V100

SXM2 32GB

GPU Server: Dual Xeon E5-2698 [email protected] with GPU servers as shownFramework: TensorFlow v1.13.1; Mixed Precision; CUDA 10.1.105; NCCL 2.4.6, cuDNN 7.5.0.56; cuBLAS 10.1.105NVIDIA Driver: 410.104; Batch size: V100 PCIe: 192 for GNMT V2, 1048576 for NCF, 256 for ResNet-50 v1.5, 32 for SSD, 2 for U-Net Industrial| V100 SXM2: 192 for GNMT V2, 1048576 for NCF, 512 for ResNet-50 v1.5, 32 for SSD, 2 for U-Net Industrial | P100 SXM2: 128 for GNMT V2/ResNet-50 v1.5, 1048576 for NCF, 32 for SSD, 2 for U-Net Industrial

GNMT (Translation) NCF (Recommender) ResNet-50 v1.5 (Image)SSD (Object Detection) U-Net Industrial (Segmentation)

Up to

3x

faster*

*NCF performance benefit coming from larger 32GB GPU memory

https://ngc.nvidia.com/catalog/containers/nvidia:tensorflow

21

DGX-2

22

NVIDIA DGX-2THE WORLD’S MOST POWERFUL DEEP LEARNING SYSTEM FOR THE MOST COMPLEX DEEP LEARNING CHALLENGES

• First 2 PFLOPS System

• 16 V100 32GB GPUs Fully Interconnected

• NVSwitch: 2.4 TB/s bisection bandwidth

• 24X GPU-GPU Bandwidth

• 0.5 TB of Unified GPU Memory

• 10X Deep Learning Performance

22

23

DESIGNED TO TRAIN THE PREVIOUSLY IMPOSSIBLE

1

2

3

5

4

6 Two Intel Xeon Platinum CPUs

7 1.5 TB System Memory

23

30 TB NVME SSDs Internal Storage

NVIDIA Tesla V100 32GB

Two GPU Boards8 V100 32GB GPUs per board6 NVSwitches per board512GB Total HBM2 Memoryinterconnected byPlane Card

Twelve NVSwitches2.4 TB/sec bi-section

bandwidth

Eight EDR Infiniband/100 GigE1600 Gb/sec Total Bi-directional Bandwidth

PCIe Switch Complex

8

9

9Dual 10/25 Gb/secEthernet

35

FULL NON-BLOCKING BANDWIDTH

GPU8

GPU9

GPU10

GPU11

GPU12

GPU13

GPU14

GPU15

GPU0

GPU1

GPU2

GPU3

GPU4

GPU5

GPU6

GPU7

NVSwitch

NVSwitch

NVSwitch

NVSwitch

NVSwitch

NVSwitch

NVSwitch

NVSwitch

NVSwitch

NVSwitch

NVSwitch

NVSwitch

25

HIGHER PERFORMANCE WITH NVSWITCHDGX-2 vs Multi-System Interconnect

2 8xV100 servers have dual socket Xeon E5 2698v4 Processor. 8 x V100 GPUs. Servers connected via 4X 100Gb IB portsDGX-2 server has dual-socket Xeon Platinum 8168 Processor. 16 V100 GPUs

Physics(MILC benchmark)

4D Grid

Weather

(ECMWF benchmark)

All-to-all

Recommender

(Sparse Embedding)

Reduce & Broadcast

Language Model

(Transformer with MoE)

All-to-all

DGX-2 with NVSwitchTwo 8xV100

2X FASTER 2.4X FASTER 2X FASTER 2.7X FASTER

AI TrainingHPC

26

RAPIDS BENCHMARKSSeconds

CPU nodes = r4.2xlarge EMR

27

DGX-1 DGX-2?

DGX-1 multi-system deployments deliver affordable AI scale

Excellent performance for 1-8 GPU jobs

Support larger pools of concurrent users

Proven approaches and solutions for multi-system scale

DGX-2 tackles your most complex models

High definition video training, speech, translation

Large models, more complex network, model parallelism

Best multi-GPU performance with single-system simplicity

ANNOUNCING NVIDIA DGX SUPERPODAI LEADERSHIP REQUIRES AI INFRASTRUCTURE LEADERSHIP

Test Bed for Highest Performance Scale-Up Systems

• 9.4 PF on HPL | ~200 AI PF | #22 on Top500 list

• <2 mins To Train RN-50

Modular & Scalable GPU SuperPOD Architecture

• Built in 3 Weeks

• Optimized For Compute, Networking, Storage & Software

Integrates Fully Optimized Software Stacks

• Freely Available Through NGC• 96 DGX-2H

• 10 Mellanox EDR IB per node

• 1,536 V100 Tensor Core

GPUs

• 1 megawatt of power

Autonomous Vehicles | Speech AI | Healthcare | Graphics | HPC

29

DGX AI

30

AI

AI WORKSTATION AI DATA CENTER

• Universal SW for Deep Learning

• Predictable execution across platforms

• Pervasive reach

DGX SOFTWARE STACK

The Essential Instrument for AI

Research

DGX-1

The Personal AI Supercomputer

DGX Station

The World’s Most Powerful AI System for the Most Complex AI Challenges

DGX-2

31

DGX

DGX Station DGX-1 DGX-2

NVIDIAGPU Cloud

33NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.

DGX DGX

➢ NVIDIA DGX

o

o

o DGX AI

➢ DGX GPU DIY GPU

➢ NVIDIA Enterprise Support DGX

34

DGX — AIValu

e t

o IT a

nd u

sers

GPU GPU

NVLink

GPU

NVLink

NGC DL SW

NVIDIA

AI Experts

GPU

NVLink

NGC DL SW

GPUs NVLink Server &

GPUs

NVLink Server &

GPUs

DIY stack

DGX

Systems

From Design to Support:End-to-End AI Expertise

35

DGX

Accelerate Deep Learning Value

ExperimentRefine

ModelDeploy

Train at

ScaleInsights

Procure

DGX

Station

Install,

Build, Test

TrainingProductive

ExperimentationFast Bring-up

Data CenterDesk

FromIdea

installed iterate

Inference

ToResults

refine, re-train

scale

Edge

36

CONTENTS

• NVIDIA DGX

• DGX AI

• DGX POD RA

• DGX POD RA

• DGX

37

AI

38

•

•

•

•

Keep Compute Where the Data Lives

ON-PREM

•

•

•

•

•

TRAIN CLOSEST TO WHERE YOUR DATA LIVES

✓

✓

✓

39

AIShort term thinking leads to longer term problems

40

AI

AI/DL Expertise &

Innovation

AI/DL Software Stack

Operating System Image

Hardware Architecture

Looking beyond the “spec sheet”Evalu

ati

on C

rite

ria


Insights gained from deep learning data centers

Rack Design Networking Storage Facilities Software

• DL drives

close to

operational

limits

• Similarities

to HPC best

practices

• IB or

Ethernet

based fabric

• 100Gbps

inter-

connect

• High-

bandwidth,

ultra-low

latency

• Datasets

range from

10k’s to

millions

objects

• terabyte

levels of

storage and

up

• High IOPS,

low latency

• assume

higher watts

per-rack

• Higher

FLOPS/watt

= DC less

floorspace

required

• Scale

requires

“cluster-

aware”

software

Example:

• Autonomous vehicle = 1TB / hr

• Training sets up to 500 PB

• RN50: 113 days to train

• Objective: 7 days

• 6 simultaneous developers

= 97 node cluster

42

DGX POD

43

NVIDIA DGX PODTM: .

Nine DGX-1 Servers

• Eight Tesla V100 GPUs

• NVIDIA. GPUDirect™ over RDMA support

• Run at MaxQ

• 100 GbE networking (up to 4 x 100 GbE)

Twelve Storage Nodes

• 192 GB RAM

• 3.8 TB SSD

• 100 TB HDD (1.2 PB Total HDD)

• 50 GbE networking

Network

• In-rack: 100 GbE to DGX-1 servers

• In-rack: 50 GbE to storage nodes

• Out-of-rack: 4 x 100 GbE (up to 8)

Rack

• 35 kW Power

• 42U x 1200 mm x 700 mm (minimum)

• Rear Door Cooler

4 POD design with cooling

DGX-1 POD

• NVIDIA DGX POD

• Support scalability to hundreds of nodes

• Based on proven SATURNV architecture

44

● Data factory collects raw data and

includes tools used to pre-process,

index, label, and manage data

● Model training with labeled data using

a DL framework from the NVIDIA GPU

Cloud (NGC) container repository

running on DGX servers with Volta

Tensor Core GPUs

● Model testing and validation adjusts

model parameters as needed and

repeats training until the desired

accuracy is reached

● Model optimization for production

deployment (inference) is completed

using the NVIDIA TensorRT optimizing

inference accelerator

AI SOFTWARE DEVELOPMENT WORKFLOWFor Large-Scale Multi-User AI Software Development Teams

NVIDIA DGX POD – 快速构建AI平台

46

NVIDIA AI SOFTWAREFor Large-Scale Multi-User AI Software Development Teams

47

DGX POD NGC

bigdft

candle

chroma

gamess

gromacs

lammps

lattice-microbes

milc

namd

pgi

picongpu

relion

vmd

caffe

caffe2

cntk

cuda

digits

inferenceserver

mxnet

pytorch

tensorflow

tensorrt

theano

torch

index

paraview-holodeck

paraview-index

paraview-optix

chainer

h20ai-driverless

kinetica

mapd

paddlepaddle

Deep Learning HPC HPC Visualization PartnersNVIDIA/K8s

Kubernetes

on NVIDIA GPUs

48

DGX POD MANAGEMENT SOFTWAREFor Large-Scale Multi-User AI Software Development Teams

https://github.com/NVIDIA/deepops

https://github.com/NVIDIA/deepops

49

DGX POD RA AI平台及其价值

50

DGX POD RA AI平台的价值

Reference architectures from NVIDIA and leading storage partners

Simplified, validated, converged infrastructure offers

Available through select NPN partners as a turnkey solution

DGX RA

Solution

Storage

51

TCO

Study & exploration

Platform Design

Productive Experi-

mentation

HW & SW Integra-

tion

Trouble-shooting

Software eng’g

Software optimiz-

ation

Design and Build for

Scale

Software re-

optimiz-ation

InsightsTraining at Scale

Designing, Building and Supporting an AI Infrastructure – from Scratch

OPEX

CAPEX

Day 1

Month 3

Time and budget spent on things other than data science

“DIY” TCO

52

Study & exploration

Platform Design

Productive Experi-

mentation

Install and Deploy DGX RA

SOLUTION

Trouble-shooting

Software eng’g

Software optimiz-

ation

Design and Build for

Scale

Software re-

optimiz-ation

InsightsTraining at Scale

2. Deploying an Integrated, Full-Stack AI Solution using a DGX Reference Architecture

Day 1

Month 3

“DIY” TCO

CAPEX

DGX TCOdeployment

cycle shortened

Wasted time/effort - eliminated

53

Study & exploration

Insights

2. Deploying an Integrated, Full-Stack AI Solution using a DGX Reference Architecture

Day 1

Week 1

Install and Deploy DGX RA

SOLUTION

CAPEX

Productive Experi-

mentation

Training at Scale

“DIY” TCO

DGX TCO

54

没有统一接口的技术支持状况

Installed/

running

Problem!

“My PyTorch CNN model

is running 30% slower

than yesterday!”

“OK let me look into it”

IT Admin

55

Installed/

running

Problem!

Open source / forum

Open source / forum

Framework?

Libraries?

O/S?

GPU?

Drivers?

Server?

Network?

Storage?

Multiple paths to

problem resolution

Server, Storage & Network

Solution Providers

没有统一接口的技术支持状况

56

DGX POD 集成架构的技术支持

“Update to PyTorch

container XX.XX”

AI Expertise

NPN

Partner

Running!Problem!

DGX RA

Solution

Storage

DGX RA

Solution

Storage

“My PyTorch CNN model

is running 30% slower

than yesterday!”

IT Admin

57

CONTENTS

• NVIDIA DGX

• DGX AI

• DGX POD RA

• DGX POD RA

• DGX


DGX-1 存储：本地存储

系统盘：单个SSD盘 480GB OS 无冗余

4块SSD（Raid0） 4 x 1.92TB cacheFS 无冗余

➢ 每台DGX-1系统有5个SSD

➢ 深度学习的训练IO通常需要多次通过IO读取训练数据，因此随着训练一次次的迭代，系统本地高速存储能有效的提高IO的利用率，从而提高GPU的利用率。

59

DGX-1 外部存储IO需求参考

应用场景充足的读缓存

DGX缓存能力推荐网络类型网络文件系统选择

数据分析 NA 10Gbe 对象存储，NFS，或其他并发读及小文件性能优的存储

HPC NA 10/40/100Gbe IBNFS，或其他支持大量客户端，单点存储性能优的HPC并行存储系统

DL 256x256 图片 Yes 63 million images 10Gbe NFS或小文件读写效率高的存储

DL 1080p 图片 Yes 13 million images 10/40Gbe IB 高性能NFS，HPC存储系统，高并发

DL 4K 图片 Yes 5 million images 40Gbe IB 高性能NFS，HPC存储系统，高并发，单节点3GB/s+

DL 无压缩图片 Yes 1 million images IB 40/100Gbe 高性能NFS，HPC存储系统，高并发，单节点3GB/s+

DL 不缓存数据集 no NA IB 10/40/100Gbe 性能同上, 总的性能需要满足所有应用并发使用的需求

➢ 存储系统网络可选择万兆/IB网络

➢ 以下的表格是我们基于深度学习框架的通用IO访问模式针对存储系统的参考推荐，该推荐只作为参考使用

https://docs.nvidia.com/dgx/bp-dgx/index.html#storage_scaling

https://docs.nvidia.com/dgx/bp-dgx/index.html#storage_scaling

60

NVIDIA DGX POD :AI Infrastructure Built on NVIDIA Best Practices

Storage Partner DGX-1 RA Solutions

NVIDIA DGX POD

Growing

portfolio

of offers…

Common Benefits:

• Eliminate design guesswork

• Faster, simpler deployment

• Predictable performance at scale

• Simplified, single-point of support

Backed by prioritized NPN partners

Ref. Arch.

NVIDIA DGX POD – 快速构建AI平台

62

ONTAP AI NETAPP VERIFIED ARCHITECTURE

62

2x 100GbE

4x 100GbE

2-4 ISL

2x 100GbE

4x NVIDIA DGX-1

1x NetApp A800

4x 100GbE

100GbE switch100GbE switch

© 2018 NetApp, Inc. All rights reserved. NetApp Confidential – Limited Use Only

Key metrics

• 4 DGX-1s, 1 AFF A800 HA-pair

• Peak throughput requested: 5GB/s

• Sustained throughput requested: 4GB/s

• Average storage latency: ~600us

• All 32 GPUs kept consistently >95% busy

• Storage CPU utilization achieved: ~18%

• A800 can provide 25GB/s read throughput

Conclusion

• Massive headroom to support a large

number of DGX-1 servers by one AFF

A800

64

ACCELERATING THE AI DATA PIPELINE

Training with NVIDIA DGX-1 and NetApp A800

64

Test environment

32 GPUs (4 DGX-1 servers)

Tensor Cores

Measured as images per second

Compares metrics for synthetic and ImageNet data

Conclusion

Near linear scaling achieved

Performance for ImageNet dataset is close to synthetic data

© 2018 NetApp, Inc. All rights reserved. NetApp Confidential – Limited Use Only

START SMALL, SCALE BIG

65 * Based on 35kW racks

1:1 Configuration Full scale-out configuration * 1:4 Configuration 1:5 Configuration

42U

rack

66

NETAPP ONTAP AI SOLUTION RACK-SCALE ARCHITECTURE

67

NVIDIA DGX-2 POD WITH NETAPP AFF A800

NVIDIA DGX-2 POD with NetApp AFF A800

https://www.netapp.com/us/media/nva-1135-design.pdf

AIRI: AI-READY INFRASTRUCTURE

68

• NVIDIA DGX-1 | 4x DGX-1 Systems | 4 PFLOPS

• PURE FLASHBLADE™ | 15x 17TB Blades | 1.5M IOPS

• ARISTA | 2x 100Gb Ethernet Switches with RDMA

• NVIDIA GPU CLOUD DEEP LEARNING STACK | NVIDIA

Optimized Frameworks

• AIRI SCALING TOOLKIT | Multi-node Training Made

Simple

HARDWARE

SOFTWARE

Extending the power of DGX-1 at-scale in every enterprise

Pure Storage Network Topology

DDN A3I WITH DGX-1

70

• NVIDIA DGX-1 | 4x DGX-1 Systems | 4 PFLOPS

• DDN AI200, AI7990 | 20GB/s | from 30TB | 350K IOPS

• NETWORK: 2x EDR IB or 100GbE Switches with RDMA

• NVIDIA GPU CLOUD DEEP LEARNING STACK | NVIDIA

Optimized Frameworks

• DDN: High performance, low latency, parallel file

system

• DDN: In-container client for easy deployment,

efficiency, performance and reliability

HARDWARE

SOFTWARE

Making AI-Powered Innovation Easier

DDN A3I Reference Architecture in a 9:1 configuration

IBM SPECTRUM STORAGE FOR AI WITH NVIDIA DGX

72

• NVIDIA DGX-1 | up to 9x DGX-1 Systems

• IBM Spectrum Scale NVMe Appliance| 40GB/s per

node, 120GB/s in 6RU| 300TB per node

• NETWORK: Mellanox SB7700 Switch | 2x EDR IB with

RDMA

• NVIDIA DGX SOFTWARE STACK | NVIDIA Optimized

Frameworks

• IBM: High performance, low latency, parallel file

system

• IBM: Extensible and composable

HARDWARE

SOFTWARE

The Engine to Power Your AI Data Pipeline

DELL EMC ISILON WITH NVIDIA DGX

73

• NVIDIA DGX-1 | 9x DGX-1 Systems = 9 PFLOPS

• DELL EMC ISILON | 4x or 8x Isilon F800 nodes

(in 2 chassis) up to 250K IOPS per chassis

• ARISTA | 2x 7060CX2-32S | 32x 40/100GbE

• NVIDIA DGX SOFTWARE STACK | NVIDIA Optimized

AI/DL Frameworks

• ISILON OneFS

HARDWARE

SOFTWARE

Simplified Enterprise AI Infrastructure

74

(TIER 1) DGX-1 STORAGE PARTNERS (PUBLISHED RAS AS OF 12/12/18)

Pure Storage

AIRI® / AIRI Mini®

NetApp®

ONTAP® AI

DDN

A3I®

Dell EMC

Isilon

IBM Spectrum Scale for AI

with NVIDIA DGX

Pure FlashBlade™ NetApp® A800™ all flash

storage system

AI200,AI400,AI7990 F800 All-Flash Scale-out

NAS

IBM Spectrum Scale

based shared storage

100 Gb Ethernet, RoCE 100 Gb Ethernet, RoCE InfiniBand (up to EDR)

Ethernet (up to 100Gb/s)

40/100 Gb Ethernet,

RoCE

InfiniBand (up to EDR)

Ethernet (up to 100Gb/s)

https://www.purestorage.com

/content/dam/purestorage/pd

f/datasheets/Pure_Storage_Fla

shBlade_Datasheet_05.pdf

NFSv3, S3, SMB2.1, HTTP(S)

ONTAP AI NFS Filesystem DDN has developed an intelligent

parallel file system client specifically

for DGX-1 server containers that

engages multiple high-speed data paths

to the storage and delivers the full

performance of NVMe flash directly to

the application. Under the covers,

DDN is running Lustre, but has done a

lot of engineering to simplify setup and

configuration for AI environments,

specific to DGX-1.

OneFS - NFS IBM Spectrum Scale (POSIX)

NFS v3 / v4.0 and SMB

(through cluster export

services)

Reference Architecture with 4

DGX-1s. Deployed with 500+

DGX-1s in customer

environment.

4 DGX-1s with 9 DGX-1 POD

in progress

9 DGX-1s

1:1, 4:1, 9:1 Configurations

9 DGX-1 POD 1-9 DGX-1 servers; 1-3

Spectrum Scale All-Flash

appliances

Document Document DGX (1)

Document DGX (4)

Document

Document

Document

Scalable AI Infrastructure for Real-

World Deep

Learning Use Cases: Deployment Guide

Document Document

https://www.purestorage.com/products/flashblade/scale-out.html

https://www.netapp.com/us/products/storage-systems/all-flash-array/aff-a-series.aspx

https://www.ddn.com/products/a3i-accelerated-any-scale-ai/

https://www.dellemc.com/en-us/storage/isilon/index.htm?cid=315901&st=%2Bemc+%2Bisilon&gclid=Cj0KCQiAgMPgBRDDARIsAOh3uyKWArS1IVKGwZBnLlWLEKExlJDkXL8Fr_wQ7h_GUIvAEnU1q7LCLdkaAvZuEALw_wcB&lid=5982681&ptaid=kwd-308190616027&VEN1=siWk73rrf,246149495715,901qz26673,c,,,51248631255,kwd-308190616027&VEN2=b,%2Bemc+%2Bisilon&pgrid=51248631255&dgc=st&dgseg=cbg&acd=12309215337205631&VEN3=112604645991683362#collapse=?mkwid=siWk73rrf&pcrid=246149495715&pkw=+emc%20+isilon&pmt=b&pdv=c&slid&product&pgrid=51248631255&ptaid=kwd-308190616027&VEN1=siWk73rrf,246149495715,901qz26673,c,,,51248631255,kwd-308190616027&VEN2=b,+emc%20+isiloncid=315901&lid=5982681&dgc=st&dgseg=COM&acd=12309215337205631

https://www.ibm.com/blogs/systems/introducing-spectrumai-with-nvidia-dgx/?mhq=IBM SpectrumAI&mhsrc=ibmsearch_a

https://www.ibm.com/blogs/systems/introducing-spectrumai-with-nvidia-dgx/?mhq=IBM SpectrumAI&mhsrc=ibmsearch_a

https://www.purestorage.com/content/dam/purestorage/pdf/datasheets/Pure_Storage_FlashBlade_Datasheet_05.pdf

https://blog.netapp.com/choosing-an-optimal-filesystem-for-your-data-pipeline-for-ai-dl/

https://www.dellemc.com/resources/en-us/asset/white-papers/products/storage/h17240_wp_isilon_onefs_nfs_design_considerations_bp.pdf

https://www.ibm.com/support/knowledgecenter/en/STXKQY_5.0.2/com.ibm.spectrum.scale.v5r02.doc/bl1ins_Protocolsupportoverview.htm

https://www.ibm.com/support/knowledgecenter/en/STXKQY_5.0.2/com.ibm.spectrum.scale.v5r02.doc/bl1ins_Protocolsupportoverview.htm

https://github.com/PureStorage-OpenConnect/AIRI/blob/master/Arista/AIRI Arista Reference Architecture - AIRI & AIRI Mini.pdf

https://www.netapp.com/us/media/wp-7267.pdf

https://www.netapp.com/us/media/nva-1121-design.pdf

https://www.netapp.com/us/media/nva-1121-deploy.pdf

https://www.netapp.com/us/media/tr-4718.pdf

https://www.dropbox.com/s/chog8hp5xqqh1en/DDN-DGX-1-RA-v4.pdf?dl=0

https://www.dellemc.com/resources/en-us/asset/white-papers/products/storage/Dell_EMC_Isilon_and_NVIDIA_DGX_1_Servers_for_Deep_Learning.pdf

https://public.dhe.ibm.com/common/ssi/ecm/81/en/81022381usen/ibm-spectrumai-ref-arch-dec10-v6_81022381USEN.pdf

: nvidia dgx system · tesla gpus & systems tesla gpu nvidia dgx family nvidia hgx system oem...

Documents