: nvidia dgx system · tesla gpus & systems tesla gpu nvidia dgx family nvidia hgx system oem...
TRANSCRIPT
: NVIDIA DGX SYSTEM
2
CONTENTS
• NVIDIA DGX
• DGX AI
• DGX POD RA
• DGX POD RA
• DGX
3
1980 1990 2000 2010 2020
GPU-Computing perf
1.5X per year
1000X
by
2025
RISE OF GPU COMPUTING
Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K.
Olukotun, L. Hammond, and C. Batten New plot and data collected for 2010-2015 by K. Rupp
102
103
104
105
106
107
Single-threaded perf
1.5X per year
1.1X per year
APPLICATIONS
SYSTEMS
ALGORITHMS
CUDA
ARCHITECTURE
4
1
10
100
1000
Mar-12 Mar-13 Mar-14 Mar-15 Mar-16 Mar-17 Mar-18
Re
lati
ve
Pe
rfo
rm
an
ce
Mar-19
2013
BEYOND MOORE’S LAW
Base OS: CentOS 6.2
Resource Mgr: r304
CUDA: 5.0
Thrust: 1.5.3
2019
Accelerated Server
With FermiAccelerated Server
with Volta
NPP: 5.0
cuSPARSE: 5.0
cuRAND: 5.0
cuFFT: 5.0
cuBLAS: 5.0
Base OS: Ubuntu 16.04
Resource Mgr: r384
CUDA: 10.0
NPP: 10.0
cuSPARSE: 10.0
cuSOLVER: 10.0
cuRAND: 10.0
cuFFT: 10.0
cuBLAS: 10.0
Thrust: 1.9.0
Progress Of Stack In 6 Years
GPU-Accelerated Computing
CPU
Moore’s Law
2013 2014 2015 2016 2017 2018 2019March
Rela
tive P
erf
orm
ance
5
APPS &FRAMEWORKS
CUDA-X & NVIDIA SDKs
NVIDIA DATA CENTER PLATFORMSingle Platform Drives Utilization and Productivity
CUDA & CORE LIBRARIES - cuBLAS | NCCL
DEEP LEARNING
cuDNN
HPC
cuFFTOpenACC
+600 Applications
Amber
NAMD
CUSTOMER USE CASES Speech Translate Recommender
SCIENTIFIC APPLICATIONS
Molecular Simulations
WeatherForecasting
SeismicMapping
CONSUMER INTERNET & INDUSTRY APPLICATIONS
ManufacturingHealthcare Finance
MACHINE LEARNING
cuMLcuDF cuGRAPH cuDNN CUTLASS TensorRT
VIRTUAL GPU
VIRTUAL GRAPHICS
vDWS vPC
Creative & Technical
Knowledge Workers
vAPPS
TESLA GPUs & SYSTEMS
CLOUDTESLA GPU NVIDIA HGXNVIDIA DGX FAMILY SYSTEM OEM
6
DATACENTER SERVERS WORKSTATIONS
NVIDIA ENTERPRISE GPU PRODUCT FAMILY
SPECIALIZED(Max Performance)
MAINSTREAM(Max Utility)
AI & HPC Model Development
V100Tensor Core, NVLink
32GB HBM2250W/300W
Rendering
RTX 8000/6000RT Core, Tensor Core
48/24 GB GDDR6250W/300W
T4Tensor Core, RT Core
16GB GDDR6, 70W
Enterprise Application Deployment
DATA SCIENCE VISUALIZATION
RTX 8000/6000Tensor Core
48/24 GB GDDR6
AI Development
RTX 8000/6000/5000*/4000*RT Core, Tensor Core
Design & Graphics
Computing For Modern Enterprise Workloads
*Not designed & cannot be qualified for Servers
7
END-TO-END PRODUCT FAMILY
DESKTOP
TITAN/GeForce
WORKSTATION
DGX Station
DATA CENTER
Tesla V100/T4
AUTOMOTIVE
Drive AGX Pegasus
VIRTUAL
WORKSTATION
Virtual GPU
SERVER
PLATFORM
HGX1/ HGX2
HPC / TRAINING INFERENCE
EMBEDDED
Jetson AGX Xavier
DATA CENTER
Tesla V100
Tesla T4
FULLY INTEGRATED AI SYSTEMS
DGX-1 DGX-2
8
DGX Station
9
DGX 工作站Groundbreaking AI – at your desk
个人AI超级计算机为研究人员和数据科学家打造
9
Key Features
1. 4 x NVIDIA Tesla V100 GPU (NOW 32 GB)
2. 2nd-gen NVLink (4-way)
3. Water-cooled design
4. 3 x DisplayPort (4K resolution)
5. Intel Xeon E5-2698 20-core
6. 256GB DDR4 RAM2
1
5
4
3
6
10
NVIDIA DGX STATION
SPECIFICATIONS
At a GlanceGPUs 4x NVIDIA® Tesla® V100
TFLOPS (GPU FP16) 500
GPU Memory 32 GB per GPU
NVIDIA Tensor Cores 2,560 (total)
NVIDIA CUDA Cores 20,480 (total)
CPU Intel Xeon E5-2698 v4 2.2 GHz (20-core)
System Memory 256 GB LRDIMM DDR4
StorageData: 3 x 1.92 TB SSD RAID 0
OS: 1 x 1.92 TB SSD
Network Dual 10GBASE-T LAN (RJ45)
Display 3x DisplayPort, 4K Resolution
Additional Ports 2x eSATA, 2x USB 3.1, 4x USB 3.0
Acoustics < 35 dB
Maximum Power Requirements 1500 W
Operating Temperature Range 10 - 30 oC
Software
Ubuntu Desktop Linux OS
DGX Recommended GPU Driver
CUDA Toolkit
10
DGX STATION SPECIFICATIONS
11
DGX-1
12
NVIDIA DGX-1 WITH VOLTAHighest Performance, Fully Integrated HW System
1 PetaFLOPS | 8x Tesla V100 32GB | 300 Gb/s NVLink Hybrid Cube Mesh
2x Xeon | 7 TB RAID 0 | Quad IB/Ethernet 100Gbps, Dual 10GbE | 3U — 3500W
7 TB SSD 8 x Tesla V100 32 GB
Quad IB/Ethernet 100Gbps, Dual 10GbE
2x Xeon
3U – 3200W NVLink Hybrid Cube Mesh
13
DGX-1 NVLINK300 GB/sec per GPU, 10x Faster than PCIe Gen3
NVLink for Tesla Volta
3 Rings
14
NVLINK vs PCIe for DL Training
15
DL DATA PARALLELISM – PCIE BASED
PCIe
Switch
CPU
PCIe
Switch
CPU
0
32
1 5
67
4
QPI Link
Data loading and gradient averaging share communication resources: Congestion
16
DL DATA PARALLELISM – NVLINK
PCIe
Switch
CPU
PCIe
Switch
CPU
0
32
1 5
67
4
No sharing of communication resources: No congestion
17
30% BETTER PERFORMANCE WITH NVLINK THAN PCIE
• Encoder and decoder embedding size of 512
• Batch size of 256 per GPU
• NVIDIA DGX containers version 17.11, processing real data with cuDNN 7.0.4, NCCL 2.1.2
18
2.54X BETTER PERFORMANCE WITH NVLINK
• Performance benefits increase with increasing encoder/ decoder embedding size
• Sockeye neural machine translation single-precision training
• NVIDIA DGX containers version 17.11, processing real data with cuDNN 7.0.4, NCCL 2.1.2
19
6,095imgs/sec
31.7Mimgs/sec
3.2Msamples/sec
13,579tokens/sec
334,435tokens/sec
6,116 imgs/sec
2,010imgs/sec
96.1Msamples/sec
17,185tokens/sec
596,891tokens/sec
0x
1x
2x
3x
4x
5x
6x
Spee
du
p v
s. S
erve
r w
ith
8 x
P1
00
SX
M2
PyTorch Deep Learning FrameworkTraining on V100 GPU Server vs P100 GPU Server
PyTorchDeep Learning Training
PyTorch is a deep learning framework that puts Python first.
VERSION1.1.0
ACCELERATED FEATURESFull framework accelerated
SCALABILITYMulti-GPU, multi-node
More Informationwww.pytorch.orgPyTorch on NGC
2.0xAvg. Speedup
3.0x Avg. Speedup
DGX-1Server with 8x V100
SXM2 32GB
Server with 8x V100 PCIe 16GB
GPU Server: Dual Xeon E5-2698 [email protected] with GPU servers as shownFramework: PyTorch v1.1.0; Mixed Precision; CUDA 10.1.105; NCCL 2.4.6, cuDNN 7.5.0.56; cuBLAS 10.1.105NVIDIA Driver: 410.104; Batch size: V100 PCIe: 256 for ResNet-50 v1.5/GNMT V2, 64 for SSD, 1048576 for NCF, 80 for Tacotron2 | V100 SXM2: 512 for ResNet-50 v1.5/GNMT V2, 64 for SSD, 1048576 for NCF, 80 for Tacotron2 | P100 SXM2: 128 for ResNet-50 v1.5/GNMT V2, 32 for SSD, 524288 for NCF, 48 for Tacotron2
ResNet-50 v1.5 (Image) SSD (Object Detection) NCF (Recommender) Tacotron2 (Speech) GNMT (Translation)
Up to 3x
faster*
*NCF performance benefit coming from larger 32GB GPU memory
32M
20
TensorFlowDeep Learning Training
An open-source software library for numerical computation using data flow graphs.
VERSION1.13.1
ACCELERATED FEATURESFull framework accelerated
SCALABILITYMulti-GPU and multi-node
More Informationwww.tensorflow.org/TensorFlow on NGC
TensorFlow Deep Learning FrameworkTraining on V100 GPU Server vs P100 GPU Server
81,090tokens/sec
20,026,780samples/sec
5,795imgs/sec
588imgs/sec
459imgs/sec
136,116tokens/sec
67,075,162samples/sec
6,394imgs/sec
661imgs/sec
524imgs/sec
0x
1x
2x
3x
4x
Spee
du
p v
s. S
erve
r w
ith
8 x
P1
00
SX
M2
1.7x Avg. Speedup
2.0x Avg. Speedup
Server with 8x V100 PCIe 16GB
DGX-1Server with 8x V100
SXM2 32GB
GPU Server: Dual Xeon E5-2698 [email protected] with GPU servers as shownFramework: TensorFlow v1.13.1; Mixed Precision; CUDA 10.1.105; NCCL 2.4.6, cuDNN 7.5.0.56; cuBLAS 10.1.105NVIDIA Driver: 410.104; Batch size: V100 PCIe: 192 for GNMT V2, 1048576 for NCF, 256 for ResNet-50 v1.5, 32 for SSD, 2 for U-Net Industrial| V100 SXM2: 192 for GNMT V2, 1048576 for NCF, 512 for ResNet-50 v1.5, 32 for SSD, 2 for U-Net Industrial | P100 SXM2: 128 for GNMT V2/ResNet-50 v1.5, 1048576 for NCF, 32 for SSD, 2 for U-Net Industrial
GNMT (Translation) NCF (Recommender) ResNet-50 v1.5 (Image)SSD (Object Detection) U-Net Industrial (Segmentation)
Up to
3x
faster*
*NCF performance benefit coming from larger 32GB GPU memory
21
DGX-2
22
NVIDIA DGX-2THE WORLD’S MOST POWERFUL DEEP LEARNING SYSTEM FOR THE MOST COMPLEX DEEP LEARNING CHALLENGES
• First 2 PFLOPS System
• 16 V100 32GB GPUs Fully Interconnected
• NVSwitch: 2.4 TB/s bisection bandwidth
• 24X GPU-GPU Bandwidth
• 0.5 TB of Unified GPU Memory
• 10X Deep Learning Performance
22
23
DESIGNED TO TRAIN THE PREVIOUSLY IMPOSSIBLE
1
2
3
5
4
6 Two Intel Xeon Platinum CPUs
7 1.5 TB System Memory
23
30 TB NVME SSDs Internal Storage
NVIDIA Tesla V100 32GB
Two GPU Boards8 V100 32GB GPUs per board6 NVSwitches per board512GB Total HBM2 Memoryinterconnected byPlane Card
Twelve NVSwitches2.4 TB/sec bi-section
bandwidth
Eight EDR Infiniband/100 GigE1600 Gb/sec Total Bi-directional Bandwidth
PCIe Switch Complex
8
9
9Dual 10/25 Gb/secEthernet
35
FULL NON-BLOCKING BANDWIDTH
GPU8
GPU9
GPU10
GPU11
GPU12
GPU13
GPU14
GPU15
GPU0
GPU1
GPU2
GPU3
GPU4
GPU5
GPU6
GPU7
NVSwitch
NVSwitch
NVSwitch
NVSwitch
NVSwitch
NVSwitch
NVSwitch
NVSwitch
NVSwitch
NVSwitch
NVSwitch
NVSwitch
25
HIGHER PERFORMANCE WITH NVSWITCHDGX-2 vs Multi-System Interconnect
2 8xV100 servers have dual socket Xeon E5 2698v4 Processor. 8 x V100 GPUs. Servers connected via 4X 100Gb IB portsDGX-2 server has dual-socket Xeon Platinum 8168 Processor. 16 V100 GPUs
Physics(MILC benchmark)
4D Grid
Weather
(ECMWF benchmark)
All-to-all
Recommender
(Sparse Embedding)
Reduce & Broadcast
Language Model
(Transformer with MoE)
All-to-all
DGX-2 with NVSwitchTwo 8xV100
2X FASTER 2.4X FASTER 2X FASTER 2.7X FASTER
AI TrainingHPC
26
RAPIDS BENCHMARKSSeconds
CPU nodes = r4.2xlarge EMR
27
DGX-1 DGX-2?
DGX-1 multi-system deployments deliver affordable AI scale
Excellent performance for 1-8 GPU jobs
Support larger pools of concurrent users
Proven approaches and solutions for multi-system scale
DGX-2 tackles your most complex models
High definition video training, speech, translation
Large models, more complex network, model parallelism
Best multi-GPU performance with single-system simplicity
ANNOUNCING NVIDIA DGX SUPERPODAI LEADERSHIP REQUIRES AI INFRASTRUCTURE LEADERSHIP
Test Bed for Highest Performance Scale-Up Systems
• 9.4 PF on HPL | ~200 AI PF | #22 on Top500 list
• <2 mins To Train RN-50
Modular & Scalable GPU SuperPOD Architecture
• Built in 3 Weeks
• Optimized For Compute, Networking, Storage & Software
Integrates Fully Optimized Software Stacks
• Freely Available Through NGC• 96 DGX-2H
• 10 Mellanox EDR IB per node
• 1,536 V100 Tensor Core
GPUs
• 1 megawatt of power
Autonomous Vehicles | Speech AI | Healthcare | Graphics | HPC
29
DGX AI
30
AI
AI WORKSTATION AI DATA CENTER
• Universal SW for Deep Learning
• Predictable execution across platforms
• Pervasive reach
DGX SOFTWARE STACK
The Essential Instrument for AI
Research
DGX-1
The Personal AI Supercomputer
DGX Station
The World’s Most Powerful AI System for the Most Complex AI Challenges
DGX-2
31
DGX
DGX Station DGX-1 DGX-2
NVIDIAGPU Cloud
33NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.
DGX DGX
➢ NVIDIA DGX
o
o
o DGX AI
➢ DGX GPU DIY GPU
➢ NVIDIA Enterprise Support DGX
34
DGX — AIValu
e t
o IT a
nd u
sers
GPU GPU
NVLink
GPU
NVLink
NGC DL SW
NVIDIA
AI Experts
GPU
NVLink
NGC DL SW
GPUs NVLink Server &
GPUs
NVLink Server &
GPUs
DIY stack
DGX
Systems
From Design to Support:End-to-End AI Expertise
35
DGX
Accelerate Deep Learning Value
ExperimentRefine
ModelDeploy
Train at
ScaleInsights
Procure
DGX
Station
Install,
Build, Test
TrainingProductive
ExperimentationFast Bring-up
Data CenterDesk
FromIdea
installed iterate
Inference
ToResults
refine, re-train
scale
Edge
36
CONTENTS
• NVIDIA DGX
• DGX AI
• DGX POD RA
• DGX POD RA
• DGX
37
AI
38
•
•
•
•
Keep Compute Where the Data Lives
ON-PREM
•
•
•
•
•
TRAIN CLOSEST TO WHERE YOUR DATA LIVES
✓
✓
✓
39
AIShort term thinking leads to longer term problems
40
AI
AI/DL Expertise &
Innovation
AI/DL Software Stack
Operating System Image
Hardware Architecture
Looking beyond the “spec sheet”Evalu
ati
on C
rite
ria
41NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.
Insights gained from deep learning data centers
Rack Design Networking Storage Facilities Software
• DL drives
close to
operational
limits
• Similarities
to HPC best
practices
• IB or
Ethernet
based fabric
• 100Gbps
inter-
connect
• High-
bandwidth,
ultra-low
latency
• Datasets
range from
10k’s to
millions
objects
• terabyte
levels of
storage and
up
• High IOPS,
low latency
• assume
higher watts
per-rack
• Higher
FLOPS/watt
= DC less
floorspace
required
• Scale
requires
“cluster-
aware”
software
Example:
• Autonomous vehicle = 1TB / hr
• Training sets up to 500 PB
• RN50: 113 days to train
• Objective: 7 days
• 6 simultaneous developers
= 97 node cluster
42
DGX POD
43
NVIDIA DGX PODTM: .
Nine DGX-1 Servers
• Eight Tesla V100 GPUs
• NVIDIA. GPUDirect™ over RDMA support
• Run at MaxQ
• 100 GbE networking (up to 4 x 100 GbE)
Twelve Storage Nodes
• 192 GB RAM
• 3.8 TB SSD
• 100 TB HDD (1.2 PB Total HDD)
• 50 GbE networking
Network
• In-rack: 100 GbE to DGX-1 servers
• In-rack: 50 GbE to storage nodes
• Out-of-rack: 4 x 100 GbE (up to 8)
Rack
• 35 kW Power
• 42U x 1200 mm x 700 mm (minimum)
• Rear Door Cooler
4 POD design with cooling
DGX-1 POD
• NVIDIA DGX POD
• Support scalability to hundreds of nodes
• Based on proven SATURNV architecture
44
● Data factory collects raw data and
includes tools used to pre-process,
index, label, and manage data
● Model training with labeled data using
a DL framework from the NVIDIA GPU
Cloud (NGC) container repository
running on DGX servers with Volta
Tensor Core GPUs
● Model testing and validation adjusts
model parameters as needed and
repeats training until the desired
accuracy is reached
● Model optimization for production
deployment (inference) is completed
using the NVIDIA TensorRT optimizing
inference accelerator
AI SOFTWARE DEVELOPMENT WORKFLOWFor Large-Scale Multi-User AI Software Development Teams
NVIDIA DGX POD – 快速构建AI平台
46
NVIDIA AI SOFTWAREFor Large-Scale Multi-User AI Software Development Teams
47
DGX POD NGC
bigdft
candle
chroma
gamess
gromacs
lammps
lattice-microbes
milc
namd
pgi
picongpu
relion
vmd
caffe
caffe2
cntk
cuda
digits
inferenceserver
mxnet
pytorch
tensorflow
tensorrt
theano
torch
index
paraview-holodeck
paraview-index
paraview-optix
chainer
h20ai-driverless
kinetica
mapd
paddlepaddle
Deep Learning HPC HPC Visualization PartnersNVIDIA/K8s
Kubernetes
on NVIDIA GPUs
48
DGX POD MANAGEMENT SOFTWAREFor Large-Scale Multi-User AI Software Development Teams
https://github.com/NVIDIA/deepops
49
DGX POD RA AI平台及其价值
50
DGX POD RA AI平台的价值
Reference architectures from NVIDIA and leading storage partners
Simplified, validated, converged infrastructure offers
Available through select NPN partners as a turnkey solution
DGX RA
Solution
Storage
51
TCO
Study & exploration
Platform Design
Productive Experi-
mentation
HW & SW Integra-
tion
Trouble-shooting
Software eng’g
Software optimiz-
ation
Design and Build for
Scale
Software re-
optimiz-ation
InsightsTraining at Scale
Designing, Building and Supporting an AI Infrastructure – from Scratch
OPEX
CAPEX
Day 1
Month 3
Time and budget spent on things other than data science
“DIY” TCO
52
Study & exploration
Platform Design
Productive Experi-
mentation
Install and Deploy DGX RA
SOLUTION
Trouble-shooting
Software eng’g
Software optimiz-
ation
Design and Build for
Scale
Software re-
optimiz-ation
InsightsTraining at Scale
2. Deploying an Integrated, Full-Stack AI Solution using a DGX Reference Architecture
Day 1
Month 3
“DIY” TCO
CAPEX
DGX TCOdeployment
cycle shortened
Wasted time/effort - eliminated
53
Study & exploration
Insights
2. Deploying an Integrated, Full-Stack AI Solution using a DGX Reference Architecture
Day 1
Week 1
Install and Deploy DGX RA
SOLUTION
CAPEX
Productive Experi-
mentation
Training at Scale
“DIY” TCO
DGX TCO
54
没有统一接口的技术支持状况
Installed/
running
Problem!
“My PyTorch CNN model
is running 30% slower
than yesterday!”
“OK let me look into it”
IT Admin
55
Installed/
running
Problem!
Open source / forum
Open source / forum
Framework?
Libraries?
O/S?
GPU?
Drivers?
Server?
Network?
Storage?
Multiple paths to
problem resolution
Server, Storage & Network
Solution Providers
没有统一接口的技术支持状况
56
DGX POD 集成架构的技术支持
“Update to PyTorch
container XX.XX”
AI Expertise
NPN
Partner
Running!Problem!
DGX RA
Solution
Storage
DGX RA
Solution
Storage
“My PyTorch CNN model
is running 30% slower
than yesterday!”
IT Admin
57
CONTENTS
• NVIDIA DGX
• DGX AI
• DGX POD RA
• DGX POD RA
• DGX
58NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.
DGX-1 存储:本地存储
系统盘:单个SSD盘 480GB OS 无冗余
4块SSD(Raid0) 4 x 1.92TB cacheFS 无冗余
➢ 每台DGX-1系统有5个SSD
➢ 深度学习的训练IO通常需要多次通过IO读取训练数据,因此随着训练一次次的迭代,系统本地高速存储能有效的提高IO的利用率,从而提高GPU的利用率。
59
DGX-1 外部存储IO需求参考
应用场景充足的读缓存
DGX缓存能力 推荐网络类型 网络文件系统选择
数据分析 NA 10Gbe 对象存储,NFS,或其他并发读及小文件性能优的存储
HPC NA 10/40/100Gbe IBNFS,或其他支持大量客户端,单点存储性能优的HPC并行存储系统
DL 256x256 图片 Yes 63 million images 10Gbe NFS或小文件读写效率高的存储
DL 1080p 图片 Yes 13 million images 10/40Gbe IB 高性能NFS,HPC存储系统,高并发
DL 4K 图片 Yes 5 million images 40Gbe IB 高性能NFS,HPC存储系统,高并发,单节点3GB/s+
DL 无压缩图片 Yes 1 million images IB 40/100Gbe 高性能NFS,HPC存储系统,高并发,单节点3GB/s+
DL 不缓存数据集 no NA IB 10/40/100Gbe 性能同上, 总的性能需要满足所有应用并发使用的需求
➢ 存储系统网络可选择万兆/IB网络
➢ 以下的表格是我们基于深度学习框架的通用IO访问模式针对存储系统的参考推荐,该推荐只作为参考使用
https://docs.nvidia.com/dgx/bp-dgx/index.html#storage_scaling
60
NVIDIA DGX POD :AI Infrastructure Built on NVIDIA Best Practices
Storage Partner DGX-1 RA Solutions
NVIDIA DGX POD
Growing
portfolio
of offers…
Common Benefits:
• Eliminate design guesswork
• Faster, simpler deployment
• Predictable performance at scale
• Simplified, single-point of support
Backed by prioritized NPN partners
Ref. Arch.
NVIDIA DGX POD – 快速构建AI平台
62
ONTAP AI NETAPP VERIFIED ARCHITECTURE
62
2x 100GbE
4x 100GbE
2-4 ISL
2x 100GbE
4x NVIDIA DGX-1
1x NetApp A800
4x 100GbE
100GbE switch100GbE switch
© 2018 NetApp, Inc. All rights reserved. NetApp Confidential – Limited Use Only
Key metrics
• 4 DGX-1s, 1 AFF A800 HA-pair
• Peak throughput requested: 5GB/s
• Sustained throughput requested: 4GB/s
• Average storage latency: ~600us
• All 32 GPUs kept consistently >95% busy
• Storage CPU utilization achieved: ~18%
• A800 can provide 25GB/s read throughput
Conclusion
• Massive headroom to support a large
number of DGX-1 servers by one AFF
A800
63
64
ACCELERATING THE AI DATA PIPELINE
Training with NVIDIA DGX-1 and NetApp A800
64
Test environment
32 GPUs (4 DGX-1 servers)
Tensor Cores
Measured as images per second
Compares metrics for synthetic and ImageNet data
Conclusion
Near linear scaling achieved
Performance for ImageNet dataset is close to synthetic data
© 2018 NetApp, Inc. All rights reserved. NetApp Confidential – Limited Use Only
START SMALL, SCALE BIG
65 * Based on 35kW racks
1:1 Configuration Full scale-out configuration * 1:4 Configuration 1:5 Configuration
42U
rack
66
NETAPP ONTAP AI SOLUTION RACK-SCALE ARCHITECTURE
67
NVIDIA DGX-2 POD WITH NETAPP AFF A800
NVIDIA DGX-2 POD with NetApp AFF A800
AIRI: AI-READY INFRASTRUCTURE
68
• NVIDIA DGX-1 | 4x DGX-1 Systems | 4 PFLOPS
• PURE FLASHBLADE™ | 15x 17TB Blades | 1.5M IOPS
• ARISTA | 2x 100Gb Ethernet Switches with RDMA
• NVIDIA GPU CLOUD DEEP LEARNING STACK | NVIDIA
Optimized Frameworks
• AIRI SCALING TOOLKIT | Multi-node Training Made
Simple
HARDWARE
SOFTWARE
Extending the power of DGX-1 at-scale in every enterprise
Pure Storage Network Topology
DDN A3I WITH DGX-1
70
• NVIDIA DGX-1 | 4x DGX-1 Systems | 4 PFLOPS
• DDN AI200, AI7990 | 20GB/s | from 30TB | 350K IOPS
• NETWORK: 2x EDR IB or 100GbE Switches with RDMA
• NVIDIA GPU CLOUD DEEP LEARNING STACK | NVIDIA
Optimized Frameworks
• DDN: High performance, low latency, parallel file
system
• DDN: In-container client for easy deployment,
efficiency, performance and reliability
HARDWARE
SOFTWARE
Making AI-Powered Innovation Easier
DDN A3I Reference Architecture in a 9:1 configuration
IBM SPECTRUM STORAGE FOR AI WITH NVIDIA DGX
72
• NVIDIA DGX-1 | up to 9x DGX-1 Systems
• IBM Spectrum Scale NVMe Appliance| 40GB/s per
node, 120GB/s in 6RU| 300TB per node
• NETWORK: Mellanox SB7700 Switch | 2x EDR IB with
RDMA
• NVIDIA DGX SOFTWARE STACK | NVIDIA Optimized
Frameworks
• IBM: High performance, low latency, parallel file
system
• IBM: Extensible and composable
HARDWARE
SOFTWARE
The Engine to Power Your AI Data Pipeline
DELL EMC ISILON WITH NVIDIA DGX
73
• NVIDIA DGX-1 | 9x DGX-1 Systems = 9 PFLOPS
• DELL EMC ISILON | 4x or 8x Isilon F800 nodes
(in 2 chassis) up to 250K IOPS per chassis
• ARISTA | 2x 7060CX2-32S | 32x 40/100GbE
• NVIDIA DGX SOFTWARE STACK | NVIDIA Optimized
AI/DL Frameworks
• ISILON OneFS
HARDWARE
SOFTWARE
Simplified Enterprise AI Infrastructure
74
(TIER 1) DGX-1 STORAGE PARTNERS (PUBLISHED RAS AS OF 12/12/18)
Pure Storage
AIRI® / AIRI Mini®
NetApp®
ONTAP® AI
DDN
A3I®
Dell EMC
Isilon
IBM Spectrum Scale for AI
with NVIDIA DGX
Pure FlashBlade™ NetApp® A800™ all flash
storage system
AI200,AI400,AI7990 F800 All-Flash Scale-out
NAS
IBM Spectrum Scale
based shared storage
100 Gb Ethernet, RoCE 100 Gb Ethernet, RoCE InfiniBand (up to EDR)
Ethernet (up to 100Gb/s)
40/100 Gb Ethernet,
RoCE
InfiniBand (up to EDR)
Ethernet (up to 100Gb/s)
https://www.purestorage.com
/content/dam/purestorage/pd
f/datasheets/Pure_Storage_Fla
shBlade_Datasheet_05.pdf
NFSv3, S3, SMB2.1, HTTP(S)
ONTAP AI NFS Filesystem DDN has developed an intelligent
parallel file system client specifically
for DGX-1 server containers that
engages multiple high-speed data paths
to the storage and delivers the full
performance of NVMe flash directly to
the application. Under the covers,
DDN is running Lustre, but has done a
lot of engineering to simplify setup and
configuration for AI environments,
specific to DGX-1.
OneFS - NFS IBM Spectrum Scale (POSIX)
NFS v3 / v4.0 and SMB
(through cluster export
services)
Reference Architecture with 4
DGX-1s. Deployed with 500+
DGX-1s in customer
environment.
4 DGX-1s with 9 DGX-1 POD
in progress
9 DGX-1s
1:1, 4:1, 9:1 Configurations
9 DGX-1 POD 1-9 DGX-1 servers; 1-3
Spectrum Scale All-Flash
appliances
Document Document DGX (1)
Document DGX (4)
Document
Document
Document
Scalable AI Infrastructure for Real-
World Deep
Learning Use Cases: Deployment Guide
Document Document