running tensorflow at scale on gpus€¦ · ngc model containers (pytorch, tensorflow from 19.09)...
TRANSCRIPT
Maggie Zhang (张雪萌) [email protected]
Accelerate Deep Learning Training at Scale on GPUs
AGENDA
● Introduction
● Why do we need to scale training
● How to achieve scaling
3
2015
36000 Mins (25 Days)
1xK80 | 2015CUDA
2016
1200 Mins (20 Hours)DGX-1P | 2016
NVLink
2017
480 Mins (8 Hours)DGX-1V | 2017Tensor Core
6.3 Minutes on MLPerfAt Scale | 2018
DGX Cluster
2018
70 Minutes on MLPerfDGX-2H | 2018
NVSwitch
ResNet50 v1.5 training
2019
52.7 Minutes on MLPerf
DGX-2H | 2019NVSwitch
1.33 Minutes on MLPerf
At Scale | 2019DGX SuperPOD
DL Training: from single GPU to multi-node
4
The whole stack must be considered
● Compute
● Network
● Storage
● Frameworks & Libraries
● Numerical methods
● Training recipes
5
MLPerf: NVIDIA advancing AI training
Time to Train From 8 Hours to 80 Seconds
2019 MLPerf ID (in order from top to bottom of chart): ResNet-50: 0.6-30 | Transformer: 0.6-28 | GNMT: 0.6-14 | SSD: 0.6-27 | Mini-Go: 0.6-11 | Mask R-CNN: 0.6-23
6
Largest TensorFlow model at scaleOak Ridge National Lab scales TensorFlow climate analytics model up to 27,360 V100 GPUs
Source: https://arxiv.org/pdf/1810.01993.pdf
2018 Gordon Bell Prize Winner
AGENDA
● Introduction
● Why do we need to scale training
● How to achieve scaling
8
● Unlabeled data:
○ Language model: BooksCorpus (800M words), English Wikipedia (2.5B words), WebText (8M
documents, 40 GB), C4 (Common Crawl, 745 GB)
○ GAN: unlabeled images and videos
○ Reinforcement learning: unsupervised self-play generates unlimited data
● Labeled data:
○ ImageNet (2012) - 1.3M images, 1000 categories Open Images (2019) - 9M images, 6000
categories
○ Semi-autonomous vehicles: 0.5-1.1TB of data for every 8h driving
Datasets getting larger
9
DL models increasing in complexity
Image Recognition
NLP
NLP – Generative Tasks
ChatbotsE-mail auto-completionDocument Summarization
Autonomous VehiclesSocial TaggingVisual Search
Q&ASentimentTranslation
1.5Bn
26M340M
Next-level use-cases require gigantic models
https://github.com/NVIDIA/Megatron-LM
Project Megatron
8.3B parameters
8-way Model Parallel
64-way Data Parallel
24x larger than BERT
Speech Recognition
Translation
Object Detection
AGENDA
● Introduction
● Why do we need to scale training
● How to achieve scaling
11
Scaling == whack-a-mole ?
Solving one bottleneck and another one pops up
12
Multi-node infrastructure requirements
System Design
Data Center
ManagementSW Stack
Multi-Node
Success
13
● Hardware GPU cluster design:○ Compute: significant CPU to GPU ratio, interconnect with GPU
○ Storage: high speed NFS, multi-tier caching
○ Networking: topology and bandwidth, NVLINK, GPUDirect RDMA
● GPU cluster management:○ Scheduler: Slurm vs. Kubernetes
○ Container technologies: Docker, Enroot, Singularity, etc.
● Integrated software stack:○ NVIDIA libraries: CUDA, cuDNN, NCCL
○ DL Framework scale-out optimization
○ Model scale-out implementation & optimization
Challenges of multi-node DL training
14
A basic recipe for deep learning scaling
Step 1: Optimize your single GPU model
Step 2: Scale to multiple GPUs on one node
Step 3: Scale to multiple nodes
15
Case study
• BERT model scripts:https://github.com/NVIDIA/DeepLearningExamples/blob/master/TensorFlow/LanguageModeling/BERT/https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/LanguageModeling/BERTConfigurations for convergence, from 8 to 1500 GPUs, multi-node ready
• Clone and train your own BERT model on multi-node Or download a pre-trained BERT model from NGC and fine-tune for your NLP task
Bidirectional Encoder Representations from Transformers
Super Human Question & Answering
NVIDIA Deep Learning Examples have many model scripts with best practices for accuracy and performance
16
• Pre-training on non-labelled data opens up opportunities to using massive amounts of data:• BooksCorpus (800 million words)• English Wikipedia (2.5 billion words), multi-language Wikipedia• WebText (OpenAI, 8M documents, 40 GB of text)
• More data tends to lead to better accuracy
• BERT pre-training is computationally intensive and takes days to train even on the most powerful single node: BERT-Large (330M parameters) takes ~2.5 days to train on a single DGX-2 server with 16 V100 GPUs.
Why multi-node BERT training
17
BERT multi-node pre-training performance
DGX-1
(16 GB)
GPUs Time to train
(Hrs)
1 8 153.6 (6.3
days)
4 32 39.3
16 128 10.4
DGX-2H
(32 GB)
GPUs Time to train
(Hrs)
1 16 58.4 (2.4 days)
4 64 15.4
16 256 3.9
64 1024 1.2
Source: https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/LanguageModeling/BERT#pre-training-loss-results
* Above time to train is measured for Mixed precision, training loss 1.3 in PyTorch; with LAMB optimizer
** Gradient accumulation is applied to DGX-2H 1,4,16 node
Metric: Time to train
18
• Create efficient data pipeline
• Enable mixed precision training
• Enable XLA
• Ensure latest GPU libraries
• Develop model in container to facilitate scaling out
Step 1: Optimize model
19
Step 1: Optimize model
• Use tf.data to create performant input pipelines
• Test I/O bottlenecks with a trivial model
• NVIDIA DALI accelerates image-based input pipelines
Data pipeline
20
d = tf.data.Dataset.from_tensor_slices(tf.constant(input_files))d = d.repeat()d = d.shuffle(buffer_size=len(input_files))
# `cycle_length` is the number of parallel files that get read.cycle_length = min(num_cpu_threads, len(input_files))d = d.apply(
tf.contrib.data.parallel_interleave(tf.data.TFRecordDataset,cycle_length=cycle_length))
d = d.shuffle(buffer_size=100)
d = d.apply(tf.contrib.data.map_and_batch(
lambda record: _decode_record(record, name_to_features),batch_size=batch_size,num_parallel_batches=num_cpu_threads,drop_remainder=True if is_training else False))
BERT
TFRecord - fast binary format
Parallel read, map, & batch
Fused map & batch op
Data pipeline
https://github.com/NVIDIA/DeepLearningExamples/blob/master/TensorFlow/LanguageModeling/BERT/run_pretraining.py
21
Step 1: Optimize model
• 1-line optimizer wrapper:opt = tf.train.experimental.enable_mixed_precision_graph_rewrite(opt)
• Up to 3x speed up in training on Tensor Cores with• Same accuracy• No change in hyperparameters• ½ memory bandwidth & footprint
• Optimal on Volta and Turing GPUs
Automatic Mixed Precision (AMP)
22
Step 1: Optimize modelAutomatic Mixed Precision (AMP)
• Robust speedup across different TensorFlow workloads
• https://arxiv.org/abs/1710.03740
23
Step 1: Optimize modelXLA (Accelerated Linear Algebra)
• TensorFlow XLA can accelerate models with minimal code changes
• XLA optimizes graph, mostly by fusing compatible kernels
• Set XLA optimization level:
https://github.com/NVIDIA/DeepLearningExamples/blob/master/TensorFlow/LanguageMo
deling/BERT/run_pretraining.py#L531
System config: Xeon E4-2698v4 CPU with 256GB system RAM, single V100 Tensor Core GPU 32GB. Tests
run using NVIDIA 18.11 TensorFlow container.
config.graph_options.optimizer_options.global_jit_level = tf.OptimizerOptions.ON_1
24
Step 1: Optimize model
• Latest compatible features and tuning from CUDA toolkit and Deep Learning Libraries (cuDNN, cuBLAS, NCCL)
Latest GPU optimizations
25
Step 1: Optimize model
• NGC containers: fully featured DL containers
• DL frameworks compiled with latest GPU libraries
• Portability of application libraries facilitates multi-node scale-out
Latest GPU optimizations
26
27
• Understand Data Parallel training concepts
• Ensure optimal inter-GPU communication
• Apply high level API for multi-GPU training
Step 2: Scale to multiple GPUs
28
Step 2: Scale to multiple GPUs
• Single GPU
Under the hood
29
Step 2: Scale to multiple GPUs
• Multiple GPU
• Data parallel training
Under the hood
• Allreduce algorithm
• NCCL: NVIDIA Collective Communication Library
30
• Inter-GPU communication:
Step 2: Scale to multiple GPUsUnder the hood
Effective bandwidth in GB/s
31
• Full non-blocking bandwidth
Step 2: Scale to multiple GPUsUnder the hood
32
Step 2: Scale to multiple GPUs
• Popular approach to enable multi-GPU/multi-node in TensorFlow/Keras
• Strong NCCL integration
• Sample commands:
• Single-node (4 GPUs):
horovodrun -np 4 -H localhost:4 python train.py
• Multi-node (4 nodes with 4 GPUs each):
horovodrun -np 16 -H server1:4,server2:4,server3:4,server4:4 python train.py
Approach 1: Horovod
33
Step 2: Scale to multiple GPUs
import tensorflow as tfimport horovod.tensorflow as hvd
# Initialize Horovodhvd.init()
# Pin GPU to be usedconfig = tf.ConfigProto()config.gpu_options.visible_device_list = str(hvd.local_rank())
# Build model...loss = ...opt = tf.train.AdamOptimizer(lr=0.01 * hvd.size())
# Add Horovod Distributed Optimizeropt = hvd.DistributedOptimizer(opt)
Approach 1: Horovod
# Add hook to synchronize initial statehooks = [hvd.BroadcastGlobalVariablesHook(0)]
# Make training operationtrain_op = opt.minimize(loss)
# Only checkpoint on rank 0ckpt_dir = "/tmp/train_logs" if hvd.rank() == 0 else None
# Session
with tf.train.MonitoredTrainingSession(checkpoint_dir=ckpt_dir,config=config, hooks=hooks) as mon_sess:
while not mon_sess.should_stop():# Perform synchronous training.mon_sess.run(train_op)
34
• Recently released native API that also support Allreduce with NCCL
• Multi-GPU:tf.distribute.MirrorStrategy
• Multi-node:tf.distribute.experimental.MultiWorkerMirroredStrategy
Step 2: Scale to multiple GPUsApproach 2: tf.distribute.Strategy
Source: https://www.tensorflow.org/guide/distributed_training
35
• Adopt optimizer designed for large batch size
• Ensure effective inter-node communication
• Move data close to compute
• Consider full application & system software stack
Step 3: Scale to multiple nodes
36
• Optimizer inspired by LARS• Layerwise Adaptive learning rate (You et al.)
• Allows training at huge global batch size• Originally, BERT+Adam (Devlin et al.) – global batch 256
• BERT+LAMB (You et al.) – global batch 64k
• Massive data parallelism
• Lower interconnect pressure with gradient accumulation
Step 3: Scale to multiple nodesLAMB optimizer
37
BERT+LAMB
Robustly scale to large batch size
https://github.com/NVIDIA/DeepLearningExamples/blob/master/TensorFlow/LanguageModeling/BERT/optimization.py
class LAMBOptimizer(tf.train.Optimizer):"""A LAMB optimizer that includes "correct" L2 weight decay."""
def __init__(self,learning_rate,weight_decay_rate=0.0,beta_1=0.9,beta_2=0.999,epsilon=1e-6,exclude_from_weight_decay=None,name="LAMBOptimizer"):
"""Constructs a LAMBOptimizer."""super(LAMBOptimizer, self).__init__(False, name)
.
.
.
Step 3: Scale to multiple nodesLAMB optimizer
38
• Inter-GPU communication (bigger picture):
Step 3: Scale to multiple nodesUnder the hood
Effective bandwidth in GB/s
42
• Tensor Fusion
• Batch tensors together during allreduce
• HOROVOD_FUSION_THRESHOLD=<bytes> HOROVOD_CYCLE_TIME=<ms> horovodrun ...
• Gradient Compression (FP16 Allreduce):
• hvd.DistributedOptimizer(..., compression=hvd.Compression.fp16)
• Reduces network utilization
Step 3: Scale to multiple nodesFurther Horovod optimizations
43
• DNN datasets are large
• Read-dominated at beginning of each epoch
• Keep data close to compute as much as possible:
• RAM disk, SSDs in RAID 0, Fast network attached storage
Step 3: Scale to multiple nodesStorage
44
• Integrated software and hardware system for multi-node scaling
• State-of-the-art compute, GPU interconnect, node interconnect, and storage
Step 3: Scale to multiple nodesReference architecture: DGX SuperPOD
45
NVIDIA DGX SuperPOD
Mellanox EDR 100G InfiniBand Network
Mellanox Smart Director Switches
In-Network Computing Acceleration Engines
Fast and Efficient Storage Access with RDMA
Up to 130Tb/s Switching Capacity per Switch
Ultra-Low Latency of 300ns
Integrated Network Manager
Terabit-Speed InfiniBand Networking per Node
…
Rack 1 Rack 16
ComputeBackplane
Switch
Storage Backplane
Switch
64 DGX-2
GPFS
200 Gb/s per node
800 Gb/s per node
White paper: https://www.nvidia.com/en-us/data-
center/resources/nvidia-dgx-superpod-reference-architecture/
46
• Deep Learning Model:
• Hyperparameters tuned for multi-node scaling
• Multi-node launcher scripts
• Deep Learning Container:
• Optimized DL frameworks, GPU libraries, and multi-node software
• Host:
• Host OS, GPU driver, IB driver, container runtime engine (docker, enroot)
Step 3: Scale to multiple nodesSoftware stack - Application
47
• Slurm: User job scheduling & management
• Enroot: NVIDIA open-source tool to convert traditional container/OS images into unprivileged sandboxes
• Pyxis: NVIDIA open-source plugin integrating Enroot with Slurm
• DeepOps: NVIDIA open-source toolbox for GPU cluster management w/Ansible playbooks
Step 3: Scale to multiple nodesSoftware stack - System
Login nodes DGX Pod: DGX Servers w. DGX base OS
Slurm
controllerEnroot | DockerPyxis
NGC model containers (Pytorch, Tensorflow from 19.09)
DCGM
48
DeepOps leverages Ansible for automated
large scale cluster deployment. Deployment doc
Deployment with DeepOps
Bootstrap all nodes
Prepare provisioning node
Provision all node(s)
Deploy Slurm on Slurm nodes
Deploy DL/ML development tools
Deploy Production AI applications
Deploy management services DeepO
ps
- Build your own GPU cluster following the DGX Pod and DGX
SuperPOD reference architectures.
- Clone the DeepOps repo and follow the cluster setup guide.
Open a GitHub issue if any problem.
Step 3: Scale to multiple nodes
49
• Scaling requires careful consideration of algorithms and infrastructure at each step
• Optimized single-GPU model
• Efficient & scalable Allreduce library
• GPU interconnect, networking, storage
...
• NVIDIA platform makes scaling DL training easier and more efficient
• Deep Learning Examples with SOTA accuracy and performance
• NVIDIA NGC Container with optimized multi-GPU/multi-node software stack
• Accelerated compute platform designed for performance and scaling
SummaryScaling is important and we are here to help