adasum: adaptive summation of gradients for deep learning
Post on 22-Jan-2022
1 Views
Preview:
TRANSCRIPT
Adasum: Adaptive Summation of Gradients for Deep Learning
Saeed Maleki, Madan Musuvathi, Todd Mytkowicz, Olli Saarikivi
Microsoft Research
Emad Barsoum, Sergii Dymchenko, Jaliya Ekanayake, Vadim Eksarevskiy, Maxim Lukiyanov, Tianju Xu
Microsoft
TH I S I S AN E L E PHANT R EA L L Y !DATA TH I S I S AN E L E PHANT R EA L L Y !
π€π₯
πΏππ π
π
πΏπβ―
Motivation: Neural Network Training is Inherently Sequentialβ’ Stochastic Gradient Descent (SGD)
β’ Workhorse for training a neural network
β’ Inherently sequential
TH I S I D AN E L A PHANT T EA L L Y A
do not achieve the sameβper-epoch accuracyβ
Motivation: More parallelism = less accuracy
Adasum addresses this
Background: SGD and Gradients
π€π₯
πΏππ π
π
πΏπβ―
=+
π·1
π·2
π·3
π€π₯
πΏππ π
π€1
π·2π·1
π€π₯
πΏππ π
π
πΏπβ―
π€0 π€1
π€2
π·2π·1
π€0
π€3
π€0+Ξπ€
π€1
π€0
π€2
π€3
ππΊπ· π€0 + Ξπ€ = ππΊπ· π€0 + +π( Ξπ€ 2)ππΊπ·β² π€0 β Ξπ€
ππ
local model π€3
+ π(Ξπ€)+π β Ξπ€
β’ π = πΌ β πΌπ» where π» is the Hessian matrix and πΌ is the learning rate
β’ Very costly to calculate π»
a matrix π
Adasum: Symbolic SGD
+ πΌ β πΌπ» β Ξπ€+ Ξπ€ β πΌπ» β Ξπ€
π€1
π·2π·1
π€π₯
πΏππ π
π
πΏπβ―
π€0 π€1
π€2
π·2π·1
π€0
π€3
π€0+Ξπ€
π€1
π€0
π€2
π€3ππ
Adasum: Symbolic SGD
+ Ξπ€ β πΌπ» β Ξπ€
ππ π€0
π€3
usually ignored
π€0
π€3ππ + ππ
ok to ignore
ππ + ππnot ok to ignore
ππ
ππβ²
ππ + ππβ²
now ok to ignore
orthogonal space
π·2
gradient gradient
π·2π·1Ξπ€β² is a βgoodβ direction
π€1
π€0
π€2
π€3ππ
π€0
π€1gradientππ
ππβ²
β’ Is Ξπ€β² a βgoodβ direction to move along?
β’ Ξπ€ is the gradient w.r.t. the green bowl
β’ Any direction that has a positiveinner product with the gradientsdecays the lossβ’ Ξπ€β² is a βgoodβ direction
β’ Adasum operator sums Ξπ€ with projection
orthogonal space tothe orange gradient
Adasum: Adaptive Sum
β’ Adasum combines Ξπ€ from any number of processors
β’ Adasum combines Ξπ€ from different processors by projection and summation. Effectively:β’ They are added when they are orthogonalβ’ Only one is taken when they are parallel
β’ Traditionally, Ξπ€ from different processors are:β’ Either summed: can be too aggressiveβ’ Or Averaged: can be too conservative
Orthogonality of Gradients
β’ We use Pythagorean theorem to define orthogonality
β’ For P gradients it ranges between:β’ 1 for all orthogonal
β’ 1/P for all parallel
β’ Gradients start out all parallel
β’ Later in the training they becomemore orthogonal
β’ Convergence starts out slowbut speeds up later
BERT with 64 GPUs
Adasum is in Horovod
β’ Horovod is an open-source distributed training framework by Uber that supports both PyTorch and TensorFlowβ’ Adasum is integrated in Horovod
β’ Adasum is easy to use:β’ horovod.allreduce(gradients, op=hvd.adasum)
β’ No hyperparameter
β’ Adasum allows scaling SGDβ’ Minimizes convergence slowdown in scale
Result β Adasum vs. Allreduce/Averaging Latency on 64 GPUsβ’ Adasum has some
computation overheadβ’ But no communication overhead
β’ Almost negligible overheadβ’ Communication latency
dominates the computationlatency
64 GPUs
Results β Adasum Convergence on MNIST
β’ Standard 2-layer CNN gets 99.3% in 2 epochs sequentially with batch size 32
β’ Tuned LR for averaging and Adasum
β’ Default Horovod fails at 16 GPUs, while Adasum still works with 32
β’ With tuned learning rates Adasumhas much better convergence for 16 and 32 GPUs
97.8%
97.9%
98.0%
98.1%
98.2%
98.3%
98.4%
98.5%
98.6%
98.7%
98.8%
98.9%
99.0%
99.1%
99.2%
99.3%
4 GPUs 8 GPUs 16 GPUs 32 GPUs
Test Accuracy at 2 epochs
Adasum (tuned LR) Adasum
Horovod (tuned LR) Default Horovod
Results β Adasum Convergence on word2vec
β’ word2vec model: embedding for wordsβ’ London to England::Paris to France
β’ Runs on 32 nodes
β’ Tuned LR for averaging
β’ Adasum matches sequential accuracy
AS
Results β Adasum Convergence on Resnet
β’ The MLPerf Resnet50 on ImageNet β’ 16K batch sizeβ’ 64 GPUs
β’ Adasum reaches MLPerftarget accuracy 74.9% in 69 epochs
β’ Default Horovod never reaches target accuracy in 90 epochs
Epochs
Validation Accuracy β Adasum β Default Horovod
Results β Adasum Convergence on BERT
β’ BERT is a common modeltrained nowadays
β’ Bigger batch size = more parallelismβ’ 64k and 128k
β’ LAMB is the optimizerused at large scales
β’ Adasum beats LAMBwith 128k batchsize
Please use Adasum!
β’ And let us know what you think!
β’ https://github.com/horovod/
top related