![Page 1: Adasum: Adaptive Summation of Gradients for Deep Learning](https://reader031.vdocuments.mx/reader031/viewer/2022020622/61eb92140299ea0b70239183/html5/thumbnails/1.jpg)
Adasum: Adaptive Summation of Gradients for Deep Learning
Saeed Maleki, Madan Musuvathi, Todd Mytkowicz, Olli Saarikivi
Microsoft Research
Emad Barsoum, Sergii Dymchenko, Jaliya Ekanayake, Vadim Eksarevskiy, Maxim Lukiyanov, Tianju Xu
Microsoft
![Page 2: Adasum: Adaptive Summation of Gradients for Deep Learning](https://reader031.vdocuments.mx/reader031/viewer/2022020622/61eb92140299ea0b70239183/html5/thumbnails/2.jpg)
TH I S I S AN E L E PHANT R EA L L Y !DATA TH I S I S AN E L E PHANT R EA L L Y !
π€π₯
πΏππ π
π
πΏπβ―
Motivation: Neural Network Training is Inherently Sequentialβ’ Stochastic Gradient Descent (SGD)
β’ Workhorse for training a neural network
β’ Inherently sequential
![Page 3: Adasum: Adaptive Summation of Gradients for Deep Learning](https://reader031.vdocuments.mx/reader031/viewer/2022020622/61eb92140299ea0b70239183/html5/thumbnails/3.jpg)
TH I S I D AN E L A PHANT T EA L L Y A
do not achieve the sameβper-epoch accuracyβ
Motivation: More parallelism = less accuracy
Adasum addresses this
![Page 4: Adasum: Adaptive Summation of Gradients for Deep Learning](https://reader031.vdocuments.mx/reader031/viewer/2022020622/61eb92140299ea0b70239183/html5/thumbnails/4.jpg)
Background: SGD and Gradients
π€π₯
πΏππ π
π
πΏπβ―
=+
π·1
π·2
π·3
π€π₯
πΏππ π
![Page 5: Adasum: Adaptive Summation of Gradients for Deep Learning](https://reader031.vdocuments.mx/reader031/viewer/2022020622/61eb92140299ea0b70239183/html5/thumbnails/5.jpg)
π€1
π·2π·1
π€π₯
πΏππ π
π
πΏπβ―
π€0 π€1
π€2
π·2π·1
π€0
π€3
π€0+Ξπ€
π€1
π€0
π€2
π€3
ππΊπ· π€0 + Ξπ€ = ππΊπ· π€0 + +π( Ξπ€ 2)ππΊπ·β² π€0 β Ξπ€
ππ
local model π€3
+ π(Ξπ€)+π β Ξπ€
β’ π = πΌ β πΌπ» where π» is the Hessian matrix and πΌ is the learning rate
β’ Very costly to calculate π»
a matrix π
Adasum: Symbolic SGD
+ πΌ β πΌπ» β Ξπ€+ Ξπ€ β πΌπ» β Ξπ€
![Page 6: Adasum: Adaptive Summation of Gradients for Deep Learning](https://reader031.vdocuments.mx/reader031/viewer/2022020622/61eb92140299ea0b70239183/html5/thumbnails/6.jpg)
π€1
π·2π·1
π€π₯
πΏππ π
π
πΏπβ―
π€0 π€1
π€2
π·2π·1
π€0
π€3
π€0+Ξπ€
π€1
π€0
π€2
π€3ππ
Adasum: Symbolic SGD
+ Ξπ€ β πΌπ» β Ξπ€
ππ π€0
π€3
usually ignored
π€0
π€3ππ + ππ
ok to ignore
ππ + ππnot ok to ignore
ππ
ππβ²
ππ + ππβ²
now ok to ignore
orthogonal space
π·2
gradient gradient
![Page 7: Adasum: Adaptive Summation of Gradients for Deep Learning](https://reader031.vdocuments.mx/reader031/viewer/2022020622/61eb92140299ea0b70239183/html5/thumbnails/7.jpg)
π·2π·1Ξπ€β² is a βgoodβ direction
π€1
π€0
π€2
π€3ππ
π€0
π€1gradientππ
ππβ²
β’ Is Ξπ€β² a βgoodβ direction to move along?
β’ Ξπ€ is the gradient w.r.t. the green bowl
β’ Any direction that has a positiveinner product with the gradientsdecays the lossβ’ Ξπ€β² is a βgoodβ direction
β’ Adasum operator sums Ξπ€ with projection
orthogonal space tothe orange gradient
![Page 8: Adasum: Adaptive Summation of Gradients for Deep Learning](https://reader031.vdocuments.mx/reader031/viewer/2022020622/61eb92140299ea0b70239183/html5/thumbnails/8.jpg)
Adasum: Adaptive Sum
β’ Adasum combines Ξπ€ from any number of processors
β’ Adasum combines Ξπ€ from different processors by projection and summation. Effectively:β’ They are added when they are orthogonalβ’ Only one is taken when they are parallel
β’ Traditionally, Ξπ€ from different processors are:β’ Either summed: can be too aggressiveβ’ Or Averaged: can be too conservative
![Page 9: Adasum: Adaptive Summation of Gradients for Deep Learning](https://reader031.vdocuments.mx/reader031/viewer/2022020622/61eb92140299ea0b70239183/html5/thumbnails/9.jpg)
Orthogonality of Gradients
β’ We use Pythagorean theorem to define orthogonality
β’ For P gradients it ranges between:β’ 1 for all orthogonal
β’ 1/P for all parallel
β’ Gradients start out all parallel
β’ Later in the training they becomemore orthogonal
β’ Convergence starts out slowbut speeds up later
BERT with 64 GPUs
![Page 10: Adasum: Adaptive Summation of Gradients for Deep Learning](https://reader031.vdocuments.mx/reader031/viewer/2022020622/61eb92140299ea0b70239183/html5/thumbnails/10.jpg)
Adasum is in Horovod
β’ Horovod is an open-source distributed training framework by Uber that supports both PyTorch and TensorFlowβ’ Adasum is integrated in Horovod
β’ Adasum is easy to use:β’ horovod.allreduce(gradients, op=hvd.adasum)
β’ No hyperparameter
β’ Adasum allows scaling SGDβ’ Minimizes convergence slowdown in scale
![Page 11: Adasum: Adaptive Summation of Gradients for Deep Learning](https://reader031.vdocuments.mx/reader031/viewer/2022020622/61eb92140299ea0b70239183/html5/thumbnails/11.jpg)
Result β Adasum vs. Allreduce/Averaging Latency on 64 GPUsβ’ Adasum has some
computation overheadβ’ But no communication overhead
β’ Almost negligible overheadβ’ Communication latency
dominates the computationlatency
64 GPUs
![Page 12: Adasum: Adaptive Summation of Gradients for Deep Learning](https://reader031.vdocuments.mx/reader031/viewer/2022020622/61eb92140299ea0b70239183/html5/thumbnails/12.jpg)
Results β Adasum Convergence on MNIST
β’ Standard 2-layer CNN gets 99.3% in 2 epochs sequentially with batch size 32
β’ Tuned LR for averaging and Adasum
β’ Default Horovod fails at 16 GPUs, while Adasum still works with 32
β’ With tuned learning rates Adasumhas much better convergence for 16 and 32 GPUs
97.8%
97.9%
98.0%
98.1%
98.2%
98.3%
98.4%
98.5%
98.6%
98.7%
98.8%
98.9%
99.0%
99.1%
99.2%
99.3%
4 GPUs 8 GPUs 16 GPUs 32 GPUs
Test Accuracy at 2 epochs
Adasum (tuned LR) Adasum
Horovod (tuned LR) Default Horovod
![Page 13: Adasum: Adaptive Summation of Gradients for Deep Learning](https://reader031.vdocuments.mx/reader031/viewer/2022020622/61eb92140299ea0b70239183/html5/thumbnails/13.jpg)
Results β Adasum Convergence on word2vec
β’ word2vec model: embedding for wordsβ’ London to England::Paris to France
β’ Runs on 32 nodes
β’ Tuned LR for averaging
β’ Adasum matches sequential accuracy
AS
![Page 14: Adasum: Adaptive Summation of Gradients for Deep Learning](https://reader031.vdocuments.mx/reader031/viewer/2022020622/61eb92140299ea0b70239183/html5/thumbnails/14.jpg)
Results β Adasum Convergence on Resnet
β’ The MLPerf Resnet50 on ImageNet β’ 16K batch sizeβ’ 64 GPUs
β’ Adasum reaches MLPerftarget accuracy 74.9% in 69 epochs
β’ Default Horovod never reaches target accuracy in 90 epochs
Epochs
Validation Accuracy β Adasum β Default Horovod
![Page 15: Adasum: Adaptive Summation of Gradients for Deep Learning](https://reader031.vdocuments.mx/reader031/viewer/2022020622/61eb92140299ea0b70239183/html5/thumbnails/15.jpg)
Results β Adasum Convergence on BERT
β’ BERT is a common modeltrained nowadays
β’ Bigger batch size = more parallelismβ’ 64k and 128k
β’ LAMB is the optimizerused at large scales
β’ Adasum beats LAMBwith 128k batchsize
![Page 16: Adasum: Adaptive Summation of Gradients for Deep Learning](https://reader031.vdocuments.mx/reader031/viewer/2022020622/61eb92140299ea0b70239183/html5/thumbnails/16.jpg)
Please use Adasum!
β’ And let us know what you think!
β’ https://github.com/horovod/