adasum: adaptive summation of gradients for deep learning

Post on 22-Jan-2022

1 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Adasum: Adaptive Summation of Gradients for Deep Learning

Saeed Maleki, Madan Musuvathi, Todd Mytkowicz, Olli Saarikivi

Microsoft Research

Emad Barsoum, Sergii Dymchenko, Jaliya Ekanayake, Vadim Eksarevskiy, Maxim Lukiyanov, Tianju Xu

Microsoft

TH I S I S AN E L E PHANT R EA L L Y !DATA TH I S I S AN E L E PHANT R EA L L Y !

𝑀π‘₯

πΏπ‘œπ‘ π‘ 

𝑖

𝐿𝑖⋯

Motivation: Neural Network Training is Inherently Sequentialβ€’ Stochastic Gradient Descent (SGD)

β€’ Workhorse for training a neural network

β€’ Inherently sequential

TH I S I D AN E L A PHANT T EA L L Y A

do not achieve the sameβ€œper-epoch accuracy”

Motivation: More parallelism = less accuracy

Adasum addresses this

Background: SGD and Gradients

𝑀π‘₯

πΏπ‘œπ‘ π‘ 

𝑖

𝐿𝑖⋯

=+

𝐷1

𝐷2

𝐷3

𝑀π‘₯

πΏπ‘œπ‘ π‘ 

𝑀1

𝐷2𝐷1

𝑀π‘₯

πΏπ‘œπ‘ π‘ 

𝑖

𝐿𝑖⋯

𝑀0 𝑀1

𝑀2

𝐷2𝐷1

𝑀0

𝑀3

𝑀0+Δ𝑀

𝑀1

𝑀0

𝑀2

𝑀3

𝑆𝐺𝐷 𝑀0 + Δ𝑀 = 𝑆𝐺𝐷 𝑀0 + +𝑂( Δ𝑀 2)𝑆𝐺𝐷′ 𝑀0 β‹… Δ𝑀

πœŸπ’˜

local model 𝑀3

+ 𝑓(Δ𝑀)+𝑀 β‹… Δ𝑀

β€’ 𝑀 = 𝐼 βˆ’ 𝛼𝐻 where 𝐻 is the Hessian matrix and 𝛼 is the learning rate

β€’ Very costly to calculate 𝐻

a matrix 𝑀

Adasum: Symbolic SGD

+ 𝐼 βˆ’ 𝛼𝐻 β‹… Δ𝑀+ Δ𝑀 βˆ’ 𝛼𝐻 β‹… Δ𝑀

𝑀1

𝐷2𝐷1

𝑀π‘₯

πΏπ‘œπ‘ π‘ 

𝑖

𝐿𝑖⋯

𝑀0 𝑀1

𝑀2

𝐷2𝐷1

𝑀0

𝑀3

𝑀0+Δ𝑀

𝑀1

𝑀0

𝑀2

𝑀3πœŸπ’˜

Adasum: Symbolic SGD

+ Δ𝑀 βˆ’ 𝛼𝐻 β‹… Δ𝑀

πœŸπ’˜ 𝑀0

𝑀3

usually ignored

𝑀0

𝑀3π’˜πŸ‘ + πœŸπ’˜

ok to ignore

π’˜πŸ‘ + πœŸπ’˜not ok to ignore

πœŸπ’˜

πœŸπ’˜β€²

π’˜πŸ‘ + πœŸπ’˜β€²

now ok to ignore

orthogonal space

𝐷2

gradient gradient

𝐷2𝐷1Δ𝑀′ is a β€œgood” direction

𝑀1

𝑀0

𝑀2

𝑀3πœŸπ’˜

𝑀0

𝑀1gradientπœŸπ’˜

πœŸπ’˜β€²

β€’ Is Δ𝑀′ a β€œgood” direction to move along?

β€’ Δ𝑀 is the gradient w.r.t. the green bowl

β€’ Any direction that has a positiveinner product with the gradientsdecays the lossβ€’ Δ𝑀′ is a β€œgood” direction

β€’ Adasum operator sums Δ𝑀 with projection

orthogonal space tothe orange gradient

Adasum: Adaptive Sum

β€’ Adasum combines Δ𝑀 from any number of processors

β€’ Adasum combines Δ𝑀 from different processors by projection and summation. Effectively:β€’ They are added when they are orthogonalβ€’ Only one is taken when they are parallel

β€’ Traditionally, Δ𝑀 from different processors are:β€’ Either summed: can be too aggressiveβ€’ Or Averaged: can be too conservative

Orthogonality of Gradients

β€’ We use Pythagorean theorem to define orthogonality

β€’ For P gradients it ranges between:β€’ 1 for all orthogonal

β€’ 1/P for all parallel

β€’ Gradients start out all parallel

β€’ Later in the training they becomemore orthogonal

β€’ Convergence starts out slowbut speeds up later

BERT with 64 GPUs

Adasum is in Horovod

β€’ Horovod is an open-source distributed training framework by Uber that supports both PyTorch and TensorFlowβ€’ Adasum is integrated in Horovod

β€’ Adasum is easy to use:β€’ horovod.allreduce(gradients, op=hvd.adasum)

β€’ No hyperparameter

β€’ Adasum allows scaling SGDβ€’ Minimizes convergence slowdown in scale

Result – Adasum vs. Allreduce/Averaging Latency on 64 GPUsβ€’ Adasum has some

computation overheadβ€’ But no communication overhead

β€’ Almost negligible overheadβ€’ Communication latency

dominates the computationlatency

64 GPUs

Results – Adasum Convergence on MNIST

β€’ Standard 2-layer CNN gets 99.3% in 2 epochs sequentially with batch size 32

β€’ Tuned LR for averaging and Adasum

β€’ Default Horovod fails at 16 GPUs, while Adasum still works with 32

β€’ With tuned learning rates Adasumhas much better convergence for 16 and 32 GPUs

97.8%

97.9%

98.0%

98.1%

98.2%

98.3%

98.4%

98.5%

98.6%

98.7%

98.8%

98.9%

99.0%

99.1%

99.2%

99.3%

4 GPUs 8 GPUs 16 GPUs 32 GPUs

Test Accuracy at 2 epochs

Adasum (tuned LR) Adasum

Horovod (tuned LR) Default Horovod

Results – Adasum Convergence on word2vec

β€’ word2vec model: embedding for wordsβ€’ London to England::Paris to France

β€’ Runs on 32 nodes

β€’ Tuned LR for averaging

β€’ Adasum matches sequential accuracy

AS

Results – Adasum Convergence on Resnet

β€’ The MLPerf Resnet50 on ImageNet β€’ 16K batch sizeβ€’ 64 GPUs

β€’ Adasum reaches MLPerftarget accuracy 74.9% in 69 epochs

β€’ Default Horovod never reaches target accuracy in 90 epochs

Epochs

Validation Accuracy β€” Adasum β€” Default Horovod

Results – Adasum Convergence on BERT

β€’ BERT is a common modeltrained nowadays

β€’ Bigger batch size = more parallelismβ€’ 64k and 128k

β€’ LAMB is the optimizerused at large scales

β€’ Adasum beats LAMBwith 128k batchsize

Please use Adasum!

β€’ And let us know what you think!

β€’ https://github.com/horovod/

top related