large scale distributed deep networks

Large Scale Distributed Deep Networks

Survey of paper from NIPS 2012

Hiroyuki Vincent Yamazaki, Jan 8, [email protected]

mailto:[email protected]

What is Deep Learning?

How can distributed computing be applied?

– Jeff Dean, GoogleGitHub Issue - Distributed Version #23, TensorFlow, Nov 11, 2015

“… We realize that distributed support is really important, and it's one of the top features we're

prioritizing at the moment.”

What is Deep Learning?

Multi layered neural networks

Functions that take some input and return some output

Input Outputf

Input Output

AND (1, 0) 0

y(x) = 2x + 5 7 19

Object Classifier Cat

Speech Recognizer “Hello world”

f

Neural Networks

Machine learning models, inspired by the human brain

Layered units with weighted connections

Signals are passed between layers Input layer → Hidden layers → Output layer

Steps

1. Prepare training, validation and test data

2. Define the model and its initial parameters

3. Train using the data to improve the modelf

Here to train?

Input Outputf

Input OutputHidden Layers

Yes, let’s do it

Feed Forward1. For each unit, compute its weighted sum

based on its input

2. Pass the sum to the activation function to get the output of the unit

z is the weighted sum

n is the number of inputs

xi is the i-th input

wi is the weight for xi

b is the bias term

y is the output

� is the activation function

z

z =nX

i=1

xiwi + b

y = �(z)

� y

w1x1

x2w2

b

Loss3. Given the output from the last layer, compute the loss using the

Mean Squared Error (MSE) or the cross entropy

This is the error that we want to minimize

E(W ) =1

2(y � y)2

E is the loss/error

W is the weights

y is the target values

y is the output values

Back Propagation

4. Compute the gradient of the loss function with respect to the parameters using Stochastic Gradient Descent (SGD)

5. Taken a step proportional (scaled by the learning rate) to the negative of the gradient to adjust the weights

�wi = ↵@E

@wi

wi,t+1 = wi,t +�wi

↵ is the learning rate, typically 10

�1to 10

�3

Improve the accuracy of the network by iteratively repeating these steps

But it takes time

22 layers 5M parameters

GoogLeNet, Google, ILSVRC 2014

AlexNet, NIPS 2012

7 layers 650K units 60M parameters

Yes, train hard

It’s too much

How can distributed computing be applied?

A framework, DistBelief proposed by the researchers at Google, 2012

Here, let me help you with thoseweights

Asynchronousness - Robustness to cope with slow machines and single point failures

Network Overhead - Manage the amount of data sent across machines

DistBelief

ParallelizationSplitting up the network/model

Model ReplicationProcessing multiple

instances of the network/model asynchronously

DistBelief Parallelization

Split up the network among multiple machines

Speed up gains for networks with many parameters up to the point when communication cost dominate

Bold connections require network traffic

DistBelief Model Replication

Two optimization algorithms to achieve asynchronousness, Downpour SGD and

Sandblaster L-BFGS

Downpour SGD Online Asynchronous

Stochastic Gradient Descent

1. Split the training data intoshards and assign a model replica to each data shard

2. For each model replica, fetch the parameters from the centralized sharded parameter server

3. Gradients are computed per model and pushed back to the parameter server

Each data shard stores a subset of the complete training data

Asynchrousness Model replicas and parameter server shards process data independently

Network OverheadEach machine only need to communicate with a subset of the parameter server shards

Batch Updates Performing batch updates and batch push/pull to and from the parameter server → Also reduces network overhead

AdaGrad Adaptive learning rates per weight using AdaGrad improves the training results

Stochasticity Out of date parameters in model replicas → Not clear how this affects the training

Sandblaster L-BFGSBatch Distributed Parameter Storage

and Manipulation

1. Create model replicas

2. Load balancing by dividing computational tasks into smaller subtasks and letting a coordinator assigns those subtasks to appropriate shards

Asynchrousness Model replicas and parameter shards process data independently

Network OverheadOnly a single fetch per batch

Distributed Parameter Server No need for a central parameter server that needs to handle all the parameters

Coordinator A process that balances the loads among the shards to prevent slow machines from slowing down or stopping the training

Results

Training speed-up is the number of times the parallelized model is faster compared with a regular model running on a single machine

The numbers in the brackets are the number of model replicas

Closer to the origin is better, in this case cost efficient in terms of money

Conclusion

Significant improvements over

single machine training

DistBelief is CPU oriented due to the CPU-GPU data transfer overhead

Unfortunately adds unit connectivity limitations

If neural networks continue to scale up distributed computing will become essential

Designed hardware such as the Big Sur could address these problems

We are strong together

ReferencesLarge Scaled Distributed Deep Networkshttp://research.google.com/archive/large_deep_networks_nips2012.html

Going Deeper with Convolutionshttp://arxiv.org/abs/1409.4842

ImageNet Classification with Deep Convolutional Neural Networkshttp://papers.nips.cc/book/advances-in-neural-information-processing-systems-25-2012

Asynchronous Parallel Stochastic Gradient Descent - A Numeric Core for Scalable Distributed Machine Learning Algorithmshttp://arxiv.org/abs/1505.04956

GitHub Issue - Distributed Version #23, TensorFlow, Nov 11, 2015https://github.com/tensorflow/tensorflow/issues/23

Big Sur, Facebook, Dec 11, 2015https://code.facebook.com/posts/1687861518126048/facebook-to-open-source-ai-hardware-design/

http://research.google.com/archive/large_deep_networks_nips2012.html

http://arxiv.org/abs/1409.4842

http://papers.nips.cc/book/advances-in-neural-information-processing-systems-25-2012

http://arxiv.org/abs/1505.04956

https://github.com/tensorflow/tensorflow/issues/23

https://code.facebook.com/posts/1687861518126048/facebook-to-open-source-ai-hardware-design/

Hiroyuki Vincent Yamazaki, Jan 8, [email protected]

mailto:[email protected]

large scale distributed deep networks

Engineering