large scale distributed deep networks

49
Large Scale Distributed Deep Networks Survey of paper from NIPS 2012 Hiroyuki Vincent Yamazaki, Jan 8, 2016 [email protected]

Upload: hiroyuki-vincent-yamazaki

Post on 15-Apr-2017

833 views

Category:

Engineering


8 download

TRANSCRIPT

Page 1: Large Scale Distributed Deep Networks

Large Scale Distributed Deep Networks

Survey of paper from NIPS 2012

Hiroyuki Vincent Yamazaki, Jan 8, [email protected]

Page 2: Large Scale Distributed Deep Networks

What is Deep Learning?

How can distributed computing be applied?

Page 3: Large Scale Distributed Deep Networks

– Jeff Dean, GoogleGitHub Issue - Distributed Version #23, TensorFlow, Nov 11, 2015

“… We realize that distributed support is really important, and it's one of the top features we're

prioritizing at the moment.”

Page 4: Large Scale Distributed Deep Networks

What is Deep Learning?

Page 5: Large Scale Distributed Deep Networks

Multi layered neural networks

Functions that take some input and return some output

Input Outputf

Page 6: Large Scale Distributed Deep Networks

Input Output

AND (1, 0) 0

y(x) = 2x + 5 7 19

Object Classifier Cat

Speech Recognizer “Hello world”

f

Page 7: Large Scale Distributed Deep Networks

Neural Networks

Machine learning models, inspired by the human brain

Layered units with weighted connections

Signals are passed between layers Input layer → Hidden layers → Output layer

Page 8: Large Scale Distributed Deep Networks

Steps

1. Prepare training, validation and test data

2. Define the model and its initial parameters

3. Train using the data to improve the modelf

Page 9: Large Scale Distributed Deep Networks

Here to train?

Page 10: Large Scale Distributed Deep Networks

Input Outputf

Page 11: Large Scale Distributed Deep Networks

Input OutputHidden Layers

Page 12: Large Scale Distributed Deep Networks

Input OutputHidden Layers

Page 13: Large Scale Distributed Deep Networks

Yes, let’s do it

Page 14: Large Scale Distributed Deep Networks

Feed Forward1. For each unit, compute its weighted sum

based on its input

2. Pass the sum to the activation function to get the output of the unit

z is the weighted sum

n is the number of inputs

xi is the i-th input

wi is the weight for xi

b is the bias term

y is the output

� is the activation function

z

z =nX

i=1

xiwi + b

y = �(z)

� y

w1x1

x2w2

b

Page 15: Large Scale Distributed Deep Networks

Loss3. Given the output from the last layer, compute the loss using the

Mean Squared Error (MSE) or the cross entropy

This is the error that we want to minimize

E(W ) =1

2(y � y)2

E is the loss/error

W is the weights

y is the target values

y is the output values

Page 16: Large Scale Distributed Deep Networks

Back Propagation

4. Compute the gradient of the loss function with respect to the parameters using Stochastic Gradient Descent (SGD)

5. Taken a step proportional (scaled by the learning rate) to the negative of the gradient to adjust the weights

�wi = ↵@E

@wi

wi,t+1 = wi,t +�wi

↵ is the learning rate, typically 10

�1to 10

�3

Page 17: Large Scale Distributed Deep Networks

Improve the accuracy of the network by iteratively repeating these steps

Page 18: Large Scale Distributed Deep Networks

But it takes time

Page 19: Large Scale Distributed Deep Networks

22 layers 5M parameters

GoogLeNet, Google, ILSVRC 2014

Page 20: Large Scale Distributed Deep Networks

AlexNet, NIPS 2012

7 layers 650K units 60M parameters

Page 21: Large Scale Distributed Deep Networks

Yes, train hard

It’s too much

Page 22: Large Scale Distributed Deep Networks

How can distributed computing be applied?

Page 23: Large Scale Distributed Deep Networks

A framework, DistBelief proposed by the researchers at Google, 2012

Page 24: Large Scale Distributed Deep Networks

Here, let me help you with thoseweights

Page 25: Large Scale Distributed Deep Networks

Asynchronousness - Robustness to cope with slow machines and single point failures

Network Overhead - Manage the amount of data sent across machines

Page 26: Large Scale Distributed Deep Networks

DistBelief

ParallelizationSplitting up the network/model

Model ReplicationProcessing multiple

instances of the network/model asynchronously

Page 27: Large Scale Distributed Deep Networks

DistBelief Parallelization

Page 28: Large Scale Distributed Deep Networks

Split up the network among multiple machines

Speed up gains for networks with many parameters up to the point when communication cost dominate

Bold connections require network traffic

Page 29: Large Scale Distributed Deep Networks

DistBelief Model Replication

Page 30: Large Scale Distributed Deep Networks

Two optimization algorithms to achieve asynchronousness, Downpour SGD and

Sandblaster L-BFGS

Page 31: Large Scale Distributed Deep Networks

Downpour SGD Online Asynchronous

Stochastic Gradient Descent

Page 32: Large Scale Distributed Deep Networks

1. Split the training data intoshards and assign a model replica to each data shard

2. For each model replica, fetch the parameters from the centralized sharded parameter server

3. Gradients are computed per model and pushed back to the parameter server

Each data shard stores a subset of the complete training data

Page 33: Large Scale Distributed Deep Networks

Asynchrousness Model replicas and parameter server shards process data independently

Network OverheadEach machine only need to communicate with a subset of the parameter server shards

Page 34: Large Scale Distributed Deep Networks

Batch Updates Performing batch updates and batch push/pull to and from the parameter server → Also reduces network overhead

AdaGrad Adaptive learning rates per weight using AdaGrad improves the training results

Stochasticity Out of date parameters in model replicas → Not clear how this affects the training

Page 35: Large Scale Distributed Deep Networks

Sandblaster L-BFGSBatch Distributed Parameter Storage

and Manipulation

Page 36: Large Scale Distributed Deep Networks

1. Create model replicas

2. Load balancing by dividing computational tasks into smaller subtasks and letting a coordinator assigns those subtasks to appropriate shards

Page 37: Large Scale Distributed Deep Networks

Asynchrousness Model replicas and parameter shards process data independently

Network OverheadOnly a single fetch per batch

Page 38: Large Scale Distributed Deep Networks

Distributed Parameter Server No need for a central parameter server that needs to handle all the parameters

Coordinator A process that balances the loads among the shards to prevent slow machines from slowing down or stopping the training

Page 39: Large Scale Distributed Deep Networks

Results

Page 40: Large Scale Distributed Deep Networks

Training speed-up is the number of times the parallelized model is faster compared with a regular model running on a single machine

Page 41: Large Scale Distributed Deep Networks

The numbers in the brackets are the number of model replicas

Page 42: Large Scale Distributed Deep Networks

Closer to the origin is better, in this case cost efficient in terms of money

Page 43: Large Scale Distributed Deep Networks

Conclusion

Page 44: Large Scale Distributed Deep Networks

Significant improvements over

single machine training

DistBelief is CPU oriented due to the CPU-GPU data transfer overhead

Unfortunately adds unit connectivity limitations

Page 45: Large Scale Distributed Deep Networks

If neural networks continue to scale up distributed computing will become essential

Page 46: Large Scale Distributed Deep Networks

Designed hardware such as the Big Sur could address these problems

Page 47: Large Scale Distributed Deep Networks

We are strong together

Page 48: Large Scale Distributed Deep Networks

ReferencesLarge Scaled Distributed Deep Networkshttp://research.google.com/archive/large_deep_networks_nips2012.html

Going Deeper with Convolutionshttp://arxiv.org/abs/1409.4842

ImageNet Classification with Deep Convolutional Neural Networkshttp://papers.nips.cc/book/advances-in-neural-information-processing-systems-25-2012

Asynchronous Parallel Stochastic Gradient Descent - A Numeric Core for Scalable Distributed Machine Learning Algorithmshttp://arxiv.org/abs/1505.04956

GitHub Issue - Distributed Version #23, TensorFlow, Nov 11, 2015https://github.com/tensorflow/tensorflow/issues/23

Big Sur, Facebook, Dec 11, 2015https://code.facebook.com/posts/1687861518126048/facebook-to-open-source-ai-hardware-design/

Page 49: Large Scale Distributed Deep Networks

Hiroyuki Vincent Yamazaki, Jan 8, [email protected]