hadoop summit 2014 - san jose - introduction to deep learning on hadoop

Deep Learning on Hadoop

Scaleout Deep Learning on YARN

Adam GibsonEmail:

[email protected]

Twitter: @agibsonccc

Github: https://github.com/agibsonccc

Slideshare:http://slideshare.net/agibsonccc/

Instructor athttp://zipfianacademy.com/

Wired Coverage:http://www.wired.com/2014/06/skymind-deep-learning/

https://github.com/agibsonccc



http://slideshare.net/agibsonccc/



http://zipfianacademy.com/

http://www.wired.com/2014/06/skymind-deep-learning/




Josh PattersonEmail:

[email protected]

Twitter: @jpatanooga

Github: https://github.com/jpatanooga

PastPublished in IAAI-09:

“TinyTermite: A Secure Routing Algorithm”Grad work in Meta-heuristics, Ant-algorithms

Tennessee Valley Authority (TVA)

Hadoop and the SmartgridCloudera

Principal Solution ArchitectToday: Patterson Consulting

https://github.com/jpatanooga

https://github.com/jpatanooga

Overview• What is Deep Learning?• Deep Belief Networks• Implementation on Hadoop/YARN• Results

What is Deep Learning?

What is Deep Learning?Algorithm that tries to learn simple features in lower layers

And more complex features in higher layers

Interesting Properties of Deep Learning

Reduces a problem with overfitting in neural networks. Introduces new techniques for "unsupervised feature learning”

introduces new more automatic ways to figure out the parts of your data you should feed into your learning algorithm.

Chasing NatureLearning sparse representations of auditory signals

leads to filters that closely correspond to neurons in early audio processing in mammals

When applied to speechLearned representations showed a striking resemblance to the cochlear filters in the auditory cortext

Yann LeCunn on Deep Learning

Has become the dominant method for acoustic modeling in speech recognitionQuickly becoming the dominant method for several vision tasks such as

object recognitionobject detectionsemantic segmentation.

Deep Belief Networks

What is a Deep Belief Network?

Generative probabilistic modelComposed of one visible layer

Many hidden layersEach hidden layer learns relationship between units in lower layer

Higher layer representations tend to become more complext

Restricted Boltzmann Machines Unsupervised model: Does feature learning by repeated sampling

of the input data. Learns how to reconstruct data for good feature detection. RBMs have different formulas for different kinds of data:

Binary

Continuous

DeepLearning4JImplementation in Java

Self-contained & built on Akka, Hazelcast, JblasDistributed to run faster and with more features than current Theano-based implementations.Talks to any data source, expects one format.

Vectorized Implementation

Handles lots of data concurrently. Any number of examples at once, but the code does not change.Faster: Allows for native/GPU execution.One format: Everything is a matrix.

DL4J vs Theano PerfGPUs are inherently faster than normal native.Theano is not distributed, and GPUs have very low RAM.DL4J allows for situations where you have to “throw CPUs at it.”

What are Good Applications for Deep Learning?

Image ProcessingHigh MNIST Scores

Audio ProcessingCurrent Champ on TIMIT dataset

Text / NLP ProcessingWord2vec, etc

Deep Learning on Hadoop

Past Work: Parallel Iterative Algorithms on YARN

Started withParallel linear, logistic regressionParallel Neural Networks

Packaged in Metronome100% Java, ASF 2.0 Licensed, on github

19

Parameter Averaging

McDonald, 2010Distributed Training Strategies for the Structured Perceptron

Langford, 2007Vowpal Wabbit

Jeff Dean’s Work on Parallel SGDDownPour SGD

20

MapReduce vs. Parallel Iterative

Input

Output

Map Map Map

Reduce Reduce

Processor Processor Processor

Superstep 1

Processor Processor

Superstep 2

. . .

Processor

21

SGD: Serial vs Parallel

Model

Training Data

Worker 1

Master

Partial Model

Global Model

Worker 2

Partial Model

Worker N

Partial Model

Split 1 Split 2 Split 3

…

Managing ResourcesRunning through YARN on hadoop is important

Allows for workflow schedulingAllows for scheduler oversight

Allows the jobs to be first class citizens on Hadoop

And share resources nicely

Parallelizing Deep Belief Networks

Two phase trainingPre TrainFine tune

Each phase can do multiple passes over datasetEntire network is averaged at master

PreTrain and Lots of DataWe’re exploring how to better leverage the unsupervised aspects of the PreTrain phase of Deep Belief Networks

Allows for the use of far less unlabeled dataAllows us to more easily modeled the massive amounts of structured data in HDFS

Results

DBNs on IR Performance Faster to Train. Parameter averaging is an automatic form of

regularization. Adagrad with IR allows for better

generalization of different features and even pacing.

Scale Out MetricsBatches of records can be processed by as many workers as there are data splitsMessage passing overhead is minimalExhibits linear scaling

Example: 3x workers, 3x faster learning

Usage From Command Line

Run Deep Learning on Hadoopyarn jar iterativereduce-0.1-SNAPSHOT.jar [props file]

Evaluate model./score_model.sh [props file]

Handwriting Renders

Faces Renders

…In Which We Gather Lots of Cat Photos

Future DirectionGPUsBetter Vectorization toolingMove YARN version back over to JBLAS for matrices

References“A Fast Learning Algorithm for Deep Belief Nets”

Hinton, G. E., Osindero, S. and Teh, Y. - Neural Computation (2006)

“Large Scale Distributed Deep Networks”Dean, Corrado, Monga - NIPS (2012)

“Visually Debugging Restricted Boltzmann Machine Training with a 3D Example”

Yosinski, Lipson - Representation Learning Workshop (2012)