mlconf 2013: metronome and parallel iterative algorithms on yarn

31
Metronome YARN and Parallel Iterative Algorithms

Upload: josh-patterson

Post on 06-May-2015

1.527 views

Category:

Technology


3 download

DESCRIPTION

Online learning techniques, such as Stochastic Gradient Descent (SGD), are powerful when applied to risk minimization and convex games on large problems. However, their sequential design prevents them from taking advantage of newer distributed frameworks such as Hadoop/MapReduce. In this session, we will take a look at how we parallelize parameter estimation for linear models on the next-gen YARN framework Iterative Reduce and the parallel machine learning library Metronome. We also take a look at non-linear modeling with the introduction of parallel neural network training in Metronome as well.

TRANSCRIPT

Page 1: MLConf 2013: Metronome and Parallel Iterative Algorithms on YARN

Metronome

YARN and Parallel Iterative Algorithms

Page 2: MLConf 2013: Metronome and Parallel Iterative Algorithms on YARN

Josh Patterson

Email:

[email protected]

Twitter:

@jpatanooga

Github:

https://github.com/jpatanooga

Past

Published in IAAI-09:

“TinyTermite: A Secure Routing Algorithm”

Grad work in Meta-heuristics, Ant-algorithms

Tennessee Valley Authority (TVA)

Hadoop and the Smartgrid

Cloudera

Principal Solution Architect

Today: Consultant

Page 3: MLConf 2013: Metronome and Parallel Iterative Algorithms on YARN

Sections

1. Parallel Iterative Algorithms

2. Parallel Neural Networks

3. Future Directions

Page 4: MLConf 2013: Metronome and Parallel Iterative Algorithms on YARN

YARN, IterativeReduce and HadoopParallel Iterative Algorithms

Page 5: MLConf 2013: Metronome and Parallel Iterative Algorithms on YARN

5

Machine Learning and Optimization

Direct Methods

Normal Equation

Iterative Methods

Newton’s Method

Quasi-Newton

Gradient Descent

Heuristics

AntNet

PSO

Genetic Algorithms

Page 6: MLConf 2013: Metronome and Parallel Iterative Algorithms on YARN

Linear Regression

In linear regression, data is modeled using linear predictor functions

unknown model parameters are estimated from the data.

We use optimization techniques like Stochastic Gradient Descent to find the coeffcients in the model

Y = (1*x0) + (c1*x1) + … + (cN*xN)

Page 7: MLConf 2013: Metronome and Parallel Iterative Algorithms on YARN

7

Stochastic Gradient Descent

Andrew Ng’s Tutorial: https://class.coursera.org/ml/lecture/preview_view/11

Hypothesis about data

Cost function

Update function

Page 8: MLConf 2013: Metronome and Parallel Iterative Algorithms on YARN

8

Stochastic Gradient Descent

Training

Simple gradient descent procedure

Loss functions needs to be convex (with exceptions)

Linear Regression

Loss Function: squared error of prediction

Prediction: linear combination of coefficients and input variables

SGD

Model

Training Data

Page 9: MLConf 2013: Metronome and Parallel Iterative Algorithms on YARN

9

Mahout’s SGD

Currently Single Process

Multi-threaded parallel, but not cluster parallel

Runs locally, not deployed to the cluster

Tied to logistic regression implementation

Page 10: MLConf 2013: Metronome and Parallel Iterative Algorithms on YARN

10

Distributed Learning Strategies

McDonald, 2010

Distributed Training Strategies for the Structured Perceptron

Langford, 2007

Vowpal Wabbit

Jeff Dean’s Work on Parallel SGD

DownPour SGD

Page 11: MLConf 2013: Metronome and Parallel Iterative Algorithms on YARN

11

MapReduce vs. Parallel

IterativeInput

Output

Map Map Map

Reduce Reduce

ProcessorProcessor ProcessorProcessor ProcessorProcessor

Superstep 1Superstep 1

ProcessorProcessor ProcessorProcessor

Superstep 2Superstep 2

. . .

ProcessorProcessor

Page 12: MLConf 2013: Metronome and Parallel Iterative Algorithms on YARN

12

YARN

Yet Another Resource Negotiator

Framework for scheduling distributed applications

Allows for any type of parallel application to run natively on hadoop

MRv2 is now a distributed application

Page 13: MLConf 2013: Metronome and Parallel Iterative Algorithms on YARN

13

IterativeReduce API

ComputableMaster

Setup()

Compute()

Complete()

ComputableWorker

Setup()

Compute()

WorkerWorker WorkerWorker WorkerWorker

MasterMaster

WorkerWorker WorkerWorker

MasterMaster

. . .

WorkerWorker

Page 14: MLConf 2013: Metronome and Parallel Iterative Algorithms on YARN

14

SGD: Serial vs Parallel

Model

Training Data

Worker 1

Master

Partial Model

Global Model

Worker 2

Partial Model

Worker N

Partial Model

Split 1 Split 2 Split 3

Page 15: MLConf 2013: Metronome and Parallel Iterative Algorithms on YARN

Parallel Iterative Algorithms on YARN

Based directly on work we did with Knitting Boar

Parallel logistic regression

And then added

Parallel linear regression

Parallel Neural Networks

Packaged in a new suite of parallel iterative algorithms called Metronome

100% Java, ASF 2.0 Licensed, on github

Page 16: MLConf 2013: Metronome and Parallel Iterative Algorithms on YARN

Linear Regression Results

64 128 192 256 3200

50

100

150

200

Linear Regression - Parallel vs Serial

Parallel RunsSerial Runs

Megabytes Processed Total

Tota

l P

rocessin

g

Tim

e

Page 17: MLConf 2013: Metronome and Parallel Iterative Algorithms on YARN

17

Logistic Regression: 20Newsgroups

Input Size vs Processing Time

4.1 8.2 12.3 16.4 20.5 24.6 28.7 32.8 36.9 410

50

100

150

200

250

300

OLRPOLR

Page 18: MLConf 2013: Metronome and Parallel Iterative Algorithms on YARN

Convergence Testing

Debugging parallel iterative algorithms during testing is hard

Processes on different hosts are difficult to observe

Using the Unit Test framework IRUnit we can simulate the IterativeReduce framework

We know the plumbing of message passing works

Allows us to focus on parallel algorithm design/testing while still using standard debugging tools

Page 19: MLConf 2013: Metronome and Parallel Iterative Algorithms on YARN

Let’s Get Non-LinearParallel Neural Networks

Page 20: MLConf 2013: Metronome and Parallel Iterative Algorithms on YARN

What are Neural Networks?

Inspired by nervous systems in biological systems

Models layers of neurons in the brain

Can learn non-linear functions

Recently enjoying a surge in popularity

Page 21: MLConf 2013: Metronome and Parallel Iterative Algorithms on YARN

Multi-Layer Perceptron

First layer has input neurons

Last layer has output neurons

Each neuron in the layer connected to all neurons in the next layer

Neuron has activation function, typically sigmoid / logistic

Input to neuron is the sum of the weight * input of connections

Page 22: MLConf 2013: Metronome and Parallel Iterative Algorithms on YARN

Backpropogation Learning

Calculates the gradient of the error of the network regarding the network's modifiable weights

Intuition

Run forward pass of example through network

Compute activations and output

Iterating output layer back to input layer (backwards)

For each neuron in the layer

Compute node’s responsibility for error

Update weights on connections

Page 23: MLConf 2013: Metronome and Parallel Iterative Algorithms on YARN

Parallelizing Neural Networks

Dean, (NIPS, 2012)

First Steps: Focus on linear convex models, calculating distributed gradient

Model Parallelism must be combined with distributed optimization that leverages data parallelization

simultaneously process distinct training examples in each of the many model replicas

periodically combine their results to optimize our objective function

Single pass frameworks such as MapReduce “ill-suited”

Page 24: MLConf 2013: Metronome and Parallel Iterative Algorithms on YARN

Costs of Neural Network Training

Connections count explodes quickly as neurons and layers increase

Example: {784, 450, 10} network has 357,300 connections

Need fast iterative framework

Example: 30 sec MR setup cost: 10k Epochs: 30s x 10,000 == 300,000 seconds of setup time

5,000 minutes or 83 hours

3 ways to speed up training

Subdivide dataset between works (data parallelism)

Max transfer rate of disks and Vector caching to max data throughput

Minimize inter-epoch setup times with proper iterative framework

Page 25: MLConf 2013: Metronome and Parallel Iterative Algorithms on YARN

Vector In-Memory Caching

Since we make lots of passes over same dataset

In memory caching makes sense here

Once a record is vectorized it is cached in memory on the worker node

Speedup (single pass, “no cache” vs “cached”):

~12x

Page 26: MLConf 2013: Metronome and Parallel Iterative Algorithms on YARN

Neural Networks Parallelization Speedup

1 2 3 4 5 -

1.00

2.00

3.00

4.00

5.00

6.00

UCI IrisUCI LensesUCI WineUCI DermatologyNIST Handwriting Downsample

Number of Parallel Processing Units

Tra

inin

g S

peedup F

acto

r (M

ult

iple

)

Page 27: MLConf 2013: Metronome and Parallel Iterative Algorithms on YARN

Going ForwardFuture Directions

Page 28: MLConf 2013: Metronome and Parallel Iterative Algorithms on YARN

Lessons Learned

Linear scale continues to be achieved with parameter averaging variations

Tuning is critical

Need to be good at selecting a learning rate

Page 29: MLConf 2013: Metronome and Parallel Iterative Algorithms on YARN

Future Directions

Adagrad (SGD Adaptive Learning Rates)

Parallel Quasi-Newton Methods

L-BFGS

Conjugate Gradient

More Neural Network Learning Refinement

Training progressively larger networks

Page 30: MLConf 2013: Metronome and Parallel Iterative Algorithms on YARN

Github

IterativeReduce

https://github.com/emsixteeen/IterativeReduce

Metronome

https://github.com/jpatanooga/Metronome

Page 31: MLConf 2013: Metronome and Parallel Iterative Algorithms on YARN

Unit Testing and IRUnit

Simulates the IterativeReduce parallel framework

Uses the same app.properties file that YARN applications do

Examples

https://github.com/jpatanooga/Metronome/blob/master/src/test/java/tv/floe/metronome/linearregression/iterativereduce/TestSimulateLinearRegressionIterativeReduce.java

https://github.com/jpatanooga/KnittingBoar/blob/master/src/test/java/com/cloudera/knittingboar/sgd/iterativereduce/TestKnittingBoar_IRUnitSim.java