flinkml: large scale machine learning with apache flink

84
FlinkML: Large-scale Machine Learning with Apache Flink Theodore Vasiloudis, SICS SICS Data Science Day October 21st, 2015

Upload: theodoros-vasiloudis

Post on 08-Jan-2017

646 views

Category:

Technology


4 download

TRANSCRIPT

Page 1: FlinkML: Large Scale Machine Learning with Apache Flink

FlinkML: Large-scale Machine Learning with Apache FlinkTheodore Vasiloudis, SICS

SICS Data Science DayOctober 21st, 2015

Page 2: FlinkML: Large Scale Machine Learning with Apache Flink

Apache Flink

Page 3: FlinkML: Large Scale Machine Learning with Apache Flink

What is Apache Flink?

● Large-scale data processing engine● Easy and powerful APIs for batch and real-time streaming analysis● Backed by a very robust execution backend

○ true streaming dataflow engine○ custom memory manager○ native iterations○ cost-based optimizer

Page 4: FlinkML: Large Scale Machine Learning with Apache Flink

What is Apache Flink?

Page 5: FlinkML: Large Scale Machine Learning with Apache Flink

What does Flink give us?

● Expressive APIs● Pipelined stream processor● Closed loop iterations

Page 6: FlinkML: Large Scale Machine Learning with Apache Flink

Expressive APIs

● Main distributed data abstraction: DataSet● Program using functional-style transformations, creating a Dataflow.

case class Word(word: String, frequency: Int)

val lines: DataSet[String] = env.readTextFile(...)

lines.flatMap(line => line.split(“ “).map(word => Word(word, 1)).groupBy(“word”).sum(“frequency”).print()

Page 7: FlinkML: Large Scale Machine Learning with Apache Flink

Pipelined Stream Processor

Page 8: FlinkML: Large Scale Machine Learning with Apache Flink

Iterate in the Dataflow

Page 9: FlinkML: Large Scale Machine Learning with Apache Flink

Iterate by looping

● Loop in client submits one job per iteration step● Reuse data by caching in memory or disk

Page 10: FlinkML: Large Scale Machine Learning with Apache Flink

Iterate in the Dataflow

Page 11: FlinkML: Large Scale Machine Learning with Apache Flink

Delta iterations

Page 12: FlinkML: Large Scale Machine Learning with Apache Flink

Delta iterations

Learn more in Vasia’s Gelly talk!

Page 13: FlinkML: Large Scale Machine Learning with Apache Flink

Large-scale Machine Learning

Page 14: FlinkML: Large Scale Machine Learning with Apache Flink

What do we mean?

Page 15: FlinkML: Large Scale Machine Learning with Apache Flink

What do we mean?

● Small-scale learning ● Large-scale learning

Source: Léon Bottou

Page 16: FlinkML: Large Scale Machine Learning with Apache Flink

What do we mean?

● Small-scale learning○ We have a small-scale learning problem

when the active budget constraint is the number of examples.

● Large-scale learning

Source: Léon Bottou

Page 17: FlinkML: Large Scale Machine Learning with Apache Flink

What do we mean?

● Small-scale learning○ We have a small-scale learning problem

when the active budget constraint is the number of examples.

● Large-scale learning○ We have a large-scale learning problem

when the active budget constraint is the computing time.

Source: Léon Bottou

Page 18: FlinkML: Large Scale Machine Learning with Apache Flink

What do we mean?

● What about the complexity of the problem?

Page 19: FlinkML: Large Scale Machine Learning with Apache Flink

What do we mean?

● What about the complexity of the problem?

Source: Wired Magazine

Page 20: FlinkML: Large Scale Machine Learning with Apache Flink

Deep learning

Page 21: FlinkML: Large Scale Machine Learning with Apache Flink

What do we mean?

● What about the complexity of the problem?

“When you get to a trillion [parameters], you’re getting to something that’s got a chance of really understanding some stuff.” - Hinton, 2013

Source: Wired Magazine

Page 22: FlinkML: Large Scale Machine Learning with Apache Flink

What do we mean?

● We have a large-scale learning problem when the active budget constraint is the computing time and/or the model complexity.

Page 23: FlinkML: Large Scale Machine Learning with Apache Flink

FlinkML

Page 24: FlinkML: Large Scale Machine Learning with Apache Flink

FlinkML

● New effort to bring large-scale machine learning to Flink

Page 25: FlinkML: Large Scale Machine Learning with Apache Flink

FlinkML

● New effort to bring large-scale machine learning to Flink● Goals:

○ Truly scalable implementations○ Keep glue code to a minimum○ Ease of use

Page 26: FlinkML: Large Scale Machine Learning with Apache Flink

FlinkML: Overview

● Supervised Learning○ Optimization framework○ SVM○ Multiple linear regression

Page 27: FlinkML: Large Scale Machine Learning with Apache Flink

FlinkML: Overview

● Supervised Learning○ Optimization framework○ SVM○ Multiple linear regression

● Recommendation○ Alternating Least Squares (ALS)

Page 28: FlinkML: Large Scale Machine Learning with Apache Flink

FlinkML: Overview

● Supervised Learning○ Optimization framework○ SVM○ Multiple linear regression

● Recommendation○ Alternating Least Squares (ALS)

● Pre-processing○ Polynomial features○ Feature scaling

Page 29: FlinkML: Large Scale Machine Learning with Apache Flink

FlinkML: Overview

● Supervised Learning○ Optimization framework○ SVM○ Multiple linear regression

● Recommendation○ Alternating Least Squares (ALS)

● Pre-processing○ Polynomial features○ Feature scaling

● sklearn-like ML pipelines

Page 30: FlinkML: Large Scale Machine Learning with Apache Flink

FlinkML API

// LabeledVector is a feature vector with a label (class or real value)val trainingData: DataSet[LabeledVector] = ...val testingData: DataSet[Vector] = ...

Page 31: FlinkML: Large Scale Machine Learning with Apache Flink

FlinkML API

// LabeledVector is a feature vector with a label (class or real value)val trainingData: DataSet[LabeledVector] = ...val testingData: DataSet[Vector] = ...

val mlr = MultipleLinearRegression() .setStepsize(0.01) .setIterations(100) .setConvergenceThreshold(0.001)

Page 32: FlinkML: Large Scale Machine Learning with Apache Flink

FlinkML API

// LabeledVector is a feature vector with a label (class or real value)val trainingData: DataSet[LabeledVector] = ...val testingData: DataSet[Vector] = ...

val mlr = MultipleLinearRegression() .setStepsize(0.01) .setIterations(100) .setConvergenceThreshold(0.001)

mlr.fit(trainingData)

Page 33: FlinkML: Large Scale Machine Learning with Apache Flink

FlinkML API

// LabeledVector is a feature vector with a label (class or real value)val trainingData: DataSet[LabeledVector] = ...val testingData: DataSet[Vector] = ...

val mlr = MultipleLinearRegression() .setStepsize(0.01) .setIterations(100) .setConvergenceThreshold(0.001)

mlr.fit(trainingData)

// The fitted model can now be used to make predictionsval predictions: DataSet[LabeledVector] = mlr.predict(testingData)

Page 34: FlinkML: Large Scale Machine Learning with Apache Flink

FlinkML Pipelines

val scaler = StandardScaler()val polyFeatures = PolynomialFeatures().setDegree(3)val mlr = MultipleLinearRegression()

Page 35: FlinkML: Large Scale Machine Learning with Apache Flink

FlinkML Pipelines

val scaler = StandardScaler()val polyFeatures = PolynomialFeatures().setDegree(3)val mlr = MultipleLinearRegression()

// Construct pipeline of standard scaler, polynomial features and multiple linear // regressionval pipeline = scaler.chainTransformer(polyFeatures).chainPredictor(mlr)

Page 36: FlinkML: Large Scale Machine Learning with Apache Flink

FlinkML Pipelines

val scaler = StandardScaler()val polyFeatures = PolynomialFeatures().setDegree(3)val mlr = MultipleLinearRegression()

// Construct pipeline of standard scaler, polynomial features and multiple linear // regressionval pipeline = scaler.chainTransformer(polyFeatures).chainPredictor(mlr)

// Train pipelinepipeline.fit(trainingData)

// Calculate predictionsval predictions = pipeline.predict(testingData)

Page 37: FlinkML: Large Scale Machine Learning with Apache Flink

State of the art in large-scale ML

Page 38: FlinkML: Large Scale Machine Learning with Apache Flink

Alternating Least Squares

R ≅ X Y✕Users

Items

Page 39: FlinkML: Large Scale Machine Learning with Apache Flink

Naive Alternating Least Squares

Page 40: FlinkML: Large Scale Machine Learning with Apache Flink

Blocked Alternating Least Squares

Page 41: FlinkML: Large Scale Machine Learning with Apache Flink

Blocked ALS performance

FlinkML blocked ALS performance

Page 42: FlinkML: Large Scale Machine Learning with Apache Flink

Going beyond SGD in large-scale optimization

Page 43: FlinkML: Large Scale Machine Learning with Apache Flink

● Beyond SGD → Use Primal-Dual framework

● Slow updates → Immediately apply local updates

● Average over batch size → Average over K (nodes) << batch size

CoCoA: Communication Efficient Coordinate Ascent

Page 44: FlinkML: Large Scale Machine Learning with Apache Flink

Primal-dual framework

Source: Smith (2014)

Page 45: FlinkML: Large Scale Machine Learning with Apache Flink

Primal-dual framework

Source: Smith (2014)

Page 46: FlinkML: Large Scale Machine Learning with Apache Flink

Immediately Apply Updates

Source: Smith (2014)

Page 47: FlinkML: Large Scale Machine Learning with Apache Flink

Immediately Apply Updates

Source: Smith (2014)Source: Smith (2014)

Page 48: FlinkML: Large Scale Machine Learning with Apache Flink

Average over nodes (K) instead of batches

Source: Smith (2014)

Page 49: FlinkML: Large Scale Machine Learning with Apache Flink

CoCoA: Communication Efficient Coordinate Ascent

Page 50: FlinkML: Large Scale Machine Learning with Apache Flink

CoCoA performance

Source:Jaggi (2014)

Page 51: FlinkML: Large Scale Machine Learning with Apache Flink

CoCoA performance

Available on FlinkML

SVM

Page 52: FlinkML: Large Scale Machine Learning with Apache Flink

Achieving model parallelism:The parameter server

● The parameter server is essentially a distributed key-value store with two

basic commands: push and pull○ push updates the model

○ pull retrieves a (lazily) updated model

● Allows us to store a model into multiple nodes, read and update it as

needed.

Page 53: FlinkML: Large Scale Machine Learning with Apache Flink

Architecture of a parameter server communicating with groups of workers.

Source: Li (2014)

Page 54: FlinkML: Large Scale Machine Learning with Apache Flink

Comparison with other large-scale learning systems.

Source: Li (2014)

Page 55: FlinkML: Large Scale Machine Learning with Apache Flink

Dealing with stragglers: SSP Iterations

Page 56: FlinkML: Large Scale Machine Learning with Apache Flink

● BSP: Bulk Synchronous parallel○ Every worker needs to wait for the others to finish before starting the next iteration

Dealing with stragglers: SSP Iterations

Page 57: FlinkML: Large Scale Machine Learning with Apache Flink

● BSP: Bulk Synchronous parallel○ Every worker needs to wait for the others to finish before starting the next iteration

● ASP: Asynchronous parallel○ Every worker can work individually, update model as needed.

Dealing with stragglers: SSP Iterations

Page 58: FlinkML: Large Scale Machine Learning with Apache Flink

● BSP: Bulk Synchronous parallel○ Every worker needs to wait for the others to finish before starting the next iteration

● ASP: Asynchronous parallel○ Every worker can work individually, update model as needed.○ Can be fast, but can often diverge.

Dealing with stragglers: SSP Iterations

Page 59: FlinkML: Large Scale Machine Learning with Apache Flink

● BSP: Bulk Synchronous parallel○ Every worker needs to wait for the others to finish before starting the next iteration

● ASP: Asynchronous parallel○ Every worker can work individually, update model as needed.○ Can be fast, but can often diverge.

● SSP: State Synchronous parallel○ Relax constraints, so slowest workers can be up to K iterations behind fastest ones.

Dealing with stragglers: SSP Iterations

Page 60: FlinkML: Large Scale Machine Learning with Apache Flink

● BSP: Bulk Synchronous parallel○ Every worker needs to wait for the others to finish before starting the next iteration

● ASP: Asynchronous parallel○ Every worker can work individually, update model as needed.○ Can be fast, but can often diverge.

● SSP: State Synchronous parallel○ Relax constraints, so slowest workers can be up to K iterations behind fastest ones.○ Allows for progress, while keeping convergence guarantees.

Dealing with stragglers: SSP Iterations

Page 61: FlinkML: Large Scale Machine Learning with Apache Flink

Dealing with stragglers: SSP Iterations

Source: Ho et al. (2013)

Page 62: FlinkML: Large Scale Machine Learning with Apache Flink

SSP Iterations in Flink: Lasso Regression

Source: Peel et al. (2015)

Page 63: FlinkML: Large Scale Machine Learning with Apache Flink

SSP Iterations in Flink: Lasso Regression

Source: Peel et al. (2015)

To be merged soon

into FlinkML

Page 64: FlinkML: Large Scale Machine Learning with Apache Flink

Current and future work on FlinkML

Page 65: FlinkML: Large Scale Machine Learning with Apache Flink

Coming soon

● Tooling○ Evaluation & cross-validation framework○ Predictive Model Markup Language

● Algorithms○ Quad-tree kNN search○ Efficient streaming decision trees○ k-means and extensions○ Colum-wise statistics, histograms

Page 66: FlinkML: Large Scale Machine Learning with Apache Flink

FlinkML Roadmap

● Hyper-parameter optimization● More communication-efficient optimization algorithms● Generalized Linear Models● Latent Dirichlet Allocation

Page 67: FlinkML: Large Scale Machine Learning with Apache Flink

Future of Machine Learning on Flink

● Streaming ML○ Flink already has SAMOA bindings.○ We plan to kickstart the streaming ML library of Flink, and develop new algorithms.

Page 68: FlinkML: Large Scale Machine Learning with Apache Flink

Future of FlinkML

● Streaming ML○ Flink already has SAMOA bindings.○ We plan to kickstart the streaming ML library of Flink, and develop new algorithms.

● “Computation efficient” learning○ Utilize hardware and develop novel systems and algorithms to achieve large-scale learning

with modest computing resources.

Page 69: FlinkML: Large Scale Machine Learning with Apache Flink

Recent large-scale learning systems

Source: Xing (2015)

Page 70: FlinkML: Large Scale Machine Learning with Apache Flink

Recent large-scale learning systems

Source: Xing (2015)

How to get here?

Page 71: FlinkML: Large Scale Machine Learning with Apache Flink

Demo?

Page 73: FlinkML: Large Scale Machine Learning with Apache Flink

References

● Flink Project: flink.apache.org● FlinkML Docs: https://ci.apache.org/projects/flink/flink-docs-master/libs/ml/● Leon Botou: Learning with Large Datasets● Wired: Computer Brain Escapes Google's X Lab to Supercharge Search● Smith: CoCoA AMPCAMP Presentation● CMU Petuum: Petuum Project● Jaggi (2014): “Communication-efficient distributed dual coordinate ascent." NIPS 2014.● Li (2014): "Scaling distributed machine learning with the parameter server." OSDI 2014.● Ho (2013): "More effective distributed ML via a stale synchronous parallel parameter server." NIPS

2013.● Peel (2015): “Distributed Frank-Wolfe under Pipelined Stale Synchronous Parallelism”, IEEE BigData

2015● Xing (2015): “Petuum: A New Platform for Distributed Machine Learning on Big Data”, KDD 2015

I would like to thank professor Eric Xing for his permission to use parts of the structure from his great tutorial on large-scale machine learning: A New Look at the System, Algorithm and Theory Foundations of Distributed Machine Learning

Page 74: FlinkML: Large Scale Machine Learning with Apache Flink

“Demo”

Page 75: FlinkML: Large Scale Machine Learning with Apache Flink
Page 76: FlinkML: Large Scale Machine Learning with Apache Flink

“Demo”

Page 77: FlinkML: Large Scale Machine Learning with Apache Flink

“Demo”

Page 78: FlinkML: Large Scale Machine Learning with Apache Flink

“Demo”

Page 79: FlinkML: Large Scale Machine Learning with Apache Flink

“Demo”

Page 80: FlinkML: Large Scale Machine Learning with Apache Flink

“Demo”

Page 81: FlinkML: Large Scale Machine Learning with Apache Flink

“Demo”

Page 82: FlinkML: Large Scale Machine Learning with Apache Flink
Page 83: FlinkML: Large Scale Machine Learning with Apache Flink

“Demo”