distributed machine learning 101 using apache spark from a browser devoxx.be2015
TRANSCRIPT
![Page 1: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/1.jpg)
Distributed Machine Learning using Apache Spark from the Browser
Devoxx Belgium 2015, Antwerpen
![Page 2: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/2.jpg)
● Distributed computing● what is Machine Learning?
● Spark for machine learning?
● Spark MLlib by examples
● Spark and other libraries
● Wrap up
Outline
![Page 3: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/3.jpg)
Data Fellas
Andy Petrella
MathsGeospatialDistributed Computing
Spark NotebookTrainer Spark/ScalaMachine Learning
Xavier Tordoir
PhysicsBioinformaticsDistributed Computing
Scala (& Perl)trainer SparkMachine Learning
![Page 4: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/4.jpg)
Distributed ComputingWhy you must care, by Data Fellas
Andy Petrella & Xavier Tordoir
![Page 5: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/5.jpg)
Traditionally, tasks are entirely performed on a single computer using three main resources.Uba ga!
Computing
Processing Power Memory Storage
![Page 6: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/6.jpg)
Computing
Oh no!
Hence performance is limited in time and space
Processing Power Memory StorageTIME SPACE
![Page 7: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/7.jpg)
Distribute computing: [...] A distributed system is a software system in which components located on networked computers communicate and coordinate their actions by passing messages.
The components interact with each other in order to achieve a common goal. [...].
Ref: https://en.wikipedia.org/wiki/Distributed_computing
Distributing
Interesting
![Page 8: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/8.jpg)
Consequences
Oh no!
Algorithms have to work on DATA Partitions and with partial results
The entire dataset cannot be accessed at once
![Page 9: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/9.jpg)
New resource!
Damned
Processing Power Memory StorageSPACE
Network
TIME
Network Will impact performances...
![Page 10: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/10.jpg)
Oops did it again
Distributing
Storage
Processing
Memory
Processing
Memory
Processing
Memory
Processing
Memory
Storage
Storage
Storage
network
![Page 11: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/11.jpg)
DrawbackPartition
Huh?
Storage
Processing
Memory
Processing
Memory
Processing
Memory
Processing
Memory
Storage
Storage
Storage
network
![Page 12: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/12.jpg)
DrawbackPartition
Hey, you sank my node!
Storage
Processing
Memory
Processing
Memory
Processing
Memory
Storage
Storage
network
Processing
Memory
Storage
BOOM
![Page 13: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/13.jpg)
Ouch, my rack
AdvantageElastic scaling
Storage
Processing
Memory
Processing
Memory
Processing
Memory
Processing
Memory
Storage
Storage
Storage network
What if this cluster happens to not be big enough?
![Page 14: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/14.jpg)
That’s more reasonable
AdvantageElastic scaling
Storage
Processing
Memory
Processing
Memory
Processing
Memory
Processing
Memory
Storage
Storage
Storage network
Storage
Processing
Memory
Processing
Memory
Processing
Memory
Processing
Memory
Storage
Storage
Storage network
network
![Page 15: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/15.jpg)
HPC: computationally intensive applications
Model: specialized hardware (CPU/GPU) and network
They are orchestrated by a scheduler that gather their computing power and memory.
Yeah! what about?
What about HPC?
![Page 16: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/16.jpg)
Drawbacks:
● Costs and upgrades by large blocks● Decoupled storage
storage latency = no streaming / no Iteration
Got No Money and NO time
What about HPC?
![Page 17: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/17.jpg)
Why processing data if not to model?
Machine learning: iterative (streaming & batch)
Data is aggregated in the form of a model (parameters)
Data change little, model is small
Do that baby!
Iterate
![Page 18: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/18.jpg)
Iterate
you gotta be kidding
Storage
Processing
Memory
Processing
Memory
Processing
Memory
Processing
Memory
Storage
Storage
Storage
Storage
Moving lots of data again and again...
![Page 19: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/19.jpg)
Distributed computing allow cost effective parallelism
Efficiency requires distributed storage
Colocated with the processing units
What about programming models?
Summary
Interesting
![Page 20: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/20.jpg)
Distributed storage
Partitions!
HDFS: Apache implementation of Google FS
● Natural fit for distributed storage● Works as a service
Other chunked sources...
● Apache Cassandra, S3, Tachyon,...
![Page 21: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/21.jpg)
Distributed storage
Split da Name Node
256Mb put /data/f256.txt
replication factor 2 Data Node 1
Data Node 2
Data Node 4
Data Node 3
![Page 22: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/22.jpg)
Distributed storage
Split da
Data Node 1
Data Node 2
Data Node 4
Data Node 3
Name Node
256Mb put /data/f256.txt
replication factor 2 64Mb
64Mb
64Mb
64Mb
![Page 23: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/23.jpg)
Distributed storage
Everywhere
Data Node 1
Data Node 2
Data Node 4
Data Node 3
Name Node
256Mb
64Mb
64Mb
64Mb
64Mb
put /data/f256.txtreplication factor 2 put /data/f256.txt/part-r-00000 64
Mb
![Page 24: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/24.jpg)
Distributed storage
everywhere
256Mb put /data/f256.txt
replication factor 2Data Node 1
Data Node 2
Data Node 4
Data Node 3
Name Node
put /data/f256.txt/part-r-00000 64Mb
64Mb
64Mb
64Mb
64Mb
64Mb
64Mb
64Mb
![Page 25: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/25.jpg)
Distributed storage
Replicate
Data Node 1
Data Node 2
Data Node 4
Data Node 3
Name Node
256Mb put /data/f256.txt
replication factor 2 put /data/f256.txt/part-r-00000 64Mb
64Mb
64Mb
64Mb
64Mb
64Mb
64Mb
64Mb
64Mb
64Mb
64Mb
64Mb
![Page 26: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/26.jpg)
Map ReduceHigh Level Execution
The rocket’s base
data part
data part
data part
data part
Load the data
![Page 27: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/27.jpg)
Map ReduceHigh Level Execution
The rocket’s engines
data part mapper
data part
data part
data part
mapper
mapper
mapper
Mapand Pair
![Page 28: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/28.jpg)
Map ReduceHigh Level Execution
The rocket’s trunk
GroupB
yKey
data part mapper
data part
data part
data part
mapper
mapper
mapper
Shuffle Pairs using Keys
![Page 29: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/29.jpg)
Map ReduceHigh Level Execution
The rocket’s cockpit
data part mapper
GroupB
yKey
Reducer
data part
data part
data part
mapper
mapper
mapper
Reducer
Reducer
Values per key are Reduced
![Page 30: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/30.jpg)
Map ReduceHigh Level Execution
The rocket’s tip
data part mapper
GroupB
yKey
Reducer
data part
data part
data part
mapper
mapper
mapper
Reducer
Reducer
Results
We collect the results
![Page 31: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/31.jpg)
Map ReduceHigh Level Execution
To the infinite and beyond!
data part mapper
GroupB
yKey
Reducer
data part
data part
data part
mapper
mapper
mapper
Reducer
Reducer
Results
The whole#!
![Page 32: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/32.jpg)
Map Reduce Matrix-Vector Product
How about word count?
=
![Page 33: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/33.jpg)
Map Reduce Matrix-Vector Product
Back to school...
=
![Page 34: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/34.jpg)
Map Reduce Matrix-Vector Product
Wait, that’s maths
=
![Page 35: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/35.jpg)
Map Reduce Matrix-Vector Product
Where is the RAT?
Store Matrix as ordered
Vector V loaded in memory as ordered
Map function:
Each matrix element mapped on a producT
![Page 36: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/36.jpg)
Map Reduce Matrix-Vector Product
OK … I TAKE OVER
MAP
![Page 37: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/37.jpg)
Map Reduce Matrix-Vector Product
just a sum …
REDUCE
![Page 38: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/38.jpg)
Map ReduceSummary
Summary ==
Reduce?
Simple Abstraction of computations, Map and Reduce
Using simple abstraction of data, key value pairs
![Page 39: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/39.jpg)
Map ReduceSummary
So what?
Brings transparent:
● parallelization● distribution ● fault tolerance
![Page 40: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/40.jpg)
Why Apache SparkMapReduce on steroids
Man… Finally!
Uses
● Functional paradigm● Lazy computations
Creates dependencies between tasks definitions and optimizes execution
![Page 41: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/41.jpg)
Why Apache SparkMapReduce on steroids
Almost forgot that one
Can cache data in memory or local file system.
Far less IO or network.
![Page 42: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/42.jpg)
What is Machine learning?Why you must care, by Data Fellas
Andy Petrella & Xavier Tordoir
![Page 43: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/43.jpg)
you cannot prove a vague theory is wrong
[…] Also, if the process of computing the consequences is indefinite, then with a little skill any experimental result can be made to look like the expected consequences.
—Richard Feynman [1964]
What is Machine Learning?Science with data
Surely You’re Joking Mr…
![Page 44: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/44.jpg)
● Modelling without first principle…
What is Machine Learning?Overview
2nd law neither...
![Page 45: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/45.jpg)
● Modelling without first principle…
What is Machine Learning?Overview
Machine learning you do with a Learning Machine
Take that Newton...
![Page 46: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/46.jpg)
● Modelling without first principle…
● Modelling dependencies from the data
What is Machine Learning?Overview
With some “a priori” knowledge
![Page 47: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/47.jpg)
● What is the problem?● Hypothesis?● Data Generation Process?● Collection and Preprocessing● Interpretation
What is Machine Learning?Learning Machine…
You still need a domain expert…
Like me!
LearningMachine
![Page 48: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/48.jpg)
● Estimate dependencies from data
What is Machine Learning?Overview
Machine learning you do with a Learning Machine
SamplesGenerator
System
x
y
ỹ
z ?
LearningMachine
![Page 49: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/49.jpg)
● Estimate dependencies from data
● Minimize a risk functional over the set given the data
What is Machine Learning?Overview
I like them so much in LaTeX2e
SamplesGenerator
System
x
y
ỹ
z ?
LearningMachine
![Page 50: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/50.jpg)
● Regression: continuous output
○ Risk = Prediction error
● Classification: categorical output
○ Risk = Probability of misclassification
What is Machine Learning?Supervised learning
Lyfxw y-fxw2…
WTF?
![Page 51: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/51.jpg)
What is Machine Learning?Unsupervised learning: no output
I like clusters, specially with roasted nuts
● Clustering
○ Risk = Error Distortion (distances to center)
● Density estimation (probability densities)
![Page 52: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/52.jpg)
What is Machine Learning?Bias - Variance, Regression illustration
Playtime!
Notebook!
![Page 53: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/53.jpg)
What is Machine Learning?Inductive principle
In principle, it should work.
An inductive principle tells what to do
Finite Data
Inductive principle
Model
![Page 54: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/54.jpg)
What is Machine Learning?Inductive principle
In principle, it should work.
Empirical risk minimization
Finite Data Model
• Functions class not defined• Loss not defined• Optimization procedure not defined
![Page 55: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/55.jpg)
What is Machine Learning?Inductive principle
In principle, it should work.
Regularization
Finite Data Model
• control on penalty strength• Penalize complexity/a priori knowledge
![Page 56: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/56.jpg)
What is Machine Learning?Inductive principle
In principle, it should work.
Early stopping rules
Finite Data Model
• Iterative optimization• Depends on initial params and algorithm• used for neural networks• Penalize along a path
![Page 57: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/57.jpg)
What is Machine Learning?Inductive principle
In principle, it should work.
Structural Risk
Finite Data Model
• Analytic estimates of empirical risk
![Page 58: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/58.jpg)
What is Machine Learning?Inductive principle
In principle, it should work.
Bayesian inference
Finite Data Model
• Explicit a priori probabilities• Learn mixtures• Hard multidimensional integrations…
![Page 59: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/59.jpg)
What is Machine Learning?Curse of dimentionality
In principle, it should work.
We want to control complexity
Finite Data Model
• smoothness constraint in a neighborhood
![Page 60: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/60.jpg)
What is Machine Learning?Curse of dimensionality
In principle, it should work.
Data density is key…
Finite DataIn a Space
ModelComplexity
Inductive principle
![Page 61: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/61.jpg)
What is Machine Learning?Curse of dimensionality
In principle, it should work.
Data density is key…e.g.● 1-D 0.1m interval => 10 points/m● 2-D 0.1M interval => 100 points/M^2
● d-d 0.1 m interval => 10^d points/m^d
Same smoothness requires lots of data in high dimensional spaces
![Page 62: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/62.jpg)
What is Machine Learning?Curse of dimensionality
In principle, it should work.
Sampling is hard…e.g.● 1-D 10% sample => 0.1 x size● 2-D 10% sample => 0.31 x size
● 10-d 10% sample => 0.79 x size
=> local estimates from samples are difficult
![Page 63: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/63.jpg)
What is Machine Learning?Curse of dimensionality
In principle, it should work.
Data points are closer to edges…One Data points “sees” himself as an outlier
=> Predictions require lots of extrapolation
![Page 64: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/64.jpg)
What is Machine Learning?Curse of dimensionality
In principle, it should work.
Samples must increase exponentially
… or model complexity must be controlled
![Page 65: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/65.jpg)
What is Machine Learning?Regularization in more details
In principle, it should work.
Data driven penalized risk minimization
![Page 66: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/66.jpg)
What is Machine Learning?Regularization in more details
In principle, it should work.
Loss functions
![Page 67: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/67.jpg)
What is Machine Learning?Regularization in more details
In principle, it should work.
Regularizers
L2 (ridge)
L1(lasso)
Elastic net
![Page 68: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/68.jpg)
What is Machine Learning?Regularization in more details
In principle, it should work.
Optimization (there comes the fun… )
Which algorithm to find a minimum in a distributed fashion?
Convex optimization methods (linear methods)● Gradient descent● Stochastic gradient descent● Limited-memory BFGS
![Page 69: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/69.jpg)
What is Machine Learning?Regularization in more details
In principle, it should work.
Optimization (there comes the fun… )
Gradient descent● Efficient steps but needs to read through
the whole data
![Page 70: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/70.jpg)
What is Machine Learning?Regularization in more details
In principle, it should work.
Optimization (there comes the fun… )
Stochastic Gradient descent● Samples data for each step but converges
very slowly
![Page 71: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/71.jpg)
What is Machine Learning?Regularization in more details
In principle, it should work.
Optimization (there comes the fun… )
L-BFGS● quadratic derivative estimates by keeping
several previous gradient in memory● Fast convergence
![Page 72: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/72.jpg)
What is Machine Learning?Model selection
all work and no play makes Jack a dull boy
Model Complexity control: Resampling
Selecting the right lambda…
… to minimize prediction risk
![Page 73: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/73.jpg)
What is Machine Learning?Model selection
Enough theory boy!
The universe
![Page 74: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/74.jpg)
What is Machine Learning?Model selection
Enough theory boy!
Our data
![Page 75: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/75.jpg)
What is Machine Learning?Model selection
Enough theory boy!
Our data
Learning Set (70%)
validation set (30%)
![Page 76: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/76.jpg)
What is Machine Learning?Model selection
Enough theory boy!
Our data
Learning Set (70%)
validation set (30%)
![Page 77: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/77.jpg)
What is Machine Learning?Model selection
Nice flag
K-Fold
K = 4
![Page 78: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/78.jpg)
MLLibA library to learn them all...
![Page 79: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/79.jpg)
Distributed computing framework
Large Scale Data Processing engine
What is Apache Spark?
I play BIG!
![Page 80: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/80.jpg)
Distributed computing framework
Large Scale Data Processing engine
● SQL & Dataframes● Streaming● Graph Processing● Machine Learning
With all colors!
What is Apache Spark?
![Page 81: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/81.jpg)
Distributed computing framework
Large Scale Data Processing engine
● Optimize memory usage (FAST)● Optimize computation execution
(Complex tasks)● Easy programming model
Let the brain do the work...
What is Apache Spark?
![Page 82: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/82.jpg)
Distributed computing framework
Large Scale Data Processing engine
● Interactive● @ any scale
Breed mixin’
What is Apache Spark?
![Page 83: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/83.jpg)
MLLibSpark
In principle, it should work.
Intro to Spark… notebook
![Page 84: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/84.jpg)
MLLibSpark
In principle, it should work.
Intro to Spark… notebook
So we’we seen… ● Basics of Spark data manipulation● MLLib data representation● Linear regression● Regularization and k-fold cross validation
What else is there?
![Page 85: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/85.jpg)
MLLibSpark
In principle, it should work.
Basic statisticsClassification and regressionCollaborative filteringClusteringDimensionality reductionFeature extraction and transformationFrequent pattern miningEvaluation metrics…
http://spark.apache.org/docs/latest/mllib-guide.html
![Page 86: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/86.jpg)
MLlib for Genomics?ADAM + MLlib (mixture K-Means+RF)
Playtime!
Some more examples
![Page 87: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/87.jpg)
GenomicsThe data
So… that’s what separates us huh?
![Page 88: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/88.jpg)
1000 genomes: http://www.1000genomes.org/
~1000 samples
~30M Genotypes per sample (features)
GenomicsThe data
Please, don’t mind the colors...
![Page 89: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/89.jpg)
1000 genomes: http://www.1000genomes.org/
~1000 samples
Few samples => Machine Learning
GenomicsThe data
Woooow, really, you must be kidding me… ahahahahah
![Page 90: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/90.jpg)
1000 genomes: http://www.1000genomes.org/
~1000 samples
~30M Genotypes per sample (features)
Few samples => Machine Learning
Lots of Data => Distributed computing
GenomicsThe data
Oh… damned… hum huh
![Page 91: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/91.jpg)
MLlib for Genomics?ADAM + MLlib (mixture K-Means+RF)
Playtime!
Notebook!
![Page 92: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/92.jpg)
What else?Old and new players are now integrating with Spark
(and Scala)
![Page 93: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/93.jpg)
Integrated with Data Frame
Offer API to create
shareable/reusable
Pipeline constructions (PCA, …)
Spark ML Pipeline
Higher API
![Page 94: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/94.jpg)
Like Pipeline but
Type Safe
Chainable API (andThen-friendly)
Spark ML Keystone
Higher API
![Page 95: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/95.jpg)
Memory implementation of “Map-Reduce”
Highly optimised structures for the JVM
blazing fast convergent models
H2O
Higher API
![Page 96: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/96.jpg)
DL4J Spark ML
Higher API
![Page 97: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/97.jpg)
Intel Data Analytics Acceleration Library
DAAL (Intel)
Higher API
![Page 98: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/98.jpg)
Declarative large-scale machine learning
optimization based on data and cluster
characteristics
System ML (IBM)
Higher API
![Page 99: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/99.jpg)
Nitro's Extremely Exciting Deep Learning Engine
MLP, RBM, LSTM and more to come
Needle
Higher API
![Page 100: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/100.jpg)
H2OSparkling & Deep Learning on genomics
water in fire
Learning structures using H2O Deep Learning Algorithm integrated in SparKin a Notebookon an Ec2 Cluster
http://h2o.ai/product/sparkling-water/
![Page 101: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/101.jpg)
H2OSparkling: in-memory data exchange
I remember things better when I remember then twice.
![Page 102: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/102.jpg)
Wrap upwhat we hope you have learned
![Page 103: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/103.jpg)
Distributed computingFor machine learning
I am ready.
Data is exploding
Distributed Technologies are maturing
Scale up and down, interactivity
![Page 104: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/104.jpg)
Distributed ML on SparkWhat is available
What are my options by the way?
Spark MLLibH2O
DL4J
Needle
EC2 GCEURIKA-XA
clouderaMapr
Hortonworks
HDFSC*
kafka
![Page 105: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/105.jpg)
“Create” Cluster
Find sources (context, quality, semantic, …)
Connect to sources (structure, schema/types, …)
Create distributed data pipeline/Model
Tune accuracy
Tune performances
Write results to Sinks
Access Layer
User Access
Shar3 (Data Fellas)ops
data
ops data
sci
sci ops
sci
ops data
web ops data
web ops data sci
![Page 106: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/106.jpg)
Shar3 (Data Fellas)Analysis
Production
DistributionRendering
Discovery
CatalogProject
Generator
Micro Service / Binary format
Schema for output
Metadata
![Page 107: Distributed machine learning 101 using apache spark from a browser devoxx.be2015](https://reader034.vdocuments.mx/reader034/viewer/2022042611/58ed27ba1a28ab2c138b45ef/html5/thumbnails/107.jpg)
That’s all folksThanks for listening/staying
Poke us on Twitter or via http://data-fellas.guru@DataFellas @Shar3_Fellas @SparkNotebook@Xtordoir & @Noootsab
Building Distributed Pipelines for Data Science using Kafka, Spark, and Cassandra (form → @DataFellas)
Check also @TypeSafe: http://t.co/o1Bt6dQtgH