spark ml pipeline serving
TRANSCRIPT
Spark Serving
by Stepan Pushkarev CTO of Hydrosphere.io
Spark Users here?
Data Scientists and Spark Users here?
Why do companies hire data scientists?
Why do companies hire data scientists?
To make products smarter.
What is a deliverable of data scientist and data engineer?
What is a deliverable of data scientist?
Academic
paper?
ML Model? R/Python
script?
Jupiter
Notebook?
BI
Dashboard?
cluster
datamodel
data scientist
? web app
val wordCounts = textFile
.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey((a, b) => a + b)
executor
executorexecutor
executor executor
Machine Learning: training + serving
pipeline
Training (Estimation) pipeline
trainpreprocess preprocess
tokenizer
apache spark 1
hadoop mapreduce 0
spark machine learning 1
[apache, spark] 1
[hadoop, mapreduce] 0
[spark, machine, learning] 1
hashing tf
[apache, spark] 1
[hadoop, mapreduce] 0
[spark, machine, learning] 1
[105, 495], [1.0, 1.0] 1
[6, 638, 655], [1.0, 1.0, 1.0] 0
[105, 72, 852], [1.0, 1.0, 1.0] 1
logistic regression
[105, 495], [1.0, 1.0] 1
[6, 638, 655], [1.0, 1.0, 1.0] 0
[105, 72, 852], [1.0, 1.0, 1.0] 1
0 72 -2.7138781446090308
0 94 0.9042505436914775
0 105 3.0835670890496645
0 495 3.2071722417080766
0 722 0.9042505436914775
val tokenizer = new Tokenizer().setInputCol("text").setOutputCol("words")
val hashingTF = new HashingTF().setNumFeatures(1000).setInputCol(tokenizer.getOutputCol).setOutputCol("features")
val lr = new LogisticRegression().setMaxIter(10).setRegParam(0.001)
val pipeline = new Pipeline().setStages(Array(tokenizer, hashingTF, lr))
val model = pipeline.fit(training)model.write.save("/tmp/spark-model")
pipeline
Prediction Pipeline
preprocess preprocess
val test = spark.createDataFrame(Seq(("spark hadoop"),("hadoop learning")
)).toDF("text")
val model = PipelineModel.load("/tmp/spark-model")
model.transform(test).collect()
./bin/spark-submit …
cluster
datamodel
data scientist
? web app
Pipeline Serving - NOT Model Serving
Model level API leads to code duplication & inconsistency
at pre-processing stages!
Web App
Ruby/PHP:
preprocess
Check current user
User LogsML Pipeline: preprocess, train
Save
Score/serve model
Fraud Detection Model
https://issues.apache.org/jira/browse/SPARK-16365
https://issues.apache.org/jira/browse/SPARK-13944
cluster
datamodel
data scientist
web app
PMMLPFA
MLEAP
- Yet another Format Lock
- Code & state duplication
- Limited extensibility
- Inconsistency
- Extra moving parts
cluster
datamodel
data scientist
web app
docker
model
libs
deps
- Fat All inclusive Docker - bad
practice
- Every model requires new
docker to be rebuilt
cluster
data
model
data scientist
web appA
PI
API
- Needs Spark Running
- High latency, low throughput
cluster
data
model
data scientist
web appA
PI
serv
ing
AP
I
+ Serving skips Spark
+ But re-uses ML algorithms
+ No new formats and APIs
+ Low Latency but not super tuned
+ Scalable
+ Stateless
Low level API Challenge
MS Azure
A deliverable for ML model
Single row Serving / Scoring layer
xml, json, parquet, pojo, other
Monitoring, testing
integration
Large Scale, Batch
processing engine
Zooming out
Unified Serving/Scoring API
Repository
MLLib model TensorFlow model Other model
Real-time Prediction PIpelines
Starting from scratch - System ML
Multiple execution modes, including Spark MLContext
API, Spark Batch, Hadoop Batch, Standalone, and JMLC.
Demo Time
Thank you
Looking for
- Feedback
- Advisors, mentors & partners
- Pilots and early adopters
Stay in touch
- @hydrospheredata
- https://github.com/Hydrospheredata
- http://hydrosphere.io/