spark ml pipeline serving

Spark Serving

by Stepan Pushkarev CTO of Hydrosphere.io

Spark Users here?

Data Scientists and Spark Users here?

Why do companies hire data scientists?

Why do companies hire data scientists?

To make products smarter.

What is a deliverable of data scientist and data engineer?

What is a deliverable of data scientist?

Academic

paper?

ML Model? R/Python

script?

Jupiter

Notebook?

BI

Dashboard?

cluster

datamodel

data scientist

? web app

val wordCounts = textFile

.flatMap(line => line.split(" "))

.map(word => (word, 1))

.reduceByKey((a, b) => a + b)

executor

executorexecutor

executor executor

Machine Learning: training + serving

pipeline

Training (Estimation) pipeline

trainpreprocess preprocess

tokenizer

apache spark 1

hadoop mapreduce 0

spark machine learning 1

[apache, spark] 1

[hadoop, mapreduce] 0

[spark, machine, learning] 1

hashing tf

[apache, spark] 1

[hadoop, mapreduce] 0

[spark, machine, learning] 1

[105, 495], [1.0, 1.0] 1

[6, 638, 655], [1.0, 1.0, 1.0] 0

[105, 72, 852], [1.0, 1.0, 1.0] 1

logistic regression

[105, 495], [1.0, 1.0] 1

[6, 638, 655], [1.0, 1.0, 1.0] 0

[105, 72, 852], [1.0, 1.0, 1.0] 1

0 72 -2.7138781446090308

0 94 0.9042505436914775

0 105 3.0835670890496645

0 495 3.2071722417080766

0 722 0.9042505436914775

val tokenizer = new Tokenizer().setInputCol("text").setOutputCol("words")

val hashingTF = new HashingTF().setNumFeatures(1000).setInputCol(tokenizer.getOutputCol).setOutputCol("features")

val lr = new LogisticRegression().setMaxIter(10).setRegParam(0.001)

val pipeline = new Pipeline().setStages(Array(tokenizer, hashingTF, lr))

val model = pipeline.fit(training)model.write.save("/tmp/spark-model")

pipeline

Prediction Pipeline

preprocess preprocess

val test = spark.createDataFrame(Seq(("spark hadoop"),("hadoop learning")

)).toDF("text")

val model = PipelineModel.load("/tmp/spark-model")

model.transform(test).collect()

./bin/spark-submit …

cluster

datamodel

data scientist

? web app

Pipeline Serving - NOT Model Serving

Model level API leads to code duplication & inconsistency

at pre-processing stages!

Web App

Ruby/PHP:

preprocess

Check current user

User LogsML Pipeline: preprocess, train

Save

Score/serve model

Fraud Detection Model

https://issues.apache.org/jira/browse/SPARK-16365




cluster

datamodel

data scientist

web app

PMMLPFA

MLEAP

- Yet another Format Lock

- Code & state duplication

- Limited extensibility

- Inconsistency

- Extra moving parts

cluster

datamodel

data scientist

web app

docker

model

libs

deps

- Fat All inclusive Docker - bad

practice

- Every model requires new

docker to be rebuilt

cluster

data

model

data scientist

web appA

PI

API

- Needs Spark Running

- High latency, low throughput

cluster

data

model

data scientist

web appA

PI

serv

ing

AP

I

+ Serving skips Spark

+ But re-uses ML algorithms

+ No new formats and APIs

+ Low Latency but not super tuned

+ Scalable

+ Stateless

Low level API Challenge

MS Azure

A deliverable for ML model

Single row Serving / Scoring layer

xml, json, parquet, pojo, other

Monitoring, testing

integration

Large Scale, Batch

processing engine

Zooming out

Unified Serving/Scoring API

Repository

MLLib model TensorFlow model Other model

Real-time Prediction PIpelines

Starting from scratch - System ML

Multiple execution modes, including Spark MLContext

API, Spark Batch, Hadoop Batch, Standalone, and JMLC.

Demo Time

Thank you

Looking for

- Feedback

- Advisors, mentors & partners

- Pilots and early adopters

Stay in touch

- @hydrospheredata

- https://github.com/Hydrospheredata

- http://hydrosphere.io/

- [email protected]

https://github.com/Hydrospheredata/mist

http://hydrosphere.io/

mailto:[email protected]

spark ml pipeline serving

Data & Analytics