automating and productionizing machine learning pipelines ... · machine learning pipelines for...
TRANSCRIPT
Automating and Productionizing
Machine Learning Pipelines for Real-
Time Scoring with Apache Spark
D a v i d C r e s p i , D a t a S c i e n t i s t
J a r e d P i e d t , S o f t w a r e E n g i n e e r
2
Overviewof Red Ventures.
H I S T O R Y
B Y T H E N U M B E R S
3,500+ Employees
Locations
• USA - 13 Locations
• Brazil - Sao Paulo
• United Kingdom - London
1 Culture
Founded as Red F in 2000
Red Ventures launched in 2004
General Atlantic & SilverLake minority
strategic investors.
3
Our Use Case – Real-Time Predictions
Requ i rements
1
2
Speed
Consistency
4
Data Science Process
1
2
3
Data Collection
Machine Learning Pipelines
Model Deployment
5
MLl ib
Spark SQL
&
DataFrames
6
Data Science Process
1
2
3
Data Collection
Machine Learning Pipelines
Model Deployment
7
Old Data Architecture
8
Old Data Architecture
D W?
9
Old Data Architecture
D W
Complex ETL
10
Old Data Architecture
D W
Training
Data
Complex ETL
11
Old Data Architecture
D W
A p p
Training
Data
Scoring
Data
Complex ETL
12
Pain Points
• Duplication of business logic
• Data drift
13
Goals
1
2
3
Immutable data
Write business logic once
Make data available in real-time
14
Event-Driven Architecture
D a t a P i p e l i n e
W e b
C h a t
S e r v e r
I V R
15
New Data Architecture
D a t a P i p e l i n e
Amazon
S3
K e y -
V a l u e
S t o r e
B u s i n e s s
L o g i c
Training
Data
Scoring
Data
16
Projections
{
i d : 4
…
}{
i d : 3
…
}{
i d : 2
…
}{
i d : 1
…
}
{
i d :
…
}
17
Credit Card Recommendation
User Id Keyword Page View
Count
Card Shown Clicked
a best travel
cards
2 Travel 1
b credit cards 3 Cash Back 0
c top credit cards 1 Cash Back 1
d credit cards 1 Travel 0
18
E 1 E 2 E 3
time
r e d u c e
r
19
r e d u c e
r
E 1 E 2 E 3
time
20
r e d u c e
r
E 1 E 2 E 3
time
21
r e d u c e
r
E 1 E 2 E 3
time
22
Credit Card Recommendation
User Id Keyword Page View
Count
Card Shown Clicked
a best travel
cards
2 Travel 1
b credit cards 3 Cash Back 0
c top credit cards 1 Cash Back 1
d credit cards 1 Travel 0
z airline miles
card
1 Travel 1
23
Data Science Process
1
2
3
Data Collection
Machine Learning Pipelines
Model Deployment
24
ML Pipeline
Transformer Estimator
25
Spark: Estimators and Transformers
Transformer
Estimator Transformer
26
Spark: Estimators and Transformers
PipelineStage
Transformer
Estimator Transformer
27
Transformers
Comment Resolved
I need help with internet 1
Setting up my TV 0
My internet won’t work 1
Internet is slow 0
Can’t connect 1
Netflix not working 0
28
CommentResolve
d
Internet
Comment
I need help with
internet
1 1
Setting up my TV 0 0
My internet won’t
work
1 1
Internet is slow 0 1
Can’t connect 1 0
Netflix not working 0 0
Transformers
CommentResolve
d
I need help with
internet
1
Setting up my TV 0
My internet won’t
work
1
Internet is slow 0
Can’t connect 1
Netflix not working 0
29
Estimators
Home Square Footage Sold
1,200 1
2,100 0
3,000 1
NULL 0
1,350 1
1,725 0
30
Estimators
Home Square Footage Sold
1,200 1
2,100 0
3,000 1
NULL 0
1,350 1
1,725 0
31
Estimators
Home Square
FootageSold
1,200 1
2,100 0
3,000 1
NULL 0
1,350 1
1,725 0
Imputer
Fill value = 1,875
32
Home
Square
Footage
Sold
Home Square
Footage
Imputed
1,200 1 1,200
2,100 0 2,100
3,000 1 3,000
NULL 0 1,875
1,350 1 1,350
1,725 0 1,725
Estimators
Home
Square
Footage
Sold
1,200 1
2,100 0
3,000 1
NULL 0
1,350 1
1,725 0
33
Spark: Estimators and Transformers
PipelineModel
Transformer Transformer Transformer
Pipeline
Transformer Transformer Estimator
PipelineModelTransformer Transformer Transformer
34
How do ML algorithms fit in?
35
Spark: Estimators and Transformers
PipelineModel
Transformer Transformer Transformer
Pipeline
Transformer Transformer Estimator
PipelineModelTransformer Transformer Transformer
36
Generalizing Data Science
Response
All Features
Response
Raw Text Features
Categorical Features
Numeric Features
Training Data Training Data
37
We fit our pipeline… now what?
38
Data Science Process
1
2
3
Data Collection
Machine Learning Pipelines
Model Deployment
39
Real-time scoring paradigm
?
Predic t ion
API
Product ion
Appl icat ions
40
Model evaluation in real-time – with Spark
41
Model evaluation in real-time – with MLeap
42
Data collection
ML pipeline trainingModel deployment
43
Recap
1
2
3
Data Collection
Machine Learning Pipelines
Model Deployment
44
Questions