multi model machine learning by maximo gurmendez and beth logan
TRANSCRIPT
Multi-Model Machine Learning for
Real Time Bidding over Display Ads
Beth Logan
Senior Director of Optimization
Maximo Gurmendez
Data Science Engineering Team Lead
With credit to our Spark developers:
Inés Guelfi, Juan Tejería, Martin Manasliski, Victoria Seoane
We wanted to try Spark but wondered
Thread
safe?
Is Spark
fast
enough?
Does it use
too much
memory?
Agenda
1. What we do
2. How we do it
3. Why Spark?
4. Challenges Addressed
5. Main Takeaways
DataXu’s Mission
Make marketing
smarter through
Data Science!
What We Do
Taking Action Automatically
• Bid in real-time ad auctions on behalf of advertisers
• Machine Learning System learns from past bids
Browser
Request
Ad
Ad
exchanges
Ad
Selection
+ Bid
Ad Bid
Request
DataXu Machine
Learning systemDataXu Real
time systemUser
DataXu ML System
Learn
Models
Ads shown
User actions
(purchase, clicks, etc)
Only high
quality
models
Hive
database
Calibrate
Evaluate
Real Time
Bidding
Hadoop
Why is this hard?
Huge Scale • 2 Petabytes Processed Daily
• 1.6 Million Bid Decisions Per Second
• Runs 24 X 7 on 5 Continents
• Thousands of ML Models Trained per Day
Unattended Operation • Model training and deployment runs automatically every day
Changing Industry • Need ability to adapt quickly to new customer requirements
Why Spark?• Large open source Machine Learning library
– Fast turnaround of research to production
– Easy to prototype and support new customer use cases
– Built-in upgrade of algorithms
– Increased reliability
• Trains models faster than hadoop
• Enables iterative models
• Elastic environment via cloud
Challenges Addressed• Smart Dataset Partitioning by Campaign
• Categorical Features
• Functional Features
• Pipelines + RowTransformers
• Use of SparkSQL
• Real-time model instantiations
Partitioning the data
• Need 1 RDD per campaign
• "Fat Reducers" or "Many files" problem
• 2-pass solution
Partitioning the data: Solution
• Sample the RDD
• Construct histogram of sizes
• Use histogram to allocate more
processes (pseudo-sub-
partition)
Spark ML PipelinesRaw Feature
Transformation
Feature Encoding
Feature Selection
Decision Tree
Trainer
Transformers
Estimator
Spark ML PipelinesTransformer
transform(Dataframe):Dataframe
Model
fit(Dataframe):Model
Extends
Estimator
• Great for Training, evaluation &
experimentation
• Can we use them at bid time?
ML Pipelines: Row Transformer
Problem:
At bid time there is no DataFrame!
Solution:
Use row transformer
Transformer
transform(Dataframe):Dataframe
RowTransformer
transform(Row):Row
Extends
Meta-Pipeline Extension• Combines and evaluates several pipelines
• DAG with all steps and dependencies
• JSON Configurable
• Pipelines = All possible paths from root to leaves
• Use to train multiple classifiers with little
overhead (training time dominated by data read
and transformation)
Best model varies from campaign to campaign
AUC
Models at bid time• Standard java serialization
• Models relatively light-weight and fast
Add Preprocessing
Metadata ~ 130K
0
20
40
60
80
100
120
140
Model Size in Memory (KB)
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
Current DataXuModel
Spark RandomForest
Avg. Latency (milliseconds)
• Choosing features via select command
• Functional features and categorical to numerical encoding via UDFs
• Top K feature values via UDAF
• Reuse UDFs at bid time
• Imperative to declarative
• Huge savings in LOC
Use and abuse of SparkSQL
SparkSQL:TopK UDAF Example
For categorical encoding we first obtain most
popular nominals:
select topk(os) from training_data
Result:
{windows:1562, macos:928, linux:21}
SparkSQL: Feature Encoding
Enumerate: select enumerate_encode(os),
enumerate_encode(browser) from training_data
One-Hot-Encoding: select onehot(os,’macos’),
onehot(os,’windows’),onehot(os,’linux’) …
Result:
1,3
3,1
2,1
Result:
1,0,0
0,1,0
0,0,1
Easily encode categorical features using UDFs
Takeaways
•It works!
• Spark SQL: maintainable & declarative
• Models can bid at real-time
• Automated & unattended ML at large scale
• ML Pipelines had to be extended