multi model machine learning by maximo gurmendez and beth logan

23
Multi-Model Machine Learning for Real Time Bidding over Display Ads Beth Logan Senior Director of Optimization Maximo Gurmendez Data Science Engineering Team Lead With credit to our Spark developers: Inés Guelfi, Juan Tejería, Martin Manasliski, Victoria Seoane

Upload: spark-summit

Post on 21-Apr-2017

1.273 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Multi Model Machine Learning by Maximo Gurmendez and Beth Logan

Multi-Model Machine Learning for

Real Time Bidding over Display Ads

Beth Logan

Senior Director of Optimization

Maximo Gurmendez

Data Science Engineering Team Lead

With credit to our Spark developers:

Inés Guelfi, Juan Tejería, Martin Manasliski, Victoria Seoane

Page 2: Multi Model Machine Learning by Maximo Gurmendez and Beth Logan

We wanted to try Spark but wondered

Thread

safe?

Is Spark

fast

enough?

Does it use

too much

memory?

Page 3: Multi Model Machine Learning by Maximo Gurmendez and Beth Logan

Agenda

1. What we do

2. How we do it

3. Why Spark?

4. Challenges Addressed

5. Main Takeaways

Page 4: Multi Model Machine Learning by Maximo Gurmendez and Beth Logan

DataXu’s Mission

Make marketing

smarter through

Data Science!

Page 5: Multi Model Machine Learning by Maximo Gurmendez and Beth Logan

What We Do

Page 6: Multi Model Machine Learning by Maximo Gurmendez and Beth Logan

Taking Action Automatically

• Bid in real-time ad auctions on behalf of advertisers

• Machine Learning System learns from past bids

Browser

Request

Ad

Ad

exchanges

Ad

Selection

+ Bid

Ad Bid

Request

DataXu Machine

Learning systemDataXu Real

time systemUser

Page 7: Multi Model Machine Learning by Maximo Gurmendez and Beth Logan

DataXu ML System

Learn

Models

Ads shown

User actions

(purchase, clicks, etc)

Only high

quality

models

Hive

database

Calibrate

Evaluate

Real Time

Bidding

Hadoop

Page 8: Multi Model Machine Learning by Maximo Gurmendez and Beth Logan

Why is this hard?

Huge Scale • 2 Petabytes Processed Daily

• 1.6 Million Bid Decisions Per Second

• Runs 24 X 7 on 5 Continents

• Thousands of ML Models Trained per Day

Unattended Operation • Model training and deployment runs automatically every day

Changing Industry • Need ability to adapt quickly to new customer requirements

Page 9: Multi Model Machine Learning by Maximo Gurmendez and Beth Logan

Why Spark?• Large open source Machine Learning library

– Fast turnaround of research to production

– Easy to prototype and support new customer use cases

– Built-in upgrade of algorithms

– Increased reliability

• Trains models faster than hadoop

• Enables iterative models

• Elastic environment via cloud

Page 10: Multi Model Machine Learning by Maximo Gurmendez and Beth Logan

Challenges Addressed• Smart Dataset Partitioning by Campaign

• Categorical Features

• Functional Features

• Pipelines + RowTransformers

• Use of SparkSQL

• Real-time model instantiations

Page 11: Multi Model Machine Learning by Maximo Gurmendez and Beth Logan

Partitioning the data

• Need 1 RDD per campaign

• "Fat Reducers" or "Many files" problem

• 2-pass solution

Page 12: Multi Model Machine Learning by Maximo Gurmendez and Beth Logan

Partitioning the data: Solution

• Sample the RDD

• Construct histogram of sizes

• Use histogram to allocate more

processes (pseudo-sub-

partition)

Page 13: Multi Model Machine Learning by Maximo Gurmendez and Beth Logan

Spark ML PipelinesRaw Feature

Transformation

Feature Encoding

Feature Selection

Decision Tree

Trainer

Transformers

Estimator

Page 14: Multi Model Machine Learning by Maximo Gurmendez and Beth Logan

Spark ML PipelinesTransformer

transform(Dataframe):Dataframe

Model

fit(Dataframe):Model

Extends

Estimator

• Great for Training, evaluation &

experimentation

• Can we use them at bid time?

Page 15: Multi Model Machine Learning by Maximo Gurmendez and Beth Logan

ML Pipelines: Row Transformer

Problem:

At bid time there is no DataFrame!

Solution:

Use row transformer

Transformer

transform(Dataframe):Dataframe

RowTransformer

transform(Row):Row

Extends

Page 16: Multi Model Machine Learning by Maximo Gurmendez and Beth Logan

Meta-Pipeline Extension• Combines and evaluates several pipelines

• DAG with all steps and dependencies

• JSON Configurable

• Pipelines = All possible paths from root to leaves

• Use to train multiple classifiers with little

overhead (training time dominated by data read

and transformation)

Page 17: Multi Model Machine Learning by Maximo Gurmendez and Beth Logan

Best model varies from campaign to campaign

AUC

Page 18: Multi Model Machine Learning by Maximo Gurmendez and Beth Logan

Models at bid time• Standard java serialization

• Models relatively light-weight and fast

Add Preprocessing

Metadata ~ 130K

0

20

40

60

80

100

120

140

Model Size in Memory (KB)

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

Current DataXuModel

Spark RandomForest

Avg. Latency (milliseconds)

Page 19: Multi Model Machine Learning by Maximo Gurmendez and Beth Logan

• Choosing features via select command

• Functional features and categorical to numerical encoding via UDFs

• Top K feature values via UDAF

• Reuse UDFs at bid time

• Imperative to declarative

• Huge savings in LOC

Use and abuse of SparkSQL

Page 20: Multi Model Machine Learning by Maximo Gurmendez and Beth Logan

SparkSQL:TopK UDAF Example

For categorical encoding we first obtain most

popular nominals:

select topk(os) from training_data

Result:

{windows:1562, macos:928, linux:21}

Page 21: Multi Model Machine Learning by Maximo Gurmendez and Beth Logan

SparkSQL: Feature Encoding

Enumerate: select enumerate_encode(os),

enumerate_encode(browser) from training_data

One-Hot-Encoding: select onehot(os,’macos’),

onehot(os,’windows’),onehot(os,’linux’) …

Result:

1,3

3,1

2,1

Result:

1,0,0

0,1,0

0,0,1

Easily encode categorical features using UDFs

Page 22: Multi Model Machine Learning by Maximo Gurmendez and Beth Logan

Takeaways

•It works!

• Spark SQL: maintainable & declarative

• Models can bid at real-time

• Automated & unattended ML at large scale

• ML Pipelines had to be extended

Page 23: Multi Model Machine Learning by Maximo Gurmendez and Beth Logan

THANK [email protected]

[email protected]

dataxu.com/careers

always looking for smart people!!