multi model machine learning by maximo gurmendez and beth logan

Multi-Model Machine Learning for

Real Time Bidding over Display Ads

Beth Logan

Senior Director of Optimization

Maximo Gurmendez

Data Science Engineering Team Lead

With credit to our Spark developers:

Inés Guelfi, Juan Tejería, Martin Manasliski, Victoria Seoane

We wanted to try Spark but wondered

Thread

safe?

Is Spark

fast

enough?

Does it use

too much

memory?

Agenda

1. What we do

2. How we do it

3. Why Spark?

4. Challenges Addressed

5. Main Takeaways

DataXu’s Mission

Make marketing

smarter through

Data Science!

What We Do

Taking Action Automatically

• Bid in real-time ad auctions on behalf of advertisers

• Machine Learning System learns from past bids

Browser

Request

Ad

Ad

exchanges

Ad

Selection

+ Bid

Ad Bid

Request

DataXu Machine

Learning systemDataXu Real

time systemUser

DataXu ML System

Learn

Models

Ads shown

User actions

(purchase, clicks, etc)

Only high

quality

models

Hive

database

Calibrate

Evaluate

Real Time

Bidding

Hadoop

Why is this hard?

Huge Scale • 2 Petabytes Processed Daily

• 1.6 Million Bid Decisions Per Second

• Runs 24 X 7 on 5 Continents

• Thousands of ML Models Trained per Day

Unattended Operation • Model training and deployment runs automatically every day

Changing Industry • Need ability to adapt quickly to new customer requirements

Why Spark?• Large open source Machine Learning library

– Fast turnaround of research to production

– Easy to prototype and support new customer use cases

– Built-in upgrade of algorithms

– Increased reliability

• Trains models faster than hadoop

• Enables iterative models

• Elastic environment via cloud

Challenges Addressed• Smart Dataset Partitioning by Campaign

• Categorical Features

• Functional Features

• Pipelines + RowTransformers

• Use of SparkSQL

• Real-time model instantiations

Partitioning the data

• Need 1 RDD per campaign

• "Fat Reducers" or "Many files" problem

• 2-pass solution

Partitioning the data: Solution

• Sample the RDD

• Construct histogram of sizes

• Use histogram to allocate more

processes (pseudo-sub-

partition)

Spark ML PipelinesRaw Feature

Transformation

Feature Encoding

Feature Selection

Decision Tree

Trainer

Transformers

Estimator

Spark ML PipelinesTransformer

transform(Dataframe):Dataframe

Model

fit(Dataframe):Model

Extends

Estimator

• Great for Training, evaluation &

experimentation

• Can we use them at bid time?

ML Pipelines: Row Transformer

Problem:

At bid time there is no DataFrame!

Solution:

Use row transformer

Transformer

transform(Dataframe):Dataframe

RowTransformer

transform(Row):Row

Extends

Meta-Pipeline Extension• Combines and evaluates several pipelines

• DAG with all steps and dependencies

• JSON Configurable

• Pipelines = All possible paths from root to leaves

• Use to train multiple classifiers with little

overhead (training time dominated by data read

and transformation)

Best model varies from campaign to campaign

AUC

Models at bid time• Standard java serialization

• Models relatively light-weight and fast

Add Preprocessing

Metadata ~ 130K

0

20

40

60

80

100

120

140

Model Size in Memory (KB)

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

Current DataXuModel

Spark RandomForest

Avg. Latency (milliseconds)

• Choosing features via select command

• Functional features and categorical to numerical encoding via UDFs

• Top K feature values via UDAF

• Reuse UDFs at bid time

• Imperative to declarative

• Huge savings in LOC

Use and abuse of SparkSQL

SparkSQL:TopK UDAF Example

For categorical encoding we first obtain most

popular nominals:

select topk(os) from training_data

Result:

{windows:1562, macos:928, linux:21}

SparkSQL: Feature Encoding

Enumerate: select enumerate_encode(os),

enumerate_encode(browser) from training_data

One-Hot-Encoding: select onehot(os,’macos’),

onehot(os,’windows’),onehot(os,’linux’) …

Result:

1,3

3,1

2,1

Result:

1,0,0

0,1,0

0,0,1

Easily encode categorical features using UDFs

Takeaways

•It works!

• Spark SQL: maintainable & declarative

• Models can bid at real-time

• Automated & unattended ML at large scale

• ML Pipelines had to be extended

THANK [email protected]

[email protected]

dataxu.com/careers

always looking for smart people!!

multi model machine learning by maximo gurmendez and beth logan

Data & Analytics