training large-scale ad ranking models in spark

Post on 26-Jan-2017






Click to see full reader


Training Large-scale Ad Ranking Models in Spark

PRESENTED BY Patrick Pletscher| October 19, 2015

About Us


Michal Aharon Oren Somekh Yaacov Fernandess Yair Koren

Amit Kagian Shahar Golan Raz Nissim Patrick Pletscher

Amir Ingber



What We Do


Research focused on ad ranking algorithms for Yahoo Gemini Native Ads

Ad Ranking Overview


• Advertisers run several campaigns each with several ads • Each ad has a bid set by the advertiser; different ad price types

- pay per view

- pay per click

- various conversion price types

• Auction for each impression on a Gemini Native enabled property - auction between all eligible ads (filter by targeting/budget)

- ad with the highest expected revenue is determined

• Need to know the (personalized!) probability of a click - we mostly get money for clicks / conversions!

Ad 1 Ad 2


5% 1%5c 2cuser

Click-Through Rate (CTR) Prediction


• Given a user and context, predict probability of a click for an ad. • Probably the most “profitable” machine learning problem in industry

- simple binary problem; but want probabilities, not just the label

- very skewed label distribution: clicks << skips

- tons of data (every impression generates a training example)

- limitations at serving: need to predict quickly

• Basic setting quite well-studied; scale makes it challenging - Google (McMahan et al. 2013)

- Facebook (He et al. 2014)

- Yahoo (Aharon et al. 2013)

- others (Chapelle et al. 2014)

• Some more involved research topics - Exploration/Exploitation tradeoff

- Learning from logged feedback

Overview - CTR Prediction for Gemini Native Ads


• Collaborative Filtering approach (Aharon et al. 2013) - Current production system

- Implemented in Hadoop MapReduce

- Used in Gemini Native ad ranking

• Large-scale Logistic Regression - A research prototype

- Implemented in Spark

- The combination of Spark & Scala allows us to iterate quickly

- Takes several concepts from the CF approach

Large-­scale Logistic Regression in Spark

Apache Spark


• “Apache Spark is a fast and general engine for large-scale data processing” • Similar to Hadoop • Advantages over Hadoop MapReduce

- Option to cache data in memory, great for iterative computations

- A lot of syntactic sugar ‣ filter, reduceByKey, distinct, sortByKey, join

‣ in general Spark/Scala code very concise

- Spark Shell, great for interactive/ETL* workflows

- Dataframes interesting for data scientists coming from R / Python

• Includes modules for - machine learning

- streaming

- graph computations

- SQL / Dataframes

*ETL: Extract, transform, load

Spark at Yahoo


• Spark 1.5.1, the latest version of Spark • Runs on top of Hadoop YARN 2.6

- integrates nicely with existing Hadoop tools and infrastructureat Yahoo

- data is generally stored in HDFS

• Clusters are centrally managed • Large Hadoop deployment at Yahoo

- A few different clusters

- Each has at least a few thousand nodes

HDFS (storage)

YARN (resource management)


Dataset for CTR Prediction


• Billions of ad impressions daily - Need for Streaming / Batched Streaming

- Each impression has a unique id

• Need click information for every impression for learning - Join impressions with a click stream every x minutes

- Need to wait for the click; introduces some delay

18:30 18:45 19:00


impressions impressions







labeled events

labeled events

in Spark: union & reduceByKey

Example - Joining Impression & Click RDDs


val keyAndImpressions = impressions .map(e => (e.joinKey, ("i", e))

val keyAndClicks = clicks .map(e => (e.joinKey, ("c", e)))

keyAndImpressions.union(keyAndClicks) .reduceByKey(smartCombine) .flatMap { case (k, (t, event)) => t match { case "ci" => Some(LabeledEvent(event, clicked=1)) case "i" => Some(LabeledEvent(event, clicked=0)) case "c" => None } }

def smartCombine(event1: (String, Event), event2: (String, Event)): (String, Event) = { (event1._1, event2._1) match { case ("c", "c") => event1 // de-dupe case ("i", "i") => event1 // de-dupe case ("c", "i") => ("ci", event2._2) // combine click and impression case ("i", "c") => ("ci", event1._2) // combine click and impression case ("ci", _) => event1 // de-dupe case (_, "ci") => event2 // de-dupe }}

Incremental Learning Architecture


learning examples

18:30 18:45 19:00


impressions impressions







labeled events

feature extraction



Large-scale Logistic Regression


• Industry standard for CTR prediction (McMahan et al. 2013, He et al. 2014) • Models the probability of a click as

- feature vector ‣ high-dimensional vector but sparse (few non-zero values)

‣ model expressivity controlled by the features

‣ a lot of hand-tuning and playing around

- model parameters ‣ need to be learned

‣ generally rather non-sparse

Features for Logistic Regression


• Basic features - age, gender

- browser, device

• Feature crosses - E.g. age x gender x state (30 year old male from Boston)

- mostly indicator features

- Examples:

‣ gender^age m^30 ‣ gender^device m^Windows_NT ‣ gender^section m^5417810 ‣ gender^state m^2347579 ‣ age^device 30^Windows_NT

• Feature hashing to get a vector of fixed length - hash all the index tuples, e.g. (gender^age, m^30), to get a numeric index

- will introduce collisions! Choose dimensionality large enough

Parameter Estimation


• Basic Problem: Regularized Maximum Likelihood

- Often: L1 regularization instead of L2 ‣ promotes sparsity in the weight vector

‣ more efficient predictions in serving (also requires less memory!)

- Batch vs. streaming ‣ in our case: batched streaming, every x min perform an incremental model update

• Follow-the-regularized leader (McMahan et al. 2013) - sequential online algorithm: only use a data point once

- similar to stochastic gradient descent

- per coordinate learning rates

- encourages sparseness

- FTRL stores weight and accumulated gradient per coordinate

fit training data prevent overfitting

Basic Parallelized FTRL in Spark


def train(examples: RDD[LearningExample]): Unit={ val delta = examples .repartition(numWorkers) .mapPartitions(xs => updatePartition(xs, weights, counts)) .treeReduce{case(a, b) => (a._1+b._1, a._2+b._2)} weights += delta._1 / numWorkers.toDouble counts += delta._2 / numWorkers.toDouble}

def updatePartition(examples: Iterator[LearningExample], weights: DenseVector[Double], counts: DenseVector[Double]): Iterator[(DenseVector[Double], DenseVector[Double])]=


// standard FTRL code for examples

Iterator((deltaWeights, deltaCounts))


hack: actually a single result, but Spark

expects an iterator!

Summary: LR with Spark


• Efficient: Can learn on all the data - before: somewhat aggressive subsampling of the skips

• Possible to do feature pre-processing - in Hadoop MapReduce much harder: only one pass over data

- drop infrequent features, TF-IDF, …

• Spark-shell as a life-saver - helps to debug problems as one can inspect intermediate results at scale

- have yet to try Zeppelin notebooks

• Easy to unit test complex workflows

Spark: Lessons Learned



• Spark has a pretty regular 3 months release schedule • Always run with the latest version

- Lots of bugs get fixed

- Difficult to keep up with new functionality (see DataFrame vs. RDD)

• Speed improvements over the past year



• Our solution - config directory containing

‣ Logging:

‣ Spark itself: spark-defaults.conf

‣ our code: application.conf

- two versions of configs: local & cluster

- in YARN: specify them using --files argument & SPARK_CONF_DIR variable

• Use Typesafe’s config library for all application related configs - provide sensible defaults for everything

- overwrite using application.conf

• Do not hard-code any configurations in code



• Use accumulators for ensuring correctness! • Example:

- parse data, ignore event if there is a problem with the data

- use accumulator to count these failed lines

class Parser(failedLinesAccumulator: Accumulator[Int]) extends Serializable {

def parse(s: String): Option[Event] = {

try {

// parsing logic goes here



catch { case e: Exception => { failedLinesAccumulator += 1 None } }



val accumulator = Some(sc.accumulator(0, “failed lines”))

val parser = new Parser(accumulator)

val events = sc.textFile(“hdfs:///myfile”)

.flatMap(s => parser.parse(s))

RDD vs. DataFrame in Spark


• Initially Spark advocated Resilient Distributed Data (RDD) for data set abstraction - type-safe

- usually stores some Scala case class

- code relatively easy to understand

• Recently Spark is pushing towards using DataFrame - similar to R and Python’s Pandas data frames

- some advantages ‣ less rigid types: can append columns

‣ speed

- disadvantage: code readability suffers for non-basic types ‣ user defined types

‣ user defined functions

• Have not fully migrated to it yet

Every Day I’m Shuffling…


• Careful with operations which send a lot of data over the network - reduceByKey

- repartition / shuffle

• Careful with sending too much data to the driver - collect

- reduce

• found mapPartitions & treeReduce useful in some cases (see FTRL example) • play with spark configurations: frameSize, maxResultSize, timeouts…

textFile flatMap map reduceByKey

Machine Learning in Spark


• Relatively basic - some algorithms don’t scale so well

- not customizable enough for experts: ‣ optimizers that assume a regularizer

‣ built our own DSL for feature extraction & combination

‣ a lot of the APIs are not exposed, i.e. private to Spark

- will hopefully get there eventually

• Nice: new Transformer / Estimator / Pipeline approach - Inspired by scikit-learn, makes it easy to combine different algorithms

- Requires DataFrame

- Example (from Spark docs)val tokenizer = new Tokenizer() .setInputCol("text") .setOutputCol("words") val hashingTF = new HashingTF() .setNumFeatures(1000) .setInputCol(tokenizer.getOutputCol) .setOutputCol("features") val lr = new LogisticRegression() .setMaxIter(10) .setRegParam(0.01)

val pipeline = new Pipeline() .setStages(Array(tokenizer, hashingTF, lr))

val model =

Thank you!

top related