scaling out logistic regression with spark

62
Scaling Out Logistic Regression with Apache Spark Barak Gitsis SimilarWeb case study

Upload: barak-gitsis

Post on 26-Jan-2017

1.322 views

Category:

Software


1 download

TRANSCRIPT

Page 1: Scaling out logistic regression with Spark

Scaling Out Logistic Regression with Apache Spark Barak GitsisSimilarWeb case study

Page 2: Scaling out logistic regression with Spark

Nir Cohen - CTO

Page 3: Scaling out logistic regression with Spark

General Background about the company› The company was founded 8 years ago› 300 ~ employees world wide› 240 employees in Israel› Stay updated about our open positions in our

website. You can contact [email protected]› Nir Cohen – [email protected]

Page 4: Scaling out logistic regression with Spark

The product

Page 5: Scaling out logistic regression with Spark

Data Size› 650 servers total› Several Hadoop Clusters – 120 Servers in the biggest.› 5 Hbase clusters› Couchbase clusters› Kafka clusters› MYSQL Galera clusters› 5TB of new data every day› Full data backup to s3

Page 6: Scaling out logistic regression with Spark

Plan for the next hour or so› The need› Some history› Spark related algorithmic intuitions› Dive into spark› Our Additions› Runtime issues› Current Categorization Algorithm

Page 7: Scaling out logistic regression with Spark

The Need

Page 8: Scaling out logistic regression with Spark

Need: The Customer

Page 9: Scaling out logistic regression with Spark

Need: The Product

Page 10: Scaling out logistic regression with Spark

Need: The Product – Direct Competitors

Page 11: Scaling out logistic regression with Spark

Need: How would you classify the Web?› Crawl the web› Collect data about each website › Manually classify a few› Use machine learning to derive model› Classify all the websites we’ve seen

Page 12: Scaling out logistic regression with Spark

Some History

Page 13: Scaling out logistic regression with Spark

LEARNING SET: CLASSES

› Shopping– Clothing– Consumer Electronics– Jewelry– …

› Sports– Baseball– Basketball– Boxing– …

› …

Manually defined246 categories 2 level tree25 Parent categories

Page 14: Scaling out logistic regression with Spark

LEARNING SET: FEATURES

› Tag Count Source– cnn.com | news | 1– bbc.com | culture | 50– …

› Html Analyzer Source– cnn.com | money | 14– nba.com | nba draft | 2– …

11 basic sourcesFeature is: site | tag | score

Some reintroduced after additional processingEventually – 16 sources

18 GB of data4M Unique features

Page 15: Scaling out logistic regression with Spark

Our challenge› Large Scale Logistic Regression

– ~500K site samples– 4M Unique features– ~800K features/source– 246 classes

– Eventually apply model to 400M sites

Page 16: Scaling out logistic regression with Spark

FIRST LOGISTIC REGRESSION ATTEMPT

› Only scales up› Pre-combination of features

reduces coverage› Runtime: a few days› Code is complex, and hard to

tweak algorithm› Bus test

Single machine Java logistic regression implementation highly optimized Manually tuned loss

function multi threaded Uses plain arrays and

divides "stripes" between threads

Works on “summed features”

Page 17: Scaling out logistic regression with Spark

SECOND LOGISTIC REGRESSION ATTEMPT Out of the box

solution Customizable Open source Distributable

Page 18: Scaling out logistic regression with Spark

Why we choose spark› Has out of the box distributed solution for large scale

multinomial logistic regression› Simplicity

› Lower production maintenance costs compared to R› Intent to move to Spark for large complex

algorithmics

Page 19: Scaling out logistic regression with Spark

Spark related AlgorithmicsIntuitive reminder

Page 20: Scaling out logistic regression with Spark

Basic Regression Method› We want to estimate value of y based on samples

(x, y)

; – unknown function constants

› Define loss function that corresponds with accuracy– for example :

› Find that minimize

Page 21: Scaling out logistic regression with Spark

Logistic Regression› In case of classification we want to use logistic

function

› Define differentiable loss function (log-likelihood)

› We cannot find analytically› However, is smooth,

continuous and convex!– Has one global minimum

Page 22: Scaling out logistic regression with Spark

GRADIENT DESCENTGenerally• Value of is a vector that

points in direction of steepest descent

• In every step

• – learning rate• Converges when

Spark

• SGD – stochastic mini-batch GD

Page 23: Scaling out logistic regression with Spark

LINE SEARCH – DETERMINING STEP SIZEApproximate method

At each iteration• Find step size that

sufficiently decreases l• By reducing the range of

possible steps sizesSpark: • StrongWolfeLineSearch• Sufficiency check is a

function of

Page 24: Scaling out logistic regression with Spark

Is there a faster way?

Page 25: Scaling out logistic regression with Spark

Function Analysis for › So, we want that satisfy

𝑔𝑟𝑎𝑑𝑖𝑒𝑛𝑡 𝑣𝑒𝑐𝑡𝑜𝑟≝𝛻𝑙 ( 𝛽 )=(𝜕𝑙𝜕 𝛽1𝜕𝑙𝜕 𝛽2⋮𝜕𝑙𝜕 𝛽𝑛

)Hessian ¿𝐻 ( 𝛽) ≝𝛻2 𝑙 (𝛽 )=(

𝜕𝑙𝜕 𝛽1 𝛽1

⋯ 𝜕𝑙𝜕 𝛽1 𝛽𝑛

⋮ ⋱ ⋮𝜕𝑙

𝜕 𝛽𝑛 𝛽1⋯

𝜕𝑙𝜕 𝛽𝑛 𝛽𝑛

)

At minimum, derivative is 0

In Our Case 800Kx800K

way too much…

Page 26: Scaling out logistic regression with Spark

NEWTON’S METHOD (NEWTON-RAPHSON)

"NewtonIteration Ani" by Ralf Pfeifer - NewtonIteration_Ani.gifhttps://en.wikipedia.org/wiki/Newton's_method

Page 27: Scaling out logistic regression with Spark

Illustration for simple parabola (1 feature)

GRADIENT DESCENTNEWTON’S GRADIENT

DESCENT

Images from here

Page 28: Scaling out logistic regression with Spark

Is there a fast and simpler way?

Page 29: Scaling out logistic regression with Spark

SECANT METHOD (QUAZI-NEWTON)Approximation of derivative

Hessian is not needed

In our case, we need only

Animation from here

Page 30: Scaling out logistic regression with Spark

Requirements and Convergence rateNewton-Raphson Quazi-Newton

Analytical formula for gradient Analytical formula for gradientCompute gradient at each step Compute gradient at each step

Analytical formula for HessianCompute Inverse Hessian at each step - Save last calculations of gradient

Order Of Convergence q=2 Order Of Convergence q=1.6

Which is faster?Which is cheaper (memory, cpu) in 1000 iterations for M=100,000 features?Which of Gradient Descent, Newton or Quazi-Newton should we use?

Page 31: Scaling out logistic regression with Spark

BFGS - Quazi-Newton with Line Search› Initially guess and set › In each step k

– Calculate gradient value (direction)

– Find step size using line search (with Wolfe conditions)– Update – Update

› Stop when improvement is small enough› More info BFGS

Page 32: Scaling out logistic regression with Spark

Back To Engineering

Page 33: Scaling out logistic regression with Spark

Challenges Implementing Logistic Regression› In order to get the values of gradient we need

instantiate the formula with the learning set– For every iteration we need to go over the learning set

› If we want to speed this up by parallelization we need ship model or learning set to each thread/process

› Single machine -> process is CPU bound› Multiple machines -> network bound› With large number of features, memory becomes a

problem as well

Page 34: Scaling out logistic regression with Spark

Why we choose to use L-BFGS› Only out of the box multinomial logistic regression› Gives good value for money

– Good tradeoff between cost per iteration and number of iterations

› Uses spark’s GeneralizedLinearModel API:

Page 35: Scaling out logistic regression with Spark

L-BFGS› L stands for Limited Memory

– Replace which is matrix with a few (~10) most recent updates of and which are sized vectors

› spark.LBFGS– Distributed wrapper over breeze.LBFGS– Mostly, distribution of gradient calculation

› Rest is not› Shipping around the model and collecting gradient values

– Uses L2 regularization– Scaling Features

Page 36: Scaling out logistic regression with Spark

Spark internals distributed sub loop (max 10)

distributed but cached

on executors

Partial agg on executors, final on Driver

Page 37: Scaling out logistic regression with Spark

AGGREGATE & TREE AGGREGATE

Aggregate• Each executor holds a

portion of learning set• Broadcast model to

executors • Collect results to driverTreeAggregate• Simple heuristic to add level• Perform partial aggregation

by shipping results to other executors (by repartitioning)

Weights - Partial gradient

Page 38: Scaling out logistic regression with Spark

Job UI – big job

Page 39: Scaling out logistic regression with Spark

Implementation

Page 40: Scaling out logistic regression with Spark

Overfitting› We have more features then samples› Some features are poorly represented› For example:

– only one sample for “carbon” tag– sample is labeled “automotive”

› Model would give high weight to this feature for “automotive” class and 0 for others– Do you think it is correct?

› How would you solve this?

Page 41: Scaling out logistic regression with Spark

Regularization› Solution internal to regression mechanism› We introduce regularization into the cost function

=L2 regularization :

› – regularization constant› What happens if is too large?› What happens if is too small?› Spark’s LBFGS has L2 built-in

Page 42: Scaling out logistic regression with Spark

Finding Best Lambda› We choose best using cross-validation

– Set aside 30% of learning set, and use it for test› Build model for every and compare precision› Lets Parallelize? Is there more efficient way to do

this?– We use the fact that for large , model is underfitted,

converges fast– Start from large and use its model as a starting point of

next iteration

Page 43: Scaling out logistic regression with Spark

CHOOSING REGULARIZATION PARAMETER

Lambda Precision Iterations25 35.06% 3

12.5 35.45% 126.25 36.68% 5

3.125 38.41% 51.563 Failure!0.781 45.87% 130.391 50.64% 100.195 55.04% 130.098 58.33% 170.049 60.93% 190.024 62.33% 210.012 64.30% 250.006 65.95% 420.003 65.46% 38

After choosing the best lambda, we can use complete learning set to calculate final model

Failures can be caused externally or internally

Avg iteration time 2 sec

Page 44: Scaling out logistic regression with Spark

LBFGS EXTENSION & BUGFIXES

› Spark layer of LBFGS swallows all failures– and returns bad weights

› Feature scaling was always on– Redundant in our case– Rendered passed weights

unusable– Lowered model precision

› Expose effective number of iterations to external monitoring

• Enable passing starting weights into LBFGS

• More transparency

Page 45: Scaling out logistic regression with Spark

SPARK ADDITIONS & BUG FIXES

› PoliteLBFGS addition to

spark.LBFGS

– 3-5% more precise (for our data)

– 30% faster calculation

› Planning to contribute back to

spark

class PoliteLbfgs extends spark.Lbfgs

Was it worth the trouble?

po·lite : pəˈlīt/ having or showing behavior that is respectful and considerate of others.

synonyms: well mannered, civil, courteous, mannerly, respectful, deferential, well behaved

Page 46: Scaling out logistic regression with Spark

Job UI – small job

Page 47: Scaling out logistic regression with Spark

RUNNING

Page 48: Scaling out logistic regression with Spark

Hardware› 110 machines› 5.20 TB Memory› 6600 VCores› Yarn› Block size 128 MB

› Cluster is shared with other MapReduce jobs and HBase

› 60 Vcores per machine› 64GB Memory

– ~1 GB per VCore› 12 Cores

– 5 Vcores per physical core (tuned for MapReduce)

› CentOS 6.6› cdh-5.4.8

Page 49: Scaling out logistic regression with Spark

Execution – Good Neighboring› Each source has different number of samples and

features› Execution profiles for single learning run

Small Large#Samples ~50K 500KInput Size under 1gb 1g - 3g#Executors 2 22Executor Memory 2g 4gDriver Memory 2g 18gYarn Driver Overhead 2gYarn Executor Overhead

1g

#Jobs per profile 200 180

Page 50: Scaling out logistic regression with Spark

Execution ExampleHardware : Driver 2 cores, 20g memoryHardware : Executors 22 machines x (2 cores, 5g

memory)

Number of Features 100,000 Number of Samples 500,000

Total Number of Iterations (try out 14 different )

152

Avg Iteration Time 18.8 secTotal Learning Time 2863 sec (48 minutes)

Max Iterations for single 30

Page 51: Scaling out logistic regression with Spark

Could you guess the reason for difference?

run Phase name real time [sec]

iteration time [sec]

iteration

s1 parent-glm-AVTags 29101 153.2 1902 parent-glm-AVTags 15226 82.3 1853 parent-glm-AVTags 2863 18.8 152

• OK, I admit, cluster was very loaded in first run• What about the second ?

• org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle

• Increase spark.shuffle.memoryFraction=0.5

Page 52: Scaling out logistic regression with Spark

AKKA IN REAL WORLD

› spark.akka.frameSize = 100› spark.akka.askTimeout = 200› spark.akka.lookupTimeout =

200Response times are slower when cluster is loadedaskTimeout - seems to be particularly responsible for executors failures when removing broadcasts and unpersisting RDD

Page 53: Scaling out logistic regression with Spark

Kryo Stability› Kryo uses quite a lot of memory,

– if buffer is not sufficient, process will crush– spark.kryoserializer.buffer.max.mb = 512

Page 54: Scaling out logistic regression with Spark
Page 55: Scaling out logistic regression with Spark

LEARNING SET: CLASSES

› Shopping– Clothing– Consumer Electronics– Jewelry– …

› Sports– Baseball– Basketball– Boxing– …

› …

Manually defined246 categories 2 level tree25 Parent categories

Page 56: Scaling out logistic regression with Spark

LEARNING SET: FEATURES

› Tag Count Source– cnn.com | news | 1– bbc.com | culture | 50– …

› Html Analyzer Source– cnn.com | money | 14– nba.com | nba draft | 2– …

11 basic sourcesFeature is: site | tag | score

Some reintroduced after additional processingEventually – 16 sources~500K site samples

18 GB of data4M Unique features~800K features/source

Page 57: Scaling out logistic regression with Spark

Need: How would you improve over time?› We collect different kinds of data:

– Tags– Links– User behavior– …

› How to identify where to focus collection efforts?› How to improve classification algorithm?

Page 58: Scaling out logistic regression with Spark

Current Approach - Training› foreach source

– choose 100K most influential features

– train model for L1

– foreach L1 class (avg 9.2 L2 classes per L1)› train model for L2

› foreach source– foreach sample in training set

› Calculate probabilities () of belonging to any of L1 classes

› train Random Forest using L1 probabilities set

16 sources

25 L1 classes

Page 59: Scaling out logistic regression with Spark

Current Approach - Application› foreach site to classify

– foreach source › Calculate probabilities () belonging to L1 class

– aggregate results and estimate L1 (using RF model)

– given estimated L1, foreach source › calculate estimated L2

– choose (by voting) final L2

Page 60: Scaling out logistic regression with Spark

OTHER EXTENSIONS

› Extend mllib.LogisticRegressionModel to return probabilities instead final decision from “predict” method

› For Example– Site : nhl.com– Instead “is L1=sports”– We produce

› P(news) = 30%› P(sports) = 65%› P(art) = 5%

model.advise(p:point)

Page 61: Scaling out logistic regression with Spark

Summary : This Approach vs Straight Logistic Regression› Increases precision by using more features› Increases coverage by using very granular features› Have feedback (from RF) regarding quality of each

source– Using out-of-bag error

› Natural parallelization by source› No need for feature scaling

Page 62: Scaling out logistic regression with Spark