andy feng, distinguished architect, yahoo at mlconf sf

Scalable Machine Learning at Yahoo

Andy Feng

Nov 14, 2014

My Background § Current

›  VP Architecture, Yahoo ›  Committer, Apache Storm ›  Contributor, Apache Spark & Hadoop

§ Past ›  NoSQL ›  Online advertisement ›  Personalization ›  Cloud services

Agenda

3

§ Machine Learning ›  Use Cases ›  Challenges

§ Scalable ML Architecture § Design Patterns ›  Batch, real-time and hybrid

Evolution of Big Data @ Yahoo

4

0

100

200

300

400

500

600

0

5,000

10,000

15,000

20,000

25,000

30,000

35,000

40,000

45,000

2006 2007 2008 2009 2010 2011 2012 2013 2014

Raw

HD

FS S

tora

ge (i

n PB

)

Num

ber o

f Ser

vers

Year Servers Storage

Yahoo! Commits to

Scaling Hadoop for Production

Use

Research Workloads

in Search and Advertising

Production with machine

learning & WebMap

Revenue Systems

with Security, Multi-tenancy,

and SLAs

Open Sourced with

Apache

Hortonworks Spinoff for Enterprise hardening

Nextgen Hadoop (H 0.23)

New Services (Hbase, Hive)

Increased User-base

with partitioned namespaces

Hadoop 2.5

Machine Learning

Personalized Homepage http://www.yahoo.com Mobile

Today Module (2012)

Content stream w/ native ads (2013)

6

Web Search & Ads

•  Web Page rank •  Image/Video insertion

Ads targeting & ranking

Flickr Photo Search

Google

Flickr

2014 … Empowered by Scalable ML 2013 … User tags based

§  Search

›  Page ranking per user intention §  Advertisement

›  Ad click prediction ›  Identify potential users for an ad campaign

§  Content ›  Matching news articles against users ›  Object detection, face recognition in photos

§  Security ›  Email spam ›  Fraud login and registration

8

Machine Learning @ Yahoo

§ Scale ›  1,000,000,000’s examples ›  100,000,000’s features ›  10,000’s models ›  10’s algorithms

•  Batch learning •  Incremental learning •  Real-time learning

§ Speed ›  Temporal nature of user

interests ›  Time sensitive content

•  Ex., breaking news

›  Naïve solutions spend days/hours in model training •  Minutes/seconds desired

9

Our Challenges

Our Approach: Big-Data Machine Learning

§  Originally created by Yahoo §  Popular framework for running

applications on large cluster built of commodity hardware

§  Designed for very high throughput and reliability

§  YARN resource manager supports Map/Reduce, Tez and beyond

11

Apache Hadoop http://hadoop.apache.org

Apache Storm http://storm.apache.org § “Hadoop for Realtime”

›  distributed and high-performance realtime data processing

§ Simple API § Horizontal scalability § Fault-tolerance § Guaranteed data

processing 12

§ Fast and expressive cluster computing system compatible with Apache Hadoop

§ Support general execution DAGs ›  Ex. iterative programming

§ Resilient Distributed Datasets ›  In-memory storage

Apache Spark http://spark.apache.org

30x Speedup for GBDT §  Gradient Boosted

Decision Trees took days on training for our large datasets. é High accuracy ê Sequential execution

§  30X speedup enables frequently model training. ›  GBDT included in data

pipeline (Hadoop Oozie workflow)

Pixels -> features

Pixels -> features

Pixels -> features

dog, 1, [.2, -.3, …] dog, 0, [.3, -.5, …]

cat, 1, [.2, -.3, …] cat, 0, [.3, -.5, …]

Train models: Dog, …

Train model: …

Train model: Cat, …

10,000 Mappers

1,000 Reducers Shuffle

Deep network as feature extractor

8000+ classifiers

Auto-tag billions of Flickr photos

Real-time Prediction & Training User Experience

Real-time Learning of Newly Uploaded Photos

Design Patterns Enabled

17

1.  Batch ML for scale ›  Parallel model training (ex. 1000 models for ad campaigns) ›  Distributed model training (ex. 1 model for all homepage content)

2.  Real-time ML for speed ›  Up-to-minutes models (ex. fraud detection, breaknews)

3.  Lambda architecture ›  Scale + Speedy learning (ex. Photo autotags) ›  Enabled by “Parameter Server on Grid”

§  Basic Requirements

›  100’s - 1000’s models ›  Training data for each model

could be loaded into a single machine

§  Solution: 1 reducer per model ›  hadoop jar hadoop-streaming.jar

-Dmapreduce.job.reduces=$num_models -reducer ”vw --passes 20 --cache_file …”

›  hadoop jar lib/hadoop-streaming.jar -D mapreduce.job.reduces=$num_models

-reducer ”svm_train_reducer.py …”

18

1a. ML in Hadoop Reducers

§  Basic Requirements ›  Small # of models to be trained ›  Training data are too large to be

loaded into a single machine

§  Solution: Mappers + MPI AllReduce 1.  spanning_tree 2.  hadoop jar hadoop-streaming.jar

-input $training_data -output $model_loc -Dmapreduce.job.maps=$num_mappers -mapper "runvw.sh $model_location $span_server $num_mappers” -reducer NONE

19

1b. ML in Hadoop Mappers

1c. Spark Native ML

20

§ Spark based ›  Yahoo E-Commerce: 30 LOC Spark program for collaborative

filtering

§ Spark’s MLlib ›  Binary classification, Linear regression, Collaborative filtering,

Clustering, Decision Trees etc.

§ 3Rd ML libs ›  Ex. Alpine Data Lab’s Random Forest

§ Observations ›  A large scale ML learning

job use 100’s processes to train models for hours.

›  Some learner processes will stuck/fail due to many hardware issues (ex. disk, network etc.)

›  Existing ML algorithms will hang or fail.

§ Partial Reducer ›  Enable trade off b/w speed and

accuracy ›  Tolerate failures of % of learner

processes

for (i <- 1 to ITERATIONS) { val gradient = points.pipe(learner_cmd) .partialReducer(reduceFunc,

0.99, timeout) w -= gradient }

1d. Approximate Computing

22

2. Realtime Training in Storm Bolts §  Basic Requirements

›  Freshness of ML model is critical

§  Sample Solution public class TrainingBolt extends BaseBasicBolt { Model model; public void prepare(Map conf, TopologyContext ctx) { System.loadLibrary("VW"); model =VW.init(conf); } public void execute(Tuple input, OutputCollector collector) { Instance example = input. getValue(0); model.learn(example); if (Time since last export) collector.emit(model); } }

23

3a. Hybrid Learning § Basic Requirements ›  Boostrape models via batch

learning from large datasets ›  Update models via realtime

learning from latest events

§  Sample Solution ›  ML in Hadoop + Storm ›  ML in Spark + Storm

•  billions of features per model •  millions of operation per second •  enable asynchronous learning

3b. Parameter Server on Grid

Summary

Hadoop YARN: Resource Manager

Hadoop Storage: File System and NoSQL

Applications

Search Ranking

Photo/Video Services

Online Ads

Persona-lization

Abuse Detection

Machine Learning LibrariesLogistic

Regression Deep Learning Unsupervised Learning

Decision Trees …

Computing Engines

Committed to Apache Open Source

26

8 Committers (6 PMCs) | Apache - 80



5 Committer (5 PMC) | Apache - 17

3 Committers | Apache - 32


§  Big-Data Blog … http://yahoohadoop.tumblr.com §  Hiring … http://careers.yahoo.com

27

Thanks!

andy feng, distinguished architect, yahoo at mlconf sf

Technology