andy feng, distinguished architect, yahoo at mlconf sf
DESCRIPTION
Abstract: Scalable Machine Learning at Yahoo Yahoo scientists have developed variety of machine learning libraries (supervised learning, unsupervised learning, deep learning) for online search, advertising and personalization. The emerging business needs require us to address 2 problems: - Can we apply these libraries against massive datasets (billions of training examples, and millions of features) using commodity hardware clusters? - Can we reduce the learning time from days to minutes or seconds? We have thus examined system architecture options (including Hadoop, Spark and Storm), and developed a fault-tolerant MPI solution that allows hundreds of machines to jointly build a model. We are collaborating with open source community for a better system architecture for next-gen machine learning applications. Yahoo ML libraries are being revised for much better scalability and latency. In the talk, we will share system architecture of our ML platform and its use cases.TRANSCRIPT
Scalable Machine Learning at Yahoo
Andy Feng
Nov 14, 2014
My Background § Current
› VP Architecture, Yahoo › Committer, Apache Storm › Contributor, Apache Spark & Hadoop
§ Past › NoSQL › Online advertisement › Personalization › Cloud services
Agenda
3
§ Machine Learning › Use Cases › Challenges
§ Scalable ML Architecture § Design Patterns › Batch, real-time and hybrid
Evolution of Big Data @ Yahoo
4
0
100
200
300
400
500
600
0
5,000
10,000
15,000
20,000
25,000
30,000
35,000
40,000
45,000
2006 2007 2008 2009 2010 2011 2012 2013 2014
Raw
HD
FS S
tora
ge (i
n PB
)
Num
ber o
f Ser
vers
Year Servers Storage
Yahoo! Commits to
Scaling Hadoop for Production
Use
Research Workloads
in Search and Advertising
Production with machine
learning & WebMap
Revenue Systems
with Security, Multi-tenancy,
and SLAs
Open Sourced with
Apache
Hortonworks Spinoff for Enterprise hardening
Nextgen Hadoop (H 0.23)
New Services (Hbase, Hive)
Increased User-base
with partitioned namespaces
Hadoop 2.5
Machine Learning
Personalized Homepage http://www.yahoo.com Mobile
Today Module (2012)
Content stream w/ native ads (2013)
6
Web Search & Ads
• Web Page rank • Image/Video insertion
Ads targeting & ranking
Flickr Photo Search
Flickr
2014 … Empowered by Scalable ML 2013 … User tags based
§ Search
› Page ranking per user intention § Advertisement
› Ad click prediction › Identify potential users for an ad campaign
§ Content › Matching news articles against users › Object detection, face recognition in photos
§ Security › Email spam › Fraud login and registration
8
Machine Learning @ Yahoo
§ Scale › 1,000,000,000’s examples › 100,000,000’s features › 10,000’s models › 10’s algorithms
• Batch learning • Incremental learning • Real-time learning
§ Speed › Temporal nature of user
interests › Time sensitive content
• Ex., breaking news
› Naïve solutions spend days/hours in model training • Minutes/seconds desired
9
Our Challenges
Our Approach: Big-Data Machine Learning
§ Originally created by Yahoo § Popular framework for running
applications on large cluster built of commodity hardware
§ Designed for very high throughput and reliability
§ YARN resource manager supports Map/Reduce, Tez and beyond
11
Apache Hadoop http://hadoop.apache.org
Apache Storm http://storm.apache.org § “Hadoop for Realtime”
› distributed and high-performance realtime data processing
§ Simple API § Horizontal scalability § Fault-tolerance § Guaranteed data
processing 12
§ Fast and expressive cluster computing system compatible with Apache Hadoop
§ Support general execution DAGs › Ex. iterative programming
§ Resilient Distributed Datasets › In-memory storage
Apache Spark http://spark.apache.org
30x Speedup for GBDT § Gradient Boosted
Decision Trees took days on training for our large datasets. é High accuracy ê Sequential execution
§ 30X speedup enables frequently model training. › GBDT included in data
pipeline (Hadoop Oozie workflow)
Pixels -> features
Pixels -> features
Pixels -> features
dog, 1, [.2, -.3, …] dog, 0, [.3, -.5, …]
cat, 1, [.2, -.3, …] cat, 0, [.3, -.5, …]
Train models: Dog, …
Train model: …
Train model: Cat, …
10,000 Mappers
1,000 Reducers Shuffle
Deep network as feature extractor
8000+ classifiers
Auto-tag billions of Flickr photos
Real-time Prediction & Training User Experience
Real-time Learning of Newly Uploaded Photos
Design Patterns Enabled
17
1. Batch ML for scale › Parallel model training (ex. 1000 models for ad campaigns) › Distributed model training (ex. 1 model for all homepage content)
2. Real-time ML for speed › Up-to-minutes models (ex. fraud detection, breaknews)
3. Lambda architecture › Scale + Speedy learning (ex. Photo autotags) › Enabled by “Parameter Server on Grid”
§ Basic Requirements
› 100’s - 1000’s models › Training data for each model
could be loaded into a single machine
§ Solution: 1 reducer per model › hadoop jar hadoop-streaming.jar
-Dmapreduce.job.reduces=$num_models -reducer ”vw --passes 20 --cache_file …”
› hadoop jar lib/hadoop-streaming.jar -D mapreduce.job.reduces=$num_models
-reducer ”svm_train_reducer.py …”
18
1a. ML in Hadoop Reducers
§ Basic Requirements › Small # of models to be trained › Training data are too large to be
loaded into a single machine
§ Solution: Mappers + MPI AllReduce 1. spanning_tree 2. hadoop jar hadoop-streaming.jar
-input $training_data -output $model_loc -Dmapreduce.job.maps=$num_mappers -mapper "runvw.sh $model_location $span_server $num_mappers” -reducer NONE
19
1b. ML in Hadoop Mappers
1c. Spark Native ML
20
§ Spark based › Yahoo E-Commerce: 30 LOC Spark program for collaborative
filtering
§ Spark’s MLlib › Binary classification, Linear regression, Collaborative filtering,
Clustering, Decision Trees etc.
§ 3Rd ML libs › Ex. Alpine Data Lab’s Random Forest
§ Observations › A large scale ML learning
job use 100’s processes to train models for hours.
› Some learner processes will stuck/fail due to many hardware issues (ex. disk, network etc.)
› Existing ML algorithms will hang or fail.
§ Partial Reducer › Enable trade off b/w speed and
accuracy › Tolerate failures of % of learner
processes
for (i <- 1 to ITERATIONS) { val gradient = points.pipe(learner_cmd) .partialReducer(reduceFunc,
0.99, timeout) w -= gradient }
1d. Approximate Computing
22
2. Realtime Training in Storm Bolts § Basic Requirements
› Freshness of ML model is critical
§ Sample Solution public class TrainingBolt extends BaseBasicBolt { Model model; public void prepare(Map conf, TopologyContext ctx) { System.loadLibrary("VW"); model =VW.init(conf); } public void execute(Tuple input, OutputCollector collector) { Instance example = input. getValue(0); model.learn(example); if (Time since last export) collector.emit(model); } }
23
3a. Hybrid Learning § Basic Requirements › Boostrape models via batch
learning from large datasets › Update models via realtime
learning from latest events
§ Sample Solution › ML in Hadoop + Storm › ML in Spark + Storm
• billions of features per model • millions of operation per second • enable asynchronous learning
3b. Parameter Server on Grid
Summary
Hadoop YARN: Resource Manager
Hadoop Storage: File System and NoSQL
Applications
Search Ranking
Photo/Video Services
Online Ads
Persona-lization
Abuse Detection
Machine Learning LibrariesLogistic
Regression Deep Learning Unsupervised Learning
Decision Trees …
Computing Engines
Committed to Apache Open Source
26
8 Committers (6 PMCs) | Apache - 80
5 Committers (3 PMCs) | Apache - 18
3 Committers (2 PMCs) | Apache - 21
5 Committer (5 PMC) | Apache - 17
3 Committers | Apache - 32
7 Committers (6 PMCs) | Apache - 33
§ Big-Data Blog … http://yahoohadoop.tumblr.com § Hiring … http://careers.yahoo.com
27
Thanks!