hadoop platform at yahoo
TRANSCRIPT
HADOOP PLATFORM AT YAHOOA YEAR IN REVIEW
SUMEET SINGH (@sumeetksingh)Sr. Director, Cloud and Big Data Platforms
Agenda
2
Platform Overview 1
Infrastructure and Metrics2
CaffeOnSpark for Distributed DL3
Compute and Sketches4
Oozie6
Ease of Use7
Q&A8
HBase and Omid5
2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 20160
5,000
10,000
15,000
20,000
25,000
30,000
35,000
40,000
45,000
50,000
0
100
200
300
400
500
600
700
800
Servers StorageYear
# Se
rver
s
Raw
HD
FS (i
n PB
)
Yahoo! Commits to
Scaling Hadoop for Production
Use
Research Workloads
in Search and
Advertising
Production (Modeling)
with machine learning & WebMap
Revenue Systems
with Security, Multi-
tenancy, and SLAs
Open Sourced with
Apache
Hortonworks Spinoff for Enterprise hardening
Nextgen Hadoop
(H 0.23 YARN)
New Services(HBase,
Storm, Spark, Hive)
Increased User-base
with partitioned
namespaces
Apache H2.7(Scalable ML, Latency, Utilization, Productivity)
Platform Evolution
3
Deployment Models
Private (dedicated) Clusters
Hosted Multi-tenant (private cloud)
Clusters
Hosted Compute Clusters
Large demanding use cases
New technology not yet platformized
Data movement and regulation issues
When more cost effective than on-premise
Time to market/ results matter
Data already in public cloud
Source of truth for all of orgs data
App delivery agility
Operational efficiency and cost savings through economies of scale
On-Premise Public Cloud
Purpose-builtBig Data Clusters
For performance, tighter integration with tech stack
Value added services such as monitoring, alerts, tuning and common tools
4
Platform Today
ZK DBMS MON SSHOP LOG WH TOOLS
Apache / Open Source Projects Yahoo Projects
HDFS HBase HCat Kafka CMS DH
Pig Hive Oozie Hue GDM Big ML
YARN CS MR Tez Spark Storm
Services
Compute
Storage / Msg.
Tools
5
Technology Stack Assembly
HDFS(File System)
YARN(Scheduling, Resource Management)
Common
RHEL6 64-bit, JDK8
Platformized Tech with Production Support
In-progress, Unmet needs or Apache Alignment
6
Common Backplane
DataNode NodeManager
NameNode RM
DataNodes RegionServers
NameNode HBase Master Nimbus
Supervisor
Administration, Management and Monitoring
ZooKeeperPools
HTTP/HDFS/GDM Load Proxies
Applications and Data
DataFeeds
Data Stores
Oozie Server
HS2/HCat
Network Backplane
7
05
10152025
Cluster 1 (2,000 servers)HDFS 12 PB Compute 23 TBAvg. Util: 26%
Research Cluster Consolidation
0
20
40
60
80
Com
pute
Tot
al a
nd U
sed
(TB
)
Cluster 3 (5,400 servers)HDFS 36 PB Compute 70 TBAvg. Util: 59%
Cluster 2 (3,100 servers)HDFS 21 PB Compute 52 TBAvg. Util: 40%
0102030405060
One Month Sample (2015)
Total Used
8
0
50
100
150
200
250
300
Consolidated ClusterHDFS 65 PB Compute 240 TBAvg. Util: 70%
Consolidated Research Cluster Characteristics
One Month Sample (2016)40% decrease in TCO
10,500 servers
2,200servers
Before After
65% increase in compute capacity
50% increase in avg. utilization
Total Used
Com
pute
Tot
al a
nd U
sed
(TB
)
9
Common Hadoop Cluster Configuration
Rack 1
Network Backplane
CPU Serverswith JBODs& 10GbE
Rack 2 Rack N
.. .. ..
. . .
10
New Hadoop Cluster Configuration
Rack 1
Network Backplane
CPU Serverswith JBODs& 10GbE
Rack 2 Rack N
100Gbps InfiniBand
GPU Servers
Hi-Mem Servers
. . .
11
YARN Node Labels
J2J3
J4
Queue 1, 40%Label x
Queue 2, 40%Label x, y
J1
Queue 3, 20%
x x x x x x
x x x x x x
y y y y y yy y y y y y
yarn.scheduler.capacity.root.<queue name>.accessible-node-labels = <label name>yarn.scheduler.capacity.root.<label name>.default-node-label-expression sets the default label asked for by queue
Hadoop Cluster
12
Agenda
Platform Overview 1
Infrastructure and Metrics2
CaffeOnSpark for Distributed DL3
Compute and Sketches4
Oozie6
Ease of Use7
Q&A8
HBase and Omid5
13
CaffeOnSpark – Distributed Deep Learning
CaffeOnSparkfor DL
MLLibfor
non-DL
Hive or SparkSQL
Spark
YARN (RM and Scheduling)
HDFS (Datasets)
. . .
14
Few Use Cases – Yahoo Weather
15
Few Use Cases – Flickr Facial Recognition
16
Few Use Cases – Flickr Scene Detection
17
CaffeOnSpark Architecture – Common ClusterSpark Driver
Caffe (enhanced with
multi-GPU/CPU)
Model Synchronizer(across nodes)
HDFS Datasets
Spark Executor (for data feeding and
control)
Caffe (enhanced with
multi-GPU/CPU)
Model Synchronizer(across nodes)
HDFS Datasets
Spark Executor (for data feeding and
control)
Caffe (enhanced with
multi-GPU/CPU)
Model Synchronizer(across nodes)
HDFS Datasets
Spark Executor (for data feeding and
control)
Model O/P on HDFS
MPI on RDMA / TCP18
CaffeOnSpark Architecture – Incremental Learning
cos = new CaffeOnSpark(ctx)conf = new Config(ctx, args).init()dl_train_source = DataSource.getSource(conf, true)cos.train(dl_train_source) //training DL modellr_raw_source = DataSource.getSource(conf, false)ext_df = cos.features(lr_raw_source) // extract features via DL
Feature Engineering:
Deep Learning
19
CaffeOnSpark Architecture – Incremental Learning
cos = new CaffeOnSpark(ctx)conf = new Config(ctx, args).init()dl_train_source = DataSource.getSource(conf, true)cos.train(dl_train_source) //training DL modellr_raw_source = DataSource.getSource(conf, false)ext_df = cos.features(lr_raw_source) // extract features via DL
vlr_input=ext_df.withColumn(“L",cos.floats2doubleUDF(ext_df(conf.label))) .withColumn(“F",cos.floats2doublesUDF(ext_df(conf.features(0))))lr = new LogisticRegression().setLabelCol(”L").setFeaturesCol(”F")lr_model = lr.fit(lr_input_df) …
Feature Engineering:
Deep Learning
20
Train Classifiers:N
on-deep Learning
CaffeOnSpark Architecture – Single Command
spark-submit --num-executors #Exes--class CaffeOnSpark my-caffe-on-spark.jar-devices #GPUs-model dl_model_file-output lr_model_file
21
Distributed Deep Learning
Apache License
ExistingClusters
Powerful DL Platform
FullyDistributed
High-level API
Incremental Learning
CaffeOnSparkgithub.com/yahoo/caffeonspark
22
Agenda
Platform Overview 1
Infrastructure and Metrics2
CaffeOnSpark for Distributed DL3
Compute and Sketches4
Oozie6
Ease of Use7
Q&A8
HBase and Omid5
23
Hadoop Compute Sources
HDFS(File System and Storage)
Pig(Scripting)
Hive(SQL)Java MR APIs
YARN(Resource Management and Scheduling)
Tez(Execution Engine for
Pig and Hive)
Spark(Alternate Exec Engine)
MapReduce(Legacy)
Data Processing
ML
Custom App on Slider
Oozie
Data Management
24
Compute GrowthM
ar-13
Apr-1
3
May
-13
Jun-1
3
Jul-1
3
Aug-1
3
Sep-1
3
Oct-
13
Nov-1
3
Dec-1
3
Jan-1
4
Feb-1
4
Mar-
14
Apr-1
4
May
-14
Jun-1
4
Jul-1
4
Aug-1
4
Sep-1
4
Oct-
14
Nov-1
4
Dec-1
4
Jan-1
5
Feb-1
5
Mar-
15
Apr-1
5
May
-15
Jun-1
5
Jul-1
5
Aug-1
5
Sep-1
5
Oct-
15
Nov-1
5
Dec-1
5
Jan-1
6
Feb-1
6
Mar-
16
10
15
20
25
30
35
40
45
13.3
20.4
23.8
27.2
32.3
34.1
39.1
# M
R, T
ez, S
park
Job
s (in
mill
ions
)
25
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Pushing Batch Compute Boundaries%
of T
otal
Com
pute
(mem
ory-
sec)
Q1 2016
MapReduce Tez Spark 112 Million Batch Jobs in Q1’16
Jan 78%
Mar 67%
Mar 21% 12% Jan 8% 14%
26
Multi-tenant Apache Storm
27
Recent Apache Storm Developments at Yahoo
MT & RAScheduler
Dist. CacheAPI
8 xThroughput
Improved Debuggability
1 github.com/yahoo/streaming-benchmarks
PacemakerServer
StreamingBenchmark 1
28
Data Sketches Algorithms
Data Sketches Algorithms Library
datasketches.github.io
Good enough approximate answers
for problem queries
Streamable
Approximate with predictable error
Sub-linear in size
Mergeable / additive
Highly parallelizable
Maven deployable
Characteristics
29
Distinct Count Sketch, High-level View
Big Data Stream
Transform Data Structure Estimator
Result + / - ε
White Noise
Basic Sketch Elements
30
Data Sketches Algorithms
Data Sketches Algorithms Library
datasketches.github.io
31
Agenda
Platform Overview 1
Infrastructure and Metrics2
CaffeOnSpark for Distributed DL3
Compute and Sketches4
Oozie6
Ease of Use7
Q&A8
HBase and Omid5
32
Apache HBase at Yahoo
Security
Isolated Deployment
Multi-tenant
Region Server Group
Namespace
Unsupported Features
33
Security Authentication
Kerberos (users, processes) Delegation Token (MapReduce, YARN, etc.)
Authorization HBase ACLs (Read, Write, Create, Admin) Grant permissions to User or Unix Group ACL for Table, Column Family or Column
34
Region Server Groups Dedicated region servers for a set of tables Resource Isolation (CPU, Memory, IO, etc)
35
Namespaces Analogous to “Database” Namespace ACL to create tables Default group Quota
Tables Regions
36
Split Meta to Spread Load and Avoid Large Regions
37
Favored Nodes for HDFS Locality
38
Humongous Tables
39
Scaling HBase to Handle Millions of Regions on a Cluster
Region Server Groups
Split Meta
Split ZK
Favored Nodes
Humongous Tables
40
Transactions on HBase with Omid1
Highly performant and fault tolerant ACID transactional framework
New Apache Incubator projectincubator.apache.org/projects/omid.html
Handles million of transactions per day for search and personalization products
1 Omid stands for “Hope” in Persian
41
Omid Components
42
Omid Data Model
43
Agenda
Platform Overview 1
Infrastructure and Metrics2
CaffeOnSpark for Distributed DL3
Compute and Sketches4
Oozie6
Ease of Use7
Q&A8
HBase and Omid5
44
Oozie Data Pipelines
Oozie
Message Bus
HCatalog
3. Push notification<New Partition>
2. Register Topic
4. Notify New Partition
Data Producer HDFSProduce data (distcp, pig, M/R..)
/data/click/2014/06/02
1. Query/Poll Partition
Start workflow
Update metadata (ALTER TABLE click ADD PARTITION(data=‘2014/06/02’) location ’hdfs://data/click/2014/06/02’)
45
Large Scale Data Pipeline RequirementsAdministrative One should be able to start, stop and pause
all related pipelines at a same time
Dependency Management Output of a coordinator “n+1” action is
dependent on coordinator “n” action (dataset dependency)
If dataset has a BCP instance, workflow should run with either, whichever arrives first
Start as soon as mandatory data is available, other feeds are optional
Data is not guaranteed, start processing even if partial data is available
SLA Management Monitor pipeline processing to take
immediate action in case of failures or SLA misses
Pipelines owners should get notified if an SLA is missed
Multiple Providers If data is available from multiple
providers, I want to specify the provider priority
Combine datasets from multiple providers to fill the gaps in data a single provider may have
46
Large Scale Data Pipeline RequirementsAdministrative One should be able to start, stop and pause
all related pipelines at a same time
Dependency Management Output of a coordinator “n+1” action is
dependent on coordinator “n” action (dataset dependency)
If dataset has a BCP instance, workflow should run with either, whichever arrives first
Start as soon as mandatory data is available, other feeds are optional
Data is not guaranteed, start processing even if partial data is available
SLA Management Monitor pipeline processing to take
immediate action in case of failures or SLA misses
Pipelines owners should get notified if an SLA is missed
Multiple Providers If data is available from multiple
providers, I want to specify the provider priority
Combine datasets from multiple providers to fill the gaps in data a single provider may have
47
BCP And Mandatory / Optional Feeds
Pull data from A or B. Specify dataset as AorB. Action will start running as soon either dataset A or B is available.
<input-logic> <or name=“AorB”> <data-in dataset="A” wait=“10”/> <data-in dataset="B"/> </or></input-logic>
Dataset B is optional, Oozie will start processing as soon as A is available. It will include dataset from A and whatever is available from B.
<input-logic> <and name="optional <data-in dataset="A"/> <data-in dataset="B" min=”0”/> </and></input-logic>
48
Data Not Guaranteed / Priority Among Dataset Instances
A will have higher precedence over B and B will have higher precedence over C.
<input-logic> <or name="AorBorC"> <data-in dataset="A"/> <data-in dataset="B"/> <data-in dataset="C”/> </or></input-logic>
49
Oozie will start processing if available A instances are >= 10. Min can also be combined with wait (as shown for dataset B).
<input-logic> <data-in dataset="A" min=”10”/> <data-in dataset=“B” min =“10” wait=“20”/> </input-logic>
Combining Dataset From Multiple ProvidersCombine function will first check instances from A and go to B next for whatever is missing in A.
<data-in name="A" dataset="dataset_A"> <start-instance> ${coord:CURRENT(-5)} </start-instance> <end-instance> ${coord:latest(-1)} </end-instance></data-in>
<data-in name="B" dataset="dataset_B"> <start-instance>${coord:CURRENT(-5)}</start-instance> <end-instance>${coord:CURRENT(-1)}</end-instance></data-in>
<input-logic> <combine name="AB"> <data-in dataset="A"/> <data-in dataset="B"/> </combine></input-logic>
50
Agenda
Platform Overview 1
Infrastructure and Metrics2
CaffeOnSpark for Distributed DL3
Compute and Sketches4
Oozie6
Ease of Use7
Q&A8
HBase and Omid5
51
Automated Onboarding / Collaboration Portal
52
Built for Tenant Transparency
53
Queue Utilization Dashboard
54
Data Discovery and Access
55
Audits, Compliance, and Efficiency
Starling
FS, Job, Task logs
Cluster 1 Cluster 2 Cluster n...
CF, Region, Action, Query Stats
Cluster 1 Cluster 2 Cluster n...
DB, Tbl., Part., Colmn. Access Stats
...MS 1 MS 2 MS n
GDM
Data Defn., Flow, Feed, Source
F 1 F 2 F n
Log Warehouse
Log Sources
56
Audits, Compliance, and Efficiency (cont’d)
Data Discovery and Access
Public
Non-sensitive
Financial $Governance Classification
No addn. reqmt.
LMS Integration
Stock Admin Integration
Approval Flow
Restricted
57
Hosted UI – Hue as a Service
WSGI
Hue-1.Cluster-1 (Hot)
VIPUsers
HS2
Hue MySQL DB
(HA) Hadoop Cluster
HCatMeta
Oozie Server
YARNRM
WebHDFS
NMs
WSGI
Hue-2.Cluster-1 (hot)
HS2
IdP
SAMLAuth.
Serving pages and static content
Cookies, saved queries, workflows etc.
Full
Sta
ck H
A
REST / Thrift
(jQuery, Bootstrap, Knockout.js, Love)
58
Going Forward
IncreasedIntelligence
Greater Speed
Higher Efficiency
NecessaryScale
59
Increased Intelligence
GBDT FTRL SGD Deep Learning
Random Forests
ML Libraries
Click Prediction Search RankingKeyword Auctions Ad
Relevance Abuse Detection
Applications
Proven to Work at Scale
Solve Complex Problems
YARN (Resource Manager)Heterogeneous
Scheduling Long-running
Services GPUsLarge
Memory SupportCore GridEnhancements
…
Parameter ServerGlobally Shared Parameters
Compute EnginesDistributedProcessing
…
60
Greater Speed
DeData Management
Ease of Use
Productivity Dimensions
Real-timePipelines
Unified Metadata & Lineage Fine-grained
Access Control
Self-serve Data Movement
SLA & Cost Transparency
Intuitive UIs
Planning & Collab. Tools
Central Grid Portal
Improvements
Query times < 1 sec
4x Speedups in ETL
SQL on HBase
Limitless BI ClientsAnalytics, BI &
Reporting
61
Higher EfficiencyAchieve five 9’s availability and 70% average compute utilization across clusters
62
Hadoop Users at Yahoo
Slingstone & Aviate Mail Anti-Spam Gemini Campaign Mgmt. Search Assist
Audience Analytics Flickr YAM+ & Targeting Membership Abuse
… and many more.63
Yahoo at the Apache Open Source Foundation
10 Committers (6 PMC)
3 Committers (3 PMC)
3 Committers (2 PMC)
6 Committer (5 PMC)
1 Committer
3 Committers (2 PMCs)
7 Committers (6 PMCs)
1 2
43
5 6
7 8
1 Committer
64
Join Us @ yahoohadoop.tumblr.com
65
THANK YOUSUMEET SINGH (@sumeetksingh)Sr. Director, Cloud and Big Data Platforms
Icon Courtesy – iconfinder.com (under Creative Commons)