hadoop scalability at facebook -...

23
Hadoop Scalability at Facebook Dmytro Molkov ([email protected]) YaC, Moscow, September 19, 2011

Upload: others

Post on 21-May-2020

11 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Hadoop Scalability at Facebook - cache-ash04.cdn.yandex.netcache-ash04.cdn.yandex.net/download.yandex.ru/... · HDFS Federation Overview Each NameNode holds a part of the NameSpace

Hadoop Scalability at Facebook

Dmytro Molkov ([email protected])

YaC, Moscow, September 19, 2011

Page 2: Hadoop Scalability at Facebook - cache-ash04.cdn.yandex.netcache-ash04.cdn.yandex.net/download.yandex.ru/... · HDFS Federation Overview Each NameNode holds a part of the NameSpace

How Facebook uses Hadoop Hadoop Scalability Hadoop High Availability HDFS Raid

Page 3: Hadoop Scalability at Facebook - cache-ash04.cdn.yandex.netcache-ash04.cdn.yandex.net/download.yandex.ru/... · HDFS Federation Overview Each NameNode holds a part of the NameSpace

How Facebook uses Hadoop

Page 4: Hadoop Scalability at Facebook - cache-ash04.cdn.yandex.netcache-ash04.cdn.yandex.net/download.yandex.ru/... · HDFS Federation Overview Each NameNode holds a part of the NameSpace

Usages of Hadoop at Facebook

▪ Warehouse

▪ Thousands of machines in the cluster

▪ Tens of petabytes of data

▪ Tens of thousands of jobs/queries a day

▪ Over a hundred million files

▪ Scribe-HDFS

▪ Dozens of small clusters

▪ Append support

▪ High availability

▪ High throughput

Page 5: Hadoop Scalability at Facebook - cache-ash04.cdn.yandex.netcache-ash04.cdn.yandex.net/download.yandex.ru/... · HDFS Federation Overview Each NameNode holds a part of the NameSpace

Usages of Hadoop at Facebook (contd.)

▪ Realtime Analytics

▪ Medium sized hbase clusters

▪ High throughput/low latency

▪ FB Messages Storage

▪ Medium sized hbase clusters

▪ Low latency

▪ High data durability

▪ High Availability

▪ Misc Storage/Backup clusters

▪ Small to medium sized

▪ Various availability/performance requirements

Page 6: Hadoop Scalability at Facebook - cache-ash04.cdn.yandex.netcache-ash04.cdn.yandex.net/download.yandex.ru/... · HDFS Federation Overview Each NameNode holds a part of the NameSpace

Hadoop Scalability

Page 7: Hadoop Scalability at Facebook - cache-ash04.cdn.yandex.netcache-ash04.cdn.yandex.net/download.yandex.ru/... · HDFS Federation Overview Each NameNode holds a part of the NameSpace

Hadoop Scalability

▪ Warehouse Cluster - A “Single Cluster” approach

▪ Good data locality

▪ Ease of data access

▪ Operational Simplicity

▪ NameNode is the bottleneck

▪ Memory pressure - too many files and blocks

▪ CPU pressure - too many metadata operations against a single node

▪ Long Startup Time

▪ JobTracker is the bottleneck

▪ Memory Pressure - too many jobs/tasks/counters in memory

▪ CPU pressure - scheduling computation is expensive

Page 8: Hadoop Scalability at Facebook - cache-ash04.cdn.yandex.netcache-ash04.cdn.yandex.net/download.yandex.ru/... · HDFS Federation Overview Each NameNode holds a part of the NameSpace

HDFS Federation Wishlist

▪ Single Cluster

▪ Preserve Data Locality

▪ Keep Operations Simple

▪ Distribute both CPU and Memory Load

Page 9: Hadoop Scalability at Facebook - cache-ash04.cdn.yandex.netcache-ash04.cdn.yandex.net/download.yandex.ru/... · HDFS Federation Overview Each NameNode holds a part of the NameSpace

Hadoop Federation Design

NameN

ode

#1

NameN

ode

#N

Data

Node

Data

Node ...

Data

Node

Page 10: Hadoop Scalability at Facebook - cache-ash04.cdn.yandex.netcache-ash04.cdn.yandex.net/download.yandex.ru/... · HDFS Federation Overview Each NameNode holds a part of the NameSpace

HDFS Federation Overview ▪ Each NameNode holds a part of the NameSpace

▪ Hive tables are distributed between namenodes

▪ Hive Metastore stores full locations of the tables (including the namenode) -> Hive clients know which cluster the data is stored in

▪ HDFS Clients have a mount table to know where the data is

▪ Each namespace uses all datanodes for storage -> the cluster load is fully balanced (Storage and I/O)

▪ Single Datanode process per node ensures good utilization of resources

Page 11: Hadoop Scalability at Facebook - cache-ash04.cdn.yandex.netcache-ash04.cdn.yandex.net/download.yandex.ru/... · HDFS Federation Overview Each NameNode holds a part of the NameSpace

Map-Reduce Federation ▪ Backward Compatibility with the old code

▪ Preserve data locality

▪ Make scheduling faster

▪ Ease the resource pressure on the JobTracker

Page 12: Hadoop Scalability at Facebook - cache-ash04.cdn.yandex.netcache-ash04.cdn.yandex.net/download.yandex.ru/... · HDFS Federation Overview Each NameNode holds a part of the NameSpace

Map Reduce Federation

Cluster

Resource

Manager

Job

Client

Resource Request

Task

Track

er

Task

Track

er

Resource Heartbeats

Job Communication

...

Page 13: Hadoop Scalability at Facebook - cache-ash04.cdn.yandex.netcache-ash04.cdn.yandex.net/download.yandex.ru/... · HDFS Federation Overview Each NameNode holds a part of the NameSpace

MapReduce Federation Overview ▪ Cluster Manager only allocates resources

▪ JobTracker per user -> few tasks per JobTracker -> more responsive scheduling

▪ ClusterManager is stateless -> shorter restart times -> better availability

Page 14: Hadoop Scalability at Facebook - cache-ash04.cdn.yandex.netcache-ash04.cdn.yandex.net/download.yandex.ru/... · HDFS Federation Overview Each NameNode holds a part of the NameSpace

Hadoop High Availability

Page 15: Hadoop Scalability at Facebook - cache-ash04.cdn.yandex.netcache-ash04.cdn.yandex.net/download.yandex.ru/... · HDFS Federation Overview Each NameNode holds a part of the NameSpace

Warehouse High Availability ▪ Full cluster restart takes 90-120 mins

▪ Software upgrade is 20-30 hrs of downtime/year

▪ Cluster crash is 5 hrs of downtime/year

▪ MapReduce tolerates failures

Page 16: Hadoop Scalability at Facebook - cache-ash04.cdn.yandex.netcache-ash04.cdn.yandex.net/download.yandex.ru/... · HDFS Federation Overview Each NameNode holds a part of the NameSpace

HDFS High Availability Design

Primary

NN

Standb

y

NN

NFS

DataNodes

Edits Log Edits Log

Block Reports/

Block Received

Block Reports/

Block Received

Page 17: Hadoop Scalability at Facebook - cache-ash04.cdn.yandex.netcache-ash04.cdn.yandex.net/download.yandex.ru/... · HDFS Federation Overview Each NameNode holds a part of the NameSpace

Clients Design ▪ Using ZooKeeper as a method of name resolution

▪ Under normal conditions ZooKeeper contains a location of the primary node

▪ During the failover ZooKeeper record is empty and the clients know to wait for the failover to complete

▪ On a network failure clients check if the ZooKeeper entry has changed and retry the command agains the new Primary NameNode if the failover has occurred

▪ For the large clusters Clients also cache the location of the primary on the local node to ease the load on the zookeeper cluster

Page 18: Hadoop Scalability at Facebook - cache-ash04.cdn.yandex.netcache-ash04.cdn.yandex.net/download.yandex.ru/... · HDFS Federation Overview Each NameNode holds a part of the NameSpace

HDFS Raid

Page 19: Hadoop Scalability at Facebook - cache-ash04.cdn.yandex.netcache-ash04.cdn.yandex.net/download.yandex.ru/... · HDFS Federation Overview Each NameNode holds a part of the NameSpace

HDFS Raid

▪ 3 way replication

▪ Data locality - necessary only for the new data

▪ Data availability - necessary for all kinds of data

▪ Erasure codes

▪ Data locality is worse than 3 way replication

▪ Data availability is at least as good as 3 way replication

Page 20: Hadoop Scalability at Facebook - cache-ash04.cdn.yandex.netcache-ash04.cdn.yandex.net/download.yandex.ru/... · HDFS Federation Overview Each NameNode holds a part of the NameSpace

HDFS Raid Detais

10 blocks replicated 3 times = 30 physical blocks Effective replication factor 3.0

10 blocks replicated twice + checksum (XOR) block replicated twice = 22 physical blocks. Effective replication factor 2.2

XOR

Reed Solomon Encoding

10 blocks replicated 3 times = 30 physical blocks Effective replication factor 3.0

10 blocks with replication factor 1 + erasure codes (RS) replicated once = 14 physical blocks. Effective replication factor 1.4

Page 21: Hadoop Scalability at Facebook - cache-ash04.cdn.yandex.netcache-ash04.cdn.yandex.net/download.yandex.ru/... · HDFS Federation Overview Each NameNode holds a part of the NameSpace

HDFS Raid Pros and Cons ▪ Saves a lot of space

▪ Provides same guarantees for data availability

▪ Worse data locality

▪ Need to reconstruct blocks instead of replicating (CPU + Network cost)

▪ Block location in the cluster is important and needs to be maintained

Page 23: Hadoop Scalability at Facebook - cache-ash04.cdn.yandex.netcache-ash04.cdn.yandex.net/download.yandex.ru/... · HDFS Federation Overview Each NameNode holds a part of the NameSpace

(c) 2007 Facebook, Inc. or its licensors. "Facebook" is a registered trademark of Facebook, Inc.. All rights reserved. 1.0