availability and integrity in hadoop (strata eu edition)

© Hortonworks Inc. 2012

Data Availability and Integrityin Apache Hadoop

Steve Loughran @[email protected]


Questions Hadoop Ops teams ask

•Can Hadoop keep my data safe?

•Can Hadoop keep my data available?

•What happens when things go wrong?

•Can you improve this?


DataNode

DataNode

DataNode

DataNode

ToR Switch

DataNode

DataNode

DataNode

DataNode

ToR Switch

Switch

(Job Tracker)

ToR Switch

2ary Name Node

Name Node

file

block1block2block3…

Can Hadoop Keep My Data Safe?


Replication handles data integrity

•CRC32 checksum per 512 bytes•Verified across datanodes on write•Verified on all reads•Background verification of all blocks (~weekly)•Corrupt blocks re-replicated•All replicas corrupt operations team intervention

2009: Yahoo! lost 19 out of 329M blocks on 20K servers –bugs now fixed


DataNode

DataNode

DataNode

DataNode

ToR Switch

DataNode

DataNode

DataNode

DataNode

ToR Switch

Switch

(Job Tracker)

ToR Switch

2ary Name Node

Name Node

file

block1block2block3…

Harder: Switch failure


Bonded 1 GbE >1 switchAvoids hardware problems, not software


NameNode failure rare but costs

ToR Switch

2ary Name Node

2. Bring up new NameNode server-with same IP-or restart DataNodes

(Secondary NN receives streamed journal and checkpoints filesystem image)

Shared storage for filesystem image and journal ("edit log")

1. Try to reboot/restart

Yahoo!: 22 NameNode failures on 25 clusters in 18 months = .99999 availability

Name Node

Name NodeNN IP

NN IP


What to improve

•Address costs of NameNode failure in Hadoop 1

•Add live NN failover (HDFS 2.0)

•Eliminate shared storage (HDFS 2.x)

•Add resilience to the entire stack


Full Stack HAadd resilience to planned/unplanned outages of layers underneath


HA in Hadoop 1 (HDP1)Use existing HA clustering technologies to add cold failover of key manager services:

VMWare vSphere HARedHat HA Linux


(Job Tracker)

2ary Name Node

Name Node

RedHat HA Linux

IP1

IP2

NN IP

ToR Switches

Name Node

NN IP

IP3

IP4

2NN IP

JT IP

HA Linux: heartbeats & failover

DataNode

DataNode

DataNode

DataNode

DataNode

DataNode

DataNode

DataNode


Linux HA Implementation

•Replace init.d script with “Resource Agent” script

•Probe deep state of HDFS, Job Tracker

•Detection & handling of hung process hard

•Test in virtual + physical environments

•Testing with physical clusters


Yes, but does it work?

public void testKillHungNN() { assertRestartsHDFS { nnServer.kill(19, "/var/run/hadoop/hadoop-hadoop-namenode.pid") }}

Groovy JUnit tests“Tools of Chaos” to break remote hosts and infrastructures


And how long does it take?

Small cluster: 1-3 minutes

Medium Cluster: 2-4 Minutes

Where Medium == A Petabyte or less

14

Cold Failover is good enough for small/medium clusters


“Full Stack”: IPC client

Configurable retry & time to blockipc.client.connect.max.retriesdfs.client.retry.policy.enabled

1. Blocking works for most clients (HBase, Pig…)

2. Failure-aware applications can tune/disable

3. Job tracker added “Safe Mode” for outages


Putting it all together: Demo


HA in Hadoop HDFS 2


DataNode

DataNodeNN

NN

Hadoop 2.0 HA

IP1

IP2

Active

Failure-Controller

Failure-Controller

Zoo-Keeper

Zoo-Keeper

Zoo-Keeper

Standby

Active

ActiveStandby

Standby

Active


When will HDFS 2 be ready?Moving from alpha to beta ... production in 2013

Download and play with early releases!


Moving forward

•Retry policies for all remote client protocols/libraries in the stack.

•Dynamic (zookeeper?) service lookup

•YARN needs HA of Resource Manager, individual MR clusters

• “No more Managers”


Summary

•HDFS handles corruption and partial loss of data today

•Hadoop 1 now has cold failover for small/medium clusters

•Hadoop 2 adding hot failover

•Full Stack HA for resilience to outages


Single Points of Failure

There's always a SPOF

Q. How do you find it?

A. It finds you

availability and integrity in hadoop (strata eu edition)

Technology

data safe

client

hdfs 2

hadoop

yahoo