hadoop scheduling - a 7 year perspective

Job Scheduling in Hadoopan exposé

Joydeep Sen Sarma

About Me

Facebook: Ran/Managed Hadoop ~ 3 years Wrote Hive

Mentor/PM Hadoop Fair-Scheduler

Used Hadoop/Hive (as Warehouse/ETL Dev)

Re-wrote significant chunks of Hadoop Job Scheduling (incl. Corona)

Qubole: Running World’s largest Hadoop clusters on AWS

c 2007

c 2014

The Crime

Statistical MultiplexingLargest jobs only fit on pooled hardware

Data LocalityEasier to manage

Shared Hadoop Clusters

… and the Punishment

• “Have you no Hadoop Etiquettes?” (c 2007)

(reducer count capped in response)

• User takes down entire Cluster (OOM) (c 2007-09)• Bad Job slows down entire Cluster (c 2009)• Steady State Latencies get intolerable (c 2010-)• ”How do I know I am getting my fair share?” (c 2011)• “Too few reducer slots, cluster idle” (c 2013)

The Perfect Weapon

• Efficient• Scalable• Strong Isolation• Fair• Fault Tolerant• Low Latency

Scheduler

Quick Review

• Fair Scheduler (Fairness/Isolation)• Speculation (Fault Tolerance/Latency)• Preemption (Fairness)• Usage Monitoring/Limits (Isolation)

And then there’s Hadoop (1.x) …• Single JobTracker for all Jobs

– Does not scale, SPOF

• Pull Based Architecture– Scalability and Low Latency at permanent War– Inefficient – leaves idle time

• Slot Based Scheduling– Inefficient

• Pessimistic Locking in Tracker– Scalability Bottleneck

• Long Running Tasks– Fairness and Efficiency at permanent War

insert overwrite table dest select … from ads join campaigns on …group by …;

8

Poll Driven Scheduling

Map Tasks

ReduceTasks Master

Slave

Job Tracker

TaskTracker

Child

Heartbeat MapTask

Pessmistic LockinggetBestTask(): for pool: sortedPools

for job: pool.sortedJobs()for task: job.tasks()

if betterMatch(task) …

processHeartbeat(): synchronized(world): return getBestTask()

Slot Based Scheduling

• N cpus, M map slots, R reduce slots– Memory cannot be oversubscribed!

• How to divide?– M < N not enough mappers at times– R < N not enough reducers at times– N=M=R enough memory to run 2N tasks ?

• Reduce Tasks Problematic– Network Intensive to start, CPU wasted– Memory Intensive later

Long Running Reducers

• Online Scheduling– No advance information of future workload

• Greedy + Fair Scheduling– Schedule ASAP– Preempt if future workload disagrees

• Long Running Reducers– Preemption causes restart and wasted work– No effective way to use short bursts of idle cpu

Optimistic LockingTask[] getBestTaskCandidates(): for pool: sortedPools

for job: pool.sortedJobs.clone()for task: job.tasks.clone()

synchronized(task):…

processHeartbeat(): tasks = getBestTaskCandidates() synchronized(world): return acquireTasks(tasks)

Corona: Push Scheduling

1. JT subscribes for M maps and R reduces– Receives availability from Cluster Manager (CM)

2. CM publishes availability ASAP– Pushes events to JT

3. JT pushes tasks to available TT– In parallel

Corona/YARN: Scalability

1. JobTracker for each Job now Independent– More Fault Tolerant and Isolated as well

2. Centralized Cluster/Resource Manager– Must be super-efficient!

3. Fundamental Differences– Corona ~ Latency– YARN ~ Heterogenous workloads

Pesky Reducers

• Hadoop 2 removes distinction between M and R slots

• Not Enough– Reduce Tasks don’t use much CPU in shuffle– Still long running and bad to preempt Re-architect to run millions of small Reducers

The Future is Cloudy

• Data Center Assumption:– Cluster characteristics known– Job spec fits to cluster

• In Cloud:– Cluster can grow/shrink, change node-type– Job Spec must be dynamic– Uniform task configuration untenable

Questions?

[email protected]://www.linkedin.com/in/joydeeps

mailto:[email protected]

hadoop scheduling - a 7 year perspective

Technology