aggressive cloning of jobs for effective straggler mitigation ganesh ananthanarayanan, ali ghodsi,...

20
Aggressive Cloning of Jobs for Effective Straggler Mitigation Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, Ion Stoica

Upload: odalys-nazworth

Post on 01-Apr-2015

218 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Aggressive Cloning of Jobs for Effective Straggler Mitigation Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, Ion Stoica

Aggressive Cloning of Jobs for Effective Straggler Mitigation

Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, Ion Stoica

Page 2: Aggressive Cloning of Jobs for Effective Straggler Mitigation Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, Ion Stoica

Small jobs increasingly important

• Most jobs are small– 82% of jobs contain less than 10 tasks (Facebook’s

Hadoop cluster)

• Most small jobs are interactive and latency-constrained– Data analyst testing query on small sample

• Small jobs particularly sensitive to stragglers

Page 3: Aggressive Cloning of Jobs for Effective Straggler Mitigation Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, Ion Stoica

Straggler Mitigation

• Blacklisting: – Clusters periodically diagnose and eliminate

machines with faulty hardware

• Speculation: Non-deterministic stragglers– Complete systemic modeling is intrinsically complex

[e.g., Dean’12 at Google]– LATE [OSDI’08], Mantri [OSDI’10]…

Page 4: Aggressive Cloning of Jobs for Effective Straggler Mitigation Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, Ion Stoica

Despite the mitigation techniques…

LATE: The slowest task runs 8 times slower than the median task

Mantri: The slowest task runs 6 times slower than the median task

• (…but they work well for large jobs)

Page 5: Aggressive Cloning of Jobs for Effective Straggler Mitigation Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, Ion Stoica

State-of-the-Art Straggler Mitigation

Speculative Execution:(in LATE, Mantri, MapReduce)

1. Wait: observe relative progress rates of tasks

2. Speculate: launch copies of tasks that are predicted to be stragglers

Page 6: Aggressive Cloning of Jobs for Effective Straggler Mitigation Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, Ion Stoica

Why doesn’t this work for small jobs?

1. Consist of just a few tasks– Statistically hard to predict stragglers– Need to wait longer to accurately predict stragglers

2. Run all their tasks simultaneously– Waiting can constitute considerable fraction of a

small job’s duration

Wait & Speculate is ill-suited to address stragglers in small jobs

Page 7: Aggressive Cloning of Jobs for Effective Straggler Mitigation Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, Ion Stoica

Cloning Jobs

• Proactively launch clones of a job, just as they are submitted

• Pick the result from the earliest clone• Probabilistically mitigates stragglers

• Eschews waiting, speculation, causal analysis…

Is this really feasible??

Page 8: Aggressive Cloning of Jobs for Effective Straggler Mitigation Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, Ion Stoica

Low Cluster Utilization

• Clusters have median utilization of under 20%– Provisioned for (short burst of) peak utilization

• Cluster energy-efficiency proposals– Not adopted in today’s clusters! – Peak utilization decides half the energy bill– Hardware and software reliability issues…

Page 9: Aggressive Cloning of Jobs for Effective Straggler Mitigation Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, Ion Stoica

Power-law exponent = 1.9

Tragedy of commons?

Power-law: 90% of jobs use

6% of resources FB, Bing, Yahoo!

Can clone small jobs with few extra resources

• If every job utilizes the “lowly utilized” cluster…– Instability and negative performance effects

Page 10: Aggressive Cloning of Jobs for Effective Straggler Mitigation Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, Ion Stoica

Job

Strawman

Earliest

Easy to implement Directly extends to any framework

M1M2

M2

R1

R1M1

Page 11: Aggressive Cloning of Jobs for Effective Straggler Mitigation Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, Ion Stoica

Number of map clones

• Contention for input data by map task clones

• Storage crunch Cannot increase replication

>> 3 clones

Page 12: Aggressive Cloning of Jobs for Effective Straggler Mitigation Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, Ion Stoica

Task-level Cloning

Job

Earliest

M1M1

M2

R1

R1M2

Earliest Earliest

Page 13: Aggressive Cloning of Jobs for Effective Straggler Mitigation Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, Ion Stoica

≤3 clones sufficesStrawman Task-level Cloning

Page 14: Aggressive Cloning of Jobs for Effective Straggler Mitigation Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, Ion Stoica

Dolly: Cloning Jobs

• Task-level cloning of jobs• Works within a budget– Cap on the extra cluster resources for cloning

Page 15: Aggressive Cloning of Jobs for Effective Straggler Mitigation Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, Ion Stoica

Evaluation

• Workload derived from Facebook traces– FB: 3500 node Hadoop cluster, 375K jobs, 1 month

• Trace-driven simulator

• Baselines: LATE and Mantri, + blacklisting• Cloning budget of 5%

Page 16: Aggressive Cloning of Jobs for Effective Straggler Mitigation Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, Ion Stoica

Baseline: LATESmall jobs benefit significantly!

Average completion time improves by 44%

Page 17: Aggressive Cloning of Jobs for Effective Straggler Mitigation Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, Ion Stoica

Baseline: MantriSmall jobs benefit significantly!

Average completion time improves by 42%

Page 18: Aggressive Cloning of Jobs for Effective Straggler Mitigation Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, Ion Stoica

Intermediate Data Contention

• We would like every reduce clone to get its own copy of intermediate data (map output)

• Not replicated, to avoid overheads

• What if a map clone straggles?

Page 19: Aggressive Cloning of Jobs for Effective Straggler Mitigation Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, Ion Stoica

Intermediate Data Contention

M1

M1

M2

M2

R1

R1

M1

M2

Wait for exclusive copy or contend for the available copy?

Page 20: Aggressive Cloning of Jobs for Effective Straggler Mitigation Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, Ion Stoica

Conclusion

• Stragglers in small jobs are not well-handled by traditional mitigation strategies– Guessing task to speculate very hard, waiting wastes

significant computation time

• Dolly: Proactive Cloning of jobs– Power-law Small cloning budget (5%) suffices– Jobs improve by at least 42% w.r.t. state-of-the-art

straggler mitigation strategies

• Low utilization + Power-law + Cloning?