farms: efﬁcient mapreduce speculation for failure recovery...

FARMS: Efficient MapReduce Speculation for Failure Recovery in Short Jobs

Huansong Fua, Haiquan Chenb, Yue Zhua, Weikuan Yua

aFlorida State UniversitybValdosta State University

Abstract

With the ever-increasing size of software and hardware components and the complexity of configurations, large-scale analyticssystems face the challenge of frequent transient faults and permanent failures. As an indispensable part of big data analytics,MapReduce is equipped with a speculation mechanism to cope with run-time stragglers and failures. However, we reveal that theexisting speculation mechanism has some major drawbacks that hinder its efficiency during failure recovery, which we refer to asthe speculation breakdown.

We use the representative implementation of MapReduce, i.e., YARN and its speculation mechanism as a case study to demon-strate that the speculation breakdown causes significant performance degradation among MapReduce jobs, especially those withshorter turnaround time. As our experiments show, a single node failure can cause a job slowdown by up to 9.2 times. In order toaddress the speculation breakdown, we introduce a failure-aware speculation scheme and a refined task scheduling policy. More-over, we have conducted a comprehensive set of experiments to evaluate the performance of both single component and the wholeframework. Our experimental results show that our new framework achieves dramatic performance improvement in handling withnode failures compared to the original YARN.

Keywords: MapReduce, YARN, Speculation, Failure Recovery

1. Introduction

MapReduce-based distributed computing has gained widepopularity since Google introduced it in 2004 [13]. Specifically,Hadoop [2] has become the de facto standard implementationof MapReduce. Currently, it has been evolved into its secondgeneration called YARN [1]. YARN is designed to overcomescalability and flexibility issues encountered in the first genera-tion of Hadoop.

The popularity of Hadoop is largely due to its good per-formance for big data analytics workloads [13]. In order toachieve that in the highly unstable heterogeneous environment,where many task stragglers are common due to various rea-sons [36, 6], a mechanism called speculation is designed to con-tribute to the purpose. A global speculator proactively makesa copy of a straggler task that may block the job progress. Thecompletion of either the straggler task or the new copy will letthe job proceed. Even in the presence of a whole computingnode going down, as long as all the tasks on the node are prop-erly speculatively duplicated, the job performance will not de-grade too much.

However, we have found that the existing speculation mech-anism has several deficiencies, especially for small jobs. Fig. 1shows the job slowdown caused by a single node failure, i.e., acrashed or unresponsive node. It shows results with a varyinginput size from 1 GB to 10 GB, and an increasing number of

Email addresses: [email protected] (Huansong Fu),[email protected] (Haiquan Chen), [email protected] (Yue Zhu),[email protected] (Weikuan Yu)

tasks. We can see that, to the jobs that have 1 to 10 GB inputdata or 10 to 100 tasks, a single node failure can degrade thejob performance by a varying factor from 3.3x to 9.2x.

Input size 1G~10G

1

10

100

0

1

2

3

4

5

6

7

8

9

10

1 2 3 4 5 6 7 8 9 10

Nu

mb

er

of

tasks

Slo

wd

ow

n (

tim

es)

Input size (GB)

Slowdown

No. of tasks

Fig. 1: Wordcount job performance when one node fails.

How serious is this impact? Although Hadoop is knownfor its ability for processing big data, a significant portion ofjobs used in the real-world environment are actually small sizejobs, which has been reported by a wide range of studies [9,4, 3, 7, 26]. The size of MapReduce jobs in production clus-ters follows the power-law distribution with a vast majority ofthem containing less than 10 GB of input. For example, thedistribution of Facebook workloads [4] demonstrates a heavytail tendency, where about 90% of jobs have 100 or less tasks,and many with 100 GB or less input data. Thus, a lot of jobswill suffer from the performance degradation as shown above.In our experiments we only inject node failures, which are verycommon in real-world environment. According to [12], thereis an average of five node failures during one MapReduce job’s

Preprint submitted to Elsevier September 29, 2016

execution, which was using 268 nodes in average. All these evi-dences indicate a critical need to revisit the existing speculationmechanism in the MapReduce model.

In order to address the aforementioned problem, we havestudied the inability of the existing speculation mechanism inhandling failure recovery. Then, we introduce a set of tech-niques, which we name it FARMS (Failure-Aware, Retrospec-tive and Multiplicative Speculation). It includes an optimizedspeculation mechanism and a fast scheduling policy on fail-ures. Our experimental results show that FARMS has dramaticperformance improvement in handling failures compared to theoriginal YARN.

In summary, our work makes the following contributions:

• We systematically reveal the drawbacks of the currentspeculation mechanism and analyze the cause and effectsof node failures.

• We improve the efficiency of YARN’s existing specula-tion against failure by introducing our speculation schemeFARMS.

• We involve YARN with a heuristic failure scheduling al-gorithm named Fast Analytics Scheduling (FAS) to workwith FARMS, which adds strong resiliency to the hetero-geneous real-world environment.

• We demonstrate that our new speculation mechanism im-proves YARN’s performance significantly under failures,especially for small jobs.

This paper is organized as follows. Section 2 details ourfindings on the existing speculation mechanism with experi-mental results. Section 3 introduces our solution designs ofFARMS and FAS. Section 4 presents the evaluation results ofour implementation. We survey related work in Section 5 andconclude the paper in Section 6.

2. Background and Motivation

2.1. Fault Tolerance and Speculation Mechanism of YARN

As a representative implementation of MapReduce, Hadoopstrives to provide outstanding performance in terms of job turn-around time, scalability, fault tolerance, etc [33]. YARN aimsto overcome shortcomings of Hadoop and provide lower levelsupport for various programming models, e.g., MapReduce andMPI [29] etc. For simplicity, we refer to YARN MapReduceas YARN in this paper. In YARN, each job is comprised ofone ApplicationMaster, a.k.a AM, and many Map- and Reduc-eTasks. Each MapTask reads one input split that contains many<k,v> pairs from the HDFS and converts those records into in-termediate data in the form of <k’,v’> pairs. That intermediatedata is organized into a Map Output File (MOF) and stored tothe local file system. A MOF contains multiple partitions, oneper ReduceTask. After one wave of MapTasks, AM launchesReduceTasks, overlapping the reduce phase with the map phaseof remaining MapTasks. Once launched, a ReduceTask fetches

its partitions from all MOFs and applies the reduce function onthem. The final results are stored to the HDFS.

In order to achieve strong fault tolerance, YARN is equippedwith data replication and regeneration mechanisms. A taskis properly regenerated upon various failures (network, disk,node etc.). Even if the original input data is unavailable be-cause of failures, the rescheduled task will have access to areplica of data so that a correct failover is still ensured. In addi-tion, YARN depends on long timeouts to declare a failure forevery task. Such long timeouts are necessary to avoid falsepositive decisions on failure, but could prolong the recoverywhen real failures occur. So a simple failure can lead to largeperformance degradation, especially for small jobs who havevery short turnaround time. To make things worse, failures areprevalent in commodity cluster (as reported by [12, 23, 27, 30,8, 31]). As a result, YARN’s performance can be seriously af-fected by failures if it relies solely on the naive task-restartingmechanism. Thus, apart from the fail-over scheme, YARN alsohas a speculation mechanism which can help accelerate the de-tection and recovery.

Speculation has been extensively studied with a variety offocuses [36, 6, 3, 5]. Most of these strategies have a core sim-ilarity, i.e., they make a speculative copy for the slowest taskduring the job execution, a.k.a the straggler. For instance, theLATE scheduler [36] that is the default speculation mechanismof Hadoop, estimates the completion time for every task anduses the results to rank those tasks. The task that is estimated tofinish the furthest will have a speculative copy on a fast node.After a configurable time interval, the speculator will searchagain for slow tasks and launch speculative copies for them in-termittently. This strategy, along with others with some varia-tions in how to determine stragglers (e.g. using processed datainstead of task progress in Mantri [6]), have been adopted bythe mainstream industry to prevent the stragglers from delayingthe job performance.

2.2. Issues With The Existing Speculation

However, we find that the existing speculation mechanismhas some major drawbacks, which seriously impede its effi-ciency in the real-world environment, where failures are preva-lent. Next, we describe the two issues existing in the specula-tion mechanism. In the following context of the paper, we usespeculate to describe launching a speculative copy for a task.

2.2.1. Converged Task StragglersTo start with, speculation is simply making a copy of the

slowest task. But what if all the tasks are slow? For example,if every single task of one job is converged on one single nodeand the node becomes unresponsive due to node crash or lostconnection, the speculator will not speculate any of those taskssince they have relatively the same progress. The speculationalgorithm of the speculator cannot decide which task is slower,so the whole job will halt until each of the tasks gets a timeoutand then starts from scratch again. Clearly, those timeouts canbe avoided by early speculation as soon as YARN recognizesthat those tasks on the same node have stalled. However, this

2

is not feasible in the current YARN speculator because its spec-ulation decision is based only upon comparison of progress orprocessed data of task across all participating nodes. Thus, it isunable to discover the intra-node converged task stragglers.

One may argue that the converged task stragglers can rarelyoccur for most MapReduce jobs because it works against thedistributed computing nature of MapReduce. However, we findthat this phenomenon is not rare but indeed extremely commonamong small size jobs. The reason is a design feature of YARNMapReduce. Although the MapReduce framework providesdata locality that can help distribute the tasks evenly across dif-ferent nodes, in practical implementation such as YARN, itsscheduler does not follow the same principle strictly. With itsdefault scheduling policy (capacity scheduler), the scheduler re-quests several containers at once from one NodeManager andwhen it gets enough containers for the job, it stops requesting.When the job is small (so it does not need many containers),the MapTasks will have a very high probability of residing onthe same node. This design of ResourceManager is good forYARN’s extreme scalability, but unfortunately causes task con-vergence and downgrades the effectiveness of speculation.

2.2.2. Prospective only SpeculationAnother critical issue of the existing speculation relates to

the correlation between the map and reduce phases in the MapRe-duce model. The existing speculation mechanism only specu-lates running tasks. If a task is finished, it will be excluded fromthe candidates for speculation. The progress comparison of theexisting speculation algorithm still uses the completed task’sprogress (100% of course) to determine if other running tasksare to be speculated, but no longer considers making specula-tive copies for the completed tasks themselves.

Intuitively, it is a reasonable strategy since completed tasksshould have no way of delaying a job. However, MapReducecomputing typically requires the use of intermediate data (theMOF) that is produced by the completed MapTasks. Clearly, itwill be a problem if that intermediate data is lost, because thejob will be held up until it finally finds out that the intermediatedata is permanently lost. The MOF partitions of one MapTaskare often consumed by multiple ReduceTasks and one failednode often contains MOFs from multiple MapTasks. Thus, afailed node can cause a large number of delayed ReduceTasks.As a result, completed tasks can also become stragglers, whichthe current speculation mechanism is unable to address. Inother words, the existing speculation mechanism can only makeprospective copies of tasks, but not retrospective ones. The is-sue of prospective-only speculation implies that a task shouldbe considered to be subject to failure even if its progress hasreached 100%. This issue results in serious degradation of jobperformance, which we will demonstrate in later sections.

2.3. The breakdown of the existing speculation

With the ever-increasing size of software and hardware com-ponents and the complexity of configurations, large-scale ana-lytics systems face the challenge of frequent transient faults andpermanent failures. Conventionally, such faults and failures are

categorized based on the tolerance level an application may ex-hibit against them. Soft errors are transient faults upon which aprocess may experience slow performance or erroneous results.Hard errors are usually permanent failures caused by networkdisconnection, disk and node failures. Both soft and hard er-rors can cause the slowdown of all tasks on a single node. Inthis section, we use hard errors as a study case to demonstratethe impacts of speculation myopia. We test the default YARNand use the Wordcount benchmark as an example. During eachjob, we inject a failure of a task-hosting node at different stagesof map progress. We avoid crashing the master node or AM-hosting node because that will fail the job entirely. Detailedexperimental setup can be found in Section 4.

2.3.1. Speculation breakdown in small jobsFig. 2 shows the execution time of individual jobs. Each dot

represents a job and its completion time. The dotted baseline in-dicates the normal job execution time without failures. First ofall, Fig. 2(a) shows the results of 1GB jobs that have node fail-ure at different progresses of their map phase, The results areas astonishing as it stands. Most of the jobs take time that isorders of magnitude longer than the failure-free case, but thereis an unrelated issue behind this. YARN needs to clean up thecontainer after a task attempt is finished. If the attempt is run-ning on a failed node, YARN will keep trying to connect to thecorresponding NodeManager, which is currently unavailable,and finally throws a timeout exception after a fixed number ofretries, which is decided by the IPC connection configurationsettings (ipc.client.connect.*). By default, it will take about 30minutes to throw a NodeManager failure. During this time, thejob will not end successfully, even both map and reduce’s pro-gresses may have already reached 100%.

This long timeout for the container cleanup is not the solereason that hurts the job performance. Fig. 2(b) is the resultof our control group without the issue of container cleanup. Werule out the issue by modifying YARN’s default retry policy andthen conduct the same set of tests (note that we have also ruledout this issue with the experiment results shown in Section 1and Section 4). In Fig. 2(b), we can still observe significantperformance degradation compared to the running time of nor-mal job. The culprit here is the issue of converged task strag-glers as we have discussed earlier. When the node containingall the MapTasks has become unresponsive, the speculator willnot speculate any of those MapTasks but wait for 600 secondsuntil they get timeout.

However, this delay of MapTask timeout still explains onlya portion of our test cases. We can see that, if node failure oc-curs on 40% to 60% of the overall map progress, some jobs endonly slightly slower than the no-failure case. This is because asmap phase proceeds, different MapTasks’ progress rates can beuneven, meaning that some MapTasks can be much faster thanothers, finally becoming fast enough that the progress variationcan trigger the speculation of the slowest task. Thus, when anode failure occurs during this time, MapTasks on the failednode are stalled, but the speculative copies of some MapTaskswill continue on other nodes. As job proceeds even further,when the progress rates of those speculative copies are large

3

20

200

2000

20000

0% 20% 40% 60% 80% 100%

Job E

xecution T

ime(s

)

Percentage of Map phase

Job affected

No failure

(a) 1GB jobs of original YARN.

20

200

2000

20000

0% 20% 40% 60% 80% 100%

Job E

xecution T

ime(s

)


Job affected

No failure

(b) 1GB jobs of modified YARN.

0

200

400

600

800

0% 20% 40% 60% 80% 100%

Job E

xecution T

ime(s

)


Job affected

No failure

(c) 10GB jobs of modified YARN.

Fig. 2: Running time of MapReduce jobs in presence of node failure on different spot.

enough, they will in turn trigger the speculation of other Map-Tasks on the failed node, and let the job proceed normally there-after. Hence, although the job is still slower than usual, theavoidance of long timeouts result in a much better performancethan other failure tests.

If the node failure occurs on even later phases of the overallmap progress, the disadvantage of the second issue we have dis-cussed, i.e. the prospective-only speculation, becomes relevant.As many MapTasks are now completed, the ReduceTasks thatare trying to fetch those MOFs will have fetch failures, sincethe MOFs are unavailable on the failed node. After the timefor fetching a MapTask output exceeds the limit (determinedby mapreduce.reduce.shuffle.read.timeout), the MapTask willbe declared failed and a new task attempt will be scheduled.But it has seriously stalled the overall job progress because Re-duceTasks are idle during that time. To make things worse, ifthe fetch failures experienced by a single ReduceTask exceedsanother hard limit, that ReduceTask will also be declared failedand rescheduled. Additionally, if the corresponding MapTaskshave not been timely speculated, the rescheduled ReduceTaskcan experience a second fetch failure and thus be scheduled fora third time.

On the same time, converged task stragglers still appear inthis phase, although they are ReduceTasks but not MapTasks.Recall that, in the MapReduce workflow, the reduce phase doesnot require the completion of the map phase. ReduceTask canstart executing when one wave of MapTasks finish. So, in thesecond half of the map phase, some ReduceTasks have alreadybeen launched. If the job has only one ReduceTask (often thecase for small jobs) and the ReduceTask is on the failed node,it will certainly not be speculated since it has no other Reduc-eTask to compare to. The entire job will halt until the Reduc-eTask gets a timeout (600 seconds by default, too). Thus, thesejobs (mostly during 50% to 100% of map phase) also have poorperformance.

2.3.2. Speculation breakdown in larger jobsWhat if the data size is larger? Now, the effect of converged

tasks is eliminated, but the cost of node failure on the map phaseis still significantly high. Fig. 2(c) demonstrates the results ofthe same test with 10GB of input. The execution time of mostjobs with failure are nonetheless more than twice as much asthe no-failure case. Note that, right now the number of Map-Tasks is large enough so that they are assigned evenly to differ-

ent nodes. Thus, the speculator can successfully speculate theMapTasks that reside on the failed node as soon as it detectsthat it is slower than others. However, the majority of jobs stillsuffer various performance degradation. We analyze the causesas follows.

• Converged task stragglers may occur in multiple Reduc-eTasks, too. We have shown that the failure of only oneReduceTask will cost 600 seconds of the ReduceTask time-out. If a crashed node contains multiple ReduceTasks,the progress of remaining ReduceTasks may not be slowenough for them to be speculated. In Fig. 2(c), the jobsthat have more than 600 seconds of execution time aremostly due to this cause.

• The other jobs in Fig. 2(c) that spend less than 600 sec-onds but a lot more than the no-failure case suffer fromthe prospective-only speculation. We can see that evenif the input size is larger, the cost of resuming the com-pleted tasks is still unbearable compared to normal jobexecution time.

• The fact that speculative tasks are conducted intermit-tently is not effective. In Fig. 2(c), those jobs encoun-tering node failures in the early phase have speculationworking successfully. But the tasks on the failed nodehave to wait in line to be speculated. Depending on thenumber of affected tasks, those jobs have various delayson their completion time.

2.4. Issue with Shorter Timeouts

We have shown that the default timeouts are too long forMapReduce framework to detect failures and can prolong thespeculation and failure recovery process. A natural solution isthat we can solve the problem by simply decreasing the lengthof timeouts. However, the long timeouts are necessary for MapRe-duce to adapt to heterogeneous environment since the network-ing condition is unknown and may be unstable. If the timeoutis too short, tasks could be falsely declared failed when the net-work is just experiencing temporal congestion. To investigatethe feasibility of short timeouts for MapReduce jobs, we con-duct experiments with modified timeout setting. First of all,we changed YARN’s timeout for MapTask and ReduceTask tobe 5 seconds and run the jobs in an unstable network where a

4

Only delays

0%

20%

40%

60%

80%

100%

0 50 100 150 200

Pro

gre

ss

Time(s)

Map

Reduce

(a) Unstable network.

Delays and one kill

0%

20%

40%

60%

80%

100%

0 100 200 300 400 500

Pro

gre

ss

Time(s)

Map

Reduce

(b) Unstable network and failed node.

Fig. 3: The progress roadmap of map and reduce of jobs using shorter timeouts.

lot of networking delays, varying from 1 to 8 seconds, are in-jected randomly. Fig. 3(a) shows the results of this case. Wecan see that, the progresses of both map and reduce are seri-ously affected. They either stall at the network delays whenthey need network transferring (about 80s, some ReduceTasksare in shuffle phase), or even backslide if the delays exceedthe timeout and the corresponding tasks are declared failed (atabout 130s). Furthermore, we then inject a failed node into thescene and the result is shown in Fig. 3(b). Many MapTasks (atabout 30s) are quickly declared failed due to network delays,which cause progress backslides of the overall map progress.The progresses are further impeded by a node failure (at about100s), after which one ReduceTask is declared failed imme-diately, but the overall reduce progress cannot proceed becausethe ReduceTask needs the MOFs on the failed node. So it keepsfetching those MOFs until a fetch failure is thrown. The Reduc-eTask continues to request other lost MOFs and undergoes twomore fetch failures (290s and 480s). Only until those missingMOFs are reproduced by the corresponding speculative copiesof those affected MapTasks, the reduce phase can continue andthe job is completed quickly after that.

The above analysis shows how the short timeout can be af-fected by network jitters and node failures. We further examinethe feasibility of short timeout in dealing with failure recoveryusing more choices of timeout length. As discussed in the pre-vious section, failures happening at different progresses havedistinct impacts on the job. Thus, we inject the failures at 0%,50% and 100% of the map progress to the jobs. Fig. 4 showsthe average job execution times when tuning different lengthsof task timeout. Note that we do not consider the ReduceTaskfailures here so the results have precluded the jobs where oneor more ReduceTasks are being hosted on the failed node. Thefigure shows that the shorter timeouts can only help to reducethe performance degradation of failure happening at 0% of mapprogress, where the speculation fails to work because of taskconvergence. It has limited benefits for failures at 50% becauseit cannot resolve the performance degradation caused by theprospective-only speculation. Additionally, it has nearly no ef-fect on helping the failures at 100% of map progress. Thus,we conclude that although shorter timeouts can avoid the per-formance degradation of early failures, it cannot help with laterfailures and more importantly, it comes with the price that theMapReduce framework would be much less effective to defend

against transient faults which are even more frequent than nodefailures. We cannot simply rely on timeout tuning to solve theproblem. Instead, we have to address the internal limitations ofthe existing speculation mechanism.

Different timeouts

0

200

400

600

800

5 30 180 600

Average Execution Time (s)

Timeout (s)

0%

50%

100%

No failure

Fig. 4: Tuning timeouts.

2.5. Proposed Solution

To restore the broken down speculation and accelerate MapRe-duce failure recovery, we propose a hybrid solution, includinga run-time failure analyzer, a new speculation algorithm and anew scheduling policy. The failure analyzer, being initiated asa YARN component, will detect failure occurrence as early aspossible, supplying run-time failure awareness for the YARNspeculator. The central design is the new speculation schemenamed FARM, which takes advantage of the failure analyticsresults, will bundle all affected tasks and speculate them in acollective manner. The new scheduling policy will incorporatethe speculation algorithm, providing fast recovery from nodefailures, while restraining additional overheads incurred by thespeculation in minimal using an accurate detection method ofnode failures.

3. Design and Implementation

In this section, we will unfold our designs and some impor-tant implementation features in order to tackle down the afore-mentioned issues of the existing speculation with failure recov-ery.

5

3.1. Failure Awareness of YARN SpeculatorAs discussed before, the breakdown of the existing specu-

lation roots in its unawareness of failures. we need the YARNspeculator to be aware of the failures so it can facilitate failurerecovery by launching early speculative copies. YARN has anapplication history service and configurations like yarn.api.recordsthat can act like an information bank for post-execution anal-ysis. But they do not feature a service that is dedicated torun-time failure analysis, nor any efficient associations betweenthe speculator and failure analysis. YARN needs a standaloneserver whose responsibility is to gather only the information ofsystem exception/failure and guide the failover. To this end, wehave designed and implemented a Centralized Fault Analyzer(CFA) that can collect and monitor system anomalies at run-time, and provide failure analysis to YARN’s speculator for thespeculation decision.

Our framework is shown in Fig. 5. The CFA is initializedwith ResourceManager as a CompositeService. CFA keeps twolevels of records: job-level and system-level. At job-level, CFAkeeps a record of the job logs such as job ID, task IDs, con-tainer assignments, etc. At system-level, CFA keeps track ofthe node health status, such as the time duration of a node con-nection loss. Since CFA and ResourceManager are collocatedon the same master node, information about the running ap-plication can be retrieved from the ResourceManager withoutgoing through the network. System-level logs are recorded byindividual component who discovers it. CFA gathers those logsand provides its analytics result back to the speculator, whichresides on a slave node, along with the AppMaster. The spec-ulator will consume the results and adjust the speculation ac-cordingly. Because all logs are general information gathered inevery several seconds (five seconds in our implementation), thesize of the logs is trivial. We will demonstrate in Section 4 thatCFA’s extra I/O is lightweight and it incurs minimal overheads.

Failure to CFA itself can be a tricky issue. However, our de-sign of CFA guarantees its availability upon failure. Because alluseful information is stored onto HDFS with replica, we simplydepend on the fault resiliency of HDFS itself. If ResourceMan-ager finds CFA unresponsive, it will restart it and the new CFAwill extract the previous status from HDFS.

ResourceManager

Solution Framework (for PC journal paper)

CFA

AppMaster

N1

Task speculator

T1Failure detection

Supply analytics results

Launch speculationsN3

T1s

N4

T2s

1

2

3

N2

T2

T3 T3s

Fig. 5: Information exchange with the involvement of CFA.

3.2. FARMSWe design a new speculation mechanism that is Failure-

Aware, Retrospective and Multiplicative Speculation (FARM-

Speculation, or FARMS). Fig. 6 shows the demonstration ofFARMS, where the existing speculation is shown on the left andFARM is shown on the right. Each small box represents a run-ning task and its brightness indicates the task’s progress (darkerbox indicates later phase of a task). In the existing speculation,a straggler is speculated upon periodical progress comparisonand its speculative copy will be attached to the task schedulingqueue. FARMS improves such traditional design in the follow-ing ways.

First of all, the existing speculation’s inability to addressthe aforementioned issues roots in its unawareness of the fail-ures. Actually, because the speculator only coordinates thetasks progress at task level, it does not need the node levelstatus for computation. Thus, a simple node failure can de-lay the whole job. Our solution is straightforward. In FARMS,we leverage the failure information that is collected by CFA.Thus, FARMS knows the association between tasks and theirhost nodes. When a failed node is detected, FARMS can list allthe affected tasks as its speculation candidates.

Secondly, in FARMS, we continue to list the completedtasks in the speculation candidates. We add transitions that canspeculate the completed tasks that are associated with a failednode. When the speculation task attempts have been completed,ReduceTasks will be notified to fetch MOFs from the new taskattempts instead of the original ones. But note that, the specu-lation of completed tasks are not based solely upon the success-ful detection of unresponsive node. The fetch failure of certainMOFs is also taken into consideration, though there is differ-ence in the granularity of speculation. If a single fetch failure isnotified by YARN, we speculate that particular completed task.To avoid unnecessary speculative copies, the speculation trig-gered by unresponsive node and the speculation triggered byintermediate file fetch failure are mutually excluded. When onetask is speculated by either way, it will not be speculated againby another cause.

Thirdly, FARMS speculates stragglers in a collective man-ner, meaning that when we decide to launch speculation uponthe detection of a node failure, all the affected tasks can bespeculated at once but not one by one. But we are aware ofthat such speculation can be costly sometimes because if wemake false-negative decision in the node’s unresponsiveness,there can be a lot of unnecessary speculative tasks, along withadditional resource consumption that is also unnecessary. Al-though we have optimized our decision algorithm (will be in-troduced in Section. 4.4) but we still want to minimize the cost.Thus, we incorporate an multiplicative speculation mechanisminto FARMS. Upon the detection of an unresponsive node, thenumber of tasks to speculate increases in exponential order. Thecondition to keep making speculation copies is contingent onthe liveness of the corresponding node. For example, if onenode is unresponsive, we first speculate 2 tasks. We monitorthe progress of the problematic tasks and if they remain slow orunresponsive, we speculate another 4 tasks.

3.3. Fast Analytics SchedulingFinally, we propose a new scheduling procedure on failures

based on FARMS. We name it as the Fast Analytics Schedul-

6

Old & new speculation comparison

T1

T2

T3

T4

T5

T6

T1

T2

T3

T4

T5

T6

T1’

T1

T2

T3

T4

T5

T6

T1’

T1 is chosen for speculating

N2

N1

N2

N1

T1

T2

T3

T4

T5

T6

T1

T2

T3

T4

T5

T6

T1

T2

T3

T4

T5

T6

T1’

N1 is chosen for speculating

T3’ T2’ T1’

T1

T2

T3

T4

T5

T6

T1’

T2’

T3’

T1’ T1’

(a) Existing speculation

Old & new speculation comparison

T1

T2

T3

T4

T5

T6

T1

T2

T3

T4

T5

T6

T1’

T1

T2

T3

T4

T5

T6

T1’

T1 is chosen for speculating

N2

N1

N2

N1

T1

T2

T3

T4

T5

T6

T1

T2

T3

T4

T5

T6

T1

T2

T3

T4

T5

T6

T1’

N1 is chosen for speculating

T3’ T2’ T1’

T1

T2

T3

T4

T5

T6

T1’

T2’

T3’

T1’ T1’

(b) FARMS

Fig. 6: Comparison between existing speculation and FARMS.

ing (FAS). As mentioned before, there is an important trade-offbetween speeding up the failure detection and keeping low re-source consumption. In FAS, we use a dynamic threshold to de-termine if a failure should be speculated or not. If the node hasbeen unresponsive for a time duration longer than the threshold,FAS deems it as a positive result. Otherwise, it is a negative re-sult. The positive result indicates a failed node and tasks onthat node will be added to the list of speculation candidates.The negative result will be tolerated and we will wait to see itsfurther responsiveness.

Before we go into details, the design choice of the schedul-ing policy need to be sorted out. In general, the algorithmshould meet the following principal requirements.

(i) The failure detection made in most cases should decreasethe job execution time.

(ii) The decision should be as accurate as possible, avoidingtoo much unnecessary additional resource consumption.

(iii) Even when the detection is wrong, its impact to job per-formance should be trivial.

(iv) The algorithm should fit into the real-world environment.

To meet (i), we need to keep our speculation aggressiveenough for gaining some performance improvement. That means,the threshold cannot be too large. Otherwise, a big threshold fornode failure would cause similar issue with the long task time-outs, which as we have demonstrated before, can greatly hurtthe performance. However, to meet (ii), we need the thresholdto be dynamically adjusted according to the specific conditionsof the running environment. Thus, we have to limit the numberof positive results and so does the number of speculative tasksimposed to the job execution. Then, to meet (iii), our specula-tion needs to be cautious in spawning speculative tasks. Thisis accomplished by the multiplicative speculation mechanism,as discussed before. Finally, we need to tune the key parame-ters in our algorithm to meet the requirement (iv). By doing so,

it can have optimal performance when we deploy it with real-world MapReduce systems. Those principles indicate us that,to balance the trade-off between effectiveness and efficiency ofspeculation, the threshold to determine an unresponsive node iscrucially important. For that matter, we introduce a TemporalWindow Smoothing algorithm, which assigns various weightsto the recent several responsiveness of a node. This algorithmis detailed in the following section.

3.3.1. Temporal Window SmoothingTo decide the threshold that is used for the FAS procedure

(Fig. 7), we need to consider the recent unresponsiveness ofthe node. A node is down if it has lost disconnection perma-nently. But as discussed in previous section, we need to timelydetect the anomaly to conduct proper speculation. A plausibleapproach is to calculate the average of some number of recentnode disconnection duration. But this approach overlooks theunstable environment that the node may reside on. Networklatency/throughput does not remain plateau, it can be slow forsome time and become fast later. Thus, instead of using theaverage value [18], we adopt a Temporal Window Smoothingmechanism that can take the historical node unresponsivenessinto account with varying weights, where the earlier unrespon-siveness has less impact on the determination of the threshold.

To be more specific, in order to capture the temporal local-ity between the last L failures and the next failure at node i, wedefine the length of our smoothing window as L. We then useRn to represent the real unresponsive time duration for the lastfailure at node i while Pn+1 denotes the predicated unrespon-sive time duration for the next failure. Given any node i and asmoothing window with a length of L, Pn+1 can be estimated asfollows:

Pn+1 =∑

Lk=1(2

L+1−k ×Rn+1−k)

∑k=1 L(2k)(1)

To give an example, if we set L to five, given the last five un-responsiveness of a node i, denoted as Rn−4, Rn−3, Rn−2, Rn−1,

7

Rn, the threshold should be set as follows:

Pn+1 =21 ×Rn−4 +22 ×Rn−3 +23 ×Rn−2 +24 ×Rn−1 +25 ×Rn

21 +22 +23 +24 +25

(2)The parameter L can be tuned based on the the trade-off

between the prediction accuracy and the computing overhead,i.e., the extent to which the historical behaviors will affect thefuture responsiveness. The base (in this case, it is 2) can alsobe tuned based on the characteristics of the nodes. A higherbase should be set if the next responsiveness is more temporallyrelated to the last few failures.

3.3.2. Overall FAS ProcedureThe procedure of FAS is illustrated in Fig. 7. We take a

simple heuristic method that guides the speculation of FARMto recover jobs from failures. As discussed before, we use thetemporal window smoothing to deduce a failed node. If thepositive result is correct (i.e., the node is failed), all the com-putations need to be re-conducted anyway so those early spec-ulative tasks should not be considered as overheads. But in thecase of a false positive result (i.e., the node is not failed), thespeculative tasks may cause some overheads. However, thosespeculative tasks still act as the competitors of the original taskson the temporarily unresponsive node. The two groups of taskscan compete for completion. If the original node is resumed toolate, the speculative tasks can still help the job progresses faster.If the node resumed soon, we issue kill to the speculative tasks,and the threshold is adjusted accordingly. We have examinedthe overheads that come with false results as will be shown inSection 4.

NewThreshold

Feedback

Node unresponsive

Tasks starton a node

Tasks complete

Speculative copies

Node resumes

Fig. 7: Workflow of FAS.

Intuitively, it is good to blacklist the node that experiencesfrequent failures so the majority of jobs will be free from nodefailures. However, this approach is not sufficient to mitigatethe performance degradation caused by node failures. The cur-rent blacklisting mechanism is to rule out the problematic nodesthrough some periodic system checks when no jobs are running,so the next running job would not use the blacklisted nodes.Thus, blacklisting may reduce the total number of failures oc-currences in the long term, but it cannot prevent failures fromhappening. Even if we blacklist nodes at run-time, i.e., ruleout failed node so the following tasks would not be scheduledon that node, it does not help either because the node failure

probably have already caused stragglers and the correspondingperformance loss.

4. Experiment Evaluation

4.1. Experiment Environment

Hardware Environment: Our experiments are conducted ontwo private clusters. The first one is a cluster of 21 server nodesthat are connected through 1 Gigabit Ethernet. Each machineis equipped with four 2.67 GHZ hex-core Intel Xeon X5650CPUs, 24GB memory and one 500GB hard disk. The secondone also has 21 nodes. Each node is featuring with dual-socket,10 Intel Xeon(R) cores and 64 GB memory. The nodes areconnected through a 10 Gigabit Ethernet interconnect.

Software Environment: We use the latest release of YARN2.6.0 as the code base with JDK 1.7. One node of the clusteris dedicated to run ResourceManager of YARN and NameN-ode of HDFS. The key parameters of the whole software stackare listed in Table 1, along with the tuned value. To mini-mize the data usage, we use 2 replicas which are the minimalto recover lost data of node failure. The minimal and maxi-mum memory allocation (yarn.scheduler.minimum-allocation-mb and yarn.scheduler.maximum-allocation-mb) decides howmany containers can be launched on one node. The more con-tainers will have more tasks affected during a node failure.

Table 1: List of key YARN configuration parameters.

Parameter Name Valuemapreduce.map.java.opts 1536 MBmapreduce.reduce.java.opts 4096 MBmapreduce.task.io.sort.factor 100dfs.replication 2dfs.block.size 128 MBio.file.buffer.size 8 MByarn.nodemanager.vmem-pmem-ratio 2.1yarn.scheduler.minimum-allocation-mb 1024 MByarn.scheduler.maximum-allocation-mb 6144 MB

Benchmarks: We have selected a wide range of representa-tive MapReduce benchmarks from two MapReduce benchmarksuites. The first one is the built-in benchmarks of YARN, in-cluding Terasort, WordCount, and Secondarysort. The otherone is the well-known HiBench MapReduce benchmark suitev5.0 [21]. Developed by Intel, HiBench includes a wide rangeof MapReduce benchmarks with different emphasis of MapRe-duce characteristics such as map-heavy (K-means, Wordcount,Scan etc.) and reduce-heavy (Join, Terasort, Pagerank etc.) Forother experiments that do not mention the benchmark type, weuse the built-in Wordcount results (Section 4.2.2, Section 4.4,figures in Section 1 and 2).

Performance metrics: We measure the Job Execution Timefor the performance evaluation of FARMS. And to evaluate theFAS algorithm, we use the job execution time plus AdditionalTasks Rate, which is the number of speculative tasks that shouldnot be launched, i.e., the original task attempts are not on afailed node. To inject a node failure in the test, we simply kill

8

all JAVA processes on that node. We compare FARMS againstthe original YARN, which adopts the LATE scheduler [36] asits default speculator. Throughout the results, we use YARN toindicate the results of original YARN and Ours to indicate ourframework.

4.2. FARMS Evaluation

We examine the effectiveness of FARMS in tackling thespeculation breakdown and performance degradation under nodefailures. Since node failure has very different impacts on thejob execution (Section 2) with different job sizes and failureoccurrence time, we conduct different sets of experiments withvarying job sizes and timing of failure injection. The first set issmall-size jobs that have 1GB of input data. The second set ismedian-size jobs that have 10GB of input data. The third set isvery large size jobs that have 100GB and 1TB of input data. Forsmall size jobs, we run different benchmarks and crash a nodethat hosts the MapTasks at 10 different spots during the job’smap phase. For median size jobs, we crash a random node andinject failure at various time spots. For even larger jobs, theperformance difference between various time spots is less pro-nounced, so we only collect one of them to report.

Fig. 8 and Fig. 9 show the performance comparison betweenthe original YARN and our framework against node failure hap-pening at different map progress spots. At each spot, we test itfive times and get the average. We show the highest and lowestexecution time using error bars. Since the prolonged finishingdelays caused by YARN’s retry policy is too large and can betuned by simple re-configuring, in our experiment, we have pre-cluded that issue by modifying YARN’s default retry policy andstill regard it as the “Original YARN” case.

From the figures, it is clear that for small size jobs, the per-formance improvement is striking. FARMS speeds up the jobexecution time by almost an order of magnitude. It even man-ages to keep the job completion time to be comparable with theno failure case. For median size jobs, it can also tackle down thejob delay significantly, although by a smaller factor. Moreover,the original YARN has very distinct performance at differentfailure spots because when the spots differ, the causes for delayalso differ (Section 2). As shown by the error bars, FARMSsmooths out the variation. Next, we will discuss more about theperformance variation.

4.2.1. Performance VariationAs mentioned before, we plot the highest and lowest exe-

cution times with error bars in Fig. 8 and Fig. 9. As shown thefigures, node failure can cause distinct impacts even at the samespot of failure occurrence. For small jobs, the variations aremore obvious during the middle phase of the job than the initialor later phases. This is because as job approaches to halfway ofits map progress, both types of speculation issues are possibleto take effect as we have discussed in Section 2. Additionally,there is also a possibility that neither of them occurs, which isreflected by some results that have just marginally larger exe-cution time than the no failure case. More types of causes ofdelays lead to larger performance variation. On the contrary,

our optimization eliminates all the issues of the existing spec-ulation in handling failures. The experimental results show in-significant variation in terms of the job execution times, whichprovides constancy and predictability for job executor in thereal-world deployment.

For median size jobs, the performance penalty caused byfailures is much less, but the variation is even more pronounced.Here, since there is no such issue of task convergence on onesingle node, the jobs are no longer suffering from the convergedtask stragglers and thus be able to avoid the long task timeout.This results in some test cases ending up having slightly worseperformance compared to the no failure case. However, manyjobs still suffers from the long waiting time for lost MOFs onthe failed node, which has earlier appearance during the mapphase than small jobs. This is because, for median size jobs,many MapTasks will finish a lot faster than others. Such imbal-ance in MapTask progresses means that a node failure is likelyto cause MOFs lost at earlier map stage. Thus, the variation be-tween highest and lowest execution remains large and, becausethe performance loss is smaller, appears to be more significant.In contrast, our optimization can always discover the failure andrecover the lost MOFs within a fixed period of time and hencemanage to keep the variation insignificant.

Table 2: Job execution time of 100 GB jobs (in seconds).

Terasort Wordcount SecondarysortNo failure 597 318 1338

Failure-YARN 678 489 1445Failure-Ours 617 355 1344

Table 3: Job execution time of 1 TB jobs (in seconds).

Terasort Wordcount SecondarysortNo failure 5911 2175 7669

Failure-YARN 6078 2352 7949Failure-Ours 5922 2199 7611

4.2.2. Performance of Even Larger JobsBesides gaining big performance improvement on small or

median size jobs, we also want to evaluate our framework forvery large jobs to see whether they still benefits from the op-timization. To that purpose, we conduct the same tests butincrease the sizes to 100 GB and 1 TB. We observe that forthose jobs, the variation of performance at different failure spotsis not as obvious as the small jobs. This is because in largejobs, failures at any spots of map phase almost always resultin some MOFs loss. Thus, we only report one of the resultsfor each case, which is the job having a failure at 50% of itsmap progress. The results are shown in Table 2 and 3.Amongthese experiments, for jobs that have relatively shorter execu-tion time, e.g., the Wordcount jobs with both 100GB and 1TBof input, our optimization can greatly reduce the performancedegradation caused by node failure and achieve performancethat is comparable with the failure-free case. For other jobs thattake much longer (e.g., Terasort and Secondarysort jobs with1TB of input), the performance degradation is less significant

9

Terasort 1g

0

200

400

600

800

1000

1200

10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Job Execution Time(s)

Failure spot on map phase

Original YARNOursNo Failure

(a) Terasort

Wordcount 1g

0

200

400

600

800

1000

10% 20% 30% 40% 50% 60% 70% 80% 90% 100%



(b) Wordcount

Secondarysort 1g

0

200

400

600

800

1000

1200

1400

10% 20% 30% 40% 50% 60% 70% 80% 90% 100%



(c) Secondarysort

Fig. 8: Failure recovery of 1GB job.

Terasort 10g

02004006008001000120014001600

10% 20% 30% 40% 50% 60% 70% 80% 90% 100%



Original YARNOursNo Failure

(a) Terasort

Wordcount 10g

0

100

200

300

400

500

600

10% 20% 30% 40% 50% 60% 70% 80% 90% 100%



(b) Wordcount

Secondarysort 10g

0

200

400

600

800

1000

1200

1400

10% 20% 30% 40% 50% 60% 70% 80% 90% 100%



(c) Secondarysort

Fig. 9: Failure recovery of 10GB job.

HiBench

0

200

400

600

800

1000

1200

Time (s)

No failure - Orig No failure - OursFailure - Orig Failure -Ours

Fig. 10: Results of HiBench suite.

and so is the improvement of our optimization. Moreover, thefigure also clearly shows that our optimization does not hurt theperformance of very large jobs. The additional speculative taskslaunched by early speculation may cause some overheads dur-ing early phase. But as discussed before, those tasks are neededregardless of using optimization or not, while it is better to ex-ecute them earlier than later.

4.3. Evaluation with HiBenchWe have deployed the HiBench benchmarks with its default

configurations, e.g., the input size, number of data parallelism,etc. Note that in HiBench, each benchmarks is consist of mul-tiple MapReduce jobs and has complex data dependency be-tween the jobs. A random failure may cause undesirable out-come such as the loss of data that are needed by dependent jobs,which results in the failure of the entire benchmark test. Thus,we uniformly inject the failure at the 50% map progress of thelast job in each benchmark. Fig. 10 shows the results of Hi-Bench experiment. It shows that for all benchmarks, our frame-

work is able to boost the performance in the failure case. More-over, for map-heavy jobs such as Kmeans and Wordcount, thefailures cause a lot more disastrous performance degradationthan reduce-heavy jobs and our optimization can provide per-formance that is just slightly worse than the failure-free case.But even for the reduce-heavy jobs such as join and sort, ourframework is still capable of quickly recovering from the fail-ure and boost the job performance notably. Note that we havealso tested our framework with HiBench in the failure-free sce-narios. The results show that our framework have negligibleadditional overheads to the failure-free jobs. This shows thatFARMS can handle the occurrence of failure without disturb-ing the normal job execution. The failure decision algorithmin FAS does not impose much additional overheads in a stableenvironment. In order to further evaluate the correctness andoverheads of FAS in an unstable environment, we will presentthe corresponding evaluation of FAS in the next section.

4.4. FAS Evaluation

To see if FAS can adapt to real-world environment wheretransient network congestion is common, we generate severalexperimental cases with a variety of settings. First, we test ourframework against one single network delay or one failed node.We inject such network delay by delaying the packet with atime duration t. The time duration t is generated according tothe Poisson Distribution. We vary the average delay, denoted asλ, in the Poisson Distribution equation and generate a numberof random delays. Then we conduct the same set of tests asthe previous section but with those delays being injected. Thenode failure is injected at a certain rate, which is empirically setaccording to the real-world statistics from [14, 12]. From thosereports, the average of node failure per job is 5 and the averagenumber of nodes in one job is 268. In our experiment, we use 21

10

Algorithm test 1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0

50

100

150

200

5 10 15 20 25 30

Ad

dit

ion

al

sp

ecu

late

d t

asks r

ate

Jo

b e

xecu

tio

n t

ime(s

)

λ

Time-Original YARN

Time-Ours

Additional tasks

(a) Only one faulty node.

Algorithm test 2

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

0

50

100

150

200

250

300

1 2 3 4 5

Ad

dit

ion

al

sp

ecu

late

d t

asks r

ate

Av

era

ge e

xecu

tio

n t

ime(s

)

Number of nodes

Time-Original YARN

Time-Ours

Additional tasks

(b) Multiple faulty nodes.

Fig. 11: Using FAS for fast and efficient recovery against network delays and node failures.

machines. Thus, we set the probability of node failure per runto 0.4 and the probability of one particular node being failedto 0.02. We conduct one set of experiment for each value ofλ and for each set, we randomly inject a network delay or anode failure in each test. We conduct twenty tests for each setand the result is the average of the tests. The maximum λ inour experiment is 30. Note that, we do not want to set λ tobe too large because in that way, it would have no differencefrom a node failure. To normalize our tests and make the resultscomparable, both the delay and failure occur at the same phaseof the job.

The same experimental setup is used for both the originalYARN and ours. Since our evaluation focuses include both jobperformance and resource consumption, we plot the average jobrunning time, along with the wrongfully speculated task rateper job for every experiment setup. We use only Wordcountbenchmark and 10 GB of input for this experiment.

Fig. 11 shows the results of the experiments introduced above.Fig. 11(a) is the first case that has only one faulty node, i.e., iteither has a network delay or a node failure. The figure showsthat FAS has varying performance improvement than originalYARN with varying λ. The variation comes from uncertaintyof occurrence of the node failure. However, even if there is nonode failure, our framework can still outperform the originalYARN because the impact of network delays are balanced byearly speculation. Additionally, in overall, these early specu-lative tasks incurred by the network delay remains at very lowlevel.

It is understandable that a single faulty node does not causemuch trouble because even if we speculate it every time de-spite of its real status, there may be no excessive overheads im-posed on the system. Thus, we conduct another evaluation thatinject network delays and/or node failures on multiple nodes.The results are shown in Fig. 11(b). Our framework has evenmore significantly performance improvement compared withthe original YARN. If the number of faulty nodes is larger,we gain better performance improvement because we can savemore cost caused by delays and failures. Although our addi-tional speculative task rate slightly goes up when number of de-lays/failures increases, it still remain under an acceptable level.

4.5. Overall Evaluation

The evaluation of FARMS demonstrates the advantage ofFARMS in handling node failures and the evaluation of FASdemonstrate the effectiveness of our detection mechanism ofnode failure. We further examine the overall performance usingFARMS+FAS for different benchmarks. We referenced [4] toset the size of jobs, as shown by Table 4.

Table 4: Ratio of test group in data size.

Group Size Ratio1 1 GB 85%2 10 GB 8%3 50 GB 5%4 100 GB 2%

Then, we inject various types of failures and exceptions,e.g.,task failure, node crash and network delays, each with a fre-quency as introduced in the real-world experience of MapRe-duce system [13, 12, 14, 30]. We conduct tests with the ex-act same setup (job group, failure injection and interval) forboth original YARN and ours. Fig. 12 shows the results ofour overall evaluation. We can see that combining FARMSand FAS provides performance that is almost comparable withthe no-failure YARN. For smaller jobs that are basically intactfrom node failures, all three cases are similar, but ours slightlyoutperforms original YARN with failures and, surprisingly, iseven slightly better than original YARN without failures. Thisshows how the aggressiveness of FARMS can help small jobsfor speeding up their turnaround times. For larger jobs thatare more often affected by node failures, original YARN per-forms badly under failure but ours manages to keep its perfor-mance comparable to the no failure case. This shows that al-though we cannot gain much improvement for large jobs, theFARMS+FAS implementation would at least not hurt their per-formance.

5. Related Work

Speculation mechanism is introduced with the initial ver-sions of many of the representative parallel computing paradigmssuch as MapReduce [13] and Dryad [22]. Since then, it has beenactively studied with a variety of viewpoints [36, 6, 5, 3]. But

11

Overall test for DISCS

0

200

400

600

800

1000

1200

1400

1600

#1(1GB) #2(10GB) #3(50GB) #4(100GB)

Tim

e (

s)

Group in data size

No Failure

Failure(Original YARN)

Failure(FARMS+FAS)

Fig. 12: Overall performance improvement.

we find that none of these works has addressed the deficiency ofspeculation in handling with failures, as discussed in this paper.

To name a few prior arts, LATE [36] scheduler takes nodeheterogeneity into account. It deliberately places the specula-tive copies onto fast nodes but not slow ones. But the questionabout when to make a speculation on failure-related stragglersremains unsolved. Also, its intermittent speculative strategycan cause significant amount of performance loss upon nodefailure because the job only proceeds till all speculative tasksare completed. Mantri [6] searches for the causes of stragglersand designs its optimized speculation algorithm based on thestraggler categories. It identifies in part the impact of failure-related stragglers. However, it only considers data recomputa-tion as the worst outcome caused by failure, without addressingthe delayed execution of speculation upon failure. GRASS [5]improves speculation performance of the error-bound and deadline-bound approximation jobs by using two distinct scheduling strate-gies, a.k.a. Greedy Speculative and Resource Aware Specu-lative scheduling. But neither of the two strategies can servethe purpose to failure cases. Among the studies of speculation,DOLLY [3] has the most similar research motivation but it hasvery different focus and approach compared with our work. Itis motivated by straggler problem of small size jobs of MapRe-duce framework. However, unlike our paper which reveals therelation between failure and stragglers, its focus is more on gen-eral performance impact of stragglers. They demonstrate thatto aggressively launch a clone for every task is a good way toameliorate the performance degradation that stragglers may im-pose on the MapReduce applications. Although their designcould also be helpful for solving the performance degradationof node failure found in this paper, it has an obvious down-side that cloning every task will incur a lot more unnecessaryresource consumption and network overheads, especially for ashared MapReduce cluster that is already heavily loaded as dis-cussed in [11, 28, 32]. In addition, those extra overloads areneeded in every job execution, despite the nodes were beingfaulty, just delaying, or not having any problem at all. Withouthandling with failures respectively, relying on such aggressivespeculation for fault recovery is unpractical.

Besides speculation, our work has also set foot in the issueof MapReduce’s fault tolerance. The existing efforts of this areainclude to analyze code bugs to prevent failure occurrence [20,35, 19], localize the failure timely and accurately [24], enhancedata placement to achieve higher data availability [10], etc. Al-

though the failure resiliency has gained so much attentions, wemust be clear that strong failure resiliency does not imply op-timal job performance. Failures can cost significant degrada-tion of job turnaround time even if the job can eventually com-plete successfully, as shown in this paper. There are studieslike [25, 16, 15, 34], along with our work, have reveal thatdue to the fact that failures are norm rather than exception inthe real-world production deployment, to recover speedily fromfailures can be also essential. Similar to our work, Piranha [17]also recognizes the delays of small jobs in Hadoop frameworkbut it focuses more on scheduling optimization.

Quiane-Ruiz et al. in [25] introduces RAFTing MapRe-duce for preserving the computation status of MapTasks andreplicating the MOFs to reduce side. This design avoids therecomputation of MapTasks on the failed node, but it requirespre-assignment of ReduceTasks and additional network over-heads. Moreover, it addresses only one negative factor of nodefailure which is the loss of MOFs, without taking care of fail-ures happened on early map phase and looking into the meansof failure’s detection. Thus, it would still suffer from the per-formance degradation discussed in this paper and have problemdealing with real-world failure scenarios. However, we do thinkthat the idea to conserve MapTask output may be beneficial tospeculation as well to avoid unnecessary recomputations.

Dinu et al. in [16] conducts a comprehensive study onthe impacts of node failure in MapReduce model. They revealthat a single node failure can significantly downgrade the per-formance of MapReduce applications. Specifically, they findthat the failure of the node containing ReduceTasks can in-fect other healthy tasks and nodes, causing drastic performancedegradation. Our previous work [34] has revealed issues sim-ilar to them, which we refer to as “failure amplification”, andmore importantly, also provide techniques to address the issues.But both works do not look into the failures occurring on mapphase. Dinu’s subsequent work RCMP [15] studies on how toconduct recomputation upon failures at the job-level. Our paperis orthogonal to those works by addressing map phase failuresand leverage an optimized speculation mechanism to expeditethe job performance at the task-level.

6. Conclusion and Future Work

In this paper, we have detailed issues of the existing spec-ulation mechanism that has long been neglected in the repre-sentative implementation of MapReduce model, i.e., YARN.We have revealed that the existing speculation has fundamentalflaws for failure recovery of shorter jobs that have led to seriousjob execution delay. We have demonstrated a comprehensivestudy on how those issues can cause breakdown of the existingspeculation in presence of node failures. Based on the findingsand implications, a new speculation mechanism called FARMSis proposed. We have also designed a refined scheduling pol-icy to leverage FARMS. We have implemented the frameworkand evaluated it through an extensive set of experiments. Theexperimental results show that our framework has dramatic per-formance improvement in handling node failures than the orig-inal YARN and can adapt to an unstable running environment

12

very well. In the future, we plan to further explore the ineffi-ciency of speculation, especially during reduce phase. We alsoplan to incorporate proper work-conserving mechanism for thespeculation.

Acknowledgments

We are thankful to the anonymous reviewers for their in-sightful comments. This work is funded in part by NationalScience Foundation awards 1561041 and 1564647.

[1] Apache hadoop nextgen mapreduce (yarn). http://hadoop.apache.org/docs/r2.3.0/hadoop-yarn/hadoop-yarn-site/YARN.html.

[2] Apache hadoop project. http://hadoop.apache.org/.[3] G. Ananthanarayanan, A. Ghodsi, S. Shenker, and I. Stoica. Effective

straggler mitigation: Attack of the clones. In NSDI, volume 13, pages185–198, 2013.

[4] G. Ananthanarayanan, A. Ghodsi, A. Wang, D. Borthakur, S. Kandula,S. Shenker, and I. Stoica. Pacman: coordinated memory caching for par-allel jobs. In Proceedings of the 9th USENIX conference on NetworkedSystems Design and Implementation, pages 20–20. USENIX Association,2012.

[5] G. Ananthanarayanan, M. C.-C. Hung, X. Ren, I. Stoica, A. Wierman,and M. Yu. Grass: trimming stragglers in approximation analytics. Proc.of the 11th USENIX NSDI, 2014.

[6] G. Ananthanarayanan, S. Kandula, A. Greenberg, I. Stoica, Y. Lu,B. Saha, and E. Harris. Reining in the outliers in map-reduce clustersusing mantri. In Proceedings of the 9th USENIX conference on Operat-ing systems design and implementation, OSDI’10, pages 1–16, Berkeley,CA, USA, 2010. USENIX Association.

[7] R. Appuswamy, C. Gkantsidis, D. Narayanan, O. Hodson, and A. Row-stron. Scale-up vs scale-out for hadoop: Time to rethink? In Proceedingsof the 4th annual Symposium on Cloud Computing, page 20. ACM, 2013.

[8] T. Benson, A. Anand, A. Akella, and M. Zhang. Understanding datacenter traffic characteristics. ACM SIGCOMM Computer CommunicationReview, 40(1):92–99, 2010.

[9] Y. Chen, S. Alspaugh, and R. Katz. Interactive analytical processing inbig data systems: A cross-industry study of mapreduce workloads. Pro-ceedings of the VLDB Endowment, 5(12):1802–1813, 2012.

[10] A. Cidon, R. Escriva, S. Katti, M. Rosenblum, and E. G. Sirer. Tieredreplication: a cost-effective alternative to full cluster geo-replication. InProceedings of the 2015 USENIX Conference on Usenix Annual TechnicalConference, pages 31–43. USENIX Association, 2015.

[11] T. Condie, N. Conway, P. Alvaro, J. M. Hellerstein, K. Elmeleegy, andR. Sears. Mapreduce online. In Proceedings of the 7th USENIX Confer-ence on Networked Systems Design and Implementation, NSDI’10, pages21–21, Berkeley, CA, USA, 2010. USENIX Association.

[12] J. Dean. Experiences with mapreduce, an abstraction for large-scale com-putation. In PACT, volume 6, pages 1–1, 2006.

[13] J. Dean and S. Ghemawat. Mapreduce: Simplified data processing onlarge clusters. In Proceedings of the 6th Symposium on Operating SystemDesign and Implementation, OSDI ’04, pages 137–150, San Francisco,California, USA, 2004. USENIX Association.

[14] J. Dean and S. Ghemawat. Mapreduce: simplified data processing onlarge clusters. Communications of the ACM, 51(1):107–113, 2008.

[15] F. Dinu and T. Ng. Rcmp: Enabling efficient recomputation based failureresilience for big data analytics. In Parallel and Distributed ProcessingSymposium, 2014 IEEE 28th International, pages 962–971. Ieee, 2014.

[16] F. Dinu and T. E. Ng. Understanding the effects and implications of com-pute node related failures in hadoop. In Proceedings of the 21st Interna-tional Symposium on High-Performance Parallel and Distributed Com-puting, HPDC ’12, pages 187–198, New York, NY, USA, 2012. ACM.

[17] K. Elmeleegy. Piranha: Optimizing short jobs in hadoop. Proceedings ofthe VLDB Endowment, 6(11):985–996, 2013.

[18] H. Fu, Y. Zhu, and W. Yu. A case study of mapreduce speculation forfailure recovery. In Proceedings of the 2015 International Workshop onData-Intensive Scalable Computing Systems, page 7. ACM, 2015.

[19] H. S. Gunawi, M. Hao, T. Leesatapornwongsa, T. Patana-anake, T. Do,

J. Adityatama, K. J. Eliazar, A. Laksono, J. F. Lukman, V. Martin, et al.What bugs live in the cloud?: A study of 3000+ issues in cloud systems.

In Proceedings of the ACM Symposium on Cloud Computing, pages 1–14.ACM, 2014.

[20] H. S. Gunawi, C. Rubio-Gonzalez, A. C. Arpaci-Dusseau, R. H. Arpaci-Dusseau, and B. Liblit. Eio: Error handling is occasionally correct. InFAST, volume 8, pages 1–16, 2008.

[21] S. Huang, J. Huang, J. Dai, T. Xie, and B. Huang. The hibench bench-mark suite: Characterization of the mapreduce-based data analysis. InNew Frontiers in Information and Software as Services, pages 209–228.Springer, 2011.

[22] M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: distributeddata-parallel programs from sequential building blocks. In ACM SIGOPSOperating Systems Review, volume 41, pages 59–72. ACM, 2007.

[23] S. Kavulya, J. Tan, R. Gandhi, and P. Narasimhan. An analysis of tracesfrom a production mapreduce cluster. In Cluster, Cloud and Grid Comput-ing (CCGrid), 2010 10th IEEE/ACM International Conference on, pages94–103. IEEE, 2010.

[24] R. N. Mysore, R. Mahajan, A. Vahdat, and G. Varghese. Gestalt: Fast,unified fault localization for networked systems. In Proc. USENIX ATC,2014.

[25] J.-A. Quiane-Ruiz, C. Pinkel, J. Schad, and J. Dittrich. Rafting mapre-duce: Fast recovery on the raft. In Proceedings of the 2011 IEEE 27thInternational Conference on Data Engineering, ICDE ’11, pages 589–600, Washington, DC, USA, 2011. IEEE Computer Society.

[26] C. Reiss, A. Tumanov, G. R. Ganger, R. H. Katz, and M. A. Kozuch. Het-erogeneity and dynamicity of clouds at scale: Google trace analysis. InProceedings of the Third ACM Symposium on Cloud Computing, page 7.ACM, 2012.

[27] K. Shvachko, H. Kuang, S. Radia, and R. Chansler. The hadoop dis-tributed file system. In Mass Storage Systems and Technologies (MSST),2010 IEEE 26th Symposium on, pages 1–10. IEEE, 2010.

[28] J. Tan, X. Meng, and L. Zhang. Delay tails in mapreduce scheduling. InProceedings of the 12th ACM SIGMETRICS/PERFORMANCE Joint In-ternational Conference on Measurement and Modeling of Computer Sys-tems, SIGMETRICS ’12, pages 5–16, New York, NY, USA, 2012. ACM.

[29] V. K. Vavilapalli, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar,R. Evans, T. Graves, J. Lowe, H. Shah, S. Seth, B. Saha, C. Curino,O. O’Malley, S. Radia, B. Reed, and E. Baldeschwieler. Apache hadoopyarn: Yet another resource negotiator. In Proceedings of the 4th AnnualSymposium on Cloud Computing, SOCC ’13, pages 5:1–5:16, New York,NY, USA, 2013. ACM.

[30] K. V. Vishwanath and N. Nagappan. Characterizing cloud computinghardware reliability. In Proceedings of the 1st ACM symposium on Cloudcomputing, pages 193–204. ACM, 2010.

[31] H. Wang, Q. Jing, R. Chen, B. He, Z. Qian, and L. Zhou. Distributedsystems meet economics: pricing in the cloud. In Proceedings of the2nd USENIX conference on Hot topics in cloud computing, pages 6–6.USENIX Association, 2010.

[32] Y. Wang, J. Tan, W. Yu, X. Meng, and L. Zhang. Preemptive reducetaskscheduling for fair and fast job completion. In Proceedings of the 10th In-ternational Conference on Autonomic Computing, ICAC’13, June 2013.

[33] T. White. Hadoop: The Definitive Guide. O’Reilly Media, Inc., 1st edi-tion, 2009.

[34] Y. Wang, H. Fu and W. Yu. Cracking Down MapReduce Failure Am-plification through Analytics Logging and Migration. In 29th IEEE In-ternational Parallel & Distributed Processing Symposium (IEEE IPDPS2015), Hyderabad, India, May 2015.

[35] D. Yuan, Y. Luo, X. Zhuang, G. R. Rodrigues, X. Zhao, Y. Zhang, P. U.Jain, and M. Stumm. Simple testing can prevent most critical failures:An analysis of production failures in distributed data-intensive systems.In Proceedings of the 11th Symposium on Operating Systems Design andImplementation (OSDI), pages 249–265, 2014.

[36] M. Zaharia, A. Konwinski, A. D. Joseph, R. Katz, and I. Stoica. Im-proving mapreduce performance in heterogeneous environments. In Pro-ceedings of the 8th USENIX Conference on Operating Systems Designand Implementation, OSDI’08, pages 29–42, Berkeley, CA, USA, 2008.USENIX Association.

13

farms: efﬁcient mapreduce speculation for failure recovery...

Documents