moon: mapreduce on opportunistic environmentseprints.cs.vt.edu/archive/00001089/01/moon.pdf ·...

15
MOON: MapReduce On Opportunistic eNvironments Heshan Lin * , Jeremy Archuleta * , Xiaosong Ma †‡ , Wu-chun Feng * Zhe Zhang and Mark Gardner * * Department of Computer Science, Virginia Tech {hlin2, jsarch, feng}@cs.vt.edu, [email protected] Department of Computer Science, North Carolina State University [email protected], [email protected] Computer Science and Mathematics Division, Oak Ridge National Laboratory Abstract—MapReduce offers a flexible programming model for processing and generating large data sets on dedicated resources, where only a small fraction of such resources are every unavailable at any given time. In contrast, when MapReduce is run on volunteer comput- ing systems, which opportunistically harness idle desktop computers via frameworks like Condor, it results in poor performance due to the volatility of the resources, in particular, the high rate of node unavailability. Specifically, the data and task replication scheme adopted by existing MapReduce implementations is woe- fully inadequate for resources with high unavailability. To address this, we propose MOON, short for MapReduce On Opportunistic eNvironments. MOON extends Hadoop, an open-source implementation of MapReduce, with adaptive task and data scheduling algorithms in order to offer reliable MapReduce services on a hybrid resource architec- ture, where volunteer computing systems are supplemented by a small set of dedicated nodes. The adaptive task and data scheduling algorithms in MOON distinguish between (1) different types of MapReduce data and (2) different types of node outages in order to strategically place tasks and data on both volatile and dedicated nodes. Our tests demonstrate that MOON can deliver a 3-fold performance improvement to Hadoop in volatile, volunteer computing environments. I. I NTRODUCTION The maturation of volunteer computing systems with multi-core processors offers a low-cost re- source for high-performance computing [1], [2], [3], [4]. However, these systems offer limited program- ming models and rely on ad-hoc storage solutions, which are insufficient for data-intensive problems. MapReduce is an effective programming model that simplifies large-scale parallel data processing [5], but has been relegated to dedicated computing re- sources found in high-performance data centers. While the union of MapReduce services with volunteer computing systems is conceptually ap- pealing, a vital issue needs to be addressed – computing resources in desktop grid systems are significantly more volatile than in dedicated com- puting environments. For example, while Ask.com per-server unavailability rate is an astonishingly low 0.000455 [6], availability traces collected from an enterprise volunteer computing system [7] showed a more challenging picture: individual node un- availability rates average around 0.4 with as many as 90% of the resources simultaneously inacces- sible (Figure 1). Unlike dedicated systems, soft- ware/hardware failure is not the major contributor to resource volatility on volunteer computing sys- tems. volunteer computing nodes shut down at the owners’ will are unavailable. Also, typical volunteer computing frameworks such as Condor [1] consider a computer unavailable for running external jobs whenever keyboard or mouse events are detected. In such a volatile environment it is unclear how well existing MapReduce frameworks perform. In this work, we evaluated Hadoop, a popular, open-source MapReduce runtime system[8], on an emulated volunteer computing system and observed that the volatility of opportunistic resources cre- ates several severe problems. First, the Hadoop Distributed File System (HDFS) provides reliable data storage through replication, which on volatile

Upload: dinhanh

Post on 06-Feb-2018

228 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: MOON: MapReduce On Opportunistic eNvironmentseprints.cs.vt.edu/archive/00001089/01/moon.pdf · MOON: MapReduce On Opportunistic eNvironments ... short for MapReduce On Opportunistic

MOON: MapReduce On OpportunisticeNvironments

Heshan Lin∗, Jeremy Archuleta∗, Xiaosong Ma†‡, Wu-chun Feng∗

Zhe Zhang† and Mark Gardner∗

∗Department of Computer Science, Virginia Tech{hlin2, jsarch, feng}@cs.vt.edu, [email protected]

†Department of Computer Science, North Carolina State [email protected], [email protected]

‡Computer Science and Mathematics Division, Oak Ridge National Laboratory

Abstract—MapReduce offers a flexible programmingmodel for processing and generating large data sets ondedicated resources, where only a small fraction of suchresources are every unavailable at any given time. Incontrast, when MapReduce is run on volunteer comput-ing systems, which opportunistically harness idle desktopcomputers via frameworks like Condor, it results in poorperformance due to the volatility of the resources, inparticular, the high rate of node unavailability.

Specifically, the data and task replication schemeadopted by existing MapReduce implementations is woe-fully inadequate for resources with high unavailability. Toaddress this, we propose MOON, short for MapReduce OnOpportunistic eNvironments. MOON extends Hadoop, anopen-source implementation of MapReduce, with adaptivetask and data scheduling algorithms in order to offerreliable MapReduce services on a hybrid resource architec-ture, where volunteer computing systems are supplementedby a small set of dedicated nodes. The adaptive task anddata scheduling algorithms in MOON distinguish between(1) different types of MapReduce data and (2) differenttypes of node outages in order to strategically place tasksand data on both volatile and dedicated nodes. Our testsdemonstrate that MOON can deliver a 3-fold performanceimprovement to Hadoop in volatile, volunteer computingenvironments.

I. INTRODUCTION

The maturation of volunteer computing systemswith multi-core processors offers a low-cost re-source for high-performance computing [1], [2], [3],[4]. However, these systems offer limited program-ming models and rely on ad-hoc storage solutions,which are insufficient for data-intensive problems.MapReduce is an effective programming model that

simplifies large-scale parallel data processing [5],but has been relegated to dedicated computing re-sources found in high-performance data centers.

While the union of MapReduce services withvolunteer computing systems is conceptually ap-pealing, a vital issue needs to be addressed –computing resources in desktop grid systems aresignificantly more volatile than in dedicated com-puting environments. For example, while Ask.comper-server unavailability rate is an astonishingly low0.000455 [6], availability traces collected from anenterprise volunteer computing system [7] showeda more challenging picture: individual node un-availability rates average around 0.4 with as manyas 90% of the resources simultaneously inacces-sible (Figure 1). Unlike dedicated systems, soft-ware/hardware failure is not the major contributorto resource volatility on volunteer computing sys-tems. volunteer computing nodes shut down at theowners’ will are unavailable. Also, typical volunteercomputing frameworks such as Condor [1] considera computer unavailable for running external jobswhenever keyboard or mouse events are detected. Insuch a volatile environment it is unclear how wellexisting MapReduce frameworks perform.

In this work, we evaluated Hadoop, a popular,open-source MapReduce runtime system[8], on anemulated volunteer computing system and observedthat the volatility of opportunistic resources cre-ates several severe problems. First, the HadoopDistributed File System (HDFS) provides reliabledata storage through replication, which on volatile

Page 2: MOON: MapReduce On Opportunistic eNvironmentseprints.cs.vt.edu/archive/00001089/01/moon.pdf · MOON: MapReduce On Opportunistic eNvironments ... short for MapReduce On Opportunistic

25

35

45

55

65

75

85

95

9:00

10:00

11:00

12:00

1:00

2:00

3:00

4:00

Percentage Unavailability

DAY1

DAY2

DAY3

DAY4

DAY5

DAY6

DAY7

Fig. 1. Percentage of unavailable resources measured in a 7-daytrace from a production volunteer computing system at San DiegoSupercomputing Center [7]. The trace of each day was collectedfrom 9:00AM to 5:00PM. The average percentage unavailability ismeasured in 10-minute intervals.

systems can have a prohibitively high replicationcost in order to provide high data availability.For instance, when machine unavailability rate is0.4, eleven replicasare needed to achieve 99.99%availability for a single data block, assuming thatmachine unavailability is independent. Handlinglarge-scalecorrelated resource unavailability re-quires even more replication.

Second, Hadoop does not replicate intermediateresults (the output of Map tasks). When a nodebecomes inaccessible, the Reduce tasks processingintermediate results on this node will stall, resultingin Map task re-execution or even livelock.

Third, Hadoop task scheduling assumes that themajority of the tasks will run smoothly until com-pletion. However, tasks can be frequently suspendedor interrupted on volunteer computing systems. Thedefault Hadoop task replication strategy, designedto handle failures, is insufficient to handle the highvolatility of volunteer computing platforms.

To mitigate these problems in order to realizethe computing potential of MapReduce on volunteercomputing systems, we have a created a novelamalgamation of these two technologies to produceMOON, “MapReduce On Opportunistic eNviron-ments”. MOON addresses the challenges of pro-viding MapReduce services within the opportunisticenvironment of volunteer computing systems, inthree specific ways:

• adopting a hybrid resource architecture by pro-visioning a small number of dedicated comput-ers to serve as a system anchor to supplement

other personal computers,• extending HDFS to (1) take advantage of

the dedicated resources for better data avail-ability and (2) to provide differentiatedreplication/data-serving services for differenttypes of MapReduce job data, and

• extending the Hadoop speculative task schedul-ing algorithm to take into account high resourcevolatility and strategically place Map/Reducetasks on volatile or dedicated nodes.

We implemented these three enhancements inHadoop and carried out extensive evaluation workwithin an opportunistic environment. Our resultsshow that MOON can improve the QoS of MapRe-duce services significantly, with as much as a 3-fold speedup, and even finish MapReduce jobs thatcould not be completed previously in highly volatileenvironments.

II. BACKGROUND

A. Volunteer Computing

Many volunteer computing systems have been de-veloped to harness idle desktop resources for high-performance or high-throughput computing [1], [2],[3], [4]. A common feature shared by these systemsis non-intrusive deployment. While studies havebeen conducted on aggressively stealing computercycles [9] and its corresponding impact [10], mostproduction volunteer computing systems allow usersto donate their resources in a conservative way bynot running external tasks when the machine isactively used. For instance, Condor allows jobs toexecute only after 15 minutes of no console activityand a CPU load less than 0.3.

B. MapReduce

MapReduce is a programming model designed tosimplify parallel data processing [5]. Google hasbeen using MapReduce to handle massive amount ofweb search data on large-scale commodity clusters.This programming model has also been found ef-fective in other application areas including machinelearning [11], bioinformatics [12], astrophysics andcyber-security [13].

A MapReduce application is implementedthrough two user-supplied primitives: Map andReduce. Map tasks take inputkey-valuepairs andconvert them into intermediatekey-value pairs,

Page 3: MOON: MapReduce On Opportunistic eNvironmentseprints.cs.vt.edu/archive/00001089/01/moon.pdf · MOON: MapReduce On Opportunistic eNvironments ... short for MapReduce On Opportunistic

which are in turn converted to outputkey-valuepairs by reduce tasks.

In Google’s MapReduce implementation, thehigh-performance distributed file system, GFS [14],is used to store the input, intermediate, and outputdata.

C. Hadoop

Hadoop is an open-source cluster-based MapRe-duce implementation written in Java [8]. It is log-ically separated into two subsystems: the HadoopDistributed File System (HDFS), and a MapReducetask execution framework.

HDFS consists of aNameNodeprocess runningon the master and multipleDataNode processesrunning on the workers. To provide scalable dataaccess, the NameNode only manages the systemmetadatawith the actual file contents stored onthe DataNodes. Each file in the system is storedas a collection of equal-sized data blocks. For I/Ooperations, an HDFS client queries the NameNodefor the data block locations, with subsequent datatransfer occurring directly between the client andthe target DataNodes. Like GFS, HDFS achieveshigh data availability and reliability through datareplication, with the replication degree specified bya replication factor.

To control task execution, a singleJobTrackerprocess running on the master manages job statusand performs task scheduling. On each worker ma-chine, aTaskTracker process tracks the availableexecution slots. A worker machine can execute upto M Map tasks andR Reduce tasks simultaneously(M and R default to 2). A TaskTracker contactsthe JobTracker for an assignment when it detectsan empty execution slot on the machine. Tasksof different jobs are scheduled according to jobpriorities. Within a job, the JobTracker first tries toschedule a non-running task, giving high priority tothe recently failed tasks, but if all tasks for this jobhave been scheduled, the JobTracker speculativelyissues backup tasks for slow running ones. Thesespeculative tasks help improve job response time.

III. MOON DESIGN RATIONALE AND

ARCHITECTURE OVERVIEW

MOON targets institutional intranet environ-ments, where volunteer personal computers (PCs)

are connected with a local area network with rel-atively high bandwidth and low latency. However,PC availability is ephemeral in such an environment.Moreover, large-scale, correlated resource inacces-sibility can be normal [15]. For instance, manymachines in a computer lab will be occupied si-multaneously during a lab session.

Observing that opportunistic PC resources are notdependable enough to offer reliable compute andstorage services, MOON supplements a volunteercomputing system witha small number of dedicatedcompute resources. Figure 2 illustrates this hybridarchitecture, where a small set of dedicated nodesprovide storage and computing resources at a muchhigher reliability level than the existing volatilenodes.

(a) Volunteer computingenvironments

(b) MOON

Fig. 2. Overview of MOON executing environments. The resourceson nodes with a question mark are ephemeral.

The MOON hybrid architecture has multiple ad-vantages. First, placing a replica on dedicated nodescan significantly enhance data availability withoutimposing a high replication cost on the volatilenodes, thereby improving overall resource utiliza-tion and reducing job response time. For example,the well-maintained workstations in our research labhave had only 10 hours of unscheduled downtimein the past year (due to an unnotified power out-age), which is equivalent to a 0.001 unavailabilityrate. Assuming the average unavailability rate of avolunteer computing system is 0.4 and the failureof each volatile node is independent, achieving99.99% availability only requires a single copy onthe dedicated node and three copies on the volatilenodes. Second, long-running tasks with executiontimes much larger than the mean available interval

Page 4: MOON: MapReduce On Opportunistic eNvironmentseprints.cs.vt.edu/archive/00001089/01/moon.pdf · MOON: MapReduce On Opportunistic eNvironments ... short for MapReduce On Opportunistic

of volunteered machines may be difficult to finishon purely volatile resources because of frequent in-terruptions. Scheduling those long-running tasks ondedicated resources can guarantee their completion.Finally, with these more predictable nodes dedicatedto assist a volunteer computing system, it is easierto perform QoS control, especially when the nodeunavailability rate is high.

Due to the wide range of scheduling policies usedin volunteer computing systems, we have designedfor the extreme situation where MOON might bewrapped inside a virtual machine and distributed toeach PC, as enabled by Condor [1] and Entropia [3].In this scenario, the volunteer computing systemcontrols when to pause and resume the virtualmachine according to the policy chosen by thecomputer owners. To accommodate such a scenario,MOON assumes that no computation or commu-nication progress can be made on a PC when itis actively used by the owner, and it relies on theheartbeat mechanismin Hadoop to detect when aPC is unavailable.

As we will discuss in Section IV, one designassumption of the current MOON solution is that,collectively, the dedicated nodes have enough aggre-gate storage for at least one copy of all active datain the system. We argue that this solution is madepractical by the decreasing price of commodityservers and hard drives with large capacity. Forexample, currently a decent desktop computer with1.5 TB of disk space can be acquired for under$1,000. In the future, we are going to investigatescenarios where the above assumption is relaxed.

IV. MOON DATA MANAGEMENT

In this section, we present our enhancements toHadoop to provide a reliable MapReduce servicefrom the data management perspective. Within aMapReduce system there are three types of userdata – input, intermediate, and output. Input dataare provided by a user and used by Map tasksto produce intermediate data, which are in turnconsumed by Reduce tasks to create output data.The availability of each type of data has differentimplications on QoS.

For input data, temporary inaccessibility will stallcomputation of corresponding Map tasks, whereasloss of the input data will cause the entire job to

fail1. Similar situations will be experienced withtemporary unavailability of intermediate or outputdata. However, these two types of data are moreresilient to loss, as they can be reproduced by re-executing the Map and/or Reduce tasks involved.On the other hand, once a job has completed, lostoutput data is irrecoverable if the input data havebeen removed from the system. In this case, a userwill have to re-stage the previously removed inputdata and re-issue the entire job, acting as if theinput data was lost. In any of these scenarios thetime-to-completion of the MapReduce job can besubstantially elongated.

As mentioned in Section I, we found that existingHadoop data management is insufficient to providehigh QoS on volatile environments for two mainreasons. First, the replication cost to provide thenecessary level of data availability for input and out-put data in HDFS on volunteer computing systems isprohibitive when the volatility is high. Additionally,non-replicated intermediate data can easily becometemporarily or permanently unavailable due to useractivity or software/hardware failures on the workernode where the data is stored, thereby unnecessarilyforcing the relevant Map tasks to be re-executed.

To address these issues, MOON augmentsHadoop data management in several ways to lever-age the proposed hybrid resource architecture tooffer a cost-effective and robust storage service.

A. Multi-dimensional, Cost-effective ReplicationService

Existing MapReduce frameworks such as Hadoopare designed for relatively stable environments run-ning on dedicated nodes. In Hadoop, data replicationis carried out in a rather static and uniform man-ner. To extend Hadoop to handle volatile volunteercomputing environments, MOON provides a multi-dimensional, cost-effective replication service.

First, MOON manages two types of resources –supplemental dedicated computers and volatile vol-unteer nodes. The number of dedicated computers ismuch smaller than the number of volatile nodes forcost-effectiveness purposes. To support this hybrid

1In Hadoop, an incomplete Map task (e.g., caused by inaccessibil-ity of the corresponding input data block) will be rescheduled up to4 times, after which the Map task will be marked as failed and inturn the corresponding job will be terminated.

Page 5: MOON: MapReduce On Opportunistic eNvironmentseprints.cs.vt.edu/archive/00001089/01/moon.pdf · MOON: MapReduce On Opportunistic eNvironments ... short for MapReduce On Opportunistic

scheme, MOON extends Hadoop’s data manage-ment and defines two types of workers:dedicatedDataNodes andvolatile DataNodes. Accordingly,the replication factor of a file can no longer beadequately represented by a single number. Instead,it is defined by a pair{d, v}, where d and v

specify the number of data replicas on the dedicatedDataNodes and the volatile DataNodes, respectively.

Intuitively, since dedicated nodes have muchhigher availability than volatile nodes, placing repli-cas on dedicated DataNodes can significantly im-prove data availability and in turn minimize thereplication cost on volatile nodes. Because of thelimited aggregated network and I/O bandwidth ondedicated computers, however, the major challengeis how to maximize the utilization of the dedicatedresources to improve service quality while pre-venting the dedicated computers from becoming asystem bottleneck. To this end, MOON’s replicationdesign differentiates between various data types atthe file level and takes into account the load andvolatility levels of the DataNodes.

MOON characterizes Hadoop data files into twocategories,reliable andopportunistic. Reliable filesare defined as data thatcannot be lost under anycircumstances. One or more dedicated copies arealways maintained for reliable files so that they cantolerate potential outage of a large percentage ofvolatile nodes. MOON always stores input data andsystem data required by the job as reliable files.

In contrast,opportunistic filescontain transientdata that can tolerate a certain level of unavailabilityand may or may not have dedicated replicas. Inter-mediate data will always be stored as opportunisticfiles. On the other hand, output data will first bestored as opportunistic files while the Reduce tasksare completing, and once all are completed they arethen converted to reliable files.

The separation of reliable files from opportunisticfiles is critical in controlling the load level ofdedicated DataNodes. When MOON decides thatall dedicated DataNodes are nearly saturated, an I/Orequest to replicate an opportunistic file on a dedi-cated DataNode will be declined (details describedin Section IV-B).

Additionally, by allowing output data to be firststored as opportunistic files enables MOON to dy-namically direct write traffic towards or away from

the dedicated DataNodes as necessary. Furthermore,only after all data blocks of the output file havereached its replication factor, will the job be markedas complete and the output file be made availableto users.

To maximize the utilization of dedicated comput-ers, MOON will attempt to have dedicated replicasfor opportunistic files when possible. When dedi-cated replicas cannot be maintained, the availabilityof the opportunistic file is subject to the volatility ofthe volunteer PCs, possibly resulting in poor QoSdue to forced re-execution of the related Map orReduce tasks. While this issue can be addressed byusing a high replication degree on volatile DataN-odes, such a solution will inevitably incur highnetwork and storage overhead.

MOON addresses this issue by adaptively chang-ing the replication requirement to provide the user-defined QoS. Specifically, consider a write re-quest of an opportunistic file with replication factor{d, v}. If the dedicated replicas are rejected becausethe dedicated DataNodes are saturated, MOON willdynamically adjustv to v′, wherev′ is chosen toguarantee that the file availability meets the user-defined availability level (e.g., 0.9) pursuant to thenode unavailability ratep (i.e., 1 − pv

> 0.9). Ifp changes before a dedicated replica can be stored,v′ will be recalculated accordingly. Also, no extrareplication is needed if an opportunistic file alreadyhas a replication degree higher thanv′. While wecurrently estimatep by simply having the NameN-ode monitor the fraction of unavailable DataNodesduring the past intervalI, MOON allows for user-defined models to accurately predictp under a givenvolunteer computing system.

The rationale for adaptively changing the replica-tion requirement is that when an opportunistic filehas a dedicated copy, the availability of the file ishigh thereby allowing MOON to decrease the repli-cation degree on volatile DataNodes. Alternatively,MOON can increase the volatile replication degreeof a file as necessary to prevent forced task re-execution caused by unavailability of opportunisticdata.

Similar to Hadoop, when any file in the systemfalls below its replication factor this file will beput into a replication queue. The NameNode pe-riodically checks this queue and issues replication

Page 6: MOON: MapReduce On Opportunistic eNvironmentseprints.cs.vt.edu/archive/00001089/01/moon.pdf · MOON: MapReduce On Opportunistic eNvironments ... short for MapReduce On Opportunistic

Fig. 3. Decision process to determine where data should be stored.

requests giving higher priority to reliable files.

B. Prioritizing I/O Requests

When a large number of volatile nodes are sup-plemented with a much smaller number of dedicatednodes, providing scalable data access is challenging.As such, MOON prioritizes the I/O requests on thedifferent resources.

To alleviate read traffic on dedicated nodes,MOON factors in the node type in servicing a readrequest. Specifically, for files with replicas on bothvolatile and dedicated DataNodes, read requestsfrom clients on volatile DataNodes will always tryto fetch data from volatile replicas first. By doingso, the read request from clients on the volatileDataNodes will only reach dedicated DataNodeswhen none of the volatile replicas are available.

When a write request occurs, MOON prioritizesI/O traffic to the dedicated DataNodes according todata vulnerability. A write request from a reliablefile will always be satisfied on dedicated DataNodes.However, a write request from an opportunistic filewill be declined if all dedicated DataNodes are closeto saturation. As such, write requests for reliablefiles are fulfilled prior to those of opportunistic fileswhen the dedicated DataNodes are fully loaded.This decision process is shown in Figure 3.

To determine whether a dedicated DataNode is al-most saturated, MOON uses a sliding window-basedalgorithm as show in Algorithm 1. MOON monitorsthe I/O bandwidth consumed at each dedicatedDataNode and sends this information to the NameN-ode piggybacking on theheartbeatmessages. The

throttling algorithm running on the NameNode com-pares the updated bandwidth with the average I/Obandwidth during a past window. If the consumedI/O bandwidth of a DataNode is increasing but onlyby a small margin determined by a thresholdTb, theDataNode is considered saturated. On the contrary,if the updated I/O bandwidth is decreasing andfalls more than thresholdTb, the dedicated node isunsaturated. Such a design is to avoid false detectionof saturation status caused by load oscillation.

Algorithm 1 I/O throttling on dedicated DataNodesLet W be the throttling window sizeLet Tb be the control thresholdLet bwk be the measured bandwidth at timestepkInput: current I/O bandwidthbwi

Output: setting throttling state of the dedicated node

avg bw = (Pi−1

j=i−W bwj)/Wif bwi > avg bw then

if (state == unthrottled) and (bwi < avg bw ∗ (1 + Tb))then

state = throttledend if

end ifif bwi < avg bw then

if (state == throttled) and (bwi < avg bw∗(1−Tb)) thenstate = unthrottled

end ifend if

C. Handling Ephemeral Unavailability

Within the original HDFS, fault tolerance isachieved by periodically monitoring the health ofeach DataNode and replicating files as needed. If aheartbeat message from a DataNode has not arrivedat the NameNode within theNodeExpiryIntervalthe DataNode will be declared dead and its filesreplicated as needed.

This fault tolerance mechanism is problematic foropportunistic environments where transient resourceunavailability is common. If theNodeExpiryIntervalis shorter than the mean unavailability interval of thevolatile nodes, these nodes may frequently switchbetweenlive and dead states, causing replicationthrashing due to HDFS striving to keep the correctnumber of replicas. Such thrashing significantlywastes network and I/O resources and should beavoided. On the other hand, if theNodeExpiryInter-val is set too long, the system would incorrectly con-sider a “dead” DataNode as “alive”. These DataN-

Page 7: MOON: MapReduce On Opportunistic eNvironmentseprints.cs.vt.edu/archive/00001089/01/moon.pdf · MOON: MapReduce On Opportunistic eNvironments ... short for MapReduce On Opportunistic

odes will continue to be sent I/O requests until itis properly identified as dead, thereby degradingoverall I/O performance as the clients experiencetimeouts trying to access the nodes.

To address this issue, MOON introduces ahiber-natestate. A DataNode enters the hibernate state ifno heartbeat messages are received for more than aNodeHibernateInterval, which is much shorter thanthe NodeExpiryInterval. A hibernated DataNodewill not be supplied any I/O requests so as to avoidunnecessary access attempts from clients. Observingthat a data block with dedicated replicas alreadyhas the necessary availability to tolerate transientunavailability of volatile nodes, only opportunisticfiles without dedicated replicas will be re-replicated.This optimization can greatly save the replicationtraffic in the system while preventing task re-executions caused by the compromised availabilityof opportunistic files.

V. MOON TASK SCHEDULING

One important mechanism that Hadoop uses toimprove job response time is to speculatively issuebackup tasks for “stragglers”, i.e. slow runningtasks. Hadoop considers a task as a straggler if thetask meets two conditions: 1) it has been runningfor more than one minute, and 2) itsprogress scorelags behind the average progress of all tasks of thesame type by 0.2 or more. The per-task progressscore, valued between 0 and 1, is calculated as thefraction of data that has been processed in this task.

In Hadoop, all stragglers are treated equally re-gardless of the relative differences between theirprogress scores. The JobTracker (i.e., the master)simply selects stragglers for speculative executionaccording to the order in which they were originallyscheduled, except that for Map stragglers, prioritywill be given to the ones with input data local tothe requesting TaskTracker (i.e., the worker). Themaximum number of speculative copies (excludingthe original copy) for each task is user-configurable,but capped at 1 by default.

Similar to data replication, such static task repli-cation becomes inadequate in volatile volunteercomputing environments. The assumption that tasksrun smoothly toward completion, except for a smallfraction that may be affected by the abnormal nodesis easily invalid in opportunistic environments; a

large percentage of tasks will likely be suspended orinterrupted due to temporary or permanent outagesof the volatile nodes. Consequently, the existingHadoop solution of identifying stragglers basedsolely on tasks’ progress scores is too optimistic.

First, when the machine unavailability rate ishigh, all instances of a task can possibly be sus-pended simultaneously, allowing no progress to bemade on that task. Second, identifying stragglersvia the comparison with average progress scoreassumes that the majority of nodes run smoothly tocompletion. Third, even for an individual node, theprogress score is not a reliable metric for detectingstalled tasks that have processed a lot of data. In avolunteer computing environment, where computersare turned off or reclaimed by owner activitiesfrequently independent of the MapReduce work-load, fast progressing tasks may be suddenly sloweddown. Yet, because of their relatively high progressscores, it may take a long time for those tasksto be allowed to have speculative copies issued.Meanwhile, the natural computational heterogeneityamong volunteer nodes plus additional productivityvariance caused by node unavailability may causeHadoop to issue a large number of speculative tasks(similar to an observation made in [16]). The endresult is a waste of resources and an increase in jobexecution time.

Therefore, MOON adopts speculative task exe-cution strategies that are aggressive for individualtasks to prepare for high node volatility, yet overallcautious considering the collectively unreliable en-vironment. We describe these techniques in the restof this section.

A. Ensuring Sufficient Progress with High NodeVolatility

In order to guarantee that sufficient progress ismade on all tasks, MOON characterizes stragglersinto frozen tasks(tasks whereall copies are simul-taneously inactive) andslow tasks(tasks that arenot frozen, but satisfy the Hadoop criteria for spec-ulative execution). The MOON scheduler composestwo separate lists, containing frozen and slow tasksrespectively, with tasks selected from the frozen listfirst. In both lists, tasks are sorted by the progressmade thus far, with lower progress ranked higher.

Page 8: MOON: MapReduce On Opportunistic eNvironmentseprints.cs.vt.edu/archive/00001089/01/moon.pdf · MOON: MapReduce On Opportunistic eNvironments ... short for MapReduce On Opportunistic

It is worth nothing that Hadoop does offer a taskfault-tolerant mechanism to handle node outage.The JobTracker considers a TaskTrackerdead ifno heartbeat messages have been received fromthe TaskTracker for anTrackerExpiryInterval(10minutes by default). All task instances on a deadTaskTracker will be killed and rescheduled. Naively,using a smalltracker expiry intervalcan help detectand relaunch inactive tasks faster. However, using atoo small value for theTrackerExpiryIntervalwillcause many suspended tasks to be killed prema-turely, thus wasting resources.

In contrast, MOON considers a TaskTrackersus-pended if no heartbeat messages have been re-ceived from the TaskTracker for aSuspensionIn-terval, which can be set to a value much smallerthanTrackerExpiryIntervalso that the anomaly canbe detected early. All task instances running on asuspended TaskTracker are then flaggedinactive, inturn triggering frozen task handling. Inactive taskinstances are not killed right away in the hopethat they may be resumed when the TaskTrackeris returned to normal later.

MOON imposes a cap on the number of spec-ulative copies for a slow task similar to Hadoop.However, a speculative copy will be issued to afrozen task regardless of the number of its copiesso that progress can always be made for the task.To constrain the resources used by task replication,however, MOON enforces a limit on the total con-current speculative task instances for a job, similarto the approach used by a related Hadoop schedul-ing study [16]. Specifically, no more speculativetasks will be issued if the concurrent number ofspeculative tasks of a job is above a percentageof the total currently available execution slots. Wefound that a threshold of 20% worked well in ourexperiments.

B. Two-phase Task Replication

The speculative scheduling approach discussedabove only issues a backup copy for a taskafterit is detected as frozen or slow. Such a reactiveapproach is insufficient to handle fast progressingtasks that become suddenly inactive. For instance,consider a task that runs normal until 99% completeand then is suspended. A speculative copy will onlybe issued for this task after the task suspension is

detected by the system, and the computation needsto be started all over again. To make it worse, thespeculative copy may also become inactive beforeits completion. In the above scenario, the delay inthe reactive scheduling approach can elongate thejob response time, especially when that scenariohappens toward the end of the job.

To remedy this, MOON separates job progressinto two phases,normal and homestretch, wherethe homestretchphase begins once the number ofremaining tasks for the job falls belowH% ofthe currently available execution slots. The basicidea of this two-phase design is to alleviate theimpacts of unexpected task interruptions by proac-tively replicating tasks toward the job completion.Specifically, during the homestretch phase, MOONattempts to maintain at leastR activecopies ofanyremaining taskregardless the task progress score.If the unavailability rate of volunteer PCs isp, theprobability that a task will become frozen decreasesto pR.

The motivation of the two-phase scheduling stemsfrom two observations. First, when the number ofconcurrent jobs in the system is small, computa-tional resources become more underutilized as a jobgets closer to completion. Second, a suspended taskwill delay the job more toward the completion ofthe job. The choosing ofH and R is important toachieve a good trade-off between the task replica-tion cost and the performance improvement. In ourexperiments, we foundH = 20 andR = 2 can yieldgenerally good results.

C. Leveraging the Hybrid Resources

MOON attempts to further decrease the impactof volatility during both normal and homestretchphases by replicating tasks on the dedicated nodes.Doing this allows us to take advantage of theCPU resources available on the dedicated computers(as opposed to using them as pure data servers).We adopt a best-effort approach in augmentingthe MOON scheduling policy to leverage the hy-brid architecture. The improved policy schedulesa speculative task on dedicated computers if thereare empty slots available, with tasks prioritized ina similar way as done in task replication on thevolunteer computers.

Page 9: MOON: MapReduce On Opportunistic eNvironmentseprints.cs.vt.edu/archive/00001089/01/moon.pdf · MOON: MapReduce On Opportunistic eNvironments ... short for MapReduce On Opportunistic

Intuitively, tasks with a dedicated speculativecopy are givenlower priority in receiving additionaltask replicas, as the backup support from dedicatedcomputers tends to be much more reliable. Simi-larly, tasks that already have a dedicated copy donot participate the homestretch phase.

As a side-effect of the above task schedulingapproach, long running tasks that have difficultyin finishing on volunteer PCs because of frequentinterruptions will eventually be scheduled and guar-anteed completion on the dedicate computers.

VI. PERFORMANCE EVALUATION

We now present the performance evaluation ofthe MOON system. Our experiments are executedon System X at Virginia Tech, comprised of AppleXserve G5 compute nodes with dual 2.3GHz Pow-erPC 970FX processors, 4GB of RAM, 80 GBytehard drives. System X uses a 10Gbs InfiniBandnetwork and a 1Gbs Ethernet for interconnection. Toclosely resemble volunteer computing systems, weonly use the Ethernet network in our experiments.Each node is running the GNU/Linux operatingsystem with kernel version 2.6.21.1. The MOONsystem is developed based on Hadoop 0.17.2.

On production volunteer computing systems,machine availability patterns are commonly non-repeatable, making it difficult to fairly comparedifferent strategies. Meanwhile, traces cannot easilybe manipulated to create different node availabilitylevels. In our experiments, we emulate a volunteercomputing system with synthetic node availabilitytraces, where node availability level can be adjusted.

We assume that node outage is mutually inde-pendent and generate unavailable intervals usinga normal distribution, with the mean node-outageinterval (409 seconds) extracted from the aforemen-tioned Entropia volunteer computing node trace [7].The unavailable intervals are then inserted into 8-hour traces following a Poisson distribution suchthat in each trace, the percentage of unavailabletime is equal to a given node unavailability rate. Atruntime of each experiment, a monitoring processon each node reads in the assigned availability trace,and suspends and resumes all the Hadoop/MOONrelated processes on the node accordingly.

Our experiments focus on two representativeMapReduce applications, i.e.,sort and word

count, that are shipped with the Hadoop distribu-tion. The configurations of the two applications aregiven in Table I2. For both applications, the inputdata is randomly generated using tools distributedwith Hadoop.

TABLE IAPPLICATION CONFIGURATIONS.

Application Input Size # Maps # Reducessort 24 GB 384 0.9 × AvailSlotsword count 20 GB 320 20

A. Speculative Task Scheduling Evaluation

First, we evaluate the MOON scheduling algo-rithm using job response time as the performancemetric. On opportunistic environments both thescheduling algorithm and the data management poli-cies can largely impact this metric. To isolate theimpact of speculative task scheduling, we use thesleep application distributed with Hadoop, whichallows us to simulate our two target applicationswith faithful Map and Reduce task execution times,but generating only insignificant amount of inter-mediate and output data (two integers per record ofintermediate and zero output data).

We feed the average Map and Reduce executiontimes fromsort and word count benchmark-ing runs intosleep. We also configure MOONto replicate the intermediate data as reliable fileswith one dedicated and one volatile copy, so thatintermediate data are always available to Reducetasks. Sincesleep only deals with a small amountof intermediate data, the impact of data managementis minimal.

The test environment is configured with 60volatile nodes and 6 dedicated nodes, resulting ina 10:1 of volatile-to-dedicated (V-to-D) node ratio(results with higher V-to-D node ratio will be shownin Section VI-C). We compare the original Hadooptask scheduling policy and two versions of theMOON two-phase scheduling algorithm describedin Section V: with and without awareness of thehybrid architecure (MOON and MOON-Hybrid re-spectively).

We control how quickly the Hadoop fault-tolerantmechanism reacts to node outages by using 1, 5,

2Note by default, Hadoop runs 2 reduce tasks per node.

Page 10: MOON: MapReduce On Opportunistic eNvironmentseprints.cs.vt.edu/archive/00001089/01/moon.pdf · MOON: MapReduce On Opportunistic eNvironments ... short for MapReduce On Opportunistic

and 10 (default) minutes forTrackerExpiryInterval.With even largerTrackerExpiryIntervals, Hadoopperformance gets worse and hence those resultsare not shown here. For MOON, as discussed inSection V-B, the task suspension detection allowsusing largerTrackerExpiryIntervalsto avoid killingtasks prematurely. We use 1 minute forSuspension-Interval, and 30 minutes forTrackerExpiryInterval.

0

500

1000

1500

2000

2500

0.1 0.3 0.5

Exe

cutio

n T

ime(

s)

Machine Unavailable Rate

Hadoop10MinHadoop5MinHadoop1Min

MOONMOON-Hybrid

(a) sort

0

500

1000

1500

2000

2500

0.1 0.3 0.5

Exe

cutio

n T

ime(

s)

Machine Unavailable Rate

Hadoop10MinHadoop5MinHadoop1Min

MOONMOON-Hybrid

(b) word count

Fig. 4. Execution time with Hadoop and MOON scheduling policies.

Figure 4 shows the execution times on variousaverage node unavailability rates. Overall, the jobexecution time with the default Hadoop schedulingpolicy reduces asTrackerExpiryIntervaldecreases.

For the sort application (Figure 4(a)),Hadoop1Min (Hadoop with 1 minuteTrackerExpiryInterval) outperformsHadoop10Minby 48% andHadoop5Min by 35%, on average.When node unavailability rate is low, theperformance with the MOON schedulingpolicies are comparable to or slightly better thanHadoop1Min. This is because in these scenarios,a high percentage of tasks run normally, and thusthe default Hadoop speculative algorithm performs

relatively well. However, at the 0.5 unavailabilityrate, the MOON scheduling policy significantlyoutperforms Hadoop1Min, by 45% without evenbeing aware of the hybrid architecture. This isbecause the MOON two-phase scheduling algorithmcan handle task suspension without killing tasksprematurely, and reduce the occurrence of tasksfailing towards the end of job execution. Finally,when leveraging the hybrid resources, MOON canfurther improve performance, especially when theunavailability rate is high.

Figure 4(b) shows similar results withwordcount. While the MOON scheduler still outper-forms Hadoop1Min, the improvement is smaller.This is due to the fact theword count hasa smaller number of Reduce tasks, providingless room for improvement. Nonetheless, MOON-Hybrid outperforms the best alternative Hadooppolicy (Hadoop1Min) by 24% and 35% at 0.3 and0.5 unavailability rates, respectively. Note that theefficacy of leveraging the hybrid nodes inwordcount is slightly different than that insort.MOON-Hybrid does not show a performance im-provement when the unavailability rates are 0.1 and0.3, mainly because the number of reduce tasks ofword count is small, and the speculative tasksissued by the two-phase algorithm on volatile nodesare sufficient to handle the node outage. How-ever, at the 0.5 unavailability rate, the reduce taskson volatile nodes are interrupted more frequently,in which case placing reduce tasks on dedicatednodes can deliver considerable performance im-provements.

Another important metric to evaluate is the totalnumber of duplicated tasks issued, as extra taskswill consume system resources as well as energy.Figure 5 plots the number of duplicated tasks (in-cluding both Map and Reduce) issued with differentscheduling policies. For Hadoop, with a smallerTrackerExpiryInterval, the JobTracker is more likelyto consider a suspended TaskTracker as dead andin turn increase the number of duplicated tasksby re-executing tasks. Meanwhile, a smallerTrack-erExpiryInterval can decrease the execution timeby reacting to the task suspension more quickly.Conversely, a reduction in the execution time candecrease the probability of a task being suspended.Because of these two complementary factors, we

Page 11: MOON: MapReduce On Opportunistic eNvironmentseprints.cs.vt.edu/archive/00001089/01/moon.pdf · MOON: MapReduce On Opportunistic eNvironments ... short for MapReduce On Opportunistic

observe that generally, the Hadoop scheduler createslarger numbers of speculative tasks as a smallerTrackerExpiryIntervalis used, with a few exceptionsfor sort at 0.1 and 0.3 unavailability rates.

The basic MOON algorithm significantlyreducesduplicated tasks. Forsort, it issues 14%, 22%, and44% fewer such tasks, at the three unavailability ratelevels respectively. Similar improvements can beobserved forword count. With hybrid-resource-aware optimizations, MOON achieves further im-provement averaging 29% gains, and peaking at44% in these tests.

0

20

40

60

80

100

120

140

0.1 0.3 0.5

Num

ber

of D

uplic

ated

Tas

ks

Machine Unavailable Rate

Hadoop10MinHadoop5MinHadoop1Min

MOONMOON-Hybrid

(a) sort

0

20

40

60

80

100

120

140

0.1 0.3 0.5

Num

ber

of D

uplic

ated

Tas

ks

Machine Unavailable Rate

Hadoop10MinHadoop5MinHadoop1Min

MOONMOON-Hybrid

(b) word count

Fig. 5. Number of duplicated tasks issued with different schedulingpolicies.

Overall, we found that the default Hadoopscheduling policy may enhance its capability ofhandling task suspensions in opportunistic environ-ments, but often at the cost of shorteningTrackerEx-piryInterval and issuing more speculative tasks. TheMOON scheduling policies, however, can deliversignificant performance improvement over Hadoopnative algorithms while creating fewer speculativetasks, especially when the resource volatility is high.

B. Replication of Intermediate Data

In a typical Hadoop job, theshufflephase, whereintermediate data are copied to Reduce tasks, istime-consuming even in dedicated environments.On opportunistic environments, achieving efficientshuffle performance is more challenging, given thatthe intermediate data could be inaccessible dueto frequent machine outage. In this section, weevaluate the impact of MOON’s intermediate datareplication policy on shuffle efficiency and conse-quently, job response time.

We compare avolatile-only (VO) replication ap-proach that statically replicates intermediate dataonly on volatile nodes, and thehybrid-aware(HA)replication approach described in Section IV-A. Forthe VO approach, we increase the number of volatilecopies gradually from 1 (VO-V1) to 5 (VO-V5). Forthe HA approach, we have MOON store one copyon dedicated nodes when possible, and increase theminimum volatile copies from from 1 (HA-V1) to3 (HA-V3). Recall that in the HA approach, if thedata block does not yet have a dedicated copy, thenthe number of volatile copies of a data block isdynamically adjusted such that the availability ofa file reaches 0.9.

These experiments use 60 volatile nodes and 6dedicated nodes. To focus solely on intermediatedata, we configure the input/output data to use afixed replication factor of{1, 3} across all experi-ments. Also, the task scheduling algorithm is fixedat MOON-Hybrid, which was shown to be the bestin the previous section.

In Hadoop, a Reduce task reports a fetch failure ifthe intermediate data of a Map task is inaccessible.The JobTracker will reschedule a new copy of aMap task if more than 50% of the running Reducetasks report fetching failures for the Map task. Weobserve that with this approach, the reaction to theloss of Map output is too slow, and as a result, atypical job runs for hours. We remedy this by allowsthe JobTracker to query the MOON file system tosee whether there areactive replicas for a Mapoutput, once it observes three fetch failures fromthis task, it immediately reissues a new copy of theMap task to regenerate the data.

Figure 6(a) shows the results ofsort. Asexpected, enhanced intermediate data availability

Page 12: MOON: MapReduce On Opportunistic eNvironmentseprints.cs.vt.edu/archive/00001089/01/moon.pdf · MOON: MapReduce On Opportunistic eNvironments ... short for MapReduce On Opportunistic

through the VO replication clearly reduces the over-all execution time. When the unavailability rate islow, the HA replication does not exhibit much ad-ditional performance gain. However, HA replicationsignificantly outperforms VO replication when thenode unavailability level is high. While increasingthe number of volatile replicas can help improvedata availability on a highly volatile system, thisincurs a high performance cost. As a result, there isno further execution time improvement fromVO-V3to VO-V4, and fromVO-V4 to VO-V5, the per-formance actually degrades. With HA replication,having at least one copy written to dedicated nodessubstantially improves data availability, with a loweroverall replication cost. More specifically,HA-V1outperforms the best VO configuration, i.e.,VO-V3by 61% at the 0.5 unavailability rate.

With word count, the gap between the bestHA configuration and the best VO configuration issmall. This is not surprising, asword count gen-erates much smaller intermediate/final output andhas much fewer Reduce tasks, thus the cost of fetch-ing intermediate results can be largely hidden byMap tasks. Also, increasing the number of replicasdoes not incur significant overhead. Nonetheless,at the 0.5 unavailability rate, the HA replicationapproach still outperforms the best VO replicationconfiguration by about 32.5%.

To further understand the cause of performancevariances of different policies, Table II shows theexecution profile collected from the Hadoop joblog for tests at 0.5 unavailability rate. We do notinclude all policies due to space limit. Forsort,the average Map execution time increases rapidlyas higher replication degrees are used in the VOreplication approach. In contrast, the Map executiontime does not change much across different policiesfor word count, due to reasons discussed earlier.

The most noticeable factor causing performancedifferences is the averageshuffletime. Forsort,the average shuffle time ofVO-V1 is much higherthan other policies due to the low availability ofintermediate data. In fact, the average shuffle time ofVO-V1 is about 5 times longer than that ofHA-V1.For VO replication, increasing the replication degreefrom 1 to 3 results in a 54% improvement in theshuffle time, but no further improvement is observedbeyond this point. This is because the shuffle time is

partially affected by the increasing Map executiontime, given that the shuffle time is measured fromthe start of a reduce task till the end of copying allrelated Map results. Forword count, the shuffletimes with different policies are relatively closeexcept withVO-V1, again because of the smallerintermediate data size.

Finally, since the fetch failures of Map resultswill trigger the re-execution of corresponding Maptasks, the average number of killed Map tasks is agood indication of the intermediate data availability.While the number of killed Map tasks decreasesas the VO replication degree increases, the HAreplication approach in general results in a lowernumber of Map task re-executions.

C. Overall Performance Impacts of MOON

To evaluate the impact of MOON strategies onoverall MapReduce performance, we establish abase line by augmenting Hadoop to replicate theintermediate data and configure Hadoop to store sixreplicas for both input and output data, to attain a99.5% data availability when the average machineunavailability is0.4 (selected according to the realnode availability trace shown in Figure 1). ForMOON, we assume the availability of a dedicatednode is at least as high as that of three volatilenodes together with independent failure probability.That is, the unavailability of dedicated node is lessthan 0.43, which is not hard to achieve for wellmaintained workstations. As such, we configureMOON with a replication factor of{1, 3} for bothinput and output data.

In testing the native Hadoop system, 60 volatilenodes and 6 dedicated nodes are used. These nodes,however are all treated as volatile in the Hadooptests as Hadoop cannot differentiate between volatileand dedicated. For each test, we use the VO repli-cation configuration that can deliver the best per-formance under a given unavailability rate. It worthnoting that we do not show the performance of thedefault Hadoop system (without intermediate datareplication), which wasunable to finish the jobsunder high machine unavailability levels, due tointermediate data losses and high task failure rate.

The MOON tests are executed on 60 volatilenodes with 3, 4 and 6 dedicated nodes, corre-sponding to a 20:1, 15:1 and 10:1 V-to-D ratios.

Page 13: MOON: MapReduce On Opportunistic eNvironmentseprints.cs.vt.edu/archive/00001089/01/moon.pdf · MOON: MapReduce On Opportunistic eNvironments ... short for MapReduce On Opportunistic

0

500

1000

1500

2000

2500

3000

3500

4000

4500

0.1 0.3 0.5

Exe

cutio

n T

ime(

s)

Machine Unavailable Rate

VO-V1VO-V2VO-V3VO-V4VO-V5HA-V1HA-V2HA-V3

(a) sort

0

500

1000

1500

2000

2500

3000

0.1 0.3 0.5

Exe

cutio

n T

ime(

s)

Machine Unavailable Rate

VO-V1VO-V2VO-V3VO-V4VO-V5HA-V1HA-V2HA-V3

(b) word count

Fig. 6. Compare impacts of different replication policies for intermediate data on execution time.

TABLE IIEXECUTION PROFILE OF DIFFERENT REPLICATION POLICIES AT0.5 UNAVAILABILITY RATE .

sort word countPolicy VO-V1 VO-V3 VO-V5 HA-V1 VO-V1 VO-V3 VO-V5 HA-V1Avg Map Time (s) 21.25 42 71.5 41.5 100 110.75 113.5 112Avg Shuffle Time (s) 1150.25 528 563 210.5 752.5 596.25 584 559Avg Reduce Time (s) 155.25 84.75 116.25 74.5 50.25 28 28.5 31Avg #Killed Maps 1389 55.75 31.25 18.75 292.25 32.5 30.5 23Avg #Killed Reduces 59 47.75 55.25 34.25 18.25 18 15.5 12.5

The intermediate data is replicated with the HAapproach using{1, 1} as the replication factor. Asshown in Figure 7, MOON clearly outperformsHadoop-VO for 0.3 and 0.5 unavailable rates and iscompetitive at a 0.1 unavailability rate, even for a20:1 V-to-D ratio. For sort, MOON outperformsHadoop-VO by a factor of 1.8, 2.2 and 3 with 3,4 and 6 dedicated nodes, respectively, when theunavailability rate is 0.5. Forword count, theMOON performance is slightly better than aug-mented Hadoop, delivering a speedup factor of 1.5compared to Hadoop-VO. The only case whereMOON performs worse than Hadoop-VO is for thesort application at the 0.1 unavailability rate andthe V-to-D node ratio is 20:1. This is due to thefact that the aggregate I/O bandwidth on dedicatednodes is insufficient to quickly absorb all of theintermediate and output data.

VII. RELATED WORK

Several storage systems have been designed toaggregate idle disk spaces on desktop computerswithin local area network environments [18], [19],[20], [21]. Farsite [18] aims at building a secure filesystem service equivalent to centralized file systemon top of untrusted PCs. It adopts replication to en-

0

500

1000

1500

2000

2500

3000

0.1 0.3 0.5

Exe

cutio

n T

ime(

s)

Machine Unavailable Rate

Hadoop-VOMOON-HybridD3MOON-HybridD4MOON-HybridD6

(a) sort

0

500

1000

1500

2000

2500

3000

0.1 0.3 0.5

Exe

cutio

n T

ime(

s)

Machine Unavailable Rate

Hadoop-VOMOON-HybridD3MOON-HybridD4MOON-HybridD6

(b) word count

Fig. 7. Overall performance of MOON vs. Hadoop with VOreplication

Page 14: MOON: MapReduce On Opportunistic eNvironmentseprints.cs.vt.edu/archive/00001089/01/moon.pdf · MOON: MapReduce On Opportunistic eNvironments ... short for MapReduce On Opportunistic

sure high data reliability, and is designed to reducethe replication cost by placing data replicas basedon the knowledge of failure correlation between in-dividual machines. Glacier [19] is a storage systemthat can deliver high data availability under large-scale correlated failures. It does not assume anyknowledge of the machine failure patterns and useserasure code to reduce the data replication overhead.Both Farsite and Glacier are designed for typicalI/O activities on desktop computers and are notsufficient for high-performance data-intensive com-puting. Freeloader [20] provides a high-performancestorage system. However, it aims at providing aread-only caching space and is not suitable forstoring mission critical data. The BitDew frame-work [21] intends to provide data managementfor computational grids, but is currently limited toapplications with little or zero data dependenciesbetween tasks.

There have been studies in executing MapReduceon grid systems, such as GridGain [22]. Thereare two major differences between GridGain andMOON. First, GridGain only provides computingservice and relies on other data grid systems forits storage solution, whereas MOON provides anintegrated computing and data solution by extendingHadoop. Second, unlike MOON, GridGain is notdesigned to provide high QoS on opportunisticenvironments where machines will be frequentlyunavailable. Sun Microsystems’ Compute Servertechnology is also capable of executing MapReducejobs on a grid by creating a master-worker task poolwhere workers iteratively grab tasks to execute [23].However, based on information gleaned from [23],it appears that this technology is intended for useon large dedicated resources, similarly to Hadoop.

When executing Hadoop in heterogenous envi-ronments, Zaharia et. al. discovered several limita-tions of the Hadoop speculative scheduling algo-rithm and developed the LATE (Longest Approxi-mate Time to End) scheduling algorithm [16]. LATEaims at minimizing Hadoop’s job response time byalways issuing a speculative copy for the task thatis expected to finish last. LATE was designed onheterogeneous,dedicatedresources, assuming thetask progress rate is constant on a node. LATEis not directly applicable to opportunistic environ-ments where a high percentage of tasks can be

frequently suspended or interrupted, and in turnthe task progress rate is not constant on a node.Currently, the MOON design focuses on environ-ments with homogeneous computers. In the future,we plan to explore the possibility of combining theMOON scheduling principles with LATE to supportheterogeneous, opportunistic environments.

Finally, Ko et al. discovered that the loss ofintermediate data may result in considerable per-formance penalty in Hadoop even under dedi-cated computing environments [24]. Their prelim-inary studies suggested that simple replication ap-proaches, such as relying on HDFS’s replicationservice used in our paper, could incur high repli-cation overhead and is impractical in dedicated,cluster environments. In our study, we show thatin opportunistic environments, the replication over-head for intermediate data can be well paid off bythe performance gain resulted from the increaseddata availability. Future studies in more efficientintermediate data replication will of course wellcomplement the MOON design.

VIII. C ONCLUSION AND FUTURE WORK

In this paper, we presented MOON, an adaptivesystem that supports MapReduce jobs on oppor-tunistic environments, where existing MapReducerun-time policies fail to handle frequent node out-ages. In particular, we demonstrated the benefit ofMOON’s data and task replication design to greatlyimprove the QoS of MapReduce when running ona hybrid resource architecture, where a large groupof volatile, volunteer resources is supplemented bya small set of dedicated nodes.

Due to testbed limitations in our experiments, weused homogeneous configurations across the nodesused. Although node unavailability creates naturalheterogeneity, it did not create disparity in hardwarespeed (such as disk and network bandwidth speeds).In our future work, we plan to evaluate and furtherenhance MOON in heterogeneous environments.Additionally, we would like to deploy MOON onvarious production systems with different degreesof volatility and evaluate a variety of applications inuse on these systems. Lastly, this paper investigatedsingle-job execution, and it would be interestingfuture work to study the scheduling and QoS issues

Page 15: MOON: MapReduce On Opportunistic eNvironmentseprints.cs.vt.edu/archive/00001089/01/moon.pdf · MOON: MapReduce On Opportunistic eNvironments ... short for MapReduce On Opportunistic

of concurrent MapReduce jobs on opportunisticenvironments.

REFERENCES

[1] D. Thain, T. Tannenbaum, and M. Livny, “Distributed Com-puting in Practice: The Condor Experience,”Concurrency andComputation: Practice and Experience, 2004.

[2] D. Anderson, “Boinc: A system for public-resource computingand storage,”Grid Computing, IEEE/ACM International Work-shop on, vol. 0, 2004.

[3] A. Chien, B. Calder, S. Elbert, and K. Bhatia, “Entropia:Archi-tecture and performance of an enterprise desktop grid system,”Journal of Parallel and Distributed Computing, vol. 63, 2003.

[4] Apple Inc., “Xgrid,” http://www.apple.com/server/macosx/technology/xgrid.html.

[5] J. Dean and S. Ghemawat, “Mapreduce: simplified data process-ing on large clusters,”Commun. ACM, vol. 51, no. 1, 2008.

[6] M. Zhong, K. Shen, and J. Seiferas, “Replication degreecustomization for high availability,”SIGOPS Oper. Syst. Rev.,vol. 42, no. 4, 2008.

[7] D. Kondo, M. Taufe, C. Brooks, H. Casanova, and A. Chien,“Characterizing and evaluating desktop grids: an empiricalstudy,” in Proceedings of the 18th International Parallel andDistributed Processing Symposium, 2004.

[8] “Hadoop,” http://hadoop.apache.org/core/.[9] J. Strickland, V. Freeh, X. Ma, and S. Vazhkudai, “Governor:

Autonomic throttling for aggressive idle resource scavenging,”in Proceedings of the 2nd IEEE International Conference onAutonomic Computing, 2005.

[10] A. Gupta, B. Lin, and P. A. Dinda, “Measuring and under-standing user comfort with resource borrowing,” inHPDC ’04:Proceedings of the 13th IEEE International Symposium on HighPerformance Distributed Computing. Washington, DC, USA:IEEE Computer Society, 2004, pp. 214–224.

[11] S. Chen and S. Schlosser, “Map-reduce meets wider varietiesof applications meets wider varieties of applications,” IntelResearch, Tech. Rep. IRP-TR-08-05, 2008.

[12] A. Matsunaga, M. Tsugawa, and J. Fortes, “Cloudblast: Com-bining mapreduce and virtualization on distributed resources forbioinformatics,” Microsoft eScience Workshop, 2008.

[13] M. Grant, S. Sehrish, J. Bent, and J. Wang, “Introducingmap-reduce to high end computing,” 3rd Petascale Data StorageWorkshop, Nov 2008.

[14] S. Ghemawat, H. Gobioff, and S. Leung, “The Google filesystem,” inProceedings of the 19th Symposium on OperatingSystems Principles, 2003.

[15] B. Javadi, D. Kondo, J.-M. Vincent, and D. P. Anderson,“Mining for Statistical Models of Availability in Large-ScaleDistributed Systems: An Empirical Study of SETI@home,”in 17th IEEE/ACM International Symposium on Modelling,Analysis and Simulation of Computer and TelecommunicationSystems (MASCOTS), September 2009.

[16] M. Zaharia, A. Konwinski, A. Joseph, R. Katz, and I. Stoica,“Improving mapreduce performance in heterogeneous environ-ments,” inOSDI, 2008.

[17] Advanced Research Computing, “System x,”http://www.arc.vt.edu/arc/SystemX/.

[18] A. Adya, W. Bolosky, M. Castro, R. Chaiken, G. Cermak,J.Douceur, J. Howell, J. Lorch, M. Theimer, and R. Watten-hofer, “FARSITE: Federated, available, and reliable storage foran incompletely trusted environment,” inProceedings of the 5thSymposium on Operating Systems Design and Implementation,2002.

[19] A. Haeberlen, A. Mislove, and P. Druschel, “Glacier: Highlydurable, decentralized storage despite massive correlated fail-ures,” in Proceedings of the 2nd Symposium on NetworkedSystems Design and Implementation (NSDI’05), May 2005.

[20] S. Vazhkudai, X. Ma, V. Freeh, J. Strickland, N. Tammineedi,and S. Scott, “Freeloader: Scavenging desktop storage resourcesfor bulk, transient data,” inProceedings of Supercomputing,2005.

[21] G. Fedak, H. He, and F. Cappello, “Bitdew: a programmableenvironment for large-scale data management and distribution,”in SC ’08: Proceedings of the 2008 ACM/IEEE conference onSupercomputing. Piscataway, NJ, USA: IEEE Press, 2008, pp.1–12.

[22] GridGain Systems, LLC, “Gridgain,” http://www.gridgain.com/.[23] Sun Microsystems, “Compute server,”

https://computeserver.dev.java.net/.[24] S. Ko, I. Hoque, B. Cho, and I. Gupta, “On Availability of

Intermediate Data in Cloud Computations,” in12th Workshopon Hot Topics in Operating Systems (HotOS XII), 2009.