loose synchronization for large-scale networked systems€¦ · at a high level, the goal is to...

14
Loose Synchronization for Large-Scale Networked Systems Jeannie Albrecht, Christopher Tuttle, Alex C. Snoeren, and Amin Vahdat University of California, San Diego {jalbrecht, ctuttle, snoeren, vahdat}@cs.ucsd.edu Abstract Traditionally, synchronization barriers ensure that no co- operating process advances beyond a specified point un- til all processes have reached that point. In heterogeneous large-scale distributed computing environments, with un- reliable network links and machines that may become overloaded and unresponsive, traditional barrier seman- tics are too strict to be effective for a range of emerg- ing applications. In this paper, we explore several relax- ations, and introduce a partial barrier, a synchronization primitive designed to enhance liveness in loosely coupled networked systems. Partial barriers are robust to vari- able network conditions; rather than attempting to hide the asynchrony inherent to wide-area settings, they en- able appropriate application-level responses. We evaluate the improved performance of partial barriers by integrat- ing them into three publicly available distributed appli- cations running across PlanetLab. Further, we show how partial barriers simplify a re-implementation of MapRe- duce that targets wide-area environments. 1 Introduction Today’s resource-hungry applications are increasingly deployed across large-scale networked systems, where significant variations in processor performance, network connectivity, and node reliability are the norm. These in- stallations lie in stark contrast to the tightly-coupled clus- ter and supercomputer environments traditionally em- ployed by compute-intensive applications. What remains unchanged, however, is the need to synchronize various phases of computation across the participating nodes. The realities of this new environment place additional demands on synchronization mechanisms; while exist- ing techniques provide correct operation, performance is often severely degraded. We show that new, relaxed syn- chronization semantics can provide significant perfor- mance improvements in wide-area distributed systems. Synchronizing parallel, distributed computation has long been the focus of significant research effort. At a high level, the goal is to ensure that concurrent computation tasks—across independent threads, processors, nodes in a cluster, or spread across the Internet—are able to make independent forward progress while still maintaining some higher-level correctness semantic. Perhaps the sim- plest synchronization primitive is the barrier [19], which establishes a rendezvous point in the computation that all concurrent nodes must reach before any are allowed to continue. Bulk synchronous parallel programs run- ning on MPPs or clusters of workstations employ barriers to perform computation and communication in phases, transitioning from one consistent view of shared under- lying data structures to another. In past work, barriers and other synchronization primi- tives have defined strict semantics that ensure safety— i.e., that no node falls out of lock-step with the others— at the expense of liveness. In particular, if employed naively, a parallel computation using barrier synchro- nization moves forward at the pace of the slowest node and the entire computation must be aborted if any node fails. In closely-coupled supercomputer or cluster envi- ronments, skillful programmers optimize their applica- tions to reduce these synchronization overheads by lever- aging knowledge about the relative speed of individual nodes. Further, dataflow can be carefully crafted based upon an understanding of transfer times and access la- tencies to prevent competing demand for the I/O bus. Finally, recovering from failure is frequently expensive; thus, failure during computation is expected to be rare. In large-scale networked systems—where individual node speeds are unknown and variable, communi- cation topologies are unpredictable, and failure is commonplace—applications must be robust to a range of operating conditions. The performance tuning com- mon in cluster and supercomputing environments is im- practical, severely limiting the performance of synchro- nized concurrent applications. Further, individual node failures are almost inevitable, hence applications are gen- erally engineered to adapt to or recover from failure. Our work focuses on distributed settings where online per- formance optimization is paramount and correctness is ensured through other mechanisms. We introduce adap- tive mechanisms to control the degree of concurrency in cases where parallel execution appears to be degrading performance due to self-interference. Annual Tech ’06: 2006 USENIX Annual Technical Conference USENIX Association 301

Upload: others

Post on 05-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Loose Synchronization for Large-Scale Networked Systems€¦ · At a high level, the goal is to ensure that concurrent computation tasks—across independent threads, processors,

Loose Synchronization for Large-Scale Networked Systems

Jeannie Albrecht, Christopher Tuttle, Alex C. Snoeren, and Amin VahdatUniversity of California, San Diego

{jalbrecht, ctuttle, snoeren, vahdat}@cs.ucsd.edu

Abstract

Traditionally, synchronization barriers ensure that no co-operating process advances beyond a specified point un-til all processes have reached that point. In heterogeneouslarge-scale distributed computing environments, with un-reliable network links and machines that may becomeoverloaded and unresponsive, traditional barrier seman-tics are too strict to be effective for a range of emerg-ing applications. In this paper, we explore several relax-ations, and introduce a partial barrier, a synchronizationprimitive designed to enhance liveness in loosely couplednetworked systems. Partial barriers are robust to vari-able network conditions; rather than attempting to hidethe asynchrony inherent to wide-area settings, they en-able appropriate application-level responses. We evaluatethe improved performance of partial barriers by integrat-ing them into three publicly available distributed appli-cations running across PlanetLab. Further, we show howpartial barriers simplify a re-implementation of MapRe-duce that targets wide-area environments.

1 Introduction

Today’s resource-hungry applications are increasinglydeployed across large-scale networked systems, wheresignificant variations in processor performance, networkconnectivity, and node reliability are the norm. These in-stallations lie in stark contrast to the tightly-coupled clus-ter and supercomputer environments traditionally em-ployed by compute-intensive applications. What remainsunchanged, however, is the need to synchronize variousphases of computation across the participating nodes.The realities of this new environment place additionaldemands on synchronization mechanisms; while exist-ing techniques provide correct operation, performance isoften severely degraded. We show that new, relaxed syn-chronization semantics can provide significant perfor-mance improvements in wide-area distributed systems.

Synchronizing parallel, distributed computation has longbeen the focus of significant research effort. At a highlevel, the goal is to ensure that concurrent computationtasks—across independent threads, processors, nodes ina cluster, or spread across the Internet—are able to make

independent forward progress while still maintainingsome higher-level correctness semantic. Perhaps the sim-plest synchronization primitive is the barrier [19], whichestablishes a rendezvous point in the computation thatall concurrent nodes must reach before any are allowedto continue. Bulk synchronous parallel programs run-ning on MPPs or clusters of workstations employ barriersto perform computation and communication in phases,transitioning from one consistent view of shared under-lying data structures to another.

In past work, barriers and other synchronization primi-tives have defined strict semantics that ensure safety—i.e., that no node falls out of lock-step with the others—at the expense of liveness. In particular, if employednaively, a parallel computation using barrier synchro-nization moves forward at the pace of the slowest nodeand the entire computation must be aborted if any nodefails. In closely-coupled supercomputer or cluster envi-ronments, skillful programmers optimize their applica-tions to reduce these synchronization overheads by lever-aging knowledge about the relative speed of individualnodes. Further, dataflow can be carefully crafted basedupon an understanding of transfer times and access la-tencies to prevent competing demand for the I/O bus.Finally, recovering from failure is frequently expensive;thus, failure during computation is expected to be rare.

In large-scale networked systems—where individualnode speeds are unknown and variable, communi-cation topologies are unpredictable, and failure iscommonplace—applications must be robust to a rangeof operating conditions. The performance tuning com-mon in cluster and supercomputing environments is im-practical, severely limiting the performance of synchro-nized concurrent applications. Further, individual nodefailures are almost inevitable, hence applications are gen-erally engineered to adapt to or recover from failure. Ourwork focuses on distributed settings where online per-formance optimization is paramount and correctness isensured through other mechanisms. We introduce adap-tive mechanisms to control the degree of concurrency incases where parallel execution appears to be degradingperformance due to self-interference.

Annual Tech ’06: 2006 USENIX Annual Technical ConferenceUSENIX Association 301

Page 2: Loose Synchronization for Large-Scale Networked Systems€¦ · At a high level, the goal is to ensure that concurrent computation tasks—across independent threads, processors,

At a high level, emerging distributed applications all im-plement some form of traditional synchronizing barri-ers. While not necessarily executing a SIMD instructionstream, different instances of the distributed computationreach a shared point in the global computation. Rela-tive to traditional barriers, however, a node arriving ata barrier need not necessarily block waiting for all otherinstances to arrive—doing so would likely sacrifice ef-ficiency or even liveness as the system waits for eitherslow or failed nodes. Similarly, releasing a barrier doesnot necessarily imply that all nodes should pass throughthe barrier simultaneously—e.g., simultaneously releas-ing thousands of nodes to download a software packageeffectively mounts a denial-of-service attack against thetarget repository. Instead, applications manage the entryand release semantics of their logical barriers in an adhoc, application-specific manner.

This paper defines, implements, and evaluates a new syn-chronization primitive that meets the requirements of abroad range of large-scale network services. In this con-text, we make the following contributions:

• We introduce a partial barrier, a synchronizationprimitive designed for heterogeneous failure-prone en-vironments. By relaxing traditional semantics, partialbarriers enhance liveness in the face of slow, failed,and self-interfering nodes.

• Based on the observation that the arrival rate ata barrier will often form a heavy-tailed distribu-tion, we design a heuristic to dynamically detect theknee of the curve—the point at which arrivals slowconsiderably—allowing applications to continue de-spite slow nodes. We also adapt the rate of releasefrom a barrier to prevent performance degradation inthe face of self-interfering processes.

• We adapt three publicly available, wide-area servicesto use partial barriers for synchronization, and reim-plement a fourth. In particular, we use partial bar-riers to reduce the implementation complexity ofan Internet-scale version of MapReduce [9] runningacross PlanetLab. In all four cases, partial barriers re-sult in a significant performance improvement.

2 Distributed barriers

We refer to nodes as entering a barrier when they reacha point in the computation that requires synchronization.When a barrier releases or fires (we use the terms inter-changeably), the blocked nodes are allowed to proceed.According to strict barrier semantics, ensuring safety,i.e., global synchronization, requires all nodes to reacha synchronization point before any node proceeds. In theface of wide variability in performance and prominentfailures, however, strict enforcement may force the ma-

jority of nodes to block while waiting for a handful ofslow or failed nodes to catch up. Many wide-area appli-cations already have the ability to reconfigure themselvesto tolerate node failure. We can harness this functionalityto avoid excessive waits at barriers: once slow nodes areidentified, they can be removed and possibly replaced byquicker nodes for the remainder of the execution.

One of the important questions, then, is determiningwhen to release a barrier, even if all nodes have not ar-rived at the synchronization point. That is, it is impor-tant to dynamically determine the point where waitingfor additional nodes to enter the barrier will cost the ap-plication more than the benefit brought by any additionalarriving nodes in the future. This determination often de-pends on the semantics of individual applications. Evenwith full knowledge of application behavior, making thedecision appropriately requires future knowledge. Onecontribution of this work is a technique to dynamicallyidentify “knees” in the node arrival process, i.e., pointswhere the arrival process significantly slows. By com-municating these points to the application we allow it tomake informed decisions regarding when to release.

A primary contribution of this work is the definition ofrelaxed barrier semantics to support a variety of wide-area distributed computations. To motivate our proposedextensions consider the following application scenarios:

Application initialization. Many distributed applica-tions require a startup phase to initialize their local stateand to coordinate with remote participants before begin-ning their operation. Typically, developers introduce anartificial delay judged to be sufficient to allow the systemto stabilize. If each node entered a barrier upon complet-ing the initialization phase, developers would be freedfrom introducing arbitrarily chosen delays into the inter-active development/debugging cycle.

Phased computation and communication. Scientificapplications and large-scale computations often oper-ate in phases consisting of one or many sequential, lo-cal computations, followed by periods of global com-munication or parallel computation. These computationsnaturally map to the barrier abstraction: one phase ofthe computation must complete before proceeding to thenext phase. Other applications operate on a work queuethat distributes tasks to available machines based on therate that they complete individual tasks. Here, a barriermay serve as a natural point for distributing work.

Coordinated measurement. Many distributed applica-tions measure network characteristics. However, unco-ordinated measurements can self-interfere, leading towasted effort and incorrect results. Such systems benefitfrom a mechanism that limits the number of nodes thatsimultaneously perform probes. Barriers are capable of

Annual Tech ’06: 2006 USENIX Annual Technical Conference USENIX Association302

Page 3: Loose Synchronization for Large-Scale Networked Systems€¦ · At a high level, the goal is to ensure that concurrent computation tasks—across independent threads, processors,

� ���(a)

� ���(b)

� � � (c)

� ���(d)

Figure 1: (a) Traditional semantics: All hosts enter the barrier (indicated by the white boxes) and are simultaneouslyreleased (indicated by dotted line). (b) Early entry: The barrier fires after 75% of the hosts arrive. (c) Throttled release:Hosts are released in pairs every ∆T seconds. (d) Counting semaphore: No more than 2 processes are simultaneouslyallowed into a “critical section” (indicated by the grey regions). When one node exits the critical section, another hostis allowed to enter.

providing this needed functionality. In this case, the ap-plication uses two barriers to delimit a “critical section”of code (e.g., a network measurement). The first barrierreleases some maximum number of nodes into the criti-cal section at a time and waits until these nodes reach thesecond barrier, thereby exiting the critical section, beforereleasing the next round.

To further clarify the goal of our work, it is perhapsworthwhile to consider what we are not trying to accom-plish. Partial barriers provide only a loose form of groupmembership [8, 16, 27]. In particular, partial barriers donot provide any ordering of messages relative to barri-ers as in virtual synchrony [2, 3], nor do partial barriersrequire that all participants come to consensus regard-ing a view change [24]. In effect, we strive to constructan abstraction that exposes the asynchrony and the fail-ures prevalent in wide-area computing environments in amanner that allows the application to make dynamic andadaptive decisions as to how to respond.

It is also important to realize that not all applications willbenefit from relaxed synchronization semantics. The cor-rectness of certain classes of applications cannot be guar-anteed without complete synchronization. For example,some applications may require a minimum number ofhosts to complete a computation. Other applications mayapproximate a measurement (such as network delay) andcontinuing without all nodes reduces the accuracy of theresult. However, many distributed applications, includ-ing the ones described in Section 5, can afford to sac-rifice global synchronization without negatively impact-ing the results. These applications either support dynam-ically degrading the computation, or are robust to failuresand can tolerate mid-computation reconfigurations. Ourresults indicate that for applications that are willing andable to sacrifice safety, our semantics have the potentialto improve performance significantly.

3 Design and implementation

Partial barriers are a set of semantic extensions to thetraditional barrier abstraction. Our implementation hasa simple interface for customizing barrier functionality.

This section describes these extended semantics, detailsour API, and presents the implementation details.

3.1 Design

We define two new barrier semantics to support emergingwide-area computing paradigms and discuss each below.

Early entry – Traditional barriers require all nodes toenter a barrier before any node may pass through, as inFigure 1(a). A partial barrier with early entry is instanti-ated with a timeout, a minimum percentage of enteringnodes, or both. Once the barrier has met either of thespecified preconditions, nodes that have already entereda barrier are allowed to pass through without waiting forthe remaining slow nodes (Figure 1(b)). Alternatively, anapplication may instead choose to receive callbacks asnodes enter and manually release the barrier, enablingthe evaluation of arbitrary predicates.

Throttled-release – Typically, a barrier releases allnodes simultaneously when a barrier’s precondition ismet. A partial barrier with throttled-release specifies arate of release, such as two nodes every ∆T seconds asshown in Figure 1(c). A special variation of throttled-release barriers allows applications to limit the number ofnodes that simultaneously exist in a “critical section” ofcode, creating an instance of a counting semaphore [10](shown in Figure 1(d)), which may be used, for ex-ample, to throttle the number of nodes that simultane-ously perform network measurements or software down-loads. A critical distinction between traditional countingsemaphores and partial barriers, however, is our supportfor failure. For instance, if a sufficient number of slowor soon-to-fail nodes pass a counting semaphore, theywill limit access to other participants, possibly forever.Thus, as with early-entry barriers, throttled-release bar-riers eventually time out slow or failed nodes, allowingthe system as a whole to make forward progress despiteindividual failures.

One issue that our proposed semantics introduce thatdoes not arise with strict barrier semantics is handlingnodes performing late entry, i.e., arriving at an alreadyreleased barrier. We support two options to address this

Annual Tech ’06: 2006 USENIX Annual Technical ConferenceUSENIX Association 303

Page 4: Loose Synchronization for Large-Scale Networked Systems€¦ · At a high level, the goal is to ensure that concurrent computation tasks—across independent threads, processors,

class Barrier {Barrier(string name, int max, int timeout,

int percent, int minWait);static void setManager(string Hostname);void enter(string label, string Hostname);void setEnterCallback(bool (*callbackFunc)(string label,

string Hostname, bool default), int timeout);map<string label, string Hostname> getHosts(void);

}

Figure 2: Barrier instantiation API.

case: i) pass-through semantics that allow the node toproceed with the next phase of the computation eventhough it arrived late; ii) catch-up semantics that issue anexception allowing the application to reintegrate the nodeinto the mainline computation in an application-specificmanner, which may involve skipping ahead to the nextbarrier (omitting the intervening section of code) in aneffort to catch up to the other nodes.

3.2 Partial barrier API

Figure 2 summarizes the partial barrier API from anapplication’s perspective. Each participating node ini-tializes a barrier with a constructor that takes the fol-lowing arguments: name, max, timeout, percent, andminWait. name is a globally unique identifier for thebarrier. max specifies the maximum number of partici-pants in the barrier. (While we do not require a prioriknowledge of participant identity, it would be straight-forward to add.) The timeout in milliseconds sets themaximum time that can pass from the point where thefirst node enters a barrier before the barrier is released.The percent field similarly captures a minimum per-centage out of the maximum number nodes that mustreach the barrier to activate early release. The minWaitfield is associated with the percent field and specifiesa minimum amount of time to wait (even if the specifiedpercentage of nodes have reached) before releasing thebarrier with less than the maximum number of nodes.Without this field, the barrier will deterministically bereleased upon reaching the threshold percent of enteringnodes even when all nodes are entering rapidly. How-ever, the barrier is always released if max nodes arrive,regardless of minWait. The timeout field overrides thepercent and minWait fields; the barrier will fire if thetimeout is reached, regardless of the percentage of nodesentering the barrier. The last three parameters to the con-structor are optional—if left unspecified the barrier willoperate as a traditional synchronization barrier.

Coordination of barrier participants is controlled by abarrier manager whose identity is specified at startup bythe setManager() method. Participants call the bar-rier’s enter() method and pass in their Hostname

and label when they reach the appropriate point intheir execution. (The label argument supports advancedfunctionality such as load balancing for MapReduce asdescribed in Section 5.4.) The participant’s enter()method notifies the manager that the particular node hasreached the synchronization point. Our implementationsupports blocking calls to enter() (as described here)or optionally a callback-based mechanism where the en-tering node is free to perform other functionality until theappropriate callback is received.

While our standard API provides simplistic supportfor the early release of a barrier, an application maymaintain its own state to determine when a partic-ular barrier should fire and to manage any side ef-fects associated with barrier entry or release. For in-stance, a barrier manager may wish to kill processesarriving late to a particular (already released) bar-rier. To support application-specific functionality, thesetEnterCallback() method specifies a function tobe called when any node enters a barrier. The callbacktakes the label and Hostname passed to the enter()method and a boolean variable that specifies whetherthe manager would normally release the barrier uponthis entry. The callback function returns a boolean valueto specify whether the barrier should actually be re-leased or not, potentially overriding the manager’s de-cision. A second argument to setEnterCallback()called timeout specifies a maximum amount of timethat may pass before successive invocations of the call-back. This prevents the situation where the applicationwaits a potentially infinite amount of time for the nextnode to arrive before deciding to release the barrier. Weuse this callback API to implement our adaptive releasetechniques presented in Section 4.

Barrier participants may wish to learn the identity of allhosts that passed through a barrier, similar to (but withrelaxed semantics from) view advancement or GBCASTin virtual synchrony [4]. The getHosts() method re-turns a map of Hostnames and labels through a re-mote procedure call with the barrier manager. If manyhosts are interested in membership information, it canoptionally be propagated from the barrier manager to allnodes by default as part of the barrier release operation.

Figure 3 describes a subclass of Barrier,called ThrottleBarrier, with throttled-releasesemantics. These semantics allow for a pre-determined subset of the maximum number ofnodes to be released at a specified rate. Themethods setThrottleReleasePercent() andsetThrottleReleaseCount() periodically releasea percentage and number of nodes, respectively, oncethe barrier fires. setThrottleReleaseTimeout()specifies the periodicity of release.

Annual Tech ’06: 2006 USENIX Annual Technical Conference USENIX Association304

Page 5: Loose Synchronization for Large-Scale Networked Systems€¦ · At a high level, the goal is to ensure that concurrent computation tasks—across independent threads, processors,

class ThrottleBarrier extends Barrier {void setThrottleReleasePercent(int percent);void setThrottleReleaseCount(int count);void setThrottleReleaseTimeout(int timeout);

}class SemaphoreBarrier extends Barrier {

void setSemaphoreCount(int count);void setSemaphoreTimeout(int timeout);void release(string label, string Hostname);void setReleaseCallback(int (*callbackFunc)(string label,

string Hostname, int default), int timeout);}

Figure 3: ThrottleBarrier and SemaphoreBarrier API.

Figure 3 also details a variant of throttled-release bar-riers, SemaphoreBarrier, which specifies a maxi-mum number of nodes that may simultaneously entera critical section. A SemaphoreBarrier extends thethrottled-release semantics further by placing a barrierat the beginning and end of a critical section of activ-ity to ensure that only a specific number of nodes passinto the critical section simultaneously. One key dif-ference for this type of barrier is that it does not re-quire any minimum number of nodes to enter the bar-rier before beginning to release nodes into the subse-quent critical section. It simply mandates a maximumnumber of nodes that may simultaneously enter thecritical section. The setSemaphoreCount() methodsets this maximum number. Nodes call the barrier’srelease() method upon completing the subsequentcritical section, allowing the barrier to release additionalnodes. setSemaphoreTimeout() allows for timing outnodes that enter the critical section but do not completewithin a maximum amount of time. In this case, theyare assumed to have failed, enabling the release of ad-ditional nodes. The setReleaseCallback() enablesapplication-specific release policies and timeout of slowor failed nodes in the critical section. The callback func-tion in setReleaseCallback() returns the number ofhosts to be released.

3.3 Implementation

Partial barrier participants implement the interface de-scribed above while a separate barrier manager coordi-nates communication across nodes. Our implementationof partial barriers consists of approximately 3, 000 linesof C++ code. At a high level, a node calling enter()transmits a BARRIER REACHED message using TCPto the manager with the calling host’s unique identifier,barrier name, and label. The manager updates its lo-cal state for the barrier, including the set of nodes thathave thus far entered the barrier, and performs appropri-ate callbacks as necessary. The manager starts a timer tosupport various release semantics if this is the first node

entering the barrier and subsequently records the inter-arrival times between nodes entering the barrier.

If a sufficient number of nodes enter the barrier or aspecified amount of time passes, the manager transmitsFIRE messages using TCP to all nodes that have enteredthe barrier. For throttled release barriers, the manager re-leases the specified number of nodes from the barrier inFIFO order. The manager also sets a timer as specifiedby setThrottleReleaseTimeout() to release addi-tional nodes from the barrier when appropriate.

For semaphore barriers, the manager releases the num-ber of nodes specified by setSemaphoreCount()and, if specified by setSemaphoreTimeout(), alsosets a timer to expire for each node entering thecritical section. Each call to enter() transmits aSEMAPHORE REACHED message to the manager.In response, the manager starts the timer associatedwith the calling node. If the semaphore timer associ-ated with the node expires before receiving the cor-responding SEMAPHORE RELEASED message, themanager assumes that node has either failed or isproceeding sufficiently slowly that an additional nodeshould be released into the critical section. EachSEMAPHORE RELEASED message releases one addi-tional node into the critical section.

For all barriers, the manager must gracefully handlenodes arriving late, i.e., after the barrier has fired. Weemploy two techniques to address this case. For pass-through semantics, the manager transmits a LATE FIREmessage to the calling node, releasing it from the barrier.In catch-up semantics, the manager issues an exceptionand transmits a CATCH UP message to the node. Catch-up semantics allow applications to respond to the excep-tion and determine how to reintegrate the node into thecomputation in an application-specific manner.

The type of barrier—pass-through or catch-up—is spec-ified at barrier creation time (intentionally not shown inFigure 2 for clarity). Nodes calling enter() may regis-ter local callbacks with the arrival of either LATE FIREor CATCH UP messages for application-specific re-synchronization with the mainline computation, or per-haps to exit the local computation altogether if re-synchronization is not possible.

3.4 Fault tolerance

One concern with our centralized barrier manager is tol-erating manager faults. We improve overall system ro-bustness with support for replicated managers. Our al-gorithm is a variant of traditional primary/backup sys-tems: each participant maintains an ordered list of bar-rier controllers. Any message sent from a client to thelogical barrier manager is sent to all controllers on thelist. Because application-specific entry callbacks may be

Annual Tech ’06: 2006 USENIX Annual Technical ConferenceUSENIX Association 305

Page 6: Loose Synchronization for Large-Scale Networked Systems€¦ · At a high level, the goal is to ensure that concurrent computation tasks—across independent threads, processors,

non-deterministic, a straightforward replicated state ma-chine approach where each barrier controller simultane-ously decides when to fire is insufficient. Instead, theprimary controller forwards all BARRIER REACHEDmessages to the backup controllers. These messages actas implicit “keep alive” messages from the primary. If abackup controller receives BARRIER REACHED mes-sages from clients but not the primary for a sufficient pe-riod of time, the first backup determines the primary hasfailed and assumes the role of primary controller. Thesecondary backup takes over should the primary fail, andso on. Note that our approach admits the case where mul-tiple controllers simultaneously act as the primary con-troller for a short period of time. Clients ignore duplicateFIRE messages for the same barrier, so progress is as-sured, and one controller eventually emerges as primary.

Although using the replicated manager scheme de-scribed above lowers the probability of losing BAR-RIER REACHED messages, it does not provide any in-creased reliability with respect to messages sent from themanager(s) to the remote hosts. All messages are sent us-ing reliable TCP connections. If a connection fails be-tween the manager and a remote host, however, mes-sages may be lost. For example, suppose the TCP con-nections between the manager and some subset of theremote hosts break just after the manager sends a FIREmessage to all participants. Similarly, if a group of nodesfails after entering the barrier, but before receiving theFIRE message, the failure may go undetected until af-ter the controller transmits the FIRE messages. In thesecases, the manager will attempt to send FIRE messagesto all participants, and detect the TCP failures after theconnections time out. Such ambiguity is unavoidable inasynchronous systems; the manager simply informs theapplication of the failure(s) via a callback and lets the ap-plication decide the appropriate recovery action. As withany other failure, the application may choose to continueexecution and ignore the failures, attempt to find newhosts to replace the failed ones, or to even abort the exe-cution entirely.

4 Adaptive release

Unfortunately, our extended barrier semantics introduceadditional parameters: the threshold for early release andthe concurrency level in throttled release. Experience hasshown it is often difficult to select values that are appro-priate across heterogeneous and changing network con-ditions. Hence, we provide adaptive mechanisms to dy-namically determine appropriate values.

4.1 Early release

There is a fundamental tradeoff in specifying an early-release threshold. If the threshold is too large, the ap-plication will wait unnecessarily for a relatively modest

0

0.2

0.4

0.6

0.8

1

0 5 10 15 20 25 30

Per

cent

of h

osts

rea

chin

g ba

rrie

r

Elapsed time (sec)

Host arrivalThreshold

EWMA host arrival 0

0.2

0.4

0.6

0.8

1

0 5 10 15 20 25 30

Per

cent

of h

osts

rea

chin

g ba

rrie

r

Elapsed time (sec)

Host arrivalThreshold

EWMA host arrival

Figure 4: Dynamically determining the knee of arrivingprocesses. Vertical bars indicate a knee detection.

number of additional nodes to enter the barrier; if toosmall, the application will lose the opportunity to haveparticipation from other nodes had it just waited a bitlonger. Thus, we use the barrier’s callback mechanism todetermine release points in response to varying networkconditions and node performance.

In our experience, the distribution of nodes’ arrivals ata barrier is often heavy-tailed: a relatively large portionof nodes arrive at the barrier quickly with a long tail ofstragglers entering late. In this case, many of our tar-get applications would wish to dynamically determinethe “knee” of a particular arrival process and release thebarrier upon reaching it. Unfortunately, while it can bestraightforward to manually determine the knee off-lineonce all of the data for an arrival process is available, itis difficult to determine this point on-line.

Our heuristic, inspired by TCP retransmission timers andMONET [1], maintains an exponentially weighted mov-ing average (EWMA) of the host arrival times (arr), andanother EWMA of the deviation from this average foreach measurement (arrvar). As each host arrives at thebarrier, we record the arrival time of the host, as well asthe deviation from the average. Then we recompute theEWMA for both arr and arrvar, and use the values tocompute a maximum wait threshold of arr+4∗arrvar.This threshold indicates the maximum time we are will-ing to wait for the next host to arrive before firing thebarrier. If the next host does not arrive at the barrier be-fore the maximum wait threshold passes, we assume thata knee has been reached. Figure 4 illustrates how thesevalues interact for a simulated group of 100 hosts enter-ing a barrier with randomly generated exponential inter-arrival times. Notice that a knee occurs each time the hostarrival time intersects the threshold line.

With the capability to detect multiple knees, it is impor-tant to provide a way for applications to pick the rightknee and avoid firing earlier or later than desired. Ag-

Annual Tech ’06: 2006 USENIX Annual Technical Conference USENIX Association306

Page 7: Loose Synchronization for Large-Scale Networked Systems€¦ · At a high level, the goal is to ensure that concurrent computation tasks—across independent threads, processors,

gressive applications may choose to fire the barrier whenthe first knee is detected. Conservative applications maywish to wait until some specified amount of time haspassed, or a minimum percentage of hosts have enteredthe barrier before firing. To support both aggressive andconservative applications, our algorithm allows the appli-cation to specify a minimum percentage of hosts, mini-mum waiting time, or both for each barrier. If an appli-cation specifies a minimum waiting time of 5 seconds,knees detected before 5 seconds are ignored. Similarly, ifa minimum host percentage of 50% is specified, the kneedetector ignores knees detected before 50% of the totalhosts have entered the barrier. If both values are speci-fied, the knee detector uses the more conservative thresh-old so that both requirements (time and host percentage)are met before firing.

One variation in our approach compared to other relatedapproaches is the values for the weights in the movingaverages. In the RFC for computing TCP retransmissiontimers [7], the weight in the EWMA of the rtt places aheavier weight (0.875) on previous delay measurements.This value works well for TCP since the average delay isexpected to remain relatively constant over time. In ourcase, however, we expect the average arrival time to in-crease, and thus we decreased the weight to be 0.70 forprevious measurements of arr. This allows our arr valueto more closely follow the actual data being recorded.When measuring the average deviation, which is com-puted by averaging |sample − arr| (where sample rep-resents the latest arrival time recorded), we used a weightof 0.75 for previous measurements, which is the sameweight used in TCP for the variation in rtt.

4.2 Throttled release

We also employ an adaptive method to dynamically ad-just the amount of concurrency in the “critical section”of a semaphore barrier. In many applications, it is im-practical to select a single value which performs well un-der all conditions. Similar in spirit to SEDA’s thread-poolcontroller [35], our adaptive release algorithm selects anappropriate concurrency level based upon recent releasetimes. The algorithm starts with low level of concurrencyand increases the degree of concurrency until responsetimes worsen; it then backs off and repeats, oscillatingabout the optimum value.

Mathematically, the algorithm compares the median ofthe distributions of recent and overall release times. Forexample, if there are 15 hosts in the critical section whenthe 45th host is released, the algorithm computes the me-dian release time of the last 15 releases, and of all 45.If the latest median is more than 50% greater than theoverall median, no additional hosts are released, thus re-ducing the level of concurrency to 14 hosts. If the latest

median is more than 10% but less than 50% greater thanthe overall median, one host is released, maintaining alevel of concurrency of 15. In all other cases, two hostsare released, increasing the concurrency level to 16. Thethresholds and differences in size are selected to increasethe degree of concurrency whenever possible, but keepthe magnitude of each change small.

5 Applications

We integrated partial barriers into three wide-area, dis-tributed applications with a range of synchronizationrequirements. We reimplemented a fourth applicationwhose source code was unavailable. While a detailed dis-cussion of these applications is beyond the scope of thispaper, we present a brief overview to facilitate under-standing of our performance evaluation in Section 6.

5.1 Plush

Plush [29] is a tool for configuring and managing large-sale distributed applications, such as PlanetLab [28] andthe Grid [13]. Users specify a description of: i) a set ofnodes to run their application on; ii) the set of softwarepackages, including the application itself and any neces-sary data files, that should be installed on each node; andiii) a directed acyclic graph of individual processes thatshould be run, in order, on each node.

Plush may be used to manage long-running services orinteractive experimental evaluations of distributed appli-cations. In the latter case, developers typically configurea set of machines (perhaps installing the latest versionof the application binary) and then start the processes atapproximately the same time on all participating nodes.A barrier may be naturally inserted between the tasks ofnode configuration and process startup. However, in theheterogeneous PlanetLab environment, the time to con-figure a set of nodes with the requisite software can varywidely or fail entirely at individual hosts. In this case,it is often beneficial to timeout the “software configura-tion/install” barrier and either proceed with the availablenodes or to recruit additional nodes.

5.2 Bullet

Bullet [21] is an overlay-based large-file distribution in-frastructure. In Bullet, a source transmits a file to re-ceivers spread across the Internet. As part of the boot-strap process, all receivers join the overlay by initiallycontacting the source before settling on their final po-sition in the overlay. The published quantitative evalua-tion of Bullet presents a number of experiments acrossPlanetLab. However, to make performance results exper-imentally meaningful when measuring behavior across alarge number of receivers, the authors hard-coded a 30-second delay at the sender from the time that it starts

Annual Tech ’06: 2006 USENIX Annual Technical ConferenceUSENIX Association 307

Page 8: Loose Synchronization for Large-Scale Networked Systems€¦ · At a high level, the goal is to ensure that concurrent computation tasks—across independent threads, processors,

to the time that it begins data transmission. While typi-cally sufficient for the particular targeted configuration,the timeout would often be too long, unnecessarily ex-tending turnaround time for experimentation and inter-active debugging. Depending on overall system load andnumber of participants, the timeout would also some-times be too short, meaning that some nodes would notcomplete the join process at the time the sender beginstransmitting. While this latter case is not a problem forthe correct behavior of the system, it makes interpretingexperimental results difficult.

To address this limitation, we integrated partial barri-ers into the Bullet implementation. This integration wasstraightforward. Once the join process completes, eachnode simply enters a barrier. The Bullet sender registersfor a callback with the barrier manager to be notified ofa barrier release, at which point it begins transmittingdata. By calling the getHosts()method, the sender canrecord the identities of nodes that should be consideredin interpreting the experimental results. The barrier man-ager notes the identity of nodes entering the barrier lateand instructs them to exit rather than proceed with re-trieving the file.

5.3 EMAN

EMAN [12] is a publicly available software packageused for reconstructing 3D models of particles using 2Delectron micrographs. The program takes a 2D micro-graph image as input and then repeatedly runs a “refine-ment” process on the data to create a 3D model. Eachiteration of the refinement consists of both computation-ally inexpensive sequential computations and computa-tionally expensive parallel computations.

Barriers separate sequential and parallel stages of com-putation in EMAN. Using partial barrier semantics addsthe benefit of being able to detect slow nodes, allowingthe application to redistribute tasks to faster machinesduring the parallel phase of computation. In addition toslow processors, the knee detector also detects machineswith low bandwidth capacities and reallocates their workto machines with higher bandwidth. In our test, each re-finement phase requires approximately 240 MB of datato be transferred to each node, and machines with lowbandwidth links have a significant impact on the overallcompletion time if their work is not reallocated to differ-ent machines. Since the sequential phases run on a singlemachine, partial barriers are most applicable to the paral-lel phases of computation. We use the publicly availableversion of EMAN for our experiments. We wrote a 50-line Perl script to run the parallel phase of computationon 98 PlanetLab machines using Plush.

Figure 5: MapReduce execution. As each map task com-pletes, enter(i) is called with the unique task iden-tifier. Once all m tasks have entered the Map barrier,it is released. When the MapReduce controller is noti-fied that the Map barrier has fired, the r reduce tasksare distributed and begin execution. When all r reducetasks have entered the Reduce barrier, MapReduce iscomplete. In both barriers, callback(i) informs theMapReduce controller of task completions.

5.4 MapReduce

MapReduce [9] is a toolkit for application-specific par-allel processing of large volumes of data. The modelinvolves partitioning the input data into smaller splitsof data, and spreading them across a cluster of workernodes. Each worker node applies a map function to thesplits of data that they receive, producing intermediatekey/value pairs that are periodically written to specificlocations on disk. The MapReduce master node tracksthese locations, and eventually notifies another set ofworker nodes that intermediate data is ready. This sec-ond set of workers aggregate the data and pass it to thereduce function. This function processes the data to pro-duce a final output file.

Our implementation of MapReduce leverages partial bar-riers to manage phases of the computation and to orches-trate the flow of data among nodes across the wide area.In our design, we have m map tasks and correspondinginput files, n total nodes hosting the computation, and r

reduce tasks. A central MapReduce controller distributesthe m split input files to a set of available nodes (our cur-rent implementation runs on PlanetLab), and spawns the

Annual Tech ’06: 2006 USENIX Annual Technical Conference USENIX Association308

Page 9: Loose Synchronization for Large-Scale Networked Systems€¦ · At a high level, the goal is to ensure that concurrent computation tasks—across independent threads, processors,

map process on each node. When the map tasks finish,intermediate files are written back to a central repository,and then redistributed to r hosts, who eventually executethe r reduce tasks. There are a number of natural barriersin this application, as shown in Figure 5, correspondingto completion of: i) the initial distribution of m split filesto appropriate nodes; ii) executing m map functions; iii)the redistribution of the intermediate files to appropriatenodes, and iv) executing r reduce functions.

As with the original MapReduce work, the load balanc-ing aspects corresponding to barriers (ii) and (iv) (fromthe previous paragraph) are of particular interest. Recallthat although there are m map tasks, the same physicalhost may execute multiple map tasks. In these cases, thegoal is not necessarily to wait for all n hosts to reachthe barrier, but for all m or r logical tasks to complete.Thus, we extended the original barrier entry semanticsdescribed in Section 3 to support synchronizing barriersat the level of a set of logical, uniquely named tasks orprocesses, rather than a set of physical hosts. To supportthis, we simply invoke the enter() method of the bar-rier API (see Figure 2) upon completing a particular mapor reduce function. In addition to the physical hostname,we send a label corresponding to a globally unique namefor the particular map or reduce task. Thus, rather thanwaiting for n hosts, the barrier instead waits for m or r

unique labels to enter the barrier before firing.

For a sufficiently large and heterogeneous distributedsystem, performance at individual nodes varies widely.Such variability often results in a heavy-tailed distri-bution for the completion of individual tasks, meaningthat while most tasks will complete quickly, the overalltime will be dominated by the performance of the slownodes. The original MapReduce work noted that one ofthe common problems experienced during execution isthe presence of “straggler” nodes that take an unusuallylong time to complete a map or reduce task. Althoughthe authors mentioned an application-specific solution tothis problem, by using partial barriers in our implemen-tation we were able to provide a more general solutionthat achieved the same results. We use the arrival rate ofmap/reduce tasks at their respective barriers to respawna subset of the tasks that were proceeding slowly.

By using the knee detector described in Section 4, weare able to dynamically determine the transition pointbetween rapid arrivals and the long tail of stragglers.However, rather than releasing the barrier at this point,the MapReduce controller receives a callback from thebarrier manager, and instead performs load rebalancingfunctionality by spawning additional copies of outstand-ing tasks on nodes disjoint from the ones hosting theslower tasks (potentially first distributing the necessaryinput/intermediate files). As in the original implemen-

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

10 20 30 40 50 60 70 80 90 100

Tim

e (s

ec)

Number of nodes

All hosts90th percentile

Figure 6: Scalability of centralized barrier imple-mentation. “All hosts” line shows the average timeacross 5 runs for barrier manager to receive BAR-RIER REACHED messages from all hosts. “90th per-centile” line shows the average time across 5 runs forbarrier manager to receive BARRIER REACHED mes-sages from 90% of all hosts.

tation of MapReduce, the barrier is not concerned withwhat copies of the computation complete first; the goalis for all m or r tasks to complete as quickly as possible.

The completion rate of tasks also provides an estimateof the throughput available at individual nodes, influenc-ing future task placement decisions. Nodes with highthroughput are assigned more tasks in the future. Ourperformance evaluation in Section 6 quantifies the bene-fits of this approach. Cluster-based MapReduce [9] alsofound this rebalancing to be critical for improving per-formance in more tightly controlled cluster settings, butdid not describe a precise approach for determining whento spawn additional instances of a given computation.

6 Evaluation

The goal of this section is to quantify the utility of par-tial barriers for a range of application scenarios. For allexperiments, we randomly chose subsets from a pool of130 responsive PlanetLab nodes.

6.1 Scalability

To estimate baseline barrier scalability, we measure thetime it takes to move between two barriers for an increas-ing number of hosts. In this experiment, the controllerwaits for all hosts to reach the first barrier. All hosts arereleased, and then immediately enter the second barrier.We measure the time between when the barrier managersends the first FIRE message for the first barrier and re-ceives the last BARRIER REACHED message for thesecond barrier. No partial barrier semantics are used forthese measurements. Figure 6 shows the average com-

Annual Tech ’06: 2006 USENIX Annual Technical ConferenceUSENIX Association 309

Page 10: Loose Synchronization for Large-Scale Networked Systems€¦ · At a high level, the goal is to ensure that concurrent computation tasks—across independent threads, processors,

pletion time for varying numbers of nodes across a totalof five runs for each data point with standard deviation.

Notice that even for 100 nodes, the average timefor the barrier manager to receive the last BAR-RIER REACHED message for the second barrier is ap-proximately 1 second. The large standard deviation val-ues indicate that there is much variability in our results.This is due to the presence of straggler nodes that de-lay the firing for several seconds or more. The 90th per-centile, on the other hand, has little variation and is rel-atively constant as the number of participants increases.This augurs well for the potential of partial barrier se-mantics to improve performance in the wide area. Over-all, we are satisfied with the performance of our central-ized barriers for 100 nodes; we expect to use hierarchyto scale significantly further.

6.2 Admission control

Next, we consider the benefits of a semaphore barrier toperform admission control for parallel software installa-tion in Plush. Plush configures a set of wide-area hoststo execute a particular application. This process ofteninvolves installing the same software packages locatedon a central server. Simultaneously performing this in-stall across hundreds of nodes can lead to thrashing atthe server hosting the packages. The overall goal is toensure sufficient parallelism such that the server is sat-urated (without thrashing) while balancing the averagetime to complete the download across all participants.

For our results, we use Plush to install the same 10-MBfile on 100 PlanetLab hosts while varying the number ofsimultaneous downloads using a semaphore barrier. Fig-ure 7 shows the results of this experiment. The data indi-cates that limiting parallelism can improve overall com-pletion rate. Releasing too few hosts does not fully con-sume server resources, while releasing too many taxesavailable resources, increasing the time to first comple-tion. This is evident in the graph since 25 simultaneousdownloads finishes more quickly than both 100 and 10simultaneous transfers.

While statically defining the number of hosts allowed toperform simultaneous downloads works well for our filetransfer experiment, varying network conditions meansthat statically picking any single value is unlikely to per-form well under all conditions. Some applications maybenefit from a more dynamic throttled release techniquethat attempts to find the optimal number of hosts thatmaximizes throughput from the server without causingsaturation. The “Adaptive Simultaneous Transfers” linein Figure 7 shows the performance of our adaptive re-lease technique. In this example, the initial concurrencylevel is 15, and the level varies according to the dura-tion of each transfer. In this experiment the adaptive al-

0

20

40

60

80

100

0 100 200 300 400 500

Hos

t cou

nt

Elapsed time (sec)

25 Simultaneous Transfers100 Simultaneous Transfers

Adaptive Simultaneous Transfers10 Simultaneous Transfers

Figure 7: Software transfer from a high-speed server toPlanetLab hosts using a SemaphoreBarrier to limit thenumber of simultaneous file transfers.

gorithm line reaches 100% before the lines representinga fixed concurrency level of 10 or 100, but the algorithmwas too conservative to match the optimal static level of25 given the network conditions at the time.

6.3 Detecting knees

In this section we consider our ability to dynamically de-tect knees in a heavy-tailed arrival processes. We observethat when choosing a substantial number of time-sharedPlanetLab hosts to perform the same amount of work, thecompletion time varies widely, often following an expo-nential distribution. This variation makes it difficult tocoordinate distributed computation. To quantify our abil-ity to more adaptively set timeouts and coordinate thebehavior of multiple wide-area nodes, we used a barrierto synchronize senders and receivers in Bullet while run-ning across a varying number of PlanetLab nodes. Weset the barrier to dynamically detect the knee in the ar-rival process as described in Section 4. Upon reachingthe knee, nodes already in the barrier are released; oneside effect is that the sender begins data transmission.Bullet ignores all late arriving nodes.

Figure 8 plots the cumulative distribution of receiversthat enter the startup barrier on the y-axis as a function oftime progressing on the x-axis. Each curve correspondsto an experiment with 50, 90, or 130 PlanetLab receiversin the initial target set. The goal is to run with as manyreceivers as possible from the given initial set withoutwaiting an undue amount of time for a small number ofstragglers to complete startup. Interestingly, it is insuffi-cient to filter for any static set of known “slow” nodesas performance tends to vary on fairly small time scalesand can be influenced by multiple factors at a particularnode (including available CPU, memory, and changingnetwork conditions). Thus, manually choosing an appro-

Annual Tech ’06: 2006 USENIX Annual Technical Conference USENIX Association310

Page 11: Loose Synchronization for Large-Scale Networked Systems€¦ · At a high level, the goal is to ensure that concurrent computation tasks—across independent threads, processors,

0

20

40

60

80

100

120

0 5 10 15 20 25 30

Num

ber

of H

osts

Elapsed time (sec)

50 Nodes90 Nodes130 Nodes

Figure 8: A startup barrier that regulates nodes joining alarge-scale overlay. Vertical bars indicate when the bar-rier detects a knee and fires.

priate static set may be sufficient for one particular batchof runs but not likely the next.

Vertical lines in Figure 8 indicate where the barrier man-ager detects a knee and releases the barrier. Although weran the experiment multiple times, for clarity we plot theresults from a single run. While differences in time ofday or initial node characteristics affect the quantitativebehavior, the general shape of the curve is maintained.However, in all of our experiments, we are satisfied withour ability to dynamically determine the knee of the ar-rival process. The experiments are typically able to pro-ceed with 85-90% of the initial set participating.

6.4 EMAN

For our next evaluation, we added a partial barrier tothe parallel computation of EMAN’s refinement process.Upon detecting a knee, we reallocate tasks to faster ma-chines. Figure 9 shows the results of running EMANwith and without partial barrier semantics. In this exper-iment we ran EMAN on the 98 most responsive Plan-etLab machines. The workflow consists of several se-rial tasks (not shown), and a 98-way image classificationstep, run in parallel. Each participant first downloads a40-MB archive containing the EMAN executables anda wrapper script. After unpacking the archive, each nodedownloads a unique 200-MB data file and begins runningthe classification process. At the end of the computation,each node generates 77 output files stored on the localdisk, which EMAN merges into 77 “master” files onceall tasks complete.

After 801 seconds, the barrier detects a knee and reallo-cates the remaining tasks to idle machines among thoseinitially allocated to the experiment; these finish shortlyafterward. The “knee” in the curve at approximately 300

seconds indicates that around 21 hosts have good con-nectivity to the data repository, while the rest have longer

0

10

20

30

40

50

60

70

80

90

0 200 400 600 800 1000 1200 1400 1600 1800

Tas

k co

unt

Elapsed time (sec)

No reallocationWith reallocation

Figure 9: EMAN. Knee detected at 801 seconds. Totalruntime is over 2700 seconds.

transfer times. While the knee detection algorithm de-tects a knee at 300 seconds, a minimum threshold of 60%

prevents reconfiguration. The second knee is detected at801 seconds, which starts a reconfiguration of 10 tasks.These tasks complete by 900 seconds, while the originalset continue past 2700 seconds, for an overall speedup ofmore than 3.

6.5 MapReduce

We now consider an alternative use of partial barriers:to assist not with the synchronization of physical hostsor processors, but with load balancing of logical tasksspread across cooperating wide-area nodes. Further, wewish to determine whether we can dynamically detectknees in the arrival rate of individual tasks, spawningadditional copies of tasks that appear to be proceedingslowly on a subset of nodes.

We conduct these experiments with our implementationof MapReduce (see Section 5.4) with m = 480 maptasks and r = 30 reduce tasks running across n = 30

PlanetLab hosts. During each of the map and reducerounds, the MapReduce controller evenly partitions thetasks over the available hosts and starts the tasks asyn-chronously. For this experiment, each map task simplyreads 2000 random words from a local file and countsthe number of instances of certain words. This count iswritten to an intermediate output file based on the hash ofthe words. The task is CPU-bound and requires approxi-mately 7 seconds to complete on an unloaded PlanetLab-class machine. The reduce tasks summarize these inter-mediate files with the same hash values.

Thus, each map and reduce task performs an approxi-mately equal amount of work as in the original MapRe-duce work [9], though it would be useful to generalizeto variable-length computation. When complete, a mapor reduce task enters the associated barrier with a uniqueidentifier for the completed work. The barrier manager

Annual Tech ’06: 2006 USENIX Annual Technical ConferenceUSENIX Association 311

Page 12: Loose Synchronization for Large-Scale Networked Systems€¦ · At a high level, the goal is to ensure that concurrent computation tasks—across independent threads, processors,

0

50

100

150

200

250

300

350

400

450

0 50 100 150 200 250 300 350 400 450 500

Tas

k co

unt

Elapsed time (sec)

No reallocationWith reallocation

Figure 10: MapReduce: m = 480, r = 30, n = 30 withuniform prepartitioning of the data. Knee detection oc-curs at 68 seconds and callbacks enable rebalancing.

monitors the arrival rate and dynamically determines theknee, which is the where the completion rate begins toslow. We empirically determined that this slowing resultsfrom a handful of nodes that proceed substantially slowerthan the rest of the system. (Note that this phenomenonis not restricted to our wide-area environment; Dean andGhemawat observed the same behavior for their runson tightly coupled and more homogeneous clusters [9].)Thus, upon detecting the knee a callback to the MapRe-duce controller indicates that additional copies of theslow tasks should be respawned, ideally on nodes withthe smallest number of outstanding tasks. Typically, bythe time the knee is detected there are a number of hoststhat have completed their initial allocation of work.

Figure 10 shows the performance of one MapReduce runboth with and without task respawn upon detecting theknee. Figure 10 plots the cumulative number of com-pleted tasks on the y-axis as a function of time progress-ing on the x-axis. We see that the load balancing enabledby barrier synchronization on abstract tasks is criticalto overall system performance. With task respawn us-ing knee detection, the barrier manager detects the kneeapproximately 68 seconds into the experiment after ap-proximately 53% of the tasks have completed. At thispoint, the MapReduce controller (via a callback fromthe Barrier class) repartitions the remaining 47% of thetasks across available wide-area nodes. This is wherethe curves significantly diverge in the graph. Withoutdynamic rebalancing the completion rate transitions toa long-tail lasting more than 2500 seconds (though thegraph only shows the first 500 seconds), while the com-pletion rate largely maintains its initial slope when rebal-ancing is enabled. Overall, our barrier-based rebalancingresults in a factor of sixteen speedup in completion timecompared to proceeding with the initial mapping. Multi-ple additional runs showed similar results.

Note that this load balancing approach differs from thealternate approach of trying to predict the set of nodeslikely to deliver the highest level of performance a priori.Unfortunately, predicting the performance of tasks withcomplex resource requirements on a shared computinginfrastructure with dynamically varying CPU, network,and I/O characteristics is challenging in the best case andpotentially intractable. We advocate a simpler approachthat does not attempt to predict performance character-istics a priori. Rather, we simply choose nodes likely toperform well and empirically observe the utility of ourdecisions. Of course, this approach may only be appro-priate for a particular class of distributed applications andcomes at the cost of performing more work in absoluteterms because certain computations are repeated. For thecase depicted in Figure 10, approximately 30% of thework is repeated if we assume that the work on both thefast and slow nodes are run to completion (a pessimisticassumption as it is typically easy to kill tasks running onslow nodes once the fast instances complete).

7 Design alternatives

To address potential scalability problems with our cen-tralized approach, a tree of controllers could be built thataggregates BARRIER REACHED messages from chil-dren before sending a single message up the tree [18,26, 37]. This tree could be built randomly from the ex-pected set of hosts, or it could be crafted to match theunderlying communication topology, in effect formingan overlay aggregation tree [34, 36]. In these cases, themaster would send FIRE messages to its children, whichwould in turn propagate the message down the tree. Onepotentially difficult question with this approach is deter-mining when interior nodes should pass summary BAR-RIER REACH messages to their parent. Although a tree-based approach may provide better scalability since mes-sages can be aggregated up the tree, the latency requiredto pass a message to all participants is likely to increasesince the number of hops required to reach all partici-pants is greater than in the centralized approach.

A gossip-based algorithm could also be employed tomanage barriers in a fully decentralized manner [2]. Inthis case, each node acts as a barrier manager and swapsbarrier status with a set of remote peers. Given sufficientpair-wise exchanges, some node will eventually observeenough hosts having reached the barrier and it will firethe barrier locally. Subsequent pair-wise exchanges willpropagate the fact that the barrier was fired to the remain-der of the nodes in the system, until eventually all activeparticipants have been informed. Alternatively, the nodethat determines that a barrier should be released couldalso broadcast the FIRE message to all participants. Fullydecentralized solutions like this have the benefit of beinghighly fault tolerant and scalable since the work is shared

Annual Tech ’06: 2006 USENIX Annual Technical Conference USENIX Association312

Page 13: Loose Synchronization for Large-Scale Networked Systems€¦ · At a high level, the goal is to ensure that concurrent computation tasks—across independent threads, processors,

equally among participants and there is no single pointof failure. However, since information is propagated ina somewhat ad hoc fashion, it takes more time to propa-gate information to all participants, and the total amountof network traffic is greater. There is an increased risk ofpropagating stale information as well. In our experience,we have not yet observed significant reliability limita-tions with our centralized barrier implementation to war-rant exploring a fully decentralized approach.

We expect that all single-controller algorithms will even-tually run into scalability limitations based on a singlenode’s ability to manage incoming and outgoing com-munication with many peers. However, based on our per-formance evaluation (see Section 6), the performance ofcentralized barriers is acceptable to at least 100 nodes. Infact, we find that our centralized barrier implementationout-performs an overlay tree with an out-degree of 10 for100 total participants with regards to the time it takes asingle message to propagate to all participants.

8 Related Work

Our work builds upon a number of synchronization con-cepts in distributed and parallel computing. For tradi-tional parallel programming on tightly coupled mul-tiprocessors, barriers [19] form natural synchroniza-tion points. Given the importance of fast primitivesfor coordinating bulk synchronous SIMD applications,most MPPs have hardware support for barriers [23, 31].While the synchronization primitives and requirementsfor large-scale networked systems discussed vary fromthese traditional barriers, they clearly form one basisfor our work. Barriers also form a natural consistencypoint for software distributed shared memory systems [6,20], often signifying the point where data will be syn-chronized with remote hosts. Another popular program-ming model for loosely synchronized parallel machinesis message passing. Popular message passing librariessuch as PVM [15] and MPI [25] contain implementationsof barriers as a fundamental synchronization service.

Our approach is similar to Gupta’s work on Fuzzy Bar-riers [17] in support of SIMD programming on tightlycoupled parallel processors. Relative to our approach,Gupta’s approach specified an entry point for a barrier,followed by a subsequent set of instructions that couldbe executed before the barrier is released. Thus, a pro-cessor is free to be anywhere within a given region of theoverall instruction stream before being forced to block.In this way, processors completing a phase of computa-tion early could proceed to other computation that doesnot require strict synchronization before finally blocking.This notion of fuzzy barriers were required in SIMD pro-grams because there could only be a single outstandingbarrier at any time. We can capture identical semantics

using two barriers that communicate state with one an-other. The first barrier releases all entering nodes andsignals state to a second barrier. The second barrier onlybegins to release nodes once a sufficient number (perhapsall) of nodes have entered the first barrier.

Our approach to synchronizing large-scale networkedsystems is related to the virtual synchrony [2, 3] andextended virtual synchrony [27] communication mod-els. These communication systems closely tie togethernode inter-communication with group membership ser-vices. They ensure that a message multicast to a group iseither delivered to all participants or to none. Further-more, they preserve causal message ordering [22] be-tween both individual messages and changes in groupmembership. This communication model is clearly bene-ficial to a significant class of distributed systems, includ-ing service replication. However, we have a more mod-est goal: to provide a convenient synchronization pointto coordinate the behavior of more loosely coupled sys-tems. It is important to stress that in our target scenar-ios this synchronization is delivered mostly as a mat-ter of convenience rather than as a prerequisite for cor-rectness. Any inconsistency resulting from our model istypically detected and corrected at the application, simi-lar to soft-state optimizations in network protocol stacksthat improve common-case performance but are not re-quired for correctness. Other consistent group member-ship/view advancement protocols include Harp [24] andCristian’s group membership protocol [8].

Another effort related to ours is Golding’s weakly consis-tent group membership protocol [16]. This protocol em-ploys gossip messages to propagate group membershipin a distributed system. Assuming the rate of change inmembership is low enough, the system will quickly tran-sition from one stable view to another. One key benefitof this approach is that it is entirely decentralized andhence does not require a central coordinator to managethe protocol. As discussed in Section 7, we plan to ex-plore the use of a distributed barrier that employs gossipto manage barrier entry and release. However, our systemevaluation indicates that our current architecture deliverssufficient levels of both performance and reliability forour target settings.

Our loose synchronization model is related in spirit toa variety of efforts into relaxed consistency models forupdates in distributed systems, including Epsilon Se-rializability [30], the CAP principle [14], Bayou [32],TACT [38], and Delta Consistency [33].

Finally, we note that we are not the first to use the ar-rival rate at a barrier to perform scheduling and load bal-ancing. In Implicit Coscheduling [11], the arrival rateat a barrier (and the associated communication) is one

Annual Tech ’06: 2006 USENIX Annual Technical ConferenceUSENIX Association 313

Page 14: Loose Synchronization for Large-Scale Networked Systems€¦ · At a high level, the goal is to ensure that concurrent computation tasks—across independent threads, processors,

consideration in making local scheduling decisions toapproximate globally synchronized behavior in a multi-programmed parallel computing environment. Further,the way in which we use barriers to reallocate work issimilar to methods used by work stealing schedulers likeCILK [5]. The fundamental difference here is that idleprocessors in CILK make local decisions to seek out ad-ditional pieces of work, whereas all decisions to reallo-cate work in our barrier-based scheme are made centrallyat the barrier manager.

9 Summary

Partial barriers represent a useful relaxation of the tradi-tional barrier synchronization primitive. The ease withwhich we were able to integrate them into three dif-ferent existing distributed applications augurs well fortheir general utility. Perhaps more significant, however,was the straightforward implementation of wide-areaMapReduce as enabled by our expanded barrier seman-tics. We are hopeful that partial barriers can be used tobring to the wide area other sophisticated parallel algo-rithms initially developed for tightly coupled environ-ments. In our work thus far, we find in many cases it maybe as easy as directly replacing existing synchronizationprimitives with their relaxed partial barrier equivalents.

Dynamic knee detection in the completion time of tasksin heterogeneous environments is also likely to findwider application. Being able to detect multiple kneeshas added benefits since many applications exhibit multi-modal distributions. Detecting multiple knees gives ap-plications more control over reconfigurations and in-creases overall robustness, ensuring forward progresseven in somewhat volatile execution environments.References[1] D. G. Andersen, H. Balakrishnan, and F. Kaashoek. Improving

Web Availability for Clients with MONET. In NSDI, 2005.[2] K. Birman. Replication and Fault-Tolerance in the ISIS System.

In SOSP, 1985.[3] K. Birman and T. Joseph. Exploiting Virtual Synchrony in Dis-

tributed Systems. In SOSP, 1987.[4] K. P. Birman. The Process Group Approach to Reliable Dis-

tributed Computing. CACM, 36(12), 1993.[5] R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson,

K. H. Randall, and Y. Zhou. Cilk: an efficient multithreaded run-time system. In PPOPP, 1995.

[6] M. Z. Brian Bershad and W. Sawdon. The Midway DistributedShared Memory System. In CompCon, 1993.

[7] Computing TCP’s Retransmission Timers (RFC). http://www.faqs.org/rfcs/rfc2988.html.

[8] F. Cristian. Reaching Agreement on Processor-Group Member-ship in Synchronous Distributed Systems. DC, 4(4), 1991.

[9] J. Dean and S. Ghemawat. MapReduce: Simplified Data Process-ing on Large Clusters . In OSDI, 2004.

[10] E. Dijkstra. The Structure of the “THE”-Multiprogramming Sys-tem. CACM, 11(5), 1968.

[11] A. C. Dusseau, R. H. Arpaci, and D. E. Culler. Effective Dis-tributed Scheduling of Parallel Workloads. In SIGMETRICS,1996.

[12] EMAN. http://ncmi.bcm.tmc.edu/EMAN/.

[13] I. Foster, C. Kesselman, J. Nick, and S. Tuecke. The Physiologyof the Grid: An Open Grid Services Architecture for DistributedSystems Integration. GGF, 2002.

[14] A. Fox and E. Brewer. Harvest, Yield, and Scalable Tolerant Sys-tems. In HotOS, 1999.

[15] G. A. Geist and V. S. Sunderam. Network-based concurrent com-puting on the PVM system. C– P&E, 4(4):293–312, 1992.

[16] R. Golding. A Weak-Consistency Architecture for DistributedInformation Services. CS, 5(4):379–405, Fall 1992.

[17] R. Gupta. The Fuzzy Barrier: A Mechanism for High Speed Syn-chronization of Processors. In ASPLOS, 1989.

[18] R. Gupta and C. R. Hill. A scalable implementation of bar-rier synchronization using an adaptive combining tree. IJPP,18(3):161–180, 1990.

[19] H. F. Jordan. A Special Purpose Architecture for Finite ElementAnalysis. In ICPP, 1978.

[20] P. Keleher, S. Dwarkadas, A. L. Cox, and W. Zwaenepoel. Tread-Marks: Distributed Shared Memory on Standard Workstationsand Operating Systems. In USENIX, pages 115–131, 1994.

[21] D. Kostic, A. Rodriguez, J. Albrecht, and A. Vahdat. Bullet: Highbandwidth data dissemination using an overlay mesh. In SOSP,2003.

[22] L. Lamport. Time, Clocks, and the Ordering of Events in a Dis-tributed System. CACM, 21(7), 1978.

[23] C. E. Leiserson, Z. S. Abuhamdeh, D. C. Douglas, C. R. Feyn-man, M. N. Ganmukhi, J. V. Hill, W. D. Hillis, B. C. Kuszmaul,M. A. S. Pierre, D. S. Wells, M. C. Wong-Chan, S.-W. Yang, andR. Zak. The network architecture of the Connection MachineCM-5. JPDC, 33(2):145–158, 1996.

[24] B. Liskov, S. Ghemawat, R. Gruber, P. Johnson, L. Shrira, andM. Williams. Replication in the Harp file system. In SOSP, 1991.

[25] Message Passing Interface Forum. MPI: A message-passing in-terface standard. Technical Report UT-CS-94-230, 1994.

[26] S. Moh, C. Yu, B. Lee, H. Y. Youn, D. Han, and D. Lee. Four-arytree-based barrier synchronization for 2d meshes without non-member involvement. TC, 50(8):811–823, 2001.

[27] L. E. Moser, P. M. Melliar-Smith, D. A. Agarwal, R. K. Budhia,and C. A. Lingley-Papadopoulos. Totem: A Fault-Tolerant Mul-ticast Group Communication System. CACM, 39(4), 1996.

[28] L. Peterson, T. Anderson, D. Culler, and T. Roscoe. A Blueprintfor Introducing Disruptive Technology into the Internet. In Hot-Nets, 2002.

[29] Plush. http://plush.ucsd.edu.[30] C. Pu and A. Leff. Epsilon-Serializability. Technical Report

CUCS-054-90, Columbia University, 1991.[31] S. L. Scott. Synchronization and Communication in the T3E Mul-

tiprocessor. In ASPLOS, 1996.[32] D. B. Terry, M. M. Theimer, K. Petersen, A. J. Demers, M. J.

Spreitzer, and C. H. Hauser. Managing Update Conflicts inBayou, a Weakly Connected Replicated Storage System. InSOSP, 1995.

[33] F. Torres-Rojas, M. Ahamad, and M. Raynal. Timed Consistencyfor Shared Distributed Objects. In PODC, 1999.

[34] R. van Renesse, K. Birman, and W. Vogels. Astrolabe: A Ro-bust and Scalable Technology for Distributed System Monitoring,Management, and Data Mining. TCS, 21(2), 2003.

[35] M. Welsh, D. E. Culler, and E. A. Brewer. SEDA: An Architec-ture for Well-Conditioned, Scalable Internet Services. In SOSP,2001.

[36] P. Yalagandula and M. Dahlin. A Scalable Distributed Informa-tion Management System. In SIGCOMM, 2004.

[37] J.-S. Yang and C.-T. King. Designing tree-based barrier synchro-nization on 2d mesh networks. TPDS, 9(6):526–534, 1998.

[38] H. Yu and A. Vahdat. Design and Evaluation of a ContinuousConsistency Model for Replicated Services. In OSDI, 2000.

Annual Tech ’06: 2006 USENIX Annual Technical Conference USENIX Association314