imapreduce: a distributed computing framework for...

J Grid Computing (2012) 10:47–68DOI 10.1007/s10723-012-9204-9

iMapReduce: A Distributed Computing Frameworkfor Iterative Computation

Yanfeng Zhang · Qixin Gao · Lixin Gao ·Cuirong Wang

Received: 15 July 2011 / Accepted: 1 March 2012 / Published online: 25 March 2012© Springer Science+Business Media B.V. 2012

Abstract Iterative computation is pervasive inmany applications such as data mining, web rank-ing, graph analysis, online social network analysis,and so on. These iterative applications typicallyinvolve massive data sets containing millions orbillions of data records. This poses demand ofdistributed computing frameworks for process-ing massive data sets on a cluster of machines.MapReduce is an example of such a framework.

Y. Zhang (B)School of Information Science and Engineering,Northeastern University, 11 Wenhua Road,Shenyang, Liaoning 110819, Chinae-mail: [email protected]

Q. Gao · C. WangDepartment of Electrical and Information Engineering,Northeastern University at Qinhuangdao,143 Taishan Road, Qinhuangdao,Hebei 066000, China

Q. Gaoe-mail: [email protected]

C. Wange-mail: [email protected]

L. Gao · Y. ZhangDepartment of Electrical and Computer Engineering,University of Massachusetts Amherst,151 Holdsworth Way,Amherst, MA 01002, USA

L. Gaoe-mail: [email protected]

However, MapReduce lacks built-in support foriterative process that requires to parse data setsiteratively. Besides specifying MapReduce jobs,users have to write a driver program that submitsa series of jobs and performs convergence testingat the client. This paper presents iMapReduce,a distributed framework that supports iterativeprocessing. iMapReduce allows users to specifythe iterative computation with the separated mapand reduce functions, and provides the supportof automatic iterative processing within a singlejob. More importantly, iMapReduce significantlyimproves the performance of iterative implemen-tations by (1) reducing the overhead of creatingnew MapReduce jobs repeatedly, (2) eliminat-ing the shuffling of static data, and (3) allowingasynchronous execution of map tasks. We imple-ment an iMapReduce prototype based on ApacheHadoop, and show that iMapReduce can achieveup to 5 times speedup over Hadoop for imple-menting iterative algorithms.

Keywords Iterative computation · iMapReduce ·Distributed computing framework · Hadoop

1 Introduction

With the success of Web 2.0 and the popularity ofonline social networks, massive amounts of dataare being collected, such as Facebook activities,

48 Y. Zhang et al.

Flickr photos, Web pages, eBay sales records, andcell phone records. The collected data typicallycontain millions or billions of records. For the pur-pose of processing large-scale data sets, Google,Microsoft and Yahoo! have built their own datacenters [11]. Additionally, Amazon and Microsoftprovide commodity hardware for public usage[1, 28]. Users pay to lease a number of virtualserver instances to customize their own clusters.

To analyze the massive data sets, a distrib-uted computing framework is needed on top ofa cluster of servers. MapReduce [12] is a frame-work proposed for data intensive computation ina large cluster environment. Since its introduction,MapReduce, in particular its opensource imple-mentation, Hadoop [15], has become extremelypopular for analyzing large data sets. It provides asimple programming model and takes care of dis-tributed execution, distributed storage, and faulttolerance, which enable programmers with no ex-perience on distributed systems to exploit a largecluster of commodity machines to perform dataintensive computation.

MapReduce is designed for batch-orientedcomputation such as log analysis and text process-ing. However, many data analysis applications [2,3, 7, 10, 25, 32, 35] require iterative processing ofthe data, which includes algorithms for text-basedsearch and machine learning. For example, thewell-known PageRank algorithm parses the weblinkage graph iteratively for deriving pages’ rank-ing scores. The huge amount of data presentedin these applications demands for an efficient dis-tributed framework to implement these iterativealgorithms. Nevertheless, MapReduce lacks thebuilt-in support for iterative processing.

Further, implementing iterative algorithms inMapReduce usually requires users to design achain of jobs, which poses several performancepenalties. First, it wastes considerable resourceson launching, scheduling, and cleaning up thesejobs, and as result leads to longer processingtime, even though these jobs perform the sameoperations. Second, the data used for iterativecomputation are loaded and shuffled repeatedlyin each iteration job, even though they are notchanged during iterations. Third, the synchronousbatched executions of these MapReduce jobs re-

quire finishing the previous iteration job beforestarting the next iteration job, which can unnec-essarily delay the process.

In this paper, we propose iMapReduce. To ad-dress the inefficiencies of iterative algorithm im-plementations in MapReduce, iMapReduce pro-vides an efficient iterative processing frameworkwhile reserving the similar MapReduce program-ming interfaces at the same time. First, it pro-poses the concept of persistent tasks to avoidrepeated task scheduling. Second, the input dataare loaded to local file system only once insteadof shuffled among workers many times, whichcan significantly reduce I/O and network commu-nication overhead. Third, it facilitates asynchro-nous execution of map tasks within the same it-eration to break synchronization barrier betweenMapReduce jobs.

We implement a prototype of iMapReducebased on Apache Hadoop [15] and Hadoop On-line Prototype (HOP) [16]. Our prototype is back-ward compatible to Hadoop MapReduce in thesense that it supports any Hadoop MapReducejob. We evaluate our prototype with several well-known iterative algorithms, including Shortestpath, PageRank and K-means. Our experimentalresults show that iMapReduce can accelerate theiterative process significantly, which is up to 5times faster than the Hadoop implementations.

The rest of the paper is organized as follows.In Section 2, we introduce two typical iterative al-gorithms’ MapReduce implementations and sum-marize their limitations. Section 3 describes thedesign and implementation of iMapReduce. Eval-uation results are provided in Section 4. Section5 generalizes iMapReduce to any iterative algo-rithm. We review the related work in Section 6and conclude the paper in Section 7.

2 Iterative Algorithms

Many data mining algorithms have an iterativeprocess to operate data recursively. In this sec-tion, we first provide two examples of iterativealgorithms, and describe their MapReduce imple-mentations. Then we summarize the limitations ofimplementing these algorithms in MapReduce.

iMapReduce: A Distributed Computing Framework for Iterative Computation 49

2.1 Iterative Algorithm Examples

We describe Single Source Shortest Path (SSSP)and PageRank along with their MapReduce im-plementations in this section.

2.1.1 Single Source Shortest Path

Single Source Shortest Path (SSSP) [10] is a clas-sical problem that derives the shortest distancefrom a source node s to all other nodes in a graph.Formally, we describe the shortest path computa-tion as follows. Given a weighted, directed graphG = (V, E), with link weight matrix W mappingedges to real-valued weights, for a source node s,find the minimum distance to any other vertex v,d(v).

To perform the shortest path computation inthe MapReduce framework, we can traverse thegraph in a breadth-first manner. Starting from asource s, with the distance to the source beinginitialized as 0, i.e., d(s) = 0, and any other node’sminimum distance being initially set to be ∞, themap function is applied on each node u. The mapinput key is a node id, and the map input value iscomposed of two parts. The first part is the mini-mum distance from s to u, i.e., d(u), and the secondpart is the set of node u’s outgoing links’ weightvalues, i.e., W(u, ∗). Based on these two typesof input, the map function generates the outputkey-value pair 〈v, d(u) + W(u, v)〉, where v is anyof u’s outbound neighboring nodes and W(u, v)

is the link weight from u to v. At meantime,to maintain its current shortest distance and thelink weight information, the mapper also produces〈u, [d(u), W(u, ∗)]〉 pair.

The map output key-value pairs are shuffled tothe reducers. In each reducer, a number of pos-sible distance values for a node v from differentpredecessors u are gathered, and the minimumdistance value is selected to update v’s shortestdistance only if it is shorter than v’s original short-est distance. Node v’s shortest distance combinedwith its outbound link weight information com-poses the reduce output value, which is writtento DFS for the next iteration MapReduce job. Inthe next iteration, the same map/reduce opera-tions are performed on the previous MapReduce

job’s output, which is the set of updated shortestdistance values. For the sake of terminating theiterative process, an additional MapReduce jobfollowing each iteration job is launched to checkthe termination condition. Finally, the iterativeprocess terminates when all the nodes’ shortestdistance values are not changed.

The map and reduce operations can be summa-rized as follows.

Map For each node u, based on its outgoinglinks’ weight values W(u, ∗), output key-valuepairs 〈v, d(u) + W(u, v)〉, where W(u, v) ∈ W(u,

∗), and output its current shortest distance valueas well as its outgoing links’ weights, 〈u, [d(u),W(u, ∗)]〉.

Reduce For each node v, select the minimumvalue among d(v) and d(u) + W(u, v) receivedfrom any u to update d(v), and output key-valuepair 〈v, [d(v), W(u, ∗)]〉.

2.1.2 PageRank

PageRank [3, 6, 32] is a popular algorithm initiallyproposed for ranking web pages. Later on, it hasbeen used in a wide range of applications, suchas link prediction [23, 36] and recommendationsystems [2, 17, 37].

The PageRank vector R is defined over a di-rected graph G = (V, E). Each node v in thegraph is associated with a PageRank score R(v).The initial rank of each node is 1

|V| . Each node v

updates its rank iteratively as follows:

R(k+1)(v) = 1 − d|V| +

∑

u∈N−(v)

d · R(k)(u)

|N+(u)| , (1)

where N−(v) is the set of nodes pointing to nodev, N+(v) is the set of nodes that v points to, k isthe iteration number, and d is a constant repre-senting the damping factor. This iterative processcontinues for a fixed number of iterations or tillthe difference between the resulting PageRankscores of two consecutive iterations is smaller thana threshold.

In MapReduce, the map function is applied oneach node u, where the input key is the node

50 Y. Zhang et al.

id and the input value contains node u’s rankingscore R(u) as well as node u’s outbound neighborsset N+(u). The mapper on node u derives thepartial ranking score of v, v ∈ N+(u), i.e., d R(u)

|N+(u)| ,that will be shuffled to node v. Meanwhile, theretained PageRank score 1−d

|V| and the outboundneighbors set N+(u) are shuffled to itself. The re-ducer on node v accumulates these partial rankingscores and the retained ranking score to producea new ranking score of v. The updated rankingscore of v along with the outbound neighbors setis written to DFS for feeding the next iterationMapReduce job. To stop the iterative process,users have to perform another MapReduce job af-ter each iteration to measure the difference fromthe last iteration’s result.

The map and reduce operations can be summa-rized as follows.

Map For each node u, output key-value pairs〈v, d R(u)

|N+(u)| 〉, where v ∈ N+(u), and output the re-tained ranking score and its outbound neighborsset, 〈u, [ 1−d

|V| , N+(u)]〉.

Reduce For each node v, sum 1−d|V| and d R(u)

|N+(u)|received from any u to update R(v), and outputkey-value pair 〈v, [R(v), N+(v)]〉.

2.2 Limitations of MapReduce Implementation

As described above, MapReduce can be used toimplement iterative algorithms. However, we listthree limitations for implementing iterative algo-rithms in MapReduce.

1. The operations in each iteration are thesame. Nevertheless, MapReduce implemen-tation starts a new job for each iteration,which involves repeated task initializationsand cleanups. Moreover, these jobs have toload the input data from DFS and dump theoutput data to DFS repeatedly. These resultin the unnecessary scheduling overhead.

2. The adjacency information data is shuffled ineach iteration between map and reduce, de-spite the fact that it remains the same duringall iterations. An alternative approach is torun an additional MapReduce job to performa join operation between the static adjacency

data and the dynamic iterated data beforeeach iteration. Both of these approaches resultin the unnecessary communication overhead.

3. The map tasks in an iteration cannot bestarted before finishing all the reduce tasksin the previous iteration. The main loop inthe MapReduce implementation requires thecompletion of previous iteration job beforestarting the next iteration job. However, themap tasks should be started as soon as theirinput data are available. This limitation resultsin the unnecessary synchronization overhead.

iMapReduce aims to address these limitationsand provides an efficient distributed computingframework for implementing iterative algorithms.In doing so, we have two observations about theseiterative algorithms. First, the processing unit inboth map function and reduce function is node,which enables a one-to-one mapping betweenmappers and reducers. Second, each iteration isexpressed by only one MapReduce job. Althoughthese two observations might not be true for alliterative algorithms, we note that they are indeedtrue for a large class of graph-based iterativealgorithms. We will make these assumptions inpresenting iMapReduce in Section 3 for ease ofexposition. In Section 5, we will describe howiMapReduce supports implementing iterative al-gorithms where the above two assumptions do nothold.

3 iMapReduce

In this section we propose iMapReduce.iMapReduce provides support of iterative proc-essing (Section 3.1), efficient data management(Section 3.2), and asynchronous map task exec-ution (Section 3.3). In addition, we will describethe runtime support including load balancing andfault tolerance mechanisms in Section 3.4 andframework APIs in Section 3.5.

3.1 Iterative Processing

In the MapReduce implementations of iterativealgorithms, a series of MapReduce jobs consist-ing of map operation and reduce operation are


scheduled. Figure 1a shows the dataflow in theMapReduce implementation. Each MapReducejob has to load the input data from DFS before themap operation. After the map operation derivesthe intermediate key-value pairs, the reduce func-tion operates on the intermediate data and derivesthe output of this iteration, which is written toDFS. In the next iteration, the map function loadsthe iterated data from DFS again and repeats theprocess. These MapReduce jobs including theircomponent map/reduce tasks incur unnecessaryscheduling overhead. Additionally, the repeatedDFS loading/dumping is expensive, even thoughHadoop provides locality optimization that re-duces the remote communication.

We note that each iteration performs thesame operations. In other words, the series ofjobs perform the same map and reduce func-tions. We exploit this property in iMapReduce bymaking map/reduce tasks persistent. That is, themap/reduce operations in map/reduce tasks arekept executing till the iteration is terminated. Fur-ther, iMapReduce enables the reduce’s output tobe passed to the map for the next round iteration.

Fig. 1 a Dataflow of MapReduce. b Dataflow ofiMapReduce

Figure 1b shows the dataflow in iMapReduce. Thedashed line indicates that the data loading fromDFS happens only once in the initialization stage,and the output data are written to DFS only oncewhen the iteration terminates. In the followingwe describe how iMapReduce implements thepersistent tasks and how the persistent tasks areterminated.

3.1.1 Persistent Tasks

A map/reduce task is a computing process with aspecified map/reduce operation corresponding toa subset of data records. In the Hadoop MapRe-duce framework, each map/reduce task is assignedto a slave worker processing a subset of the in-put/shuffled data, and its life cycle ends whencompleting processing the assigned data records.

In contrast, each map/reduce task in iMapRe-duce is persistent. A persistent map/reduce taskkeeps alive during the entire iterative process.When all the assigned data of a persistent taskare parsed, the task becomes dormant, waitingfor the new input/shuffled data. For a map task,it waits for the results from the reduce task, andis reactivated to work on the new input datarecords when the required data arrive from thereduce task. We will describe how the data arepassed from the reduce tasks to the map tasks inSection 3.2.1. For the reduce task, it waits for allthe map tasks’ output and is reactivated synchro-nously as in MapReduce.

To implement persistent tasks, there shouldbe enough available task slots. The number ofavailable map/reduce task slots is the number ofmap/reduce tasks that the framework can accom-modate (or allows to be executed) simultaneously.In Hadoop MapReduce, the master splits a jobinto many small map/reduce tasks, the numberof map/reduce tasks executed simultaneously can-not be larger than the number of the availablemap/reduce task slots (the default number inHadoop is two per slave worker). Once a slaveworker completes an assigned task, it requestsanother one from the master. In iMapReduce,we need to guarantee that there are sufficientavailable task slots for all the persistent tasks tostart at the beginning. This means that the taskgranularity should be set coarser to have less

52 Y. Zhang et al.

tasks. Clearly, this might make load balancingchallenging. We will address this issue with a loadbalancing scheme in Section 3.4.2.

3.1.2 Iteration Termination

Iterative algorithms typically stop when a termi-nation condition is met. Users terminate an iter-ative process in two ways: (1) Fixed number ofiterations: Iterative algorithm stops after a fixednumber of iterations. (2) Bounding the distancebetween two consecutive iterations: Iterative al-gorithm stops when the difference between twoconsecutive iterations is less than a threshold.

iMapReduce performs the termination checkafter each iteration. To terminate the iterativecomputation by a fixed number of iterations, it isstraightforward that we record the iteration num-ber in each task and terminate it when the numberexceeds a threshold. To bound the distance be-tween two consecutive iterations, the reduce taskssave the output from two consecutive iterationsand calculate the distance. Users should specifythe distance measurement method, e.g., Euclideandistance, Manhattan distance, etc. In order to ob-tain an global distance value from all reduce tasks,the local distance values from the reduce tasksare merged by the master, and the master checksthe termination condition to decide whether toterminate or not. If the termination condition issatisfied, the master will notify all the persistentmap/reduce tasks to terminate their executions.

3.2 Data Management

As described in Section 2.2, even though the graphadjacency information is unchanged from itera-tion to iteration, the MapReduce implementationsreload and reshuffle the unchanged graph data ineach iteration, which poses considerable commu-nication overhead.

To avoid the shuffling of unchanged graph data,iMapReduce differentiates the static data from thestate data. The state data are updated in eachiteration, while the static data remain the sameacross iterations. For example, in the SSSP ex-ample, the nodes’ shortest distance values as thestate data are updated at each iteration, while thelink weights as the static data are unchanged. In

PageRank, the pages’ ranking scores as the statedata are updated iteratively, while the adjacencylists as the static data are unchanged.

Figure 2 shows the state/static data flow in aniMapReduce worker. Within a worker, the initialstate data and the static data are loaded to localFS from DFS in the initial stage. Before each map-reduce iteration starts, the iterated state data arejoined with the local static data for map operation.Then the state data produced by map are shuffledto reduce, and the updated state data from reduceare passed to map to start another iteration. Inthe following, we describe how the state data arepassed from reduce to map and how to join thestate data with the static data.

3.2.1 Passing State Data from Reduce to Map

In MapReduce, the output of a reduce task iswritten to DFS and might be used later in thenext MapReduce job. In contrast, iMapReduceallows the state data to be passed from the reducetask to the map task directly, so as to trigger thejoin operation with the static data and to start themap execution for the next iteration. To do so,iMapReduce builds persistent socket connectionsfrom the reduce tasks to the map tasks.

Fig. 2 Dataflow of the state data and the static data in aniMapReduce worker


In order to reduce the number of the socketconnections, we partition the set of graph nodesinto multiple subsets. Each map task accompaniedwith a reduce task is assigned with a subset ofnodes. In other words, there is a one-to-one cor-respondence between the map tasks and the re-duce tasks. Only one socket connection is neededfor passing the state data from a reduce taskto the corresponding map task, which processesthe same subset of nodes. Otherwise, we haveto perform another shuffling process from thereduce tasks to the map tasks, where the numberof persistent connections grows exponentially asthe number of tasks increases.

Since the map task has a one-to-one correspon-dence with the reduce task, they hold the samesubset of nodes. We can partition the static datausing the same partition function as that is usedin the shuffling of the state data. In doing so,the state data are always shuffled to the exactreduce task where its connected map task is hold-ing the corresponding static data. Further, in or-der to avoid the network communication neededfor passing the state data, we prefer a local con-nection, so that the task scheduler maintains themapping information and always assigns the maptask and its corresponding reduce task to the sameworker.

3.2.2 Joining State Data with Static Data

As shown in Fig. 2, iMapReduce separates the sta-tic data from the state data, so that we only updatethe state data iteratively. The static data are lo-cated in map tasks and are joined with the iteratedstate data for map operation. Therefore, the mapoperation takes input from both the state data andthe static data, while the reduce operation onlytakes input from the state data. Before executingthe map operation, a join operation between thestate data and the static data is performed toobtain the combined state and static data records,which are used in the map operation.

iMapReduce automatically performs this joinoperation without requiring users to write an ad-ditional MapReduce job (Some previous workhas focused on optimizing this additional joiningMapReduce job [5]). We assign the same key tothe static data record and the state data record

that will be joined together. For the PageRankexample, a node’s static adjacency list and itsiteratively updated ranking score are indexed bythe same node id. The static data records and thestate data records are sorted in the natural orderof their keys respectively. In order to implementthe join operation, we read one static data recordand read one state data record correspondingly,so they are with the same key. The frameworkwill join the static data and the state data beforefeeding the joined data to the map operation. Thejoined state data record and static data record areprovided to map operation as the input parame-ters, so that users can concentrate on implement-ing the map operation, without worrying about themaintenance of the static data in their iterativealgorithm implementations.

3.3 Asynchronous Execution of Map Tasks

In MapReduce iterative algorithm implementa-tions, two synchronization barriers are existed,between maps-reduces and between MapReducejobs, respectively. Due to the synchronization bar-rier between jobs, the map tasks of the next it-eration job cannot start before the completion ofthe previous iteration job, which requires all thereduce tasks’ completion. However, since the maptask needs only the state data from its correspond-ing reduce task, a map task can start its executionas soon as its input state data arrives, withoutwaiting for the other reduce tasks’ completion.In iMapReduce, we schedule the execution ofmap tasks asynchronously. By enabling the asyn-chronous execution, the synchronization barriersbetween MapReduce jobs are eliminated, whichcan further speed up the iterative process.

To implement the asynchronous execution, webuild a persistent socket connection from a reducetask to its corresponding map task. In a naiveimplementation, as soon as the reduce task pro-duces a data record, it is immediately sent back toits corresponding map task. Upon receipt of thedata from the reduce task, the map task performsthe map operation on it immediately. However,the eager triggering in the native implementationwill result in frequent context switches betweenreduce operation and map operation that impactsperformance. Thus, a buffer is designed in each

54 Y. Zhang et al.

reduce task. As the buffer size grows larger than athreshold, the data are sent to the correspondingmap task.

In each map task, the join operation is per-formed as soon as the state data from the cor-responding reduce task arrive. Since the staticdata are always ready in the map task, the joinoperation can be performed in an eager fashion.The map operation is executed immediately af-ter the join operation produces the joined staticand state data record. Note that the reduce taskscannot start until all the map tasks in the sameiteration have been completed. In other words,the execution of map tasks and the execution ofreduce tasks cannot be overlapped in the sameiteration, as is the case in MapReduce.

3.4 Runtime Support

The runtime support for load balancing and faulttolerance is essential for a distributed computingframework. As we know, one of the key reasonsfor MapReduce’s success is its simple and efficientruntime support. In the following, we describehow iMapReduce supports load balancing andfault tolerance.

3.4.1 Fault Tolerance

Fault Tolerance is important in a server clusterenvironment. MapReduce splits a job into mul-tiple fine-grained tasks and reschedule the failedtask whenever a task failure is detected. More-over, MapReduce provides speculative execution[40] that is designed on clusters of heterogeneoushardware. Speculative execution starts anotherconcurrent task to process the same data blockif extra resources are available, where the firstcompleted task’s output is preferred.

iMapReduce relies on checkpointing mecha-nism for fault-tolerance. For each map task, thestatic data block has a replica on DFS, whilefor each reduce task, the output state data asthe checkpoint are dumped to DFS every fewiterations. In case there is a failure, iMapReducerecovers from the most recent checkpoint itera-tion, instead of starting the iterative process fromscratch. Since the state data are relatively small,it is expected to consume little time for dumping

these data to DFS (i.e., make several file copieson several other machines for data redundancy).Note that the checkpointing process is performedin parallel with the iterative process.

3.4.2 Load Balancing

In MapReduce, the master decomposes a submit-ted job into multiple tasks. The slave worker com-pletes one task followed by requesting anotherone from the master. This “complete-and-then-feed” task scheduling mechanism makes gooduse of computing resources. In iMapReduce, alltasks are assigned to slave workers in the be-ginning at one time, since tasks are persistent iniMapReduce. This one-time assignment conflictswith MapReduce’s task scheduling strategy, sothat we cannot confer the benefit from the originalMapReduce framework.

Lack of load balancing support may lead toseveral problems: (1) Even though the initial in-put data are evenly partitioned among workers, itdoes not necessarily mean that the computationworkload is evenly distributed due to the skeweddegree distribution. (2) Even though the computa-tion workload is evenly distributed among work-ers, it still cannot guarantee the best utilization ofcomputing resources, since a large cluster mightconsist of heterogeneous servers [40].

To address these problems, iMapReduce per-forms task migration whenever the workload isunbalanced among workers. After each iteration,each reduce task sends an iteration completionreport to the master, which contains the reducetask id, the iteration number, and the processingtime for that iteration. Upon receipt of all thereduce tasks’ reports, the master calculates the av-erage processing time for that iteration excludinglongest and shortest, based on which the mastercalculates the time deviation of each worker andidentifies the slower workers and the faster work-ers. If the time deviation is larger than a threshold,the reduce task along with its connected map taskin the slowest worker is migrated to the fastestworker in the following three steps. The master(1) kills a map-reduce task pair in the slow worker,(2) launches a new map-reduce task pair in thefast worker, and (3) sends a rollback command tothe other map/reduce tasks. All the map/reduce


tasks that receive the rollback command skip theircurrent work. The rolled back map tasks reloadthe latest checkpointed state data from DFS andproceed the iterative computation, and the newlaunched map tasks load not only the state databut also the corresponding static data from DFS.

However, we notice that when the data parti-tions are skewed and every worker in the clusteris exactly the same, the large partition will keepmoving around inside the cluster, which may de-grade performance a lot and does not help loadbalancing. A confined load balancing mechanismcan automatically identify the large partition andbreak it into several small sub-partitions distrib-uted to different workers.

3.5 APIs

According to the design ideas, we implement aprototype of iMapReduce [18] based on Hadoop[15] and Hadoop Online Prototype (HOP) [16].Our prototype supports any Hadoop job. In ad-dition, it supports the implementation of iterativealgorithms. Users can turn on iterative process-ing functionalities for implementing iterative algo-rithms, or turn them off for implementing MapRe-duce jobs as usual.

To implement an iterative algorithm iniMapReduce, users should implement thefollowing interfaces:

– void map(Key, StateValue,StaticValue).The map interface in iMapReduce has oneinput key Key and two input values: statedata value StateValue and static data valueStaticValue. Both of the values are asso-ciated with the same input key. iMapReduceframework joins the state data and the staticdata automatically, so that users can focus ondescribing the map computation.

– void reduce(Key, StateValue).The reduce interface in iMapReduce is thesame as that in MapReduce, with an input keyand an input value. Note that the input value isstate data that has been separated from staticdata.

– float distance(Key, PrevState,CurrState).Users implement this interface to specify thedistance measurement using a key’s previousstate value and its current state value. The re-turned float values for different keys are accu-mulated to obtain the distance value betweentwo consecutive iterations’ results. For exam-ple, Manhattan distance and Euclidean dis-tance can be used to quantify the difference.

In addition, iMapReduce provides the follow-ing job parameters (i.e., JobConf’s parameters) tohelp users specify iterative computation:

– job.set("mapred.iterjob.statepath",path).Set the DFS path of the initial state data.

– job.set("mapred.iterjob.staticpath",path).Set the DFS path of the static data.

– job.setInt("mapred.iterjob.maxiter", n).Set the maximum iteration number n to termi-nate an iterative computation.

– job.setFloat("mapred.iterjob.disthresh", ε).Set the distance threshold as ε to terminate aniterative computation, which is used togetherwith distance interface.

To show how to implement iterative algorithmsin iMapReduce, an example of PageRank algo-rithm implementation code is given in Fig. 3. Inthe map function (Line 1–4), each page’s PageR-ank score is evenly distributed to its neighbors andretaining (1−d)

N by itself, where N is the total num-ber of pages in the graph. In the reduce function(Line 5), for each page, it combines the partialPageRank scores from its incoming neighbors andits own retained score. For the distance measure-ment, we calculate the Manhattan distance (Line6). Additionally, we should specify the locationof the initial state data (Line 11), as well as thelocation of the static data (Line 12). iMapRe-duce supports automatically graph partitioningand graph loading for a few particular format-ted graphs (including weighted and unweightedgraphs). Users can first format their graphs in oursupported formats. By using the distance-based

56 Y. Zhang et al.

Fig. 3 PageRank implementation in iMapReduce

termination check method, the iterative computa-tion will terminate when the distance measured bythe distance function is smaller than 0.01 (line 13).

4 Evaluation

In this section, we evaluate iMapReduce. Twotypical graph based iterative algorithms areconsidered: SSSP and PageRank. We comparethe performance of the two algorithms imple-mented in iMapReduce with that implemented inHadoop [15].

4.1 Experiment Setups

4.1.1 Cluster Environment

Our experiments are performed on both a localcluster of commodity machines and a cluster builton Amazon EC2 cloud [1]. Hadoop’s block sizeis set to be 64 MB, and Hadoop’s heap size is setto be 2 GB. We describe the cluster settings asfollows.

Local Cluster A local cluster containing 4 nodesis used to run experiments. Each node has Intel(R) Core(TM)2 Duo E8200 dual-core 2.66 GHz

CPU, 3 GB of RAM, 160 GB storage, and runs32-bit Linux Debian 4.0 OS. These four nodes arelocally connected by a switch with communicationbandwidth of 1 Gbps.

Amazon EC2 cluster We build a test clusteron Amazon EC2. There are 80 small instances in-volved in our experiments. Each instance has1.7 GB memory, Inter Xeon CPU E54302.66 GHz, 146.77 GB instance storage andruns 32-bit platform Linux Debian 4.0 OS.

4.1.2 Data Sets

We implement SSSP and PageRank underiMapReduce and evaluate the performance onreal graphs and synthetic graphs. We generate thesynthetic graphs in order to evaluate iMapReduceon graphs of different sizes.

We first describe the graphs used for runningSSSP algorithm as follows. (1) DBLP author co-operation graph. Each node in DBLP graph repre-sents an author, and each link between two nodesrepresents the cooperation relationship betweenthe two authors. The link weight is set accordingto the cooperation frequency between the twolinked authors. (2) Facebook user interactionsgraph [38]. Each node in the graph represents aFacebook user, and each link between two nodesimplies that the two users are friends. The usersinteraction frequency is used to assign weight tothe user friendship links. (3) Synthetic graph. Thelog-normal parameters of the link weight distrib-ution (i.e., shape parameter σ = 1.2, scale para-meter μ = 0.4) and that of the node out-degreedistribution (i.e., shape parameter σ = 1.0, scaleparameter μ = 1.5) are extracted by average fromthe above two real graphs [8]. Based on theselog-normal parameters, we then generate threelog-normal synthetic graphs, containing 1 million,10 million, and 50 million nodes, respectively.Table 1 shows the brief description of thesedata sets.

The data set used for PageRank is describedas follows. Two real graphs are considered:(1) Google webgraph [22] and (2) Berkley-Stanford webgraph [22]. Unlike SSSP’s weightedgraphs, these graphs are unweighted. (3) The log-normal parameters (i.e., shape parameter σ =


Table 1 SSSP data sets statistics

Graph # of nodes # of edges File size

DBLP 310,556 1,518,617 16 MBFacebook 1,204,004 5,430,303 58 MBSSPP-s 1M 7,868,140 87 MBSSPP-m 10M 78,873,968 958 MBSSPP-l 50M 369,455,293 5.19 GB

2, scale parameter μ = −0.5) of the node’s out-degree distribution are extracted from the abovereal graphs to generate three synthetic graphs,containing 1 million, 10 million, and 30 millionnodes, respectively. Table 2 shows the brief de-scription of these data sets.

4.2 Local Cluster Experiments

On our local cluster, we use real graphs as in-put to evaluate SSSP and PageRank under bothMapReduce and iMapReduce. As discussed inSection 3, there are three factors that help re-duce running time. (1) With the help of iterativeprocessing support, iMapReduce performs one-time initialization rather than spending time onmultiple job/task initializations in each iterationin Hadoop. (2) By maintaining static data locally,iMapReduce avoids static data shuf f ling. (3) Byasynchronous map execution, iMapReduce elimi-nates execution delay. Besides measuring the run-ning time of MapReduce and iMapReduce, wemeasure how much these factors help improveperformance.

In order to investigate how much these threefactors affect on performance improvement, wemeasure each factor’s contribution by the follow-ing steps. (1) We first record MapReduce job’srunning time and iMapReduce job’s runningtime as two extreme reference points. The timedifference between the two indicates the reducedtime achieved by iMapReduce. (2) By using

Table 2 PageRank data sets statistics

Graph # of nodes # of edges File size

Google 916,417 6,078,254 49 MBBerk-Stan 685,230 7,600,595 57 MBPageRank-s 1M 7,425,360 61 MBPageRank-m 10M 75,061,501 690 MBPageRank-l 30M 224,493,620 2.26 GB

iMapReduce, we let the map tasks execute syn-chronously and record the running time, so thatthe running time difference from iMapReduceindicates the contribution from the asynchronousmap execution factor. (3) In order to measurethe job/task initialization time consumed in theHadoop implementations, we record the initializa-tion time, which is the time period from the jobsubmission to the averaged time point when thesemap tasks start performing map operations. Thetime spent on job/task initialization in Hadoop im-plementations is the summation of the initializa-tion time from all iteration jobs. iMapReduce hasone-time initialization that only happens in theinitialization stage. The estimated initializationtime of multiple Hadoop jobs excluding that of theiMapReduce job is the time saved by the factor ofone-time initialization. (4) The contribution fromavoiding static data shuffling can be estimated bysubtracting the saved time measured by (2) and(3) from the referenced Hadoop running time.

Figure 4 shows the running time of SSSP onDBLP author cooperation graph. Figure 5 showsthat on Facebook user interaction graph. We cansee that by iMapReduce it can achieve a factorof 2–3 speedup over the Hadoop implementation.The plot labeled with MapReduce (ex. init.) showsthe MapReduce running time excluding the ini-tialization time in each job, and the plot labeledwith iMapReduce (sync.) shows the synchronousiMapReduce running time, where we let the maptasks execute synchronously. We can see that bymeans of one-time initialization we can reduce the

0

50

100

150

200

250

300

350

0 2 4 6 8 10 12 14 16

time

(s)

iterations

MapReduceMapReduce (ex. init.)

iMapReduce (sync.)iMapReduce

Fig. 4 The running time of SSSP on DBLP author cooper-ation graph

58 Y. Zhang et al.

0

50

100

150

200

250

300

350

400

450

500

0 2 4 6 8 10 12 14 16

time

(s)

iterations



Fig. 5 The running time of SSSP on Facebook user inter-action graph

running time by about 20%, which is illustratedby the plot labeled with MapReduce (ex. init.).Further, another 15% running time is saved byasynchronous map execution, which is illustratedby the plot labeled with iMapReduce (sync.).Then, the remaining improvement is achieved byavoiding static data shuffling, which saves about20% running time.

Figure 6 shows the running time of PageRankon Google webgraph, and Fig. 7 shows that onBerkley-Stanford webgraph. Comparing with theHadoop implementation, iMapReduce achievesabout 2 times speedup. In addition, we can seethat 10% running time is saved by one-time ini-tialization, 30% running time is saved by avoidingstatic data shuffling, and another 10% runningtime is saved by asynchronous map execution.

0

100

200

300

400

500

600

700

800

900

0 5 10 15 20

time

(s)

iterations



Fig. 6 The running time of PageRank on Googlewebgraph

0

100

200

300

400

500

600

700

800

900

1000

0 5 10 15 20

time

(s)

iterations



Fig. 7 The running time of PageRank on Berkley-Stanfordwebgraph

4.3 EC2 Cluster Experiments

We deploy iMapReduce prototype on AmazonEC2 cluster and perform experiments on the syn-thetic graphs.

4.3.1 Running Time

We run SSSP on the three synthetic graphs SSSP-s, SSSP-m, and SSSP-l on Amazon EC2 cluster(20 small instances). We limit to ten iterationsand compare running time on the synthetic graphswith different sizes. Figure 8 shows the result.The iMapReduce implementation reduces run-ning time to 23.2%, 37.0% and 38.6% of thatof Hadoop for SSSP-s, SSSP-m, and SSSP-l, re-spectively. We can see that iMapReduce performsbetter when the input is small. The main iterative

Fig. 8 The running time of SSSP on the synthetic graphs


computation takes less time on the small input, sothat there is relatively more time spent on job/taskinitialization. While iMapReduce does not per-form job/task initialization for each iteration, itsone-time initialization property can reduce thetotal running time significantly.

Similarly, PageRank is executed with ten iter-ations on the three synthetic graphs PageRank-s,PageRank-m, and PageRank-l on Amazon EC2cluster (20 small instances). The results are shownin Fig. 9. More significant speedup is achievedfor the PageRank-s graph, where the runningtime is reduced to 44% comparing with Hadoopimplementation. For the PageRank-m graph andthe PageRank-l graph, iMapReduce reduces therunning time to about 60%.

To explore the contributions from different fac-tors, we show in Fig. 10 the time portion reductionby the these factors, i.e., one-time initialization,static data shuffling avoidance, and asynchronousmap execution. SSSP and PageRank are bothcomputed 10 iterations on the SSSP-m graph andon the PageRank-m graph, respectively. We cansee that the running time reduced by one-timeinitialization and asynchronous map execution areboth around 5–10%. The time reduced by avoid-ing static data shuffling is proportional to the inputstatic data size (SSSP-m 958 MB and PageRank-m 690 MB). It takes more time on shuffling thebigger-size static data, so we can save relativelymore time when the static data size is larger.

Fig. 9 The running time of PageRank on the syntheticgraphs

Fig. 10 Different factors’ effects on running timereduction

4.3.2 Communication Cost

In Hadoop MapReduce implementations, largeamounts of data are communicated between maptasks and reduce tasks. Reducing the amountof data communicated not only helps improveperformance but also saves communication re-sources. iMapReduce saves network communica-tion cost by avoiding static graph shuffling. Toquantify the saving of communication, we mea-sure the total amount of data exchanged whenperforming the iterative algorithms on the SSSP-l graph and on the PageRank-l graph. As shownin Fig. 11, iMapReduce significantly reduces thecommunication cost. The amount of required dataexchange is reduced to only about 12%.

Fig. 11 Total communication cost

60 Y. Zhang et al.

Fig. 12 The speedup over the Hadoop implementationsfor running SSSP when scaling cluster size from 20 to 80

4.3.3 Scaling Performance

Since our prototype is implemented on top ofHadoop MapReduce, which scales well, the scal-ability of iMapReduce prototype should meetmost applications’ needs. We scale our AmazonEC2 cluster to contain 50 instances and 80 in-stances, and run SSSP and PageRank on the SSSP-l graph and on the PageRank-l graph, respec-tively. iMapReduce accelerates algorithm conver-gence when using more computing resources asexpected. Moreover, iMapReduce performs evenbetter on a larger scale cluster.

Figure 12 shows the running time of SSSP whenwe scale the cluster size. The figure shows that therunning time ratio of iMapReduce to MapReduceis reduced by 8% when we scale cluster from20 instances to 80 instances. Figure 13 shows the

Fig. 13 The speedup over the Hadoop implementationsfor running PageRank when scaling cluster size from 20to 80

running time result of PageRank when we scalethe cluster size. We can see that the running timeratio of iMapReduce to MapReduce is reduced by7% when we scale the cluster from 20 instances to80 instances. We explain these results as follows.The bigger the cluster is, the more network com-munications would occur. Since iMapReduce aimsat reducing network communications, it is morelikely to exert its advantages on the bigger cluster.

4.3.4 Parallel Ef f iciency

We measure the parallel efficiencies of iMapRe-duce and MapReduce in the context of SSSP andPageRank, The parallel ef f iciency is defined as

Parallel efficiency = T∗Tn × n

, (2)

where T∗ is the running time of an application ina single EC2 instance (no communication and nosynchronization between machines), n is the num-ber of instances, and Tn is the running time of anapplication in an n-instance cluster. We first runan application in a single EC2 instance to obtainT∗, where the partition number is one. Then werun the application on a cluster of machines withcluster size of 20, 50 and 80 respectively.

Figure 14 shows the parallel efficiencies forSSSP computation and PageRank computation.We can see that iMapReduce yields higher par-allel efficiencies than Hadoop MapReduce inboth SSSP and PageRank. For SSSP, performanceslowdown could be around 60% by MapRe-duce when scaling cluster size to 80. While byiMapReduce, the performance slowdown could

Fig. 14 Parallel efficiencies of iMapReduce and MapRe-duce for SSSP and PageRank


be around 43%. For PageRank, the parallelefficiency of iMapReduce is also much better thanMapReduce.

5 Extensions of iMapReduce

So far, we have focused on supporting graph-based iterative algorithms in iMapReduce. How-ever, iMapReduce can be extended to imple-ment other iterative algorithms as well. In Sec-tion 2.2, we make two assumptions in iMapRe-duce: (1) there is one-to-one correspondence fromreducers to mappers, and (2) each iteration is de-scribed as a single MapReduce job. In this section,we present how to extend iMapReduce to supportthe iterative algorithms which do not hold thesetwo assumptions.

5.1 Accommodating Broadcast from Reduceto Map

In an iterative algorithm, it is possible that amapper needs all reducers’ output. For example,to implement the Jacobi method [4] in MapRe-duce, according to x(k+1) = D−1(b − Rx(k)), eachreducer calculates a part of the iterated vector,and all mappers need the intact vector x for com-putation, where D is a diagonal matrix. Anotherfamous iterative algorithm, K-means clustering al-gorithm, also requires all reducers’ output for mapoperation. iMapReduce accommodates broadcastfrom reduces to maps as shown in Fig. 15. Eachreduce task can send its output key-value pairsto all map tasks. Note that, the broadcast fromreduces to maps is different from the shuffle frommaps to reduces. By broadcasting, each reducetask sends n copies of its output to n map tasks(each map task receives a whole output), whileby shuffling, each map task splits its output into npartitions and sends the n partitions to n reducetasks (each reduce task receives a part of theoutput), where n is the number of map/reducetasks. In the following, we will present the K-means algorithm in iMapReduce to show howiMapReduce accommodates broadcasting.

Fig. 15 Broadcast from reduces to maps

5.1.1 K-means Clustering Algorithm

K-means [25] is a commonly used clustering al-gorithm, which partitions n nodes into k clustersso that the nodes in the same cluster are moresimilar than those in other clusters. We describethe algorithm briefly as follows. (1) Start withselecting k random nodes as cluster centroids, (2)Assign each node to the nearest cluster centroid,(3) Update the k cluster centroids by “averaging”the nodes belonging to the same cluster centroid.Repeat steps (2) and (3) until convergence hasbeen reached.

The map and reduce operations for the K-means algorithm can be described as follows.

Map For each key nid (node id), compute thedistance from the node to any cluster centroid,and output the closest cluster id cid along with thenode coordinate information ncoord, i.e., outputkey-value pair 〈cid, ncoord〉.

Reduce For each key cid (cluster id), updatethe cluster centroid coordinate by averaging allthe nodes’ coordinates that belong to cid, andoutput cid along with the cluster’s updated cen-troid coordinate ccoord, i.e., output key-value pair〈cid, ccoord〉.

The map function needs all the cluster centroidsin order to find the closest one for a certainnode. This means that the mapping from reduceto map is not one-to-one but one-to-all. The statedata, i.e., cluster centroids set, should be sentfrom each reduce task to all map tasks. Anotherdifference between K-means and the graph-based

62 Y. Zhang et al.

iterative algorithms described in Section 2 is that,not only the map operation needs the static data(i.e., the coordinates of all nodes) for measuringthe distance to the centroids, but also the reduceoperation needs the static data for the averagingoperation. Therefore, the coordinates of nodeshave to be shuffled from map to reduce.

5.1.2 iMapReduce Implementation

To support “K-means-like” iterative algorithms,iMapReduce lets reduce tasks broadcast the up-dated state data to all map tasks. For the one-to-all mapping, the map function operates on astatic data record with multiple state data records.Accordingly, users should specify the mappingtype by setting

– job.set("mapred.iterjob.mapping","one2all").

Otherwise, the framework will use "one2one" bydefault. In addition, the map function’s input pa-rameter StateValue (Section 3.5) that originallycontains a single state value are extended to a listof state values. In the case of K-means, the statevalues list is the set of all cluster centroids.

Further, the map operation needs the outputfrom multiple reduce tasks. The map operationcannot be started before all its input data arrive.In the case of K-means, the map operation cannotbe started before all the updated cluster centroidsare collected. That is, the map tasks cannot beexecuted asynchronously. Therefore, the optionfor synchronous map execution should be turnedon by setting

– job.setBoolean("mapred.iterjob.sync", true).

In general, we trigger the map operations synchro-nously when all the reduce operations that supplythe input have completed.

5.1.3 Experimental Results

We implement K-means in iMapReduce and runthe algorithm on the data set collected fromLast.fm [21]. The users’ listening history log isused to cluster users based on their tastes. This

log contains each user’s artist preference informa-tion quantified by the times the artist is listened.Sharing a preferred artist indicates a commontaste. Last.fm data set (1.5 GB) has 359,347 userrecords, and each user has 48.9 preferred artistson average.

Figure 16 shows the K-means running time lim-ited in ten iterations, which are performed on ourlocal cluster. iMapReduce achieves about 1.2×speedup over Hadoop. This is less significant thanthat achieved for the SSSP and PageRank algo-rithms. Nevertheless, this is under our expectationsince the implementation of K-means needs toshuffle static data and has to execute map oper-ations synchronously.

It is also possible to reduce the shuffling of sta-tic data by using Combiner provided by HadoopMapReduce. Combiner performs local aggrega-tion at the map side before shuffling the in-termediate data in order to reduce the amountof shuffled data. We also perform the K-meansexperiments with Combiner (other experimentconfigurations are the same). As expected, therunning time is reduced in both Hadoop andiMapReduce. For Hadoop implementation, therunning time is reduced from 2881 s to 2226 s(23% reduction), while for iMapReduce imple-mentation, the running time is reduced from2338 s to 1733 s (26% reduction).

0

500

1000

1500

2000

2500

3000

1 2 3 4 5 6 7 8 9 10

time

(s)

iterations

MapReduceiMapReduce

Fig. 16 Running time of K-means for clustering Last.fmdata on the local cluster


5.2 Accommodating Multiple Map-ReducePhases

In reality, it is common that a single map-reducepass of the input is not sufficient to express algo-rithm logic. To solve this problem, we can performmultiple map-reduce steps in each iteration. Inthis section, we discuss the extensions of iMapRe-duce that support multiple map-reduce phases foreach iteration.

Figure 17 shows an example of two-phase case.The state data are parsed by two map-reducephases in each iteration, and the static data canbe joined with the state data at two different mapsteps respectively. To join the state data with theircorresponding static data, the key is to specify themapping from the reducers of the previous phaseto the mappers of the next phase, as well as themapping from the reducers of the last phase tothe mappers of the first phase. In the following,we give an example of matrix power computationto illustrate the multiple map-reduce phases case.

5.2.1 Matrix Power Computation

Square matrices can be multiplied by themselvesrepeatedly. This repeated multiplication can be

Fig. 17 Two map-reduce phases for each iteration iniMapReduce

described as a power of the matrix, i.e., Mk =∏k1 M. The operation of each iteration is ma-

trix multiplication, i.e., Mk = M × N, where N =Mk−1. If M is a matrix with element mij in row iand column j, and N is a matrix with element n jk

in row j and column k, then the product P = MNis the matrix P with element pik in row i andcolumn k, where pik = ∑

j mijn jk.In the MapReduce framework, we use two

map-reduce phases to perform the matrix multi-plication [29], which forms each iteration. In thefirst phase, the map extracts the columns of Mand the rows of N, and the reduce joins columnj of M and row j of N together. In the secondphase, the map multiplies a column vector withthe joined row vector to obtain an matrix, andthe reduce sums these matrices to obtain the finalresult matrix. The map and reduce operations ineach iteration are described as follows.

Map 1 For each key (i, j), send each matrix ele-ment mij of M to the key-value pair 〈 j, (M, i, mij)〉.For each key ( j, k), send each matrix element n jk

of N to the key-value pair 〈 j, (N, k, n jk)〉.

Reduce 1 For each key j, collect its list ofassociated values 〈 j, [(M, i1, mi1 j), (M, i2, mi2 j),. . . , (N, k1, n jk1), (N, k2, n jk2), . . .]〉.

Map 2 Take the output key-value pair of Re-duce 1. For each value that comes from M,say (M, i, mij), and each value that comes fromN, say (N, k, n jk), produce the key-value pair〈(i, k), mijn jk〉. It will output all the permu-tations 〈(i1, k1), mi1 jn jk1〉, 〈(i1, k2), mi1 jn jk2〉, . . .,〈(i2, k1), mi2 jn jk1〉, . . ..

Reduce 2 For each key (i, k), produce the sumof the list of values associated with this key. Theresult key-value pair is 〈(i, k), pik〉, where pik =∑

j mijn jk.


In iMapReduce, we connect Reduce 1 to Map 2such that both of them operate on the same keyj, and connect Reduce 2 to Map 1 such that bothof them operate on the same key (i, k). Since the

64 Y. Zhang et al.

connected reduce and map operate on the sametype of keys, we can connect them according toone-to-one mapping. In order to avoid static datashuffling, we note that M as a static multiplicatoris used in each iteration, so that we make M asthe static data, which are joined with N at Map 2.While in the first map-reduce phase, there is nojoin operation, so that the static data are not set.

Accordingly, besides setting the first map-reduce phase by configuring a JobConf job1,users should set the second map-reduce phaseby configuring another JobConf job2. These twojobs are chained together by setting

– job1.addSuccessor(job2)– job2.addSuccessor(job1).

Otherwise, if a single map-reduce phase issufficient to execute an iteration, the job willset the successor to be itself by default. Theframework connects these map/reduce functionsaccording to the chained order.


We perform matrix power computation for a syn-thetic matrix (1000 × 1000) for five iterations iniMapReduce as well as in Hadoop, which is con-ducted on our local cluster. In the matrix powercomputation, the major portion of running time isconsumed by the intermediate data shuffling be-tween Map 2 and Reduce 2, which is ineluctable.However, as shown in Fig. 18, iMapReduce can

1000

2000

3000

4000

5000

6000

7000

8000

9000

1 2 3 4 5

time

(s)

iterations

MapReduceiMapReduce

Fig. 18 Running time of matrix power computation on thelocal cluster

still achieve about 10% speedup over Hadoop byvarious framework optimizations.

5.3 Accommodating Auxiliary Map-ReducePhase

Some applications require running an auxiliarymap-reduce phase that generates some auxiliaryinformation. The main map-reduce phase per-forms iterative computation, and the auxiliarymap-reduce phase also takes the main phase’soutput and produces some auxiliary information.Figure 19 shows a typical structure of this case.The execution of the main phase takes consid-eration of the result of the auxiliary phase, andthe auxiliary phase generates the auxiliary in-formation without pausing active computation inthe main phase. In the following, we illustratea common usage of the auxiliary map-reducephase for detecting convergence of an iterativecomputation.

5.3.1 Convergence Detection of K-means

iMapReduce can terminate iterative computa-tion based on the maximum iteration number orbased on the difference between two consecutive

Fig. 19 Auxiliary map-reduce phase in iMapReduce


iterations’ results. However, iMapReduce onlyprovides a limited number of choices on measur-ing the difference (based on Manhattan distanceor Euclidean distance), which is not sufficient todetect convergence for a broad class of algorithms.For example, the K-means algorithm is consid-ered converged when the number of nodes thatmove between clusters is less than a threshold.The main map-reduce phase (Map 1 and Reduce1) for computing K-means has been discussed inSection 5.1. We use another auxiliary map-reducephase to detect convergence. The map and reduceoperations in the auxiliary map-reduce phase aredescribed as follows.

Map 2 For each cluster cid, compute the num-ber of nodes that do appear in cluster cid inthe previous iteration, say num_stay, and output〈0, num_stay〉, where 0 is a unique key, so thatall the mappers’ outputs are shuffled to a uniquereducer.

Reduce 2 In the sole reducer assigned for key0, sum all num_stay from different mappers toretrieve the total number of nodes that movebetween clusters, which is num_move = |S| −num_stay, where |S| is the total number of inputnodes, and broadcast the termination signals to allmappers in Map 1 step if num_move is less than athreshold.


In iMapReduce, these two map-reduce phases arerunning in parallel. As long as the map tasks inthe main phase receive the termination signalsfrom Reduce 2, they will terminate themselvesand notify the master. To perform an auxiliarymap-reduce phase in iMapReduce, users shouldset

– job1.addAuxiliray(job2)

where job1 corresponds to the main map-reducephase, and job2 corresponds to the auxiliary map-reduce phase. In addition, the one-to-all mappingsfrom Reduce 2 to Map 1 should be specified.

0

200

400

600

800

1000

1200

1400

1 2 3 4 5 6

time

(s)

iterations

MapReduceiMapReduce

Fig. 20 Running time of K-means with convergencedetection


Based on the experiment setup mentioned in Sec-tion 5.1.3, we perform K-means with an auxiliaryphase for convergence detection in iMapReduce.For comparison with MapReduce implementa-tion, we also implement an additional Hadoopjob for convergence detection, which is executedbetween two K-means computation jobs. While iniMapReduce, the two map-reduce phases are run-ning in parallel (asynchronously) that could helpreduce the running time. The results are shown inFig. 20, where the algorithm terminates after sixiterations. We can see that 25% running time isreduced, which is mainly resulted from the elim-ination of the synchronously executed auxiliaryjob.

6 Related Work

MapReduce, as a popular distributed frameworkfor data intensive computation, has gained con-siderable attention over the past few years [12].The framework has been extended for diverseapplication requirements. MapReduce Online [9]pipelines map/reduce operations and performsonline aggregation to support efficient onlinequeries, which directly inspires our work.

To support implementing large-scale iterativealgorithms, there are a number of studies propos-ing new distributed computing frameworks for

66 Y. Zhang et al.

iterative processing [5, 13, 19, 20, 24, 27, 30, 33,34, 39, 41].

A class of these efforts targets on managingstatic data efficiently. Design patterns for run-ning efficient graph algorithms in MapReducehave been introduced in [24]. They partition thestatic graph adjacency list into n parts and pre-store them on DFS. However, since the MapRe-duce framework arbitrarily assigns reduce tasksto workers, accessing the graph adjacency list caninvolve remote reads. This cannot guarantee localaccess to the static data. HaLoop [5] was pro-posed aiming at iterative processing on a largecluster. It realizes the join of the static data andthe state data by explicitly specifying an additionalMapReduce job, and relies on the task schedulerand caching techniques to maintain local access tostatic data. While iMapReduce relies on persistenttasks to manage static data and to avoid tasksinitialization.

Some studies accelerate iterative algorithms bymaintaining the iterated state data in memory.Spark [39] was developed to optimize iterativeand interactive computation. It uses caching tech-niques to dramatically improve the performancefor repeated operations. The main idea in Sparkis the construction of resilient distributed dataset(RDD), which is a read-only collection of ob-jects maintained in memory across iterations andsupports fault recovery. [26] presents a general-ized architecture for continuous bulk processing(CBP), which performs iterative computations inan incremental fashion by unifying stateful pro-gramming with a data-parallel operator. CIEL[31] supports data-dependent iterative or recur-sive algorithms by building an abstract dynamictask graph. Piccolo [34] allows computation run-ning on different machines to share distributed,mutable state via a key-value table interface. Thisenables one to implement iterative algorithms thataccess in-memory distributed tables without wor-rying about the consistency of the data. PrIter [42]enables prioritized iteration, which exploits thedominant property of some portion of the dataand schedules them first for computation, ratherthan blindly performs computations on all data.This is realized by maintaining a state table anda priority queue in memory. Our iMapReduceframework is built on Hadoop, the iterated state

data as well as the static data are maintained infiles but not in memory. Therefore, it is morescalable and more resilient to failures.

Some other efforts focus on graph-based iter-ative algorithms, an important class of iterativealgorithms. PEGASUS [20] models those seem-ingly different graph iterative algorithms as a gen-eralization of matrix-vector multiplication (GIM-V). By exploring matrix property, such as blockmultiplication, clustered edges and diagonal blockiteration, it can achieve 5× faster performanceover the regular job. Pregel [27] chooses a puremessage passing model to process graphs. In eachiteration, a vertex can, independently of othervertices, receive messages sent to it in the pre-vious iteration, send messages to other vertices,modify its own and its outgoing edges’ states, andmutate the graph’s topology. By using this model,processing large graphs is expressive and easy toprogram. iMapReduce exploits the property thatthe map and reduce functions operate on the sametype of keys, i.e., node id, to accelerate graph-based iterative algorithms.

The most relevant work is that of Ekanayakeet al., who proposed Twister [13, 14], which em-ploys stream-based MapReduce implementationthat supports iterative applications. Twister em-ploys novel ideas of loading the input data onlyonce in the initialization stage and performingiterative map-reduce processing by long runningmap/reduce daemons. iMapReduce differs fromTwister mainly on that Twister stores interme-diate data in memory, while iMapReduce storesintermediate data in files. Twister loads all datain distributed memory for fast data access andgrounds on the assumption that datasets and in-termediate data can fit into the distributed mem-ory of the computing infrastructure. iMapReduceaims at providing a MapReduce based iterativecomputing framework running on a cluster ofcommodity machines where each node has lim-ited memory resources. In iMapReduce, the in-termediate data, including the intermediate re-sults, the shuffled data between map and reduce,and the static data, are all stored in files. Thiskey difference results in different implementa-tion mechanisms, including different data trans-fers and different joining techniques of staticdata and state data. Furthermore, iMapReduce


supports asynchronous map execution, which fur-ther improves performance. Besides, iMapReduceis implemented based on Hadoop MapReduce.The iterative applications in Hadoop can be easilymodified to run on iMapReduce.

7 Conclusions

In this paper, we propose iMapReduce that sup-ports the implementation of iterative algorithmsunder a large cluster environment. iMapReduceextracts the common features of iterative algo-rithms and provides the built-in support for thesefeatures. In particular, it proposes the conceptof persistent tasks to reduce the job/task initial-ization overhead, provides efficient data manage-ment to avoid the shuffling of static data amongtasks, and allows asynchronous map task execu-tion when possible. Accordingly, the system per-formance is greatly improved through these op-timizations. For clarity, we first present iMapRe-duce for supporting graph-based iterative algo-rithms. Then, the original framework is slightlyextended to support more iterative applications,which makes our framework more general onsupporting iterative implementations. We demon-strate our results in the context of various applica-tions, which show up to 5 times speedup over thetraditional Hadoop MapReduce.

Acknowledgements This work is partially supported byU.S. NSF grants CCF-1018114. Yanfeng Zhang was avisiting student at UMass Amherst, supported by ChinaScholarship Council, when this work was performed.

References

1. Amazon ec2: http://aws.amazon.com/ec2/. Accessed2011

2. Baluja, S., Seth, R., Sivakumar, D., Jing, Y., Yagnik, J.,Kumar, S., Ravichandran, D., Aly, M.: Video sugges-tion and discovery for Youtube: taking random walksthrough the view graph. In: Proceedings of the 17thInternational Conference on World Wide Web (WWW’08), pp. 895–904 (2008)

3. Brin, S., Page, L.: The anatomy of a large-scale hyper-textual web search engine. Comput Networks ISDN 30,107–117 (1998)

4. Bronshtein, I.N., Semendyayev, K.A.: Handbook ofMathematics, 3rd edn. Springer, London (1997)

5. Bu, Y., Howe, B., Balazinska, M., Ernst, M.D.: Haloop:efficient iterative data processing on large clusters.Proc. VLDB Endow. 3, 285–296 (2010)

6. Chakrabarti, S.: Dynamic personalized pagerank inentity-relation graphs. In: Proceedings of the 16th In-ternational Conference on World Wide Web (WWW’07), pp. 571–580 (2007)

7. Chu, C.T., Kim, S.K., Lin, Y.A., Yu, Y., Bradski,G.R., Ng, A.Y., Olukotun, K.: Map-reduce for ma-chine learning on multicore. In: Proceedings of the 20thNeural Information Processing Systems (NIPS ’06),pp. 281–288 (2006)

8. Clauset, A., Shalizi, C.R., Newman, M.E.J.: Power-lawdistributions in empirical data. SIAM Rev. 51(4), 661–703 (2009)

9. Condie, T., Conway, N., Alvaro, P., Hellerstein, J.M.,Elmeleegy, K., Sears, R.: Mapreduce online. In: Pro-ceedings of the 7th USENIX Conference on Net-worked Systems Design and Implementation (NSDI’10), pp. 21–21 (2010)

10. Cormen, T.H., Stein, C., Rivest, R.L., Leiserson, C.E.:Introduction to Algorithms, 2nd edn. McGraw-HillHigher Education (2001)

11. Data center wiki page: http://en.wikipedia.org/wiki/Datacenter. Accessed 2011

12. Dean, J., Ghemawat, S.: Mapreduce: simplified dataprocessing on large clusters. Commun. ACM 51, 107–113 (2008)

13. Ekanayake, J., Li, H., Zhang, B., Gunarathne, T., Bae,S.H., Qiu, J., Fox, G.: Twister: a runtime for itera-tive mapreduce. In: Proceedings of the 1st Interna-tional Workshop on MapReduce and its Applications(MAPREDUCE ’10), pp. 810–818 (2010)

14. Ekanayake, J., Pallickara, S., Fox, G.: Mapreduce fordata intensive scientific analyses. In: Proceedings ofthe 4th IEEE International Conference on eScience(eScience ’08), pp. 277–284 (2008)

15. Hadoop mapreduce: http://hadoop.apache.org/.Accessed 2011

16. Hadoop online prototype: http://code.google.com/p/hop/.Accessed 2011

17. Herlocker, J.L., Konstan, J.A., Terveen, L.G., Riedl,J.T.: Evaluating collaborative filtering recommendersystems. ACM Trans. Inf. Syst. 22, 5–53 (2004)

18. imapreduce on Google code: http://code.google.com/p/i-mapreduce/. Accessed 2012

19. Kambatla, K., Rapolu, N., Jagannathan, S., Grama,A.: Asynchronous algorithms in mapreduce. In: Pro-ceedings of the 2010 IEEE International Confer-ence on Cluster Computing (Cluster ’10), pp. 245–254(2010)

20. Kang, U., Tsourakakis, C., Faloutsos, C.: Pegasus: apeta-scale graph mining system implementation andobservations. In: Proceedings of the 9th IEEE In-ternational Conference on Data Mining (ICDM ’09),pp. 229–238 (2009)

21. Last.fm web services: http://www.last.fm/api/. Accessed2011

22. Leskovec, J., Lang, K.J., Dasgupta, A., Mahoney,M.W.: Statistical properties of community structure inlarge social and information networks. In: Proceedings

http://aws.amazon.com/ec2/

http://en.wikipedia.org/wiki/Datacenter

http://en.wikipedia.org/wiki/Datacenter

http://hadoop.apache.org/

http://code.google.com/p/hop/

http://code.google.com/p/i-mapreduce/

http://code.google.com/p/i-mapreduce/

http://www.last.fm/api/

68 Y. Zhang et al.

of the 17th International Conference on World WideWeb (WWW ’08), pp. 695–704, (2008)

23. Liben-Nowell, D., Kleinberg, J.: The link-predictionproblem for social networks. J. Am. Soc. Inf. Sci. Tech-nol. 58, 1019–1031 (2007)

24. Lin, J., Schatz, M.: Design patterns for efficient graphalgorithms in mapreduce. In: Proceedings of the 8thWorkshop on Mining and Learning with Graphs (MLG’10), pp. 78–85 (2010)

25. Lloyd, S.P.: Least squares quantization in pcm. IEEETrans. Inform. Theory 28, 129–136 (1982)

26. Logothetis, D., Olston, C., Reed, B., Webb, K.C.,Yocum, K.: Stateful bulk processing for incrementalanalytics. In: Proceedings of the 1st ACM Symposiumon Cloud Computing (SOCC ’10), pp. 51–62 (2010)

27. Malewicz, G., Austern, M.H., Bik, A.J., Dehnert, J.C.,Horn, I., Leiser, N., Czajkowski, G.: Pregel: a systemfor large-scale graph processing. In: Proceedings of the28th ACM Symposium on Principles of DistributedComputing (PODC ’09), pp. 6–146 (2009)

28. Microsoft windows azure platform: http://www.microsoft.com/windowsazure/. Accessed 2011

29. Mining of massive datasets: http://infolab.stanford.edu/ullman/mmds/book.pdf. Accessed 2010

30. Murray, D.G., Hand, S.: Scripting the cloud with sky-writing. In: Proceedings of the 2nd USENIX Confer-ence on Hot Topics in Cloud Computing (HotCloud’10), p. 12 (2010)

31. Murray, D.G., Schwarzkopf, M., Smowton, C., Smith,S., Madhavapeddy, A., Hand, S.: Ciel: a universal exe-cution engine for distributed data-flow computing. In:Proceedings of the 8th USENIX Conference on Net-worked Systems Design and Implementation (NSDI’11), p. 9 (2011)

32. Page, L., Brin, S., Motwani, R., Terry, W.: The pager-ank citation ranking: bringing order to the web. In: Pro-ceedings of the 9th International Conference on WorldWide Web (WWW ’98) (1998)

33. Peng, D., Dabe, F.: Large-scale incremental process-ing using distributed transactions and notifications. In:Proceedings of the 9th Conference on Symposium onOpearting Systems Design and Implementation (OSDI’10), pp. 1–15 (2010)

34. Power, R., Li, J.: Piccolo: building fast, distributedprograms with partitioned tables. In: Proceedings ofthe 9th USENIX Conference on Operating SystemsDesign and Implementation (OSDI ’10), OSDI’10,pp. 1–14 (2010)

35. Ranger, C., Raghuraman, R., Penmetsa, A., Bradski,G., Kozyrakis, C.: Evaluating mapreduce for multi-core and multiprocessor systems. In: Proceedings of the13th IEEE International Symposium on High Perfor-mance Computer Architecture (HPCA ’07), pp. 13–24(2007)

36. Sarukkai, R.R.: Link prediction and path analysis usingmarkov chains. Comput. Netw. 33, 377–386 (2000)

37. Takács, G., Pilászy, I., Németh, B., Tikk, D.: Scal-able collaborative filtering approaches for large rec-ommender systems. J. Mach. Learn. Res. 10, 623–656(2009)

38. Wilson, C., Boe, B., Sala, A., Puttaswamy, K.P., Zhao,B.Y.: User interactions in social networks and theirimplications. In: Proceedings of the 4th ACM Euro-pean Conference on Computer Systems (EuroSys ’09),pp. 205–218 (2009)

39. Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker,S., Stoica, I.: Spark: cluster computing with workingsets. In: Proceedings of the 2nd USENIX Conferenceon Hot Topics in Cloud Computing (HotCloud ’10),p. 10 (2010)

40. Zaharia, M., Konwinski, A., Joseph, A.D., Katz,R., Stoica, I.: Improving mapreduce performancein heterogeneous environments. In: Proceedings ofthe 8th USENIX Conference on Operating SystemsDesign and Implementation (OSDI ’08), pp. 29–42(2008)

41. Zhang, Y., Gao, Q., Gao, L., Wang, C.: Imapreduce: adistributed computing framework for iterative compu-tation. In: Proceedings of the 1st International Work-shop on Data Intensive Computing in the Clouds(DataCloud ’11), p. 1112 (2011)

42. Zhang, Y., Gao, Q., Gao, L., Wang, C.: Priter: adistributed framework for prioritized iterative com-putations. In: Proceedings of the 2nd ACM Sympo-sium on Cloud Computing (SOCC ’11), pp. 13:1–13:14(2011)

http://www.microsoft.com/windowsazure/

http://www.microsoft.com/windowsazure/

http://infolab.stanford.edu/ullman/mmds/book.pdf

http://infolab.stanford.edu/ullman/mmds/book.pdf

imapreduce: a distributed computing framework for...

Documents