design patterns for efficient graph algorithms in...
TRANSCRIPT
![Page 1: Design Patterns for Efficient Graph Algorithms in MapReducejimmylin/presentations/Lin-Hadoop...Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy LinJimmy Lin and Michael](https://reader034.vdocuments.mx/reader034/viewer/2022051903/5ff4093ed158b3227431feed/html5/thumbnails/1.jpg)
Design Patterns for Efficient Graph Algorithms in MapReduceAlgorithms in MapReduce
Jimmy Lin and Michael SchatzJimmy Lin and Michael SchatzUniversity of Maryland
Tuesday, June 29, 2010
This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United StatesSee http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details
![Page 2: Design Patterns for Efficient Graph Algorithms in MapReducejimmylin/presentations/Lin-Hadoop...Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy LinJimmy Lin and Michael](https://reader034.vdocuments.mx/reader034/viewer/2022051903/5ff4093ed158b3227431feed/html5/thumbnails/2.jpg)
@lintool
![Page 3: Design Patterns for Efficient Graph Algorithms in MapReducejimmylin/presentations/Lin-Hadoop...Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy LinJimmy Lin and Michael](https://reader034.vdocuments.mx/reader034/viewer/2022051903/5ff4093ed158b3227431feed/html5/thumbnails/3.jpg)
Talk OutlineGraph algorithms
Graph algorithms in MapReduceG ap a go t s ap educe
Making it efficient
Experimental resultsExperimental results
![Page 4: Design Patterns for Efficient Graph Algorithms in MapReducejimmylin/presentations/Lin-Hadoop...Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy LinJimmy Lin and Michael](https://reader034.vdocuments.mx/reader034/viewer/2022051903/5ff4093ed158b3227431feed/html5/thumbnails/4.jpg)
What’s a graph?G = (V, E), where
V represents the set of vertices (nodes)E represents the set of edges (links)Both vertices and edges may contain additional information
Graphs are everywhere:Graphs are everywhere:E.g., hyperlink structure of the web, interstate highway system, social networks, etc.
Graph problems are everywhere:E.g., random walks, shortest paths, MST, max flow, bipartite matching clustering etcmatching, clustering, etc.
![Page 5: Design Patterns for Efficient Graph Algorithms in MapReducejimmylin/presentations/Lin-Hadoop...Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy LinJimmy Lin and Michael](https://reader034.vdocuments.mx/reader034/viewer/2022051903/5ff4093ed158b3227431feed/html5/thumbnails/5.jpg)
Source: Wikipedia (Königsberg)
![Page 6: Design Patterns for Efficient Graph Algorithms in MapReducejimmylin/presentations/Lin-Hadoop...Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy LinJimmy Lin and Michael](https://reader034.vdocuments.mx/reader034/viewer/2022051903/5ff4093ed158b3227431feed/html5/thumbnails/6.jpg)
Graph RepresentationG = (V, E)
Typically represented as adjacency lists:yp ca y ep ese ted as adjace cy stsEach node is associated with its neighbors (via outgoing edges)
1
2
1: 2, 41
3
,2: 1, 3, 43: 1
4
4: 1, 3
![Page 7: Design Patterns for Efficient Graph Algorithms in MapReducejimmylin/presentations/Lin-Hadoop...Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy LinJimmy Lin and Michael](https://reader034.vdocuments.mx/reader034/viewer/2022051903/5ff4093ed158b3227431feed/html5/thumbnails/7.jpg)
“Message Passing” Graph AlgorithmsLarge class of iterative algorithms on sparse, directed graphs
At each iteration:Computations at each vertexPartial results (“messages”) passed (usually) along directed edgesComputations at each vertex: messages aggregate to alter state
Iterate until convergenceIterate until convergence
![Page 8: Design Patterns for Efficient Graph Algorithms in MapReducejimmylin/presentations/Lin-Hadoop...Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy LinJimmy Lin and Michael](https://reader034.vdocuments.mx/reader034/viewer/2022051903/5ff4093ed158b3227431feed/html5/thumbnails/8.jpg)
A Few Examples…Parallel breadth-first search (SSSP)
Messages are distances from sourceEach node emits current distance + 1Aggregation = MIN
PageRankPageRankMessages are partial PageRank massEach node evenly distributes mass to neighborsAggregation = SUM
DNA Sequence assemblyMichael Schatz’s dissertation
![Page 9: Design Patterns for Efficient Graph Algorithms in MapReducejimmylin/presentations/Lin-Hadoop...Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy LinJimmy Lin and Michael](https://reader034.vdocuments.mx/reader034/viewer/2022051903/5ff4093ed158b3227431feed/html5/thumbnails/9.jpg)
PageRank in a nutshell….Random surfer model:
User starts at a random Web pageUser randomly clicks on links, surfing from page to pageWith some probability, user randomly jumps around
PageRankPageRank…Characterizes the amount of time spent on any given pageMathematically, a probability distribution over pages
![Page 10: Design Patterns for Efficient Graph Algorithms in MapReducejimmylin/presentations/Lin-Hadoop...Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy LinJimmy Lin and Michael](https://reader034.vdocuments.mx/reader034/viewer/2022051903/5ff4093ed158b3227431feed/html5/thumbnails/10.jpg)
PageRank: DefinedGiven page x with inlinks t1…tn, where
C(t) is the out-degree of tα is probability of random jumpN is the total number of nodes in the graph
⎞⎛ n tPR )(1 ∑=
−+⎟⎠⎞
⎜⎝⎛=
i i
i
tCtPR
NxPR
1 )()()1(1)( αα
X
t1
tt2
tn
…
![Page 11: Design Patterns for Efficient Graph Algorithms in MapReducejimmylin/presentations/Lin-Hadoop...Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy LinJimmy Lin and Michael](https://reader034.vdocuments.mx/reader034/viewer/2022051903/5ff4093ed158b3227431feed/html5/thumbnails/11.jpg)
Sample PageRank Iteration (1)
n2 (0.2)
0 1
n2 (0.166)Iteration 1
n1 (0.2) 0.1
0.1
0.1 0.1
0.066 0.0660.066
n1 (0.066)
n4 (0.2)
n3 (0.2)n5 (0.2)
0.2 0.2
n4 (0.3)
n3 (0.166)n5 (0.3)
![Page 12: Design Patterns for Efficient Graph Algorithms in MapReducejimmylin/presentations/Lin-Hadoop...Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy LinJimmy Lin and Michael](https://reader034.vdocuments.mx/reader034/viewer/2022051903/5ff4093ed158b3227431feed/html5/thumbnails/12.jpg)
Sample PageRank Iteration (2)
n2 (0.166)
0 033 0 083
n2 (0.133)Iteration 2
n1 (0.066)0.033
0.033
0.083 0.083
0.1 0.10.1
n1 (0.1)
n4 (0.3)
n3 (0.166)n5 (0.3)
0.3 0.166
n4 (0.2)
n3 (0.183)n5 (0.383)
![Page 13: Design Patterns for Efficient Graph Algorithms in MapReducejimmylin/presentations/Lin-Hadoop...Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy LinJimmy Lin and Michael](https://reader034.vdocuments.mx/reader034/viewer/2022051903/5ff4093ed158b3227431feed/html5/thumbnails/13.jpg)
PageRank in MapReduce
n5 [n1, n2, n3]n1 [n2, n4] n2 [n3, n5] n3 [n4] n4 [n5]
Mapn2 n4 n3 n5 n1 n2 n3n4 n5
Map
n2 n4n3 n5n1 n2 n3 n4 n5
Reduce
n5 [n1, n2, n3]n1 [n2, n4] n2 [n3, n5] n3 [n4] n4 [n5]
![Page 14: Design Patterns for Efficient Graph Algorithms in MapReducejimmylin/presentations/Lin-Hadoop...Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy LinJimmy Lin and Michael](https://reader034.vdocuments.mx/reader034/viewer/2022051903/5ff4093ed158b3227431feed/html5/thumbnails/14.jpg)
PageRank Pseudo-Code
![Page 15: Design Patterns for Efficient Graph Algorithms in MapReducejimmylin/presentations/Lin-Hadoop...Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy LinJimmy Lin and Michael](https://reader034.vdocuments.mx/reader034/viewer/2022051903/5ff4093ed158b3227431feed/html5/thumbnails/15.jpg)
Why don’t distributed algorithms scale?
![Page 16: Design Patterns for Efficient Graph Algorithms in MapReducejimmylin/presentations/Lin-Hadoop...Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy LinJimmy Lin and Michael](https://reader034.vdocuments.mx/reader034/viewer/2022051903/5ff4093ed158b3227431feed/html5/thumbnails/16.jpg)
Source: http://www.flickr.com/photos/fusedforces/4324320625/
![Page 17: Design Patterns for Efficient Graph Algorithms in MapReducejimmylin/presentations/Lin-Hadoop...Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy LinJimmy Lin and Michael](https://reader034.vdocuments.mx/reader034/viewer/2022051903/5ff4093ed158b3227431feed/html5/thumbnails/17.jpg)
Three Design PatternsIn-mapper combining: efficient local aggregation
Smarter partitioning: create more opportunitiesS a te pa t t o g c eate o e oppo tu t es
Schimmy: avoid shuffling the graph
![Page 18: Design Patterns for Efficient Graph Algorithms in MapReducejimmylin/presentations/Lin-Hadoop...Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy LinJimmy Lin and Michael](https://reader034.vdocuments.mx/reader034/viewer/2022051903/5ff4093ed158b3227431feed/html5/thumbnails/18.jpg)
In-Mapper CombiningUse combiners
Perform local aggregation on map outputDownside: intermediate data is still materialized
Better: in-mapper combiningPreserve state across multiple map calls, aggregate messages in buffer, emit buffer contents at endDownside: requires memory management
buffer
configure
map
close
![Page 19: Design Patterns for Efficient Graph Algorithms in MapReducejimmylin/presentations/Lin-Hadoop...Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy LinJimmy Lin and Michael](https://reader034.vdocuments.mx/reader034/viewer/2022051903/5ff4093ed158b3227431feed/html5/thumbnails/19.jpg)
Better PartitioningDefault: hash partitioning
Randomly assign nodes to partitions
Observation: many graphs exhibit local structureE.g., communities in social networksBetter partitioning creates more opportunities for local aggregation
Unfortunately… partitioning is hard!Sometimes chick and eggSometimes, chick-and-eggBut in some domains (e.g., webgraphs) take advantage of cheap heuristicsFor webgraphs: range partition on domain-sorted URLs
![Page 20: Design Patterns for Efficient Graph Algorithms in MapReducejimmylin/presentations/Lin-Hadoop...Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy LinJimmy Lin and Michael](https://reader034.vdocuments.mx/reader034/viewer/2022051903/5ff4093ed158b3227431feed/html5/thumbnails/20.jpg)
Schimmy Design PatternBasic implementation contains two dataflows:
Messages (actual computations)Graph structure (“bookkeeping”)
Schimmy: separate the two data flows, shuffle only the messagesmessages
Basic idea: merge join between graph structure and messages
both relations consistently partitioned and sorted by join key
S TS1 T1 S2 T2 S3 T3
both relations consistently partitioned and sorted by join key
![Page 21: Design Patterns for Efficient Graph Algorithms in MapReducejimmylin/presentations/Lin-Hadoop...Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy LinJimmy Lin and Michael](https://reader034.vdocuments.mx/reader034/viewer/2022051903/5ff4093ed158b3227431feed/html5/thumbnails/21.jpg)
Do the Schimmy!Schimmy = reduce side parallel merge join between graph structure and messages
Consistent partitioning between input and intermediate dataMappers emit only messages (actual computation)Reducers read graph structure directly from HDFSReducers read graph structure directly from HDFS
intermediate data(messages)
intermediate data(messages)
intermediate data(messages)
from HDFS(graph structure)
from HDFS(graph structure)
from HDFS(graph structure)
S1 T1 S2 T2 S3 T3
ReducerReducerReducer
![Page 22: Design Patterns for Efficient Graph Algorithms in MapReducejimmylin/presentations/Lin-Hadoop...Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy LinJimmy Lin and Michael](https://reader034.vdocuments.mx/reader034/viewer/2022051903/5ff4093ed158b3227431feed/html5/thumbnails/22.jpg)
ExperimentsCluster setup:
10 workers, each 2 cores (3.2 GHz Xeon), 4GB RAM, 367 GB diskHadoop 0.20.0 on RHELS 5.3
Dataset:First English segment of ClueWeb09 collection50.2m web pages (1.53 TB uncompressed, 247 GB compressed)Extracted webgraph: 1.4 billion links, 7.0 GBDataset arranged in crawl order
Setup:Measured per-iteration running time (5 iterations)100 partitions
![Page 23: Design Patterns for Efficient Graph Algorithms in MapReducejimmylin/presentations/Lin-Hadoop...Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy LinJimmy Lin and Michael](https://reader034.vdocuments.mx/reader034/viewer/2022051903/5ff4093ed158b3227431feed/html5/thumbnails/23.jpg)
Results
“Best Practices”
![Page 24: Design Patterns for Efficient Graph Algorithms in MapReducejimmylin/presentations/Lin-Hadoop...Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy LinJimmy Lin and Michael](https://reader034.vdocuments.mx/reader034/viewer/2022051903/5ff4093ed158b3227431feed/html5/thumbnails/24.jpg)
Results
+18%1.4b
674m
![Page 25: Design Patterns for Efficient Graph Algorithms in MapReducejimmylin/presentations/Lin-Hadoop...Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy LinJimmy Lin and Michael](https://reader034.vdocuments.mx/reader034/viewer/2022051903/5ff4093ed158b3227431feed/html5/thumbnails/25.jpg)
Results
+18%
-15%
1.4b
674m
![Page 26: Design Patterns for Efficient Graph Algorithms in MapReducejimmylin/presentations/Lin-Hadoop...Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy LinJimmy Lin and Michael](https://reader034.vdocuments.mx/reader034/viewer/2022051903/5ff4093ed158b3227431feed/html5/thumbnails/26.jpg)
Results
+18%
-15%
1.4b
674m
-60%86m
![Page 27: Design Patterns for Efficient Graph Algorithms in MapReducejimmylin/presentations/Lin-Hadoop...Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy LinJimmy Lin and Michael](https://reader034.vdocuments.mx/reader034/viewer/2022051903/5ff4093ed158b3227431feed/html5/thumbnails/27.jpg)
Results
+18%
-15%
1.4b
674m
-60%-69%86m
![Page 28: Design Patterns for Efficient Graph Algorithms in MapReducejimmylin/presentations/Lin-Hadoop...Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy LinJimmy Lin and Michael](https://reader034.vdocuments.mx/reader034/viewer/2022051903/5ff4093ed158b3227431feed/html5/thumbnails/28.jpg)
Take-Away MessagesLots of interesting graph problems!
Social network analysisBioinformatics
Reducing intermediate data is keyLocal aggregationBetter partitioningLess bookkeeping
![Page 29: Design Patterns for Efficient Graph Algorithms in MapReducejimmylin/presentations/Lin-Hadoop...Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy LinJimmy Lin and Michael](https://reader034.vdocuments.mx/reader034/viewer/2022051903/5ff4093ed158b3227431feed/html5/thumbnails/29.jpg)
Complete details in Jimmy Lin and Michael Schatz. Design Patterns for Efficient Graph Algorithms in MapReduce. Proceedings of the 2010 Workshop on Mining and Learning with Graphs Workshop (MLG-2010), July 2010, Washington, D.C.
htt // d /http://mapreduce.me/
Source code available in Cloud9
htt // l d9lib /http://cloud9lib.org/
@lintool