mapreduce for graph algorithms - max planck...
TRANSCRIPT
Seminar: Massive-Scale Graph AnalysisSummer Semester 2015
MapReduce for Graph AlgorithmsModeling & Approach
Ankur [email protected]
May 8, 2015
1
Agenda
Map-Reduce FrameworkBig DataMapReduce
Graph AlgorithmsGraphsGraph Algorithms
Design pattern
Summary & Conclusion
Ankur Sharma | MapReduce for Graph Algorithms
2
Big Data: Definition
I Sources? : Web pages, social network,tweets, messages and many more
I How Big? : Google’s Index is larger than100 Peta Bytes and consumed over 1Mcomputing hours to build a
I Number of webpages indexed: 48B(Google) & 5B (Bing)b
I Big Data Market will account $50 Billion by2017c
ahttps://www.google.com/search/about/insidesearch/howsearchworks/crawling-indexing.htmlbwww.worldwidewebsize.comcwww.wikibon.org
Ankur Sharma | MapReduce for Graph Algorithms
3
Big Data: Processing
Figure: Comic from Socmedsean1
1Source: www.socmedsean.comAnkur Sharma | MapReduce for Graph Algorithms
4
Big Data: Processing
I Quick and efficient analysis of Big Data requires extremeparallelism.
I A lot of systems were developed in late 1990s for parallel dataprocessing such as SRBa
I These systems were efficient but lacked nice wrapper foraverage developer.
I In 2004, Jeffrey Dean and Sanjay Ghemawat from Googleintroduced MapReduceb which is a part of company’s proprietaryinfrastructure.
I Similar open source infrastructure Hadoopc was developed inlate 2005 by Doug Cutting & Mike Cafarella.
aWan et. al. The SDSC storage resource broker, CASCON ’98bGhemawat et. al. MapReduce: Simplified data processing on large clusters, OSDI’14cwww.hadoop.apache.org
Ankur Sharma | MapReduce for Graph Algorithms
5
MapReduce: Framework
IntroductionI MapReduce motivates from LISP’s map and reduce operations
I (map square ′(1 2 3 4 5)) → (1 4 9 16 25)I (reduce + ′(1 2 3 4 5)) → (15)
I Framework provides automatic parallelism, fault-tolerance, I/Oscheduling and monitoring for data processing
MapReduce: Programming ModelI Input / Output are a set of key value pairsI User only needs to provide simple functions map(), reduce() and
combine()a
I map(in_k , in_v) → list(out_k , imt_v)I combine(in_k , list(in_v)) → list(out_k , imt_v)I reduce(out_k , list(imt_v)) → list(out_v)
aoptional
Ankur Sharma | MapReduce for Graph Algorithms
5
MapReduce: Framework
IntroductionI MapReduce motivates from LISP’s map and reduce operations
I (map square ′(1 2 3 4 5)) → (1 4 9 16 25)I (reduce + ′(1 2 3 4 5)) → (15)
I Framework provides automatic parallelism, fault-tolerance, I/Oscheduling and monitoring for data processing
MapReduce: Programming ModelI Input / Output are a set of key value pairsI User only needs to provide simple functions map(), reduce() and
combine()a
I map(in_k , in_v) → list(out_k , imt_v)I combine(in_k , list(in_v)) → list(out_k , imt_v)I reduce(out_k , list(imt_v)) → list(out_v)
aoptional
Ankur Sharma | MapReduce for Graph Algorithms
6
MapReduce: Execution Model
Figure: Overview: Map-Reduce execution model2
2Source: Ghemawat et. al. MapReduce: Simplified data processing on large clusters, OSDI’14Ankur Sharma | MapReduce for Graph Algorithms
7
MapReduce: ExampleWord Count
Map
function M(in_k, in_v)for each word in in_v:
emit(word, 1)end function
Combine/Reduce
function C_R(out_k, list{imt_v})emit(word,
∑imt_v )
end function
Ankur Sharma | MapReduce for Graph Algorithms
7
MapReduce: ExampleWord Count
Map
function M(in_k, in_v)for each word in in_v:
emit(word, 1)end function
Combine/Reduce
function C_R(out_k, list{imt_v})emit(word,
∑imt_v )
end function
Ankur Sharma | MapReduce for Graph Algorithms
7
MapReduce: ExampleWord Count
Map
function M(in_k, in_v)for each word in in_v:
emit(word, 1)end function
Combine/Reduce
function C_R(out_k, list{imt_v})emit(word,
∑imt_v )
end function
Ankur Sharma | MapReduce for Graph Algorithms
8
Graphs
I Graph are everywhere: social graphs representing connectionsor citation graphs representing hierarchy in scientific research
I Due to massive scale, it is impractical to use conventionaltechniques for graph storage and in-memory analysis
I These constraints had driven the development of scalablesystems such as distributed file systems like Google FileSystema and Hadoop File Systemb
I MapReduce provides a good way to partition and analyze graphsas suggested by Cohen et. al. c
aGhemawat et. al. The Google File System, SOSP ’03bwww.hadoop.apache.orgcJonathan Cohen, Graph twiddling in a MapReduce world, Computing in Science & Engineering 2009
Ankur Sharma | MapReduce for Graph Algorithms
9
Augmenting edge with vertex degree
Input: All edges as Key Value Pair → <-,eij > : edge connecting vi & vj
MapReduce Job II map(<-,eij >)→ list[<vi , eij >, <vj , eij >]
I reduce(vk , list[epk , eqk ...]>) → [<epk , deg(vk )>, <eqk , deg(vk )>, ...]I where deg(vk ) = size of the list in the parameter
Input: Output of Job I
MapReduce Job III map() is an identity function: does no modification to inputI reduce(<eij , list[deg(vi ), deg(vj )]>) → <eij , deg(vi ), deg(vj )>
Ankur Sharma | MapReduce for Graph Algorithms
10
Augmenting edge ... contdExample
Ankur Sharma | MapReduce for Graph Algorithms
11
Enumerating Triangles
I The main idea behind enumerating the triangles is to find triadand an edge joining the open ends together
I Input: Augmented edges with vertex valency
MapReduce Job 1
I map(<eij , deg(vi ), deg(vj )>) →I output: deg(vi ) < deg(vj ) ? <vi , eij > : <vj , eij >I If deg(vi ) == deg(vj ), we can break the tie by any
consistent method such as using vertex names
I reduce(<vi , list[edges connecting vi ]>) →I For each pair of edge connected by vi , emit <e1, e2>
with key as <vp , vq>, where vp and vq represent theopen end of triad
I The order of vp and vq is decided using some consistenttechnique like vertex names
Ankur Sharma | MapReduce for Graph Algorithms
12
Enumerating Triangles ...contd
I Input: Augmented edge set and output of Job I
MapReduce Job 2
I map(<R>) →I if R is a record from Augmented Edge Set, change key
to <vi , vj > such that deg(vi ) < deg(vj )I if R is a record from MR Job I and hence a triad, change
key to <vi , vj > such that deg(vi ) < deg(vj ), where vi & vjare vertices of open ends
I reduce(<[vi , vj ], list[containing a triad, edge]a>) →I If there is a record corresponding to an edge and a triad
with same key, emit a triangle <eij , ejk , eki >
amay or may not be there
Ankur Sharma | MapReduce for Graph Algorithms
13
Enumerating Rectangles
Idea: Two triads joining with a common edge can be merged to form a rectangle
I 4! = 24 different orderings. Since symmetry group(S4) on 4 elements has|S4| = 8, we have 24
8 = 3 distinct orderings
MapReduce Job I
I map(<eij , deg(vi ), deg(vj )>) →I [<vi , H, eij >, <vj , L, eij >]
I reduce(<vk , list[edges incident on vk ]>) →I emit a triad for L-L & H-L record
I if we express deg(v) = degL(v) + degH (v), each vertex appears inO(deg2
L (v)) low triads and O(degL(v)degH (v)) mixed triads
MapReduce Job II
I map(<vk , list[triads]>) →I list[<key: edge joining open ends, value: triad>]
I reduce(<eij , list[triads open at same end]>) →I emit rectangle: [eij , ejk , ekl , & eli ]
Ankur Sharma | MapReduce for Graph Algorithms
13
Enumerating Rectangles
Idea: Two triads joining with a common edge can be merged to form a rectangle
I 4! = 24 different orderings. Since symmetry group(S4) on 4 elements has|S4| = 8, we have 24
8 = 3 distinct orderings
MapReduce Job I
I map(<eij , deg(vi ), deg(vj )>) →I [<vi , H, eij >, <vj , L, eij >]
I reduce(<vk , list[edges incident on vk ]>) →I emit a triad for L-L & H-L record
I if we express deg(v) = degL(v) + degH (v), each vertex appears inO(deg2
L (v)) low triads and O(degL(v)degH (v)) mixed triads
MapReduce Job II
I map(<vk , list[triads]>) →I list[<key: edge joining open ends, value: triad>]
I reduce(<eij , list[triads open at same end]>) →I emit rectangle: [eij , ejk , ekl , & eli ]
Ankur Sharma | MapReduce for Graph Algorithms
13
Enumerating Rectangles
Idea: Two triads joining with a common edge can be merged to form a rectangle
I 4! = 24 different orderings. Since symmetry group(S4) on 4 elements has|S4| = 8, we have 24
8 = 3 distinct orderings
MapReduce Job I
I map(<eij , deg(vi ), deg(vj )>) →I [<vi , H, eij >, <vj , L, eij >]
I reduce(<vk , list[edges incident on vk ]>) →I emit a triad for L-L & H-L record
I if we express deg(v) = degL(v) + degH (v), each vertex appears inO(deg2
L (v)) low triads and O(degL(v)degH (v)) mixed triads
MapReduce Job II
I map(<vk , list[triads]>) →I list[<key: edge joining open ends, value: triad>]
I reduce(<eij , list[triads open at same end]>) →I emit rectangle: [eij , ejk , ekl , & eli ]
Ankur Sharma | MapReduce for Graph Algorithms
14
Finding k -truss
Definition: A k-truss is a sub-graph in which each edge is a part of atleastk-2 triangles
MapReduce Job
Input : Augmented edge set & enumerated triangles of the graph
I map(<eij , trianglei >) →I if (eij ∈ trianglei ) : <eij ,1>
I reduce(<eij , list[1,1......,1]>) →I if (
∑listi ) ≥ k : emit eij
All edges emitted by map-reduce job belongs to the k -truss
Ankur Sharma | MapReduce for Graph Algorithms
15
Barycentric Clustering
Clusters are sub-graphs that partitions graph without loosing the co-relevantinformation. Barycentric Clustering is a highly scalable approach of findingtighly connected sub-graphs.
I wij : weight of edge eij and di =∑
wij : weighted degree of vertex vi
I If x = (x1, x2, .., xn)T denotes the position of n vertices, then x ′ = Mx modifies the vertexposition by the average of its old position and position of its neighborhood.
I The process is repeated and literature3 suggests that 5 repetitions are adequate.I If the length of edge > threshold, it can be cut
M =
1/(d1 + 1) w121/(d1 + 1) · · · w1n1/(d1 + 1)
w211/(d2 + 1) 1/(d2 + 1) · · · w2n1/(d2 + 1)...
.... . .
...wn11/(dn + 1) wn21/(dn + 1) · · · 1/(dn + 1)
3Cohen et. at. Barycentric Graph Clustering, 2008Ankur Sharma | MapReduce for Graph Algorithms
16
Barycentric Clustering ... contd
MapReduce Job I
I map(<eij >) → [<vi , eij >, <vj , eij >]I reduce(<vi , list[eij , eik ,...]>) → [<eij , P(vi )>, <eik , P(vi )>,...], where P(v) denotes the position
vector of vertex v.
MapReduce Job II
I map : (identity)I reduce(<eij , list[P(vi ), P(vj )]>) → <eij , P(vi ), P(vj )>
Now we execute 5 iterations of next two map reduce jobs
MapReduce Job III
I map(<eij , P(vi ), P(vj )>) → [<vi , eij , P(vi ), P(vj )>, <vj , eij , P(vi ), P(vj )>]I reduce(<vi , list[<eij , P(vi ), P(vj )>, <eik , P(vi ), P(vk )>...]>) → calculates modified P’(vi ) =
M*P’(vi )and emits [<eij , P’(vi ), P(vj )>, <eik , P’(vi ), P(vk )>, ...]
Ankur Sharma | MapReduce for Graph Algorithms
16
Barycentric Clustering ... contd
MapReduce Job I
I map(<eij >) → [<vi , eij >, <vj , eij >]I reduce(<vi , list[eij , eik ,...]>) → [<eij , P(vi )>, <eik , P(vi )>,...], where P(v) denotes the position
vector of vertex v.
MapReduce Job II
I map : (identity)I reduce(<eij , list[P(vi ), P(vj )]>) → <eij , P(vi ), P(vj )>
Now we execute 5 iterations of next two map reduce jobs
MapReduce Job III
I map(<eij , P(vi ), P(vj )>) → [<vi , eij , P(vi ), P(vj )>, <vj , eij , P(vi ), P(vj )>]I reduce(<vi , list[<eij , P(vi ), P(vj )>, <eik , P(vi ), P(vk )>...]>) → calculates modified P’(vi ) =
M*P’(vi )and emits [<eij , P’(vi ), P(vj )>, <eik , P’(vi ), P(vk )>, ...]
Ankur Sharma | MapReduce for Graph Algorithms
16
Barycentric Clustering ... contd
MapReduce Job I
I map(<eij >) → [<vi , eij >, <vj , eij >]I reduce(<vi , list[eij , eik ,...]>) → [<eij , P(vi )>, <eik , P(vi )>,...], where P(v) denotes the position
vector of vertex v.
MapReduce Job II
I map : (identity)I reduce(<eij , list[P(vi ), P(vj )]>) → <eij , P(vi ), P(vj )>
Now we execute 5 iterations of next two map reduce jobs
MapReduce Job III
I map(<eij , P(vi ), P(vj )>) → [<vi , eij , P(vi ), P(vj )>, <vj , eij , P(vi ), P(vj )>]I reduce(<vi , list[<eij , P(vi ), P(vj )>, <eik , P(vi ), P(vk )>...]>) → calculates modified P’(vi ) =
M*P’(vi )and emits [<eij , P’(vi ), P(vj )>, <eik , P’(vi ), P(vk )>, ...]
Ankur Sharma | MapReduce for Graph Algorithms
17
Barycentric Clustering ... contd
MapReduce Job IV
I map: (identity)I reduce(<eij , list[<eij , P’(vi ), P(vj )>, <eij , P(vi ), P’(vj )>]>) → <eij , P’(vi ), P’(vj )>
MapReduce Job V
I map(eij , list[P1, P2, ..., Pr ]) → [<vi , AVG(Pi ), AVG(Pj )>, <vj , AVG(Pj )>]I reduce(<vi , list[<eij , Pj >, <eik , Pk >..]>) → [<eij , >, <eik ,
∑Pj , Pk , ...>...],
∑Pj , Pk , ..., aij =∑
Pj , Pk , ... represents the weight of edge eij .
MapReduce Job VI
I map : (identity)I reduce(<eij , list[weights emitted by previous map reduce job]>) : computes aij , if aij > aij , emit
the edge
aij =
∑k∈Ni
ajk +∑
k∈Njajk − aij
| Ni | + | Nj | −1
I where Nk represents the set of vertices in the neighborhood of vk
Ankur Sharma | MapReduce for Graph Algorithms
17
Barycentric Clustering ... contd
MapReduce Job IV
I map: (identity)I reduce(<eij , list[<eij , P’(vi ), P(vj )>, <eij , P(vi ), P’(vj )>]>) → <eij , P’(vi ), P’(vj )>
MapReduce Job V
I map(eij , list[P1, P2, ..., Pr ]) → [<vi , AVG(Pi ), AVG(Pj )>, <vj , AVG(Pj )>]I reduce(<vi , list[<eij , Pj >, <eik , Pk >..]>) → [<eij , >, <eik ,
∑Pj , Pk , ...>...],
∑Pj , Pk , ..., aij =∑
Pj , Pk , ... represents the weight of edge eij .
MapReduce Job VI
I map : (identity)I reduce(<eij , list[weights emitted by previous map reduce job]>) : computes aij , if aij > aij , emit
the edge
aij =
∑k∈Ni
ajk +∑
k∈Njajk − aij
| Ni | + | Nj | −1
I where Nk represents the set of vertices in the neighborhood of vk
Ankur Sharma | MapReduce for Graph Algorithms
17
Barycentric Clustering ... contd
MapReduce Job IV
I map: (identity)I reduce(<eij , list[<eij , P’(vi ), P(vj )>, <eij , P(vi ), P’(vj )>]>) → <eij , P’(vi ), P’(vj )>
MapReduce Job V
I map(eij , list[P1, P2, ..., Pr ]) → [<vi , AVG(Pi ), AVG(Pj )>, <vj , AVG(Pj )>]I reduce(<vi , list[<eij , Pj >, <eik , Pk >..]>) → [<eij , >, <eik ,
∑Pj , Pk , ...>...],
∑Pj , Pk , ..., aij =∑
Pj , Pk , ... represents the weight of edge eij .
MapReduce Job VI
I map : (identity)I reduce(<eij , list[weights emitted by previous map reduce job]>) : computes aij , if aij > aij , emit
the edge
aij =
∑k∈Ni
ajk +∑
k∈Njajk − aij
| Ni | + | Nj | −1
I where Nk represents the set of vertices in the neighborhood of vk
Ankur Sharma | MapReduce for Graph Algorithms
18
Finding Components
Algorithm using MapReduce
I Step 1 → consists of two MapReduce Jobs focussing on
translating the vertex-vertex information into zone zone
informationI MapReduce 1 assigns edges with zones, producing 1
record for each vertex in edge and then translating thisinformation to assign 1 zone to both the vertices
I MapReduce 2 merges the result of Job 1. If distinctzones are connected by an edge, then the zone withlower zone_id captures the other
I Step 2 → consists of one MapReduce JobI MapReduce 1 It record any updates to zone information
for each vertexI Step 1 and 2 alternate until step 1 produces no new record,
which marks the completion of zone assignment
Figure: Graph with threecomponents
Ankur Sharma | MapReduce for Graph Algorithms
19
Design PatternsStrategies for Efficient MapReduce Implementation
Lin et. ala suggested following design patterns for efficiently implementing graph algorithms usingMapReduce.
I Local aggregation done using combiners reduce the amount intermediate data. It iseffective only in case of high duplication factor of intermediate key-value pair.
I In-Mapper Combining performs better than than regular combiner. Combiner often have tocreate and destroy objects and perform object serialization and de-serialization which isoften costly.
I Schimmy Pattern divides the graph into n partitions each sorted by vertex id. Number ofreducers are then set to n and it is guaranteed that reducer Ri processes same vertex ids asin partition Pi in sorted manner. Hence reducing the cost of shuffle and sort.
I Range Partitioning proposes to maximize intra-block connectivity and minimize inter-blockconnectivity. Dividing a graph into blocks with such a property helps reducing the amount ofcommunication over network.
aJimmy Lin, Michael Schatz, Design patterns for efficient graph algorithms in MapReduce, MLG’10
Ankur Sharma | MapReduce for Graph Algorithms
20
Design Patterns ... contdPerformance Comparison
Figure: Performance comparison for PageRank using suggested strategies.4
4Jimmy Lin, Michael Schatz, Design patterns for efficient graph algorithms in MapReduce, MLG’10Ankur Sharma | MapReduce for Graph Algorithms
21
Summary & Conclusion
I In general, it is a good contribution to attract MapReduce community to develop distributedsystems for graph processing.
I Author focused on breaking down smaller problems in map-reduce jobs and merging themtogether to solve bigger problem.
I The contribution provides no information about the performance of the implementation. Nocomparison to standard approach.
I Most of the algorithms need complete re-transformation for implementation using MRframework. Some may become easy and some like finding component becomes complex.
I Lin et. al. introduced some really good strategies for efficient MapReduce implementationwhich can be used even for non-graph algorithms.
Ankur Sharma | MapReduce for Graph Algorithms
Thank YouQuestions?