mapreduce for graph algorithms - max planck...

Seminar: Massive-Scale Graph AnalysisSummer Semester 2015

MapReduce for Graph AlgorithmsModeling & Approach

Ankur [email protected]

May 8, 2015

1

Agenda

Map-Reduce FrameworkBig DataMapReduce

Graph AlgorithmsGraphsGraph Algorithms

Design pattern

Summary & Conclusion

Ankur Sharma | MapReduce for Graph Algorithms

2

Big Data: Definition

I Sources? : Web pages, social network,tweets, messages and many more

I How Big? : Google’s Index is larger than100 Peta Bytes and consumed over 1Mcomputing hours to build a

I Number of webpages indexed: 48B(Google) & 5B (Bing)b

I Big Data Market will account $50 Billion by2017c

ahttps://www.google.com/search/about/insidesearch/howsearchworks/crawling-indexing.htmlbwww.worldwidewebsize.comcwww.wikibon.org


3

Big Data: Processing

Figure: Comic from Socmedsean1

1Source: www.socmedsean.comAnkur Sharma | MapReduce for Graph Algorithms

4

Big Data: Processing

I Quick and efficient analysis of Big Data requires extremeparallelism.

I A lot of systems were developed in late 1990s for parallel dataprocessing such as SRBa

I These systems were efficient but lacked nice wrapper foraverage developer.

I In 2004, Jeffrey Dean and Sanjay Ghemawat from Googleintroduced MapReduceb which is a part of company’s proprietaryinfrastructure.

I Similar open source infrastructure Hadoopc was developed inlate 2005 by Doug Cutting & Mike Cafarella.

aWan et. al. The SDSC storage resource broker, CASCON ’98bGhemawat et. al. MapReduce: Simplified data processing on large clusters, OSDI’14cwww.hadoop.apache.org


5

MapReduce: Framework

IntroductionI MapReduce motivates from LISP’s map and reduce operations

I (map square ′(1 2 3 4 5)) → (1 4 9 16 25)I (reduce + ′(1 2 3 4 5)) → (15)

I Framework provides automatic parallelism, fault-tolerance, I/Oscheduling and monitoring for data processing

MapReduce: Programming ModelI Input / Output are a set of key value pairsI User only needs to provide simple functions map(), reduce() and

combine()a

I map(in_k , in_v) → list(out_k , imt_v)I combine(in_k , list(in_v)) → list(out_k , imt_v)I reduce(out_k , list(imt_v)) → list(out_v)

aoptional


6

MapReduce: Execution Model

Figure: Overview: Map-Reduce execution model2

2Source: Ghemawat et. al. MapReduce: Simplified data processing on large clusters, OSDI’14Ankur Sharma | MapReduce for Graph Algorithms

7

MapReduce: ExampleWord Count

Map

function M(in_k, in_v)for each word in in_v:

emit(word, 1)end function

Combine/Reduce

function C_R(out_k, list{imt_v})emit(word,

∑imt_v )

end function


8

Graphs

I Graph are everywhere: social graphs representing connectionsor citation graphs representing hierarchy in scientific research

I Due to massive scale, it is impractical to use conventionaltechniques for graph storage and in-memory analysis

I These constraints had driven the development of scalablesystems such as distributed file systems like Google FileSystema and Hadoop File Systemb

I MapReduce provides a good way to partition and analyze graphsas suggested by Cohen et. al. c

aGhemawat et. al. The Google File System, SOSP ’03bwww.hadoop.apache.orgcJonathan Cohen, Graph twiddling in a MapReduce world, Computing in Science & Engineering 2009


9

Augmenting edge with vertex degree

Input: All edges as Key Value Pair → <-,eij > : edge connecting vi & vj

MapReduce Job II map(<-,eij >)→ list[<vi , eij >, <vj , eij >]

I reduce(vk , list[epk , eqk ...]>) → [<epk , deg(vk )>, <eqk , deg(vk )>, ...]I where deg(vk ) = size of the list in the parameter

Input: Output of Job I

MapReduce Job III map() is an identity function: does no modification to inputI reduce(<eij , list[deg(vi ), deg(vj )]>) → <eij , deg(vi ), deg(vj )>


10

Augmenting edge ... contdExample


11

Enumerating Triangles

I The main idea behind enumerating the triangles is to find triadand an edge joining the open ends together

I Input: Augmented edges with vertex valency

MapReduce Job 1

I map(<eij , deg(vi ), deg(vj )>) →I output: deg(vi ) < deg(vj ) ? <vi , eij > : <vj , eij >I If deg(vi ) == deg(vj ), we can break the tie by any

consistent method such as using vertex names

I reduce(<vi , list[edges connecting vi ]>) →I For each pair of edge connected by vi , emit <e1, e2>

with key as <vp , vq>, where vp and vq represent theopen end of triad

I The order of vp and vq is decided using some consistenttechnique like vertex names


12

Enumerating Triangles ...contd

I Input: Augmented edge set and output of Job I

MapReduce Job 2

I map(<R>) →I if R is a record from Augmented Edge Set, change key

to <vi , vj > such that deg(vi ) < deg(vj )I if R is a record from MR Job I and hence a triad, change

key to <vi , vj > such that deg(vi ) < deg(vj ), where vi & vjare vertices of open ends

I reduce(<[vi , vj ], list[containing a triad, edge]a>) →I If there is a record corresponding to an edge and a triad

with same key, emit a triangle <eij , ejk , eki >

amay or may not be there


13

Enumerating Rectangles

Idea: Two triads joining with a common edge can be merged to form a rectangle

I 4! = 24 different orderings. Since symmetry group(S4) on 4 elements has|S4| = 8, we have 24

8 = 3 distinct orderings

MapReduce Job I

I map(<eij , deg(vi ), deg(vj )>) →I [<vi , H, eij >, <vj , L, eij >]

I reduce(<vk , list[edges incident on vk ]>) →I emit a triad for L-L & H-L record

I if we express deg(v) = degL(v) + degH (v), each vertex appears inO(deg2

L (v)) low triads and O(degL(v)degH (v)) mixed triads

MapReduce Job II

I map(<vk , list[triads]>) →I list[<key: edge joining open ends, value: triad>]

I reduce(<eij , list[triads open at same end]>) →I emit rectangle: [eij , ejk , ekl , & eli ]


14

Finding k -truss

Definition: A k-truss is a sub-graph in which each edge is a part of atleastk-2 triangles

MapReduce Job

Input : Augmented edge set & enumerated triangles of the graph

I map(<eij , trianglei >) →I if (eij ∈ trianglei ) : <eij ,1>

I reduce(<eij , list[1,1......,1]>) →I if (

∑listi ) ≥ k : emit eij

All edges emitted by map-reduce job belongs to the k -truss


15

Barycentric Clustering

Clusters are sub-graphs that partitions graph without loosing the co-relevantinformation. Barycentric Clustering is a highly scalable approach of findingtighly connected sub-graphs.

I wij : weight of edge eij and di =∑

wij : weighted degree of vertex vi

I If x = (x1, x2, .., xn)T denotes the position of n vertices, then x ′ = Mx modifies the vertexposition by the average of its old position and position of its neighborhood.

I The process is repeated and literature3 suggests that 5 repetitions are adequate.I If the length of edge > threshold, it can be cut

M =

1/(d1 + 1) w121/(d1 + 1) · · · w1n1/(d1 + 1)

w211/(d2 + 1) 1/(d2 + 1) · · · w2n1/(d2 + 1)...

.... . .

...wn11/(dn + 1) wn21/(dn + 1) · · · 1/(dn + 1)

3Cohen et. at. Barycentric Graph Clustering, 2008Ankur Sharma | MapReduce for Graph Algorithms

16

Barycentric Clustering ... contd

MapReduce Job I

I map(<eij >) → [<vi , eij >, <vj , eij >]I reduce(<vi , list[eij , eik ,...]>) → [<eij , P(vi )>, <eik , P(vi )>,...], where P(v) denotes the position

vector of vertex v.

MapReduce Job II

I map : (identity)I reduce(<eij , list[P(vi ), P(vj )]>) → <eij , P(vi ), P(vj )>

Now we execute 5 iterations of next two map reduce jobs

MapReduce Job III

I map(<eij , P(vi ), P(vj )>) → [<vi , eij , P(vi ), P(vj )>, <vj , eij , P(vi ), P(vj )>]I reduce(<vi , list[<eij , P(vi ), P(vj )>, <eik , P(vi ), P(vk )>...]>) → calculates modified P’(vi ) =

M*P’(vi )and emits [<eij , P’(vi ), P(vj )>, <eik , P’(vi ), P(vk )>, ...]


17

Barycentric Clustering ... contd

MapReduce Job IV

I map: (identity)I reduce(<eij , list[<eij , P’(vi ), P(vj )>, <eij , P(vi ), P’(vj )>]>) → <eij , P’(vi ), P’(vj )>

MapReduce Job V

I map(eij , list[P1, P2, ..., Pr ]) → [<vi , AVG(Pi ), AVG(Pj )>, <vj , AVG(Pj )>]I reduce(<vi , list[<eij , Pj >, <eik , Pk >..]>) → [<eij , >, <eik ,

∑Pj , Pk , ...>...],

∑Pj , Pk , ..., aij =∑

Pj , Pk , ... represents the weight of edge eij .

MapReduce Job VI

I map : (identity)I reduce(<eij , list[weights emitted by previous map reduce job]>) : computes aij , if aij > aij , emit

the edge

aij =

∑k∈Ni

ajk +∑

k∈Njajk − aij

| Ni | + | Nj | −1

I where Nk represents the set of vertices in the neighborhood of vk


18

Finding Components

Algorithm using MapReduce

I Step 1 → consists of two MapReduce Jobs focussing on

translating the vertex-vertex information into zone zone

informationI MapReduce 1 assigns edges with zones, producing 1

record for each vertex in edge and then translating thisinformation to assign 1 zone to both the vertices

I MapReduce 2 merges the result of Job 1. If distinctzones are connected by an edge, then the zone withlower zone_id captures the other

I Step 2 → consists of one MapReduce JobI MapReduce 1 It record any updates to zone information

for each vertexI Step 1 and 2 alternate until step 1 produces no new record,

which marks the completion of zone assignment

Figure: Graph with threecomponents


19

Design PatternsStrategies for Efficient MapReduce Implementation

Lin et. ala suggested following design patterns for efficiently implementing graph algorithms usingMapReduce.

I Local aggregation done using combiners reduce the amount intermediate data. It iseffective only in case of high duplication factor of intermediate key-value pair.

I In-Mapper Combining performs better than than regular combiner. Combiner often have tocreate and destroy objects and perform object serialization and de-serialization which isoften costly.

I Schimmy Pattern divides the graph into n partitions each sorted by vertex id. Number ofreducers are then set to n and it is guaranteed that reducer Ri processes same vertex ids asin partition Pi in sorted manner. Hence reducing the cost of shuffle and sort.

I Range Partitioning proposes to maximize intra-block connectivity and minimize inter-blockconnectivity. Dividing a graph into blocks with such a property helps reducing the amount ofcommunication over network.

aJimmy Lin, Michael Schatz, Design patterns for efficient graph algorithms in MapReduce, MLG’10


20

Design Patterns ... contdPerformance Comparison

Figure: Performance comparison for PageRank using suggested strategies.4

4Jimmy Lin, Michael Schatz, Design patterns for efficient graph algorithms in MapReduce, MLG’10Ankur Sharma | MapReduce for Graph Algorithms

21

Summary & Conclusion

I In general, it is a good contribution to attract MapReduce community to develop distributedsystems for graph processing.

I Author focused on breaking down smaller problems in map-reduce jobs and merging themtogether to solve bigger problem.

I The contribution provides no information about the performance of the implementation. Nocomparison to standard approach.

I Most of the algorithms need complete re-transformation for implementation using MRframework. Some may become easy and some like finding component becomes complex.

I Lin et. al. introduced some really good strategies for efficient MapReduce implementationwhich can be used even for non-graph algorithms.


Thank YouQuestions?

mapreduce for graph algorithms - max planck...

Documents