mrongraphs acm-sig-2 (1)

Graph Data Mining at Scale

Nima Sarshar, Ph.D.

[email protected]

My Goals for this Talk

You leave with your inner computer scientist tantalized:

There is more to writing efficient Map-Reduce algorithms

than counting words and merging logs

You get a general sense of the state of the research

I convince you of the need for a real graph processing

package for Hadoop

You know a bit about our work at Intuit

Plan

Jump right to it with an example (enumerating triangles)

Define the performance metrics (what are we optimizing

for?)

Give a classification of known “recipes”

The triangle example with with a new trick

Personalized PageRank, connected components

A list of other algorithms

3

Finding Triangles with Map-Reduce

1 2

3 4

1 3

2 3

2 4

3 4

3

4

4

3

22

2

4

31

1

3

5 Potential Triangles to Consider

Another round of Map Reduce jobs

will check for the existence of the

“closing” edge

Problems with this Approach

1. Each triangle will be detected 3 times – once under

each of its 3 vertices

2. Too many “potential” triangles are created in the first

reduce step.

For a node with degree d:

Total # of records:

5

d

2

æ

èç

ö

ø÷ ~O(d 2 )

d2

v

vÎV

å = V pkk2 = N k2

k

å

Modified Algorithm [Cohen ‘08]

1 2

3 4

1 3

2 3

2 4

3 4

3

4

2

4

3

1

3

For each triangle exactly one potential

triangle is created (under the lowest value

node)

The quadratic problem still persists

This is neat. At least we are not triple counting

But the quadratic problem still exists. The number of

records is still O(N<k2>)

We want to avoid binning edges under high degree

nodes

The ordering of nodes is arbitrary! Let the degree of a

node define its order.

7

Bin an edge under it’s LOW DEGREE node

Break ties arbitrarily, but consistently

3 2

1 4

5

1 4

5 3

2

The performance

Worst case: records vs.

The same as the best serial algorithm [Suri ‘11]

The gain for “real” graphs is fairly substantial. If a graph is

reasonably random, it cuts down to: vs.

For a heavy-tailed social graph (like our Commercial

Graph), this can be fairly huge

8

Q M 3/2( ) Q M 2( )

N k2

N k2

Enumerating Rectangles

Triangles will tell you the friends you have in common with another friend

“People you May Know”: Find another node, not connected to you, who has many friends in common with you. That node is a good candidate for “friendship”.

Basis of User Based or Content Based collaborative filtering

If the graph is bi-partite

9

Generalization to Rectangles

10

There are 4 classes for a rectangle: requires a bit more work

2

3

4

1

3

2

4

1

2

4

3

1

A

B C

Ordering triangle nodes has a unique equivalency class

Performance Metrics

Computation:

Total computation in all mappers and reducers

Communication:

How many bits are shuffled from the mapper to the reducer

Number of map-reduce steps:

You can work it into the above

The overhead of running jobs

11

“Recipes” for Graph MR Algorithms

Roughly two classes of algorithms:

1. Partition-Compute then Merge

Create smaller sub-graphs that fit into a single memory

Do computation on the small graphs

Construct the final answer from the answers to the small

sub-problems

2. Compute-in-Parallel then Merge

12

Partition-Compute-Merge

13

Finding Triangles By Partitioning [Suri ‘11]

1. Partition the nodes into b sets:

2. For every 3 sets

create a reducer.

3. Send an edge to iff both its ends are in

4. Detect triangles using a serial algorithm within each

reducer

14

V =V1 ÈV2 È...ÈVb Vi ÇVj = F, i ¹ j

Vi, j,k =Vi ÈVj ÈVk i < j < k

Vi, j,k Vi, j,k

b=4, V1={1}, V2={2}, V3={3}, V4={4},

1 2

3 4

1 3

2 3

2 4

3 4

V1,2,3 V1,3,4 V2,3,4

3 4

2

3 43

1 21

Analysis

Every triangle is detected. All 3 vertices are guaranteed

to be in at least one partition

Average # edges in each reducer is

Use an optimal serial triangle finder at each reducer. The

total amount of work at all reducers is:

# of edges sent from the mappers to reducers

(communication cost) is

16

OM

b2

æ

èç

ö

ø÷

M

b2

æ

èç

ö

ø÷

3/2

´b3 =O M 3/2( )

O bM( ) =O M 3/2( ) for b = M

One Problem

Each triangle may be detected multiple times. If all three

vertices are mapped to the same partition, it will be

detected times

This can be fixed with a similar ordering-of-nodes trick [Afrati

’12]

Can be generalized to detect other small graph

structures efficiently [Afrati ‘12]

17

b- 2

2

æ

èç

ö

ø÷ ~O b2( )

Minimum Weights Spanning Tree

1. Partition the nodes into b sets

2. For every pair of sets create a reducer

3. Send all edges that have both their ends in one pair to

the corresponding reducer

4. Compute the minimum spanning tree for the graph in

each reducer. Remove other edges to sparsify the

graph

5. Compute the MST for the sparsified graph

18

Compute-in-parallel and merge

19

Personalized PageRank

Like the global PageRank:

But the random walker that comes back to where it started

with probability d

For every v you will have a personalized page rank

vector of length N.

We usually keep only a limited number of top personalized

PageRanks for each node.

It finds the influential nodes in the proximity of a given node.

20

Monte Carlo Approximation

Simulate many random walks from every single node. For

each walk:

1. A walk starting from node v is identified by v

Keep track of <v,Uv,t> where Uv,t is the current end point at

step t for the walk starting at node v

2. In each Map-Reduce step advance the walk by 1 step

Pick a random neighbor of Uv,t

3. Count the frequency of visits to each node

21

One can do better [Das Sarma ‘08]

This takes T steps for a walk of length T

We can cut it down to T1/2 by a simple “stitching” idea

1. Do T/J random walks from every node for some J

2. To for a walk of length T, pick one of the T/J segments at random and jump to the end of the segment

3. Pick another random segment, etc

4. If you arrive at a node twice, do not use the same segment (that’s why you need T/J segments)

Total iterations: J+T/J minimized when J=T1/2 O(T1/2)

22

Exponential speed up [Bahmani ‘11]

The stitching was done somewhat serially (at each step,

one segment was stitched to another)

Idea: Stich recursively, which will result in exponentially

expanding the walk/segment ratio

Takes a little more tricks to make it work, but you can

bring it down to O(log T)

23

Labeling Connected Components

Assign the same ID to all nodes inside the same

component

24

1 2

34

5

6

How do we do it on one machine?

1. i=1

2. Pick a random node you have not

picked before, assign it id=i and put

it in a stack

3. Pop a node from the stack, pull all

it’s neighbors we have not seen

before into the stack. Assign them

id=i

4. If stack is not empty go to 3, otherwise i i+1 and go to 2

Time and memory complexity O(M).

25

1 2

3 4

5

6

In Map-Reduce: More Parallelizim

Instead of growing a frontier zone from a single seed, start

growing it from all nodes. When two zones meet, merge them

26

1 432

Edge File

<v1,v2>

<v2,v3>

<v3,v4>

Zone File

<v1,z1>

<v2,z2>

<v3,z3>

<v4,z4>

Game Plan27

<v1,v2>

<v1,z1><[v1,v2],z1>

<v2,v1>

<v2,v3>

<v2,z2>

<[v1,v2],z2>

<[v2,v3],z2>

<v3,v2>

<v3,v4>

<v3,z3>

<[v2,v3],z3>

<[v3,v4],z3>

<v4,v3>

<v4,z4><[v3,v4],z4>

<[v1,v2],z1>

<[v1,v2],z2>

<[v2,v3],z2>

<[v2,v3],z3>

<[v3,v4],z3>

<[v3,v4],z4>

<z2,z1>

<z3,z2>

<z4,z3>

<z2,v2>

<z2,z1>

<z3,v3>

<z3,z2>

<z4,v4>

<z4,z3>

<z2,v2>

<z2,z1>

New Zone File

<v1,z1>

<v2,z1>

<v3,z2>

<v4,z3>

Bin Zone and Edge by Node

Bin edge to zone map

Collect over edges

A zone to zone map

Reconcile zones

Reassign zones to

nodes

1 432

Analysis

Communication: O(M+N)

Number of rounds: O(d) where d is the diameter of the graph.

Most real graphs have small diameters.

Random graph: d=O(log N)

This works worst for a “path-graph”

An algorithm with O(M+N) communication and O(log n) round

exists for all graphs [Rastogi ’12]

Uses an idea similar to MinHash

28

References

Cohen, Jonathan. "Graph twiddling in a MapReduce world."

Computing in Science & Engineering 11.4 (2009): 29-41.

Suri, Siddharth, and Sergei Vassilvitskii. "Counting triangles and the

curse of the last reducer." Proceedings of the 20th international

conference on World wide web. ACM, 2011.

Bahmani Bahman, Kaushik Chakrabarti, and Dong Xin. "Fast

personalized pagerank on mapreduce." Proceedings of the 37th

SIGMOD international conference on Management of data. 2011.

A. Das Sarma, S. Gollapudi, and R. Panigrahy. Estimating

PageRank on graph streams. In PODS, pages 69–78, 2008.

Foto N. Afrati, Dimitris Fotakis, Jeffrey D. Ullman, Enumerating

Subgraph Instances Using Map-Reduce.

http://arxiv.org/abs/1208.0615 2012

Lattanzi, Silvio, et al. "Filtering: a method for solving graph

problems in mapreduce.” 2011.

29

http://arxiv.org/abs/1208.0615

mrongraphs acm-sig-2 (1)

Data & Analytics