demystifying distributed graph processing

33
DEMYSTIFYING DISTRIBUTED GRAPH PROCESSING Vasia Kalavri [email protected] @vkalavri

Upload: vasia-kalavri

Post on 16-Apr-2017

1.383 views

Category:

Technology


3 download

TRANSCRIPT

Page 1: Demystifying Distributed Graph Processing

DEMYSTIFYING

DISTRIBUTED GRAPH PROCESSING

Vasia Kalavri [email protected]

@vkalavri

Page 2: Demystifying Distributed Graph Processing
Page 3: Demystifying Distributed Graph Processing

WHY DISTRIBUTED GRAPH PROCESSING?

Page 4: Demystifying Distributed Graph Processing

MY GRAPH IS SO BIG, IT DOESN’T FIT IN A SINGLE MACHINE

Big Data Ninja

MISCONCEPTION #1

Page 5: Demystifying Distributed Graph Processing

A SOCIAL NETWORK

Page 6: Demystifying Distributed Graph Processing

YOUR INPUT DATASET SIZE IS _OFTEN_ IRRELEVANT

Page 7: Demystifying Distributed Graph Processing

INTERMEDIATE DATA: THE OFTEN DISREGARDED EVIL

▸Naive Who(m) to Follow:

▸ compute a friends-of-friends list per user

▸ exclude existing friends

▸ rank by common connections

Page 8: Demystifying Distributed Graph Processing

DISTRIBUTED PROCESSING IS ALWAYS FASTER THAN SINGLE-NODE

Data Science Rockstar

MISCONCEPTION #2

Page 9: Demystifying Distributed Graph Processing
Page 10: Demystifying Distributed Graph Processing

GRAPHS DON’T APPEAR OUT OF THIN AIR

Expectation…

Page 11: Demystifying Distributed Graph Processing

GRAPHS DON’T APPEAR OUT OF THIN AIR

Reality!

Page 12: Demystifying Distributed Graph Processing

HOW DO WE EXPRESS A DISTRIBUTED GRAPH ANALYSIS TASK?

Page 13: Demystifying Distributed Graph Processing

GRAPH APPLICATIONS ARE DIVERSE

▸ Iterative value propagation

▸ PageRank, Connected Components, Label Propagation

▸ Traversals and path exploration

▸ Shortest paths, centrality measures

▸ Ego-network analysis

▸ Personalized recommendations

▸ Pattern mining

▸ Finding frequent subgraphs

Page 14: Demystifying Distributed Graph Processing

RECENT DISTRIBUTED GRAPH PROCESSING HISTORY

2004

MapReduce

Pegasus

2009

Pregel

2010

Signal-Collect

PowerGraph

2012

Page 15: Demystifying Distributed Graph Processing

RECENT DISTRIBUTED GRAPH PROCESSING HISTORY

2004

MapReduce

Pegasus

2009

Pregel

2010

Signal-Collect

PowerGraph

2012

Iterative value propagation

Page 16: Demystifying Distributed Graph Processing

PREGEL: THINK LIKE A VERTEX

1

5

4

3

2 1 3, 4

2 1, 4

5 3

. . .

Page 17: Demystifying Distributed Graph Processing

PREGEL: SUPERSTEPS

(Vi+1, outbox) <— compute(Vi, inbox)

1 3, 4

2 1, 4

5 3

. . .

1 3, 4

2 1, 4

5 3

. . .

Superstep i Superstep i+1

Page 18: Demystifying Distributed Graph Processing

PREGEL EXAMPLE: PAGERANKvoid compute(messages): sum = 0.0

for (m <- messages) do sum = sum + m end for

setValue(0.15/numVertices() + 0.85*sum)

for (edge <- getOutEdges()) do sendMessageTo( edge.target(), getValue()/numEdges) end for

sum up received messages

update vertex rank

distribute rank to neighbors

Page 19: Demystifying Distributed Graph Processing

SIGNAL-COLLECT

outbox <— signal(Vi)

1 3, 4

2 1, 4

5 3

. . .

1 3, 4

2 1, 4

5 3

. . .

Superstep i

Vi+1 <— collect(inbox)

1 3, 4

2 1, 4

5 3

. . .

Signal Collect Superstep i+1

Page 20: Demystifying Distributed Graph Processing

SIGNAL-COLLECT EXAMPLE: PAGERANKvoid signal(): for (edge <- getOutEdges()) do sendMessageTo( edge.target(), getValue()/numEdges) end for

void collect(messages): sum = 0.0 for (m <- messages) do sum = sum + m end for

setValue(0.15/numVertices() + 0.85*sum)

distribute rank to neighbors

sum up received messages

update vertex rank

Page 21: Demystifying Distributed Graph Processing

GATHER-SUM-APPLY (POWERGRAPH)

1

. . .. . .

Gather Sum

1

2

5

. . .

Apply

3

1 5

5 3

1

. . .

Gather

3

1 5

5 3

Superstep i Superstep i+1

Page 22: Demystifying Distributed Graph Processing

GSA EXAMPLE: PAGERANK

double gather(source, edge, target): return target.value() / target.numEdges()

double sum(rank1, rank2): return rank1 + rank2

double apply(sum, currentRank): return 0.15 + 0.85*sum

compute partial rank

combine partial ranks

update rank

Page 23: Demystifying Distributed Graph Processing

RECENT DISTRIBUTED GRAPH PROCESSING HISTORY

2004

MapReduce

Pegasus

2009

Pregel

2010

Signal-Collect

PowerGraph

2012

Iterative value propagation

Giraph++

2013

Page 24: Demystifying Distributed Graph Processing

RECENT DISTRIBUTED GRAPH PROCESSING HISTORY

2004

MapReduce

Pegasus

2009

Pregel

2010

Signal-Collect

PowerGraph

2012

Iterative value propagation

Giraph++

2013

Graph Traversals

Page 25: Demystifying Distributed Graph Processing

THINK LIKE A (SUB)GRAPH

1

5

4

3

2

1

5

4

3

2

- compute() on the entire partition

- Information flows freely inside each partition

- Network communication between partitions, not vertices

Page 26: Demystifying Distributed Graph Processing

RECENT DISTRIBUTED GRAPH PROCESSING HISTORY

2004

MapReduce

Pegasus

2009

Pregel

2010

Signal-Collect

PowerGraph

2012

Iterative value propagation

Giraph++

2013

Graph Traversals

NScale

2014

Page 27: Demystifying Distributed Graph Processing

RECENT DISTRIBUTED GRAPH PROCESSING HISTORY

2004

MapReduce

Pegasus

2009

Pregel

2010

Signal-Collect

PowerGraph

2012

Iterative value propagation

Giraph++

2013

Graph Traversals

NScale

2014

Ego-network analysis

Arabesque

2015

Tinkerpop

Page 28: Demystifying Distributed Graph Processing

RECENT DISTRIBUTED GRAPH PROCESSING HISTORY

2004

MapReduce

Pegasus

2009

Pregel

2010

Signal-Collect

PowerGraph

2012

Iterative value propagation

Giraph++

2013

Graph Traversals

NScale

2014

Ego-network analysis

Arabesque

2015

Pattern Matching

Tinkerpop

Page 29: Demystifying Distributed Graph Processing
Page 30: Demystifying Distributed Graph Processing

CAN WE HAVE IT ALL?

▸ Data pipeline integration: built on top of an efficient distributed processing engine

▸ Graph ETL: high-level API with abstractions and methods to transform graphs

▸ Familiar programming model: support popular programming abstractions

Page 31: Demystifying Distributed Graph Processing

HELLO, GELLY! THE APACHE FLINK GRAPH API

▸ Java and Scala APIs: seamlessly integrate with Flink’s DataSet API

▸ Transformations, library of common algorithms

val graph = Graph.fromDataSet(edges, env)

val ranks = graph.run(new PageRank(0.85, 20))

▸ Iteration abstractionsPregel

Signal-Collect

Gather-Sum-Apply

Partition-Centric*

Page 32: Demystifying Distributed Graph Processing

POSIX Java/ScalaCollections

POSIX

‣efficient streaming runtime

‣native iteration operators

‣well-integrated

WHY FLINK?

Page 33: Demystifying Distributed Graph Processing

FEELING GELLY?

▸ Paper References

http://www.citeulike.org/user/vasiakalavri/tag/dotscale

▸ Apache Flink:

http://flink.apache.org/

▸ Gelly documentation:

http://ci.apache.org/projects/flink/flink-docs-master/apis/batch/libs/gelly.html

▸ Gelly-Stream:

https://github.com/vasia/gelly-streaming