demystifying distributed graph processing
TRANSCRIPT
WHY DISTRIBUTED GRAPH PROCESSING?
MY GRAPH IS SO BIG, IT DOESN’T FIT IN A SINGLE MACHINE
Big Data Ninja
MISCONCEPTION #1
A SOCIAL NETWORK
YOUR INPUT DATASET SIZE IS _OFTEN_ IRRELEVANT
INTERMEDIATE DATA: THE OFTEN DISREGARDED EVIL
▸Naive Who(m) to Follow:
▸ compute a friends-of-friends list per user
▸ exclude existing friends
▸ rank by common connections
DISTRIBUTED PROCESSING IS ALWAYS FASTER THAN SINGLE-NODE
Data Science Rockstar
MISCONCEPTION #2
GRAPHS DON’T APPEAR OUT OF THIN AIR
Expectation…
GRAPHS DON’T APPEAR OUT OF THIN AIR
Reality!
HOW DO WE EXPRESS A DISTRIBUTED GRAPH ANALYSIS TASK?
GRAPH APPLICATIONS ARE DIVERSE
▸ Iterative value propagation
▸ PageRank, Connected Components, Label Propagation
▸ Traversals and path exploration
▸ Shortest paths, centrality measures
▸ Ego-network analysis
▸ Personalized recommendations
▸ Pattern mining
▸ Finding frequent subgraphs
RECENT DISTRIBUTED GRAPH PROCESSING HISTORY
2004
MapReduce
Pegasus
2009
Pregel
2010
Signal-Collect
PowerGraph
2012
RECENT DISTRIBUTED GRAPH PROCESSING HISTORY
2004
MapReduce
Pegasus
2009
Pregel
2010
Signal-Collect
PowerGraph
2012
Iterative value propagation
PREGEL: THINK LIKE A VERTEX
1
5
4
3
2 1 3, 4
2 1, 4
5 3
. . .
PREGEL: SUPERSTEPS
(Vi+1, outbox) <— compute(Vi, inbox)
1 3, 4
2 1, 4
5 3
. . .
1 3, 4
2 1, 4
5 3
. . .
Superstep i Superstep i+1
PREGEL EXAMPLE: PAGERANKvoid compute(messages): sum = 0.0
for (m <- messages) do sum = sum + m end for
setValue(0.15/numVertices() + 0.85*sum)
for (edge <- getOutEdges()) do sendMessageTo( edge.target(), getValue()/numEdges) end for
sum up received messages
update vertex rank
distribute rank to neighbors
SIGNAL-COLLECT
outbox <— signal(Vi)
1 3, 4
2 1, 4
5 3
. . .
1 3, 4
2 1, 4
5 3
. . .
Superstep i
Vi+1 <— collect(inbox)
1 3, 4
2 1, 4
5 3
. . .
Signal Collect Superstep i+1
SIGNAL-COLLECT EXAMPLE: PAGERANKvoid signal(): for (edge <- getOutEdges()) do sendMessageTo( edge.target(), getValue()/numEdges) end for
void collect(messages): sum = 0.0 for (m <- messages) do sum = sum + m end for
setValue(0.15/numVertices() + 0.85*sum)
distribute rank to neighbors
sum up received messages
update vertex rank
GATHER-SUM-APPLY (POWERGRAPH)
1
. . .. . .
Gather Sum
1
2
5
. . .
Apply
3
1 5
5 3
1
. . .
Gather
3
1 5
5 3
Superstep i Superstep i+1
GSA EXAMPLE: PAGERANK
double gather(source, edge, target): return target.value() / target.numEdges()
double sum(rank1, rank2): return rank1 + rank2
double apply(sum, currentRank): return 0.15 + 0.85*sum
compute partial rank
combine partial ranks
update rank
RECENT DISTRIBUTED GRAPH PROCESSING HISTORY
2004
MapReduce
Pegasus
2009
Pregel
2010
Signal-Collect
PowerGraph
2012
Iterative value propagation
Giraph++
2013
RECENT DISTRIBUTED GRAPH PROCESSING HISTORY
2004
MapReduce
Pegasus
2009
Pregel
2010
Signal-Collect
PowerGraph
2012
Iterative value propagation
Giraph++
2013
Graph Traversals
THINK LIKE A (SUB)GRAPH
1
5
4
3
2
1
5
4
3
2
- compute() on the entire partition
- Information flows freely inside each partition
- Network communication between partitions, not vertices
RECENT DISTRIBUTED GRAPH PROCESSING HISTORY
2004
MapReduce
Pegasus
2009
Pregel
2010
Signal-Collect
PowerGraph
2012
Iterative value propagation
Giraph++
2013
Graph Traversals
NScale
2014
RECENT DISTRIBUTED GRAPH PROCESSING HISTORY
2004
MapReduce
Pegasus
2009
Pregel
2010
Signal-Collect
PowerGraph
2012
Iterative value propagation
Giraph++
2013
Graph Traversals
NScale
2014
Ego-network analysis
Arabesque
2015
Tinkerpop
RECENT DISTRIBUTED GRAPH PROCESSING HISTORY
2004
MapReduce
Pegasus
2009
Pregel
2010
Signal-Collect
PowerGraph
2012
Iterative value propagation
Giraph++
2013
Graph Traversals
NScale
2014
Ego-network analysis
Arabesque
2015
Pattern Matching
Tinkerpop
CAN WE HAVE IT ALL?
▸ Data pipeline integration: built on top of an efficient distributed processing engine
▸ Graph ETL: high-level API with abstractions and methods to transform graphs
▸ Familiar programming model: support popular programming abstractions
HELLO, GELLY! THE APACHE FLINK GRAPH API
▸ Java and Scala APIs: seamlessly integrate with Flink’s DataSet API
▸ Transformations, library of common algorithms
val graph = Graph.fromDataSet(edges, env)
val ranks = graph.run(new PageRank(0.85, 20))
▸ Iteration abstractionsPregel
Signal-Collect
Gather-Sum-Apply
Partition-Centric*
POSIX Java/ScalaCollections
POSIX
‣efficient streaming runtime
‣native iteration operators
‣well-integrated
WHY FLINK?
FEELING GELLY?
▸ Paper References
http://www.citeulike.org/user/vasiakalavri/tag/dotscale
▸ Apache Flink:
http://flink.apache.org/
▸ Gelly documentation:
http://ci.apache.org/projects/flink/flink-docs-master/apis/batch/libs/gelly.html
▸ Gelly-Stream:
https://github.com/vasia/gelly-streaming