graphlab under the hood
DESCRIPTION
TRANSCRIPT
12/10/12 1
GraphLab under the hood
Zuhair Khayyat
12/10/12 2
GraphLab overview: GraphLab 1.0
● GraphLab: A New Framework For Parallel Machine Learning
– high-level abstractions for machine learning problems
– Shared-memory multiprocessor
– Assume no fault tolerance needed
– Concurrent access precessing models with sequential-consistency guarantees
12/10/12 3
GraphLab overview: GraphLab 1.0
● How GraphLab 1.0 works?– Represent the user's data by a directed graph
– Each block of data is represented by a vertex and a directed edge
– Shared data table
– User functions:● Update: modify the vertex and edges state,
read only to shared table● Fold: sequential aggregation to a key entry in
the shared table, modify vertex data● Merge: Parallelize Fold function● Apply: Finalize the key entry in the shared table
12/10/12 4
GraphLab overview: GraphLab 1.0
12/10/12 5
GraphLab overview: Distributed GraphLab 1.0
● Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud
– Fault tolerance using snapshot algorithm
– Improved distributed parallel processing
– Two stage partitioning:● Atoms generated by ParMetis● Ghosts generated by the intersection of the
atoms
– Finalize() function for vertex synchronization
12/10/12 6
GraphLab overview: Distributed GraphLab 1.0
12/10/12 7
GraphLab overview: Distributed GraphLab 1.0
GHostsWorker 1 Worker 2
12/10/12 8
PowerGraph: Introduction
● GraphLab 2.1● Problems of highly skewed power-law graphs:
– Workload imbalance ==> performance degradations
– Limiting Scalability
– Hard to partition if the graph is too large
– Storage
– Non-parallel computation
12/10/12 9
PowerGraph: New Abstraction
● Original Functions:– Update
– Finalize
– Fold
– Merge
– Apply: The synchronization apply
● Introduce GAS model:– Gather: in, out or all neighbors
– Apply: The GAS model apply
– Scatter
12/10/12 10
PowerGraph: Gather
Worker 1 Worker 2
12/10/12 11
PowerGraph: Apply
Worker 1 Worker 2
12/10/12 12
PowerGraph: Scatter
Worker 1 Worker 2
12/10/12 13
PowerGraph: Vertex Cut
A
B
H
F
C
G
I
E D
A B
C I
E
D
A H
A G
B H
B C
C
CH
D E D I
E F I
F H F G
12/10/12 14
PowerGraph: Vertex Cut
A B
C I
E
D
A H
A G
B H
B C
C
CH
D E D I
E F I
F H F G
A B
A H
A G
C
B
DC
H
IC
D
I
E
E I
F
F G
12/10/12 15
PowerGraph: Vertex Cut (Greedy)
A B
C I
E
D
A H
A G
B H
B C
C
CH
D E D I
E F I
F H F G
A B
HG
B C
H
DC
C
I EE
F G
12/10/12 16
PowerGraph: Experiment
12/10/12 17
PowerGraph: Experiment
12/10/12 18
PowerGraph: Discussion
● Isn't it similar to Pregel Mode?– Partially process the vertex if a message exists
● Gather, Apply and Scatter are commutative and associative operations. What if the computation is not commutative!
– Sum up the message values in a specific order to get the same floating point rounding error.
12/10/12 19
PowerGraph and Mizan
● In Mizan we use partial replication:
a
b
c
d
e
f
g
W0 W1
a'a
b
c
d
e
f
g
W0 W1
Compute Phase Communication Phase
12/10/12 20
GraphChi: Introduction
● Asynchronous Disk-based version of GraphLab
● Utilizing parallel sliding window– Very small number of non-sequential accesses
to the disk
● Support for graph updates– Based on Kineograph, a distributed system for
processing a continuous in-flow of graph updates, while simultaneously running advanced graph mining algorithms.
12/10/12 21
GraphChi: Graph Constrains
● Graph does not fit in memory● A vertex, its edges and values fits in memory
12/10/12 22
GraphChi: Disk storage
● Compressed sparse row (CSR):– Compressed adjacency list with indexes of the
edges.
– Fast access to the out-degree vertices.
● Compressed Sparse Column (CSC):– CSR for the transpose graph
– Fast access to the in-degree vertices
● Shard: Store the edges' data
12/10/12 23
GraphChi: Loading the graph
● Input graph is split into P disjoint intervals to balance edges, each associated with a shard
● A shard contains data of the edges of an interval
● The sub graph is constructed as reading its interval
12/10/12 24
GraphChi: Parallel Sliding Windows
● Each interval is processed in parallel
● P sequential disk access are required to process each interval
● The length of intervals vary with graph distribution
● P * P disk access required for one superstep
12/10/12 25
GraphChi: Example
(1,2) (3,4) (5,6)
Executing interval (1,2):
12/10/12 26
GraphChi: Example
(1,2) (3,4) (5,6)
Executing interval (3,4):
12/10/12 27
GraphChi: Example
12/10/12 28
GraphChi: Evolving Graphs
● Adding an edge is reflected on the intervals and shards if read
● Deleting an edge causes that edge to be ignored
● Adding and deleting edges are handled after processing the current interval.
12/10/12 29
GraphChi: Preprocessing
12/10/12 30
Thank you
12/10/12 31
thegraphsblog.wordpress.com/
The Blog wants YOU