1 fast failure recovery in distributed graph processing systems yanyan shen, gang chen, h.v....
TRANSCRIPT
1
Fast Failure Recovery in Distributed Graph Processing Systems
Yanyan Shen, Gang Chen, H.V. Jagadish, Wei Lu, Beng Chin Ooi, Bogdan Marius Tudor
2
Graph analytics• Emergence of large graphs
– The web, social networks, spatial networks, …
• Increasing demand of querying large graphs– PageRank, reverse web link analysis over the web graph– Influence analysis in social networks– Traffic analysis, route recommendation over spatial graphs
3
Distributed graph processing
MapReduce-like systems
Pregel-like systems GraphLab-related systems
Others
4
Failures of compute nodes
Increasing graph size
More compute nodes
Increase in the number of failed
nodes
1 10 100 10000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8Failure probability when avg. failure
time of a compute node is ~200 hours
# of compute nodes
• Failure rate– # of failures per unit of time– 1/200(hours)
• Exponential failure probability
5
Outline• Motivation & background• Failure recovery problem
– Challenging issues– Existing solutions
• Solution– Reassignment generation– In-parallel recomputation– Workload rebalance
• Experimental results• Conclusions
6
Pregel-like distributed graph processing systems• Graph model
– G=(V,E)– P: partitions
• Computation model– A set of supersteps– Invoke compute function for each active vertex– Each vertex can
•Receive and process messages•Send messages to other vertices•Modify its value, state(active/inactive), its outgoing edges
BA G
HJ
C
FE
D
I
𝐍𝟏 𝐍𝟐
P1
P2
P3
P4
P5
BA
C D
FE
G H
JI
B
C
H
I D EB F
A B
C D
FE
G H
JI
Vertex Subgraph
G
7
Failure recovery problem• Running example
– All the vertices compute and send messages to all neighbors in all the supersteps
– N1 fails when the job executes in superstep 12
– Two states: record each vertex completes which superstep when failure occurs (Sf)and failure is recovered (Sf*)
• Problem statement– For a failure F(Nf, sf), recover vertex states from Sf to Sf*
BA G
HJ
C
FE
D
I
𝐍𝟏 𝐍𝟐
Sf Sf*
A-F: 10; G-J: 12 A-J: 12
8
Challenging issues• Cascading failures
– New failures may occur during the recovery phase– How to handle all the cascading failures if any?
•Existing solution: treat each cascading failure as an individual failure and restart from the latest checkpoint
• Recovery latency– Re-execute lost computations to achieve state S*– Forward messages during recomputation– Recover cascading failures– How to perform recovery with minimized latency?
9
Existing recovery mechanisms• Checkpoint-based recovery
– During normal execution•all the compute nodes flush its own graph-related information to a reliable storage at the beginning of every checkpointing superstep (e.g., C+1, 2C+1, …, nC+1).
– During recovery•let c+1 be the latest checkpointing superstep•use healthy nodes to replace failed ones; all the compute nodes rollback to the latest checkpoint and re-execute lost computations since then (i.e., from superstep c+1 to sf)
Simple to implement! Can handle cascading failures!
Replay lost computations over whole graph!
Ignore partially recovered workload!
10
Existing recovery mechanisms• Checkpoint + log
– During normal execution:•besides checkpoint, every compute node logs its outgoing messages at the end of each superstep
– During recovery•Use healthy nodes (replacements) to replace failed one•Replacements:
– redo lost computation and forward messages among each other;
– forward messages to all the nodes in superstep sf
•Healthy nodes:–holds their original partitions and redo the lost computation by forwarding locally logged messages to failed vertices
11
Existing recovery mechanisms• Checkpoint + log
– Suppose latest checkpoint is made at the beginning of superstep 11; N1 (A-F) fails at superstep 12
– During recovery•superstep 11: A-F perform computation and send messages to each other; G-J send messages to A-F•superstep 12:A-F perform computation and send messages along their outgoing edges; G-J send messages to A-F
BA G
HJ
C
FE
D
I
𝐍𝟏 𝐍𝟐
Less computation and communication cost!
Overhead of locally logging! (negligible)
Limited parallelism: replacements handle all the lost computation!
12
Outline• Motivation & background• Problem statement
– Challenging issues– Existing solutions
• Solution– Reassignment generation– In-parallel recomputation– Workload rebalance
• Experimental results• Conclusions
13
Our solution• Partition-based failure recovery
– Step 1: generate a reassignment for the failed partitions– Step 2: recompute failed partitions
•Every node is informed of the reassignment•Every node loads its newly assigned failed partitions from the latest checkpoint; redoes lost computations
– Step 3: exchange partitions•Re-balance workload after recovery
14
Recompute failed partitions• In superstep , every compute node iterates through its
active vertices. For each vertex , we:– perform computation for vertex only if:
•its state after the failure satisfies:
– forward a message from to only if:•; or,
Intuition: will need this message to perform computation in superstep i+1
15
Example• N1 fails in superstep 12
– Redo superstep 11, 12
BA G
HJ
C
FE
D
I
𝐍𝟏 𝐍𝟐
B
A
G
H JC
FE
D I
𝐍𝟏 𝐍𝟐
(1) reassginment (2) recomputation
Less computation and communication cost!
16
Handling cascading failures• N1 fails in superstep 12
• N2 fails superstep 11 during recovery
BA G
HJ
C
FE
D
I
𝐍𝟏 𝐍𝟐
B
A
G
H JC
FE
D I
𝐍𝟏 𝐍𝟐
(1) reassginment
(2) recomputation
No need to recover A and B since they have been recovered!
Same recovery algorithm can be used to recovery any failure!
17
Reassignment generation• When a failure occurs, how to compute a good
reassignment for failed partitions?– Minimize the recovery time
• Calculating recovery time is complicated because it depends on:– Reassignment for the failure– Cascading failures– Reassignment for each cascading failure
No knowledge about cascading failures!
18
Our insight• When a failure occurs (can be cascading failure), we
prefer a reassignment that can benefit the remaining recovery process by considering all the cascading failures that have occurred
• We collect the state S after the failure and measure the minimum time Tlow to achieve Sf*
– Tlow provides a lower bound of remaining recovery time
19
Estimation of Tlow
– Ignore downtime (similar over different recovery methods)
• To estimate computation and communication time, we need to know:– Which vertex will perform computation – Which message will be forwarded (across different nodes)
• Maintain relevant statistics in the checkpoint
20
Reassignment generation problem• Given a failure, find a reassignment that minimizes
– Problem complexity: NP-hard– Different from graph partitioning problem
•Assignment partitioning•Not a static graph, but depends on runtime vertex states and messages•No “balance” requirement
• Greedy algorithm– Start with a random reassignment for failed partitions and
achieve a better one (with less ) by “moving” the failed partitions
21
Outline• Motivation & background• Problem statement
– Challenging issues– Existing solutions
• Solution– Reassignment generation– In-parallel recomputation– Workload rebalance
• Experimental results• Conclusions
22
Experimental evaluation• Experiment settings
– In-house cluster with 72 nodes, each of which has one Intel X3430 2.4GHz processor, 8GB of memory, two 500GB SATA hard disks and Hadoop 0.20.203.0, and Giraph-1.0.0.
• Comparisons– PBR(our proposed solution), CBR(checkpoint-based)
• Benchmark Tasks– K-means– Semi-clustering– PageRank
• Datasets– Forest– LiveJournal– Friendster
23
PageRank results
Logging Overhead Single Node Failure
24
PageRank results
Multiple Node Failure Cascading Failure
25
PageRank results (communication cost)
Multiple Node Failure Cascading Failure
26
Conclusions• Develop a novel partition-based recovery method to
parallelize failure recovery workload for distributed graph processing
• Address challenges in failure recovery– Handle cascading failures– Reduce recovery latency
• Reassignment generation problem• Greedy strategy
27
Thank You!
Q & A