1 fast failure recovery in distributed graph processing systems yanyan shen, gang chen, h.v....

1

Fast Failure Recovery in Distributed Graph Processing Systems

Yanyan Shen, Gang Chen, H.V. Jagadish, Wei Lu, Beng Chin Ooi, Bogdan Marius Tudor

2

Graph analytics• Emergence of large graphs

– The web, social networks, spatial networks, …

• Increasing demand of querying large graphs– PageRank, reverse web link analysis over the web graph– Influence analysis in social networks– Traffic analysis, route recommendation over spatial graphs

3

Distributed graph processing

MapReduce-like systems

Pregel-like systems GraphLab-related systems

Others

4

Failures of compute nodes

Increasing graph size

More compute nodes

Increase in the number of failed

nodes

1 10 100 10000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8Failure probability when avg. failure

time of a compute node is ~200 hours

# of compute nodes

• Failure rate– # of failures per unit of time– 1/200(hours)

• Exponential failure probability

5

Outline• Motivation & background• Failure recovery problem

– Challenging issues– Existing solutions

• Solution– Reassignment generation– In-parallel recomputation– Workload rebalance

• Experimental results• Conclusions

6

Pregel-like distributed graph processing systems• Graph model

– G=(V,E)– P: partitions

• Computation model– A set of supersteps– Invoke compute function for each active vertex– Each vertex can

•Receive and process messages•Send messages to other vertices•Modify its value, state(active/inactive), its outgoing edges

BA G

HJ

C

FE

D

I

𝐍𝟏 𝐍𝟐

P1

P2

P3

P4

P5

BA

C D

FE

G H

JI

B

C

H

I D EB F

A B

C D

FE

G H

JI

Vertex Subgraph

G

7

Failure recovery problem• Running example

– All the vertices compute and send messages to all neighbors in all the supersteps

– N1 fails when the job executes in superstep 12

– Two states: record each vertex completes which superstep when failure occurs (Sf)and failure is recovered (Sf*)

• Problem statement– For a failure F(Nf, sf), recover vertex states from Sf to Sf*

BA G

HJ

C

FE

D

I

𝐍𝟏 𝐍𝟐

Sf Sf*

A-F: 10; G-J: 12 A-J: 12

8

Challenging issues• Cascading failures

– New failures may occur during the recovery phase– How to handle all the cascading failures if any?

•Existing solution: treat each cascading failure as an individual failure and restart from the latest checkpoint

• Recovery latency– Re-execute lost computations to achieve state S*– Forward messages during recomputation– Recover cascading failures– How to perform recovery with minimized latency?

9

Existing recovery mechanisms• Checkpoint-based recovery

– During normal execution•all the compute nodes flush its own graph-related information to a reliable storage at the beginning of every checkpointing superstep (e.g., C+1, 2C+1, …, nC+1).

– During recovery•let c+1 be the latest checkpointing superstep•use healthy nodes to replace failed ones; all the compute nodes rollback to the latest checkpoint and re-execute lost computations since then (i.e., from superstep c+1 to sf)

Simple to implement! Can handle cascading failures!

Replay lost computations over whole graph!

Ignore partially recovered workload!

10

Existing recovery mechanisms• Checkpoint + log

– During normal execution:•besides checkpoint, every compute node logs its outgoing messages at the end of each superstep

– During recovery•Use healthy nodes (replacements) to replace failed one•Replacements:

– redo lost computation and forward messages among each other;

– forward messages to all the nodes in superstep sf

•Healthy nodes:–holds their original partitions and redo the lost computation by forwarding locally logged messages to failed vertices

11

Existing recovery mechanisms• Checkpoint + log

– Suppose latest checkpoint is made at the beginning of superstep 11; N1 (A-F) fails at superstep 12

– During recovery•superstep 11: A-F perform computation and send messages to each other; G-J send messages to A-F•superstep 12:A-F perform computation and send messages along their outgoing edges; G-J send messages to A-F

BA G

HJ

C

FE

D

I

𝐍𝟏 𝐍𝟐

Less computation and communication cost!

Overhead of locally logging! (negligible)

Limited parallelism: replacements handle all the lost computation!

12

Outline• Motivation & background• Problem statement




13

Our solution• Partition-based failure recovery

– Step 1: generate a reassignment for the failed partitions– Step 2: recompute failed partitions

•Every node is informed of the reassignment•Every node loads its newly assigned failed partitions from the latest checkpoint; redoes lost computations

– Step 3: exchange partitions•Re-balance workload after recovery

14

Recompute failed partitions• In superstep , every compute node iterates through its

active vertices. For each vertex , we:– perform computation for vertex only if:

•its state after the failure satisfies:

– forward a message from to only if:•; or,

Intuition: will need this message to perform computation in superstep i+1

15

Example• N1 fails in superstep 12

– Redo superstep 11, 12

BA G

HJ

C

FE

D

I

𝐍𝟏 𝐍𝟐

B

A

G

H JC

FE

D I

𝐍𝟏 𝐍𝟐

(1) reassginment (2) recomputation

Less computation and communication cost!

16

Handling cascading failures• N1 fails in superstep 12

• N2 fails superstep 11 during recovery

BA G

HJ

C

FE

D

I

𝐍𝟏 𝐍𝟐

B

A

G

H JC

FE

D I

𝐍𝟏 𝐍𝟐

(1) reassginment

(2) recomputation

No need to recover A and B since they have been recovered!

Same recovery algorithm can be used to recovery any failure!

17

Reassignment generation• When a failure occurs, how to compute a good

reassignment for failed partitions?– Minimize the recovery time

• Calculating recovery time is complicated because it depends on:– Reassignment for the failure– Cascading failures– Reassignment for each cascading failure

No knowledge about cascading failures!

18

Our insight• When a failure occurs (can be cascading failure), we

prefer a reassignment that can benefit the remaining recovery process by considering all the cascading failures that have occurred

• We collect the state S after the failure and measure the minimum time Tlow to achieve Sf*

– Tlow provides a lower bound of remaining recovery time

19

Estimation of Tlow

– Ignore downtime (similar over different recovery methods)

• To estimate computation and communication time, we need to know:– Which vertex will perform computation – Which message will be forwarded (across different nodes)

• Maintain relevant statistics in the checkpoint

20

Reassignment generation problem• Given a failure, find a reassignment that minimizes

– Problem complexity: NP-hard– Different from graph partitioning problem

•Assignment partitioning•Not a static graph, but depends on runtime vertex states and messages•No “balance” requirement

• Greedy algorithm– Start with a random reassignment for failed partitions and

achieve a better one (with less ) by “moving” the failed partitions

21

Outline• Motivation & background• Problem statement




22

Experimental evaluation• Experiment settings

– In-house cluster with 72 nodes, each of which has one Intel X3430 2.4GHz processor, 8GB of memory, two 500GB SATA hard disks and Hadoop 0.20.203.0, and Giraph-1.0.0.

• Comparisons– PBR(our proposed solution), CBR(checkpoint-based)

• Benchmark Tasks– K-means– Semi-clustering– PageRank

• Datasets– Forest– LiveJournal– Friendster

23

PageRank results

Logging Overhead Single Node Failure

24

PageRank results

Multiple Node Failure Cascading Failure

25

PageRank results (communication cost)

Multiple Node Failure Cascading Failure

26

Conclusions• Develop a novel partition-based recovery method to

parallelize failure recovery workload for distributed graph processing

• Address challenges in failure recovery– Handle cascading failures– Reduce recovery latency

• Reassignment generation problem• Greedy strategy

27

Thank You!

Q & A

1 fast failure recovery in distributed graph processing systems yanyan shen, gang chen, h.v....

Documents

recovery phasehow

forward messages

failure fnf

individual failure

sfand failure

outgoing messages

superstep c

compute nodes rollback