slides tamc07

35
TAMC 2007 25 th May, 2007 A Distributed Algorithm of Fault Recovery For Stateful Failover Indranil Saha HTS (Honeywell Technology Solutions) Research Bangalore, India Email: [email protected] and Debapriyay Mukhopadhyay Ixia Technologies Kolkata, India Email: [email protected] A Distributed Algorithm of Fault Recovery For Stateful Failover 1

Upload: debapriyaym

Post on 11-May-2015

393 views

Category:

Technology


2 download

DESCRIPTION

A distributed algorithm for stateful fault recovery

TRANSCRIPT

Page 1: Slides Tamc07

TAMC 2007 25 th May, 2007

A Distributed Algorithm of Fault

Recovery For Stateful Failover

Indranil SahaHTS (Honeywell Technology Solutions) Research

Bangalore, India

Email: [email protected]

and

Debapriyay MukhopadhyayIxia Technologies

Kolkata, India

Email: [email protected]

A Distributed Algorithm of Fault Recovery For Stateful Failover 1

Page 2: Slides Tamc07

TAMC 2007 25 th May, 2007

Presentation Outline

I will talk about

• Introduction

• System Models

• Distributed Algorithm for Automated Fault Recovery

• Formal verification of the Distributed Algorithm

• Conclusion

A Distributed Algorithm of Fault Recovery For Stateful Failover 2

Page 3: Slides Tamc07

TAMC 2007 25 th May, 2007

Presentation Outline

• Introduction

• System Models

• Distributed Algorithm for Automated Fault Recovery

• Formal verification of the Distributed Algorithm

• Conclusion

A Distributed Algorithm of Fault Recovery For Stateful Failover 3

Page 4: Slides Tamc07

TAMC 2007 25 th May, 2007

Introduction

• Critical business processes and mission critical systems shouldprovide a high degree of availability and reliability to the endusers.

• Redundancy techniques are mostly used to achievefault-tolerance.

• Redundancy can be achieved by using extra copies of itscomponents which include hardware, software and networkcomponents.

A Distributed Algorithm of Fault Recovery For Stateful Failover 4

Page 5: Slides Tamc07

TAMC 2007 25 th May, 2007

Stateful and Stateless Failover

• Stateless Failover:- Occasional loss of application state information or data istolerable.- The system can restart without any state or data restorationafter a failure.- Any live node in the network is a promising candidate to takeover the processes of any failed node

• Stateful Failover- Restoration of the state or data pertaining to the applicationis required for highly accurate recovery.- How to distribute the state information of a node across thenetwork is an important issue.

A Distributed Algorithm of Fault Recovery For Stateful Failover 5

Page 6: Slides Tamc07

TAMC 2007 25 th May, 2007

Related Works

• Graph theoretic models have been extensively used to representprocessor-to-processor interconnection structure of faulttolerant designs for specific multi-processor architectures(Kuhl80, Yang88, Sridhar91, Mukhopadhyay92, Sung00,Hung01).

• Minimum k-Hamilton graphs are widely used to meetreliability considerations for loop type communication networks(Mukhopadhyay92, Sung00, Hung01).

• Fault tolerant networks based on de Bruijn graph are proposed,which can tolerate up to k − 2 node faults, where the graph isregular of degree k and have kn number of vertices for some n

(Sridhar91).

None of these works talk about stateful failover.

A Distributed Algorithm of Fault Recovery For Stateful Failover 6

Page 7: Slides Tamc07

TAMC 2007 25 th May, 2007

Presentation Outline

• Introduction

• System Models

• Distributed Algorithm for Automated Fault Recovery

• Formal verification of the Distributed Algorithm

• Conclusion

A Distributed Algorithm of Fault Recovery For Stateful Failover 7

Page 8: Slides Tamc07

TAMC 2007 25 th May, 2007

System Model

• The network consists of the set of nodes N with |N | = n

• Each node is labeled with a unique id from 0 to n− 1.

• Each node handles one process initially, and is capable ofexecuting at most m processes simultaneously.

• Pi is the process node i starts executing initially when thenetwork becomes functional.

• Failures are of failstop kind, i.e., the nodes in the network canstop operating at any point of time due to a crash.

• With a processor failed, all the links incident on that node alsobecomes non-functional.

• k node faults are allowed in the network.

A Distributed Algorithm of Fault Recovery For Stateful Failover 8

Page 9: Slides Tamc07

TAMC 2007 25 th May, 2007

Network Topology

Each node i ∈ N, (0 ≤ i ≤ n− 1), in the network is connected tothe set of nodes Pi ⊆ N, such that |Pi| = l = k + x, wherek + x(≤ n− 1) is even, and

Pi = {j ∈ N : j = (i + p)(mod n), where− l/2 ≤ p ≤ l/2, p 6= 0}

Underlying undirected graph modeling the network can be writtenas (N,E) where

E = ∪n−1i=0 {(i, j) : j ∈ Pi}.

The state information of processor i, i ∈ N , is periodicallyforwarded to all the nodes in the set Fi ⊆ N such that |Fi| = k and

Fi = {j ∈ S : j = (i+p)(mod n), where−bk/2c ≤ p ≤ dk/2e, p 6= 0}

A Distributed Algorithm of Fault Recovery For Stateful Failover 9

Page 10: Slides Tamc07

TAMC 2007 25 th May, 2007

Connectivity

- The graph (N,E) represents a regular network, for, the degree ofeach node is l.

- For any n and k, the graph (N, E) corresponds to the HararyGraph Hl,n, where

l = k + x ≥

k + 2, for k even,

k + 1, for k odd,

The network is l-connected with χ(G) ≥ l(> k),

χ(G) denotes the connectivity of G.

A Distributed Algorithm of Fault Recovery For Stateful Failover 10

Page 11: Slides Tamc07

TAMC 2007 25 th May, 2007

Theoretical Results

Theorem 1. A. Forwarding state information of each process to k

other nodes in the network ensures k-fault tolerance.B. A sufficient condition to ensure k-fault tolerance is to forwardthe state information by each node to at least k other nodes in thenetwork.Theorem 2. As long as k ≤ bm−1

m .nc, no live node has to executemore than m processes including one of its own and an algorithm toattain the same under the proposed framework can also be found.Theorem 3. Minimum number of nodes with which any node in anetwork with n > 2k (or n = 2k) is required to be connecteddirectly is 2k (or 2k − 1) to ensure that all the eligible nodescorresponding to a process can be updated about its stateinformation all the time in one hop.

A Distributed Algorithm of Fault Recovery For Stateful Failover 11

Page 12: Slides Tamc07

TAMC 2007 25 th May, 2007

Network Example

Illustration of a network with n = 10, m = 2 and k = 4

A Distributed Algorithm of Fault Recovery For Stateful Failover 12

Page 13: Slides Tamc07

TAMC 2007 25 th May, 2007

Presentation Outline

• Introduction

• System Models

• Distributed Algorithm for Automated Fault Recovery

• Formal verification of the Distributed Algorithm

• Conclusion

A Distributed Algorithm of Fault Recovery For Stateful Failover 13

Page 14: Slides Tamc07

TAMC 2007 25 th May, 2007

Message Types

1. INFO

• In the first round, each node i sends an INFO message toall the nodes in the set Fi.

• Message consists of the tuple (j, Fj)

2. STATUS

• Starting from the second round in each successive rounds,every live node i sends STATUS message for every processpj that is running on it to all the live nodes in the set Fj .

• Message consists of the tuple (pj , Spj)

• Ommision of the Status message for a process for a roundindicates the failure of the process.

3. RESOLVED Message is sent to all the nodes in Fj by the nodewho has resolved the failure of process j.

A Distributed Algorithm of Fault Recovery For Stateful Failover 14

Page 15: Slides Tamc07

TAMC 2007 25 th May, 2007

Preference for the Neighbours

pref ij . denotes the preference of node i to take process j in case of

its failure among the nodes in Fj .

A Distributed Algorithm of Fault Recovery For Stateful Failover 15

Page 16: Slides Tamc07

TAMC 2007 25 th May, 2007

Distributed Algorithm: Example

Illustration of the distributed algorithm for a network with n = 10,m = 2 and k = 4

Every node is running its own process.

A Distributed Algorithm of Fault Recovery For Stateful Failover 16

Page 17: Slides Tamc07

TAMC 2007 25 th May, 2007

Distributed Algorithm: Example

Node 9 is faulty

A Distributed Algorithm of Fault Recovery For Stateful Failover 17

Page 18: Slides Tamc07

TAMC 2007 25 th May, 2007

Distributed Algorithm: Example

Node 1 takes the process of node 9 after one round as it is the highest

preference node for process 9.

A Distributed Algorithm of Fault Recovery For Stateful Failover 18

Page 19: Slides Tamc07

TAMC 2007 25 th May, 2007

Distributed Algorithm: Example

Node 2 is faulty

A Distributed Algorithm of Fault Recovery For Stateful Failover 19

Page 20: Slides Tamc07

TAMC 2007 25 th May, 2007

Distributed Algorithm: Example

Node 4 takes the process of node 2 after one round as it is the highest

preference node for process 2.

A Distributed Algorithm of Fault Recovery For Stateful Failover 20

Page 21: Slides Tamc07

TAMC 2007 25 th May, 2007

Distributed Algorithm: Example

Node 8 is faulty

A Distributed Algorithm of Fault Recovery For Stateful Failover 21

Page 22: Slides Tamc07

TAMC 2007 25 th May, 2007

Distributed Algorithm: Example

Node 0 takes the process of node 8 after one round as it is the highest

preference node for process 8.

A Distributed Algorithm of Fault Recovery For Stateful Failover 22

Page 23: Slides Tamc07

TAMC 2007 25 th May, 2007

Distributed Algorithm: Example

Node 0 is faulty. Real problem begins...

A Distributed Algorithm of Fault Recovery For Stateful Failover 23

Page 24: Slides Tamc07

TAMC 2007 25 th May, 2007

Distributed Algorithm: Example

Node 7 takes the process of node 8 after 3 rounds as it is the third

preference node for process 8.

A Distributed Algorithm of Fault Recovery For Stateful Failover 24

Page 25: Slides Tamc07

TAMC 2007 25 th May, 2007

Distributed Algorithm: Example

Node 1 stops running process 9 and starts running process 0 after 6

rounds of node 0’s failure.

According to Theorem 1 there is at least one node available to take

process 9.

A Distributed Algorithm of Fault Recovery For Stateful Failover 25

Page 26: Slides Tamc07

TAMC 2007 25 th May, 2007

Distributed Algorithm: Example

Node 7 stops running process 8 and starts running process 9 after 8

rounds when node 1 stops running process 9.

A Distributed Algorithm of Fault Recovery For Stateful Failover 26

Page 27: Slides Tamc07

TAMC 2007 25 th May, 2007

Distributed Algorithm: Example

Node 6 starts running process 8 after 4 rounds when node 7 stops

running process 8.

No more failure is possible.

A Distributed Algorithm of Fault Recovery For Stateful Failover 27

Page 28: Slides Tamc07

TAMC 2007 25 th May, 2007

Analysis of the Algorithm

• At most 2k rounds are required to resolve a single fault.

• To resolve a single fault, the maximum number ofRESOLV ED messages that is required to be sent across thenetwork is (k − 2)m + 1, where m is the maximum number ofprocesses that a node is capable of executing.

A Distributed Algorithm of Fault Recovery For Stateful Failover 28

Page 29: Slides Tamc07

TAMC 2007 25 th May, 2007

Presentation Outline

• Introduction

• System Models

• Distributed Algorithm for Automated Fault Recovery

• Formal verification of the Distributed Algorithm

• Conclusion

A Distributed Algorithm of Fault Recovery For Stateful Failover 29

Page 30: Slides Tamc07

TAMC 2007 25 th May, 2007

Correctness of the Algorithm

• We show the correctness of the distributed algorithm throughformal verification.

• We use Spin Model checker for modeling and verification of thealgorithm.

• We have been able to verify our model for N=8, K=3 and M=2and all lower instances.

• Due to the state-space explosion problem inherent in modelchecker SPIN, we could not verity our algorithm for more than8 processors.

A Distributed Algorithm of Fault Recovery For Stateful Failover 30

Page 31: Slides Tamc07

TAMC 2007 25 th May, 2007

Spin Model Checker

• Tool for automatically model checking distributed algorithms

• Promela is a language for modeling systems of concurrentprocesses that can interact via shared variables and messagechannels

• Given a concurrent system modeled by a Promela program,SPIN can check for deadlock, dead code, violations of userspecified assertions, and temporal properties expressed by LTLformulas

• When a violation of a property is detected, SPIN reports ascenario, i.e., a sequence of transitions, violating the property.

A Distributed Algorithm of Fault Recovery For Stateful Failover 31

Page 32: Slides Tamc07

TAMC 2007 25 th May, 2007

Properties

Safety 1 Whenever a node becomes faulty, at least one of itsneighboring nodes is non-faulty.

Safety 2 No node has to take more than M processes at any pointof time.

Liveness Whenever a node becomes faulty, its process iseventually taken up by some other live nodes.

Timeliness Every fault is recovered in no more than 2K rounds.

A Distributed Algorithm of Fault Recovery For Stateful Failover 32

Page 33: Slides Tamc07

TAMC 2007 25 th May, 2007

Presentation Outline

• Introduction

• System Models

• Distributed Algorithm for Automated Fault Recovery

• Formal verification of the Distributed Algorithm

• Conclusion

A Distributed Algorithm of Fault Recovery For Stateful Failover 33

Page 34: Slides Tamc07

TAMC 2007 25 th May, 2007

Conclusion

• We have presented a distributed algorithm of automated faultrecovery for stateful failover in a network.

• In whatever way the fault may arise the algorithm can handlethat fault

• In at most 2k rounds the processes of the faulty processor aretaken up by a(some) eligible live node(nodes) in the network.

• The message complexity of our algorithm is linear with thenumber of nodes.

• The correctness of the algorithm has been proved by modelingthe algorithm in SPIN and verifying its desired properties.

A Distributed Algorithm of Fault Recovery For Stateful Failover 34

Page 35: Slides Tamc07

TAMC 2007 25 th May, 2007

Thank You!!

A Distributed Algorithm of Fault Recovery For Stateful Failover 35