distributed snapshots

16
Distributed Distributed Snapshots: Snapshots: Determining Determining Global States of Global States of Distributed Systems Distributed Systems K. Mani Chandy K. Mani Chandy Leslie Lamport Leslie Lamport

Upload: awesomesos

Post on 28-Nov-2014

3.688 views

Category:

Technology


1 download

DESCRIPTION

My presentation on distributed snapshots for graduate OS course

TRANSCRIPT

Page 1: Distributed Snapshots

Distributed Snapshots: Distributed Snapshots: Determining Global States of Determining Global States of

Distributed SystemsDistributed Systems

K. Mani Chandy K. Mani Chandy Leslie LamportLeslie Lamport

Page 2: Distributed Snapshots

OverviewOverview

Paper shows Paper shows the Snapshot the Snapshot AlgorithmAlgorithm

Aims to discover Aims to discover a a global state global state of of the distributed the distributed systemsystem

Page 3: Distributed Snapshots

MotivationMotivation

We want Global State DiscoveryWe want Global State Discovery Communication latency and clock Communication latency and clock

skew prevent us from doing this wellskew prevent us from doing this well Applications of global state discoveryApplications of global state discovery

CheckpointingCheckpointing Detection of Deadlock with Global Detection of Deadlock with Global

Resources … why?Resources … why? Consistent view of Distributed Bank Consistent view of Distributed Bank

AccountsAccounts Phase Detection (e.g. Barriers)Phase Detection (e.g. Barriers)

Page 4: Distributed Snapshots

What is a Global State?What is a Global State?

Processes are finite state Processes are finite state machines (FSM’s)machines (FSM’s)

A global state of a system is a A global state of a system is a set of states {pset of states {p11, … ,p, … ,pnn} such } such that pthat pi i represents the state of represents the state of process i.process i.

… … is this sufficient?is this sufficient?

Page 5: Distributed Snapshots

NO! What about channels?NO! What about channels?

Insufficient characterization of the Insufficient characterization of the system!system!

Processes communicate using Processes communicate using channelschannels

Must account for messages currently Must account for messages currently in transitin transit

Page 6: Distributed Snapshots

Stable PropertiesStable Properties

Algorithm targeted Algorithm targeted at specific at specific problemsproblems

Check if a Check if a stable stable propertyproperty holds holds Once it is true, Once it is true,

remains true for all remains true for all later points later points

““Are all lights Are all lights currently green?” currently green?” Is this an example Is this an example of a stable of a stable property?property?

Page 7: Distributed Snapshots

Quick RecapQuick Recap

We want Global State DetectionWe want Global State Detection Stable PropertiesStable Properties

Moving on …Moving on … System ModelSystem Model AssumptionsAssumptions Chandy-Lamport AlgorithmChandy-Lamport Algorithm

Page 8: Distributed Snapshots

Eagle’s Eye ViewEagle’s Eye View

Page 9: Distributed Snapshots

Assumptions (oh no!)Assumptions (oh no!)

ChannelsChannels FIFOFIFO Infinite BuffersInfinite Buffers Error-freeError-free Finite delivery timeFinite delivery time

No failures No failures States can be captured States can be captured

in finite timein finite time Hidden assumption: Hidden assumption:

steps in algorithm must steps in algorithm must be atomic in terms of be atomic in terms of process state (why?)process state (why?)

Page 10: Distributed Snapshots

Snapshot (Chandy-Lamport) Snapshot (Chandy-Lamport) AlgorithmAlgorithm1.1. A process decides to take a snapshot A process decides to take a snapshot

“spontaneously” and sends itself a “spontaneously” and sends itself a markermarker..

2.2. Upon receiving the Upon receiving the marker marker over a over a channel c a process will …channel c a process will …

1.1. If marker not previously seen, record state, If marker not previously seen, record state, state of c is empty, start recording other state of c is empty, start recording other incoming channels, and send marker to incoming channels, and send marker to neighborsneighbors

2.2. Else stop recording, state of c is the sequence Else stop recording, state of c is the sequence of messages recorded since [1] of messages recorded since [1]

Will a marker ever be received on the Will a marker ever be received on the same channel twice?same channel twice?

Page 11: Distributed Snapshots

Algorithm in ActionAlgorithm in Action

Page 12: Distributed Snapshots

Termination of AlgorithmTermination of Algorithm

When a When a markermarker received on received on every incoming channelevery incoming channel

How could you distribute the How could you distribute the actual snapshot?actual snapshot?

How would we handle multiple How would we handle multiple concurrent snapshots?concurrent snapshots?

Page 13: Distributed Snapshots

Properties of SnapshotProperties of Snapshot

Global state returned is Global state returned is reachable from start and before reachable from start and before end of snapshotend of snapshot

System never necessarily in the System never necessarily in the state of a snapshotstate of a snapshot

Can obtain a consistent global Can obtain a consistent global state with it.state with it.

How can we guarantee state How can we guarantee state returned actually occurred?returned actually occurred?

Page 14: Distributed Snapshots

Stability DetectionStability Detection

If the stable property is true, it is true by the If the stable property is true, it is true by the end of the algorithm.end of the algorithm.

If it is false, it was false at the beginning of If it is false, it was false at the beginning of the snapshot. the snapshot.

Intuitive explanation?Intuitive explanation?

Page 15: Distributed Snapshots

IssuesIssues

Many assumptions Many assumptions necessarynecessary Overhead becomes high Overhead becomes high

with methods that work with methods that work around assumptionsaround assumptions

Cannot discover Cannot discover transient properties transient properties

Hard to see type of Hard to see type of problems to solve with problems to solve with algorithmalgorithm

How would you deal with How would you deal with failures? Termination? failures? Termination?

At best a good guess. At best a good guess. How would you do this?How would you do this?

Page 16: Distributed Snapshots

QuestionsQuestions