Download - Distributed Snapshots
Distributed Snapshots: Distributed Snapshots: Determining Global States of Determining Global States of
Distributed SystemsDistributed Systems
K. Mani Chandy K. Mani Chandy Leslie LamportLeslie Lamport
OverviewOverview
Paper shows Paper shows the Snapshot the Snapshot AlgorithmAlgorithm
Aims to discover Aims to discover a a global state global state of of the distributed the distributed systemsystem
MotivationMotivation
We want Global State DiscoveryWe want Global State Discovery Communication latency and clock Communication latency and clock
skew prevent us from doing this wellskew prevent us from doing this well Applications of global state discoveryApplications of global state discovery
CheckpointingCheckpointing Detection of Deadlock with Global Detection of Deadlock with Global
Resources … why?Resources … why? Consistent view of Distributed Bank Consistent view of Distributed Bank
AccountsAccounts Phase Detection (e.g. Barriers)Phase Detection (e.g. Barriers)
What is a Global State?What is a Global State?
Processes are finite state Processes are finite state machines (FSM’s)machines (FSM’s)
A global state of a system is a A global state of a system is a set of states {pset of states {p11, … ,p, … ,pnn} such } such that pthat pi i represents the state of represents the state of process i.process i.
… … is this sufficient?is this sufficient?
NO! What about channels?NO! What about channels?
Insufficient characterization of the Insufficient characterization of the system!system!
Processes communicate using Processes communicate using channelschannels
Must account for messages currently Must account for messages currently in transitin transit
Stable PropertiesStable Properties
Algorithm targeted Algorithm targeted at specific at specific problemsproblems
Check if a Check if a stable stable propertyproperty holds holds Once it is true, Once it is true,
remains true for all remains true for all later points later points
““Are all lights Are all lights currently green?” currently green?” Is this an example Is this an example of a stable of a stable property?property?
Quick RecapQuick Recap
We want Global State DetectionWe want Global State Detection Stable PropertiesStable Properties
Moving on …Moving on … System ModelSystem Model AssumptionsAssumptions Chandy-Lamport AlgorithmChandy-Lamport Algorithm
Eagle’s Eye ViewEagle’s Eye View
Assumptions (oh no!)Assumptions (oh no!)
ChannelsChannels FIFOFIFO Infinite BuffersInfinite Buffers Error-freeError-free Finite delivery timeFinite delivery time
No failures No failures States can be captured States can be captured
in finite timein finite time Hidden assumption: Hidden assumption:
steps in algorithm must steps in algorithm must be atomic in terms of be atomic in terms of process state (why?)process state (why?)
Snapshot (Chandy-Lamport) Snapshot (Chandy-Lamport) AlgorithmAlgorithm1.1. A process decides to take a snapshot A process decides to take a snapshot
“spontaneously” and sends itself a “spontaneously” and sends itself a markermarker..
2.2. Upon receiving the Upon receiving the marker marker over a over a channel c a process will …channel c a process will …
1.1. If marker not previously seen, record state, If marker not previously seen, record state, state of c is empty, start recording other state of c is empty, start recording other incoming channels, and send marker to incoming channels, and send marker to neighborsneighbors
2.2. Else stop recording, state of c is the sequence Else stop recording, state of c is the sequence of messages recorded since [1] of messages recorded since [1]
Will a marker ever be received on the Will a marker ever be received on the same channel twice?same channel twice?
Algorithm in ActionAlgorithm in Action
Termination of AlgorithmTermination of Algorithm
When a When a markermarker received on received on every incoming channelevery incoming channel
How could you distribute the How could you distribute the actual snapshot?actual snapshot?
How would we handle multiple How would we handle multiple concurrent snapshots?concurrent snapshots?
Properties of SnapshotProperties of Snapshot
Global state returned is Global state returned is reachable from start and before reachable from start and before end of snapshotend of snapshot
System never necessarily in the System never necessarily in the state of a snapshotstate of a snapshot
Can obtain a consistent global Can obtain a consistent global state with it.state with it.
How can we guarantee state How can we guarantee state returned actually occurred?returned actually occurred?
Stability DetectionStability Detection
If the stable property is true, it is true by the If the stable property is true, it is true by the end of the algorithm.end of the algorithm.
If it is false, it was false at the beginning of If it is false, it was false at the beginning of the snapshot. the snapshot.
Intuitive explanation?Intuitive explanation?
IssuesIssues
Many assumptions Many assumptions necessarynecessary Overhead becomes high Overhead becomes high
with methods that work with methods that work around assumptionsaround assumptions
Cannot discover Cannot discover transient properties transient properties
Hard to see type of Hard to see type of problems to solve with problems to solve with algorithmalgorithm
How would you deal with How would you deal with failures? Termination? failures? Termination?
At best a good guess. At best a good guess. How would you do this?How would you do this?
QuestionsQuestions