slides for chapter 14: distributed transactions from coulouris, dollimore and kindberg distributed...
Post on 22-Dec-2015
277 views
TRANSCRIPT
Slides for Chapter 14: Distributed transactions
From Coulouris, Dollimore and Kindberg
Distributed Systems: Concepts and Design
Edition 4, © Addison-Wesley 2005
Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3
© Addison-Wesley Publishers 2000
Commitment of dist. trans. - intro
A distributed transaction refers to a flat or nested transaction that accesses objects managed by multiple servers
Atomicity must still be preserved A process on one of the servers is coordinator, it
must ensure the same outcome at all of the servers.
The ‘two-phase commit protocol’ is the most commonly used protocol for achieving this
•
Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3
© Addison-Wesley Publishers 2000
Fig. 14.3 A dist. banking transaction
..
BranchZ
BranchX
participant
participant
C
D
Client
BranchY
B
A
participant join
join
join
T
a.withdraw(4);
c.deposit(4);
b.withdraw(3);
d.deposit(3);
openTransaction
b.withdraw(T, 3);
closeTransaction
T = openTransaction a.withdraw(4); c.deposit(4); b.withdraw(3); d.deposit(3);
closeTransaction
Note: the coordinator is in one of the servers, e.g. BranchX
Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3
© Addison-Wesley Publishers 2000
Fig. 14.4 Ops for 2PC
canCommit?(trans)-> Yes / NoCall from coordinator to participant to ask whether it can commit a transaction. Participant replies with its vote.
doCommit(trans) Call from coordinator to participant to tell participant to commit its part of a transaction.
doAbort(trans) Call from coordinator to participant to tell participant to abort its part of a transaction.
haveCommitted(trans, participant) Call from participant to coordinator to confirm that it has committed the transaction.
getDecision(trans) -> Yes / NoCall from participant to coordinator to ask for the decision on a transaction after it has voted Yes but has still had no reply after some delay. Used to recover from server crash or delayed messages.
Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3
© Addison-Wesley Publishers 2000
Fig. 14.5 Two-phase commit protocol
Phase 1 (voting phase): 1. The coordinator sends a canCommit? request to
each of the participants in the transaction.2. When a participant receives a canCommit? request
it replies with its vote (Yes or No) to the coordinator. Before voting Yes, it prepares to commit by saving objects in permanent storage. If the vote is No the participant aborts immediately.
Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3
© Addison-Wesley Publishers 2000
Fig. 14.5 Two-phase commit protocol
Phase 2 (completion according to outcome of vote):3. The coordinator collects the votes (including its own).
(a) If there are no failures and all the votes are Yes the coordinator decides to commit the transaction and sends a doCommit request to each of the participants.
(b) Otherwise the coordinator decides to abort the transaction and sends doAbort requests to all participants that voted Yes.
4. Participants that voted Yes are waiting for a doCommit or doAbort request from the coordinator. When a participant receives one of these messages it acts accordingly and in the case of commit, makes a haveCommitted call as confirmation to the coordinator.
Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3
© Addison-Wesley Publishers 2000
Summary of 2PC
a distributed transaction involves several different servers. A nested transaction structure allows
additional concurrency and independent committing by the servers in
a distributed transaction.atomicity requires that the servers participating
in a distributed transaction either all commit it or all abort it.
continued ...
•
Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3
© Addison-Wesley Publishers 2000
Summary of 2PC
atomic commit protocols are designed to achieve this effect, even if servers crash during their execution.
the 2PC protocol allows a server to abort unilaterally. it includes timeout actions to deal with
delays due to servers crashing. 2PC protocol can take an unbounded
amount of time to complete but is guaranteed to complete eventually.
•
Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3
© Addison-Wesley Publishers 2000
14.5 Distributed deadlocks
Single server transactions can experience deadlocks prevent or detect and resolve use of timeouts is clumsy, detection is preferable.
it uses wait-for graphs.
Distributed transactions lead to distributed deadlocks in theory can construct global wait-for graph from
local ones a cycle in a global wait-for graph that is not in local
ones is a distributed deadlock
•
Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3
© Addison-Wesley Publishers 2000
14.6 Transaction recovery
Atomicity property of transactions durability and failure atomicity durability requires that objects are saved in
permanent storage and will be available indefinitely
failure atomicity requires that effects of transactions are atomic even when the server crashes
•
Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3
© Addison-Wesley Publishers 2000
14.6 Transaction recovery
Recovery is concerned with ensuring that a server’s objects are durable and that the service provides failure atomicity. for simplicity we assume that when a server is
running, all of its objects are in volatile memory and all of its committed objects are in a recovery file
in permanent storage recovery consists of restoring the server with the
latest committed versions of all of its objects from its recovery file
•
Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3
© Addison-Wesley Publishers 2000
Recovery manager
The task of the Recovery Manager (RM) is: to save objects in permanent storage (in a recovery file)
for committed transactions; to restore the server’s objects after a crash; to reorganize the recovery file to improve performance; to reclaim storage space (in the recovery file).
media failures i.e. disk failures affecting the recovery file need another copy of the recovery file on an independent
disk.
•
Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3
© Addison-Wesley Publishers 2000
Fig 14.18 Types of entry in a recovery file
Type of entry Description of contents of entry
Object A value of an object.
Transaction statusTransaction identifier, transaction status (prepared, committed
aborted) and other status values used for the two-phase
commit protocol.
Intentions list Transaction identifier and a sequence of intentions, each of
which consists of <identifier of object>, <position in recovery
file of value of object>.
Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3
© Addison-Wesley Publishers 2000
Logging - reorganizing the recovery file
RM is responsible for reorganizing its recovery fileCheckpointing
the process of writing to a new recovery file the current committed values of a server’s objects, transaction status entries and intentions lists of
transactions not yet fully resolved including information related to 2PC (see later)
checkpointing makes recovery faster and saves disk space done after recovery and from time to time
•
Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3
© Addison-Wesley Publishers 2000
Role Status Action of RM
Coordinator
prepared
or aborted
No decision reached before server failed. Sends abortTransaction to all participants on list and sends aborted to recovery file.
Coord committed A decision to commit had been reached before the server failed. Sends a doCommit to all part.’s in list and resumes 2PC at step 4.
Particip committed Part sends haveCommitted msg to Coord. Allows coord. to discard info about trans at next ckpoint.
Participant
uncertain The part. failed before knew outcome of trans. Cannot determine status of transaction until coordinator informs of decision. Sends a getDecision to coord to determine status of trans. When reply received, commits or aborts
Particip prepared Participant has not voted; can abort
Coord done No action is required
Recovery of the 2PC
Time and Global States
ECEN5053 Software Engineering of Distributed SystemsUniversity of Colorado, Boulder
Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3
© Addison-Wesley Publishers 2000
Topics
Clock synchronizationLogical clocksGlobal State
Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3
© Addison-Wesley Publishers 2000
How processes can synchronize
Multiple processes must be able to cooperate in granting each other temporary exclusive access to a resource
Also, multiple processes may need to agree on the ordering of events, such as whether message m1 from process P was sent before or after message m2 from process Q.
Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3
© Addison-Wesley Publishers 2000
Centralized system
Time is unambiguous If a process wants to know the time, it makes a
system call and finds out If process A asks for the time and gets it and then
process B asks for the time and gets it, the time that B was told will be later than the time that A was told.
Simple, no?
Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3
© Addison-Wesley Publishers 2000
Physical Clocks
Physical computer clocks are not clocks; they are timers Quartz crystal that oscillates at a well-defined
frequency that depends on physical properties Two registers: counter and a holding register Each oscillation decrements the counter by one When counter reaches zero, generates an interrupt
and the counter is reloaded from the holding register Each interrupt is called a clock tick
Interrupt service procedure adds 1 to time stored in memory so the software clock is kept up to date
Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3
© Addison-Wesley Publishers 2000
The one and the many
What if the clock is “off” by a little? All processes on single machine use the same
clock so they will still be internally consistent What matters is relative time
Impossible to guarantee that crystals in different computers run at exactly the same frequency Gradually software clocks get out of synch --
skew
A program that expects time to be independent of the machine on which it is run ... fails
Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3
© Addison-Wesley Publishers 2000
Hey buddy, can you spare me a second?
To provide UTC (translates as Universal Coordinated Time) to those who need precise time, NIST operates a short wave radio station WWV from Fort Collins, CO
WWV broadcasts a short pulse at the start of each second
There are stations in other countries plus satellitesUsing either short wave or satellite services requires an
accurate knowledge of the relative position of the sender and receiver. Why?
Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3
© Addison-Wesley Publishers 2000
To WWV or not to WWV
If one computer has a WWV receiver, the goal is keeping all the others synchronized to it.
If no machines have WWV receivers, each machine keeps track of its own time Goal -- keep all machines together as well as
possible There are many algorithms
Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3
© Addison-Wesley Publishers 2000
Underlying model for synchronization models
Each machine has a timer that interrupts H times a second Interrupt handler adds 1 to a software clock that
keeps track of the number of ticks since some agreed-upon time in the past
Call the value of the clock CNotationally, when UTC time is t, the value of the
clock on machine p is Cp(t) In a perfect world, Cp (t) = t for all p and all t
Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3
© Addison-Wesley Publishers 2000
Back to reality
Theoretically, a timer with H=60 should generate 216,000 ticks per hour
Relative error is about 10^-5 meaning a particular machine gets a value in the range 215,998 to 216,002
There is a constant called the maximum drift rate and a timer will work with “perfect” + maximum drift rate.
If two clocks are drifting in the opposite direction at a time delta-t after they were synchronized may be as much as twice the max drift rate apart To differ by no more than delta, clocks must be
resynchronized every (delta/2*max-drift-rate) seconds
Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3
© Addison-Wesley Publishers 2000
Cristian’s algorithm
Well suited to one machine with a WWV receiver and a goal to have all other machines stay synchronized with it.
Call the one with the WWV receiver the time serverPeriodically, each machine sends a message to the
time server asking for the current timeMachine responds with CUTC as fast as it can
1st approximation, requester sets its clock to CUTC
What’s wrong with that?
Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3
© Addison-Wesley Publishers 2000
Big Trouble
Major problem Time really should never run backward -- why? If sender’s clock was fast, CUTC will be smaller
than the sender’s current value of C Change must be introduced gradually
If timer generates 100 interrupts/second, each interrupt adds 10 ms to the time
To slow down, ISR adds only 9 ms until correct
To speed up, add 11 ms at each interrupt
Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3
© Addison-Wesley Publishers 2000
Little Trouble
Minor problem Takes a nonzero amount of time for the time
server’s reply to get back to the sender Delay may be large and vary with network load Cristian attempts to measure send and receive
times, subtract, divide by 2; add this to received CUTC
Better: length of time server’s ISR, I, and incoming message processing time: (T1 - T0 - I)/2
To improve accuracy, measure several and average
Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3
© Addison-Wesley Publishers 2000
If no WWV Receiver
Berkeley UNIX algorithmThe time server (actually time daemon) is active, not
passive It polls every machine and asks what time it isBased on answers, it computes an average time and
tells all machines to adjust their clocks to the new time
The time daemon’s time is set manually by the operator periodically
Centralized algorithm though the time daemon does not have a WWV receiver
Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3
© Addison-Wesley Publishers 2000
Decentralized synchronization
Cristian and Berkeley UNIX are centralized algorithms with the usual downside. What?
There are several decentralized algorithms, for example: Divide time into fixed length resynchronization
intervals At the beginning of each interval, every machine
broadcasts its current time Each starts a local timer to collect all broadcasts
arriving during a certain interval Algorithm to compute a new time based on
some/all
Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3
© Addison-Wesley Publishers 2000
Internet Synchronization
New hardware and software technology in the past few years make it possible to keep millions of clocks synchronized to within a few ms of UTC
New algorithms using these synchronized clocks are beginning to appear
Synchronized clocks can be used to achieve cache consistency to use time-out tickets in distributed system
authentication to handle commitment in atomic transactions
Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3
© Addison-Wesley Publishers 2000
Logical Clocks
See also notes from 3 weeks agoFor many purposes, it is sufficient that machines
agree on the same time even if it is not the “right” time
Internal consistency of the clocks mattersClock synchronization is possible but does not have
to be absolute If 2 processes do not interact, their clocks need
not be synchronized; the lack of synch would not be seen
What is important is that all processes agree on the order in which events occur
Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3
© Addison-Wesley Publishers 2000
Lamport timestamps
a happens-before b means that all processes agree that first event a occurs, then afterward, event b occurs
We write a happens-before b as a --> b If a occurs before b in the same process, we say a --> b is
true If the event a sends a message and event b receives that
message in another process, a --> b is also true because a message cannot be received until after it is sent.
happens-before is transitive
Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3
© Addison-Wesley Publishers 2000
Ya cain’t say
If x and y happen in different processes that do not exchange messages, then we cannot say x --> y we cannot say y --> x nothing can be said about when the events
happened or which event happened first we call these events concurrent
Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3
© Addison-Wesley Publishers 2000
Invent time
Need a way of measuring time so that for every event a we can assign a time C(a) on which all processes agree. Such that, if a --> b, then C(a) < C(b) If a and b are two events in the same process and a
happens before b, then C(a) < C(b) If a is the sending of a msg by one process and b is
the receiving of that msg by another, then C(a) and C(b) must be assigned so that everyone agrees on the values of C(a) and C(b) with C(a) < C(b)
Corrections to C can only be made by addition, never subtraction so that the clock time always goes forward
Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3
© Addison-Wesley Publishers 2000
If msg leaves at time N, it arrives at >= N+1
Each message carries the time according to its sender’s clock
When it arrives, if the receiver’s clock shows a value prior to the time the message was sent, the receiver fast forwards its clock to be 1 more than the sending time
Between every two events the clock must tick at least once If a process sends or receives 2 messages in quick
succession, it must advance its clock by (at least) 1 tick in between
No 2 events ever occur at exactly the same time
Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3
© Addison-Wesley Publishers 2000
Totally-ordered Multicast
Consider a bank with replicated data in San Francisco and New York City.
Customer in SF wants to add $100 to the account of $1000
Meanwhile, a bank employee in NY initiates an update by which the customer’s account will be increased with 1% interest.
Due to communication delays, the instructions could arrive at the replicated sites in different orders with differing final answers
Should have been performed at both sites in same order
Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3
© Addison-Wesley Publishers 2000
Using Lamport timestamps to get totally ordered multicast
Consider group of processes multicasting messages to each other
Each message is timestamped with the current (logical) time of its sender
Conceptually, if multicast, the msg is also sent to its sender
We assume msgs from the same sender are received in the order they were sent and that no messages were lost
Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3
© Addison-Wesley Publishers 2000
totally ordered multicast (cont.)
When a process receives a message, it goes into a local queue ordered according to its timestamp
The receiver multicasts an acknowledgementUsing Lamport’s algorithm for adjusting local clocks,
the timestamp of the received msg is lower than the timestamp of the acknowledgement
All processes will eventually have the same copy of the local queue because each msg is multicast, plus acks
We assumed msgs are delivered in the order sent by sender
Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3
© Addison-Wesley Publishers 2000
Each process inserts a received msg in its local queue according to the timestamp in that msg.
Lamport’s clocks ensure no two messages have the same timestamp
Also, the timestamps reflect a consistent global ordering of events
A process delivers a queued msg to the application it is running when that message is at the head of the queue and has been acknowledged by each other process
The msg removed from queue; associated acks removed.
totally ordered multicast (cont. more)
Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3
© Addison-Wesley Publishers 2000
Vector Timestamps
With Lamport timestamps, nothing can be said about the relationship between a and b simply by comparing their timestamps C(a) and C(b). Just because C(a) < C(b), doesn’t mean a
happened before b (remember concurrent events)
Consider network news where processes post articles and react to posted articles
Postings are multicast to all membersWant reactions delivered after associated postings
Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3
© Addison-Wesley Publishers 2000
Will totally-ordered multicasting work?
That scheme does not mean that if msg B is delivered after msg A, B is a reaction to msg A. They may be completely independent.
What’s missing? If causal relationships are maintained within a group
of processes, then receipt of a reaction to an article should always follow the receipt of the article. If two items are independent, their order of
delivery should not matter at all
Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3
© Addison-Wesley Publishers 2000
Vector Timestamps capture causality
VT(a) < VT(b) means event a causally precedes event b.Let each process Pi maintain vector Vi such that
Vi[i] is the number of events that have occurred so far at Pi
If Vi[j] = k then Pi knows that k events have occurred at Pj
We increment Vi[i] at the occurrence of each new event that happens at process Pi
Piggyback vectors with msgs that are sent. When P i sends msg m, it sends its current vector along as a timestamp vt.
Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3
© Addison-Wesley Publishers 2000
Receiver thus knows the number of events that have occurred at Pi
Receiver is also told how many events at other processes have taken place before Pi sent message m. timestamp vt of m tells the receiver how many events in
other processes have preceded m and on which m may causally depend
When Pj receives m, it adjusts its own vector by setting each entry Vj[k] to max{Vj[k], vt[k]}
The vector now reflects the # of msgs that Pj must receive to have at least seen the same msgs that preceded the sending of m.
Vj[i] is incremented by 1 representing the event of receiving msg m as the next message from Pi
Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3
© Addison-Wesley Publishers 2000
When are messages delivered?
Vector timestamps are used to deliver msgs when no causality constraints are violated.
When process Pi posts an article, it multicasts that article as a msg a with timestamp vt(a) set equal to Vi.
When another process Pj receives a, it will have adjusted its own vector such that Vj[i] > vt(a)[i]
Now suppose Pj posts a reaction by multicasting msg r with timestamp vt(r) equal to Vj. vt(r)[i] > vt(a)[i].
Both msg a and msg r will arrive at Pk in some order
Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3
© Addison-Wesley Publishers 2000
When receiving r, Pk inspects timestamp vt(r) and decides to postpone delivery until all msgs that causally precede r have been received as well.
In particular, r is delivered only if the following conditions are met
1. vt(r)[j] = Vk[j] + 1
2. vt(r)[i] <= Vk [i] for all i not equal to j
1. says r is the next msg Pk was expecting from Pj
2. says Pk has seen no msg not seen by Pj when it sent r. In particular, Pk has already seen message a.
Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3
© Addison-Wesley Publishers 2000
Controversy
There has been some debate about whether support for totally-ordered and causally-
ordered multicasting should be provided as part of the message-communication layer or
whether applications should handle orderingComm layer doesn’t know what it contains, only potential
causality 2 msgs from same sender will always be marked as
causally related even if they are notApplication developer may not want to think about it