cs 542 -- failure recovery, concurrency control
TRANSCRIPT
CS 542 Database Management SystemsFailure Recovery, Concurrency Control
J Singh April 4, 2011
2© J Singh, 2011 2
Today’s meeting
• The D in ACID: Durability
• The ACI in ACID– Consistency is specified by users is how they define
transactions– The Database is responsible for Atomicity and Isolation
3© J Singh, 2011 3
Types of Failures
• Potential sources of failures:– Power loss, resulting in loss of main-memory state,– Media failures, resulting in loss of disk state and– Software errors, resulting in both
• Recovery is based on the concept of transactions.
4© J Singh, 2011 4
Transactions and Concurrency
• Users submit transactions, and think of each transaction as executing by itself.
• Concurrency is achieved by the DBMS, which interleaves actions (reads/writes of DB objects) of various transactions.
• Each transaction must leave the database in a consistent state if the DB is consistent when the transaction begins. A transaction can end in two different ways:
– commit: successful end, all actions completed,– abort: unsuccessful end, only some actions executed.
• Issues: effect of interleaving transactions on the database– System failures (today’s lecture)– Concurrent transactions (partly today, remainder next week)
5© J Singh, 2011 5
Transactions, Logging and Recovery
• We studied Query Processing in the last two lectures
• Now, Log Manager and Recovery Manager
• Second part today, Transaction Manager
6© J Singh, 2011 6
Reminder: Buffer Management
• Data must be in RAM for DBMS to operate on it!
• DB
• MAIN MEMORY
• DISK
• disk page
• free frame
• Page Requests from Higher Levels
• BUFFER POOL
• choice of frame dictated• by replacement policy
7© J Singh, 2011 7
Primitive Buffer Operations
• Requests from Transactions– Read (x,t):
• Input(x) if necessary• Assign value of x in block to
local variable t (in buffer)– Write (x,t):
• Input(x) if necessary• Assign value of local
variable t (in buffer) to x
• Requests to Disk– Input (x):
• Transfer block containing x from disk to memory (buffer)
– Output (x):• Transfer block containing x
from buffer to disk
8© J Singh, 2011 8
Failure Recovery Approaches
• All of the approaches rely on logging – storing a log of changes to the database so it is possible to restore its state. They differ in
– What information is logged,– The timing of when to force that information to stable storage,– What the procedure for recovery will be– The approaches are named after the recovery procedure
• Undo Logging– The log contains enough information to detect if the transaction
was committed and to roll back the state if it was not.– When recovering after a failure, walk back through the log and
undo the effect of all txns that do not have a COMMIT entry in the log
• Other approaches described later
9© J Singh, 2011 9
Undo Logging
• When executing transactions– Write the log before writing
transaction data and force it to disk
– Make sure to preserve chronological order
– The log contains enough information to detect if the transaction was committed and to roll back the state if it was not.
• When restarting, – Walk back through the log
and undo the effect of all uncommitted txns in the log.
– Challenge: How far back do we need to look?
– Answer: Until the last checkpoint
• Define and implement checkpoints momentarily
10© J Singh, 2011 10
An Example Transaction
• Initially– A = 8– B = 8
• Transaction T1– A 2 A– B 2 B
• Transaction T1:– Read (A,t); t t 2– Write (A,t);– Read (B,t); t t 2– Write (B,t);– Output (A);– Output (B);
• State at Failure Point:– Memory:
• A = 16• B = 16
– Disk:• A = 16• B = 8
• Undo Log Entries– <T1, start>– <T1, A, 8>– <T1, B, 8>– <T1, Commit>
• Would have been written if the transaction had completed.
– Do we have the info to restore?
failure!
11© J Singh, 2011 11
Execution with Undo Logging
• Logging Rule:– If a transaction commits, the commit record must be written to
disk after all data records have been written to disk
Forces all log records to disk
12© J Singh, 2011 12
Recovery with Undo Logging
• Consider all uncommitted transactions, starting with the most recent one and going backward.
• Undo all actions of these transactions.
• Why going backward, not forward?
• Example: T1, T2 and T3 all write A– T1 executed before T2 before T3– T1 committed, T2 and T3 incomplete
T1 write A T2 write Atime/log
T1 commit
systemfailure
T3 write A
13© J Singh, 2011 13
More on Undo Logging
• Failure During Recovery– Recovery algorithm is
idempotent– Just do it again!
• How much of the log file needs to be processed?
– In principle, we need to examine the entire log.
– Checkpointing limits the part of the log that needs to be considered during recovery up to a certain point (checkpoint).
14© J Singh, 2011 14
Quiescent Checkpointing
• Simple approach to introduce the concept
• Pause the database– stop accepting new transactions,– wait until all current transactions commit or abort and have
written the corresponding log records,– flush the log to disk,– write a <CKPT> log record and flush the log,– resume accepting new transactions.
• Once we encounter a checkpoint record, we know that there are no incomplete transactions.
– Do not need to go backward beyond checkpoint. – Can afford to throw away any part of the log prior to the
checkpoint
• Pausing the database may not be warranted for business reasons
15© J Singh, 2011 15
Non-quiescent Checkpointing
• Main idea: Start- and End-Checkpoints to bracket unfinished txns
– Write a <START CKPT (T1, T2, … Tk)> record into the log
• T1, T2, … Tk are the unfinished txns
– Wait till T1, T2, … Tk commit or abort, but allow other txns to begin
– Write a <END CKPT> record into the log
• Recovery method: scan the log backwards until a <CKPT> record is found
– If <END…>, scan backwards to the previous <START…>
• No need to look any further– If <START…>, then crash
must have occurred during checkpointing.
• The START record tells us unfinished txns and
• Scan back to the beginning of the oldest one of these.
16© J Singh, 2011 16
Issues with Undo Logging
• Bottlenecks on I/O– All log records must be forced back to disk before any data
written back– All data records must be forced to disk before the COMMIT
record is written back
• An alternative: Redo Logging– Instead of scanning backward from the end
• Undoing all transactions that were not completed– Scans the log forward
• Reapplies all transactions that were not completed
17© J Singh, 2011 17
Logging with Redo Logs
• Creation of the Redo log– For every action, generate
redo log record.• <T, X, v> has different
meaning: v is the new value, not old
– Flush log at commit.– All log records for
transaction that modified X (including commit) must be on disk before X is modified on disk
– Write END log record after DB modifications have been written to disk.
• Recovery algorithm. – Redo the modifications by
committed transactions not yet flushed to disk.
• S = set of txns with <Ti
commit> and no <Ti end> in log
• For each <Ti X, v> in log, in forward order (from earliest to latest) do:
– if Ti in S then• Write(X, v) • Output(X)• Write <Ti END>
18© J Singh, 2011 18
Logging with Redo Logs
Step Action t M-A M-B D-A D-B Log
1 <START T>
2 READ(A,t) 8 8 8 8
3 t : = t * 2 16 8 8 8
4 WRITE(A,t) 16 16 8 8 <T,A,16>
5 READ(B,t) 8 16 8 8 8
6 t : = t * 2 16 16 8 8 8
7 WRITE(B,t) 16 16 16 8 8 <T,B,16>
8 <COMMIT T>
9 FLUSH LOG
10 OUTPUT(A) 16 16 16 16 16
11 OUTPUT(B) 16 16 16 16 16
19© J Singh, 2011 19
Comments on Redo Logging
• Checkpoint algorithms similar to those for Undo Logging– Quiescent as well as Non-quiescent algorithms
• Issues with Redo Logging– Writing data back to disk is not allowed until transaction logs
have been written out• Results in a large requirement for memory for buffer pool
– A flaw in the checkpointing algorithms (textbook, p869)• Both undo and redo logs may put contradictory requirements on how
buffers are handled during a checkpoint, unless the database elements are complete blocks or sets of blocks.
• For instance, if a buffer contains one database element A that was changed by a committed transaction and another database element B that was changed in the same buffer by a transaction that has not yet had its COMMIT record written to disk, then we are required to copy the buffer to disk because of A but also forbidden to do so, because rule R1 applies to B.
20© J Singh, 2011 20
Undo/Redo Logging (p1)
• Undo logging requires to write modifications to disk immediately after commit, leading to an unnecessarily large number of IOs.
• Redo logging requires to keep all modified blocks in the buffer until the transaction commits and the log records have been flushed, increasing the buffer size requirement.
• Undo/redo logging combines undo and redo logging. – It provides more flexibility in flushing modified blocks at the
expense of maintaining more information in the log.
21© J Singh, 2011 21
Undo/Redo Logging (p2)
• Main idea: The log can be used to reconstruct the data
• Update records <T, X, new, old> record new and old value of X.
• The only undo/redo logging rule is: – Log record must be flushed before corresponding modified block
• Also known as write ahead logging.
• Block of X can be flushed before or after T commits, i.e. before or after the COMMIT log record.
• Flush the log at commit.
22© J Singh, 2011 22
Undo/Redo Logging (p3)
• Because of the flexibility of flushing X before or after the COMMIT record, we can have uncommitted transactions with modifications on disk and committed transactions with modifications not yet on disk.
• The undo/redo recovery policy is as follows:– Redo committed transactions.– Undo uncommitted transactions.
23© J Singh, 2011 23
Undo/Redo Logging Recovery
• More details on the recovery procedure:– Backward pass
• From end of log back to latest valid checkpoint, construct set S of committed transactions.
• Undo actions of transactions not in S.– Forward pass
• From latest checkpoint forward to end of log,– Or from the beginning of time, if there are no checkpoints
• redo actions of transactions in S.
• Alternatively, can also perform the redos before the undos.
24© J Singh, 2011 24
Undo/Redo Checkpointing
• Write "start checkpoint" listing all active transactions to log• Flush log to disk• Write to disk all dirty buffers (contain a changed DB element),
whether or not transaction has committed– Implies nothing should be written (not even to memory buffers)
until we are sure the transaction will not abort– Implies some log records may need to be written to disk (WAL)
• Write "end checkpoint" to log• Flush log to disk
• start ckpt• active T's:• T1,T2,...
• end• ckpt
• ...• ...• ...
25© J Singh, 2011 25
Protecting Against Media Failure
• Logging protects from local loss of main memory and disk content, but not against global loss of secondary storage content (media failure).
• To protect against media failures, employ archiving: maintaining a copy of the database on a separate, secure storage device.
– Log also needs to be archived in the same manner.
• Two levels of archiving:– full dump vs. incremental dump.
26© J Singh, 2011 26
Protecting Against Media Failure
• Typically, database cannot be shut down for the period of time needed to make a backup copy (dump).
• Need to perform nonquiescent archiving, i.e., create a dump while the DBMS continues to process transactions.
– Goal is to make copy of database at time when the dump began, but transactions may change database content during the dumping.
– Logging continues during the dumping, and discrepancies can be corrected from the log.
27© J Singh, 2011 27
Protecting Against Media Failure
• We assume undo/redo (or redo) logging.
• The archiving procedure is as follows: – Write a log record <START DUMP>. – Perform a checkpoint for the log. – Perform a (full / incremental) dump on the secure storage
device. – Make sure that enough of the log has been copied to the secure
storage device so that at least the log up to the check point will survive media failure.
– Write a log record <END DUMP>.
28© J Singh, 2011 28
Protecting Against Media Failure
• After a media failure, we can restore the DB from the archived DB and archived log as follows:
– Copy latest full dump (archive) back to DB. – Starting with the earliest ones, make the modifications recorded
in the incremental dump(s) in increasing order of time. – Further modify DB using the archived log. – Use the recovery method corresponding to the chosen type of
logging.
29© J Singh, 2011 29
Summary
• Logging is an effective way to prepare for system failure– Transactions provide a useful building block on which to base
log entries– Three type of logs
• Undo Logs• Redo Logs• Undo/Redo logs
– Only Undo/Redo logs are used in practice. Why?– Periodic checkpoints are necessary for keeping recovery times
under control. Why?
• Database Dumps (archives) protect against media failure– Great for making a “point in time” copy of the database.
30© J Singh, 2011 30
On the NoSQL Front…
• Google Datastore– Recently (1/2011) added a “High Replication” option.– Replicates the datastore synchronously across multiple data
centers• Does not use an append-only log
– Has performance and size impact
• CouchDB– Append-only log that’s actually a b-tree– No provision for deleting part of the log
• Provision for ‘compacting the log’
• MongoDB– Recently (12/2010) added a --journal option– Has performance impact, no measurements available
• Common thread, tradeoff between performance and durability!
CS 542 Database Management SystemsConcurrency Control
J Singh April 4, 2011
32© J Singh, 2011 32
Concurrency Control
• Goal: Preserving Data Integrity
• Challenge: enforce ACID rules (while maintaining maximum traffic through the system)
– Committed transactions leave the system in a consistent state– Rolled-back transactions behave as if they never happened!
• Historical Note– Based on The Transaction Concept: Virtues and Limitations by
Jim Gray, Tandem Computers, 1981– ACM Turing Award, 1998
33© J Singh, 2011 33
Transactions
• Concurrent execution of user programs is essential for good DBMS performance.
– Because disk accesses are frequent, and relatively slow, it is important to keep the cpu humming by working on several user programs concurrently.
• A user’s program may carry out many operations on the data retrieved from the database, but the DBMS is only concerned about what data is read/written from/to the database.
• A transaction is the DBMS’s abstract view of a user program: a sequence of reads and writes.
– Referred to as a Schedule– Implemented by a Transaction Scheduler
34© J Singh, 2011 34
Scheduler
• Scheduler takes read/write requests from transactions– Either executes them in buffers or delays them– Scheduler must avoid Isolation Anomalies
35© J Singh, 2011 35
Isolation Anomalies (p1)
• READ UNCOMMITTED– Dirty Read – data of an uncommitted transaction visible to
others– Sometimes called WR Conflict
• UNREPEATABLE READ– Non-repeatable Read – some previously read data changes due
to another transaction committing– Sometimes called RW Conflict
T1: R(A), W(A), R(B), W(B), CT2: R(A), W(A), R(B), W(B), C
T1: R(A), W(A), CT2: R(A), W(A), C
36© J Singh, 2011 36
Isolation Anomalies (p2)
• Overwriting Uncommitted Data– Sometimes called WW Conflicts
• We need a set of rules to prohibit such isolation anomalies– The rules place constraints on the actions of concurrent
transactions
T1: W(A), W(B), CT2: W(A), W(B), C
37© J Singh, 2011 37
Serial Schedules
• Schedule D is the set of 3 transactions T1, T2, T3.
– T1 Reads and writes to object X– Then T2 Reads and writes to object
Y– Then T3 Reads and writes to object
Z.
• D is an example of a serial schedule, because the 3 txns are not interleaved.
• Shorthand:R1 (X), W1(X), R2 (Y), W2(Y), R3 (Z),
W3(Z)
• Definition: A schedule is a list of actions, (i.e. reading, writing, aborting, committing), from a set of transactions.
• A schedule is serial if its transactions are not interleaved– Serial schedules observe ACI properties
38© J Singh, 2011 38
Serializable Schedules
• The order of transactions in E is not the same as in D,
– But E gives the same result.
• Shorthand: E = R1 (X); R2 (Y); R3 (Z); W1
(X);
W2 (Y); W3 (Z);
• A serializable schedule is one that is equivalent to a serial schedule.
• The Transaction Manager should defer some transactions if the current schedule is not serializable
39© J Singh, 2011 39
Serializability
• Is G serializable?– Equivalent to the serial schedule
<T1,T2>– But not <T2,T1>
• G is conflict-serializable
• Conflict equivalence: The schedules S1 and S2 are conflict-equivalent if the following conditions are satisfied:
– Both schedules S1 and S2 involve the same set of transactions (including ordering of actions within each transaction).
– The order of each pair of conflicting actions in S1 and S2 are the same
• Conflict-serializability: A schedule is conflict-serializable when the schedule is conflict-equivalent to one or more serial schedules.
40© J Singh, 2011 40
Serializability of Schedule G
• Precedence graph:– a node for each transaction– an arc from Ti to Tj if an action in Ti precedes and conflicts with
an action in Tj.• T1 T2? R1 (A) W1 (B) R2 (A) W2 (A) ? No conflicts
• T2 T1? R2 (A) W2 (A) R1 (A) W1 (B) ?
– Two actions conflict if• The actions belong to different transactions. • At least one of the actions is a write operation. • The actions access the same object (read or write).
– Theorem: A schedule is conflict serializable if and only if its precedence graph is acyclic
• T1: R(A) W(B)• T2: R(A) W(A)
T1 T2
Conflicts
41© J Singh, 2011 41
Enforcing Serializable Schedules
• Prevent cycles in the Precedence Graph, P(S), from occurring
• Locking primitives:– Lock (exclusive): li(A)
– Unlock: ui(A)
1. Make transactions consistent– Ti: pi (A) becomes Ti: li(A) pi (A) ui(A)
• pi (A) is either a READ or a WRITE
2. Allow only one transaction to hold a lock on A at any time
3. Two-phase locking for transactions– Ti: li(A) … pi (A) … ui(A)
no unlocks no locks
42© J Singh, 2011 42
Legal Schedules?
S1 = l1 (A) l1(B) r1 (A) w1 (B) l2(B) u1 (A) u1 (B)
r2 (B) w2 (B) u2 (B) l3 (B) r3 (B) u3(B)
S2 = l1 (A) r1 (A) w1 (B) u1 (A) u1 (B)
l2 (B) r2 (B) w2 (B) l3 (B) r3 (B) u3 (B)
S3 = l1 (A) r1 (A) u1 (A) l1 (B) w1 (B) u1 (B)
l2 (B) r2 (B) w2 (B) u2 (B) l3 (B) r3 (B) u3 (B)
43© J Singh, 2011 43
Locking Protocols for Serializable Schedules
• Strict Two-phase Locking (Strict 2PL) Protocol:– Each transaction must obtain a S (shared) lock on object before
reading, and an X (exclusive) lock on object before writing.– All locks held by a transaction are released when the
transaction completes– Strict 2PL allows only serializable schedules– Additionally, it simplifies transaction aborts
• (Non-strict) 2PL Variant: Release locks anytime, but cannot acquire locks after releasing any lock.
– If a txn holds an X lock on an object, no other txn can get a lock (S or X) on that object.
– (Non-strict) 2PL also allows only serializable schedules, but involves more complex abort processing
– Why is “acquiring after releasing” disallowed? To avoid cascading aborts
• More in a minute
44© J Singh, 2011 44
Executing Locking Protocols
• Begin with a Serialized Schedule– We know it won’t deadlock
• How do we know this?
• Beyond this simple 2PL protocol, it is all a matter of improving performance and allowing more concurrency….
– Shared locks– Increment locks– Multiple granularity– Other types of concurrency control mechanisms
45© J Singh, 2011 45
Lock Management
• Lock and unlock requests are handled by the lock manager
• Lock Table Entry– Number of transactions
currently holding a lock– Type of lock held (shared or
exclusive)– Pointer to queue of lock
requests
• Locking and unlocking operations
– Atomic– Support upgrade: transaction
that holds a shared lock can be upgraded to hold an exclusive lock
• Any level of granularity can be locked
– Database, table, block, tuple
– Why is this necessary?
46© J Singh, 2011 46
Multiple-Granularity Locks
• If a transaction needs to scan all records in a table, we don’t really want to have a lock on all tuples individually – significant locking overhead!
• Put a single lock on the table
Tuples
Tables
Pages
Database
contains
A lock on a nodeimplicitly locksall descendents.
47© J Singh, 2011 47
Aborting a Transaction
• If a transaction Ti is aborted, all its actions have to be undone.
– If Tj reads an object last written by Ti, Tj must be aborted as well!
• Most systems avoid such cascading aborts by releasing a transaction’s locks only at commit time.
– If Ti writes an object, Tj can read this only after Ti commits.
• In order to undo the actions of an aborted transaction, the DBMS maintains a log in which every write is recorded.
– The same mechanism is used to recover from system crashes; all active txns at the time of the crash are aborted when the system recovers
48© J Singh, 2011 48
Performance Considerations (Again!)
• 2PL Protocol allows transactions to proceed with maximum parallelism
– Locking algorithm only delays actions that would cause conflicts
• But the locks are still a bottleneck– Need to ensure lowest-possible level of locking granularity
• Classic memory-performance trade-off– Conflict-serialization is too conservative
• But other methods of serialization are too complex– A use case that occurs quite often, should be optimized
• Besides scanning through the table, if we need to modify a few tuples, what kind of lock to put on the table?
• Have to be X (if we only have S or X).• But, blocks all other read requests!
– Concurrency control is pessimistic and acquires/releases locks• Optimistic Concurrency Control
49© J Singh, 2011 49
Next Week
• Intention Locks
• Optimistic Concurrency Control
• Distributed Commit
• Please Read ahead of time– The end of an Architectural Era, Stonebraker et al, Proc. VLDB,
2007– OLTP Through the Looking Glass, and What We Found There,
Harizopoulos et al, Proc ACM SIGMOD, 2008