cs 542 -- failure recovery, concurrency control

CS 542 Database Management SystemsFailure Recovery, Concurrency Control

J Singh April 4, 2011

2© J Singh, 2011 2

Today’s meeting

• The D in ACID: Durability

• The ACI in ACID– Consistency is specified by users is how they define

transactions– The Database is responsible for Atomicity and Isolation

3© J Singh, 2011 3

Types of Failures

• Potential sources of failures:– Power loss, resulting in loss of main-memory state,– Media failures, resulting in loss of disk state and– Software errors, resulting in both

• Recovery is based on the concept of transactions.

4© J Singh, 2011 4

Transactions and Concurrency

• Users submit transactions, and think of each transaction as executing by itself.

• Concurrency is achieved by the DBMS, which interleaves actions (reads/writes of DB objects) of various transactions.

• Each transaction must leave the database in a consistent state if the DB is consistent when the transaction begins. A transaction can end in two different ways:

– commit: successful end, all actions completed,– abort: unsuccessful end, only some actions executed.

• Issues: effect of interleaving transactions on the database– System failures (today’s lecture)– Concurrent transactions (partly today, remainder next week)

5© J Singh, 2011 5

Transactions, Logging and Recovery

• We studied Query Processing in the last two lectures

• Now, Log Manager and Recovery Manager

• Second part today, Transaction Manager

6© J Singh, 2011 6

Reminder: Buffer Management

• Data must be in RAM for DBMS to operate on it!

• DB

• MAIN MEMORY

• DISK

• disk page

• free frame

• Page Requests from Higher Levels

• BUFFER POOL

• choice of frame dictated• by replacement policy

7© J Singh, 2011 7

Primitive Buffer Operations

• Requests from Transactions– Read (x,t):

• Input(x) if necessary• Assign value of x in block to

local variable t (in buffer)– Write (x,t):

• Input(x) if necessary• Assign value of local

variable t (in buffer) to x

• Requests to Disk– Input (x):

• Transfer block containing x from disk to memory (buffer)

– Output (x):• Transfer block containing x

from buffer to disk

8© J Singh, 2011 8

Failure Recovery Approaches

• All of the approaches rely on logging – storing a log of changes to the database so it is possible to restore its state. They differ in

– What information is logged,– The timing of when to force that information to stable storage,– What the procedure for recovery will be– The approaches are named after the recovery procedure

• Undo Logging– The log contains enough information to detect if the transaction

was committed and to roll back the state if it was not.– When recovering after a failure, walk back through the log and

undo the effect of all txns that do not have a COMMIT entry in the log

• Other approaches described later

9© J Singh, 2011 9

Undo Logging

• When executing transactions– Write the log before writing

transaction data and force it to disk

– Make sure to preserve chronological order

– The log contains enough information to detect if the transaction was committed and to roll back the state if it was not.

• When restarting, – Walk back through the log

and undo the effect of all uncommitted txns in the log.

– Challenge: How far back do we need to look?

– Answer: Until the last checkpoint

• Define and implement checkpoints momentarily

10© J Singh, 2011 10

An Example Transaction

• Initially– A = 8– B = 8

• Transaction T1– A 2 A– B 2 B

• Transaction T1:– Read (A,t); t t 2– Write (A,t);– Read (B,t); t t 2– Write (B,t);– Output (A);– Output (B);

• State at Failure Point:– Memory:

• A = 16• B = 16

– Disk:• A = 16• B = 8

• Undo Log Entries– <T1, start>– <T1, A, 8>– <T1, B, 8>– <T1, Commit>

• Would have been written if the transaction had completed.

– Do we have the info to restore?

failure!

11© J Singh, 2011 11

Execution with Undo Logging

• Logging Rule:– If a transaction commits, the commit record must be written to

disk after all data records have been written to disk

Forces all log records to disk

12© J Singh, 2011 12

Recovery with Undo Logging

• Consider all uncommitted transactions, starting with the most recent one and going backward.

• Undo all actions of these transactions.

• Why going backward, not forward?

• Example: T1, T2 and T3 all write A– T1 executed before T2 before T3– T1 committed, T2 and T3 incomplete

T1 write A T2 write Atime/log

T1 commit

systemfailure

T3 write A

13© J Singh, 2011 13

More on Undo Logging

• Failure During Recovery– Recovery algorithm is

idempotent– Just do it again!

• How much of the log file needs to be processed?

– In principle, we need to examine the entire log.

– Checkpointing limits the part of the log that needs to be considered during recovery up to a certain point (checkpoint).

14© J Singh, 2011 14

Quiescent Checkpointing

• Simple approach to introduce the concept

• Pause the database– stop accepting new transactions,– wait until all current transactions commit or abort and have

written the corresponding log records,– flush the log to disk,– write a <CKPT> log record and flush the log,– resume accepting new transactions.

• Once we encounter a checkpoint record, we know that there are no incomplete transactions.

– Do not need to go backward beyond checkpoint. – Can afford to throw away any part of the log prior to the

checkpoint

• Pausing the database may not be warranted for business reasons

15© J Singh, 2011 15

Non-quiescent Checkpointing

• Main idea: Start- and End-Checkpoints to bracket unfinished txns

– Write a <START CKPT (T1, T2, … Tk)> record into the log

• T1, T2, … Tk are the unfinished txns

– Wait till T1, T2, … Tk commit or abort, but allow other txns to begin

– Write a <END CKPT> record into the log

• Recovery method: scan the log backwards until a <CKPT> record is found

– If <END…>, scan backwards to the previous <START…>

• No need to look any further– If <START…>, then crash

must have occurred during checkpointing.

• The START record tells us unfinished txns and

• Scan back to the beginning of the oldest one of these.

16© J Singh, 2011 16

Issues with Undo Logging

• Bottlenecks on I/O– All log records must be forced back to disk before any data

written back– All data records must be forced to disk before the COMMIT

record is written back

• An alternative: Redo Logging– Instead of scanning backward from the end

• Undoing all transactions that were not completed– Scans the log forward

• Reapplies all transactions that were not completed

17© J Singh, 2011 17

Logging with Redo Logs

• Creation of the Redo log– For every action, generate

redo log record.• <T, X, v> has different

meaning: v is the new value, not old

– Flush log at commit.– All log records for

transaction that modified X (including commit) must be on disk before X is modified on disk

– Write END log record after DB modifications have been written to disk.

• Recovery algorithm. – Redo the modifications by

committed transactions not yet flushed to disk.

• S = set of txns with <Ti

commit> and no <Ti end> in log

• For each <Ti X, v> in log, in forward order (from earliest to latest) do:

– if Ti in S then• Write(X, v) • Output(X)• Write <Ti END>

18© J Singh, 2011 18

Logging with Redo Logs

Step Action t M-A M-B D-A D-B Log

1 <START T>

2 READ(A,t) 8 8 8 8

3 t : = t * 2 16 8 8 8

4 WRITE(A,t) 16 16 8 8 <T,A,16>

5 READ(B,t) 8 16 8 8 8

6 t : = t * 2 16 16 8 8 8

7 WRITE(B,t) 16 16 16 8 8 <T,B,16>

8 <COMMIT T>

9 FLUSH LOG

10 OUTPUT(A) 16 16 16 16 16

11 OUTPUT(B) 16 16 16 16 16

19© J Singh, 2011 19

Comments on Redo Logging

• Checkpoint algorithms similar to those for Undo Logging– Quiescent as well as Non-quiescent algorithms

• Issues with Redo Logging– Writing data back to disk is not allowed until transaction logs

have been written out• Results in a large requirement for memory for buffer pool

– A flaw in the checkpointing algorithms (textbook, p869)• Both undo and redo logs may put contradictory requirements on how

buffers are handled during a checkpoint, unless the database elements are complete blocks or sets of blocks.

• For instance, if a buffer contains one database element A that was changed by a committed transaction and another database element B that was changed in the same buffer by a transaction that has not yet had its COMMIT record written to disk, then we are required to copy the buffer to disk because of A but also forbidden to do so, because rule R1 applies to B.

20© J Singh, 2011 20

Undo/Redo Logging (p1)

• Undo logging requires to write modifications to disk immediately after commit, leading to an unnecessarily large number of IOs.

• Redo logging requires to keep all modified blocks in the buffer until the transaction commits and the log records have been flushed, increasing the buffer size requirement.

• Undo/redo logging combines undo and redo logging. – It provides more flexibility in flushing modified blocks at the

expense of maintaining more information in the log.

21© J Singh, 2011 21


• Main idea: The log can be used to reconstruct the data

• Update records <T, X, new, old> record new and old value of X.

• The only undo/redo logging rule is: – Log record must be flushed before corresponding modified block

• Also known as write ahead logging.

• Block of X can be flushed before or after T commits, i.e. before or after the COMMIT log record.

• Flush the log at commit.

22© J Singh, 2011 22


• Because of the flexibility of flushing X before or after the COMMIT record, we can have uncommitted transactions with modifications on disk and committed transactions with modifications not yet on disk.

• The undo/redo recovery policy is as follows:– Redo committed transactions.– Undo uncommitted transactions.

23© J Singh, 2011 23

Undo/Redo Logging Recovery

• More details on the recovery procedure:– Backward pass

• From end of log back to latest valid checkpoint, construct set S of committed transactions.

• Undo actions of transactions not in S.– Forward pass

• From latest checkpoint forward to end of log,– Or from the beginning of time, if there are no checkpoints

• redo actions of transactions in S.

• Alternatively, can also perform the redos before the undos.

24© J Singh, 2011 24

Undo/Redo Checkpointing

• Write "start checkpoint" listing all active transactions to log• Flush log to disk• Write to disk all dirty buffers (contain a changed DB element),

whether or not transaction has committed– Implies nothing should be written (not even to memory buffers)

until we are sure the transaction will not abort– Implies some log records may need to be written to disk (WAL)

• Write "end checkpoint" to log• Flush log to disk

• start ckpt• active T's:• T1,T2,...

• end• ckpt

• ...• ...• ...

25© J Singh, 2011 25

Protecting Against Media Failure

• Logging protects from local loss of main memory and disk content, but not against global loss of secondary storage content (media failure).

• To protect against media failures, employ archiving: maintaining a copy of the database on a separate, secure storage device.

– Log also needs to be archived in the same manner.

• Two levels of archiving:– full dump vs. incremental dump.

26© J Singh, 2011 26


• Typically, database cannot be shut down for the period of time needed to make a backup copy (dump).

• Need to perform nonquiescent archiving, i.e., create a dump while the DBMS continues to process transactions.

– Goal is to make copy of database at time when the dump began, but transactions may change database content during the dumping.

– Logging continues during the dumping, and discrepancies can be corrected from the log.

27© J Singh, 2011 27


• We assume undo/redo (or redo) logging.

• The archiving procedure is as follows: – Write a log record <START DUMP>. – Perform a checkpoint for the log. – Perform a (full / incremental) dump on the secure storage

device. – Make sure that enough of the log has been copied to the secure

storage device so that at least the log up to the check point will survive media failure.

– Write a log record <END DUMP>.

28© J Singh, 2011 28


• After a media failure, we can restore the DB from the archived DB and archived log as follows:

– Copy latest full dump (archive) back to DB. – Starting with the earliest ones, make the modifications recorded

in the incremental dump(s) in increasing order of time. – Further modify DB using the archived log. – Use the recovery method corresponding to the chosen type of

logging.

29© J Singh, 2011 29

Summary

• Logging is an effective way to prepare for system failure– Transactions provide a useful building block on which to base

log entries– Three type of logs

• Undo Logs• Redo Logs• Undo/Redo logs

– Only Undo/Redo logs are used in practice. Why?– Periodic checkpoints are necessary for keeping recovery times

under control. Why?

• Database Dumps (archives) protect against media failure– Great for making a “point in time” copy of the database.

30© J Singh, 2011 30

On the NoSQL Front…

• Google Datastore– Recently (1/2011) added a “High Replication” option.– Replicates the datastore synchronously across multiple data

centers• Does not use an append-only log

– Has performance and size impact

• CouchDB– Append-only log that’s actually a b-tree– No provision for deleting part of the log

• Provision for ‘compacting the log’

• MongoDB– Recently (12/2010) added a --journal option– Has performance impact, no measurements available

• Common thread, tradeoff between performance and durability!

CS 542 Database Management SystemsConcurrency Control

J Singh April 4, 2011

32© J Singh, 2011 32

Concurrency Control

• Goal: Preserving Data Integrity

• Challenge: enforce ACID rules (while maintaining maximum traffic through the system)

– Committed transactions leave the system in a consistent state– Rolled-back transactions behave as if they never happened!

• Historical Note– Based on The Transaction Concept: Virtues and Limitations by

Jim Gray, Tandem Computers, 1981– ACM Turing Award, 1998

http://www.hpl.hp.com/techreports/tandem/TR-81.3.pdf

33© J Singh, 2011 33

Transactions

• Concurrent execution of user programs is essential for good DBMS performance.

– Because disk accesses are frequent, and relatively slow, it is important to keep the cpu humming by working on several user programs concurrently.

• A user’s program may carry out many operations on the data retrieved from the database, but the DBMS is only concerned about what data is read/written from/to the database.

• A transaction is the DBMS’s abstract view of a user program: a sequence of reads and writes.

– Referred to as a Schedule– Implemented by a Transaction Scheduler

34© J Singh, 2011 34

Scheduler

• Scheduler takes read/write requests from transactions– Either executes them in buffers or delays them– Scheduler must avoid Isolation Anomalies

35© J Singh, 2011 35

Isolation Anomalies (p1)

• READ UNCOMMITTED– Dirty Read – data of an uncommitted transaction visible to

others– Sometimes called WR Conflict

• UNREPEATABLE READ– Non-repeatable Read – some previously read data changes due

to another transaction committing– Sometimes called RW Conflict

T1: R(A), W(A), R(B), W(B), CT2: R(A), W(A), R(B), W(B), C

T1: R(A), W(A), CT2: R(A), W(A), C

36© J Singh, 2011 36

Isolation Anomalies (p2)

• Overwriting Uncommitted Data– Sometimes called WW Conflicts

• We need a set of rules to prohibit such isolation anomalies– The rules place constraints on the actions of concurrent

transactions

T1: W(A), W(B), CT2: W(A), W(B), C

37© J Singh, 2011 37

Serial Schedules

• Schedule D is the set of 3 transactions T1, T2, T3.

– T1 Reads and writes to object X– Then T2 Reads and writes to object

Y– Then T3 Reads and writes to object

Z.

• D is an example of a serial schedule, because the 3 txns are not interleaved.

• Shorthand:R1 (X), W1(X), R2 (Y), W2(Y), R3 (Z),

W3(Z)

• Definition: A schedule is a list of actions, (i.e. reading, writing, aborting, committing), from a set of transactions.

• A schedule is serial if its transactions are not interleaved– Serial schedules observe ACI properties

38© J Singh, 2011 38

Serializable Schedules

• The order of transactions in E is not the same as in D,

– But E gives the same result.

• Shorthand: E = R1 (X); R2 (Y); R3 (Z); W1

(X);

W2 (Y); W3 (Z);

• A serializable schedule is one that is equivalent to a serial schedule.

• The Transaction Manager should defer some transactions if the current schedule is not serializable

39© J Singh, 2011 39

Serializability

• Is G serializable?– Equivalent to the serial schedule

<T1,T2>– But not <T2,T1>

• G is conflict-serializable

• Conflict equivalence: The schedules S1 and S2 are conflict-equivalent if the following conditions are satisfied:

– Both schedules S1 and S2 involve the same set of transactions (including ordering of actions within each transaction).

– The order of each pair of conflicting actions in S1 and S2 are the same

• Conflict-serializability: A schedule is conflict-serializable when the schedule is conflict-equivalent to one or more serial schedules.

40© J Singh, 2011 40

Serializability of Schedule G

• Precedence graph:– a node for each transaction– an arc from Ti to Tj if an action in Ti precedes and conflicts with

an action in Tj.• T1 T2? R1 (A) W1 (B) R2 (A) W2 (A) ? No conflicts

• T2 T1? R2 (A) W2 (A) R1 (A) W1 (B) ?

– Two actions conflict if• The actions belong to different transactions. • At least one of the actions is a write operation. • The actions access the same object (read or write).

– Theorem: A schedule is conflict serializable if and only if its precedence graph is acyclic

• T1: R(A) W(B)• T2: R(A) W(A)

T1 T2

Conflicts

41© J Singh, 2011 41

Enforcing Serializable Schedules

• Prevent cycles in the Precedence Graph, P(S), from occurring

• Locking primitives:– Lock (exclusive): li(A)

– Unlock: ui(A)

1. Make transactions consistent– Ti: pi (A) becomes Ti: li(A) pi (A) ui(A)

• pi (A) is either a READ or a WRITE

2. Allow only one transaction to hold a lock on A at any time

3. Two-phase locking for transactions– Ti: li(A) … pi (A) … ui(A)

no unlocks no locks

42© J Singh, 2011 42

Legal Schedules?

S1 = l1 (A) l1(B) r1 (A) w1 (B) l2(B) u1 (A) u1 (B)

r2 (B) w2 (B) u2 (B) l3 (B) r3 (B) u3(B)

S2 = l1 (A) r1 (A) w1 (B) u1 (A) u1 (B)

l2 (B) r2 (B) w2 (B) l3 (B) r3 (B) u3 (B)

S3 = l1 (A) r1 (A) u1 (A) l1 (B) w1 (B) u1 (B)

l2 (B) r2 (B) w2 (B) u2 (B) l3 (B) r3 (B) u3 (B)

43© J Singh, 2011 43

Locking Protocols for Serializable Schedules

• Strict Two-phase Locking (Strict 2PL) Protocol:– Each transaction must obtain a S (shared) lock on object before

reading, and an X (exclusive) lock on object before writing.– All locks held by a transaction are released when the

transaction completes– Strict 2PL allows only serializable schedules– Additionally, it simplifies transaction aborts

• (Non-strict) 2PL Variant: Release locks anytime, but cannot acquire locks after releasing any lock.

– If a txn holds an X lock on an object, no other txn can get a lock (S or X) on that object.

– (Non-strict) 2PL also allows only serializable schedules, but involves more complex abort processing

– Why is “acquiring after releasing” disallowed? To avoid cascading aborts

• More in a minute

44© J Singh, 2011 44

Executing Locking Protocols

• Begin with a Serialized Schedule– We know it won’t deadlock

• How do we know this?

• Beyond this simple 2PL protocol, it is all a matter of improving performance and allowing more concurrency….

– Shared locks– Increment locks– Multiple granularity– Other types of concurrency control mechanisms

45© J Singh, 2011 45

Lock Management

• Lock and unlock requests are handled by the lock manager

• Lock Table Entry– Number of transactions

currently holding a lock– Type of lock held (shared or

exclusive)– Pointer to queue of lock

requests

• Locking and unlocking operations

– Atomic– Support upgrade: transaction

that holds a shared lock can be upgraded to hold an exclusive lock

• Any level of granularity can be locked

– Database, table, block, tuple

– Why is this necessary?

46© J Singh, 2011 46

Multiple-Granularity Locks

• If a transaction needs to scan all records in a table, we don’t really want to have a lock on all tuples individually – significant locking overhead!

• Put a single lock on the table

Tuples

Tables

Pages

Database

contains

A lock on a nodeimplicitly locksall descendents.

47© J Singh, 2011 47

Aborting a Transaction

• If a transaction Ti is aborted, all its actions have to be undone.

– If Tj reads an object last written by Ti, Tj must be aborted as well!

• Most systems avoid such cascading aborts by releasing a transaction’s locks only at commit time.

– If Ti writes an object, Tj can read this only after Ti commits.

• In order to undo the actions of an aborted transaction, the DBMS maintains a log in which every write is recorded.

– The same mechanism is used to recover from system crashes; all active txns at the time of the crash are aborted when the system recovers

48© J Singh, 2011 48

Performance Considerations (Again!)

• 2PL Protocol allows transactions to proceed with maximum parallelism

– Locking algorithm only delays actions that would cause conflicts

• But the locks are still a bottleneck– Need to ensure lowest-possible level of locking granularity

• Classic memory-performance trade-off– Conflict-serialization is too conservative

• But other methods of serialization are too complex– A use case that occurs quite often, should be optimized

• Besides scanning through the table, if we need to modify a few tuples, what kind of lock to put on the table?

• Have to be X (if we only have S or X).• But, blocks all other read requests!

– Concurrency control is pessimistic and acquires/releases locks• Optimistic Concurrency Control

49© J Singh, 2011 49

Next Week

• Intention Locks

• Optimistic Concurrency Control

• Distributed Commit

• Please Read ahead of time– The end of an Architectural Era, Stonebraker et al, Proc. VLDB,

2007– OLTP Through the Looking Glass, and What We Found There,

Harizopoulos et al, Proc ACM SIGMOD, 2008

http://nms.csail.mit.edu/~stavros/pubs/hstore.pdf

http://nms.csail.mit.edu/~stavros/pubs/hstore.pdf

http://nms.csail.mit.edu/~stavros/pubs/OLTP_sigmod08.pdf

http://nms.csail.mit.edu/~stavros/pubs/OLTP_sigmod08.pdf

cs 542 -- failure recovery, concurrency control

Technology