• Query Processing and Query OptmizationProf. Sin-Min LeeDepartment of Computer ScienceSan Jose State University

  • Transaction ManagementUntil now, our concept of database has been one inwhich programs accessing the database are run oneat a time (serially). Often this is indeed the case.However, there are also numerous applications inwhich more than one program, or differentexecutions of the same program, run simultaneously(concurrently). This is where TRANSACTIONMANAGEMENT comes in handy

  • Example #1An airline reservation system, where at one time, several agents may be selling tickets, and therefore, changing lists of passengers and counts of available seats. The problem here is that if we are not careful when we allow two or more processes to access the database, we could sell the same seat twice. Here 2 processes that read and change the value of the same object must not be allowed to run concurrently, because they might interact in undesirable ways.SAFETY

  • Example #2Statistical database, such as census data, where many people may be querying the database at once. Here, as long as no one is changing the data, we do not really care in what order the processes read data. We can let the operating system schedule simultaneous read requests. Here we want to allow maximum concurrent operations, so time can be savedSPEED

  • Important TermsSerializable schedule -- is a lineararrangement of the database calls from several transactions with the property: the final database state obtained by executing the calls in schedule order is the same as the obtained by running the transactions in some unspecified serial order

  • Important Terms (cont)Lock -- is an access privilege on a database object, which the DBMS grants to a particular transaction. To allow the privileged transaction to complete its work without undue interference, the lock restricts the access of competing transactions to the database object.

  • Important Terms (cont)Two-phase locking protocol -- is a discipline that transactions can use to ensure serializability. The first phase is characterized by the monotonic acquisition of new locks or the strengthening of existing locks. The second phase involves the monotonic downgrade or release of existing locks. This lock is the most important technique for managing concurrency.

  • Important Terms (cont)Strict two-phase locking protocol -- each transaction holds all its locks until it commits or rolls back. When the commit or rollback is secure, it releases all the locks. In other words, the shrinking phase occurs all at once, immediately after the transaction terminates.

  • Important Terms (cont)Timestamping -- is another method for securing concurrency in the face of conflicting transactions.Timestamp -- is a time of origin of each transaction given by transaction manager portion of DBMSItems -- units of data to which access is controlled

  • Important Terms (cont)Transaction -- is simply a single execution of a program. This program may be a simple query expressed in one of the query languages or an elaborate host language program with embedded calls to a query language. Several independent executions of the same program may be in progress simultaneously; each is a different transaction

  • Main IdeaTo a large extent, transaction management can be seen as an attempt to make complex operations appear atomic. That is, they either occur in their entirety or do not occur at all, and if they occur, nothing else apparently went on during the time of their occurrence. The normal approach to ensuring atomicity of transactions is serialization, which forces transactions to run concurrently in a way that makes it appear that they ran one-at-a-time (serially).

  • Main IdeaIn reality, transaction are sequences of more elementary steps, such as reading or writing of single items from the database, and performing simple arithmetic steps in the workspace. When concurrency control is provided, other primitive steps are also needed, steps which set and release locks, commit (complete) transactions, and others

  • ItemsTo manage concurrency, the database must be partitioned into items, which are the units of data to which access is controlled. The nature and size of items are for the system designer to choose. In the relational model of data, for example, we could choose large items, like relations, or small items like individual tuples or even components of tuples.

  • LocksThe most common way in which access to items is controlled is by locks. Lock manager is the part of a DBMS that records, for each item I, whether one or more transactions are reading or writing any part of I. If so, the manager will forbid another transaction from gaining access to I, provided the type of access (read or write) could cause a conflict, such as the duplicate selling of an airline seat.

  • LocksAs it is typical for only a small subset of the items to have locks on them at any one time, the lock manager can store the current locks in a lock table which consists of records (,, The meaning of record (I,L,T) is that transaction T has a lock of type L on item I.

  • Example of locksLets consider two transaction T1 and T2. Each accesses an item A, which we assume has an integer value, and adds one to A. Read A;A:=A+1;Write A;-----------------------------------------------------------T1: Read A A:=A+1 Write AT2: Read A A:=A+1 Write A-----------------------------------------------------------

  • Example of locks (cont)The most common solution to this problem is to provide a lock on A. Before reading A, a transaction T must lock A, which prevents another transaction from accessing A until T is finished with A. Furthermore, the need for T to set a lock on A prevents T from accessing A if some other transaction is already using A. T must wait until the other transaction unlocks A, which it should do only after finishing with A.

  • IntroductionGoals:get the right answer to a queryget it fast

    Possible query representations (after parsing)relational algebra (not usually)query graph (usually)

  • Strategies for making relational algebra queries betterPush down selects eg.: r1(r1no, a, b, c, d) r2(r2no, x, y, z, r1no) r1.a = 7 (r1 JOIN r2) ( r1.a = 7 r1) JOIN r2push down projectsr1.a, r1.b(r1 JOIN r2) r1.a, r1.b( (r1no,a,b (r1)) JOIN r2)

  • Strategies for making relational algebra queries better(cntd)Eliminate productsr1.r1no=r2.r1no (r1 r2) r1 JOIN r1.r1no=r2.r1no r2

    replacep1 and p2 (e1) by p1 (p2 (e1))

  • Strategies for making relational algebra queries better(cntd)maybe there is an index to support p2 solutioneg,: (sal > 10000) and (name = elmo) empdo name = elmo first with indextest results for sal > 10000

  • Strategies for making relational algebra queries better(cntd)Join relations in increasing order of size to keep intermediate results small ( saves time)| e1| = 1| e2| = 200| e3| = 100000eg.: e1 JOIN (e2 JOIN e3) (e1 JOIN e2) JOIN e3

  • Query Processing Cost EstimationCatalog Statisticsr = a relationnr = |r| = # tuples in rsr = size of record r in bytesV(A,r) = # unique attribute A values in rGiven these statistics,|r s| = nrnssize of a tuple of r s = sr + ss

  • Query Processing Cost EstimationIf a is a constant, then| r.A = a| = 1 * nr V(A,r)

    Note: histograms can be better for selectivity estimation than V(A,r) statistic.

  • Selection Predicate Selectivity Estimates

    sel(r.A=a) 1 / V(A,r)


  • Join Predicate SelectivityWhat is | r1 (R1) JOIN r2 (R2)| ?If R1 R2 is a key of r1, then at most 1 tuple of r1 joins to each r2 tuple. Hence, | r1 JOIN r2| | r2 |if R1 R2 is a key of neither r1 nor r2, then, assume R1 R2 = {A}a tuple t r1 produces approximately n r2 / V(A, r2) tuples when joined to r2.Using the above, | r1 JOIN r2 | (n r1n r2 )/V(A, r2)

  • Join Predicate Selectivity(cntd)If we had used t r2 instead of r1 to compute the above estimate, we would have gotten | r1 JOIN r2 | (n r1 n r2 )/V(A, r1)

    If V(A,r1) V(A,r2), there are likely to be some dangling tuples that dont participate in the join. Hence, the smaller of the two (join size) estimates is probably better.

  • Maintaining StatisticsDoing it after every update is too much overhead

    Hence, most systems gather statistics periodically during slack time ( often this is initiated manually).

  • 1-table selection subquery costWithout indexmust use a sequential scanC I/O = cost of an I/O 11 msecC CPU = cost of CPU to process 1 tuple .1msec

    cost( p (r)) = nr C CPU + pages ( r ) * C I/O

  • 1-table selection subquery cost(cntd)with clustered indexB+ tree on r.Ap= ( r.A = a)cost ( p (r)) = (1/ V(A,r)) * nr C CPU + (height (r.A B+tree) +(1/V(A,r))*pages ( r )) C I/O

  • 1-table selection subquery cost(cntd)With unclustered indexsecondary B+ tree on r.Ap = (r.A = a)

    cost ( p (r)) = (1/ V(A,r)) * nr C CPU + (height (r.A B+tree) +(1/V(A,r)) * nr) C I/O

  • ExerciseWhat is the cost of solving a1 < r.A < a2 (r)using a clustered B+ tree on R?

  • Join StrategiesNested Loop Join ( Simple Iteration) NLJto solve r1 JOINp r2for each t1 r1 do begin for t2 r2 do begin if p(t1 , t2) = true then output(t1 , t2) as part of result endendNOTE: outer scan is over r1 inner scan is over r2 (this could have been reversed)

  • QUIZShould outer scan be over biggest or smallest table?Hint: Consider buffer pool size.

    Variation of nested loop join: block oriented iterationScan inner table once per outer table block instead of outer table tuple.

  • Sort Merge JoinTo do r1 JOIN r2 r1.a = r2 . b sort r1 on asort r2 on bmergescan r1 sequentiallyiterate over only needed portion of r2 for each r1 tuple

  • Sort Merge Join(cntd)r1 xa r2bya11pb12qc22rd23se3cost = sort cost + constant * | r1 JOIN r2|

  • Use of an IndexIn nested loop join, if there is an index on JOIN attribute of inner table, you can use it.In sort-merge join, if there is a B+tree index (preferably clustered) on the join attribute(s) you can avoid sorting that table (read it in sorted order).

  • ConclusionQuery processingQuery processing operationssequential scanindex scanjoinnested loopsort mergeQuery processing operation costsNext time: Hash join

Top Related