department of computer science and engineering, hkust slide 1 1314. query processing and...
Embed Size (px)
TRANSCRIPT

1314. Query Processing and Optimization

IntroductionUsers are expected to write efficient queries. But they do not always do that!Users typically do not have enough information about the database to write efficient queries. E.g., no information on table sizeUsers would not know if a query is efficient or not without knowing how the DBMSs query processor workDBMSs job is to optimize the users query by:Converting the query to an internal representation (tree or graph) Evaluate the costs of several possible ways of executing the query and find the best one.

Steps in Query ProcessingSQL queryExecution PlanCodeResultParse TreeQuery ParsingCode GenerationQuery OptimizationRuntime DB Processor

Select OperationFile scan scan all records of the file to find records that satisfy selection conditionBinary search when the file is sorted on attributes specified in the selection conditionIndex scan using index to locate the qualified recordsPrimary index, single record retrieval equality comparison on a primary key attribute with a primary indexPrimary index, multiple records retrieval comparison condition , etc. on a key field with primary indexClustering index to retrieve multiple recordsSecondary index to retrieve single or multiple recordsWhen would file scan be better than index scan?

Conjunctive ConditionsOP1 AND OP2 (e.g., EmpNo=123 AND Age=30)Conjunctive selection: Evaluate the condition that has an index created (I.e., that can be evaluated very fast), get the qualified tuples and then check if these tuples satisfy the remaining conditions. Conjunctive selection using composite index: if there is a composite index created on attributes involved in one or more conditions, then use the composite index to find the qualified tuplesConjunctive selection by intersection of record pointers: if secondary indexes are available, evaluate each condition and intersect the sets of record pointers obtained.

Conjunctive Conditions (cont.)When there are more than one attribute with an index:use the one that costs least, andthe one that returns the smallest number of qualified tupleDisjunctive select conditions: OP1 or OP2 are much more costly:potentially a large number of tuples will qualifycostly if any one of the condition doesnt have an index createdselectivity of a condition is the number of tuples that satisfy the condition divided by total number of tuples. The smaller the selectivity, the fewer the number of tuples retrieved, and the higher the desirability of using that condition to retrieve the records.

Join OperationJoin is one of the most timeconsuming operations in query processing. Twoway join is a join of two relations, and there are many algorithms to evaluate the join.Multiway join is a join of more than two relations; different orders of evaluating a multiway join have different speedsWe shall study methods for implementing twoway joins of form R A=B S

Join Algorithm: Nested (innerouter) LoopNested (innerouter) Loop: For each record r in R (outer loop), retrieve every record s from S (inner loop) and check if r[A] = s[B].R A=B Sfor each tuple r in Rdo for each tuple s in Sdo if r.[A] = s[B] then output resultendendR and S can be reversed

When One Join Attributes is IndexedIf an index (or hash key) exists, say, on attribute B of S, should we put R in the outer loop or S? Why?Records in the outer relation are accessed sequentially, an index on the outer relation doesnt help;Records in the inner relations are accessed randomly, so an index can retrieve all records in the inner relation that satisfy the join condition.for each tuple r in Rdo lookup r.[A] in Sif found then output resultend

SortMerge JoinSortmerge join: if the records of R and S are sorted on the join attributes A and B, respectively, then the relations are scanned in say ascending order, matching the records that have same values for A and B. R A=B SR and S are only scanned once.Even if the relations are not sorted, it is better to sort them first and do sortmerge join then doing doubleloop join.if R and S are sorted, n + mif not sorted: n log(n) + m log(m) + m + n

Hash Join MethodHashjoin: R and S are both hashed to the same hash file based on the join attributes. Tuples in the same bucket are then joined.

Hints on Evaluating Joins Disk accesses are based on blocks, not individual tuplesMain memory buffer can significantly reduce the number of disk accessesUse the smaller relation in outer loop in nested loop methodConsider if 1 buffer is available, 2 buffers, m buffersWhen index is available, either the smaller relation or the one with large number of matching tuples should be used in the outer loop. If join attributes are not indexed, it may be faster to create the indexes onthefly (hashjoin is close to generating a hash index onthefly)SortMerge is the most efficient; the relations are often sorted already Hash join is efficient if the hash file can be kept in the main memory

Query Optimization Give a relational algebra expression, how do we transform it to a more efficient one?Use the query tree as a tool to rearrange the operations of the relational algebra expression

A Query Tree
Empolyee(EmpNo, EmpName, Address, Birthdate, DeptNo)Department (DeptNo, DeptName, MgrNo)Project (ProjNo, ProjName, ProjLocation, DeptNo)WorksOn(EmpNo, ProjNo, Hours)ProjNo,DeptNo,EmpName,Address,BirthdateMgrNo=EmpNoProjLocation=StaffordDeptNo=DeptNoEmployeeDepartmentProject(3)(2)(1)

Structure and Execution of a Query Tree A query tree is a tree structure that corresponds to a relational algebra expression by representing the input relations as leaf nodes and the relational algebra operations as internal nodes of the treeAn execution of the query tree consists of executing an internal node operation whenever its operands are available and then replacing that internal node by the relation that results from executing the operation

Heuristics for Optimizing a Query A query may have several equivalent query treesA query parser generates a standard canonical query tree from a SQL query treeCartesian products are first applied (FROM)then the conditions (WHERE)and finally projection (SELECT)

Heuristics for Optimizing a Query The query optimizer transforms this canonical query into an efficient final queryselect ProjNo, DeptNo, EmpName, Address, Birthdatefrom Project, Department, Employeewhere ProjLocation=Stafford andMrgNo=EmpNo andDepartment.DeptNo=Employee.DeptNo

Example
Find the names of employees born after 1957 who work on a project named Aquariusselect EmpNamefrom Employee, WorksOn, ProjectwhereProjName=Aquarius ANDProject.ProjNo=WorksOn.ProjNo AND Employee.EmpNo = WorksOn.EmpNo ANDBirthdate >DEC311957
WorksOn (EmpNo, ProjNo, Hours)

Example Push all the conditions as far downthe tree as possible

Example Rearrange join sequence accordingto estimates of relation sizes

Example Replace cross products and selectionsequence with a join operation

Example Push projection as far down thequery tree as possibleLNAMEEmpNo = EmpNoEmployeeBirthdate > dec311957WorksOnProjectProjName=AquariusProjNo= ProjNoEmpNo, EmpNameEmpNo EmpNo, ProjNo ProjNo

Transformation Rules
1. Cascade of : A conjunctive selection condition can be broken up into a cascade (sequence) of individual operations: c1 AND c2 AND...AND cn(R) c1(c2(...(cn(R))..)) 2. Commutativity of : c1(c2(R)) c2(c1(R)) 3. Cascade of : List1(List2(... (Listn(R))... )) List1(R) if List1 is included in List2Listn; result is null if List1 is not in any of List2Listn

Transformation Rules (Cont.)
4. Commuting with : if the projection list List1 involves only attributes that are in condition c List1(c(R)) c(List1(R)) 5. Commutivity of JOIN or : R S S R 6. Commuting with JOIN: if all the attributes in the selection condition c involve only the attributes of one of the relations being joined, say, R c(R S) (c(R)) S

Transformation Rules (Cont.)7.Commuting with JOIN: if List can be separated into List1 and List2 involving only attributes from R and S, respectively, and the join condition c involves only attributes in List: List(R c S) (List1(R) c List2(S))8.Commuting set operations: and are commutative9.JOIN, , , are associative10. distributes over , , c (R S) c(R) c(S)11. distributes over List (R S) (List(R) List(S))

Heuristic Algebraic Optimization
Use rule 1 to break up any operation with conjunctive conditions into a sequence of operations Use rules 2, 4, 6, and 10 concerning commutativity of with other operations to move each operation as far down the query tree as possible based on the attributes in the operations Use rule 9 concerning associativity of binary operations to rearrange the leaf nodes of the tree so that the leaf node relations with the most restrictive operations are executed

Combine sequences of Cartesian product and operation representing a join condition into single JOIN operations Use rules 3, 4, 7, and 11 concerning the cascading of and commuting with other operations, break down a and move the projection attributes down the tree as far as possible Identify subtrees that represent groups of operations that can be executed by a single algorithm (select/join followed by project)Heuristic Algebraic Optimization (Cont.)

Estimation of the Size of JoinsThe Cartesian product r s contains nrns tuples; each tuple occupies sr + ss bytes.If R S = , then r s is the same as r x s.If R S is a key for R, then a tuple of s will join with at most one tuple from r; therefore, the number of tuples in r s is no greater than the number of tuples in s. If R S in S is a foreign key in S referencing R, then the number of tuples in r s is exactly the same as the number of tuples in s. The case for R S being a foreign key referencing S is symmetric.RSMatching tuples

Example of Size EstimationIn the example query depositor customer, customername in depositor is a foreign key of customer; hence, the result has exactly depositor tuples, which is 5000.
Data: R = Customer, S = Depositorcustomer = 10,000fcustomer = 25bcustomer = 10000/25 = 400
depositor = 5,000fdepositor = 50bdepositor = 5000/50 = 100

Estimation of the size of JoinsIf R S = {A} is not a key for R or S. If we assume that every tuple t in R produces tuples in R S, number of tuples in R S is estimated to be: r s V(A, s)If the reverse is true, the estimates obtained will be: r s V(A, r)The lower of these two estimates is probably the more accurate one.Number of distinct values of A in sRS s V(A, s)

Estimation of the size of JoinsCompute the size estimates for depositor customer without using information about foreign keys:customer = 10,000 depositor = 5,000 V(customername, depositor ) = 2500 V(customername, customer ) = 10000 The two estimates are 5000 * 10000/2500 = 20,000 and 5000 * 10000/10000 = 5000We choose the lower estimate, which, in this case, is the same as our earlier computation using foreign keys.

NestedLoop Join (TupleBased)Compute the theta join, r s for each tuple tr in r do begin for each tuple ts in s do begin test pair (tr, ts) to see if they satisfy the join condition if they do, add tr ts to the result. End end r is called the outer relation and s the inner relation of the join.Requires no indices and can be used with any kind of join condition.Expensive since it examines every pair of tuples in the two relations.For each tuple in the outer relation (r), loop through all ns tuples in the inner relation (s)Cost is nr x ns

Cost of NestedLoop JoinIf there is enough memory to hold only one block of each relation, the estimated cost is nr * bs + br disk accesses If the smaller relation fits entirely in memory, use it as the inner relation. This reduces the cost estimate to br + bs disk accesses.br + bs is the minimum possible cost to read R and S oncePutting both relations in memory wont reduce the cost furtherbr disk accesses toload R into bufferRSFor each tuple in r, S has to beread into buffer, bs disk accesses

NestedLoop Join with Buffers (Still Tuple Based)The algorithm is the same as in the previous slideTuples are fetched and compared one by one according to the double loopOS or DBMS fetches a tuple from buffer if it is already therebr disk accesses toload R into bufferRSFor each tuple in r, S has to beread into buffer, bs disk accessesAt this point, one block of r is read, and the first rtuple has been compared to 3 stuples (1 block of s)

NestedLoop Join with Buffers (Still Tuple Based)br disk accesses toload R into bufferRSAt this point, the first rtuple has been compared to 6 stuples The next step begins with the 2nd tuple in rs buffer; no access to r on disk is needed; however, the stuples have to be read from disk againTotal cost = nr * bs + br disk accesses

Rewriting the NestedLoop JoinTo make use of the buffer efficiently, the algorithm has to be bufferawarefor each block Br in r do begin for each block Bs in s do begin Do all tuples in Br and Bs: Br Bs end end RSTotal cost = br * bs + br disk accesses

Rewriting the NestedLoop JoinTo make use of the buffer efficiently, the algorithm has to be rewrittenfor each block Br in r do begin for each block Bs in s do begin Do all tuples in Br and Bs: Br Bs end end Total cost = br * bs + br disk accessesRS

Rewriting the NestedLoop JoinTo make use of the buffer efficiently, the algorithm has to be rewrittenfor each block Br in r do begin for each block Bs in s do begin Do all tuples in Br and Bs: Br Bs end end Total cost = br * bs + br disk accessesRS
22Connect to tutorial oriented presentation, and emphasize our approach22Connect to tutorial oriented presentation, and emphasize our approach22Connect to tutorial oriented presentation, and emphasize our approach22Connect to tutorial oriented presentation, and emphasize our approach22Connect to tutorial oriented presentation, and emphasize our approach22Connect to tutorial oriented presentation, and emphasize our approach22Connect to tutorial oriented presentation, and emphasize our approach22Connect to tutorial oriented presentation, and emphasize our approach22Connect to tutorial oriented presentation, and emphasize our approach22Connect to tutorial oriented presentation, and emphasize our approach22Connect to tutorial oriented presentation, and emphasize our approach22Connect to tutorial oriented presentation, and emphasize our approach22Connect to tutorial oriented presentation, and emphasize our approach22Connect to tutorial oriented presentation, and emphasize our approach22Connect to tutorial oriented presentation, and emphasize our approach22Connect to tutorial oriented presentation, and emphasize our approach22Connect to tutorial oriented presentation, and emphasize our approach22Connect to tutorial oriented presentation, and emphasize our approach22Connect to tutorial oriented presentation, and emphasize our approach22Connect to tutorial oriented presentation, and emphasize our approach22Connect to tutorial oriented presentation, and emphasize our approach22Connect to tutorial oriented presentation, and emphasize our approach22Connect to tutorial oriented presentation, and emphasize our approach22Connect to tutorial oriented presentation, and emphasize our approach