query processing. steps in query processing validate and translate the query –good syntax. –all...
Embed Size (px)
TRANSCRIPT
-
Query Processing
-
Steps in Query ProcessingValidate and translate the queryGood syntax.All referenced relations exist.Translate the SQL to relational algebra.OptimizeMake it run faster.Evaluate
- Translation ExamplePossible SQL Query:SELECT balanceFROM accountWHERE balance
- Tree Representation of Relational Algebrabalancebalance
-
Making An Evaluation PlanAnnotate Query Tree with evaluation instructions:
The query can now be executed by the query execution engine.balancebalance
-
Before Optimizing the QueryMust predict the cost of execution plans.Measured by CPU time,Number of disk block reads,Network communication (in distributed DBs),where C(CPU) < C(Disk) < C(Network).Major factor is buffer space.Use statistics found in the catalog to help predict the work required to evaluate a query.
-
Disk CostSeek time = rotational latency + arm movement.Scan time = time to read the data.
Typically, seek time is orders of magnitude greater.Disk cost is assumed to be highest, so it can be used to approximate total cost.
-
Reading Data, No IndicesLinear scanCost is a function of file size.Binary search on ordering attributeCost is lg of the file size.Requires table to be sorted.
-
Reading Data with IndicesPrimary index: index on sort key.Can be dense or sparse.Secondary index: index on non-sort key.Queries can be point queries or range queries.Point queries return a single record.Range queries return a sequence of consecutive records.
- Point QueriesPoint queriesCost = index cost + block read cost.Range queries (c1
- More on Range QueriesRange query on sort key (c1
-
More Complex SelectionsConditions on multiple attributesNegationsDisjunctionsGrouping pointers when selection is on multiple attributes:Find a set of solutions for each condition.Either compute its union or intersection, depending on the condition (disjunction or conjunction.)
-
SortingSorted relations are easier to scan.The cost of sorting a relation before querying it can be less than querying an unsorted relation.Two types of sorts:In memoryOut of memory (a.k.a., external sorting)
-
External Merge SortUse this when you cannot fit the relation in memory.Assume there are M memory buffers.Two phases:Create sorted runs.Merge sorted runs.
-
External Merge Sort, Phase 1Fill the M memory buffers with the next M blocks of the relation.Sort the M blocks.Write the sorted blocks to disk.
-
External Merge Sort, Phase 2Assume there are at most M-1 runs.Read the first block of each run into memory.At each iteration, find the lowest record from the M-1 runs.Place it into the memory buffer.If any run is empty, read its next block.
-
External Merge Sort NotesCan be extended to an arbitrarily large relation using multiple passes.Cost is:Br(2 * lg_(M-1) (Br/M) + 1)Br is the number of blocks for the relation.B is the size of a memory buffer.
-
Nested Loop JoinNo indices (for now).Nested LoopR join SR is the outer relation.S is the inner relation.Read a block of R, then read each block of S and compare their contents using the join condition.Write any matching tuples to another block.
-
Nested Loop Join CostIf you read tuple by tuple, its:#tuples in R * #blocks in S + #blocks in R.Question: Which should be in inner relation, and which should be the outer?
-
Block Nested LoopNested Loop Join, but block by block instead.Cost for R join S, where R is outer, S is inner:#blocks in R * #blocks in S + #blocks in S
-
Block Nested Loop Improvements Sorted relations?More memory?
-
Indexed Nested Loop JoinAssume we have an index on a join attribute of one of the relations, R or S.Questions:Which should the index be on?Or, if both have indices on them, which should be the outer one?
-
Indexed Nested Loop Join Cost#blocks in R + #rows in R * LsLs is the cost of looking up a record in S using the index.
-
More JoinsMerge joinSort R and S, and then merge them.Hash joinHash R and S into buckets, and compare the bucket contents.
-
EvaluationMaterialization: Build intermediate tables as the expression goes up the tree.
Here, one intermediate table is created for the select, and is the input of the project.balancebalance
-
Materialization CostCost of writing out intermediate results to disk.
-
PipeliningCompute several operations simultaneously.
As soon as a tuple is created from one operation, send it to the next. Here, send selected tuples straight to the projection.balancebalance
-
Implementation of PipeliningRequires buffers for each operation.Can be:Demand driven an operator must be asked to generate a tuple.Producer driven an operator generates a tuple whether its asked for or not.
-
Query Optimization
-
Some Actions of Query OptimizationReordering joins.Changing the positions of projects and selects.Changing the access structures used to read data.
-
Catalog InfoNumber of tuples in r.Number of blocks for r.Size of tuple of r.Blocking factor a r the number of r tuples that fit in a block.The number of distinct values of each attribute of r.