1 optimization - selection. 2 the selection operation table: reserves(sid, bid, day, agent) a page...
Post on 21-Dec-2015
227 views
TRANSCRIPT
1
Optimization - Selection
2
The Selection Operation
• Table: Reserves(sid, bid, day, agent)• A page (block) can hold 100 Reserves tuples• There are 1,000 pages• Compute the query:
SELECT *
FROM Reserves R
WHERE R.agent = ‘Joe’
3
General Problem
• Compute: A op c(R) where – A is an attribute of R– op is an operator, such as =, < , etc.– c is a constant
• We assume that there are M pages in R.
4
No Index, Unsorted Data
• We can compute the selection by scanning the entire relation.
Cost: M
(For the query on Reserves, Cost = 1,000)
5
No Index, Sorted Data
• Suppose that the file in which R is stored is physically sorted on A.
• Then, we can use binary search to find the first tuple matching the search condition.
• Then, scan to find all additional tuples that match the condition.
Cost: log(M) + #pages with tuples from the result(For the query on Reserves: 10 + #pages with tuples from the result)
6
B+ Tree Index
• Suppose that there is a B+ Tree Index available on attribute A of R.
• Search the tree to find the first index entry that points to a tuple satisfying the condition.
• Scan leaf pages of index to find all entries in which key value satisfies the condition.
• Retrieve the satisfying tuples from the file.
7
Example
• Selection condition: rname < ‘C%’• Assume that names are uniformly distributed
with respect to first letter About 2/26 10% = 10,000 tuples = 100 pages match the condition
• If the B+ Tree is clustered, we can return them in 100 I/Os (plus a few to traverse the tree)
• Otherwise, up to 10,000 I/Os may be needed!!– Can you improve the number of I/Os needed?
8
Hash Index, Equality Selection
• Find appropriate bucket (about 1, 2 I/Os)• Retrieve the qualifying tuples from R
– Time depends on whether index is clustered.
• Example: Consider condition agent = ‘Joe’. Suppose there is an unclustered hash index on agent. Suppose there 100 reservations made by Joe.
• Cost: Could be between 1 and 100.
9
Optimization - Join
10
The Join Operation
• Compute the query:
SELECT *
FROM Reserves R, Sailors S
WHERE R.sid = S.sid
Number of Pages Tuples per Page
R 1000 (M) 100 (pR)
S 500 (N) 80 (pS)
11
Block Nested Loops Join
• Suppose there are B buffer pages
foreach block of B-2 pages of R do
foreach page of S do {
for all matching in-memory
pairs r, s:
add <r,s> to result
}
12
Cost Analysis
• R is read once: M• S is read ceil(M/(B-2)) times: ceil(M/(B-2))*N• Total: M + ceil(M/(B-2))*N
• For our Sailors, Reserves example. (B = 102)• Reserves is the outer Relation:
– Cost = 1000 + (1,000/100)*500 = 6,000 I/Os
• Sailors is the outer Relation:– Cost = 500 + (500/100)*1,000 = 5,500 I/Os
13
Index Nested Loops Join
• Suppose there is an index on the join attribute of S
• We find the tuple s using the index!
foreach tuple r of R
foreach tuple s of S where ri=sj
add <r,s> to result
14
Cost Analysis
1. If index is B+ Tree, about 2-4 I/Os to find leaf. If index is Hash index, about 1-2 I/Os to find bucket
2. Retrieve tuples. Time depends on whether index is clustered. If so, cost is usually 1 I/O per r tuple. Otherwise, could be 1 I/O per matching s tuple
15
Example (1)
• Hash index on sid of Sailors. (about 1.2 I/Os to find bucket)
• sid is a key in Sailors, so there is at most one matching tuple (actually, exactly 1: why?)
• Scanning Reserves: 1,000• There are 100 * 1,000 = 100,000 tuples in Reserves.• For each tuple, search index (1.2) and retrieve page (1)• Total time: 1,000 + 100,000 * 2.2 = 221,000 I/Os
16
Example
• Hash index on sid of Reserves. • Scanning Sailors: 500• There are 80 * 500 = 40,000 tuples in Sailors• For each tuple, search index (1.2) and retreive page
(??)• There are 100,000 reservations for 40,000 sailors.
Assuming uniform distribution on reservations, each sailor tuple matches about 2.5 reserves tuples. If index is unclustered, they may be on different pages
• Total: 500 + 40,000*(1.2 + 2.5) = 148,500 I/Os
17
Sort-Merge Join
• Sort both relations on join attribute.• This creates “partitions” according to the join
attributes.• Join relations while merging them. Tuples in
corresponding partitions are joined. • Cost depends on whether partitions are large
and therefore, are scanned multiple times.
18
sid sname rating age
22 dustin 7 45
28 yuppy 9 35
31 lubber 8 55
36 lubber 6 36
44 guppy 5 35
58 rusty 10 35sid bid day agent
28 103 12/4/96 Joe
28 103 11/3/96 Frank
31 101 10/2/96 Joe
31 102 12/7/96 Sam
31 101 13/7/96 Sam
58 103 22/6/96 Frank
Reserves
Sailors
19
Cost Analysis
• Sort R: O(M logM)• Sort S: O(N log N)• Merge: M + N (If partitions aren’t scanned
multiple times. Otherwise, worst case is M*N!!)
Cost: O(M+N+MlogM + NlogN)
20
Example
• Sort Reserves: 1000 log 1000 10,000 (actually better with a good algorithm for external sorting)
• Sort Sailors: 500 log 500 4,500• Merge: 1,000 + 500 = 1,500• Total: 1,500 + 10,000 + 4,500 = 16,000
(actually about 7,500 if sorting is done well)
21
Hash Join//Partition R into k partitions
foreach tuple r in R do //flush when fills
read r and add it to buffer page h(ri)
foreach tuple s in S do //flush when fills
read s and add it to buffer page h(sj)
for l = 1..k
//Build in-memory hash table for Rl using h2
foreach tuple r in Rl do
read r and insert into hash table with h2
foreach tuple s in Sl do
read s and probe table using h2
output matching pairs <r,s>
22
Cost Analysis
• Partition phase - Read R, S once and write them once: 2(M + N)
• In the second phase, we can read each partition once, assuming that it fits into memory: M + N
Cost: 3(M + N)
• In our example: 3(1,000 + 500) = 4,500 I/Os
23
Estimating Result Sizes
24
Picking a Query Plan
• Suppose we want to find the natural join of: Reserves, Sailors, Boats.
• The 2 options that appear the best are (ignoring the order within a single join):
(Sailors Reserves) Boats Sailors (Reserves Boats)
• We would like intermediate results to be as small as possible. Which is better?
25
Analyzing Result Sizes
• In order to answer the question in the previous slide, we must be able to estimate the size of (Sailors Reserves) and (Reserves Boats).
• The DBMS stores statistics about the relations and indexes.
• They are updated periodically (not every time the underlying relations are modified).
26
Statistics Maintained by DBMS
• Cardinality: Number of tuples NTuples(R) in each relation R
• Size: Number of pages NPages(R) in each relation R• Index Cardinality: Number of distinct key values
NKeys(I) for each index I• Index Size: Number of pages INPages(I) in each
index I• Index Height: Number of non-leaf levels IHeight(I) in
each B+ Tree index I• Index Range: The minimum ILow(I) and maximum
value IHigh(I) for each index I
27
Estimating Result Sizes
• Consider
• The maximum number of tuples is the product of the cardinalities of the relations in the FROM clause
• The WHERE clause is associating a reduction factor with each term
• Estimated result size is: maximum size times product of reduction factors
SELECT attribute-list
FROM relation-list
WHERE term1 and ... and termn
28
Estimating Reduction Factors
• column = value: 1/NKeys(I) if there is an index I on column. This assumes a uniform distribution. Otherwise, System R assumes 1/10.
• column1 = column2: 1/Max(NKeys(I1),NKeys(I2)) if there is an index I1 on column1 and I2 on column2. If only one column has an index, we use it to estimate the value. Otherwise, use 1/10.
• column > value: (High(I)-value)/(High(I)-Low(I)) if there is an index I on column.
29
Example
• Cardinality(R) = 1,000 * 100 = 100,000• Cardinality(S) = 500 * 80 = 40,000• NKeys(Index on R.agent) = 100• High(Index on Rating) = 10, Low = 0
SELECT *
FROM Reserves R, Sailors S
WHERE R.sid = S.sid and S.rating > 3 and
R.agent = ‘Joe’
30
Example (cont.)
• Maximum cardinality: 100,000 * 40,000 • Reduction factor of R.sid = S.sid: 1/40,000• Reduction factor of S.rating > 3: (10–3)/(10-0)
= 7/10• Reduction factor of R.agent = ‘Joe’: 1/100
• Total Estimated size: 700