1 optimization - selection. 2 the selection operation table: reserves(sid, bid, day, agent) a page...

1

Optimization - Selection

2

The Selection Operation

• Table: Reserves(sid, bid, day, agent)• A page (block) can hold 100 Reserves tuples• There are 1,000 pages• Compute the query:

SELECT *

FROM Reserves R

WHERE R.agent = ‘Joe’

3

General Problem

• Compute: A op c(R) where – A is an attribute of R– op is an operator, such as =, < , etc.– c is a constant

• We assume that there are M pages in R.

4

No Index, Unsorted Data

• We can compute the selection by scanning the entire relation.

Cost: M

(For the query on Reserves, Cost = 1,000)

5

No Index, Sorted Data

• Suppose that the file in which R is stored is physically sorted on A.

• Then, we can use binary search to find the first tuple matching the search condition.

• Then, scan to find all additional tuples that match the condition.

Cost: log(M) + #pages with tuples from the result(For the query on Reserves: 10 + #pages with tuples from the result)

6

B+ Tree Index

• Suppose that there is a B+ Tree Index available on attribute A of R.

• Search the tree to find the first index entry that points to a tuple satisfying the condition.

• Scan leaf pages of index to find all entries in which key value satisfies the condition.

• Retrieve the satisfying tuples from the file.

7

Example

• Selection condition: rname < ‘C%’• Assume that names are uniformly distributed

with respect to first letter About 2/26 10% = 10,000 tuples = 100 pages match the condition

• If the B+ Tree is clustered, we can return them in 100 I/Os (plus a few to traverse the tree)

• Otherwise, up to 10,000 I/Os may be needed!!– Can you improve the number of I/Os needed?

8

Hash Index, Equality Selection

• Find appropriate bucket (about 1, 2 I/Os)• Retrieve the qualifying tuples from R

– Time depends on whether index is clustered.

• Example: Consider condition agent = ‘Joe’. Suppose there is an unclustered hash index on agent. Suppose there 100 reservations made by Joe.

• Cost: Could be between 1 and 100.

9

Optimization - Join

10

The Join Operation

• Compute the query:

SELECT *

FROM Reserves R, Sailors S

WHERE R.sid = S.sid

Number of Pages Tuples per Page

R 1000 (M) 100 (pR)

S 500 (N) 80 (pS)

11

Block Nested Loops Join

• Suppose there are B buffer pages

foreach block of B-2 pages of R do

foreach page of S do {

for all matching in-memory

pairs r, s:

add <r,s> to result

}

12

Cost Analysis

• R is read once: M• S is read ceil(M/(B-2)) times: ceil(M/(B-2))*N• Total: M + ceil(M/(B-2))*N

• For our Sailors, Reserves example. (B = 102)• Reserves is the outer Relation:

– Cost = 1000 + (1,000/100)*500 = 6,000 I/Os

• Sailors is the outer Relation:– Cost = 500 + (500/100)*1,000 = 5,500 I/Os

13

Index Nested Loops Join

• Suppose there is an index on the join attribute of S

• We find the tuple s using the index!

foreach tuple r of R

foreach tuple s of S where ri=sj

add <r,s> to result

14

Cost Analysis

1. If index is B+ Tree, about 2-4 I/Os to find leaf. If index is Hash index, about 1-2 I/Os to find bucket

2. Retrieve tuples. Time depends on whether index is clustered. If so, cost is usually 1 I/O per r tuple. Otherwise, could be 1 I/O per matching s tuple

15

Example (1)

• Hash index on sid of Sailors. (about 1.2 I/Os to find bucket)

• sid is a key in Sailors, so there is at most one matching tuple (actually, exactly 1: why?)

• Scanning Reserves: 1,000• There are 100 * 1,000 = 100,000 tuples in Reserves.• For each tuple, search index (1.2) and retrieve page (1)• Total time: 1,000 + 100,000 * 2.2 = 221,000 I/Os

16

Example

• Hash index on sid of Reserves. • Scanning Sailors: 500• There are 80 * 500 = 40,000 tuples in Sailors• For each tuple, search index (1.2) and retreive page

(??)• There are 100,000 reservations for 40,000 sailors.

Assuming uniform distribution on reservations, each sailor tuple matches about 2.5 reserves tuples. If index is unclustered, they may be on different pages

• Total: 500 + 40,000*(1.2 + 2.5) = 148,500 I/Os

17

Sort-Merge Join

• Sort both relations on join attribute.• This creates “partitions” according to the join

attributes.• Join relations while merging them. Tuples in

corresponding partitions are joined. • Cost depends on whether partitions are large

and therefore, are scanned multiple times.

18

sid sname rating age

22 dustin 7 45

28 yuppy 9 35

31 lubber 8 55

36 lubber 6 36

44 guppy 5 35

58 rusty 10 35sid bid day agent

28 103 12/4/96 Joe

28 103 11/3/96 Frank

31 101 10/2/96 Joe

31 102 12/7/96 Sam

31 101 13/7/96 Sam

58 103 22/6/96 Frank

Reserves

Sailors

19

Cost Analysis

• Sort R: O(M logM)• Sort S: O(N log N)• Merge: M + N (If partitions aren’t scanned

multiple times. Otherwise, worst case is M*N!!)

Cost: O(M+N+MlogM + NlogN)

20

Example

• Sort Reserves: 1000 log 1000 10,000 (actually better with a good algorithm for external sorting)

• Sort Sailors: 500 log 500 4,500• Merge: 1,000 + 500 = 1,500• Total: 1,500 + 10,000 + 4,500 = 16,000

(actually about 7,500 if sorting is done well)

21

Hash Join//Partition R into k partitions

foreach tuple r in R do //flush when fills

read r and add it to buffer page h(ri)

foreach tuple s in S do //flush when fills

read s and add it to buffer page h(sj)

for l = 1..k

//Build in-memory hash table for Rl using h2

foreach tuple r in Rl do

read r and insert into hash table with h2

foreach tuple s in Sl do

read s and probe table using h2

output matching pairs <r,s>

22

Cost Analysis

• Partition phase - Read R, S once and write them once: 2(M + N)

• In the second phase, we can read each partition once, assuming that it fits into memory: M + N

Cost: 3(M + N)

• In our example: 3(1,000 + 500) = 4,500 I/Os

23

Estimating Result Sizes

24

Picking a Query Plan

• Suppose we want to find the natural join of: Reserves, Sailors, Boats.

• The 2 options that appear the best are (ignoring the order within a single join):

(Sailors Reserves) Boats Sailors (Reserves Boats)

• We would like intermediate results to be as small as possible. Which is better?

25

Analyzing Result Sizes

• In order to answer the question in the previous slide, we must be able to estimate the size of (Sailors Reserves) and (Reserves Boats).

• The DBMS stores statistics about the relations and indexes.

• They are updated periodically (not every time the underlying relations are modified).

26

Statistics Maintained by DBMS

• Cardinality: Number of tuples NTuples(R) in each relation R

• Size: Number of pages NPages(R) in each relation R• Index Cardinality: Number of distinct key values

NKeys(I) for each index I• Index Size: Number of pages INPages(I) in each

index I• Index Height: Number of non-leaf levels IHeight(I) in

each B+ Tree index I• Index Range: The minimum ILow(I) and maximum

value IHigh(I) for each index I

27

Estimating Result Sizes

• Consider

• The maximum number of tuples is the product of the cardinalities of the relations in the FROM clause

• The WHERE clause is associating a reduction factor with each term

• Estimated result size is: maximum size times product of reduction factors

SELECT attribute-list

FROM relation-list

WHERE term1 and ... and termn

28

Estimating Reduction Factors

• column = value: 1/NKeys(I) if there is an index I on column. This assumes a uniform distribution. Otherwise, System R assumes 1/10.

• column1 = column2: 1/Max(NKeys(I1),NKeys(I2)) if there is an index I1 on column1 and I2 on column2. If only one column has an index, we use it to estimate the value. Otherwise, use 1/10.

• column > value: (High(I)-value)/(High(I)-Low(I)) if there is an index I on column.

29

Example

• Cardinality(R) = 1,000 * 100 = 100,000• Cardinality(S) = 500 * 80 = 40,000• NKeys(Index on R.agent) = 100• High(Index on Rating) = 10, Low = 0

SELECT *

FROM Reserves R, Sailors S

WHERE R.sid = S.sid and S.rating > 3 and

R.agent = ‘Joe’

30

Example (cont.)

• Maximum cardinality: 100,000 * 40,000 • Reduction factor of R.sid = S.sid: 1/40,000• Reduction factor of S.rating > 3: (10–3)/(10-0)

= 7/10• Reduction factor of R.agent = ‘Joe’: 1/100

• Total Estimated size: 700

1 optimization - selection. 2 the selection operation table: reserves(sid, bid, day, agent) a page...

Documents