cpsc 404, laks v.s. lakshmanan1 evaluation of relational operations: other operations chapter 14...

CPSC 404, Laks V.S. Lakshmanan 1

Evaluation of Relational Operations: Other Operations

Chapter 14 Ramakrishnan & Gehrke

(Sections 14.1-14.3; 14.5-14.7)


What you will learn from this lecture

General selections. Union/intersection. Group-by.


Simple Selections Of the form Size of result approximated as size of R * reduction

factor– reduction factor typically estimated with the uniformity

assumption– or estimated with the help of a histogram [will discuss

soon]. With no index, unsorted: (worst case) table

scan only option; cost = M (#pages of R). With an index on selection attribute: Use index to

find qualifying data entries, then retrieve corresponding data records. (Hash index useful only for equality selections.)

What about a sorted indexless relation? Is binary search always superior to table scan?

SELECT *FROM Songs SWHERE S.sname = ‘C%’

R attr valueop R. ( )


Using an Index for Selections

Cost depends on #qualifying tuples, and clustering.– Cost of finding qualifying data entries (typically small) plus

cost of retrieving records (could be large w/o clustering).– For example, assuming uniform distribution of snames(!),

about 1/26 of tuples qualify (20 pages, 1600 tuples). With a clustered index, cost is about 20 I/Os; if unclustered, upto 1600 I/Os! [assumes S has 500 pages.]

Important refinement for unclustered indexes: 1. Find qualifying data entries.2. Sort the rid’s of the data records to be retrieved (or, if the

records are not stored in rid-order, sort the page addresses).3. Fetch rids in order. This ensures that each data page is

looked at just once (though # of such pages likely to be higher than with clustering).


General Selection Conditions

Such selection conditions are first converted to conjunctive normal form (CNF): (time<8/9/04 OR rating=5 OR sid=3 ) AND (uid=‘123 OR rating=5 OR sid=3).

We only discuss the case with no disjunctions (most common in practice!)

disjunctions may or may not be easy– e.g., (time<8/9/04 OR uid=‘123) index on uid only may still warrant a scan

– e.g., (time<8/9/04 OR uid=‘123) AND sid = 3 index on sid helps tremendously

- (time = 8/9/04 OR uid=`123’). Can you think of a clever strategy if given index on time and on uid?

(time<8/9/04 AND uid=123) OR rating=5 OR sid=3


Two Approaches to General Selections

First approach: Find the most selective access path, retrieve tuples using it, and apply any remaining conditions that don’t match the index:– Most selective access path: An index probe or index scan or

table scan that we estimate will require the fewest page I/Os.– e.g., time<8/9/04 AND rating=5 AND sid=3. Suppose we have

indexes (B+tree) on time and (hash) on <rating, sid>. 2 options:

option 1: use the B+ tree index on time (check rating=5 and sid=3 for each retrieved tuple)

option 2: use the hash index on <rating, sid> (check time<8/9/94)

– Conditions that match this index reduce the number of tuples retrieved; other conditions are used to discard some retrieved tuples, but do not affect number of tuples/pages fetched.


Intersection of Rids Second approach (if we have 2 or more

matching indexes):– Get sets of rids of data records using each matching

index.– Then intersect these sets of rids – Retrieve the records and apply any remaining

conditions.– e.g., time<8/9/04 AND rating=5 AND sid=3

suppose we have a B+ tree index on time and an index on sid

retrieve rids of records satisfying time<8/9/04 using the first retrieve rids of records satisfying sid=3 using the second intersect the rids finally, retrieve records and check rating=5 How’d you implement intersection efficiently?


The Projection Operation

An approach based on sorting:– Modify Phase 1 of external sort to eliminate

unwanted fields. Thus, sorted sublists (runs) are produced, but tuples in SSLs are smaller than input tuples. (Size ratio depends on # and size of fields dropped.)

– Modify merging passes to eliminate duplicates. Thus, number of result tuples smaller than input. (Difference depends on # of duplicates.)

– What should we sort on?

SELECT DISTINCT R.sid, R.uidFROM Ratings R


Projection Based on Hashing Partitioning phase: Read R using one input buffer. For

each tuple, discard unwanted fields, apply hash function h1 to choose one of B-1 output buckets/buffers.– Result is B-1 partitions (of tuples with no unwanted fields). Any

two tuples from different partitions guaranteed to be distinct. Duplicate elimination phase: For each partition, read it

and build an in-memory hash table, using hash fn h2 (<> h1) on all fields, while discarding duplicates.– If partition does not fit in memory, can apply hash-based

projection algorithm recursively to this partition. (but what do you “project” on in recursive invocation(s)?)

Note the similarity to hash join.


Set Operations Intersection and cross-product special cases of join. Union: sorting based and hash based Sorting based approach:

– Sort both relations (on combination of all attributes).– Scan sorted relations and merge them.– Alternative: Merge SSLs from Phase 1 for both relations. – An advantage: result is sorted.

Hash based approach to union:– Partition R and S using hash function h.– For each S-partition, build in-memory hash table (using h2),

scan corr. R-partition and add tuples to table while discarding duplicates.

– What is the buffer requirement for this?


Aggregate Operations (AVG, MIN, etc.)

Without group-by:– In general, requires scanning the relation.– Given index whose search key includes all attributes

in the SELECT & WHERE clauses, may suffice to do index scan. (remember what index scan is?)

With group-by:– Sort on group-by attributes, then scan relation and

compute aggregate for each group. (Can improve upon this by combining sorting and aggregate computation. In this case, the I/O cost is just that of sorting.)

– Similar approach based on hashing on group-by attributes.

Maintain running info. for each group.


Summary

A virtue of relational DBMSs: queries are composed of a few basic operators; the implementation of these operators can be carefully tuned (and it is important to do this!).

Indeed – relational algebra gold standard for algebra design: simple ops; sophisticated ones derivable by composition: e.g., join, division.

Many alternative implementation techniques for each operator; no universally superior technique for most operators.

Must consider available alternatives for each operation in a query and choose best one based on system statistics, etc. This is part of the broader task of optimizing a query composed of several ops.

cpsc 404, laks v.s. lakshmanan1 evaluation of relational operations: other operations chapter 14...

Documents