advanced databases: lecture 6 query optimization (i) 1 introduction to query processing +...

Advanced Databases: Lecture 6 Query Optimization (I)1

Introduction to query processing +Implementing Relational Algebra

Advanced DatabasesBy

Dr. Akhtar Ali

Query Optimization – I


What is Optimization

• Best use of resources.– Good time management– Effective allocations of lecturers, labs to course units

• Efficient solution to a problem.– Quick response to a user query

• Less costly.– Solar Energy Vs. Nuclear Vs. hydro-electric power– Minimum I/O, CPU cycles, Memory Space


Query Optimization

• A classical component of a DBMS.• Choosing best composition of algebraic operators to answer a

query.– A query (e.g. in SQL) may have several alternative

representations in algebra.– The optimizer selects a best possible algebraic representation.

• Choosing an efficient and less costly plan to answer a query.– One that takes less time to compute.– One with least cost (in terms of I/Os).

• Why Query Optimization?– To make query evaluation faster.– To reduce the response time of the query processor.– To allow the user write queries without being aware of the

physical access mechanisms and without asking her/him to explicitly dictate the system how the queries should be evaluated.


Recommended Text

• Database Management Systems By R. Ramakrishnan, Chapters 12, 13 (copy provided)

• Fundamental of Database Systems – 3rd EditionBy R. Elmasri and S. B. Navati, Chapter 18

• An Introduction to Database Systems – 7th EditionBy C. J. Date, Chapter 17


Query Processing – the clear view

user/application

SQL query

result ofthe query

DBMSQuery Processor


Query Processing – the clear view

user/application

scanning,parsing,

validatingTranslator

Logical Optimizeruses

tranformations

Physical Optimizeruses a cost model

RuntimeDatabase Engine

Database

Catalog

meta data

data

parse treeSQL query

RelationalAlgebra query tree

optimized RelationalAlgebra query tree

code to executethe query

databasestatistics

result ofthe query


Example database schema

• We will use the following schema throughout this lecture:Sailors(sid:integer, sname:string, rating:integer, age:real)Reserves(sid:integer, bid:integer, day:date, rname:string)

• Consider the following statistics about the relations.– Each tuple of Reserves is 40 bytes long,– A data page can hold 100 Reserves tuples,– The size of Reserves relation is 1000 pages,– Each tuple of Sailors is 50 bytes long, – A data page can hold 80 Sailors tuples, and– The size of Sailors relation is 500 pages.


Translating SQL into Relational Algebra

• After the SQL query is parsed and it is syntactically correct, then it is mapped onto Relational Algebra (RA) expression. Usually shown as a query tree (bottom up).

• Consider the SQL query:SELECT S.snameFROM Reserves R, Sailors SWHERE R.sid = S.sid

AND R.bid = 100 AND S.rating > 5

The same query in RA:sname (bid=100 and rating > 5(Reserves ⋈sid=sid Sailors))

π sname

sid=sid

SailorsReserves

σ bid = 100 and rating > 5


Implementation of Relational Operators

• We will discuss how to implement:– Selection () Selects a subset of rows from a relation.

– Projection () Picks only required attributes and removes unwanted attributes from a relation.

– Join (⋈) Combines two relations.


Access Paths

• There is usually more than one way to retrieve tuples from a relation, if indexes are available and if the query contains selection conditions.

• The selection condition comes from a select or a join.

• The alternative ways to retrieve tuples from a relation are called access paths.

• An access path is either:– A file scan (when there is no selection condition or no index

can be used).– An index plus a matching selection condition. For example,

attr op value, where op is an operator (<, >, =), and there is an index available on attr.


Implementing Selection operator

• Depends on the available file organizations, that is whether we have:– No index available and the physical file for a given

relation is unsorted. Too much expensive.– No index but the file is sorted on some attribute.– A B+ tree index is available.– A Hash index is available.

• For each of the above, the selection operator costs differently and that is the main thing to know.


Selection Operator – an Example Query

• Consider the following query:SELECT *FROM ReservesWHERE rname = ‘Joe’

• Consider that there are 100 tuples that qualify for the result of the above query. That is 100 tuples have rname = ‘Joe’.


Selection using no index & no sorting

• For a general selection query: R.attr op value (R), we have to scan the entire file to get the qualifying tuples. Note that op can be <, >, =, <>, etc.

• For each tuple, it is tested to see if the given condition (R.attr op value) holds. If the conditions holds then the tuple is added to the result.

• The cost of this approach is M I/Os, where M is the number of pages in R.

• For the example query, the cost is 1000 I/Os because there are 1000 pages in Reserves relation.


Binary Search – Divide & Conquer

• An algorithm for searching elements in a sorted array or file.Algorithm BinarySearch(A, k, low, high):Input: a sorted array A storing n items in ascending order; a search key k,

and integers low and high.Output: An element of A is exists or special element NoSuchKey

if low > high then return NoSuchKey

else mid = (low + high)/2 /* round to nearest integer */if k = A[mid] then

return A[mid]else

if k < A[mid] then return BinarySearch(A, k, low, mid – 1)

else return BinarySearch(A, k, mid + 1, high)


Binary Search – Divide & Conquer …

Suppose that in this array we are searching for item 22


Binary Search – Divide & Conquer …

• Initially, the number of candidate items is n.• After the first call to BinarySearch, it is at most n/2.• After the second call to BinarySearch, it is at most n/4

or n/22.• After each ith call to BinarySearch, the number of

items. remaining is at most n/2i.• The maximum number of recursive calls performed is

m < n.• So we can say: n/2m < 1• In order words: m > log2 n• Thus: m = [log2 n] + 1• Hence: The binary search algorithm runs in O(log2 n)

time i.e. in the order of log2 n.


Selection using sorting but no index

• For a general selection query: R.attr op value (R), if R is physically sorted on R.attr, we use a binary search to locate the first qualifying tuple.

• We keep on testing the condition on the tuples in every page that is scanned and add them to the result until the condition fails to hold.

• The cost of this approach is equal to the cost of binary search plus the number of pages that have been read.– The cost of binary search = log2 M I/Os

– The cost of retrieving tuples = T I/Os where T is the number of pages scanned to retrieve the qualifying tuples.

• For the example query, the cost is computed as follows:– The binary search cost = log2 1000 = log 1000/ log 2 = 9.96 10

– Since the number of qualifying tuples are 100, 1 page will hold these tuples and scanning that page will cost 1 I/O.

– So the total cost is 10 + 1 = 11 I/Os.


B+ tree Index

• B+ tree index is a balanced tree in which the internal nodes (the top two levels) direct the search and the leaf nodes contain data entries.

• Searching for a record requires just a traversal from the root to the appropriate leaf node.

• The length of the path from the root to a leaf is called height of the tree (usually 2 or 3).

• To search for entry 9*, we follow the left most child pointer from the root (as 9 < 10). Then at level two we follow the right child pointer (as 9 > 6). Once at the leaf node, data entries can be found sequentially.

• Leaf nodes are inter-connected which makes it suitable for range queries.

10 20

6 12 23 35

3* 4* 10* 10*6* 9* 12* 13* 23* 31*20* 22* 35* 36*

Root


Selection using B+ tree index

• For a general selection query: R.attr op value (R), B+ tree is best if R.attr is not equality (e.g. <, >). It is also good for = operator.

• We search the B+ tree to find the first page that contains a qualifying tuple. Assume that the tree index is clustered.

• We then read all those pages that contain the qualifying tuples.• The cost of this approach is equal to the sum of the following:

– The cost of identifying the starting page = 2 or 3 I/Os. We assume 2 I/Os throughout.

– The cost of retrieving tuples = T I/Os where T is the number of pages scanned to retrieve the qualifying tuples.

• For the example query, the cost is computed as follows:– Since the number of qualifying tuples are 100, 1 page will hold these tuples

and scanning that page will cost 1 I/O.– So the total cost is 2 + 1 = 3 I/Os.


Hash Index

• A function called hash function is applied to the hash field value (key field) to get the address of the disk page in which the record is stored.

• A bucket is a set of records.

• The directory is an array of size n (4 in the figure), each element is a pointer to a bucket.

• To search for a data entry:

• the hash function is applied to the search field and the last bits of its binary form is used to get a number between 0 and 3.

• this number gives the array position to get the pointer to the desired bucket. • to locate a record with key field 5 (binary 101), we look at directory element 01 and follow

the pointer to the data page (Bucket B).

2

00

01

10

11

2

4* 12* 32* 16*

2

1* 5* 21*

2

10*

2

15* 7* 19*

Bucket A

Bucket B

Bucket C

Bucket D

Local Dept

Global Dept

Directory

Data Pages


Selection using Hash Index

• For a general selection query: R.attr op value (R), hash index is best if R.attr is equality (=). It is not good for not equality (e.g. <, >, <>).

• We retrieve the index page that contain the rids (record identifiers) of the qualifying tuples.

• Then the pages that contain these tuples are scanned.• The cost of this approach is equal to the sum of the following:

– The cost to retrieve the index page = 1 I/O– The cost of retrieving tuples = T I/Os where T is the number of pages scanned

to retrieve the qualifying tuples.– For none-equality operators, T = the number of qualifying tuples.

• For the example query, the cost is computed as follows:– Since the number of qualifying tuples are 100, 1 page will hold these tuples

and scanning that page will cost 1 I/O.– So the total cost is 1 + 1 = 2 I/Os.


Implementation of Selection (summary)Assuming R.attr op value (R)

• No Index is available on attr and R is not sorted on attr– Cost = M I/Os, where M is the number of pages in R

• No Index is available on attr and R is sorted on attr– Cost = log2M + T I/Os, where T is the number of pages

read for retrieving the qualifying tuples

• B+ Tree Index (clustered) is available on attr – Cost = B + T I/Os, where B is the height of the index (i.e.

2).

• Hash Index (clustered) is available on attr – If attr is not a primary key:

• Cost = H + T I/Os, where H (i.e. 1) is the I/O required to obtain the rids of the qualifying tuples.

– If attr is a primary key:• Cost = (H + 1) * TP I/Os, where TP is the number of the

qualifying tuples.


Summary of the Lecture

• Query Optimization– What and why

• Query Processing– The various stages through which a query goes

• Translation of SQL into Relational Algebra– Internal representation of the query

• Access Paths– Different paths and ways to get the same data

• Implementation of the Selection Operator– Different ways of evaluating selection using different access

paths

advanced databases: lecture 6 query optimization (i) 1 introduction to query processing +...

Documents