file processing : query processing

File Processing : Query Processing

2008, Spring

Pusan National University

Ki-Joune Li

STEMPNU

Basic Concepts of Query

Query Retrieve records satisfying predicates

Types of Query Operators Aggregate Query Sorting

STEMPNU

Relational Operators : Select

Selection (condition) Retrieve records satisfying predicates Example

Find Student where Student.Score > 3.5 score>3.5(Student)

Index or Hash

Select

Predicate

STEMPNU

Relational Operators : Project

Project (attributes) Extract interesting attributes Example

Find Student.name where score > 3.5

name(acore>3.5(Student))

Full Scan

Interesting attributes to get

Extract

STEMPNU

Cartisan Product

Cartisan Product () Two Tables : R1 R2

Produce all cross products

Join ( )

r11

r12

…

r1m

R1

r21

r22

…

r2n

R2

=

r11

r11

…

r11

r21

r22

…

r2n

r12

r12

…

r12

r21

r22

…

r2n

r1m r21

r22

…

r2n

…

r1m

r1m

…

…

STEMPNU

Join

Join ( ) Select combined records of cartisan product with same value of

a common attribute (Natural Join) Example

Student (StudentName, AdvisorProfessorID, Department, Score)

Professor(ProfessorName, ProfessorID, Department)

Student AdivsorProfessorID=ProfessorID Professor

= AdivsorProfessorID=ProfessorID(Student Professor)

Double Scan : Expensive Operation

STEMPNU

Relational Algebra

Relational Algebra Operand : Table (Relation) Operator : Relational Operator (, , , etc) Example

Find Student Name where Student Score > 3.5 and Advisor Professor belongs to CSE Department

student.name(acore>3.5(Student) Department=‘CSE’ (Professor) )

Relational Algebra Specifies the sequence of operations

STEMPNU

Query Processing Mechanism

Query Processing Steps

1. Parsing and translation

2. Optimization

3. Evaluation

STEMPNU

Parsing and Translation

Parsing Query Statement (e.g. in SQL) Translation into relational algebra Equivalent Expression

For a same query statement

several relation algebraic expressions are possible Example

balance 2500(name(account )) name(balance 2500(account ))

Different execution schedules

Query Execution Plan (QEP) Determined by relational algebra Several QEPs may be produced by Parsing and Translation

STEMPNU

Query Optimization

Choose ONE QEP among QEPs based on Execution Cost of each QEP, where cost means execution time

How to find cost of each QEP ? Real Execution

Exact but Not Feasible Cost Estimation

Types of Operations Number of Records Selectivity Distribution of data

STEMPNU

Cost Model : Basic Concepts

Cost Model : Number of Block Accesses Cost

C = Cindex + Cdata

where Cindex : Cost for Index Access

Cdata : Cost for Data Block Retrieval

Cindex vs. Cdata ? Cindex : depends on index

Cdata depends on selectivity Random Access or Sequential Access

Selectivity Number (or Ratio) of Objects Selected by Query

STEMPNU

Cost Model : Type of Operations

Cost model for each type of operations Select Project Join Aggregate Query

Query Processing Method for each type of operations

Index/Hash or Not

STEMPNU

Cost Model : Number of Records

Number of Records Nrecord Nblocks

Number of Scans Single Scan

O(N) : Linear Scan O(logN ) : Index

Multiple Scans O(NM ) : Multiple Linear Scans O(N logM ) : Multiple Scans with Index

STEMPNU

Selectivity

Selectivity Affects on Cdata

Random Access Scattered on several blocks Nblock Nselected

Sequential Access Contiguously stored on blocks Nblock = Nselected / Bf

STEMPNU

Selectivity Estimation

Selectivity Estimation Depends on Data Distribution Example

Q1 : Find students where 60 < weight < 70 Q2 : Find students where 80 < weight < 90

How to find the distribution Parametric Method

e.g. Gaussian Distribution No a priori knowledge

Non-Parametric Method e.g. Histogram Smoothing is necessary

Wavelet, Discrete Cosine

30 40 50 60 70 80 90 100

Frequency

STEMPNU

Select : Linear Search

Algorithm : linear search Scan each file block and test all records to see whether they satis

fy the selection condition.

Cost estimate (number of disk blocks scanned) = br

br denotes number of blocks containing records from relation r

If selection is on a key attribute (sorted), cost = (br /2) stop on finding record

Linear search can be applied regardless of selection condition or ordering of records in the file, or availability of indices

STEMPNU

Select : Range Search

Algorithm : primary index, comparison Relation is sorted on A For A V (r)

Step 1: use index to find first tuple v and Step 2: scan relation sequentially

For AV (r) just scan relation sequentially till first tuple > v; do not use index

Algorithm : secondary index, comparison For A V (r)

Step 1: use index to find first index entry v and Step 2: scan index sequentially to find pointers to records.

For AV (r) scan leaf nodes of index finding pointers to records, till first entry > v

STEMPNU

Select : Range Search

Comparison between Searching with Index and Linear Search

Secondary Index retrieval of records that are pointed to requires an I/O for each record

Linear file scan may be cheaper if records are scattered on many blocks clustering is important for this reason

STEMPNU

Select : Complex Query

Conjunction : 1 2 . . . n(r) Algorithm : selection using one index

Step 1: Select a combination of i (i (r) ) Step 2: Test other conditions on tuple after fetching it into memory b

uffer. Algorithm : selection using multiple-key index

Use appropriate multiple-attribute index if available. Algorithm : selection by intersection of identifiers

Step 1: Requires indices with record pointers. Step 2: Intersection of all the obtained sets of record pointers. Step 3: Then fetch records from file

Disjunction : 1 2 . . . n (r) Algorithm : Disjunctive selection by union of identifiers

STEMPNU

Join Operation

Several different algorithms to implement joins Nested-loop join Block nested-loop join Indexed nested-loop join Merge-join Hash-join

Choice based on cost estimate Examples use the following information

Number of records of customer: 10,000 depositor: 5000 Number of blocks of customer: 400 depositor: 100

STEMPNU

Nested-Loop Join

Algorithm NLJ the theta join r sFor each tuple tr in r do begin

For each tuple ts in s do begin

test pair (tr,ts) to see if they satisfy the join cond

ition if they do, add tr • ts to the result.

EndEnd

r : outer relation, s : inner relation. No indices, any kind of join condition. Expensive

STEMPNU

Nested-Loop Join : Performance

Worst case the estimated cost is nr bs + br disk accesses, if not enough me

mory only to hold one block of each relation,

Example 5000 400 + 100 = 2,000,100 disk accesses with depositor as oute

r relation, and 1000 100 + 400 = 1,000,400 disk accesses with customer as the

outer relation.

If the smaller relation fits entirely in memory, use that as the inner relation. Reduces cost to br + bs disk accesses. If smaller relation (depositor) fits entirely in memory,

cost estimate will be 500 disk accesses.

STEMPNU

Block Nested-Loop Join

Algoritm BNLJ

For each block Br of r do

Get Block Br For each block Bs of s do

Get Block Bs For each tuple tr in Br do

For each tuple ts in Bs do Check if (tr, ts) satisfy the join condition if they do, add tr

• ts to the result.End

EndEnd

End

No disk access required

No disk access requiredDisk access happens here

STEMPNU

Block Nested-Loop Join : Performance

Worst case Estimate: br bs + br block accesses. Each block in the inner relation s is read once for each block

in the outer relation (instead of once for each tuple in the outer relation)

Improvements : If M blocks can be buffered use (M-2) disk blocks as blocking unit for outer relations, use remaining two blocks to buffer inner relation and output

Then the cost becomes br / (M-2) bs + br

STEMPNU

Indexed Nested-Loop Join

Index lookups can replace file scans if join is an equi-join or natural join and an index is available on the inner relation’s join attribute

Can construct an index just to compute a join.

Algorithm INLJFor each block Br of r do

Get Block Br For each tuple tr in Br do

Search Index (IDXr , tr.key)

if found, add tr • ts to the result.

End

End

STEMPNU

Indexed Nested-Loop Join : Performance

Worst case buffer has space for only one page of r,

Cost of the join: br + nr c Where c is the cost of traversing index and fetching matching tuple Number of matching tuples may be greater than one.

If indices are available on join attributes of both r and s, use the relation with fewer tuples as the outer relation

STEMPNU

Example of Nested-Loop Join Costs

Assume depositor customer, with depositor as the outer relation. customer have a primary B+-tree index on the join attribute custo

mer-name, which contains 20 entries in each index node. customer has 10,000 tuples,

the height of the tree is 4, and one more access is needed to find the actual data

Depositor has 5000 tuples

Cost of block nested loops join 400*100 + 100 = 40,100 disk accesses assuming worst case memory

Cost of indexed nested loops join 100 + 5000 * 5 = 25,100 disk accesses.

STEMPNU

Hash-Join

Applicable for equi-joins and natural joins. A hash function h is used to partition tuples of both relations

h : A→ { 0, 1, ..., n } r0, r1, . . ., rn : partitions of r tuples

s0, s1. . ., sn : partitions of s tuples

r tuples in ri need only to be compared with s tuples in si .

file processing : query processing

Documents

execution cost

querystempnu cost model

notstempnu cost model

student score

student department

relational algebraseveral

type of operationscost

execution timehow