file processing : query processing 2008, spring pusan national university ki-joune li

28
File Processing : Query Processing 2008, Spring Pusan National University Ki-Joune Li

Upload: alfred-whitehead

Post on 18-Jan-2018

222 views

Category:

Documents


0 download

DESCRIPTION

STEMPNU Relational Operators : Select Selection (  condition )  Retrieve records satisfying predicates  Example Find Student where Student.Score > 3.5  score>3.5 (Student)  Index or Hash Select Predicate

TRANSCRIPT

Page 1: File Processing : Query Processing 2008, Spring Pusan National University Ki-Joune Li

File Processing : Query Processing

2008, SpringPusan National UniversityKi-Joune Li

Page 2: File Processing : Query Processing 2008, Spring Pusan National University Ki-Joune Li

STEMPNU

Basic Concepts of Query

Query Retrieve records satisfying predicates

Types of Query Operators Aggregate Query Sorting

Page 3: File Processing : Query Processing 2008, Spring Pusan National University Ki-Joune Li

STEMPNU

Relational Operators : Select

Selection (condition) Retrieve records satisfying predicates Example

Find Student where Student.Score > 3.5 score>3.5(Student)

Index or Hash

Select

Predicate

Page 4: File Processing : Query Processing 2008, Spring Pusan National University Ki-Joune Li

STEMPNU

Relational Operators : Project

Project (attributes) Extract interesting attributes Example

Find Student.name where score > 3.5

name(acore>3.5(Student))

Full Scan

Interesting attributes to get

Extract

Page 5: File Processing : Query Processing 2008, Spring Pusan National University Ki-Joune Li

STEMPNU

Cartisan Product

Cartisan Product () Two Tables : R1 R2

Produce all cross products

Join ( )

r11

r12

r1m

R1

r21

r22

r2n

R2

=

r11

r11

r11

r21

r22

r2n

r12

r12

r12

r21

r22

r2n

r1m r21

r22

r2n

r1m

r1m

Page 6: File Processing : Query Processing 2008, Spring Pusan National University Ki-Joune Li

STEMPNU

Join

Join ( ) Select combined records of cartisan product with same value of

a common attribute (Natural Join) Example

Student (StudentName, AdvisorProfessorID, Department, Score)Professor(ProfessorName, ProfessorID, Department)Student AdivsorProfessorID=ProfessorID Professor

= AdivsorProfessorID=ProfessorID(Student Professor)

Double Scan : Expensive Operation

Page 7: File Processing : Query Processing 2008, Spring Pusan National University Ki-Joune Li

STEMPNU

Relational Algebra

Relational Algebra Operand : Table (Relation) Operator : Relational Operator (, , , etc) Example

Find Student Name where Student Score > 3.5 and Advisor Professor belongs to CSE Departmentstudent.name(acore>3.5(Student) Department=‘CSE’ (Professor) )

Relational Algebra Specifies the sequence of operations

Page 8: File Processing : Query Processing 2008, Spring Pusan National University Ki-Joune Li

STEMPNU

Query Processing Mechanism

Query Processing Steps 1. Parsing and translation2. Optimization3. Evaluation

Page 9: File Processing : Query Processing 2008, Spring Pusan National University Ki-Joune Li

STEMPNU

Parsing and Translation

Parsing Query Statement (e.g. in SQL) Translation into relational algebra Equivalent Expression

For a same query statement several relation algebraic expressions are possible

Example balance 2500(name(account )) name(balance 2500(account ))

Different execution schedules Query Execution Plan (QEP)

Determined by relational algebra Several QEPs may be produced by Parsing and Translation

Page 10: File Processing : Query Processing 2008, Spring Pusan National University Ki-Joune Li

STEMPNU

Query Optimization

Choose ONE QEP among QEPs based on Execution Cost of each QEP, where cost means execution time

How to find cost of each QEP ? Real Execution

Exact but Not Feasible Cost Estimation

Types of Operations Number of Records Selectivity Distribution of data

Page 11: File Processing : Query Processing 2008, Spring Pusan National University Ki-Joune Li

STEMPNU

Cost Model : Basic Concepts

Cost Model : Number of Block Accesses Cost

C = Cindex + Cdata

where Cindex : Cost for Index AccessCdata : Cost for Data Block Retrieval

Cindex vs. Cdata ? Cindex : depends on index Cdata

depends on selectivity Random Access or Sequential Access

Selectivity Number (or Ratio) of Objects Selected by Query

Page 12: File Processing : Query Processing 2008, Spring Pusan National University Ki-Joune Li

STEMPNU

Cost Model : Type of Operations

Cost model for each type of operations Select Project Join Aggregate Query

Query Processing Method for each type of operations

Index/Hash or Not

Page 13: File Processing : Query Processing 2008, Spring Pusan National University Ki-Joune Li

STEMPNU

Cost Model : Number of Records

Number of Records Nrecord Nblocks

Number of Scans Single Scan

O(N) : Linear Scan O(logN ) : Index

Multiple Scans O(NM ) : Multiple Linear Scans O(N logM ) : Multiple Scans with Index

Page 14: File Processing : Query Processing 2008, Spring Pusan National University Ki-Joune Li

STEMPNU

Selectivity

Selectivity Affects on Cdata

Random Access Scattered on several blocks Nblock Nselected

Sequential Access Contiguously stored on blocks Nblock = Nselected / Bf

Page 15: File Processing : Query Processing 2008, Spring Pusan National University Ki-Joune Li

STEMPNU

Selectivity Estimation

Selectivity Estimation Depends on Data Distribution Example

Q1 : Find students where 60 < weight < 70 Q2 : Find students where 80 < weight < 90

How to find the distribution Parametric Method

e.g. Gaussian Distribution No a priori knowledge

Non-Parametric Method e.g. Histogram Smoothing is necessary

Wavelet, Discrete Cosine30 40 50 60 70 80 90 100

Frequency

Page 16: File Processing : Query Processing 2008, Spring Pusan National University Ki-Joune Li

STEMPNU

Select : Linear Search

Algorithm : linear search Scan each file block and test all records to see whether they satis

fy the selection condition. Cost estimate (number of disk blocks scanned) = br

br denotes number of blocks containing records from relation r If selection is on a key attribute (sorted), cost = (br /2)

stop on finding record Linear search can be applied regardless of

selection condition or ordering of records in the file, or availability of indices

Page 17: File Processing : Query Processing 2008, Spring Pusan National University Ki-Joune Li

STEMPNU

Select : Range Search

Algorithm : primary index, comparison Relation is sorted on A For A V (r)

Step 1: use index to find first tuple v and Step 2: scan relation sequentially

For AV (r) just scan relation sequentially till first tuple > v; do not use index

Algorithm : secondary index, comparison For A V (r)

Step 1: use index to find first index entry v and Step 2: scan index sequentially to find pointers to records.

For AV (r) scan leaf nodes of index finding pointers to records, till first entry > v

Page 18: File Processing : Query Processing 2008, Spring Pusan National University Ki-Joune Li

STEMPNU

Select : Range Search

Comparison between Searching with Index and Linear Search

Secondary Index retrieval of records that are pointed to requires an I/O for each record

Linear file scan may be cheaper if records are scattered on many blocks clustering is important for this reason

Page 19: File Processing : Query Processing 2008, Spring Pusan National University Ki-Joune Li

STEMPNU

Select : Complex Query

Conjunction : 1 2 . . . n(r) Algorithm : selection using one index

Step 1: Select a combination of i (i (r) ) Step 2: Test other conditions on tuple after fetching it into memory b

uffer. Algorithm : selection using multiple-key index

Use appropriate multiple-attribute index if available. Algorithm : selection by intersection of identifiers

Step 1: Requires indices with record pointers. Step 2: Intersection of all the obtained sets of record pointers. Step 3: Then fetch records from file

Disjunction : 1 2 . . . n (r) Algorithm : Disjunctive selection by union of identifiers

Page 20: File Processing : Query Processing 2008, Spring Pusan National University Ki-Joune Li

STEMPNU

Join Operation

Several different algorithms to implement joins Nested-loop join Block nested-loop join Indexed nested-loop join Merge-join Hash-join

Choice based on cost estimate Examples use the following information

Number of records of customer: 10,000 depositor: 5000 Number of blocks of customer: 400 depositor: 100

Page 21: File Processing : Query Processing 2008, Spring Pusan National University Ki-Joune Li

STEMPNU

Nested-Loop Join

Algorithm NLJ the theta join r sFor each tuple tr in r do begin

For each tuple ts in s do begintest pair (tr,ts) to see if they satisfy the join cond

ition if they do, add tr • ts to the result.

EndEnd

r : outer relation, s : inner relation. No indices, any kind of join condition. Expensive

Page 22: File Processing : Query Processing 2008, Spring Pusan National University Ki-Joune Li

STEMPNU

Nested-Loop Join : Performance Worst case

the estimated cost is nr bs + br disk accesses, if not enough memory only to hold one block of each relation,

Example 5000 400 + 100 = 2,000,100 disk accesses with depositor as outer

relation, and 1000 100 + 400 = 1,000,400 disk accesses with customer as the

outer relation.

If the smaller relation fits entirely in memory, use that as the inner relation. Reduces cost to br + bs disk accesses. If smaller relation (depositor) fits entirely in memory,

cost estimate will be 500 disk accesses.

Page 23: File Processing : Query Processing 2008, Spring Pusan National University Ki-Joune Li

STEMPNU

Block Nested-Loop Join

Algoritm BNLJFor each block Br of r do

Get Block Br For each block Bs of s do

Get Block Bs For each tuple tr in Br do

For each tuple ts in Bs do Check if (tr, ts) satisfy the join condition if they do, add tr

• ts to the result.End

EndEnd

End

No disk access required

No disk access requiredDisk access happens here

Page 24: File Processing : Query Processing 2008, Spring Pusan National University Ki-Joune Li

STEMPNU

Block Nested-Loop Join : Performance

Worst case Estimate: br bs + br block accesses. Each block in the inner relation s is read once for each block

in the outer relation (instead of once for each tuple in the outer relation)

Improvements : If M blocks can be buffered use (M-2) disk blocks as blocking unit for outer relations, use remaining two blocks to buffer inner relation and output Then the cost becomes br / (M-2) bs + br

Page 25: File Processing : Query Processing 2008, Spring Pusan National University Ki-Joune Li

STEMPNU

Indexed Nested-Loop Join

Index lookups can replace file scans if join is an equi-join or natural join and an index is available on the inner relation’s join attribute

Can construct an index just to compute a join. Algorithm INLJ

For each block Br of r do

Get Block Br For each tuple tr in Br do

Search Index (IDXr , tr.key)

if found, add tr • ts to the result.

EndEnd

Page 26: File Processing : Query Processing 2008, Spring Pusan National University Ki-Joune Li

STEMPNU

Indexed Nested-Loop Join : Performance

Worst case buffer has space for only one page of r, Cost of the join: br + nr c

Where c is the cost of traversing index and fetching matching tuple Number of matching tuples may be greater than one.

If indices are available on join attributes of both r and s, use the relation with fewer tuples as the outer relation

Page 27: File Processing : Query Processing 2008, Spring Pusan National University Ki-Joune Li

STEMPNU

Example of Nested-Loop Join Costs Assume depositor customer,

with depositor as the outer relation. customer have a primary B+-tree index on the join attribute custo

mer-name, which contains 20 entries in each index node. customer has 10,000 tuples,

the height of the tree is 4, and one more access is needed to find the actual data

Depositor has 5000 tuples

Cost of block nested loops join 400*100 + 100 = 40,100 disk accesses assuming worst case memory

Cost of indexed nested loops join 100 + 5000 * 5 = 25,100 disk accesses.

Page 28: File Processing : Query Processing 2008, Spring Pusan National University Ki-Joune Li

STEMPNU

Hash-Join Applicable for equi-joins and natural joins. A hash function h is used to partition tuples of both relations

h : A→ { 0, 1, ..., n } r0, r1, . . ., rn : partitions of r tuples s0, s1. . ., sn : partitions of s tuples r tuples in ri need only to be compared with s tuples in si .