file processing : query processing

of 28/28
File Processing : Query Processing 2008, Spring Pusan National University Ki-Joune Li

Post on 02-Jan-2016

33 views

Category:

Documents

1 download

Embed Size (px)

DESCRIPTION

File Processing : Query Processing. 2008, Spring Pusan National University Ki-Joune Li. Basic Concepts of Query. Query Retrieve records satisfying predicates Types of Query Operators Aggregate Query Sorting. Predicate. Select. Relational Operators : Select. - PowerPoint PPT Presentation

TRANSCRIPT

  • File Processing : Query Processing 2008, SpringPusan National UniversityKi-Joune Li

    STEMPNU

    Basic Concepts of QueryQueryRetrieve records satisfying predicates

    Types of QueryOperatorsAggregate QuerySorting

    STEMPNU

    Relational Operators : SelectSelection (condition) Retrieve records satisfying predicatesExampleFind Student where Student.Score > 3.5score>3.5(Student)

    Index or Hash

    STEMPNU

    Relational Operators : ProjectProject (attributes) Extract interesting attributesExampleFind Student.name where score > 3.5

    name(acore>3.5(Student))

    Full Scan

    STEMPNU

    Cartisan ProductCartisan Product ()Two Tables : R1 R2Produce all cross products

    Join ( )

    =

    STEMPNU

    JoinJoin ( )Select combined records of cartisan product with same value of a common attribute (Natural Join)ExampleStudent (StudentName, AdvisorProfessorID, Department, Score)Professor(ProfessorName, ProfessorID, Department)Student AdivsorProfessorID=ProfessorID Professor= AdivsorProfessorID=ProfessorID(Student Professor)

    Double Scan : Expensive Operation

    STEMPNU

    Relational AlgebraRelational AlgebraOperand : Table (Relation)Operator : Relational Operator (, , , etc)ExampleFind Student Name where Student Score > 3.5 and Advisor Professor belongs to CSE Departmentstudent.name(acore>3.5(Student) Department=CSE (Professor) )

    Relational Algebra Specifies the sequence of operations

    STEMPNU

    Query Processing MechanismQuery Processing Steps 1. Parsing and translation2. Optimization3. Evaluation

    STEMPNU

    Parsing and TranslationParsing Query Statement (e.g. in SQL)Translation into relational algebraEquivalent ExpressionFor a same query statement several relation algebraic expressions are possibleExamplebalance 2500(name(account )) name(balance 2500(account ))Different execution schedulesQuery Execution Plan (QEP)Determined by relational algebraSeveral QEPs may be produced by Parsing and Translation

    STEMPNU

    Query OptimizationChoose ONE QEP among QEPs based on Execution Cost of each QEP, where cost means execution time

    How to find cost of each QEP ?Real Execution Exact but Not Feasible Cost EstimationTypes of OperationsNumber of RecordsSelectivityDistribution of data

    STEMPNU

    Cost Model : Basic ConceptsCost Model : Number of Block AccessesCost C = Cindex + Cdatawhere Cindex : Cost for Index AccessCdata : Cost for Data Block RetrievalCindex vs. Cdata ?Cindex : depends on indexCdata depends on selectivityRandom Access or Sequential AccessSelectivity Number (or Ratio) of Objects Selected by Query

    STEMPNU

    Cost Model : Type of OperationsCost model for each type of operationsSelectProjectJoinAggregate Query

    Query Processing Method for each type of operations

    Index/Hash or Not

    STEMPNU

    Cost Model : Number of RecordsNumber of RecordsNrecord Nblocks

    Number of ScansSingle ScanO(N) : Linear ScanO(logN ) : IndexMultiple ScansO(NM ) : Multiple Linear ScansO(N logM ) : Multiple Scans with Index

    STEMPNU

    SelectivitySelectivityAffects on CdataRandom Access Scattered on several blocksNblock NselectedSequential AccessContiguously stored on blocksNblock = Nselected / Bf

    STEMPNU

    Selectivity EstimationSelectivity EstimationDepends on Data DistributionExampleQ1 : Find students where 60 < weight < 70Q2 : Find students where 80 < weight < 90How to find the distribution Parametric Method e.g. Gaussian DistributionNo a priori knowledgeNon-Parametric Methode.g. HistogramSmoothing is necessaryWavelet, Discrete Cosine

    STEMPNU

    Select : Linear SearchAlgorithm : linear searchScan each file block and test all records to see whether they satisfy the selection condition.Cost estimate (number of disk blocks scanned) = br br denotes number of blocks containing records from relation rIf selection is on a key attribute (sorted), cost = (br /2) stop on finding recordLinear search can be applied regardless of selection condition orordering of records in the file, or availability of indices

    STEMPNU

    Select : Range SearchAlgorithm : primary index, comparisonRelation is sorted on AFor A V (r) Step 1: use index to find first tuple v and Step 2: scan relation sequentiallyFor AV (r) just scan relation sequentially till first tuple > v; do not use indexAlgorithm : secondary index, comparisonFor A V (r) Step 1: use index to find first index entry v and Step 2: scan index sequentially to find pointers to records.For AV (r) scan leaf nodes of index finding pointers to records, till first entry > v

    STEMPNU

    Select : Range SearchComparison between Searching with Index and Linear Search

    Secondary Index retrieval of records that are pointed torequires an I/O for each record

    Linear file scan may be cheaper if records are scattered on many blocks clustering is important for this reason

    STEMPNU

    Select : Complex QueryConjunction : 1 2 . . . n(r) Algorithm : selection using one indexStep 1: Select a combination of i (i (r) )Step 2: Test other conditions on tuple after fetching it into memory buffer.Algorithm : selection using multiple-key indexUse appropriate multiple-attribute index if available.Algorithm : selection by intersection of identifiersStep 1: Requires indices with record pointers. Step 2: Intersection of all the obtained sets of record pointers. Step 3: Then fetch records from file

    Disjunction : 1 2 . . . n (r) Algorithm : Disjunctive selection by union of identifiers

    STEMPNU

    Join OperationSeveral different algorithms to implement joinsNested-loop joinBlock nested-loop joinIndexed nested-loop joinMerge-joinHash-joinChoice based on cost estimateExamples use the following informationNumber of records of customer: 10,000 depositor: 5000Number of blocks of customer: 400 depositor: 100

    STEMPNU

    Nested-Loop JoinAlgorithm NLJ the theta join r s For each tuple tr in r do begin For each tuple ts in s do begin test pair (tr,ts) to see if they satisfy the join condition if they do, add tr ts to the result. End Endr : outer relation, s : inner relation.No indices, any kind of join condition.Expensive

    STEMPNU

    Nested-Loop Join : PerformanceWorst case the estimated cost is nr bs + br disk accesses, if not enough memory only to hold one block of each relation,

    Example 5000 400 + 100 = 2,000,100 disk accesses with depositor as outer relation, and 1000 100 + 400 = 1,000,400 disk accesses with customer as the outer relation.

    If the smaller relation fits entirely in memory, use that as the inner relation. Reduces cost to br + bs disk accesses. If smaller relation (depositor) fits entirely in memory, cost estimate will be 500 disk accesses.

    STEMPNU

    Block Nested-Loop JoinAlgoritm BNLJFor each block Br of r doGet Block Br For each block Bs of s doGet Block Bs For each tuple tr in Br do For each tuple ts in Bs do Check if (tr, ts) satisfy the join condition if they do, add tr ts to the result. End End End EndNo disk access required

    STEMPNU

    Block Nested-Loop Join : PerformanceWorst case Estimate: br bs + br block accesses. Each block in the inner relation s is read once for each block in the outer relation (instead of once for each tuple in the outer relation)

    Improvements : If M blocks can be buffereduse (M-2) disk blocks as blocking unit for outer relations, use remaining two blocks to buffer inner relation and outputThen the cost becomes br / (M-2) bs + br

    STEMPNU

    Indexed Nested-Loop JoinIndex lookups can replace file scans ifjoin is an equi-join or natural join andan index is available on the inner relations join attributeCan construct an index just to compute a join.Algorithm INLJFor each block Br of r doGet Block Br For each tuple tr in Br do Search Index (IDXr , tr.key) if found, add tr ts to the result. EndEnd

    STEMPNU

    Indexed Nested-Loop Join : PerformanceWorst casebuffer has space for only one page of r, Cost of the join: br + nr c Where c is the cost of traversing index and fetching matching tupleNumber of matching tuples may be greater than one.

    If indices are available on join attributes of both r and s,use the relation with fewer tuples as the outer relation

    STEMPNU

    Example of Nested-Loop Join CostsAssume depositor customer, with depositor as the outer relation.customer have a primary B+-tree index on the join attribute customer-name, which contains 20 entries in each index node.customer has 10,000 tuples, the height of the tree is 4, and one more access is needed to find the actual dataDepositor has 5000 tuples

    Cost of block nested loops join400*100 + 100 = 40,100 disk accesses assuming worst case memory

    Cost of indexed nested loops join100 + 5000 * 5 = 25,100 disk accesses.

    STEMPNU

    Hash-JoinApplicable for equi-joins and natural joins.A hash function h is used to partition tuples of both relations h : A { 0, 1, ..., n }r0, r1, . . ., rn : partitions of r tupless0, s1. . ., sn : partitions of s tuplesr tuples in ri need only to be compared with s tuples in si .