data mining for query optimization

of 49/49
Data Mining for Query Optimization

Post on 23-Feb-2016

30 views

Category:

Documents

0 download

Embed Size (px)

DESCRIPTION

Data Mining for Query Optimization. Outline. Semantic Query Optimization Soft Constraints Query Optimization via Soft Constraints Selectivity Estimation via Soft Constraints . Use integrity constraints associated with a database to rewrite - PowerPoint PPT Presentation

TRANSCRIPT

Implementation of Two Semantic Query Optimization Techniques in IBM DB2

Data Mining for Query Optimization12OutlineSemantic Query OptimizationSoft Constraints Query Optimization via Soft ConstraintsSelectivity Estimation via Soft Constraints

23Semantic Query Optimization

Use integrity constraints associated with a database to rewrite a query into a form that may be evaluated more efficientlySome Techniques:Join EliminationPredicate EliminationJoin IntroductionPredicate IntroductionDetecting an Empty Answer Set34Commercial implementations of SQOEarly Experiences:

Could not spend too much time on optimizationFew integrity constraints are ever definedAssociation with deductive databasesFew (if any!)45Join elimination: exampleselect p_name, p_retailprice, s_name, s_address from tpcd.lineitem, tpcd.partsupp, tpcd.part, tpcd.supplierwhere p_partkey = ps_partkey and s_suppkey = ps_suppkey andps_partkey = l_partkey and ps_suppkey = l_suppkey;RI constraints: part-partsupp (on partkey) supplier-partsupp (on partkey)partsupp-lineitem (on partkey and suppkey)

select p_name, p_retailprice, s_name, s_address from tpcd.lineitem, tpcd.partsupp, tpcd.part, tpcd.supplierwhere p_partkey = l_partkey and s_suppkey = l_suppkey; 56Algorithm for join elimination1. Derive column transitivity classes from the join predicates in the query2. Divide the relations in the query that are related through RI constraints into removable and non-removable3. Eliminate all removable relations from the query4. Add is not null predicate to foreign key columns of all tables whose RI parents were removed

67Algorithm for join elimination: exampleC.CPS.SO.CS.SPS.SO.CC.CS.SPS.SO.C78ExperimentsWindows NT Server 4.0 on a 4-way Pentium II Xeon 450 with 4GB of memoryDB2 OLAP ServerAPB-1 OLAP Benchmarkone fact table and six dimension tablesstarview defined as a join of all tablesOptimization: remove one or more dimension tables from the query89Performance results for join elimination

910Predicate Introduction: Exampleselectsum(l_extendedprice * l_discount) as revenuefromtpcd.lineitemwhereshipdate >date('1994-01-01');selectsum(l_extendedprice * l_discount) as revenuefromtpcd.lineitemwhereshipdate >date('1994-01-01') and receiptdate >= date('1994-01-01');Check constraint: receiptdate >= shipdateClustered Index on receiptdate 1011Algorithm for Predicate IntroductionN - set of predicates derivable from the query and check constraints

If N is inconsistent, stop.Else, for each predicate A op B in N, add it to the query if:A or B is a join columnB is a major column of an indexno other index on Bs table can be used in the plan for the original query1112ExperimentsAIX on JRS/6000 with 512MB of memoryDB2 Universal Database100MB TPCD BenchmarkOptimizationdetecting an empty answer setindex introductionscan reduction

1213Queriesselect 100.00 * sum(casewhen p_type like 'PROMO%'then l_extendedprice * (1 - l_discount)else 0end)/ sum(l_extendedprice * (1 - l_discount)) as promo_revenuefromtpcd.lineitem, tpcd.partwherel_partkey = p_partkey and l_shipdate >= date('1998-09-01') andl_shipdate < date('1998-09-01') + 1 month;Given the check constraint l_receiptdate >= l_shipdate we may add a new predicate to the query:

l_receiptdate >= date(1998-09-01)1314Performance Results for Index Introduction

1415The Culprit

New query plan uses an index, but the original table scan is still better! Why did this happen:incorrect estimate of the filter factorunderestimation of the CPU cost of locking index pages1516Soft Constraints1617Soft ConstraintsTraditional (hard) integrity constraints are defined to prevent incorrect updates. A soft constraint is a statement that is true about the current state of the database, but does not verify updates. In fact, a soft constraint can be invalidated by an update.1718Soft Constraints (cont.)Absolute soft constraints no violation in the current state of the databaseAbsolute soft constraints can be used for optimization in exactly the same way traditional constraints are. Statistical soft constraints can have some (small) degree of violationStatistical soft constraints can be used for improved selectivity estimation1819Implementation of Soft ConstraintsIn Oracle the standard integrity constraints are marked with a rely option, so that they are not verified on updates.

In DB2 soft constraints are called informational constraints.1920Informational Check ConstraintExample 1: Create an employee table where a minimum salary of $25,000 is guaranteed by the application

CREATE TABLE emp(empno INTEGER NOT NULL PRIMARY KEY, name VARCHAR(20), firstname VARCHAR(20), salary INTEGER CONSTRAINT minsalary CHECK (salary >= 25000) NOT ENFORCED ENABLE QUERY OPTIMIZATION);2021Enforcing ValidationExample 2: Alter the employee table to start enforcing the minimum wage of $25,000 using DB2. DB2 will also verify existing data right away.

ALTER TABLE emp ALTER CONSTRAINT minsalary ENFORCED

2122Informational RI ConstraintExample 3: Create a department table where the application ensures the existence of departments to which the employees belong.

CREATE TABLE dept(deptno INTEGER NOT NULL PRIMARY KEY, deptName VARCHAR(20), budget INTEGER);

ALTER TABLE emp ADD COLUMN dept INTEGER NOT NULL CONSTRAINT dept_exist REFERENCES dept NOT ENFORCED ENABLE QUERY OPTIMIZATION);2223Query Optimization via Empty Joins2324Exampleselect Modelfrom Tickets T, Registration RwhereT.RegNum = R.RegNum and T.date > 1990-01-01and R.Model LIKE BMW Z3%select Modelfrom Tickets T, Registration RwhereT.RegNum = R.RegNum and T.date > 1997-01-01and R.Model LIKE BMW Z3%First BMW Z3 series cars were made in 1997.2425Matrix representation of empty joins

A,B(RS)2526Staircase data structure

2627Properties of the algorithmTime Complexity O(nm)requires a single scan of the sorted dataSpace Complexity O(min(n,m))only two rows of the matrix need be kept in memoryScalable with respect to:number of tuples in the join resultnumber of discovered empty rectanglessize of the domain of one of the attributes2728How many empty rectangles are there?

Tests done on 4 pairs of attributes with numerical domain present in typical joins in a real-world workload of a health insurance company. 2829How big are the rectangles?

2930Query rewrite: simple case

select from R, S,...where R.C=S.C and 60