en 600.619: adv. storage and tp systems cost-based query optimization

21
EN 600.619: Adv. Storage and TP Systems Cost-Based Query Optimization

Upload: samantha-franklin

Post on 30-Dec-2015

221 views

Category:

Documents


0 download

TRANSCRIPT

EN 600.619: Adv. Storage and TP Systems

Cost-Based Query Optimization

EN 600.619: Adv. Storage and TP Systems

The Optimization Process

• Logical query plan– As an expression tree

• Rewrite query plan to improve performance

• Create physical plan– Select algorithms to

implement logical planQuickTime™ and a

TIFF (LZW) decompressorare needed to see this picture.

EN 600.619: Adv. Storage and TP Systems

An Expression Tree

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

SELECT title, birthdate

FROM MovieStar, StarsIn

WHERE year=1996 AND

gender=‘F’ AND

starName= name;

EN 600.619: Adv. Storage and TP Systems

An Alternate (Better) Logical Plan

SELECT title, birthdate

FROM MovieStar, StarsIn

WHERE year=1996 AND

gender=‘F’ AND

starName= name;

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

EN 600.619: Adv. Storage and TP Systems

Query Optimization Heuristics

• Push operators as far down the plan as possible

• Do selections as soon as possible– Reduce intermediate result sizes

• Select then project

• Perform joins as late as possible– They are more costly

• Group associative and commutative operators– Let the physical plan reorder execution

EN 600.619: Adv. Storage and TP Systems

Improving the Plan

• Through query rewriting

• Split the selection

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

EN 600.619: Adv. Storage and TP Systems

Improving the Plan

• Through query rewriting

• Split the selection

• Push the projection

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

EN 600.619: Adv. Storage and TP Systems

Grouping Operators

• The physical (not logical) plan should pick the order

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

EN 600.619: Adv. Storage and TP Systems

The Physical Plan

• Choose algorithms and estimate result size to generate concrete costs of a plan

• E.g. joins– Discipline: Hash, Index, Sort– Materialize, pipeline, ripple, parallel, etc.

• Large literature on different disciplines for all operations– Suitable for an entire (albeit detailed) course

• Also, how to search for good plans– Branch and bound, hill climbing, dynamic programming, etc.

• Result size and choice of algorithm are independent– For relation algebra operations

EN 600.619: Adv. Storage and TP Systems

Estimating Result Sizes

• Most inaccurate and difficult part of query processing– Cost of an operation is a f ( algorithm, size estimate )

– Given exact size, costing is very accurate

• Sometime sizing can be exact– Equality queries for unique attributes are 0/1

– Joins on key (foreign key) fields

– Good schema design improves query execution

• For many operations it is difficult– Joins: expand (cross product) or reduce (more often)

– Range queries: produce multiple tuples

• 50% accuracy is considered good……ugh!

EN 600.619: Adv. Storage and TP Systems

Problems w/ Estimating Size

• Need to know result sizes a-priori– Know them exactly after query execution

• Techniques need to be lightweight– Performing I/O as part of estimation reduces query performance

• General approach– Statistics on underlying tables for important queries

– Small, summary data structures (in-memory execution)

• Techniques– Histograms, sampling, wavelets

EN 600.619: Adv. Storage and TP Systems

Histograms

SELECT Jan.day, July,day

FROM Jan, July

WHERE Jan.temp = July.temp

Join estimate = T1T2/V

tuple product/width

Estimate:

5x20/10 + 10x5/10 = 10

Better than est. w/out histogram

245x245/100 = 600

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

EN 600.619: Adv. Storage and TP Systems

On Histograms

• Workload defined– Keep for important fields. Similar concept to indexes.

• Data defined– Keep when they improve performance.

– Don’t need a histogram for the uniform distribution

• Complications– Update queries invalidate statistics

– Need to be pre-computed, often prior to witnessing workload

– Composing histograms (for multiple attributes) leads to inaccuracies

• What the world needs is fully incremental histograms on that support multi-attribute queries

EN 600.619: Adv. Storage and TP Systems

STHoles

Bruno, Chaudhuri, and Gravano. STHoles: A Multidimensional Workload-Aware Histogram, SIGMOD 2001.

• Generate histograms from analyzing query results– No examination of data sets

– Leverage workload information and query feedback

• Supports overlapped and nested buckets– Multi-resolution histogram

– Buckets allocated where they are most needed, e.g. if there are no queries to a region, no statistics are kept

EN 600.619: Adv. Storage and TP Systems

Feedback-Based Optimization

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

EN 600.619: Adv. Storage and TP Systems

Visualizing Histograms

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

EN 600.619: Adv. Storage and TP Systems

Histogram Construction

• Start with an empty histogram

• New queries punch ‘holes’ in the histogram, creating regions of refinement

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

EN 600.619: Adv. Storage and TP Systems

Policies

• Identify and drill candidate holes

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

EN 600.619: Adv. Storage and TP Systems

Policies

• Shrink regions to preserve rectangular spaces– Ease of description and improved accuracy

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

EN 600.619: Adv. Storage and TP Systems

Policies

• Merge buckets (with similar densities) to improve histogram under a space budget

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

EN 600.619: Adv. Storage and TP Systems

STHoles Redux

• Quality histograms

• Runtime overhead (<10%)– Dynamic construction of histograms

– But, no pre-processing

• Preferable in several situations– Frequently updated data, needs distribution to change

– Shifting workloads -- STHoles can redirect attention to new regions dynamically. (This is what’s cool.)