query optimization 2 - stanford...

56
Query Optimization 2 Instructor: Matei Zaharia cs245.stanford.edu

Upload: others

Post on 27-Sep-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Query Optimization 2 - Stanford Universityweb.stanford.edu/class/cs245/slides/08-Query-Optimization-p2.pdfIt’s possible for cost-based optimization itself to take longer than running

Query Optimization 2

Instructor: Matei Zahariacs245.stanford.edu

Page 2: Query Optimization 2 - Stanford Universityweb.stanford.edu/class/cs245/slides/08-Query-Optimization-p2.pdfIt’s possible for cost-based optimization itself to take longer than running

Recap: Data Statistics

Information about tuples in a table that we can use to estimate costs» Must be approximated for intermediate tables

We saw one way to do this for 4 statistics:» T(R) = # of tuples in R» S(R) = average size of tuples in R» B(R) = # of blocks to hold R’s tuples» V(R, A) = # distinct values of attribute A in R

CS 245 2

Page 3: Query Optimization 2 - Stanford Universityweb.stanford.edu/class/cs245/slides/08-Query-Optimization-p2.pdfIt’s possible for cost-based optimization itself to take longer than running

Another Type of Data Stats: Histograms

CS 245 3

10 20 30 40

5

10

15

12

number of tuplesin R with A valuein a given range

σA≥a(R) = ?

Page 4: Query Optimization 2 - Stanford Universityweb.stanford.edu/class/cs245/slides/08-Query-Optimization-p2.pdfIt’s possible for cost-based optimization itself to take longer than running

Outline

What can we optimize?

Rule-based optimization

Data statistics

Cost models

Cost-based plan selection

Spark SQLCS 245 4

Page 5: Query Optimization 2 - Stanford Universityweb.stanford.edu/class/cs245/slides/08-Query-Optimization-p2.pdfIt’s possible for cost-based optimization itself to take longer than running

Outline

What can we optimize?

Rule-based optimization

Data statistics

Cost models

Cost-based plan selection

Spark SQLCS 245 5

Page 6: Query Optimization 2 - Stanford Universityweb.stanford.edu/class/cs245/slides/08-Query-Optimization-p2.pdfIt’s possible for cost-based optimization itself to take longer than running

Cost Models

How do we measure a query plan’s cost?

Many possible metrics:» Number of disk I/Os» Number of compute cycles» Combined time metric» Memory usage» Bytes sent on network» …

CS 245 6

We’ll focus on this

Page 7: Query Optimization 2 - Stanford Universityweb.stanford.edu/class/cs245/slides/08-Query-Optimization-p2.pdfIt’s possible for cost-based optimization itself to take longer than running

Example: Index vs Table Scan

Our query: σp(R) for some predicate p

s = p’s selectivity (fraction tuples passing)

CS 245 7

Table scan:R has B(R) = T(R)⨯S(R)/bblocks on disk

Cost: B(R) I/Os

Index search:Index lookup for p takes L I/Os

We then have to read part of R;Pr[read block i]

≈ 1 – Pr[no match]records in block

= 1 – (1–s)b / S(R)

Cost: L + (1–(1–s)b/S(R)) B(R)

block size

Page 8: Query Optimization 2 - Stanford Universityweb.stanford.edu/class/cs245/slides/08-Query-Optimization-p2.pdfIt’s possible for cost-based optimization itself to take longer than running

What If Results Were Clustered?

We’d need to change our estimate of Cindex:

Cindex = L + s B(R)

CS 245 8

Unclustered: records that match p are spread out uniformly

Clustered: records that match p are close together in R’s file

Fraction of R’s blocks read

Less than Cindex for unclustered data

Page 9: Query Optimization 2 - Stanford Universityweb.stanford.edu/class/cs245/slides/08-Query-Optimization-p2.pdfIt’s possible for cost-based optimization itself to take longer than running

Join Operators

Join orders and algorithms are often the choices that affect performance the most

For a multi-way join R ⨝ S ⨝ T ⨝ …, each join is selective and order matters a lot» Try to eliminate lots of records early

Even for one join R ⨝ S, algorithm matters

CS 245 9

Page 10: Query Optimization 2 - Stanford Universityweb.stanford.edu/class/cs245/slides/08-Query-Optimization-p2.pdfIt’s possible for cost-based optimization itself to take longer than running

ExampleSELECT order.date, product.price, customer.nameFROM order, product, customerWHERE order.product_id = product.product_idAND order.cust_id = customer.cust_idAND product.type = “car”AND customer.country = “US”

CS 245 10

⨝order product

(type=car)

customer(country=US)

⨝order customer

(country=US)

product(type=car)

⨝Plan 1: Plan 2:

join conditions

selection predicates

When is each plan better?

Page 11: Query Optimization 2 - Stanford Universityweb.stanford.edu/class/cs245/slides/08-Query-Optimization-p2.pdfIt’s possible for cost-based optimization itself to take longer than running

Common Join Algorithms

Iteration (nested loops) join

Merge join

Join with index

Hash join

CS 245 11

Page 12: Query Optimization 2 - Stanford Universityweb.stanford.edu/class/cs245/slides/08-Query-Optimization-p2.pdfIt’s possible for cost-based optimization itself to take longer than running

Iteration Joinfor each rÎR1:

for each sÎR2:if r.C == s.C then output (r, s)

I/Os: one scan of R1 and T(R1) scans of R2, so cost = B(R1) + T(R1) B(R2) reads

Improvement: read M blocks of R1 in RAM at a time then read R2: B(R1) + B(R1) B(R2) / M

CS 245 12

Note: cost of writes is always B(R1 ⨝ R2)

Page 13: Query Optimization 2 - Stanford Universityweb.stanford.edu/class/cs245/slides/08-Query-Optimization-p2.pdfIt’s possible for cost-based optimization itself to take longer than running

Merge Join

if R1 and R2 not sorted by C then sort themi, j = 1while i £ T(R1) && j £ T(R2):

if R1[i].C = R2[j].C then outputTupleselse if R1[i].C > R2[j].C then j += 1else if R1[i].C < R2[j].C then i += 1

CS 245 13

Page 14: Query Optimization 2 - Stanford Universityweb.stanford.edu/class/cs245/slides/08-Query-Optimization-p2.pdfIt’s possible for cost-based optimization itself to take longer than running

Merge Join

procedure outputTuples:while R1[i].C == R2[j].C && i £ T(R1):

jj = jwhile R1[i].C == R2[jj].C && jj £ T(R2):

output (R1[i], R2[jj]) jj += 1

i += i+1

CS 245 14

Page 15: Query Optimization 2 - Stanford Universityweb.stanford.edu/class/cs245/slides/08-Query-Optimization-p2.pdfIt’s possible for cost-based optimization itself to take longer than running

i R1[i].C R2[j].C j1 10 5 12 20 20 23 20 20 34 30 30 45 40 30 5

50 652 7

Example

CS 245 15

Page 16: Query Optimization 2 - Stanford Universityweb.stanford.edu/class/cs245/slides/08-Query-Optimization-p2.pdfIt’s possible for cost-based optimization itself to take longer than running

Cost of Merge Join

If R1 and R2 already sorted by C, then

cost = B(R1) + B(R2) reads

CS 245 16

(+ write cost of B(R1 ⨝ R2))

Page 17: Query Optimization 2 - Stanford Universityweb.stanford.edu/class/cs245/slides/08-Query-Optimization-p2.pdfIt’s possible for cost-based optimization itself to take longer than running

Cost of Merge Join

If Ri is not sorted, can sort it in 4 B(Ri) I/Os:» Read runs of tuples into memory, sort» Write each sorted run to disk» Read from all sorted runs to merge» Write out results

CS 245 17

Page 18: Query Optimization 2 - Stanford Universityweb.stanford.edu/class/cs245/slides/08-Query-Optimization-p2.pdfIt’s possible for cost-based optimization itself to take longer than running

Join with Indexfor each rÎR1:

list = index_lookup(R2, C, r.C)for each sÎlist:

output (r, s)

Read I/Os: 1 scan of R1, T(R1) index lookups on R2, and T(R1) data lookups

cost = B(R1) + T(R1) (Lindex + Ldata)

CS 245 18

Can be less when R1 is sorted/clustered by C!

Page 19: Query Optimization 2 - Stanford Universityweb.stanford.edu/class/cs245/slides/08-Query-Optimization-p2.pdfIt’s possible for cost-based optimization itself to take longer than running

Hash Join (R2 Fits in RAM)hash = load R2 into RAM and hash by Cfor each rÎR1:

list = hash_lookup(hash, r.C)for each sÎlist:

output (r, s)

Read I/Os: B(R1) + B(R2)

CS 245 19

Page 20: Query Optimization 2 - Stanford Universityweb.stanford.edu/class/cs245/slides/08-Query-Optimization-p2.pdfIt’s possible for cost-based optimization itself to take longer than running

Hash Join on Disk

Can be done by hashing both tables to a common set of buckets on disk» Similar to merge sort: 4 (B(R1) + B(R2))

Trick: hash only (key, pointer to record) pairs» Can then sort the pointers to records that

match and fetch them near-sequentially

CS 245 20

Page 21: Query Optimization 2 - Stanford Universityweb.stanford.edu/class/cs245/slides/08-Query-Optimization-p2.pdfIt’s possible for cost-based optimization itself to take longer than running

Other Concerns

Join selectivity may affect how many records we need to fetch from each relation» If very selective, may prefer methods that join

pointers or do index lookups

CS 245 21

Page 22: Query Optimization 2 - Stanford Universityweb.stanford.edu/class/cs245/slides/08-Query-Optimization-p2.pdfIt’s possible for cost-based optimization itself to take longer than running

Summary

Join algorithms can have different performance in different situations

In general, the following are used:» Index join if an index exists» Merge join if at least one table is sorted» Hash join if both tables unsorted

CS 245 22

Page 23: Query Optimization 2 - Stanford Universityweb.stanford.edu/class/cs245/slides/08-Query-Optimization-p2.pdfIt’s possible for cost-based optimization itself to take longer than running

Outline

What can we optimize?

Rule-based optimization

Data statistics

Cost models

Cost-based plan selection

Spark SQLCS 245 23

Page 24: Query Optimization 2 - Stanford Universityweb.stanford.edu/class/cs245/slides/08-Query-Optimization-p2.pdfIt’s possible for cost-based optimization itself to take longer than running

Complete CBO Process

Generate and compare possible query plans

Query

Generate Plans

Prune x x

Estimate Cost Costs

SelectCS 245 24

Pick Min

Page 25: Query Optimization 2 - Stanford Universityweb.stanford.edu/class/cs245/slides/08-Query-Optimization-p2.pdfIt’s possible for cost-based optimization itself to take longer than running

How to Generate Plans?

Simplest way: recursive search of the options for each planning choice

CS 245 25

Access pathsfor table 1

Access pathsfor table 2

Algorithmsfor join 1

Algorithmsfor join 2⨯ ⨯ ⨯ ⨯ …

Page 26: Query Optimization 2 - Stanford Universityweb.stanford.edu/class/cs245/slides/08-Query-Optimization-p2.pdfIt’s possible for cost-based optimization itself to take longer than running

How to Generate Plans?

Can limit search space: e.g. many DBMSesonly consider “left-deep” joins

CS 245 26

Often interacts well with conventions for specifying join inputs in asymmetric join algorithms (e.g. assume right argument has index)

Page 27: Query Optimization 2 - Stanford Universityweb.stanford.edu/class/cs245/slides/08-Query-Optimization-p2.pdfIt’s possible for cost-based optimization itself to take longer than running

How to Generate Plans?

Can prioritize searching through the most impactful decisions first» E.g. join order is one of the most impactful

CS 245 27

Page 28: Query Optimization 2 - Stanford Universityweb.stanford.edu/class/cs245/slides/08-Query-Optimization-p2.pdfIt’s possible for cost-based optimization itself to take longer than running

How to Prune Plans?

While computing the cost of a plan, throw it away if it is worse than best so far

Start with a greedy algorithm to find an “OK” initial plan that will allow lots of pruning

CS 245 28

Page 29: Query Optimization 2 - Stanford Universityweb.stanford.edu/class/cs245/slides/08-Query-Optimization-p2.pdfIt’s possible for cost-based optimization itself to take longer than running

Memoization and Dynamic ProgrammingDuring a search through plans, many subplans will appear repeatedly

Remember cost estimates and statistics (T(R), V(R, A), etc) for those: “memoization”

Can pick an order of subproblems to make it easy to reuse results (dynamic programming)

CS 245 29

Page 30: Query Optimization 2 - Stanford Universityweb.stanford.edu/class/cs245/slides/08-Query-Optimization-p2.pdfIt’s possible for cost-based optimization itself to take longer than running

Resource Cost of CBO

It’s possible for cost-based optimization itself to take longer than running the query!

Need to design optimizer to not take too long» That’s why we have shortcuts in stats, etc

Luckily, a few “big” decisions drive most of the query execution time (e.g. join order)

CS 245 30

Page 31: Query Optimization 2 - Stanford Universityweb.stanford.edu/class/cs245/slides/08-Query-Optimization-p2.pdfIt’s possible for cost-based optimization itself to take longer than running

Outline

What can we optimize?

Rule-based optimization

Data statistics

Cost models

Cost-based plan selection

Spark SQLCS 245 31

Page 32: Query Optimization 2 - Stanford Universityweb.stanford.edu/class/cs245/slides/08-Query-Optimization-p2.pdfIt’s possible for cost-based optimization itself to take longer than running

Background

2004: MapReduce published, enables writing large scale data apps on commodity clusters» Cheap but unreliable “consumer” machines,

so system emphasizes fault tolerance» Focus on C++/Java programmers

CS 245 32

Page 33: Query Optimization 2 - Stanford Universityweb.stanford.edu/class/cs245/slides/08-Query-Optimization-p2.pdfIt’s possible for cost-based optimization itself to take longer than running

Background

2006: Apache Hadoop project formed as an open source MapReduce + distributed FS» Started in Nutch open source search engine» Soon adopted by Yahoo & Facebook

2006: Amazon EC2 service launched as the newest attempt at “utility computing”CS 245 33

Page 34: Query Optimization 2 - Stanford Universityweb.stanford.edu/class/cs245/slides/08-Query-Optimization-p2.pdfIt’s possible for cost-based optimization itself to take longer than running

Background

2007: Facebook starts Hive (later Apache Hive) for SQL on Hadoop» Other SQL-on-MapReduces existed too» First steps toward “data lake” architecture

CS 245 34

Page 35: Query Optimization 2 - Stanford Universityweb.stanford.edu/class/cs245/slides/08-Query-Optimization-p2.pdfIt’s possible for cost-based optimization itself to take longer than running

Background

2006-2012: Many other cluster programming frameworks proposed to bring MR’s benefits to other apps

CS 245 35

Pregel Dremel DryadImpala

Page 36: Query Optimization 2 - Stanford Universityweb.stanford.edu/class/cs245/slides/08-Query-Optimization-p2.pdfIt’s possible for cost-based optimization itself to take longer than running

Background

2010: Spark engine released, built around MapReduce + in-memory computing» Motivation: interactive queries + iterative

algorithms such as graph analytics

Spark then moves to be a general (“unified”) engine, covering existing ones

CS 245 36

Page 37: Query Optimization 2 - Stanford Universityweb.stanford.edu/class/cs245/slides/08-Query-Optimization-p2.pdfIt’s possible for cost-based optimization itself to take longer than running

Code Size Comparison (2013)

0

20000

40000

60000

80000

100000

120000

140000

HadoopMapReduce

Impala(SQL)

Storm(Streaming)

Giraph(Graph)

Spark

non-test, non-example source lines

Shark

GraphXStreaming

Page 38: Query Optimization 2 - Stanford Universityweb.stanford.edu/class/cs245/slides/08-Query-Optimization-p2.pdfIt’s possible for cost-based optimization itself to take longer than running

Background

2012: Shark starts as a port of Hive on Spark

2014: Spark SQL starts as a SQL engine built directly on Spark (but interoperable w/ Hive)» Also adds two new features: DataFrames for

integrating relational ops in complex programs and extensible optimizer

CS 245 38

Page 39: Query Optimization 2 - Stanford Universityweb.stanford.edu/class/cs245/slides/08-Query-Optimization-p2.pdfIt’s possible for cost-based optimization itself to take longer than running

Original Spark API

Resilient Distributed Datasets (RDDs)» Immutable collections of objects that can be

stored in memory or disk across a cluster» Built with parallel transformations (map, filter, …)» Automatically rebuilt on failure

Page 40: Query Optimization 2 - Stanford Universityweb.stanford.edu/class/cs245/slides/08-Query-Optimization-p2.pdfIt’s possible for cost-based optimization itself to take longer than running

Load error messages from a log into memory, then interactively search for various patterns

lines = spark.textFile(“hdfs://...”)errors = lines.filter(s => s.startswith(“ERROR”))messages = errors.map(s => s.split(‘\t’)(2))messages.cache()

Block 1

Block 2

Block 3

Worker

Worker

Worker

Driver

messages.filter(s => s.contains(“foo”)).count()messages.filter(s => s.contains(“bar”)).count()

. . .

tasks

resultsCache 1

Cache 2

Cache 3

Base RDDTransformed RDD

Action

Result: full-text search of Wikipedia in 1 sec(vs 40 s for on-disk data)

Example: Log Mining

Page 41: Query Optimization 2 - Stanford Universityweb.stanford.edu/class/cs245/slides/08-Query-Optimization-p2.pdfIt’s possible for cost-based optimization itself to take longer than running

Challenges with Spark’s Functional APILooks high-level, but hides many semantics of computation from engine» Functions passed in are arbitrary blocks of

code» Data stored is arbitrary Java/Python objects

Users can mix APIs in suboptimal ways

Page 42: Query Optimization 2 - Stanford Universityweb.stanford.edu/class/cs245/slides/08-Query-Optimization-p2.pdfIt’s possible for cost-based optimization itself to take longer than running

Example Problempairs = data.map(word => (word, 1))

groups = pairs.groupByKey()

groups.map((k, vs) => (k, vs.sum))

Materializes all groupsas lists of integers

Then promptlyaggregates them

Page 43: Query Optimization 2 - Stanford Universityweb.stanford.edu/class/cs245/slides/08-Query-Optimization-p2.pdfIt’s possible for cost-based optimization itself to take longer than running

Challenge: Data RepresentationJava objects often many times larger than data

class User(name: String, friends: Array[Int])User(“Bobby”, Array(1, 2))

User 0x… 0x…

String

3

0

1 2

Bobby

5 0x…

int[]

char[] 5

Page 44: Query Optimization 2 - Stanford Universityweb.stanford.edu/class/cs245/slides/08-Query-Optimization-p2.pdfIt’s possible for cost-based optimization itself to take longer than running

Spark SQL & DataFramesEfficient library for working with structured data» 2 interfaces: SQL for data analysts and external

apps, DataFrames for complex programs» Optimized computation and storage underneath

Page 45: Query Optimization 2 - Stanford Universityweb.stanford.edu/class/cs245/slides/08-Query-Optimization-p2.pdfIt’s possible for cost-based optimization itself to take longer than running

Spark SQL Architecture

Logical Plan

Physical Plan

Catalog

OptimizerRDDs

DataSource

API

SQL DataFrames

Code

Generator

Page 46: Query Optimization 2 - Stanford Universityweb.stanford.edu/class/cs245/slides/08-Query-Optimization-p2.pdfIt’s possible for cost-based optimization itself to take longer than running

DataFrame APIDataFrames hold rows with a known schemaand offer relational operations through a DSL

c = HiveContext()users = c.sql(“select * from users”)

ma_users = users[users.state == “MA”]

ma_users.count()

ma_users.groupBy(“name”).avg(“age”)

ma_users.map(lambda row: row.user.toUpper())

Expression AST

Page 47: Query Optimization 2 - Stanford Universityweb.stanford.edu/class/cs245/slides/08-Query-Optimization-p2.pdfIt’s possible for cost-based optimization itself to take longer than running

API Details

Based on data frame concept in R, Python» Spark is the first to make this declarative

Integrated with the rest of Spark» ML library takes DataFrames as input/output» Easily convert RDDs ↔ DataFrames

Google trends for “data frame”

Page 48: Query Optimization 2 - Stanford Universityweb.stanford.edu/class/cs245/slides/08-Query-Optimization-p2.pdfIt’s possible for cost-based optimization itself to take longer than running

What DataFrames Enable1. Compact binary representation

• Columnar, compressed cache; rows for processing

2. Optimization across operators (join reordering, predicate pushdown, etc)

3. Runtime code generation

Page 49: Query Optimization 2 - Stanford Universityweb.stanford.edu/class/cs245/slides/08-Query-Optimization-p2.pdfIt’s possible for cost-based optimization itself to take longer than running

Performance

0 2 4 6 8 10

RDD Scala

RDD Python

DataFrame Scala

DataFrame Python

DataFrame R

DataFrame SQL

Time for aggregation benchmark (s)

Page 50: Query Optimization 2 - Stanford Universityweb.stanford.edu/class/cs245/slides/08-Query-Optimization-p2.pdfIt’s possible for cost-based optimization itself to take longer than running

Performance

0 2 4 6 8 10

RDD Scala

RDD Python

DataFrame Scala

DataFrame Python

DataFrame R

DataFrame SQL

Time for aggregation benchmark (s)

Page 51: Query Optimization 2 - Stanford Universityweb.stanford.edu/class/cs245/slides/08-Query-Optimization-p2.pdfIt’s possible for cost-based optimization itself to take longer than running

Data SourcesUniform way to access structured data» Apps can migrate across Hive, Cassandra,

JSON, Parquet, …» Rich semantics allows query pushdown into

data sources

SparkSQL

users[users.age > 20]

select * from users

Page 52: Query Optimization 2 - Stanford Universityweb.stanford.edu/class/cs245/slides/08-Query-Optimization-p2.pdfIt’s possible for cost-based optimization itself to take longer than running

ExamplesJSON:

JDBC:

Together:

select user.id, text from tweets

{“text”: “hi”,“user”: {“name”: “bob”,“id”: 15 }

}

tweets.jsonselect age from users where lang = “en”

select t.text, u.agefrom tweets t, users uwhere t.user.id = u.idand u.lang = “en” Spark

SQL{JSON}

select id, age fromusers where lang=“en”

Page 53: Query Optimization 2 - Stanford Universityweb.stanford.edu/class/cs245/slides/08-Query-Optimization-p2.pdfIt’s possible for cost-based optimization itself to take longer than running

Extensible Optimizer

Uses Scala pattern matching (see demo!)

CS 245 53

Page 54: Query Optimization 2 - Stanford Universityweb.stanford.edu/class/cs245/slides/08-Query-Optimization-p2.pdfIt’s possible for cost-based optimization itself to take longer than running

Which Spark ComponentsDo People Use?

58%

58%

62%

69%

MLlib + GraphX

Spark Streaming

DataFrames

Spark SQL

75% of users use 2 or more components(2015 survey)

CS 245 54

Page 55: Query Optimization 2 - Stanford Universityweb.stanford.edu/class/cs245/slides/08-Query-Optimization-p2.pdfIt’s possible for cost-based optimization itself to take longer than running

Which Languages Are Used?

84%

38% 38%

71%

31%

58%

18%

2014 Languages Used 2015 Languages Used

CS 245 55

Page 56: Query Optimization 2 - Stanford Universityweb.stanford.edu/class/cs245/slides/08-Query-Optimization-p2.pdfIt’s possible for cost-based optimization itself to take longer than running

Extensions to Spark SQL

Tens of data sources using the pushdown API

Interval queries on genomic data

Geospatial package (Magellan)

Approximate queries & other research

CS 245 56