universalizing approximate query processingmapreduce online, cosmos optimized stratified, scalable...

110
Universalizing Approximate Query Processing Yongjoo Park Barzan Mozafari Joseph Sorenson Junhao Wang

Upload: others

Post on 14-Jul-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

UniversalizingApproximate Query Processing

Yongjoo Park Barzan MozafariJoseph Sorenson Junhao Wang

Page 2: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

UniversalApproximate Query Processing

Page 3: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

UniversalApproximate Query Processing

Page 4: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

What is Approximate Query Processing (AQP)?

I/O Computation Exact Answer

Page 5: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

What is Approximate Query Processing (AQP)?

I/O

Less I/O

Computation

Less Computation Approximate Answer

Exact Answer

Page 6: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

Why AQP?

Higher Productivity

Numerous studies:

A latency >2 seconds is no longer interactive and negatively affects creativity!

Page 7: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

Why AQP?

Higher Productivity

Numerous studies:

A latency >2 seconds is no longer interactive and negatively affects creativity!

Human time: Money

Machine time: No one loves their EC2 bill!

Lower Cost (Time + Resources)

Page 8: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

Why AQP?

Higher Productivity

Numerous studies:

A latency >2 seconds is no longer interactive and negatively affects creativity!

Human time: Money

Machine time: No one loves their EC2 bill!

Lower Cost (Time + Resources)

Jeff Bezos

Page 9: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

AQP research in academia

1985 1990 1995 2000 2005 2010 2015 2020

DBLearning

QuickR, Seek+Sample,Wander join

BlinkDB

SciBORQ

MapReduce Online,COSMOS

Optimized stratified,Scalable with DBO

SMS join

Bootsrapfor AQP

Dynamic sample selection

STRAT

AQUA,Ripple join

Online aggregation

Approximate count estimator

Selectivity estimation on random samples

Doublesampling

1984

1991

1996

1997

1999

2001

2005

2003

2006

2007

2010

2011

2013

2017

2016

Page 10: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

AQP research in academia

1985 1990 1995 2000 2005 2010 2015 2020

DBLearning

QuickR, Seek+Sample,Wander join

BlinkDB

SciBORQ

MapReduce Online,COSMOS

Optimized stratified,Scalable with DBO

SMS join

Bootsrapfor AQP

Dynamic sample selection

STRAT

AQUA,Ripple join

Online aggregation

Approximate count estimator

Selectivity estimation on random samples

Doublesampling

1984

1991

1996

1997

1999

2001

2005

2003

2006

2007

2010

2011

2013

2017

2016

35 years of research, little industry adoption

Page 11: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

AQP is hard to adoptAQP typically requires significant modifications of DBMS internals

• Error estimation: [BlinkDB ‘13], [G-OLA ’15], …

• Query evaluation: [Online ‘97], [Join Synopses ‘99], ...

• Relational operators: [ABM ‘14], …

Page 12: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

AQP is hard to adopt

Traditional DBMS vendors• Stable codebase, reluctant to make major changes

• Slow in adopting ANYTHING :-)

AQP typically requires significant modifications of DBMS internals• Error estimation: [BlinkDB ‘13], [G-OLA ’15], …

• Query evaluation: [Online ‘97], [Join Synopses ‘99], ...

• Relational operators: [ABM ‘14], …

Page 13: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

AQP is hard to adoptAQP typically requires significant modifications of DBMS internals

• Error estimation: [BlinkDB ‘13], [G-OLA ’15], …

• Query evaluation: [Online ‘97], [Join Synopses ‘99], ...

• Relational operators: [ABM ‘14], …

Newer SQL-on-Hadoop systems: implementing standard features

Traditional DBMS vendors• Stable codebase, reluctant to make major changes

• Slow in adopting ANYTHING :-)

Page 14: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

AQP is hard to adoptAQP typically requires significant modifications of DBMS internals

• Error estimation: [BlinkDB ‘13], [G-OLA ’15], …

• Query evaluation: [Online ‘97], [Join Synopses ‘99], ...

• Relational operators: [ABM ‘14], …

Users won’t abandon their existing DBMS just to use AQP.

Newer SQL-on-Hadoop systems: implementing standard features

Traditional DBMS vendors• Stable codebase, reluctant to make major changes

• Slow in adopting ANYTHING :-)

Page 15: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

Built-in AQP functions in OLAP engines

APPROXIMATE PERCENTILE_DISC

approxCountDistinctapproxQuantile

NDVAPX_MEDIAN

approx_count_distinctapprox_percentile

count-distinct or quantile

Page 16: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

Built-in AQP functions in OLAP engines

count-distinct or quantile Good progress! But, too little, too slow

APPROXIMATE PERCENTILE_DISC

approxCountDistinctapproxQuantile

NDVAPX_MEDIAN

approx_count_distinctapprox_percentile

Page 17: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

Built-in AQP functions in OLAP engines

Limitations

count-distinct or quantile

APPROXIMATE PERCENTILE_DISC

approxCountDistinctapproxQuantile

NDVAPX_MEDIAN

approx_count_distinctapprox_percentile

1. Good only when the data does not fit in memory2. Good only for flat queries: no error propagation3. Applicable only for order statistics: no support for UDAs or arithmetic aggregates

Good progress! But, too little, too slow

Page 18: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

Built-in AQP functions in OLAP engines

Limitations

count-distinct or quantile

Need for complete AQP solutions that are easy to adopt

1. Good only when the data does not fit in memory2. Good only for flat queries: no error propagation3. Applicable only for order statistics: no support for UDAs or arithmetic aggregates

APPROXIMATE PERCENTILE_DISC

approxCountDistinctapproxQuantile

NDVAPX_MEDIAN

approx_count_distinctapprox_percentile

Good progress! But, too little, too slow

Page 19: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

Our proposal: Universal AQP

user/app SQL DB

ExactResult

SQL

Page 20: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

Our proposal: Universal AQP

SQL DB

Thin AQP layer

AQP

user/app

Page 21: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

Our proposal: Universal AQP

SQL DB

Thin AQP layer

AQP

user/app

select avg(price)from saleswhere channel = ‘online’

SQL

Page 22: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

Our proposal: Universal AQP

user/app SQL DB

Thin AQP layer

select avg(price)from saleswhere channel = ‘online’

AQP

SQL RewrittenSQL

select avg(a1), std(a1)from (select avg(price) as a1from saleswhere channel = ‘online’group by sid ) t1

Page 23: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

Our proposal: Universal AQP

SQL DB

Thin AQP layer

AQP

SQL

ExactResult

Approx Result+ Error Bound

RewrittenSQL

select avg(a1), std(a1)from (select avg(price) as a1from saleswhere channel = ‘online’group by sid ) t1

user/app

select avg(price)from saleswhere channel = ‘online’

Page 24: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

Our proposal: Universal AQP

SQL DB

Thin AQP layer

AQP

SQL

ExactResult

Approx Result+ Error Bound

RewrittenSQL

select avg(a1), std(a1)from (select avg(price) as a1from saleswhere channel = ‘online’group by sid ) t1

user/app

select avg(price)from saleswhere channel = ‘online’

Page 25: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

Our proposal: Universal AQP

SQLSQL DB

ExactResult

Approx Result+ Error Bound

RewrittenSQL

Thin AQP layer

select avg(a1), std(a1)from (select avg(price) as a1from saleswhere channel = ‘online’group by sid ) t1

Universal AQP

AQP

user/app

select avg(price)from saleswhere channel = ‘online’

Page 26: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

Challenges of Universal AQP

1. Statistical correctness (inter-tuple correlations)• Foreign-key constraints [Join Synopses ‘99]

• Modifying the join algorithm [Wander Join ‘16]

• Modifying the query plan [BlinkDB ‘13, Quickr ‘16]

2. Middleware efficiency• Lack of access to DBMS machinery

3. Server efficiency• Resampling-based techniques [Pol and Jermaine ‘05, BlinkDB ‘14]

• Intimate integration of err est. logic into scan operators [Quickr ‘16, SnappyData]

• Overriding the relational operators altogether [ABM ‘14]

Page 27: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

Challenges of Universal AQP

1. Statistical correctness (inter-tuple correlations)• Foreign-key constraints [Join Synopses ‘99]

• Modifying join algorithm [Wander Join ‘16]

• Modifying the query plan [Quickr ‘16]

2. Middleware efficiency• Lack of access to DBMS machinery

3. Server efficiency• Resampling-based techniques [Pol and Jermaine ‘05, BlinkDB ‘14]

• Intimate integration of err est. logic into scan operators [Quickr ‘16, SnappyData]

• Overriding the relational operators altogether [ABM ‘14]

sale

AA NYC SF

correct error bounds

Page 28: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

Challenges of Universal AQP

1. Statistical correctness (inter-tuple correlations)• Foreign-key constraints [Join Synopses ‘99]

• Modifying join algorithm [Wander Join ‘16]

• Modifying the query plan [Quickr ‘16]

2. Middleware efficiency• Lack of access to DBMS machinery

3. Server efficiency• Resampling-based techniques [Pol and Jermaine ‘05, BlinkDB ‘14]

• Intimate integration of err est. logic into scan operators [Quickr ‘16, SnappyData]

• Overriding the relational operators altogether [ABM ‘14]

ThinAQPLayer

network

Page 29: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

Challenges of Universal AQP

1. Statistical correctness (inter-tuple correlations)• Foreign-key constraints [Join Synopses ‘99]

• Modifying join algorithm [Wander Join ‘16]

• Modifying the query plan [Quickr ‘16]

2. Middleware efficiency• Lack of access to DBMS machinery

3. Server efficiency• Resampling-based techniques [Pol and Jermaine ‘05, BlinkDB ‘14]

• Intimate integration of err est. logic into scan operators [Quickr ‘16, SnappyData]

• Overriding the relational operators altogether [ABM ‘14]

Page 30: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

VerdictDB Overview

First Universal AQP system

Page 31: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

Deployment

VerdictDB

user/app SQL DB

Page 32: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

Deployment

VerdictDB

JDBC,API call

user/app SQL DB

Page 33: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

Deployment

VerdictDB

JDBC,API call

user/app SQL DB

JDBC,spark.sql

Page 34: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

Deployment

VerdictDB

user/app SQL DB

JDBC,spark.sql

JDBC,API call

Stores (1) offline-created samples, and (2) VerdictDB-managed metadata

Required SQL syntax:• create table as select …• rand(), agg(col) over (partition by …)

Page 35: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

Deployment

VerdictDB

user/app SQL DB

JDBC,spark.sql

JDBC,API call

Stores (1) offline-created samples, and (2) VerdictDB-managed metadata

The only requirements:• create table as select …• rand(), agg(col) over (partition by …)

Page 36: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

Deployment

VerdictDB

user/app SQL DB

JDBC,spark.sql

Stores (1) offline-created samples, and (2) VerdictDB-managed metadata

The only requirements:• create table as select …• rand(), agg(col) over (partition by …)

supported byalmost any SQL engines

JDBC,API call

Page 37: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

Architecture

VerdictDB

Query Parser

Query Rewriter

DBMSDrivers

Impala driver

Hive driver

Redshift driverAnswerRewriter

incoming query

approximateanswer SQL DB

Page 38: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

Architecture

VerdictDB

Query Parser

Query Rewriter

DBMSDrivers

Hive driver

Redshift driverAnswerRewriter

incoming query

approximateanswer

Crucial component1. Chooses an optimal set of samples2. Scales values appropriately3. Inserts an error estimation logic

SQL DB

Impala driver…

Page 39: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

Error estimation in VerdictDB

Page 40: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

Error estimation in generalUser interested in Q(T)

We compute Q(S) where S is a sample of T

Page 41: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

Error estimation in general

Main question: how close is Q(S) to Q(T)?

User interested in Q(T)

We compute Q(S) where S is a sample of T

Page 42: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

Error estimation in general

Fast? General?

Closed-form(CLT, Hoeffding, HT) YES NO

(no UDAs, requires IID)

Existing Resampling(subsampling, bootstrap)

NO(can be slow in SQL)

YES(Hadamard differentiable)

Ours(variational subsampling) YES YES

(Hadamard differentiable)

Main question: how close is Q(S) to Q(T)?

User interested in Q(T)

We compute Q(S) where S is a sample of T

Page 43: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

Error estimation in general

Fast? General?

Closed-form(CLT, Hoeffding, HT) YES NO

(no UDAs)

Existing Resampling(subsampling, bootstrap)

NO(can be slow in SQL)

YES(Hadamard differentiable)

Ours(variational subsampling) YES YES

(Hadamard differentiable)

Main question: how close is Q(S) to Q(T)?

User interested in Q(T)

We compute Q(S) where S is a sample of T

Page 44: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

Error estimation in general

Fast? General?

Closed-form(CLT, Hoeffding, HT) YES NO

(no UDAs)

Existing Resampling(subsampling, bootstrap)

NO(can be slow in SQL)

YES(Hadamard differentiable)

Ours(variational subsampling) YES YES

(Hadamard differentiable)

Main question: how close is Q(S) to Q(T)?

User interested in Q(T)

We compute Q(S) where S is a sample of T

Page 45: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

Error estimation in general

Fast? General?

Closed-form(CLT, Hoeffding, HT) YES NO

(no UDAs)

Existing Resampling(subsampling, bootstrap)

NO(can be slow in SQL)

YES(Hadamard differentiable)

Ours(variational subsampling) YES YES

(Hadamard differentiable)

Main question: how close is Q(S) to Q(T)?

User interested in Q(T)

We compute Q(S) where S is a sample of T

Page 46: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

Recap: traditional subsampling

TOriginal Table

(size N)

Q(T) is slow / expensive

Page 47: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

Recap: traditional subsampling

TOriginal Table

(size N)

Ssample(size n)

random sample

What is the error of Q(S)?

Page 48: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

Recap: traditional subsampling

TOriginal Table

(size N)

Ssample(size n)

s2 sbs1 · · ·s3subsample(size s ≪ n)

random sample

What is the error of Q(S)?

Page 49: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

Recap: traditional subsampling

Random sample without replacement

Each subsample is independentT

Original Table(size N)

Ssample(size n)

s2 sbs1 · · ·s3subsample(size s ≪ n)

random sample

What is the error of Q(S)?

Page 50: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

Recap: traditional subsampling

TOriginal Table

(size N)

Ssample(size n)

s2 sbs1 · · ·s3subsample(size s ≪ n)

random sample

Q(s1) Q(s2) Q(s3) Q(sb)

What is the error of Q(S)?

Page 51: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

Recap: traditional subsampling

TOriginal Table

(size N)

Ssample(size n)

s2 sbs1 · · ·s3subsample(size s ≪ n)

random sample

What is the error of Q(S)?

Q(s1) Q(s2) Q(s3) Q(sb)

Page 52: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

Recap: traditional subsampling

STOriginal Table

(size N)

sample(size n)

s2 sbs1 · · ·s3subsample(size s ≪ n)

random sample

Important properties

1. A tuple may belong to multiple subsamples.

2. The size ofevery subsample is s.

What is the error of Q(S)?

Q(s1) Q(s2) Q(s3) Q(sb)

Page 53: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

Traditional subsampling in SQL is slow

CITY PRODUCT PRICE 1 2 · · · bAA egg $3.00 1 0 1AA milk $5.00 0 1 0AA egg $3.00 0 0 1NYU egg $4.00 0 1 0NYU milk $6.00 0 0 1NYU candy $2.00 1 0 0SF milk $6.00 0 1 0SF egg $4.00 0 0 0SF egg $4.00 0 1 1

subsample ID

ntuples

sum = s sum = s

Page 54: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

Traditional subsampling in SQL is slow

CITY PRODUCT PRICE 1 2 · · · bAA egg $3.00 1 0 1AA milk $5.00 0 1 0AA egg $3.00 0 0 1NYU egg $4.00 0 1 0NYU milk $6.00 0 0 1NYU candy $2.00 1 0 0SF milk $6.00 0 1 0SF egg $4.00 0 0 0SF egg $4.00 0 1 1

subsample ID

ntuples

Algorithm:for i = 1, ..., nfor j = 1, ..., b

if sid[i,j] == 1sum[j] += price[i]

sum = s sum = s

Page 55: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

Traditional subsampling in SQL is slow

CITY PRODUCT PRICE 1 2 · · · bAA egg $3.00 1 0 1AA milk $5.00 0 1 0AA egg $3.00 0 0 1NYU egg $4.00 0 1 0NYU milk $6.00 0 0 1NYU candy $2.00 1 0 0SF milk $6.00 0 1 0SF egg $4.00 0 0 0SF egg $4.00 0 1 1

subsample ID

ntuples

Algorithm:for i = 1, ..., nfor j = 1, ..., b

if sid[i,j] == 1sum[j] += price[i]

Time Complexity: O(n·b)

sum = s sum = s

Page 56: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

Traditional subsampling in SQL is slow

CITY PRODUCT PRICE 1 2 · · · bAA egg $3.00 1 0 1AA milk $5.00 0 1 0AA egg $3.00 0 0 1NYU egg $4.00 0 1 0NYU milk $6.00 0 0 1NYU candy $2.00 1 0 0SF milk $6.00 0 1 0SF egg $4.00 0 0 0SF egg $4.00 0 1 1

subsample ID

ntuples

Algorithm:for i = 1, ..., nfor j = 1, ..., b

if sid[i,j] == 1sum[j] += price[i]

Time Complexity: O(n·b)

No error est: 0.35 secTrad. subsampling: 118 sec337x slower

(basedon1Gsample,Impala)sum = s sum = s

Page 57: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

Our approach: variational subsampling

T Ssample(size n)

randomsample

Page 58: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

Our approach: variational subsampling

T Ssample(size n)

randomsample

sbs1 · · ·s3

(size n1) (size n2) (size n3) (size nb)

s2

Page 59: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

Our approach: variational subsampling

T Ssample(size n)

randomsample

Important properties1. A tuple may belong to multiple

subsamples.Each sampled tuple can belong to at most one subsample

2. The size ofevery subsample is s.Allow subsamples to differ in size.sbs1 · · ·s3

(size n1) (size n2) (size n3) (size nb)

s2

Page 60: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

Our approach: variational subsampling

T Ssample(size n)

sbs1 · · ·s3

randomsample

(size n1) (size n2) (size n3) (size nb)

Important properties1. A tuple may belong to multiple

subsamples.Each sampled tuple can belong to at most one subsample

2. The size ofevery subsample is s.

s2

Page 61: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

Our approach: variational subsampling

T Ssample(size n)

sbs1 · · ·s3

randomsample

(size n1) (size n2) (size n3) (size nb)

Important properties1. A tuple may belong to multiple

subsamples.Each sampled tuple can belong to at most one subsample

2. The size ofevery subsample is s.Allow subsamples to differ in size.s2

Page 62: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

Our approach: variational subsampling

T Ssample(size n)

s2 sbs1 · · ·s3

randomsample

(size n1) (size n2) (size n3) (size nb)

Important properties1. A tuple may belong to multiple

subsamples.Each sampled tuple can belong to at most one subsample

2. The size ofevery subsample is s.Allow subsamples to differ in size.

Can be implemented in SQLas a single group-by query!

Page 63: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

Variational subsampling in SQL is fast

CITY PRODUCT PRICE subsample IDAA egg $3.00 1AA milk $5.00 3AA egg $3.00 2NYU egg $4.00 4NYU milk $6.00 3NYU candy $2.00 1SF milk $6.00 5SF egg $4.00 4SF egg $4.00 5

ntuples

randint(1,b)

We call this augmented table, a variational table

Page 64: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

Variational subsampling in SQL is fast

CITY PRODUCT PRICE subsample IDAA egg $3.00 1AA milk $5.00 3AA egg $3.00 2NYU egg $4.00 4NYU milk $6.00 3NYU candy $2.00 1SF milk $6.00 5SF egg $4.00 4SF egg $4.00 5

ntuples

randint(1,b)Algorithm:for i = 1, ..., nsum[sid] += price[i]

We call this augmented table, a variational table

Page 65: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

Variational subsampling in SQL is fast

CITY PRODUCT PRICE subsample IDAA egg $3.00 1AA milk $5.00 3AA egg $3.00 2NYU egg $4.00 4NYU milk $6.00 3NYU candy $2.00 1SF milk $6.00 5SF egg $4.00 4SF egg $4.00 5

ntuples

randint(1,b)Algorithm:for i = 1, ..., nsum[sid] += price[i]

Time Complexity: O(n)

We call this augmented table, a variational table

Page 66: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

Variational subsampling in SQL is fast

CITY PRODUCT PRICE subsample IDAA egg $3.00 1AA milk $5.00 3AA egg $3.00 2NYU egg $4.00 4NYU milk $6.00 3NYU candy $2.00 1SF milk $6.00 5SF egg $4.00 4SF egg $4.00 5

ntuples

randint(1,b)Algorithm:for i = 1, ..., nsum[sid] += price[i]

Time Complexity: O(n)

No error est: 0.35 secTrad. subsampling: 118 secVar. subsampling: 0.73 sec162× faster than traditional

We call this augmented table, a variational table(basedon1Gsample,Impala)

Page 67: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

Main results

Theorem 1 (Consistency) The distribution of the aggregates of variational subsamples, after appropriate scaling, converges to the true distribution of the aggregate of a sample as 𝑛 → ∞.

Page 68: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

Main results

Theorem 1 (Consistency) The distribution of the aggregates of variational subsamples, after appropriate scaling, converges to the true distribution of the aggregate of a sample as 𝑛 → ∞.

Theorem 2 (Convergence Rate) The convergence rate of variational subsampling is equal to that of traditional subsampling when b is finite.

𝑂 𝑛%&'/) +

𝑛%𝑛 + 𝑏&'/)

The error term from the finite b(The Dvoretzky–Kiefer–Wolfowitz inequality)

Page 69: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

Experiments1. Does VerdictDB provide enough speedup?

2. Is VerdictDB (UAQP)’s performance comparable

to a tightly-integrated AQP?

3. Is variational subsampling statistically correct?

Page 70: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

Experiments1. Does VerdictDB provide enough speedup?

2. Is VerdictDB (UAQP)’s performance comparable

to a tightly-integrated AQP?

3. Is variational subsampling statistically correct?

Page 71: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

Experiments1. Does VerdictDB provide enough speedup?

2. Is VerdictDB (UAQP)’s performance comparable

to a tightly-integrated AQP engine?

3. Is variational subsampling statistically correct?

Page 72: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

Experiments1. Does VerdictDB provide enough speedup?

2. Is VerdictDB (UAQP)’s performance comparable

to a tightly-integrated AQP engine?

3. Is variational subsampling statistically correct?

Page 73: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

Experiments1. Does VerdictDB provide enough speedup?

2. Is VerdictDB (UAQP)’s performance comparable

to a tightly-integrated AQP engine?

3. Is variational subsampling statistically correct?

Yes

Page 74: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

Experiments

Datasets:• 500GB TPC-H benchmark / 200GB Instacart dataset / synthetic datasets

Underlying databases• Amazon Redshift, Apache Spark SQL, Apache Impala on 10+1 r4.xlarge cluster

1. Does VerdictDB provide enough speedup?

2. Is VerdictDB (UAQP)’s performance comparable

to a tightly-integrated AQP engine?

3. Is variational subsampling statistically correct?

Yes

Page 75: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

Speedup for Redshift

tpc-h benchmark micro-benchmark

Redshift 24.0× Speedup

Page 76: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

Speedup for Redshift

tpc-h benchmark micro-benchmark

t3, t10, t15: no speedup (i.e., 1×) due to high-cardinality grouping attributes

Redshift 24.0× Speedup

Page 77: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

Speedup for Redshift

t3, t10, t15: no speedup (i.e., 1×) due to high-cardinality grouping attributes

Other queries: 26.3× speedups (relative errors were 2%)

tpc-h benchmark micro-benchmark

Redshift 24.0× Speedup

Page 78: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

Speedup for Apache Spark & ImpalaSpark SQL 12.0× Speedup

Impala 18.6× Speedup

Page 79: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

Speedup for Apache Spark & Impala

speedup = overhead + processing

overhead + (sample processing)

Lower overhead →Larger speedup

Spark SQL 12.0× Speedup

Impala 18.6× Speedup

Page 80: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

UAQP vs. Tightly-integrated AQP

Page 81: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

UAQP vs. Tightly-integrated AQP

VerdictDB was comparable to SnappyData.

Page 82: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

UAQP vs. Tightly-integrated AQP

VerdictDB was comparable to SnappyData.

SnappyData ver 0.8 didn’t support the join of two sample tables.

Page 83: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

Variational subsampling: correctnessThe bars are 5th and 95th percentiles.Rel. err. naturally become smaller for higher selectivity.

Page 84: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

Variational subsampling: correctnessThe bars are 5th and 95th percentiles.Rel. err. naturally become smaller for higher selectivity.

The estimated errors close to true errors.

Page 85: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

Variational subsampling: correctnessThe bars are 5th and 95th percentiles.Rel. err. naturally become smaller for higher selectivity.

The accuracy of var. subsampling ≈ (a) bootstrap and (b) trad. subsampling

The estimated errors close to true errors.

Page 86: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

Variational subsampling: convergence rate

Page 87: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

Variational subsampling: convergence rate

The accuracy was almost the same for relatively large samples.

Page 88: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

Variational subsampling: convergence rate

Variational subsampling was significantly faster.

The accuracy was almost the same for relatively large samples.

Page 89: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

Conclusion: Universal AQP is viable

Page 90: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

Conclusion: Universal AQP is viable

1. Comparable performance to a fully-integrated solution

2. New error estimation technique: variational subsampling

1. Generality and computational efficiency

2. The first subsampling-based error estimation technique for AQP

3. Offers considerable speedup (18.45× on average, up to 171×, less than 2-3% errors)

Page 91: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

Conclusion: Universal AQP is viable

1. Comparable performance to a fully-integrated solution

2. New error estimation technique: variational subsampling

1. Generality and computational efficiency

2. The first subsampling-based error estimation technique for AQP

3. Offers considerable speedup (18.45× on average, up to 171×, less than 2-3% errors)

Page 92: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

Conclusion: Universal AQP is viable

1. Comparable performance to a fully-integrated solution

2. New error estimation technique: variational subsampling

1. Generality and computational efficiency

2. The first subsampling-based error estimation technique for AQP

3. Offers considerable speedup (18.45× on average, up to 171×, less than 2-3% errors)

Page 93: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

Conclusion: Universal AQP is viable

1. Comparable performance to a fully-integrated solution

2. New error estimation technique: variational subsampling

1. Generality and computational efficiency

2. The first subsampling-based error estimation technique for AQP

3. Offers considerable speedup (18.45× on average, up to 171×, less than 2-3% errors)

Page 94: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

Conclusion: Universal AQP is viable

1. Comparable performance to a fully-integrated solution

2. New error estimation technique: variational subsampling

1. Generality and computational efficiency

2. The first subsampling-based error estimation technique for AQP

3. Offers considerable speedup (18.45× on average, up to 171×, less than 2-3% errors)

Page 95: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

Conclusion: Universal AQP is viable

1. Comparable performance to a fully-integrated solution

2. New error estimation technique: variational subsampling

1. Generality and computational efficiency

2. The first subsampling-based error estimation technique for AQP

3. Offers considerable speedup (18.45× on average, up to 171×, less than 2-3% errors)

Open-sourced (Apache v2.0): http://verdictdb.org

Page 96: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

Future Work

Development

Research

Page 97: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

Future Work

• Adding more drivers (Presto, Teradata, Oracle, SQL Server, …)

Development

Research

Page 98: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

Future Work

• Adding more drivers (Presto, Teradata, Oracle, SQL Server, …)

Development

• Support for online sampling

Research

Page 99: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

Future Work

• Adding more drivers (Presto, Teradata, Oracle, SQL Server, …)

Development

• Support for online sampling

• Robust physical designer (see CliffGuard @ SIGMOD 15)

Research

Page 100: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

Future Work

• Adding more drivers (Presto, Teradata, Oracle, SQL Server, …)

Development

• Support for online sampling

• Robust physical designer (see CliffGuard @ SIGMOD 15)

• Integration with ML libraries (sampling-based model tuning)

Research

Page 101: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

Thank You

http://verdictdb.org

Page 102: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

VerdictDB: current status

• We support• aggregates: sum, count, avg, count-distinct, quantiles, UDAs• sources: base table, derived table, equi-join• filters: comparison, some subquery• others: group-by, having, etc.

• Open-sourced under Apache License version 2.0• http://verdictdb.org for code and documentation

• Upcoming features• Online sampling, automated physical designer

Page 103: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

VerdictDB: current status

• We support• aggregates: sum, count, avg, count-distinct, quantiles, UDAs• sources: base table, derived table, equi-join• filters: comparison, some subquery• others: group-by, having, etc.

• Open-sourced under Apache License version 2.0• http://verdictdb.org for code and documentation

• Upcoming features• Online sampling, automated physical designer

Page 104: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

VerdictDB: current status

• We support• aggregates: sum, count, avg, count-distinct, quantiles, UDAs• sources: base table, derived table, equi-join• filters: comparison, some subquery• others: group-by, having, etc.

• Open-sourced under Apache License version 2.0• http://verdictdb.org for code and documentation

• Upcoming features• Online sampling, automated physical designer

Page 105: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

Example of query rewriting

select l_returnflag , count (*) as ccfrom lineitemgroup by l_returnflag ;

select vt1 .` l_returnflag ` AS `l_returnflag `,round ( sum (( vt1 .`cc ` * vt1 .` sub_size `)) / sum ( vt1 .` sub_size `)) AS `cc `,

(stddev ( vt1 .` count_order `) * sqrt ( avg ( vt1 .` sub_size `))) / sqrt ( sum ( vt1 .` sub_size `)) AS `cc_err `

from (select vt0 .` l_returnflag ` AS `l_returnflag `,(( sum ((1.0 / vt0 .` sampling_prob `)) / count (*))* sum ( count (*)) OVER ( partition BY vt0 .` l_returnflag `)) AS `cc `,vt0 .`sid ` AS `sid `, count (*) AS `sub_size `

from lineitem_sample vt0 GROUP BY vt0 .` l_returnflag `, vt0 .`sid `) AS vt1

GROUP BY vt1 .` l_returnflag `;

original

rewritten

Page 106: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

Bibliography

[Pol and Jermaine ‘05] Pol, Abhijit, and Christopher Jermaine. "Relational confidence bounds are easy with the bootstrap." In Proceedings of the 2005 ACM SIGMOD international conference on Management of data, pp. 587-598. ACM, 2005.

[BlinkDB ‘13] Agarwal, Sameer, Barzan Mozafari, Aurojit Panda, Henry Milner, Samuel Madden, and Ion Stoica. EuroSys, 2013.

[BlinkDB ‘14] Agarwal, Sameer, Henry Milner, Ariel Kleiner, Ameet Talwalkar, Michael Jordan, Samuel Madden, Barzan Mozafari, and Ion Stoica. "Knowing when you're wrong: building fast and reliable approximate query processing systems." SIGMOD, 2014.

[Quickr ‘16] Kandula, Srikanth, Anil Shanbhag, Aleksandar Vitorovic, Matthaios Olma, Robert Grandl, Surajit Chaudhuri, and Bolin Ding. "Quickr: Lazily approximating complex adhoc queries in bigdataclusters." SIGMOD, 2016.

Page 107: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

Bibliography

[G-OLA ’15] Zeng, Kai, Sameer Agarwal, Ankur Dave, Michael Armbrust, and Ion Stoica. "G-ola: Generalized on-line aggregation for interactive analysis on big data." SIGMOD, 2015.

[ABM ‘14] Zeng, Kai, Shi Gao, Barzan Mozafari, and Carlo Zaniolo. "The analytical bootstrap: a new method for fast error estimation in approximate query processing." SIGMOD, 2014.

[Join Synopses ‘99] Acharya, Swarup, Phillip B. Gibbons, Viswanath Poosala, and Sridhar Ramaswamy. "Join synopses for approximate query answering." SIGMOD Record, 1999.

[Wander Join ‘16] Li, Feifei, Bin Wu, Ke Yi, and Zhuoyue Zhao. "Wander join: Online aggregation via random walks." SIGMOD, 2016.

[Online ‘97] Hellerstein, Joseph M., Peter J. Haas, and Helen J. Wang. "Online aggregation." SIGMOD, 1997.

[Politis ‘94] Politis, Dimitris N., and Joseph P. Romano. "Large sample confidence regions based on subsamples under minimal assumptions." The Annals of Statistics, 1994

Page 108: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

Variational subsampling: overhead

Page 109: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

Variational subsampling: overhead

Overhead of variational subsampling: 0.38–0.87 seconds

Page 110: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join

Variational subsampling: overhead

Variational subsampling was 100×–237× faster compared to Consolidated Bootstrap.

Overhead of variational subsampling: 0.38–0.87 seconds