presented by srikanth vadada fall 2010 - cse 6339 23 rd sep 2010

26
University of Texas at Arlington Presented By Srikanth Vadada Fall 2010 - CSE 6339 23 rd Sep 2010 Dynamic Sample Selection for Approximate Query Processing Brian Babcock, Surajit Chaudhuri, Gautam Das ACM SIGMOD 2003

Upload: merle

Post on 20-Feb-2016

34 views

Category:

Documents


1 download

DESCRIPTION

Dynamic Sample Selection for Approximate Query Processing Brian Babcock, Surajit Chaudhuri, Gautam Das ACM SIGMOD 2003. Presented By Srikanth Vadada Fall 2010 - CSE 6339 23 rd Sep 2010. 3 Key Terms for the Topic. Approximate Query Processing (AQP). Biased Sample. Dynamic Selection. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Presented By  Srikanth  Vadada Fall 2010 - CSE 6339 23 rd  Sep 2010

University of Texas at Arlington

Presented By

Srikanth Vadada

Fall 2010 - CSE 6339

23rd Sep 2010

Dynamic Sample Selection for Approximate Query Processing

Brian Babcock, Surajit Chaudhuri, Gautam DasACM SIGMOD 2003

Page 2: Presented By  Srikanth  Vadada Fall 2010 - CSE 6339 23 rd  Sep 2010

University of Texas at Arlington

3 Key Terms for the Topic

Dynamic Selection

Biased SampleApproximate Query Processing (AQP)

Goal : Dynamically construct an appropriate Biased Sample for Approximate Query Processing

Page 3: Presented By  Srikanth  Vadada Fall 2010 - CSE 6339 23 rd  Sep 2010

University of Texas at Arlington

Why Approximate Query Processing (AQP) ? Rapid strides in Data Collection & Management Technologies

Resulting in very large Databases

Effective Data Analysis methods – Ongoing Research Analysis Queries require aggregation or summarizations Expensive Running Times

Requirements of Analysis Systems (Decision Support Systems) Short Query Response Time Exactness of Query Results less important

AQP Techniques are the Solution

Page 4: Presented By  Srikanth  Vadada Fall 2010 - CSE 6339 23 rd  Sep 2010

University of Texas at Arlington

Why Sampling Techniques ? Data Analysis Queries

OLAP – Materialized Views of Data Cubes

Building Indexes

Physical Data Design – Use Preprocessing Time & Space

Effective when Query Workload is known in advance

Expensive to build indexes for all possible queries

AQP & Physical Database Design Methods Complementary

Page 5: Presented By  Srikanth  Vadada Fall 2010 - CSE 6339 23 rd  Sep 2010

University of Texas at Arlington

Approximate Query Processing (Related Work)Online Aggregation

Hellerstein J., Haas P., Wang H. Online Aggregation, CM SIGMOD 1997

Approximate answers are produced during early stages

Gradual refinement until data is processed

Advantages: No pre-processing required Allows progressive refinement of answers at runtime

Disadvantages : Require random disk access (slow). Requires query processor code change.

Page 6: Presented By  Srikanth  Vadada Fall 2010 - CSE 6339 23 rd  Sep 2010

University of Texas at Arlington

Approximate Query Processing (Related Work)Join Synopses - Sampling based method

Acharya S., Gibbons P. B., Poosala V., Ramaswamy S. Join Synopses for Approximate Query Answering, ACM SIGMOD 1999

Join Queries – Primary Key Joins

Pre Computation - Join of Fact Tables with Dimension Tables

Disadvantages: Does not extend to queries that involve non-foreign key joins

Page 7: Presented By  Srikanth  Vadada Fall 2010 - CSE 6339 23 rd  Sep 2010

University of Texas at Arlington

Approximate Query Processing (Related Work)Icicles - Weighted Sampling method

Ganti V., Lee M. L., Ramakrishnan R. ICICLES: Self-tuning Samples for Approximate Query Answering, VLDB, 2000.

Frequency of Tuple Access by Queries in Workload

Disadvantages: Addresses only low selectivity problem

Page 8: Presented By  Srikanth  Vadada Fall 2010 - CSE 6339 23 rd  Sep 2010

University of Texas at Arlington

Dynamic Sample Selection “Appropriate Biased” Samples

Give accurate approximate results for most queries

Appropriate Biased Sample varies from Query to Query

Previous Sampling Methods vs Dynamic Sample Selection Previous sampling - Single Sample with fixed bias

Individual Tailored Sample for each query

Creation of subsamples is done offline - Preprocessing

Assembly into an overall sample is done online - Runtime

Page 9: Presented By  Srikanth  Vadada Fall 2010 - CSE 6339 23 rd  Sep 2010

University of Texas at Arlington

Effectiveness of Biased Sampling - Example Database consists of 100 Product Tuples

Product = “Stereo” - 90 Tuples Product = “TV” – 10 Tuples

Sampling 10 tuples in 2 ways 10% of the tuples uniformly each with weight 10 0% of “Stereo” tuples and 100% “TV” tuples with weight 1

Query – Count of “TV” Tuples Which gives a correct answer always?

2nd Sample - Always gives the exact answer

1st Sample - Only if exactly 1 of the TV tuples is chosen

Page 10: Presented By  Srikanth  Vadada Fall 2010 - CSE 6339 23 rd  Sep 2010

University of Texas at Arlington

DA

TA

SAM

PLE

DA

TA

SAM

PLESA

MPLE

SAM

PLESA

MPLE

Dynamic SamplingStandard Sampling

Static vs Dynamic Sampling Selection

Page 11: Presented By  Srikanth  Vadada Fall 2010 - CSE 6339 23 rd  Sep 2010

University of Texas at Arlington

Dynamic Sample Selection Architecture ( 1 / 3)Pre- Processing Phase

Extra Disk space not taken advantage by Standard Sampling Methods

DSS uses this effectively by creating a large sample containing a family of differently biased subsamples

Step 1 - Examine Data Distribution for creating a set of biased samples – Results into Overlapping Strata

Step 2 – Samples are created with potentially different sampling rates for each stratum. Generate Metadata – Characteristics of each sample

Query Workload

Data

Select Strata

Build Sample

Sample Data

Meta-Data

Fig Source : Babcock B., Chaudhuri C. and Das G. Dynamic Sample Selection for Approximate Query Processing, ACM SIGMOD 2003.

Page 12: Presented By  Srikanth  Vadada Fall 2010 - CSE 6339 23 rd  Sep 2010

University of Texas at Arlington

Dynamic Sample Selection Architecture ( 2 / 3)Runtime Phase

When Queries are issued at Runtime - DSS re-writes the queries to run against sample tables

Appropriate Sample Tables to use are determined by comparing Query with the Meta data

Algorithms for choosing which samples to build in pre processing and samples to use for Query Processing are not described

Fig Source : Babcock B., Chaudhuri C. and Das G. Dynamic Sample Selection for Approximate Query Processing, ACM SIGMOD 2003.

Query

Meta-Data

Sample Data

ChooseSamples

RewriteQuery

Page 13: Presented By  Srikanth  Vadada Fall 2010 - CSE 6339 23 rd  Sep 2010

University of Texas at Arlington

Dynamic Sample Selection Architecture ( 3 / 3) Policies for Sample Selection

Choice of Sample guided by incoming Query Syntax

Examples Separate sample for each table and choose sample

based on the FROM clause

Separate sample for each pre-specified aggregate expression and choose sample based on the SELECT clause

Page 14: Presented By  Srikanth  Vadada Fall 2010 - CSE 6339 23 rd  Sep 2010

University of Texas at Arlington

Motivation - Small Group Sampling (1 / 2)

Page 15: Presented By  Srikanth  Vadada Fall 2010 - CSE 6339 23 rd  Sep 2010

University of Texas at Arlington

Motivation - Small Group Sampling ( 2 / 2)

Large Groups

Small Groups

Page 16: Presented By  Srikanth  Vadada Fall 2010 - CSE 6339 23 rd  Sep 2010

University of Texas at Arlington

Small Group Sampling - Example Example : Aggregation queries with “group-bys”

select Age, Income, count(*)

from Employee_Tbl

group by Age, Income

Age Income (In 1000s) Designation30 60 Developer

35 70 Developer

25 60 Developer

25 70 Lead

30 70 Lead

25 70 Developer

25 70 Developer

30 60 Developer

30 60 Developer

40 70 Developer

35 100 Manager

25 60 Developer

25 70 Developer

35 100 Developer

30 60 Developer

Page 17: Presented By  Srikanth  Vadada Fall 2010 - CSE 6339 23 rd  Sep 2010

University of Texas at Arlington

Small Group Sampling – Illustration (1/3) sample – perform uniform sampling on large groups. Small group tables - one or more sample tables for smaller groups.

Pre-Processing Phase:1. Create a overall sample s_overall 2. “Age” Histogram ( Column Index: 0)

r : Base Sampling rate, determines the size of Overall Sample (eg, 30%) t : small group fraction, max size of each small group table (eg, 20%)

Small group table

s_age

s_overall

Page 18: Presented By  Srikanth  Vadada Fall 2010 - CSE 6339 23 rd  Sep 2010

University of Texas at Arlington

Pre-Processing Phase:“Income” Histogram (Column Index: 1)

s_income

s_overall

011

011

Small Group Sampling – Illustration (2/3)

Page 19: Presented By  Srikanth  Vadada Fall 2010 - CSE 6339 23 rd  Sep 2010

University of Texas at Arlington

Query Issued - SELECT Age, Income, count(*) FROM Employee_tbl GROUP BY Age, Income

SELECT Age, Income, count(*) FROM s_age GROUP BY Age, IncomeUNION ALLSELECT Age, Income, count(*) FROM s_income GROUP BY Age, IncomeWHERE Bitmask & 1 = 0 /* ie, 001 . (eg, 010 & 001 = 000 ; 011 & 001 = 1)*/UNION ALLSELECT Age, Income, count(*) * (100/30) FROM s_overall GROUP BY Age, IncomeWHERE Bitmask & 3 = 0 /* 3 = 20 + 21 ie, 011 (eg, 001 & 011 = 1; 011 & 011 = 1; 010 & 011 = 1)*/

s_income (Column Index: 1 - 010)

s_age (Column Index: 0 - 001)

s_overall

Small Group Sampling – Illustration (3/3)Runtime Phase:

Page 20: Presented By  Srikanth  Vadada Fall 2010 - CSE 6339 23 rd  Sep 2010

University of Texas at Arlington

Accuracy Metrics (1 / 2)

As many possible groups to be preserved in approximate answer

Error in the aggregate value for each group should be small

Q = Aggregation QueryLet G = {g1, g2, g3, … gn} be the set of n groups in the answer to Qxi = aggregate value for group gi.

A= Approximate Answer to QG’ = {gi1, gi2, gi3, … gim} be the set of m groups in Ax’i1 = aggregate value for group gij.

Page 21: Presented By  Srikanth  Vadada Fall 2010 - CSE 6339 23 rd  Sep 2010

University of Texas at Arlington

Accuracy Metrics (2 / 2)

Percentage of Groups from Q missed by A

Average relative error on Q of A

Average squared relative error on Q of A

Page 22: Presented By  Srikanth  Vadada Fall 2010 - CSE 6339 23 rd  Sep 2010

University of Texas at Arlington

Experimental Results ( 1 / 2)

TPC – H Database , Count & Sum queries , Number of Columns in all Tables =245

RelErr, PctGroups increased for Uniform Sampling & Small Group Sampling

Increase was more pronounced for Uniform SamplingFig Source : Babcock B., Chaudhuri C. and Das G. Dynamic Sample Selection for Approximate Query Processing, ACM SIGMOD 2003.

Page 23: Presented By  Srikanth  Vadada Fall 2010 - CSE 6339 23 rd  Sep 2010

University of Texas at Arlington

Experimental Results ( 2 / 2)

Uniform Sampling outperforms Small group sampling at low skews

Small group sampling does better at moderate to high skew

Speedup decreases as the number of grouping columns increase

Fig Source : Babcock B., Chaudhuri C. and Das G. Dynamic Sample Selection for Approximate Query Processing, ACM SIGMOD 2003.

Page 24: Presented By  Srikanth  Vadada Fall 2010 - CSE 6339 23 rd  Sep 2010

University of Texas at Arlington

Conclusions

Dynamic Sample Selection improves on previous AQP Methods

Productively utilizes additional Disk Space

Small Group Sampling targets aggregate queries with group bys

Small Group sampling outperforms other techniques

Page 25: Presented By  Srikanth  Vadada Fall 2010 - CSE 6339 23 rd  Sep 2010

University of Texas at Arlington

ReferencesBabcock B., Chaudhuri C. and Das G. Dynamic Sample Selection for Approximate Query Processing

http://crystal.uta.edu/~cse6339/Fall08DBIR.htm

http://crystal.uta.edu/~cse6339/Fall09DBIR.htm

Page 26: Presented By  Srikanth  Vadada Fall 2010 - CSE 6339 23 rd  Sep 2010

University of Texas at Arlington

AQ&Questions ?Questions ?