presented by srikanth vadada fall 2010 - cse 6339 23 rd sep 2010

University of Texas at Arlington

Presented By

Srikanth Vadada

Fall 2010 - CSE 6339

23rd Sep 2010

Dynamic Sample Selection for Approximate Query Processing

Brian Babcock, Surajit Chaudhuri, Gautam DasACM SIGMOD 2003


3 Key Terms for the Topic

Dynamic Selection

Biased SampleApproximate Query Processing (AQP)

Goal : Dynamically construct an appropriate Biased Sample for Approximate Query Processing


Why Approximate Query Processing (AQP) ? Rapid strides in Data Collection & Management Technologies

Resulting in very large Databases

Effective Data Analysis methods – Ongoing Research Analysis Queries require aggregation or summarizations Expensive Running Times

Requirements of Analysis Systems (Decision Support Systems) Short Query Response Time Exactness of Query Results less important

AQP Techniques are the Solution


Why Sampling Techniques ? Data Analysis Queries

OLAP – Materialized Views of Data Cubes

Building Indexes

Physical Data Design – Use Preprocessing Time & Space

Effective when Query Workload is known in advance

Expensive to build indexes for all possible queries

AQP & Physical Database Design Methods Complementary


Approximate Query Processing (Related Work)Online Aggregation

Hellerstein J., Haas P., Wang H. Online Aggregation, CM SIGMOD 1997

Approximate answers are produced during early stages

Gradual refinement until data is processed

Advantages: No pre-processing required Allows progressive refinement of answers at runtime

Disadvantages : Require random disk access (slow). Requires query processor code change.


Approximate Query Processing (Related Work)Join Synopses - Sampling based method

Acharya S., Gibbons P. B., Poosala V., Ramaswamy S. Join Synopses for Approximate Query Answering, ACM SIGMOD 1999

Join Queries – Primary Key Joins

Pre Computation - Join of Fact Tables with Dimension Tables

Disadvantages: Does not extend to queries that involve non-foreign key joins


Approximate Query Processing (Related Work)Icicles - Weighted Sampling method

Ganti V., Lee M. L., Ramakrishnan R. ICICLES: Self-tuning Samples for Approximate Query Answering, VLDB, 2000.

Frequency of Tuple Access by Queries in Workload

Disadvantages: Addresses only low selectivity problem


Dynamic Sample Selection “Appropriate Biased” Samples

Give accurate approximate results for most queries

Appropriate Biased Sample varies from Query to Query

Previous Sampling Methods vs Dynamic Sample Selection Previous sampling - Single Sample with fixed bias

Individual Tailored Sample for each query

Creation of subsamples is done offline - Preprocessing

Assembly into an overall sample is done online - Runtime


Effectiveness of Biased Sampling - Example Database consists of 100 Product Tuples

Product = “Stereo” - 90 Tuples Product = “TV” – 10 Tuples

Sampling 10 tuples in 2 ways 10% of the tuples uniformly each with weight 10 0% of “Stereo” tuples and 100% “TV” tuples with weight 1

Query – Count of “TV” Tuples Which gives a correct answer always?

2nd Sample - Always gives the exact answer

1st Sample - Only if exactly 1 of the TV tuples is chosen


DA

TA

SAM

PLE

DA

TA

SAM

PLESA

MPLE

SAM

PLESA

MPLE

Dynamic SamplingStandard Sampling

Static vs Dynamic Sampling Selection


Dynamic Sample Selection Architecture ( 1 / 3)Pre- Processing Phase

Extra Disk space not taken advantage by Standard Sampling Methods

DSS uses this effectively by creating a large sample containing a family of differently biased subsamples

Step 1 - Examine Data Distribution for creating a set of biased samples – Results into Overlapping Strata

Step 2 – Samples are created with potentially different sampling rates for each stratum. Generate Metadata – Characteristics of each sample

Query Workload

Data

Select Strata

Build Sample

Sample Data

Meta-Data

Fig Source : Babcock B., Chaudhuri C. and Das G. Dynamic Sample Selection for Approximate Query Processing, ACM SIGMOD 2003.


Dynamic Sample Selection Architecture ( 2 / 3)Runtime Phase

When Queries are issued at Runtime - DSS re-writes the queries to run against sample tables

Appropriate Sample Tables to use are determined by comparing Query with the Meta data

Algorithms for choosing which samples to build in pre processing and samples to use for Query Processing are not described


Query

Meta-Data

Sample Data

ChooseSamples

RewriteQuery


Dynamic Sample Selection Architecture ( 3 / 3) Policies for Sample Selection

Choice of Sample guided by incoming Query Syntax

Examples Separate sample for each table and choose sample

based on the FROM clause

Separate sample for each pre-specified aggregate expression and choose sample based on the SELECT clause


Motivation - Small Group Sampling (1 / 2)


Motivation - Small Group Sampling ( 2 / 2)

Large Groups

Small Groups


Small Group Sampling - Example Example : Aggregation queries with “group-bys”

select Age, Income, count(*)

from Employee_Tbl

group by Age, Income

Age Income (In 1000s) Designation30 60 Developer

35 70 Developer

25 60 Developer

25 70 Lead

30 70 Lead

25 70 Developer

25 70 Developer

30 60 Developer

30 60 Developer

40 70 Developer

35 100 Manager

25 60 Developer

25 70 Developer

35 100 Developer

30 60 Developer


Small Group Sampling – Illustration (1/3) sample – perform uniform sampling on large groups. Small group tables - one or more sample tables for smaller groups.

Pre-Processing Phase:1. Create a overall sample s_overall 2. “Age” Histogram ( Column Index: 0)

r : Base Sampling rate, determines the size of Overall Sample (eg, 30%) t : small group fraction, max size of each small group table (eg, 20%)

Small group table

s_age

s_overall


Pre-Processing Phase:“Income” Histogram (Column Index: 1)

s_income

s_overall

011

011

Small Group Sampling – Illustration (2/3)


Query Issued - SELECT Age, Income, count(*) FROM Employee_tbl GROUP BY Age, Income

SELECT Age, Income, count(*) FROM s_age GROUP BY Age, IncomeUNION ALLSELECT Age, Income, count(*) FROM s_income GROUP BY Age, IncomeWHERE Bitmask & 1 = 0 /* ie, 001 . (eg, 010 & 001 = 000 ; 011 & 001 = 1)*/UNION ALLSELECT Age, Income, count(*) * (100/30) FROM s_overall GROUP BY Age, IncomeWHERE Bitmask & 3 = 0 /* 3 = 20 + 21 ie, 011 (eg, 001 & 011 = 1; 011 & 011 = 1; 010 & 011 = 1)*/

s_income (Column Index: 1 - 010)

s_age (Column Index: 0 - 001)

s_overall

Small Group Sampling – Illustration (3/3)Runtime Phase:


Accuracy Metrics (1 / 2)

As many possible groups to be preserved in approximate answer

Error in the aggregate value for each group should be small

Q = Aggregation QueryLet G = {g1, g2, g3, … gn} be the set of n groups in the answer to Qxi = aggregate value for group gi.

A= Approximate Answer to QG’ = {gi1, gi2, gi3, … gim} be the set of m groups in Ax’i1 = aggregate value for group gij.


Accuracy Metrics (2 / 2)

Percentage of Groups from Q missed by A

Average relative error on Q of A

Average squared relative error on Q of A


Experimental Results ( 1 / 2)

TPC – H Database , Count & Sum queries , Number of Columns in all Tables =245

RelErr, PctGroups increased for Uniform Sampling & Small Group Sampling

Increase was more pronounced for Uniform SamplingFig Source : Babcock B., Chaudhuri C. and Das G. Dynamic Sample Selection for Approximate Query Processing, ACM SIGMOD 2003.


Experimental Results ( 2 / 2)

Uniform Sampling outperforms Small group sampling at low skews

Small group sampling does better at moderate to high skew

Speedup decreases as the number of grouping columns increase



Conclusions

Dynamic Sample Selection improves on previous AQP Methods

Productively utilizes additional Disk Space

Small Group Sampling targets aggregate queries with group bys

Small Group sampling outperforms other techniques


ReferencesBabcock B., Chaudhuri C. and Das G. Dynamic Sample Selection for Approximate Query Processing

http://crystal.uta.edu/~cse6339/Fall08DBIR.htm

http://crystal.uta.edu/~cse6339/Fall09DBIR.htm


AQ&Questions ?Questions ?

presented by srikanth vadada fall 2010 - cse 6339 23 rd sep 2010

Documents