online aggregation joseph m. hellerstein university of california, berkley peter j. haas ibm...

Online aggregation

Joseph M. Hellerstein University of California,

Berkley

Peter J. Haas IBM Research Division

Helen J. Wang University of California,

Berkley

Presented By:Arjav Dave

Overview

• Introduction• Goals• Implementation• Performance Evaluation• Future Work

Problems with Aggregation

• Aggregation is performed in batch mode: a query is submitted, the system processes a large volume of data over a long period of time and then the final answer is returned.

• Users are forced to wait without feedback while the query is being processed.

• Aggregate queries are generally used to get a rough picture of a large body, yet they are computed with painstaking precision.

What is Online Aggregation and Why?

• Permits users to observe the progress of their aggregation.

• Controls the execution on the fly.

Example

• Consider the query:

select avg(final_grade) from grades where course_name=‘cse186’

• Without index the query will scan all records before returning the answer.

Example using online aggregation

• The user can stop the query processing if the result is within specified confidence interval.

Interface with groups

• Consider a query with group by clause having 6 groups in output

• There are 6 stop signs each for one group. They can be used to stop the group processing.

• Such an interface is easy for the non-statistical users.

Usability Goals

• Continuous Observation: Statistical, Graphical and other intuitive interfaces. Interfaces must be extensible for each aggregate function.

• Control of Time/Precision: User should be able to terminate processing at any time controlling trade off between time and precision.

• Control of Fairness/Partiality: Users can control the relative rate at which different running aggregates are updated.

Performance Goals

• Minimum time to Accuracy: Minimize time required to produce useful estimate of the final answer.

• Minimize time to completion: Minimize the time required to produce the final answer.

• Pacing: The running aggregates should be updated at regular rates, without being so frequent that they overburden the user or user interface.

Building a system for Online AggregationTwo Approaches:•Naïve Approach•Modifying a DBMS

Naive Approach

• Consider the query:select running_avg(final_grade),running_confidence(final_grade),running_interval(final_grade) from grades;

• POSTGRES supports user defined function which can be defined for simple aggregates. For complex aggregates performance and functional issues arise.

Modifying a DBMS

• Modify the database engine to support online aggregation.

• Random access to Data:▫ Necessary to get the statistically meaningful estimates

of the precision of running aggregates.

• Access Methods:▫ Heap Scan▫ Index Scan▫ Sampling from Indices

Types of Access Methods

• Heap Scan: ▫ Generally stored in random order, so easy to fetch▫ In clustered files there may be some logical order, so

choose other access method for queries over that attributes.

• Index Scan:▫ Returns tuples based on some attribute or in groups

based on some attribute.

• Sampling from Indices: ▫ Ideal for producing meaningful confidence intervals. ▫ Less efficient than other two.

Fair, Non-Blocking Group By

• Traditional technique does sorting and then grouping. But sorting blocks

• Instead hash the input relation on its grouping columns. But does not perform better as the number of groups increases.

• Hybrid hashing or its optimized version Hybrid Cache can be used.

• For DISTINCT queries same technique can be used.

Index Striding

• Update for the group with few members will be very infrequent.

• Index Striding: It uses Round-Robin technique to fetch tuples from different groups fairly.

• Advantages:▫ Output is updated according to default or user settings.▫ Delivery of tuples of a group can be stopped, so delivery

of tuples for other groups is much faster.

Non-Blocking Join Algorithms

• Sort-Merge Join: Sorting blocks.• Merge Join: Not acceptable for access methods

that fetch tuples in sorted order.• Hybrid Hash Join: Acceptable if inner relation is

small, as it blocks to hash the inner relation.• Pipeline Hash: Less Efficient than hybrid hash

join. But efficient if both the relations are large.• Nested-Loops join: Not useful for joins with large

inner relation non-indexed.

Optimization

• Avoid Sorting

• Function to calculate cost for Blocking sub-operations (e.g. Hashing inner relation in hybrid hash join) according to processing time should be exponential.

• Maximize user control.

• Trade off need to be evaluated between the output rate and the time to completion.

Aggregate Functions

• New aggregate functions must be defined to return running confidence intervals.

• Query Executor must be modified to provide running aggregate values for display.

• An API must be provided to control the rate e.g. stopGroup, speedupGroup, slowDownGroup, setSkipFactor.

Running Confidence Intervals

• Precision of running aggregates is given by running confidence intervals

• Three types:▫ Conservative Confidence Interval▫ Large Sample Intervals▫ Deterministic Confidence Interval

Future Work

• Enhancing User Interface• Support for Nested Queries• Checkpointing and Continuation• Tracking online queries

QUESTIONS?

online aggregation joseph m. hellerstein university of california, berkley peter j. haas ibm...

Documents