ripple joins for online aggregation by peter j. haas and joseph m. hellerstein published in june...

Ripple Joins for Online Aggregation

byPeter J. Haas and Joseph M. Hellersteinpublished in June 1999

presented byRonda Hilton

Overview

This paper tells how to join a bunch of tablesand get the SUM, COUNT, or AVGin GROUP BY clausesshowing approximate results immediatelyand the confidence interval of the resultsfrom the first few tuples retrievedupdating a GUI display with closer

approximation information as the join adds more tuples.

Ripple joins compared to our previous topics General research area: algorithms another approximation algorithm online processing not maintaining a sample set aggregate queries: joins, and group-by requires random retrieval uses probabilistic calculations to determine

the quality of the approximate result not optimizing implemented as middleware on the DBMS

Traditional Hash Join stores the smaller relation in memoryTwo relations R and Swith a common attribute:on each distinct value of that

attribute,match up the tuples which have

the same value.

Example:select R.roomnumber,COUNT(S.homeroom) from Rooms R join Student S on R.roomnumber=S.homeroom

For each tuple r in Radd hash(roomnumber) to the hashtable in memoryif hashtable has filled up memory

for every tuple s in Sif hash(homeroom) is found in the hashtable

add tuple r and tuple s to the outputreset the hashtable

Finally, scan S and add the resulting join tuples to the output.

What's different about ripple join?

Traditional hash join blocks until the entire query output is finished.Ripple join reports approximate results

after each sampling step, and allows user intervention.

In the inner loop, an entire table is scanned. Ripple join expands the sample set

incrementally.

The most important difference

The tuples are processed in random order.

Pipelining In pipelining join algorithms, as the

join progresses, more and more information gets added to the result.

In ripple joins, each new tuple gets joined with all previously-seen tuples of the other operand(s).

The relative rates of the two (or more) operands are dynamically adjusted.

Worst-case scenario

Ripple join reduces to a nested loop join.

The relations do not have to be relatively equal size.

Aspect ratio: how many tuples are retrieved from each base relation per sampling step.

e.g. β1 = 1, β2 = 3, …

Ripple join adjusts the aspect ratio according to the sizes of the base relations.

Rectangular version

What can the end user control?

how many groups continue to processAny one group can be stopped. All other groups will continue to process

(faster). the speed of the query selection process

What happens to make the process faster? More tuples are skipped in the aggregation, so

the approximation will be less accurate, and the confidence interval will be wider.

The end user controls the trade-off between speed and accuracy.

GUI, 1999

Confidence interval

A running confidence interval displays how close this answer is to the final result.

This could be calculated in many ways. The authors present an example

calculation built on extending the Central Limit Theorem.

Central Limit Theorem

ˆμⁿ is estimator for true μaverage of the n values in the sample;a random quantity

CLT: for large n (e.g. after joining 30 tuples),

ˆμⁿ has a normal distribution with mean μ and variance σ2 /n

Random variable Z

Shift and scale ˆμⁿ to get a "standardized" random variable Z:

(ˆμⁿ μ) / (σ /√n)

Z also has a standard normal distribution.

There are a lot of ways to compute the zp values.

"Interval" column on the GUI

The authors use ˆσn as an estimator for true variance:

εn = ( zp ˆσn ) / √n

This is displayed quantity as the final half-width of the confidence interval.

Why call this "Ripple Join"?

1. The algorithm seems to ripple out from a corner of the join.

2. Acronym: "Rectangles of Increasing Perimeter Length"

Variants of ripple join

Block ripple join Index ripple join Hash ripple join

Performance

Further publications Eddies: Continuously Adaptive Query

Processing, by Ron Avnur and Joseph M. Hellerstein, MOD 2000, Dallas

Confidence Bounds for Sampling-Based GROUP BY Estimates, by Fei Xu, Christopher Jermaine, and Alin Dobra, ACMTrans. Datab. Syst. 33, 3 (Aug. 2008)

Wavelet synopsis for hierarchical range queries with workloads, by Sudipto Guha, Hyoungmin Park, and Kyuseok Shim, VLDB Journal (2008) 17:1079–1099

Questions?

ripple joins for online aggregation by peter j. haas and joseph m. hellerstein published in june...

Documents

resulting join tuples

join progresses

dbmstraditional hash

ripple joins

rooms r join student

tuple s

tuples retrievedupdating

relations r