ripple joins for online aggregation by peter j. haas and joseph m. hellerstein published in june...
TRANSCRIPT
Ripple Joins for Online Aggregation
byPeter J. Haas and Joseph M. Hellersteinpublished in June 1999
presented byRonda Hilton
Overview
This paper tells how to join a bunch of tablesand get the SUM, COUNT, or AVGin GROUP BY clausesshowing approximate results immediatelyand the confidence interval of the resultsfrom the first few tuples retrievedupdating a GUI display with closer
approximation information as the join adds more tuples.
Ripple joins compared to our previous topics General research area: algorithms another approximation algorithm online processing not maintaining a sample set aggregate queries: joins, and group-by requires random retrieval uses probabilistic calculations to determine
the quality of the approximate result not optimizing implemented as middleware on the DBMS
Traditional Hash Join stores the smaller relation in memoryTwo relations R and Swith a common attribute:on each distinct value of that
attribute,match up the tuples which have
the same value.
Example:select R.roomnumber,COUNT(S.homeroom) from Rooms R join Student S on R.roomnumber=S.homeroom
For each tuple r in Radd hash(roomnumber) to the hashtable in memoryif hashtable has filled up memory
for every tuple s in Sif hash(homeroom) is found in the hashtable
add tuple r and tuple s to the outputreset the hashtable
Finally, scan S and add the resulting join tuples to the output.
What's different about ripple join?
Traditional hash join blocks until the entire query output is finished.Ripple join reports approximate results
after each sampling step, and allows user intervention.
In the inner loop, an entire table is scanned. Ripple join expands the sample set
incrementally.
Pipelining In pipelining join algorithms, as the
join progresses, more and more information gets added to the result.
In ripple joins, each new tuple gets joined with all previously-seen tuples of the other operand(s).
The relative rates of the two (or more) operands are dynamically adjusted.
The relations do not have to be relatively equal size.
Aspect ratio: how many tuples are retrieved from each base relation per sampling step.
e.g. β1 = 1, β2 = 3, …
Ripple join adjusts the aspect ratio according to the sizes of the base relations.
What can the end user control?
how many groups continue to processAny one group can be stopped. All other groups will continue to process
(faster). the speed of the query selection process
What happens to make the process faster? More tuples are skipped in the aggregation, so
the approximation will be less accurate, and the confidence interval will be wider.
The end user controls the trade-off between speed and accuracy.
Confidence interval
A running confidence interval displays how close this answer is to the final result.
This could be calculated in many ways. The authors present an example
calculation built on extending the Central Limit Theorem.
Central Limit Theorem
ˆμⁿ is estimator for true μaverage of the n values in the sample;a random quantity
CLT: for large n (e.g. after joining 30 tuples),
ˆμⁿ has a normal distribution with mean μ and variance σ2 /n
Random variable Z
Shift and scale ˆμⁿ to get a "standardized" random variable Z:
(ˆμⁿ μ) / (σ /√n)
Z also has a standard normal distribution.
There are a lot of ways to compute the zp values.
"Interval" column on the GUI
The authors use ˆσn as an estimator for true variance:
εn = ( zp ˆσn ) / √n
This is displayed quantity as the final half-width of the confidence interval.
Why call this "Ripple Join"?
1. The algorithm seems to ripple out from a corner of the join.
2. Acronym: "Rectangles of Increasing Perimeter Length"
Further publications Eddies: Continuously Adaptive Query
Processing, by Ron Avnur and Joseph M. Hellerstein, MOD 2000, Dallas
Confidence Bounds for Sampling-Based GROUP BY Estimates, by Fei Xu, Christopher Jermaine, and Alin Dobra, ACMTrans. Datab. Syst. 33, 3 (Aug. 2008)
Wavelet synopsis for hierarchical range queries with workloads, by Sudipto Guha, Hyoungmin Park, and Kyuseok Shim, VLDB Journal (2008) 17:1079–1099