approximate computation of multidimensional aggregates of sparse data using wavelets based on the...
Post on 20-Dec-2015
217 views
TRANSCRIPT
![Page 1: Approximate Computation of Multidimensional Aggregates of Sparse Data Using Wavelets Based on the work of Jeffrey Scott Vitter and Min Wang](https://reader038.vdocuments.mx/reader038/viewer/2022110207/56649d4c5503460f94a2a35a/html5/thumbnails/1.jpg)
Approximate Computation of Approximate Computation of Multidimensional Aggregates of Multidimensional Aggregates of
Sparse Data Using WaveletsSparse Data Using Wavelets
Based on the work ofBased on the work of
Jeffrey Scott VitterJeffrey Scott Vitter
andand
Min WangMin Wang
![Page 2: Approximate Computation of Multidimensional Aggregates of Sparse Data Using Wavelets Based on the work of Jeffrey Scott Vitter and Min Wang](https://reader038.vdocuments.mx/reader038/viewer/2022110207/56649d4c5503460f94a2a35a/html5/thumbnails/2.jpg)
Guidelines
OverviewOverview PreliminariesPreliminaries The New ApproachThe New Approach The construction of the AlgorithmThe construction of the Algorithm Experiments and ResultsExperiments and Results SummerySummery
![Page 3: Approximate Computation of Multidimensional Aggregates of Sparse Data Using Wavelets Based on the work of Jeffrey Scott Vitter and Min Wang](https://reader038.vdocuments.mx/reader038/viewer/2022110207/56649d4c5503460f94a2a35a/html5/thumbnails/3.jpg)
The problemThe problem
Computing multidimensional aggregates in high dimensions is a performance bottleneck for many On-Line Analytical Processing (OLAP) applications.
Obtaining the exact answer to an aggregation query can be prohibitively expensive in terms of time and/or storage space in data warehouse environment.
Obviously, it is advantageous to have fast, approximate answers to OLAP aggregation queries.
![Page 4: Approximate Computation of Multidimensional Aggregates of Sparse Data Using Wavelets Based on the work of Jeffrey Scott Vitter and Min Wang](https://reader038.vdocuments.mx/reader038/viewer/2022110207/56649d4c5503460f94a2a35a/html5/thumbnails/4.jpg)
Processing MethodsProcessing Methods
There are two classes of methods for processing OLAP queries:
Exact MethodsExact Methods
Focus on how to compute the exact data cubeFocus on how to compute the exact data cube
Approximate MethodsApproximate MethodsBecoming attractive in OLAP applications.Becoming attractive in OLAP applications.They have been used in DBMS for a long time.They have been used in DBMS for a long time.In choosing proper approximation technique, there In choosing proper approximation technique, there
arearetwo concernstwo concerns::
EfficiencyEfficiency AccuracyAccuracy
![Page 5: Approximate Computation of Multidimensional Aggregates of Sparse Data Using Wavelets Based on the work of Jeffrey Scott Vitter and Min Wang](https://reader038.vdocuments.mx/reader038/viewer/2022110207/56649d4c5503460f94a2a35a/html5/thumbnails/5.jpg)
Histograms and Sampling MethodsHistograms and Sampling Methods
Advantage: Advantage:
Simple and natural
Construction procedure is very efficient
Disadvantage:
Inefficient to construct in high dimensional
Can not fit in internal memory
Histograms and sampling are used in a variety of important applications where quick approximations of an array of values are needed.
Use of Wavelet-based techniques to construct analogs of histograms in databases has showed substantial improvements in accuracy over random sampling and other histogram based approaches:
![Page 6: Approximate Computation of Multidimensional Aggregates of Sparse Data Using Wavelets Based on the work of Jeffrey Scott Vitter and Min Wang](https://reader038.vdocuments.mx/reader038/viewer/2022110207/56649d4c5503460f94a2a35a/html5/thumbnails/6.jpg)
The Intended Solution
Traditional HistogramTraditional Histograminfeasible for massive high dimensional data setsinfeasible for massive high dimensional data sets
Previously Developed Wavelet TechniquePreviously Developed Wavelet Techniqueefficient only for dense dataefficient only for dense data
Previously Approximation TechniquePreviously Approximation Techniquenot accurate enough results for typical queriesnot accurate enough results for typical queries
The proposed method provides approximate answers to high dimensional OLAP aggregation queries in MASSIVE SPARSE DATA SETS in time efficient and space efficient manner.
![Page 7: Approximate Computation of Multidimensional Aggregates of Sparse Data Using Wavelets Based on the work of Jeffrey Scott Vitter and Min Wang](https://reader038.vdocuments.mx/reader038/viewer/2022110207/56649d4c5503460f94a2a35a/html5/thumbnails/7.jpg)
The Compact Data CubeThe Compact Data Cube
The performance of this method depends in the compact data cube, which is an approximate and space efficient representation of the underlying multidimensional array, based upon multiresolution wavelet decomposition.
In the on-line phase, each aggregation query can generally be answered using the compact data cube in one I/O or a small number of I/Os, depending upon the desired accuracy.
![Page 8: Approximate Computation of Multidimensional Aggregates of Sparse Data Using Wavelets Based on the work of Jeffrey Scott Vitter and Min Wang](https://reader038.vdocuments.mx/reader038/viewer/2022110207/56649d4c5503460f94a2a35a/html5/thumbnails/8.jpg)
The Data SetThe Data Set
A particular characteristics of the data sets is that they are MASSIVE AND SPARSE
Denotes the set of dimensions
S d-dimensional array which represent the underlying data
Denotes the total size of array Swhere |D i| is the size of dimension D i
Contains the value of the measure attributefor the corresponding combination of the
functional attribute
Is defined to be the number of populated entries in S
} . . . ,{ ,21 dDDDD
) iiiiS d. . . ,,( 321
|D| di1
i
N
ZN
N
NSdensity z)(
![Page 9: Approximate Computation of Multidimensional Aggregates of Sparse Data Using Wavelets Based on the work of Jeffrey Scott Vitter and Min Wang](https://reader038.vdocuments.mx/reader038/viewer/2022110207/56649d4c5503460f94a2a35a/html5/thumbnails/9.jpg)
Range Sum QueriesRange Sum Queries
An important class of aggregation queries are the so called range sum queries, which are defined by applying the sum operation over a selected continuos range in the domain of some of the attributes.
A range sum query can generally be formulated as follows:
111
), . . . ,(. . . ):, . . . ,:( 111hil hil
ddd
ddd
iiShlhlsum
![Page 10: Approximate Computation of Multidimensional Aggregates of Sparse Data Using Wavelets Based on the work of Jeffrey Scott Vitter and Min Wang](https://reader038.vdocuments.mx/reader038/viewer/2022110207/56649d4c5503460f94a2a35a/html5/thumbnails/10.jpg)
The d’-Dimensional Range Sum The d’-Dimensional Range Sum QueriesQueries
An interesting subset of the general range sum queries are d’-dimensional range sum queries in which d’<<d.
In this case ranges are specified for only d’ dimensions, and the ranges for the other d-d’ dimensions are implicitly set to be the entire domain
}1||,...,0{ ii Dall
):, . . . ,:( ''11 dd hlhlsum)..., :, . . . ,:( 1'''11 dddd allallhlhlsum
![Page 11: Approximate Computation of Multidimensional Aggregates of Sparse Data Using Wavelets Based on the work of Jeffrey Scott Vitter and Min Wang](https://reader038.vdocuments.mx/reader038/viewer/2022110207/56649d4c5503460f94a2a35a/html5/thumbnails/11.jpg)
Traditional vs. New approachTraditional vs. New approach
In traditional approaches of answering range sum queries using data cube, all the subcubes of the data cube need to be precomputed and stored. When a query is given, a search is conducted in the data cube and relevant information is fetched.
In the new approach, as usual some preprocessing work is done on the original arrays, but instead of computing and storing all the subcubes, only one, much smaller compact data cube is stored.The compact data cube usually fits in one or small number of disk blocks.
![Page 12: Approximate Computation of Multidimensional Aggregates of Sparse Data Using Wavelets Based on the work of Jeffrey Scott Vitter and Min Wang](https://reader038.vdocuments.mx/reader038/viewer/2022110207/56649d4c5503460f94a2a35a/html5/thumbnails/12.jpg)
Approximation AdvantagesApproximation Advantages
Storage space for both the precomputation and the storage of the precomputed data cube.
Even when a huge amount of storage space is avaliable and all the data cubes can be stored comfortably, it may take too long to answer a range sun query, since all cells covered by the range need to be accessed.
This approach is preferable to the traditional approaches in two This approach is preferable to the traditional approaches in two important respects:important respects:
![Page 13: Approximate Computation of Multidimensional Aggregates of Sparse Data Using Wavelets Based on the work of Jeffrey Scott Vitter and Min Wang](https://reader038.vdocuments.mx/reader038/viewer/2022110207/56649d4c5503460f94a2a35a/html5/thumbnails/13.jpg)
I/O ModelI/O Model
The convential parallel disk modelThe convential parallel disk modelRestriction: Restriction: I=1I=1
![Page 14: Approximate Computation of Multidimensional Aggregates of Sparse Data Using Wavelets Based on the work of Jeffrey Scott Vitter and Min Wang](https://reader038.vdocuments.mx/reader038/viewer/2022110207/56649d4c5503460f94a2a35a/html5/thumbnails/14.jpg)
The Method OutlineThe Method Outline
1. 1. DecompositionDecomposition
2. Thresholding2. Thresholding
3. Reconstruction3. Reconstruction
The method can be divided into three sequential phases:
![Page 15: Approximate Computation of Multidimensional Aggregates of Sparse Data Using Wavelets Based on the work of Jeffrey Scott Vitter and Min Wang](https://reader038.vdocuments.mx/reader038/viewer/2022110207/56649d4c5503460f94a2a35a/html5/thumbnails/15.jpg)
DecompositionDecomposition
• Compute the wavelet decomposition Compute the wavelet decomposition of the multidimensional array S of the multidimensional array S
• Obtaining a set of C’ wavelet coefficients (C’ ~ NObtaining a set of C’ wavelet coefficients (C’ ~ Nzz))
As in practice, it is assumed that the array is very sparseAs in practice, it is assumed that the array is very sparse
![Page 16: Approximate Computation of Multidimensional Aggregates of Sparse Data Using Wavelets Based on the work of Jeffrey Scott Vitter and Min Wang](https://reader038.vdocuments.mx/reader038/viewer/2022110207/56649d4c5503460f94a2a35a/html5/thumbnails/16.jpg)
Thresholding and RankingThresholding and Ranking
• Keep only C (Keep only C (C’) wavelet coefficients C’) wavelet coefficients corresponds corresponds to the desired storage usage and to the desired storage usage and accuracy.accuracy.
• Rank only the C wavelet coefficients according to Rank only the C wavelet coefficients according to their importance in the context of accurately their importance in the context of accurately
answering typical aggregation queries.answering typical aggregation queries.
• The C ordered coefficients compose the compact The C ordered coefficients compose the compact data cube.data cube.
![Page 17: Approximate Computation of Multidimensional Aggregates of Sparse Data Using Wavelets Based on the work of Jeffrey Scott Vitter and Min Wang](https://reader038.vdocuments.mx/reader038/viewer/2022110207/56649d4c5503460f94a2a35a/html5/thumbnails/17.jpg)
ReconstructionReconstruction
Notes:Notes: More accurate answers can be provided upon More accurate answers can be provided upon
request.request. The efficiency is crucial, since it affects the The efficiency is crucial, since it affects the
query response time directly.query response time directly.
• In the on line phase, an aggregation query is In the on line phase, an aggregation query is processed by using the K most significant processed by using the K most significant coefficients to reconstruct an approximate coefficients to reconstruct an approximate answer.answer.
• The choice of K depends upon the time the user The choice of K depends upon the time the user is willing to spend.is willing to spend.
![Page 18: Approximate Computation of Multidimensional Aggregates of Sparse Data Using Wavelets Based on the work of Jeffrey Scott Vitter and Min Wang](https://reader038.vdocuments.mx/reader038/viewer/2022110207/56649d4c5503460f94a2a35a/html5/thumbnails/18.jpg)
Wavelet DecompositionWavelet Decomposition
Wavelets are a mathematical tool for the hierarchical Wavelets are a mathematical tool for the hierarchical decomposition of functions in a space efficient matter.decomposition of functions in a space efficient matter.
HAAR Wavelets:HAAR Wavelets:
• Conceptually very simple wavelet basis functionsConceptually very simple wavelet basis functions• fast to computefast to compute• easy to implementeasy to implement
![Page 19: Approximate Computation of Multidimensional Aggregates of Sparse Data Using Wavelets Based on the work of Jeffrey Scott Vitter and Min Wang](https://reader038.vdocuments.mx/reader038/viewer/2022110207/56649d4c5503460f94a2a35a/html5/thumbnails/19.jpg)
HAAR Wavelet - ExampleHAAR Wavelet - Example
Suppose we have a one dimensional signal of N=8 data items
S = [2,2,0,2,3,5,4,4]
By repeating this process recursively on the average, we get the full decomposition:
[2,1,4,4,0,-1,-1,00,-1,-1,0]
Wavelet transform
![Page 20: Approximate Computation of Multidimensional Aggregates of Sparse Data Using Wavelets Based on the work of Jeffrey Scott Vitter and Min Wang](https://reader038.vdocuments.mx/reader038/viewer/2022110207/56649d4c5503460f94a2a35a/html5/thumbnails/20.jpg)
Wavelet TransformWavelet Transform
The individual entries are called the wavelet coefficients. Coefficients at the lower resolution are weighted more
than the one at the higher resolution. The decomposition is very efficient:The decomposition is very efficient:
O(n) CPU timeO(n) CPU time O(N/B) I/OsO(N/B) I/Os
The wavelet transform is a single coefficient representing the over all average of the original signal , followed by the detail coefficients in the order of increasing resolution
],,, [ S 0,1,100,12ˆ ,2
1
4
1
4
3
Increasing resolution
![Page 21: Approximate Computation of Multidimensional Aggregates of Sparse Data Using Wavelets Based on the work of Jeffrey Scott Vitter and Min Wang](https://reader038.vdocuments.mx/reader038/viewer/2022110207/56649d4c5503460f94a2a35a/html5/thumbnails/21.jpg)
1. Partition the d dimensions into g groups, for some 1gd
Building The Compact Data CubeBuilding The Compact Data Cube
The goal of this step is to compute the wavelet decomposition of the multidimensional array S, obtaining a set of C’ wavelet coefficients.
} . . . ,{ 21 11 jjj iiij DDDG WhereWhere i i00=0 i=0 igg=d=d
GGjj must satisfy must satisfyB
|| . . . |||| 21 11
MDxDxD
jjj iii
2. The algorithm for constructing the compact data cube consists of g passes: GGjj is read into memory is read into memory multidimensional decomposition is performedmultidimensional decomposition is performed results are written out to be used for the next passresults are written out to be used for the next pass
![Page 22: Approximate Computation of Multidimensional Aggregates of Sparse Data Using Wavelets Based on the work of Jeffrey Scott Vitter and Min Wang](https://reader038.vdocuments.mx/reader038/viewer/2022110207/56649d4c5503460f94a2a35a/html5/thumbnails/22.jpg)
Eliminating Intermediate ResultsEliminating Intermediate Results
One problem is that the density of the intermediate results will increase from pass to pass, since performing wavelet decomposition on sparse data usually results in more nonzero data.
The natural solution is truncation keeping roughly only Nz entries
Learning process:
• During each pass, an on-line statistics of wavelet coefficients are kept to maintain cutoff value.
• Any entry with its absolute value below the cutoff value will be thrown away on the fly.
![Page 23: Approximate Computation of Multidimensional Aggregates of Sparse Data Using Wavelets Based on the work of Jeffrey Scott Vitter and Min Wang](https://reader038.vdocuments.mx/reader038/viewer/2022110207/56649d4c5503460f94a2a35a/html5/thumbnails/23.jpg)
Thresholding and RankingThresholding and Ranking
Given the storage limitation for the compact data cube, it is possible to keep only several number of wavelet coefficients:
letC’ - number of wavelet coefficients.C - number of wavelet coefficients that can be stored.
Since C<<C’, the goal is to determine which are the best C coefficients to keep, so as to minimize the error of approximation.
![Page 24: Approximate Computation of Multidimensional Aggregates of Sparse Data Using Wavelets Based on the work of Jeffrey Scott Vitter and Min Wang](https://reader038.vdocuments.mx/reader038/viewer/2022110207/56649d4c5503460f94a2a35a/html5/thumbnails/24.jpg)
P-normP-norm
Once the error rate is decided for individual queries, it is meaningful to choose a norm by which to measure the error of a collection of queries.
let ) . . . ,,( 21 Qeeee be the vector of error over a sequence of Q queries.
![Page 25: Approximate Computation of Multidimensional Aggregates of Sparse Data Using Wavelets Based on the work of Jeffrey Scott Vitter and Min Wang](https://reader038.vdocuments.mx/reader038/viewer/2022110207/56649d4c5503460f94a2a35a/html5/thumbnails/25.jpg)
Choosing the CoefficientsChoosing the Coefficients
Choosing the C largest (absolute value) wavelet coefficientsafter normalization is provably optimal in minimizing the 2-norm.
But if coefficient Ci is more likely to contribute more than another one then its w(C) will be greater, where:
k
jjicw
1
]0[)(
Finally:
1. Pick C’’ (C<C’’<C’) largest wavelet coefficients
2. Among the C’’ coefficients choose the C with the largest weight
3. Order the C coefficients in decreasing order to get the compact data cube.
![Page 26: Approximate Computation of Multidimensional Aggregates of Sparse Data Using Wavelets Based on the work of Jeffrey Scott Vitter and Min Wang](https://reader038.vdocuments.mx/reader038/viewer/2022110207/56649d4c5503460f94a2a35a/html5/thumbnails/26.jpg)
Answering On-Line QueriesAnswering On-Line Queries
Mirrors the wavelet transform It is bottom up process S(l:H) denotes the range sum between s(l) and s(h)
The error tree is built based upon the wavelet transform procedure.
h
li
iShlS )():(
![Page 27: Approximate Computation of Multidimensional Aggregates of Sparse Data Using Wavelets Based on the work of Jeffrey Scott Vitter and Min Wang](https://reader038.vdocuments.mx/reader038/viewer/2022110207/56649d4c5503460f94a2a35a/html5/thumbnails/27.jpg)
Constructing The Original Signal
The original signal S can be constructed from the tree nodes by the following formulas:
)6(ˆ )3(ˆ )1(ˆ)0(ˆ)5(
)6(ˆ )3(ˆ )1(ˆ)0(ˆ)4(
)5(ˆ )2(ˆ)1(ˆ)0(ˆ)3(
)5(ˆ )2(ˆ)1(ˆ)0(ˆ)2(
)4(ˆ )2(ˆ)1(ˆ)0(ˆ)1(
)4(ˆ )2(ˆ)1(ˆ)0(ˆ)0(
SSSSS
SSSSS
SSSSS
SSSSS
SSSSS
SSSSS
Not all terms are always being evaluated, only the true contributors are quickly evaluated for answering a query.
![Page 28: Approximate Computation of Multidimensional Aggregates of Sparse Data Using Wavelets Based on the work of Jeffrey Scott Vitter and Min Wang](https://reader038.vdocuments.mx/reader038/viewer/2022110207/56649d4c5503460f94a2a35a/html5/thumbnails/28.jpg)
Answering A QueryAnswering A Query
To answer a query form ):, . . . ,:( ''11 dd hlhlsum
Using k coefficients,
Of the compact data cube R, the following algorithm is used:
AnswerQuery(R,k,l1,h1,…,ld’,hd’)answer = 0;
for I=1,2…k do
if Contribute(R[I], l1,h1,…,ld’,hd’)answer=answer +
Compute_Contribute (R[I], l1,h1,…,ld’,hd’)for j=d’+1,….,d do
answer = answer x |Dj|
return answer ;
![Page 29: Approximate Computation of Multidimensional Aggregates of Sparse Data Using Wavelets Based on the work of Jeffrey Scott Vitter and Min Wang](https://reader038.vdocuments.mx/reader038/viewer/2022110207/56649d4c5503460f94a2a35a/html5/thumbnails/29.jpg)
Experiments DescriptionExperiments Description
The experimental results were performed using real-world data from the U.S. Census Bureau.
• The data file contains 372 attributes. Measure attribute is income. Functional attributes include among others: age, sex, education, race, origin.
• Although the dimensions size are generally small, the highdimensionality results in 10-dimensional array with more than 16,000,000 cells, density~0.001, Nz=15,985.
• Platform:Digital Alpha work station running Digital unix 4.0512 MB internal memory (only 1-10 MB are used for the program)logical block transfer size 2*4 KB
![Page 30: Approximate Computation of Multidimensional Aggregates of Sparse Data Using Wavelets Based on the work of Jeffrey Scott Vitter and Min Wang](https://reader038.vdocuments.mx/reader038/viewer/2022110207/56649d4c5503460f94a2a35a/html5/thumbnails/30.jpg)
Experiments Sets - Experiments Sets - variable densityvariable density
Dimensions groups were partitioned to satisfy M/2B condition For all data sets g=2 the small differences in running time were mainly caused by
the on-line cutoff effect.
6206
6
1021016
44
8
10
10
N
MBS
MBM
N
d
z
![Page 31: Approximate Computation of Multidimensional Aggregates of Sparse Data Using Wavelets Based on the work of Jeffrey Scott Vitter and Min Wang](https://reader038.vdocuments.mx/reader038/viewer/2022110207/56649d4c5503460f94a2a35a/html5/thumbnails/31.jpg)
Experiments Sets - Experiments Sets - fixed densityfixed density
Running time scales almost linearly with respect to the input data size
66 101610
001.0
MB 70444
8
10
zN
density
S
MBM
d
![Page 32: Approximate Computation of Multidimensional Aggregates of Sparse Data Using Wavelets Based on the work of Jeffrey Scott Vitter and Min Wang](https://reader038.vdocuments.mx/reader038/viewer/2022110207/56649d4c5503460f94a2a35a/html5/thumbnails/32.jpg)
Accuracy of the Approximations Accuracy of the Approximations AnswersAnswers
Comparison with traditional histogram has no meaning, because they are too inefficient to construct for high dimensional data.
Comparison with random sampling algorithms depends on the distribution of the non zero entries (random sampling performs better for uniform distribution).
![Page 33: Approximate Computation of Multidimensional Aggregates of Sparse Data Using Wavelets Based on the work of Jeffrey Scott Vitter and Min Wang](https://reader038.vdocuments.mx/reader038/viewer/2022110207/56649d4c5503460f94a2a35a/html5/thumbnails/33.jpg)
SummerySummery
A new wavelet technique for approximate answer to an OLAP range sum queries was presented.
Four important issues were discussed and resolved:
I/O efficiency of the data cube construction, especially when the underlying multidimensional array is very sparse.
Response time in answering an on-line query
Accuracy in answering typical OLAP queries.
Progressive refinement