a pproximate q uery p rocessing u sing w avelets kaushik chakrabarti(univ of illinois) minos...

Post on 31-Dec-2015

215 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

APPROXIMATE QUERY PROCESSING USING WAVELETS

Kaushik Chakrabarti(Univ Of Illinois)

Minos Garofalakis(Bell Labs)

Rajeev Rastogi(Bell Labs)

Kyuseok Shim(KAIST and AITrc)

Presented By:

Charanmai Koorapati Ramesh

Harika Guniganti

AGENDA

Introduction Motivation Prior Work Wavelet Decomposition Building Wavelet Synopses Processing Relational Queries Experimental Study Quality Metrics Query Execution Times Conclusion

DECISION SUPPORT SYSTEMS

Comparative sales figures between one week and the next

Projected revenue figures based on new product sales assumptions

The consequences of different decision alternatives, given past experience in a context that is described

MOTIVATION

DSS users pose very complex queries to the underlying DBMS that require complex operations over Gigabytes or Terabytes of disk-resident data.

SQL Query

Exact Answer

Decision Support Systems

Long Response Times!

Exact answers NOT always required. User may prefer a fast, approximate answer.

SQL Query

Exact Answer

CompacCompact Data t Data SynopsSynopseses

“Transformed” Query

KB/MB

Approximate Answer

FAST!!

Long Response Times!

Decision Support Systems

GB/TB

APPROXIMATE QUERY PROCESSING

Viable solution for dealing with Huge amounts of data High query complexities Increasingly stringent response-time

requirements

PRIOR WORK

Sampling Based TechniquesLimitations:• Join operator on two uniform samples• Non- aggregate query

Histogram Based TechniquesLimitations:• Storage overhead• Construction cost achieve reasonable error rates for high

dimensional data sets.

WAVELET BASED TECHNIQUES

Wavelet -mathematical function used to divide a given function or continuous-time signal into different frequency components

and study each component with a resolution that matches its scale.

This paper extends the scope of earlier work , establishing the viability and effectiveness of wavelets as a generic approximate query processing tool for modern high-dimensional DSS applications.

APPROXIMATE QUERY PROCESSING USING WAVELETS

Novel approach consisting of two steps- Multi dimensional Haar wavelets - effective,

compact synopses Novel query processing alogorithms - fast

and accurate approximate query answers

WAVELET DECOMPOSITION/TRANSFORM

One- dimensional Haar WaveletsData vector A = [2,2,5,7]

Wavelet transform, WA = [4,-2,0,-1]

Resolution Averages Detail Coefficients

2 [2,2,5,7] -

1 [2,6] [0,-1]

0 [4] [-2]

Wavelet Coefficient

NORMALIZED WAVELET TRANSFORM

To equalize the importance of all the wavelet coefficients , we normalize the final entries of WA, by dividing each wavelet coefficient by √2 ^l,

where l is the level of resolution.

Thus WA= [4,-2,0,-1/ √2]

MULTIDIMENSIONAL HAAR WAVELETS Standard Decomposition First, fix an ordering for the data

dimensions(say 1,2,… d) and then proceed to apply the complete one-dimensional wavelet transform for each one dimensional “row” of array cells along dimension k, for all k=1,2…d.

Non- standard DecompositionGiven an ordering for the data dimensions (1,2,…d), we perform one step of pairwise averaging and differencing for each one dimensional row of array cells along dimension k, for each k=1,…d. This process is repeated recursively only on quadrant containing averages across all dimensions.

NON-STANDARD DECOMPOSITION

EXAMPLE DECOMPOSITION OF A 4×4 ARRAY

MULTIDIMENSIONAL HAAR COEFFICIENTS- SEMANTICS AND REPRESENTATION

SUPPORT REGIONS AND SIGNS FOR 16 NONSTANDARD 2-DIMENSIONAL HAAR BASIS FUNCTIONS

Haar wavelet coefficient can be represented with the triple

W=<R,S,v> where1) W.R is d-dimensional support hyper-

rectangle of W Along each dimension j,1<=j<=d

Low boundary value - W.R.bound[j].loHigh boundary value - W.R.bound[j].hi

Coefficient W contributes to each data cell of A[i1,…id] satisfying the condition W.R.bound[j].lo <= ij <= W.R.bound[j].hi

for all dimensions j, 1<= j<=d

2) W.S stores sign information for all d-dimensional quadrants of W.R.

The two elements of the sign vector of coefficient W along dimensions j are denoted by

W.S.sign[j].lo , W.S.sign[j].hi corresponding to lower and upper half of W.R’s extent along dimension j.

The sign information is computed as a product of the d-sign entries that map to that quadrant.

3) W.v is the (scalar) magnitude of coefficient W.This is exactly the quantity that W contributes

to all data array cells enclosed in W.R.

BUILDING WAVELET-COEFFICIENT SYNOPSES

Joint Data Distribution Joint Data Distribution ArrayArray

0 1 2 3Attr1

3

2

1

0

Attr2

36

4

Attr1 Attr2 Count

2 0 4

1 1 6

3 1 3

Relation (ROLAP) Relation (ROLAP) Representation Representation

Capturing d-dimensional array AR (joint frequency distribution) from relational table R (“set of tuples” ROLAP)

What is the size of the wavelet-coefficient synopsis?

PROCESSING RELATIONAL QUERIES IN WAVELET-COEFFICIENT DOMAIN

Wavelet Synopses

Approximate

Relations

Query Results in

Wavelet Domain

Final Approximate

Results

Render

Render

Querying in

Wavelet

Domain

Querying in

Relation

Domain

Compressed domain (FAST)

Relation domain (SLOW)

• Reduce relations into compact wavelet-coefficient synopses

WAVELET QUERY PROCESSING

join

project

select select

set of coefficients

set of coefficients

set of coefficients

Each operator (e.g., select, project,

join, aggregates, etc.)

input: set of wavelet

coefficients

output: set of wavelet

coefficients

Finally, rendering step

input: set of wavelet

coefficients

output: (multi)set of tuples

render

QUICK REVIEW OF NOTATIONS

SELECTION OPERATOR (SELECT)

SELECTION -- RELATIONAL DOMAIN

In relational domain, interested in only those cells inside query range

In wavelet domain, interested in only the coefficients that contribute to those cells

Dim D1(Attr1)

Dim D2(Attr2)

Count

0 6 61 2 31 3 41 5 61 6 82 6 73 0 14 2 35 2 26 1 36 2 26 5 16 6 3

Dim. D2

6

3

7

3

32

2

4

1

1

8

6

3

Query Range

Dim.

D1

Joint Data Distribution ArrayJoint Data Distribution ArrayRelationRelation

APPROXIMATE QUERY EXECUTION ENGINE PROCESS FOR SELECT

SELECTION -- WAVELET DOMAIN

--++

+ -

-+

+-

D2

D1

Query

Range -+

-+

-+

D2

D1

PROJECTION OPERATOR (PROJECT)

PROJECTION- WAVELET DOMAIN

JOIN OPERATOR (JOIN)

EQUI-JOIN -- RELATIONAL DOMAIN

Relational domain: Join count= 7*3 = (A1-A3)*(B2+B3) Wavelet domain: A1*B2 + A1*B3 - A3*B2 - A3*B3 Consider all pairs of coefficients: (1) check joinability (overlap in

join dimension(s)), (2) compute output coefficients

3

Coefficients A1 (+) and A3 (-)

contribute to this cell

Coefficients B2 (+), and B3 (+)

contribute to this cellDim D1(Attr1)

Dim D2(Attr2)

Count

6 2 74 3 6

Dim D1(Attr1)

Dim D3(Attr3)

Count

6 3 3

Join along D1

Dim D1(Attr1)

Dim D2(Attr2)

Dim D3(Attr3)

Count

6 2 3 21

Joint Data DistributionJoint Data Distribution of Relation 1of Relation 1

Joint Data Distr.Joint Data Distr. of Relation 2of Relation 2

7

6

Dim. D2 Dim. D3

Join Dim.

D1

Relation 1Relation 1

Relation 2Relation 2

EQUI-JOIN -- WAVELET DOMAIN

-+

D3

D1--++

D2

D1

D1

v1 v2

Join output coefficient:

D3

D1

+

D2

-v = v1 * v2

EXPERIMENTAL STUDY

Improved Answer Quality

Low Synopsis Construction Costs

Fast Query Execution

ERROR METRICS FOR SET-VALUED QUERY ANSWERS

Need an error metric for (multi)sets that accounts for both differences in element frequencies

differences in element values

Proposed Solutions MAC (Match-And-Compare) Error [IP99]: based on perfect

bipartite graph matching

EMD (Earth Mover’s Distance) Error [CGR00, RTG98]: based on bipartite network flows

QUERY EXECUTION TIMES

SELECT-JOIN-SUM QUERY ERRORS ON REAL-LIFE DATA

SELECT query errors on real-life data

 SELECT-SUM QUERY ERRORS ON REAL-LIFE DATA

CONCLUSION

Multidimensional wavelets as an effective tool for general purpose approximate query processing in modern, high dimensional applications.

The query processing algorithms operate directly on the wavelet-coefficient synopses of relational data, thus allowing for very fast processing of arbitrarily complex queries entirely in the wavelet-coefficient domain.

 Extensive experimental study with synthetic as well as real-life data sets that verifies the effectiveness of our wavelet-based approach compared to both sampling and histograms

Questions???

THANK YOU

top related