implementing data cubes efficiently

Vicky :: Cao Hui PingSherman :: Chow Sze Ming

CTH :: Chong Tsz HoRonald :: Woo Lok Yan

Ken :: Yiu Man Lung

ImplementingData Cubes Efficiently

Content

Background Introduction of Datacube Problem defined Lattice model Greedy algorithm

How to do? How good? How bad ?

Evaluations Conclusion

Background

DSS (Decision Support System)Gain competitiveness for business

Data warehouseMaintain historical informationUse “Data cube” to summarize results Identify trendsPerformance issue (time and space)Need to reuse result (materialization of views)

Introduction of datacube Datacube

Dimensionality (number of GROUP-BYs)Aggregated data: Values in each cellDimension of datacube Detail of summaryHigher Dimension Higher detail

Common operationsDrill down: Look in more detailRoll up: Look in less detail

What is a data cube?Date

sum TV

1Qtr 2Qtr 3Qtr 4Qtr

Canada

Mexico

Total annual salesof TV in U.S.A.

Our problem

Physically materialize the whole data cubeBest query response Heavy pre-computing, large storage space i.e. Time efficient but space inefficient

Materialize nothingWorse query responseDynamic query evaluation, less storage space i.e. Space efficient but time inefficient

Problem on materialized views

Materialize only part of the data cubeBalance the storage space and responseWhat is the best subject to materialize?Addressed in this paper

Source Size Time (sec) Ratio

From cell itself 1 2.07 N/A

View (s) 10,000 2.38 0.000031

View (p,s) 800,000 20.77 0.000023

View (p,s,c) 6,000,000 226.23 0.000037

Data? View?

We use data cube to modify aggregate data.

So what we use to model view?

Lattice!

Example of lattice diagram

8 possible grouping on the dimensionsp for Parts for Supplierc for Customer# of rows of data shown

next to the grouping

psc 6M

pc 6M ps 0.8M sc 6M

p 0.2M s 0.01M c 0.1M

none 1

An example of Regular Lattice

≼ operator

Suppose c d≼ The view d can be used to derive the view c c is the ancestor of d in lattice diagram

Impose a partial order on the views Usage on dimensions

(part) (part,customer) ≼ (part) (customer) ⋠

Usage within attribute value (year) (quarter) (month) (day)≼ ≼ ≼ (year) (quarter) (week) (day)≼ ≼ ≼

week month

quarter

An example of Irregular Lattice

Regular lattices with equal domain size

Grouping attributes: A1,A2,…, An (domain: r) Attribute for aggregation: B Efficient algorithm

m: # of rows in top viewsk = log ⌈ r m⌉

Strategy k, j, and n Space Time

Space-optimal M m2n

Time-optimal k>j (2rr/(r+1))n (2rr/(r+1))n

k<j and k ≤ n/2 m m2n

k<j and k > n/2 m nCj rj

The problem

The previous technique cannot be applied to the irregular lattices

Irregular lattices is common in data warehouse The optimization of views for irregular lattice is

NP-complete problem (inefficient!) Use Greedy Algorithm i.e. use heuristics to obtain approximate solution

Greedy algorithm

Being as greedy as possible in each step!!

Simple example: Use the smallest number of coins to pay $50 cents

Suppose we have many coins of 20 cents, 10 cents and 5 cents.

How to be greedy?

Common sense approach:Select the largest coin: 20 centsSelect the largest coin again: 20 centsRemaining amount = 50 – 20 – 20 = 10 centsWe cannot select the largest coin again.We choose the 2nd largest coin 10 cents instead.

Only 3 coins are needed! Optimal solution!

Definition of “benefit of view”

C(v) denotes cost of view (v) B(v,S) denotes benefit of a view (v) relativ

e to a set of views (S) For each w v≼

Let u be the view of least cost in S such that w u≼

Bw = max{ C(u) – C(v) ,0}

B(v,S) = ∑w v≼ Bw

Greedy algorithm

In each step Select the view with the most benefit Add it to the result

AlgorithmS={top view};for i=1 to k {

select view v not in S such that B(v,S) is maximizedS = S union {v}

}return S;

Selecting the first view

After selecting coins, let us back to our problem, selecting views.

We must materialize the top view i.e. the view grouping by all attributesCannot be constructed from other viewsAvoid going to the raw data

Selecting k views more

Space is limited! Suppose we can only select k more views.

For each view which is not yet selected, calculate the benefit of materializing it.

Pick the one with maximum benefit!!!

Let’s set k = 2 for examples.

Example

E.g. The cost of constructing view b given the view A is 100

If we choose b to materialize, the new cost of constructing view b is 50.

First round

Notice that not only b, but also d, e, g and h can be calculated from b

So the total benefit is (100 – 50) x 5 = 250

Continue… Similarly, the benefit

of materializing c is (100 – 75) x 5 = 125a

Benefit

Not yet finish… For e,

Benefit =

(100-30) x 3

Benefit

Let’s choose b!

For d and f ,

Benefit =

(100-20) x 2

= 160 and

(100-40) x 2 =

120 respectively.

Benefit

Next round?

Seems we should choose e, as it has the second largest benefit.

Let’s see what will happen in the second round. Benefit

Second round!

Now, only c and f get benefit if we materialize c (since e, g and h can be more efficiently calculated by using b)

Benefit

= (100 – 75) x 2 = 50

Benefit

How about choosing f?

If we choose f, we found that h can be effectively calculated by using f instead of b.

Benefit

= (100 – 40) + (50 – 40)

Benefit

f 7020

Easy to work out others

Benefit of d

= (50 – 20) x 2 = 60 Benefit of e

= (50 – 30) x 3 = 60 Benefit of g

= 50 – 1 = 49 Benefit of h

= 50 – 10 = 40

Observation

In the first round, the benefit of choosing f (only 120) is far from the best choice (250)

But in second round, choosing f gives the maximum benefit!1st round Benefit

2nd round Benefit

Simple? Optimal?

Trade off again! This simple algorithm is not optimal in all cases!

Consider the following case…

Bad example

100 100

20 nodes

Total 1000

Bad example

100 100

20 nodes

Total 1000

Choose c Benefit

= (200-99) x (1 + 20 + 20)= 4141= maximum

Bad example

100 100

20 nodes

Total 1000

Now choose either 1 of b and d (same benefit)

Bad example

100 100

20 nodes

Total 1000

How about these? Very expensive!!!

Optimal solution should be…

100 100

20 nodes

Total 1000

Only c is a little bit expensive.

Some theoretical result

It can be proved that we can get at least (e – 1 ) / e % (which is about 63%) of the benefit of the optimal algorithm.

Extensions (1)

ProblemThe views in a lattice are unlikely to have the

same probability of being requested in a query.

Solution:We can weight each benefit by its probability.

Extensions (2)

Problem Instead of asking for some fixed number (k) of

views to materialize, we might instead allocate a fixed amount of space to views.

SolutionWe can consider the “benefit of each view per

unit space”.

Conclusions

Materialization of views is an essential query optimization strategy for decision-support applications.

Reason to materialize some part of the data cube but not all of the cube.

A lattice framework that models multidimensional analysis very well.

Conclusions (cont.)

Finding optimal solution in the case of irregular lattice is NP-hard.

Introduction of greedy algorithm Greedy algorithm work on this lattice and

pick the almost right views to materialize.

Conclusions (the end)

There exists cases which greedy algorithm fails to produce optimal solution.

But greedy algorithm has guaranteed performance

Expansion of greedy algorithm.

Reference

Venky Harinarayan, Anand Rajaraman, Jeffrey D. Ullman. Implementing Data Cubes Efficiently. SIGMOD’96:205-216.

Thank you~

Q & A Section

implementing data cubes efficiently

view p

bysaggregated data

data cubebalance

aggregate data

view cc

model view

rows of data shownnext

cost of view vbv

Documents

implementing data cubes efficiently* · view the data as...

96 perspectives on the peace cubes / virtual light & colour...

sean mathews, christopher kiser, haoxiang chen. processor...

the rubix cubes

challenges of implementing hiv/aids related … ·...

real-time polyhedron intersection for multi-camera 3d...

efficiently implementing postscript in c# · pdf...

efficiently implementing protocols and bundles: engaging...

1 vicky :: cao hui ping sherman :: chow sze ming cth ::...

efficiently implementing golog with answer set...

multilink cubes - göteborgs...

asmodee-resources.azureedge.net...design st dio asmrsc03ml2...

configure default cubes

grout compressive strength project cubes versus … ·...

pacific integrated island management principles, case...

gogo cubes

story cubes

info cubes

data cubes

light cubes