1 vicky :: cao hui ping sherman :: chow sze ming cth :: chong tsz ho ronald :: woo lok yan ken ::...
TRANSCRIPT
1
Vicky :: Cao Hui PingSherman :: Chow Sze Ming
CTH :: Chong Tsz HoRonald :: Woo Lok Yan
Ken :: Yiu Man Lung
ImplementingData Cubes Efficiently
2
Content
Background Introduction of Datacube Problem defined Lattice model Greedy algorithm
How to do? How good? How bad ?
Evaluations Conclusion
3
Background
DSS (Decision Support System)Gain competitiveness for business
Data warehouseMaintain historical informationUse “Data cube” to summarize results Identify trendsPerformance issue (time and space)Need to reuse result (materialization of views)
4
Introduction of datacube Datacube
Dimensionality (number of GROUP-BYs)Aggregated data: Values in each cellDimension of datacube Detail of summaryHigher Dimension Higher detail
Common operationsDrill down: Look in more detailRoll up: Look in less detail
5
What is a data cube?Date
Produ
ct
Cou
ntr
ysum
sum TV
VCRPC
1Qtr 2Qtr 3Qtr 4Qtr
U.S.A
Canada
Mexico
sum
Total annual salesof TV in U.S.A.
6
Our problem
Physically materialize the whole data cubeBest query response Heavy pre-computing, large storage space i.e. Time efficient but space inefficient
Materialize nothingWorse query responseDynamic query evaluation, less storage space i.e. Space efficient but time inefficient
7
Problem on materialized views
Materialize only part of the data cubeBalance the storage space and responseWhat is the best subject to materialize?Addressed in this paper
Source Size Time (sec) Ratio
From cell itself 1 2.07 N/A
View (s) 10,000 2.38 0.000031
View (p,s) 800,000 20.77 0.000023
View (p,s,c) 6,000,000 226.23 0.000037
8
Data? View?
We use data cube to modify aggregate data.
So what we use to model view?
Lattice!
9
Example of lattice diagram
8 possible grouping on the dimensionsp for Parts for Supplierc for Customer# of rows of data shown
next to the grouping
psc 6M
pc 6M ps 0.8M sc 6M
p 0.2M s 0.01M c 0.1M
none 1
An example of Regular Lattice
10
≼ operator
Suppose c d≼ The view d can be used to derive the view c c is the ancestor of d in lattice diagram
Impose a partial order on the views Usage on dimensions
(part) (part,customer) ≼ (part) (customer) ⋠
Usage within attribute value (year) (quarter) (month) (day)≼ ≼ ≼ (year) (quarter) (week) (day)≼ ≼ ≼
week month
day
year
quarter
An example of Irregular Lattice
11
Regular lattices with equal domain size
Grouping attributes: A1,A2,…, An (domain: r) Attribute for aggregation: B Efficient algorithm
m: # of rows in top viewsk = log ⌈ r m⌉
Strategy k, j, and n Space Time
Space-optimal M m2n
Time-optimal k>j (2rr/(r+1))n (2rr/(r+1))n
k<j and k ≤ n/2 m m2n
k<j and k > n/2 m nCj rj
12
The problem
The previous technique cannot be applied to the irregular lattices
Irregular lattices is common in data warehouse The optimization of views for irregular lattice is
NP-complete problem (inefficient!) Use Greedy Algorithm i.e. use heuristics to obtain approximate solution
13
Greedy algorithm
Being as greedy as possible in each step!!
Simple example: Use the smallest number of coins to pay $50 cents
Suppose we have many coins of 20 cents, 10 cents and 5 cents.
14
How to be greedy?
Common sense approach:Select the largest coin: 20 centsSelect the largest coin again: 20 centsRemaining amount = 50 – 20 – 20 = 10 centsWe cannot select the largest coin again.We choose the 2nd largest coin 10 cents instead.
Only 3 coins are needed! Optimal solution!
15
Definition of “benefit of view”
C(v) denotes cost of view (v) B(v,S) denotes benefit of a view (v) relativ
e to a set of views (S) For each w v≼
Let u be the view of least cost in S such that w u≼
Bw = max{ C(u) – C(v) ,0}
B(v,S) = ∑w v≼ Bw
16
Greedy algorithm
In each step Select the view with the most benefit Add it to the result
AlgorithmS={top view};for i=1 to k {
select view v not in S such that B(v,S) is maximizedS = S union {v}
}return S;
17
Selecting the first view
After selecting coins, let us back to our problem, selecting views.
We must materialize the top view i.e. the view grouping by all attributesCannot be constructed from other viewsAvoid going to the raw data
18
Selecting k views more
Space is limited! Suppose we can only select k more views.
For each view which is not yet selected, calculate the benefit of materializing it.
Pick the one with maximum benefit!!!
Let’s set k = 2 for examples.
19
Example
a
b c
d e f
g h
100
50 75
20 40
30
1 10
E.g. The cost of constructing view b given the view A is 100
If we choose b to materialize, the new cost of constructing view b is 50.
20
First round
a
b c
d e f
g h
100
50 75
20 40
30
1 10
Notice that not only b, but also d, e, g and h can be calculated from b
So the total benefit is (100 – 50) x 5 = 250
21
Continue… Similarly, the benefit
of materializing c is (100 – 75) x 5 = 125a
b c
d e f
g h
100
50 75
20 40
30
1 10
Benefit
b 250
c 125
22
Not yet finish… For e,
Benefit =
(100-30) x 3
= 210
a
b c
d e f
g h
100
50 75
20 40
30
1 10
Benefit
b 250
c 125
e 210
23
Let’s choose b!
a
b c
d e f
g h
100
50 75
20 40
30
1 10
For d and f ,
Benefit =
(100-20) x 2
= 160 and
(100-40) x 2 =
120 respectively.
Benefit
b 250
c 125
d 160
e 210
f 120
24
Next round?
Seems we should choose e, as it has the second largest benefit.
Let’s see what will happen in the second round. Benefit
b 250
c 125
d 160
e 210
f 120
25
Second round!
a
b c
d e f
g h
100
50 75
20 40
30
1 10
Now, only c and f get benefit if we materialize c (since e, g and h can be more efficiently calculated by using b)
Benefit
= (100 – 75) x 2 = 50
Benefit
c 50
26
How about choosing f?
a
b c
d e f
g h
100
50 75
40
30
1 10
If we choose f, we found that h can be effectively calculated by using f instead of b.
Benefit
= (100 – 40) + (50 – 40)
Benefit
c 50
f 7020
27
Easy to work out others
Benefit of d
= (50 – 20) x 2 = 60 Benefit of e
= (50 – 30) x 3 = 60 Benefit of g
= 50 – 1 = 49 Benefit of h
= 50 – 10 = 40
a
b c
d e f
g h
100
50 75
20 40
30
1 10
28
Observation
In the first round, the benefit of choosing f (only 120) is far from the best choice (250)
But in second round, choosing f gives the maximum benefit!1st round Benefit
b 250
c 125
d 160
e 210
f 120
2nd round Benefit
c 50
d 60
e 70
f 70
g 49
29
Simple? Optimal?
Trade off again! This simple algorithm is not optimal in all cases!
Consider the following case…
30
Bad example
a
b dc
200
100 100
20 nodes
Total 1000
99
31
Bad example
a
b dc
200
100 100
20 nodes
Total 1000
99
Choose c Benefit
= (200-99) x (1 + 20 + 20)= 4141= maximum
32
Bad example
a
b dc
200
100 100
20 nodes
Total 1000
99
Now choose either 1 of b and d (same benefit)
33
Bad example
a
b dc
200
100 100
20 nodes
Total 1000
99
How about these? Very expensive!!!
34
Optimal solution should be…
a
b dc
200
100 100
20 nodes
Total 1000
99
Only c is a little bit expensive.
35
Some theoretical result
It can be proved that we can get at least (e – 1 ) / e % (which is about 63%) of the benefit of the optimal algorithm.
36
Extensions (1)
ProblemThe views in a lattice are unlikely to have the
same probability of being requested in a query.
Solution:We can weight each benefit by its probability.
37
Extensions (2)
Problem Instead of asking for some fixed number (k) of
views to materialize, we might instead allocate a fixed amount of space to views.
SolutionWe can consider the “benefit of each view per
unit space”.
38
Conclusions
Materialization of views is an essential query optimization strategy for decision-support applications.
Reason to materialize some part of the data cube but not all of the cube.
A lattice framework that models multidimensional analysis very well.
39
Conclusions (cont.)
Finding optimal solution in the case of irregular lattice is NP-hard.
Introduction of greedy algorithm Greedy algorithm work on this lattice and
pick the almost right views to materialize.
40
Conclusions (the end)
There exists cases which greedy algorithm fails to produce optimal solution.
But greedy algorithm has guaranteed performance
Expansion of greedy algorithm.
41
Reference
Venky Harinarayan, Anand Rajaraman, Jeffrey D. Ullman. Implementing Data Cubes Efficiently. SIGMOD’96:205-216.
42
Thank you~
Q & A Section