basics of database tuning
TRANSCRIPT
OLAP Indexes
Jian Pei: Data Mining -- OLAP Indexes 2
What Is a Data Warehouse?
• “A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision-making process.”
– W. H. Inmon• Data warehousing: the process of
constructing and using data warehouses
Jian Pei: Data Mining -- OLAP Indexes 3
Index Requirements in OLAP
• Data is read only– No insertion or deletion
• Query types– Point query: look up one specific tuple (rare)– Range query: return the aggregate of a (large)
set of tuples, with group by– Complex queries: need specific algorithms and
index structures, will be discussed later
Jian Pei: Data Mining -- OLAP Indexes 4
OLAP Query Example
• In table (cust, gender, …), find the total number of male customers
• Method 1: scan the table once• Method 2: build a B+ tree index on attribute
gender, still need to access all tuples of male customers
• Can we get the count without scanning many tuples?
Jian Pei: Data Mining -- OLAP Indexes 5
Bitmap Index
• For n tuples, a bitmap index has n bits and can be packed into ⎡n /8⎤ bytes and ⎡n /32⎤words
• From a bit to the row-id: the j-th bit of the p-th byte row-id = p*8 +j cust gender …
Jack M …Cathy F …
… … …Nancy F …
1 0 … 0
Jian Pei: Data Mining -- OLAP Indexes 6
Using Bitmap to Count
• Shcount[] contains the number of bits in the entry subscript– shcount[01100101]=4count = 0;for (i = 0; i < SHNUM; i++)
count += shcount[B[i]];
Jian Pei: Data Mining -- OLAP Indexes 7
Advantages of Bitmap Index
• Efficient in space• Ready for logic composition
– C = C1 AND C2– Bitmap operations can be used
• Bitmap index only works for categorical data with low cardinality– Naively, we need 50 bits per entry to represent
the state of a customer in US– How to represent a sale in dollars?
Jian Pei: Data Mining -- OLAP Indexes 8
Bit-Sliced Index
• A sale amount can be written as an integer number of pennies, and then represented as a binary number of N bits– 24 bits is good for up to $167,772.15,
appropriate for many stores• A bit-sliced index is N bitmaps
– Tuple j sets in bitmap k if the k-th bit in its binary representation is on
– The space costs of bit-sliced index is the same as storing the data directly
Jian Pei: Data Mining -- OLAP Indexes 9
Using Indexes
SELECT SUM(sales) FROM Sales WHERE C;– Tuples satisfying C is identified by a bitmap B
• Direct access to rows to calculate SUM: scan the whole table once
• B+ tree: find the tuples from the tree• Projection index: only scan attribute sales• Bit-sliced index: get the sum from ∑(B AND
Bk)*2k
Jian Pei: Data Mining -- OLAP Indexes 10
Cost Comparison
• Traditional value-list index (B+ tree) is costly in both I/O and CPU time– Not good for OLAP
• Bit-sliced index is efficient in I/O• Other case studies in [O’Neil and Quass,
SIGMOD’97]
Jian Pei: Data Mining -- OLAP Indexes 11
Horizontal or Vertical Storage
• A fact table for data warehousing is often fat– Tens of even hundreds of dimensions/attributes
• A query is often about only a few attributes• Horizontal storage: tuples are stored one by one• Vertical storage: tuples are stored by attributes
A1 A2 … A100
x1 x2 … x100
… … … …z1 z2 … z100
A1 A2 … A100
x1 x2 … x100
… … … …z1 z2 … z100
Jian Pei: Data Mining -- OLAP Indexes 12
Horizontal Versus Vertical
• Find the information of tuple t– Typical in OLTP– Horizontal storage: get the whole tuple in one search– Vertical storage: search 100 lists
• Find SUM(a100) GROUP BY {a22, a83}– Typical in OLAP– Horizontal storage (no index): search all tuples O(100n),
where n is the number of tuples– Vertical storage: search 3 lists O(3n), 3% of the
horizontal storage method• Projection index: vertical storage
Jian Pei: Data Mining -- OLAP Indexes 13
Outline
• OLAP and data warehousing: concepts• Indexing for OLAP operations• Data cube computation
– Computing complete data cubes– Compression
Jian Pei: Data Mining -- OLAP Indexes 14
MOLAPDate
Prod
uct
Cou
ntrysum
sumTV
VCRPC
1Qtr 2Qtr 3Qtr 4QtrU.S.A
Canada
Mexico
sum
Jian Pei: Data Mining -- OLAP Indexes 15
Pros and Cons
• Easy to implement• Fast retrieval• Many entries may be empty if data is sparse• Space costly
Jian Pei: Data Mining -- OLAP Indexes 16
ROLAP – Data Cube in Table
• A multi-dimensional databaseBase table
Dimensions Measure
S1 P1 Spring 6S1 P2 Spring 12S2 P1 Fall 9
Store Product Season AVG(Sales)
S1 * Spring 9… … … …* * * 9
9FallP1S212SpringP2S16SpringP1S1
MeasureDimensionsSalesSeasonProductStore
Cubing
Jian Pei: Data Mining -- OLAP Indexes 17
Pros and Cons
• Compact in space• Using mature relational technology• Overhead in query answering
Jian Pei: Data Mining -- OLAP Indexes 18
HOLAP
• Hybrid OLAP• Store detail data in ROLAP
– Save space• Store aggregates in MOLAP
– Fast query answering
Jian Pei: Data Mining -- OLAP Indexes 19
Cube Computation
• Given a base table B(D1, …, Dn, M), compute the set of all aggregates in the data cube using aggregate function f()
• D1, …, Dn are n dimensions and M is a measure
Jian Pei: Data Mining -- OLAP Indexes 20
Aggregate Cells
• A cell c=(c1, …, cn) is called an aggregate cell, provided each ci is either in Di or ALL– Generally, we assume that each tuple in B has
distinct dimension values, i.e., (D1, …, Dn) is a key in B
– Value ALL is also conventionally written as *• Cell c is a base cell if (c, m) is a tuple in the
base table• A cell c is not empty if there are some base
cells match all non-ALL attribute values in c
Jian Pei: Data Mining -- OLAP Indexes 21
Example
• (S1, ALL, Spring), (S2, P1, Fall) and (S1, ALL, Winter) are aggregate cells
• (S2, P1, Fall) is a base cell• (S1, ALL, Winter) is empty• (S1, ALL, Spring) = (S1, *, Spring)
Base tableDimensions Measure
S1 P1 Spring 6S1 P2 Spring 12S2 P1 Fall 9
Store Product Season Sales
Jian Pei: Data Mining -- OLAP Indexes 22
Cuboid Lattice
• All aggregate cells belong to the same group by form a cuboid
(*,*,*)
(Store,*,*) (*,Product,*) (*,*,Season)
(Store,Product,*) (Store,*,Season) (*,Product,Season)
(Store,Product,Season)
Drill down
Roll up
Jian Pei: Data Mining -- OLAP Indexes 23
Tuples in the base table
(S1,P1,s):6 (S1,P2,s):12 (S2,P1,f):9
(S1,*,s):9 (S1,P1,*):6 (*,P1,s):6 (S1,P2,*):12 (*,P2,s):12 (S2,*,f):9 (S2,P1,*) (*,P1,f):9
(S1,*,*):9 (*,*,s):9 (*,P1,*):7.5 (*,P2,*):12 (*,*,f):9 (S2,*,*):9
(*,*,*):9
Cube (Cell) Lattice
Jian Pei: Data Mining -- OLAP Indexes 24
A Naïve Approach
• Set one counter for every aggregate cell• Scan the base table once, compute all
aggregate cells• Time complexity O(n), where n is the
number of tuples in the base table• Space overhead O(π(|Di|+1))
– 10 dimensions, each dimension has cardinality 10, (1110 – 1010) = 15,937,424,601 counters are needed!
Jian Pei: Data Mining -- OLAP Indexes 25
Real Data Sets Are Often Sparse
• Many aggregate cells are empty• Example: 10 dimensions, each dimension
has cardinality 10– 1010 possible base cells– Even a base table has 1 billion tuples, still at
least 90% of the possible base cells are empty!
Jian Pei: Data Mining -- OLAP Indexes 26
Multiway Aggregate Computation
• Compute ABC• Compute AB, AC, and
BC from ABC• Compute A, B, and C
from AB, AC, and BC• Compute ALL from A,
B, C
ABC
A BC
AB AC BC
ALL
Jian Pei: Data Mining -- OLAP Indexes 27
Main Memory Allocation
• Do we really need to allocate main memory for AB, AC, and BC, respectively?– The cuboids still can be very large
• Observation: A, B, and C can be computed from any two of AB, AC, BC
ABC
A B
AC BC
C
AB
ALL
Jian Pei: Data Mining -- OLAP Indexes 28
Multi-way Array Aggregation (1)
A
B
29 30 31 32
1 2 3 4
5
9
13 14 15 16
6463626148474645
a1a0
c3c2
c1c 0
b3
b2
b1
b0a2 a3
C
4428 56
4024 52
3620
60
B
Jian Pei: Data Mining -- OLAP Indexes 29
Multi-way Array Aggregation (2)
A
B
29 30 31 32
1 2 3 4
5
9
13 14 15 16
6463626148474645
a1a0
c3c2
c1c 0
b3
b2
b1
b0a2 a3
C
4428 56
4024 52
3620
60
B
Jian Pei: Data Mining -- OLAP Indexes 30
Ideas
• Consider a base table (A, B, C), where |A|=10, |B|=100, |C|=1000 Order A-B-C C-B-A
(A,B,C) 1 1(A,B,*) 1 10*100(A,*,C) 1000 10(*,B,C) 100*1000 1(A,*,*) 1 10(*,B,*) 100 100(*,*,C) 1000 1(*,*,*) 1 1Total 102,104 1,124
ABC
A B C
AB AC BC
ALL
Jian Pei: Data Mining -- OLAP Indexes 31
The Array-based Approach
• Sort dimensions smartly– Keep the smaller aggregate planes in main
memory– Output aggregates in larger planes once they
are computed, and then reuse the counters• How to make it work when the main memory
usage is still too large?– Details in [Zhao, Deshpande and Naughton,
SIGMOD 97]
Jian Pei: Data Mining -- OLAP Indexes 32
Summary
• Indexes for OLAP– Bitmap index– Vertical storage
• Data cube computation – the multiway array approach
• Can you find some situations where bitmap index and vertical storage are useful?