basics of database tuning

OLAP Indexes

Jian Pei: Data Mining -- OLAP Indexes 2

What Is a Data Warehouse?

• “A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision-making process.”

– W. H. Inmon• Data warehousing: the process of

constructing and using data warehouses


Index Requirements in OLAP

• Data is read only– No insertion or deletion

• Query types– Point query: look up one specific tuple (rare)– Range query: return the aggregate of a (large)

set of tuples, with group by– Complex queries: need specific algorithms and

index structures, will be discussed later


OLAP Query Example

• In table (cust, gender, …), find the total number of male customers

• Method 1: scan the table once• Method 2: build a B+ tree index on attribute

gender, still need to access all tuples of male customers

• Can we get the count without scanning many tuples?


Bitmap Index

• For n tuples, a bitmap index has n bits and can be packed into ⎡n /8⎤ bytes and ⎡n /32⎤words

• From a bit to the row-id: the j-th bit of the p-th byte row-id = p*8 +j cust gender …

Jack M …Cathy F …

… … …Nancy F …

1 0 … 0


Using Bitmap to Count

• Shcount[] contains the number of bits in the entry subscript– shcount[01100101]=4count = 0;for (i = 0; i < SHNUM; i++)

count += shcount[B[i]];


Advantages of Bitmap Index

• Efficient in space• Ready for logic composition

– C = C1 AND C2– Bitmap operations can be used

• Bitmap index only works for categorical data with low cardinality– Naively, we need 50 bits per entry to represent

the state of a customer in US– How to represent a sale in dollars?


Bit-Sliced Index

• A sale amount can be written as an integer number of pennies, and then represented as a binary number of N bits– 24 bits is good for up to $167,772.15,

appropriate for many stores• A bit-sliced index is N bitmaps

– Tuple j sets in bitmap k if the k-th bit in its binary representation is on

– The space costs of bit-sliced index is the same as storing the data directly


Using Indexes

SELECT SUM(sales) FROM Sales WHERE C;– Tuples satisfying C is identified by a bitmap B

• Direct access to rows to calculate SUM: scan the whole table once

• B+ tree: find the tuples from the tree• Projection index: only scan attribute sales• Bit-sliced index: get the sum from ∑(B AND

Bk)*2k


Cost Comparison

• Traditional value-list index (B+ tree) is costly in both I/O and CPU time– Not good for OLAP

• Bit-sliced index is efficient in I/O• Other case studies in [O’Neil and Quass,

SIGMOD’97]


Horizontal or Vertical Storage

• A fact table for data warehousing is often fat– Tens of even hundreds of dimensions/attributes

• A query is often about only a few attributes• Horizontal storage: tuples are stored one by one• Vertical storage: tuples are stored by attributes

A1 A2 … A100

x1 x2 … x100

… … … …z1 z2 … z100

A1 A2 … A100

x1 x2 … x100

… … … …z1 z2 … z100


Horizontal Versus Vertical

• Find the information of tuple t– Typical in OLTP– Horizontal storage: get the whole tuple in one search– Vertical storage: search 100 lists

• Find SUM(a100) GROUP BY {a22, a83}– Typical in OLAP– Horizontal storage (no index): search all tuples O(100n),

where n is the number of tuples– Vertical storage: search 3 lists O(3n), 3% of the

horizontal storage method• Projection index: vertical storage


Outline

• OLAP and data warehousing: concepts• Indexing for OLAP operations• Data cube computation

– Computing complete data cubes– Compression


MOLAPDate

Prod

uct

Cou

ntrysum

sumTV

VCRPC

1Qtr 2Qtr 3Qtr 4QtrU.S.A

Canada

Mexico

sum


Pros and Cons

• Easy to implement• Fast retrieval• Many entries may be empty if data is sparse• Space costly


ROLAP – Data Cube in Table

• A multi-dimensional databaseBase table

Dimensions Measure

S1 P1 Spring 6S1 P2 Spring 12S2 P1 Fall 9

Store Product Season AVG(Sales)

S1 * Spring 9… … … …* * * 9

9FallP1S212SpringP2S16SpringP1S1

MeasureDimensionsSalesSeasonProductStore

Cubing


Pros and Cons

• Compact in space• Using mature relational technology• Overhead in query answering


HOLAP

• Hybrid OLAP• Store detail data in ROLAP

– Save space• Store aggregates in MOLAP

– Fast query answering


Cube Computation

• Given a base table B(D1, …, Dn, M), compute the set of all aggregates in the data cube using aggregate function f()

• D1, …, Dn are n dimensions and M is a measure


Aggregate Cells

• A cell c=(c1, …, cn) is called an aggregate cell, provided each ci is either in Di or ALL– Generally, we assume that each tuple in B has

distinct dimension values, i.e., (D1, …, Dn) is a key in B

– Value ALL is also conventionally written as *• Cell c is a base cell if (c, m) is a tuple in the

base table• A cell c is not empty if there are some base

cells match all non-ALL attribute values in c


Example

• (S1, ALL, Spring), (S2, P1, Fall) and (S1, ALL, Winter) are aggregate cells

• (S2, P1, Fall) is a base cell• (S1, ALL, Winter) is empty• (S1, ALL, Spring) = (S1, *, Spring)

Base tableDimensions Measure

S1 P1 Spring 6S1 P2 Spring 12S2 P1 Fall 9

Store Product Season Sales


Cuboid Lattice

• All aggregate cells belong to the same group by form a cuboid

(*,*,*)

(Store,*,*) (*,Product,*) (*,*,Season)

(Store,Product,*) (Store,*,Season) (*,Product,Season)

(Store,Product,Season)

Drill down

Roll up


Tuples in the base table

(S1,P1,s):6 (S1,P2,s):12 (S2,P1,f):9

(S1,*,s):9 (S1,P1,*):6 (*,P1,s):6 (S1,P2,*):12 (*,P2,s):12 (S2,*,f):9 (S2,P1,*) (*,P1,f):9

(S1,*,*):9 (*,*,s):9 (*,P1,*):7.5 (*,P2,*):12 (*,*,f):9 (S2,*,*):9

(*,*,*):9

Cube (Cell) Lattice


A Naïve Approach

• Set one counter for every aggregate cell• Scan the base table once, compute all

aggregate cells• Time complexity O(n), where n is the

number of tuples in the base table• Space overhead O(π(|Di|+1))

– 10 dimensions, each dimension has cardinality 10, (1110 – 1010) = 15,937,424,601 counters are needed!


Real Data Sets Are Often Sparse

• Many aggregate cells are empty• Example: 10 dimensions, each dimension

has cardinality 10– 1010 possible base cells– Even a base table has 1 billion tuples, still at

least 90% of the possible base cells are empty!


Multiway Aggregate Computation

• Compute ABC• Compute AB, AC, and

BC from ABC• Compute A, B, and C

from AB, AC, and BC• Compute ALL from A,

B, C

ABC

A BC

AB AC BC

ALL


Main Memory Allocation

• Do we really need to allocate main memory for AB, AC, and BC, respectively?– The cuboids still can be very large

• Observation: A, B, and C can be computed from any two of AB, AC, BC

ABC

A B

AC BC

C

AB

ALL


Multi-way Array Aggregation (1)

A

B

29 30 31 32

1 2 3 4

5

9

13 14 15 16

6463626148474645

a1a0

c3c2

c1c 0

b3

b2

b1

b0a2 a3

C

4428 56

4024 52

3620

60

B


Multi-way Array Aggregation (2)

A

B

29 30 31 32

1 2 3 4

5

9

13 14 15 16

6463626148474645

a1a0

c3c2

c1c 0

b3

b2

b1

b0a2 a3

C

4428 56

4024 52

3620

60

B


Ideas

• Consider a base table (A, B, C), where |A|=10, |B|=100, |C|=1000 Order A-B-C C-B-A

(A,B,C) 1 1(A,B,*) 1 10*100(A,*,C) 1000 10(*,B,C) 100*1000 1(A,*,*) 1 10(*,B,*) 100 100(*,*,C) 1000 1(*,*,*) 1 1Total 102,104 1,124

ABC

A B C

AB AC BC

ALL


The Array-based Approach

• Sort dimensions smartly– Keep the smaller aggregate planes in main

memory– Output aggregates in larger planes once they

are computed, and then reuse the counters• How to make it work when the main memory

usage is still too large?– Details in [Zhao, Deshpande and Naughton,

SIGMOD 97]


Summary

• Indexes for OLAP– Bitmap index– Vertical storage

• Data cube computation – the multiway array approach

• Can you find some situations where bitmap index and vertical storage are useful?

basics of database tuning

Documents