Download - Strategies for Processing Ad Hoc Queries on Large Data Warehouses

Strategies for ProcessingAd Hoc Queries

on Large Data Warehouses

Kurt StockingerCERN

John Wu & Arie ShoshaniLawrence Berkeley National Lab

November 2002 strategies for processing ad hoc queries 2

Outline

Motivation for designing our own software Many large scientific data warehouses need to

process ad hoc queries Lack of efficient indices

Issues to discuss Vertical partitioning Bitmap index

Compression – how to store the bitmaps Persistent storage – where to store the bitmaps


Example: High-Energy Physics Experiment STAR

Current data size 20 million collision events each event ~10 KB in size

Production data rate 100 million records / year ~ 1 TB per year

Scientists may query any of the 500 or so attributes Each query may involve conditions on 5 ~ 8

attributes Energy > 100 & Particles > 500 & …

Near real-time evaluation desired


Many Scientific Applications Involve Large Datasets

Sloan Digital Sky Survey: http://www.sdss.org

Earth Observing System: http://eos.nasa.gov Large Hadron Collider: http://lhc.web.cern.ch Genomes to life: http://doegenomestolife.org Combustion: http://scidac.psc.edu PCMDI: http://www-pcmdi.llnl.gov

http://www.sdss.org/



http://eos.nasa.gov/






http://lhc.web.cern.ch/






http://doegenomestolife.org/



http://scidac.psc.edu/






http://www-pcmdi.llnl.gov/







Searching and Indexing Requirements

Some common features of the large scientific datasets Read-mostly: data warehouses Large high-dimensional data: millions or billions of

records, each record with tens or hundreds of attributes Many queries are high-dimensional partial range queries Most users desire to modify queries interactively

Existing database software not specialized for these tasks: slow

Need new special purpose software BMI: bitmap index, CERN IBIS: independent bitmap index and search, LBNL


Issues to Be Discussed

Organization of the primary data, i.e., the user data Viewing the primary data as a 2-D table

Horizontal partition: used in transactional systems Vertical partition: good for partial range queries

Indexing strategies: Tree based schemes: not effective for dimensions > 10 Bitmap index: well suited for partial range queries

Storage scheme for the index data BMI: Store bitmaps as objects in an object-oriented

database (ODBMS) IBIS: Store bitmaps as simple files


Horizontal vs. Vertical Partitioning

Horizontal partitioning Data elements of a record

are stored consecutively Good for accessing one

record at a time Used in relational DBMS

systems where records are frequently updated Typically 60~70% of

bytes of each page is used

Vertical partitioning All records of an attribute

are stored consecutively Good for accessing

multiple records by attribute selection

Suitable for data warehousing systems where records are rarely modified May use 100% of bytes

of each page


Performance Advantage of Vertical Partitioning

Experiment with 2.2 million records of STAR data (10 attributes only)

The figure on the right shows the time to search without an index

Query box size is the relative volume of the hypercube formed by range conditions

The disk system supports about 20 MB/s sustained reading

For answering a query like “A > 5”, the time used by a relational DBMS is proportional to number of attributes in the table 500 attributes, 500 times

slower

Vertical partitioning is effectivefor partial range queries


Brief Overview of Index Data Structures

One dimensional index data structures: Total order for one-dimension Hash-based: Optimized for exact match queries, e.g. E =

106 Tree-based: Optimized for range queries, e.g. E < 106

Most widely used: B+-tree (1972): Multidimensional index data structures

No total order for all dimensions Hash-based: Grid-File, Bang-File, … Tree based: R-Trees, Pyramid-Tree, … Bitmap Indices: Effective for data warehousing environments Linearize to introduce total order, then use one-dimensional

indices


Basic Bitmap Index

a) List of attributes b) Bitmap Index (equality encoding)

a) List of 12 attributes with 10 distinct attribute values, i.e attribute cardinality = 10

b) For each distinct attribute value, one bit slice is created, i.e bitmap index consists of 10 bitmaps (E0 to E9)

Bit Slice E2 encodesattributes with value 2


Pros and Cons of Bitmap Indices

Pros: Easy to build and to maintain Easy to identify records that satisfy a complex multi-

attribute predicate (multi-dimensional ad-hoc queries) Very space efficient for attributes with low cardinality

(number of distinct attribute values, e.g. “Yes”, “No”)

Cons: Space inefficient for attributes with high cardinality An effective strategy: Bitmap Compression Other strategies: binning, encoding


Bitmap Compression

Advantages: Less disk space for storing indices Indices can be read from disk faster More indices can be cached in memory

Possible problems: Increases the complexity of the software If bitmaps must be decompressed before

performing Boolean operations, the decompression overhead might outweigh the advantages of compression Use compression schemes that work directly on

compressed data


Various Bitmap Compression Algorithms

Run Length Encoding (RLE): one-sided (asymmetric) vs. two-sided (symmetric)

Gzip (Lempel-Ziv, LZ): verbatim (uncompressed) bitmap is compressed via

zlib ExpGol:

Variable bit length encoding (RLE-bitmap is compressed)

Byte-Aligned Bitmap Compression (BBC): Variable byte length encoding (Oracle patent) One-sided vs. two-sided (BBC1 vs. BBC2)

Word-Aligned Hybrid (WAH): Fixed word based encoding


Relative Strength of Different Compression Schemes

uncompressedWAH

space

speed

better

gzip

BBC

ExpGolPacBits


WAH Compression & Bitmap Index Implementations

Compression Schemes Designed for reducing the CPU-complexity of

logical operations when compared to BBC, 10 X speedup

However, lower compression factor, i.e. the sizes of the WAH-compressed bitmaps are some 40-60% larger than BBC-compressed bitmaps

Storage scheme BMI: Bitmap Index implementation on top of

ODBMS (CERN) IBIS: Bitmap Index implementation based on plain

files (LBL)


Test Setup

Real application data (STAR) : 2.2 million records Synthetic dataset I: 100 million records Synthetic dataset II: 5 million records

Only the performance of the bitwise logical operation “AND” is reported

Other logical operations such as OR, XOR, etc. show similar relative differences

Most of the benchmarks were executed on three different machines with various CPU and I/O subsystems


In Memory Logical Operation“AND”

WAH is always the fastest, 2X – 20X

On tin, 400MHz P3

On dms, 300MHz PII

On dm, 450MHzUltraSPARC


Search Time (Including File IO)

On dm, 20MB/s IO On tin, 2MB/s IO

To answer the queries: read two bitmaps from files, perform one logical “AND”Unless using a very slow disk, it is worth-while to use WAH compression


With BBC, Searching Operation Spends Little Time in IO

On dm, 20MB/s IO On tin, 2MB/s IOThe percentage of time spent in IO on different bitmapsThis percentage is expected to be high, but it is actually low with BBCWAH reduce CPU time, and searching is again IO bound


Sizes of Compressed Bitmaps

The total size of a bitmap index compressed with WAH is typically 40-60% larger than that compressed with BBC

BBC-s: simplified (LBL)BBC-f: full (AT&T + CERN)


Sizes of Compressed Bitmaps

The figure on the right plot the maximum size of the bitmap index against the attribute cardinality of an attribute with 100 million (108) records

In the worst case, the size of the compressed bitmap index is about 400 million words, 4 times the size of the primary data

For most high-cardinality attributes, the compressed bitmap index size is smaller than that of a typical B-tree index(~ 3X primary data)

The compressed bitmap index sizesare usually smaller than B-tree

B-tree


Query PerformanceIBIS vs. RDBMS

Size(MB)

Create(sec)

Query(sec)

IBIS WAH 166 91 0.7

IBIS BBC-s

117 116 2.9

RDBMS 123(247)

2890 3.1

Accessing bitmaps in files (IBIS) has about the same efficiency as accessing bitmaps within an RDBMS

The DBMS tested uses a BBC compressed bitmap index similar to our BBC compressed index

Used real application data

WAH compressed index is 4X more efficient than BBC compressed index


Query PerformanceFile (IBIS) vs. ODBMS (BMI)

b) “warm” files a) “cold” files

Figures on the left time needed to process 5-dimensional queries on tin

Queries on synthetic data IBIS with WAH uses the

least amount of time ODBMS overhead 4X Due to file system

caching, IBIS is ~10X faster on files that have been accessed before (“warm” files)


Conclusions

We have shown that BBC is CPU-bound rather than I/O-bound as assumed in the past

WAH is much more (10X) CPU-efficient than BBC Building bitmap indices on top of ODBMS

introduces about 4X overhead when compared to using plain files

Building bitmap indices inside DBMS (as in many commercial systems) shows higher efficiency

Processing multi-dimensional range queries is efficient with WAH compressed bitmap indices

Read-only data should be vertically partitioned

Download - Strategies for Processing Ad Hoc Queries on Large Data Warehouses

Top Related