Transcript
Page 1: [IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - Materialized

Materialized Sample Views for Database Approximation

Shantanu Joshi Christopher JermaineUniversity of Florida

Department of Computer and Information Sciences and EngineeringGainesville, FL, USA

{ssjoshi,cjermain}@cise.ufl.edu

Abstract

We consider the problem of creating a sample view of adatabase table. A sample view is an indexed, materializedview that permits efficient sampling from an arbitrary rangequery over the view. Our core technical contribution is anew file organization called the ACE Tree that is suitablefor organizing and indexing a sample view.

1 Introduction

In this paper, we propose the materialized sample viewas a convenient abstraction for allowing efficient randomsampling from a relational database. For example, considerthe following database schema:

SALE (DAY, CUST, PART, SUPP)

Imagine that we want to support fast, random sam-pling from this table, and most of our queries include atemporal range predicate on the DAY attribute. A materi-alized sample view over SALE can be specified with thefollowing SQL-like query:

CREATE MATERIALIZED SAMPLE VIEW MySamAS SELECT * FROM SALEINDEX ON DAY;

The primary technical contribution of this paper is anovel index structure called the ACE Tree (Appendability,Combinability, Exponentiality; see Section 3) which can beused to efficiently implement a materialized sample view.Such a view, stored as an ACE-Tree, has the following char-acteristics:

• It is possible to efficiently sample (without replace-ment) from any arbitrary range query over the indexedattribute, at a rate that is far faster than is possible us-

ing techniques proposed by Olken [4] or by scanning arandomly permuted file.

• The resulting sample is online, which means that newsamples are returned continuously as time progresses,and in a manner such that at all times, the set of sam-ples returned is a true random sample of all of therecords in the view that match the range query.

• The sample view is created efficiently, requiring onlytwo external sorts of the records in the view.

2 Overview of the ACE Tree

Our strategy uses a new data structure called the ACETree to index the records in the sample view. At the highestlevel, the ACE Tree partitions a data set into a large numberof different random samples such that each is a random sam-ple without replacement from one particular range query.When an application asks to sample from some arbitraryrange query, the ACE Tree and its associated algorithms fil-ter and combine these samples so that very quickly, a largeand random subset of the records satisfying the range queryis returned in an online fashion.

2.1 ACE Tree Leaf Nodes

The ACE Tree stores database records in a large set ofleaf nodes on disk. Every leaf node has two components:

1. A set of h ranges, where h is the height of the ACETree. The ith range associated with leaf node L is de-noted by L.Ri. The h different ranges associated witha leaf node are hierarchical: that is L.R1 ⊃ L.R2 ⊃· · · ⊃ L.Rh. The first range in any leaf node, L.R1, al-ways contains a uniform random sample of all recordsof the database.

2. A set of h associated sections. The ith section of leafnode L is denoted by L.Si. The section L.Si contains

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE

Page 2: [IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - Materialized

a random subset of all the database records with keyvalues in the range L.Ri.

2.2 ACE Tree Internal Nodes

Logically, the ACE Tree is a disk-based binary tree datastructure with internal nodes used to index leaf nodes, andleaf nodes used to store the actual data. Each internal nodehas the following components:

1. A range R of key values associated with the node.

2. A key value k that splits R and partitions the data onthe left and right of the node.

3. Pointers ptrl and ptrr, that point to the left and rightchildren of the node.

4. Counts cntl and cntr, that give the number of databaserecords falling in the ranges associated with the leftand right child nodes.

50

75

37 62 8812

0−50 51−100

26−50 76−10051−75

0−100

0−25

25

0−50 26−50 38−50

70

87

14 7

20 40

39

4427

40

4638

50

0−100

26

L1

L4.S2

I1,1

I2,1

I3,3 I3,4

L4.S1 L4.S3 L4.S4

L8

I2,2

I3,2I3,1

L7L6L5L4L3L2

Figure 1. Logical Structure of the ACE Tree

2.3 Example Query Execution in ACE Tree

Let Q = [30-65] be our example query postulated overthe ACE Tree depicted in Figure 1. The query algorithmstarts at I1,1, the root node. Since I2,1.R overlaps Q, thealgorithm decides to explore I2,1. Since the left child rangeof I2,1 has no overlap with the query range, the algorithmchooses to explore the right child next. At this child node,I3,2, the algorithm picks leaf node L3 to be the first leafnode retrieved by the index. Records from section 1 of L3

are filtered for Q and returned immediately as a randomsample from Q, while records from sections 2, 3 and 4 arestored in memory.

Next, the algorithm again starts at the root node and nowchooses to explore the right child node I2,2. After perform-ing range comparisons, it explores I3,3 since I3,4.R has no

overlap with Q. The algorithm chooses to visit the left childnode of I3,3 next, which is leaf node L5. The records ofL5.S1 are filtered and returned immediately. Furthermore,section 2 records are combined with section 2 records ofL3 to obtain a random sample of records in the range 0-100. These are then filtered and returned. Section 3 recordsare also combined with section 3 records of L3 to obtain asample of records in the range 26-75. Since this range alsoencompasses R, the records are again filtered and returned.Finally, section 4 records are stored in memory for later use.

3 Properties of the ACE Tree

The various samples produced from processing a set ofleaf nodes are combinable. If k sections of this set of leafnodes are combined and section i has ni records, the com-bined sample will have n1 + n2 + · · ·+ nk records. This isthe Combinability property of the ACE Tree.

The Appendability property states that, given two leafnodes L1 and L2, L1.Si

⋃L2.Si is always a true random

sample of all records of the database with key values withinthe range L1.Ri

⋃L2.Ri.

The ranges in a leaf node are exponential. The numberof database records that fall in L.Ri is twice the number ofrecords that fall in L.Ri+1. This allows the ACE Tree tomaintain the invariant that for any query Q over a relationR such that at least hμ database records fall in Q, and with|R|/2k+1 ≤ |σQ(R)| ≤ |R|/2k;∀k ≤ h − 1, there exists apair of leaf nodes Li and Lj , where at least one-half of thedatabase records falling in Li.Rk+2

⋃Lj .Rk+2 are also in

Q. μ is the average number of records in each section, andh is the total number of sections in any leaf node.

The net result of this Exponentiality property is thatthere is always a pair of leaf nodes whose sections can beappended to form a set which can be filtered to quickly ob-tain a sample from any range query Q.

4 Query Algorithm

The algorithm has been designed to meet the primarygoal of achieving “fast-first” sampling, which means it at-tempts to be greedy on the number of records relevant forthe query in the early stages of execution.

At a high level, the query answering algorithm retrievesrelevant leaf nodes via a series of root-to-leaf node stabs ortraversals. The distinctive feature of the algorithm is that ateach internal node that is traversed during a stab, it choosesto access the child node that was not chosen the last time thenode was traversed. This toggling of the choice of the nextnode causes the algorithm to shuttle back and forth amongthe leaf nodes.

The advantage of retrieving leaf nodes in this sequence isthat it allows us to quickly retrieve a set of leaf nodes with

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE

Page 3: [IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - Materialized

2 4 6 8 10 12 140

0.5

1

1.5

2

2.5

3

3.5x 10

5

Time(sec)

Num

ber

of r

ecor

ds ACE Tree

B+ Tree

Randomly permuted file

2 4 6 8 10 12 140

1

2

3

4

5

6

7

8x 10

5

Time (sec)

Num

ber

of r

ecor

ds

ACE Tree

B+ Tree

Randomly permuted file

2 4 6 8 10 12 140

1

2

3

4

5

6

7

8

9

10x 10

5

Time(sec)

Num

ber

of r

ecor

ds

ACE Tree

B+ Tree

Randomly permuted file

(a) Query selectivity 0.25% (b) Query selectivity 2.5% (c) Query selectivity 25%

Figure 2. Results of number of random samples obtained versus time for three variable selectivity queries. Eachgraph shows the number of records obtained by all three sampling techniques at every time instant.

the most disparate sections possible in a given number ofstabs, thus covering large ranges that do not overlap. Thisfacilitates appending sections of newly retrieved leaf nodeswith the corresponding sections of previously retrieved leafnodes. The samples obtained can then be filtered and im-mediately returned.

The detailed algorithm and its analysis appear in the fullversion of this paper [3].

5 Benchmarking

For the experiments, we implemented the ACE Treequery algorithm and compared it to Antoshenkov’s algo-rithm [2] for random sampling from a ranked B+-Tree aswell as sampling from a randomly permuted file, which isthe standard sampling technique used in previous work onOnline Aggregation [1].

5.1 Experiments

For each of the three sampling techniques, Figure 2(a)shows the average number of random samples obtained as afunction of time over ten different queries, each with a se-lectivity of 0.25%. Figure 2(b) and 2(c) show similar resultsfor queries with selectivities of 2.5% and 25% respectively.

5.2 Discussion of Experimental Results

There are several important observations that can bemade from the experimental results: Irrespective of the se-lectivity of the query, we observed that the ACE Tree clearlyprovides a much faster sampling rate during the first fewseconds of query execution than the other two approaches.

Another observation is that for highly selective queries,the randomly-permuted file is almost useless, and the B+-Tree performs far better while for less selective queries, the

randomly-permuted file works better than the B+-Tree. Thenet result of this is that if an ACE Tree were not used, itwould probably be necessary to use both a B+-Tree and arandomly-permuted file in order to ensure satisfactory per-formance in the general case.

6 Conclusion and Future Work

In this paper we have presented the novel concept of asample view which is an indexed, materialized view of anunderlying database relation. The sample view facilitatesefficient random sampling of records satisfying a relationalrange predicate. In the paper we describe the ACE Treewhich is a new indexing structure that we use to index thesample view. We have shown experimentally that with theACE Tree index, the sample view can be used to provide anonline random sample with much greater efficiency than theobvious alternatives. For applications like online aggrega-tion or data mining that require a random ordering of inputrecords, this makes the ACE Tree and the sample view anatural choice for random sampling.

One area for future work is to add the ability to handleincremental updates to the sample view.

References

[1] J. M. Hellerstein P. J. Haas and H. J. Wang. OnlineAggregation. In SIGMOD, pages 171–182, 1997.

[2] G. Antoshenkov. Random Sampling from Pseudo-Ranked B+ Trees. In VLDB, pages 375–382, 1992.

[3] http://www.cise.ufl.edu/∼ssjoshi/ACEfullversion.pdf.

[4] F. Olken. Random Sampling from Databases. In Ph.D.Dissertation, 1993.

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE


Top Related