new sampling-based summary statistics for improving approximate query answers p. b. gibbons and y....

24
y Statistics for Improvin g Approximate Query Answe rs P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007

Upload: susan-arabella-stokes

Post on 18-Dec-2015

218 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007

New Sampling-Based Summary Statistics for Improving App

roximate Query AnswersP. B. Gibbons and Y. Matias (ACM SIGMOD 1998)

Rongfang LiFeb 2007

Page 2: New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007

2

Outline Introduction Concise samples Counting samples Application to hot list queries and

Experimental evaluation Conclusion

Page 3: New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007

3

Introduction In large data recording and

warehousing environments, it is often advantageous to provide fast, approximate answers to queries.

Approximate answer engine.

The goal is to develop effective data synopses that capture important information in a concise representation.

Data Warehouse

New Data

Queries

Response

Figure 1:A traditional data warehouse

Data Warehouse

New Data

Queries

Response

Figure 2: Data warehouse set-up for providingapproximate query answers.

Approx.AnswerEngine

Page 4: New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007

4

Effectiveness of a Synopsis Accuracy of the answers it provides Fast response time Update time

Large sample size is desirable

Effectiveness of a synopsis is evaluated as a function of its footprint, i.e., the number of memory words to store the synopsis.

Page 5: New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007

5

Two new sampling-based synopses Concise samples

<value, count> represents those values that appear more than once in the sample

Counting samples keep track of all occurrences of a value inserted int

o the relation since the value was selected for the sample.

Fast Incremental Maintenance

Page 6: New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007

6

Outline Introduction Concise samples Counting samples Application to hot list queries and

Experimental evaluation Conclusion

Page 7: New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007

7

Concise samples Observation: any value occurring frequently in a samp

le is a wasteful use of the available space

Definition 1: A concise sample is a uniform random sample of the data set such that values appearing more than once in the sample are represented as a value and a count, ex: <value, count>.

More sample points than traditional samples for the same footprint => more accurate

Page 8: New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007

8

Concise samples properties Consider a relation R with n tuples and an attribute A. The goal is to obtain

a uniform random sample of R.A, i.e., the values of A for a random subset of the tuples in R.

Definition: Let S = {<v1, c1>,…, <vj, cj>, vj+1,..., vl} be a concise sample. Then sample-size(S) = l-j+∑j

i = 1ci, and footprint(S) = l+j at most m/2 distinct values, footprint at most m, Lemma 1 For any footprint m ≥ 2, there exists data sets for which the sa

mple-size of a concise sample is n/m times lager than its footprint, where n is the size of the data set.

n/m times as many sample points as a traditional sample

Page 9: New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007

9

Concise samples – offline/static Offline/static computation Footprint is m

1. Repeat m times: select a random tuple from the relation and extract its value for attribute A.

2. Semi-sort the set of values, and replace every value occurring multiple times with a <value, count> pair.

3. Continue to sample until either adding the sample point would increase the concise sample footprint to m+1 or n samples have been taken.

For each new value sampled, look-up to see if it is already in the concise sample and then either add a new singleton value, convert a singleton to a <value, count> pair, or increment the count for a pair.

Page 10: New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007

10

Concise samples – Incremental maintenance More difficult than maintain a traditional sample

Maintenance algorithm: Let S be the current concise sample and consider a new tuple t. Set up an

entry threshold τ(initially 1) for new tuples to be selected for the sample.I. Add t.A to S with probability 1/τ. (flip a coin with heads probability 1/τ) II. Do a look-up on t.A in S.

a) if it is represented by a pair, its count is incremented. b) if t.A is a singleton in S, a pair is created, c) if it is not in S, a singleton is created.

III. Increase footprint by 1 in cases b) and c)IV. Evict existing sample points to create room. Raise the threshold to some τ’.

Subject each sample point in S to this higher threshold---T/T’ probability evicted. Subsequent inserts are selected for the sample with probability 1/τ’

I. For any sequence of insertions, the above algorithm maintains a concise sample.

Page 11: New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007

11

Experimental evaluation Evaluate the gain in the sample-size of concise sample

over traditional sample Insert 500k new values, value domain[1,D], where D is

varied from 500 to 50k. Footprint m = 1000 Compare three samples: traditional; concise online; concise offline;

Page 12: New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007

12

Concise Samples – experimental evaluation

Figure 3: Comparing sample-sizes of concise and traditional samples as a function of skew, for varying footprints and D/m ratios. In (a) and (b), authors compare footprint 100 and footprint 1000, respectively, for the same data sets. In (c) and (d), authors compare D/m = 50 and D/m = 5 , respectively, for the same footprint 1000.

D: potential number of distinct valuesm: footprint size

Page 13: New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007

13

Concise samples Update time overheads

The coin flips that must be performed to decide which inserts are added to the concise sample and to evict values from the concise sample when the threshold is raised

The lookups into the current concise sample to see if a value is already present in the sample

Page 14: New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007

14

Outline Introduction Concise samples Counting samples Application to hot list queries and

Experimental evaluation Conclusion

Page 15: New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007

15

Counting samples Counting samples – a variation on concise samples in which t

he counts are used to keep track of all occurrences of a value inserted into the relation since the value was selected for the sample.

Definition: A counting sample for R.A with threshold T is any subset of R.A obtained as follows:1. For each value v occurring c times in R, we flip a coin with

probability 1/T of heads until the first heads, up to at most c coin tosses in all; if the ith coin toss is heads, then v occurs c-i+1 times in the subset, else v is not in the subset.

2. Each value v occurring c>1 times in the subset is represented as a pair <v, c>, and each value v occurring exactly once is represented as a singleton v.

Page 16: New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007

16

Counting samples (cont.) Incremental maintenance algorithm.Let S be the current counting sample and t be a new tuple 1. Set up an entry threshold T for new tuples to be selected.2. Look up on t.A in S3. t.A is not in S and we add it to S with probability 1/T.

Deletion

Theorem 4 Let R be an arbitrary relation, and let T be the current threshold for a counting sample S. (i) Any value v that occurs at least T times in R is expected to be in S. (ii) Any value v that occurs fv times in R will be in S with probability 1-(1-1/T) fv. (iii) For all α>1, if fv≥ αT, then with probability ≥ 1 - e-α, the value will be in S and its count will be at least fv - αT

Page 17: New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007

17

Outline Introduction Concise samples Counting samples Application to hot list queries and

experiment evaluation Conclusion

Page 18: New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007

18

Hot list queries Hot list queries

request an ordered set of <value, count> pairs for the k most frequently occurring data values, for some k.

ex: the top selling items in a database of sales transactions.

Page 19: New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007

19

Hot list queries Algorithms

Using traditional samples Using concise samples Using counting samples Using histogram on disk – maintains a full histogram on disk, i.

e., <value, count> pairs for all distinct values in R, with a copy of the top m/2 pairs stored as a synopsis within the approximate answer engine.

-- is considered only as a baseline for accuracy comparisons

Page 20: New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007

20

Application to hot list queries (cont.)

x-axis: rank of a valuey-axis: count for the values

Page 21: New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007

21

Application to hot list queries (cont.)

Page 22: New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007

22

Application to hot list queries (cont.)

Page 23: New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007

23

Application to hot list queries – overheads

Page 24: New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007

24

Conclusion Using concise samples may offer the best choice when

considering both accuracy and overheads.

In this paper, a batch-like processing of data warehouse inserts, in which inserts and queries do not intermix, is assumed. To address the more general case , issues of concurrency bottlenecks need to be addressed.

Future work is to explore the effectiveness of using concise samples and counting samples for other concrete approximate answer scenarios.