top-k oracle: a new way to present top-k tuples for

12
Top-K Oracle: A New Way to Present Top-K Tuples for Uncertain Data Chunyao Song, Zheng Li, Tingjian Ge Department of Computer Science, University of Massachusetts, Lowell, MA, USA {csong, zli, ge}@cs.uml.edu AbstractManaging noisy and uncertain data is needed in a great number of modern applications. A major difficulty in man- aging such data is the sheer number of query result tuples with diverse probabilities. In many cases, users have a preference over the tuples in a deterministic world, determined by a scoring func- tion. Yet it has been a challenging problem to return top-k for uncertain data. Various semantics have been proposed, and they have been shown to give wildly different tuple rankings. In this paper, we propose a completely different approach. Instead of returning users tuples, which are merely one point in the com- plex distribution of top-k tuple vectors, we provide a so-called top-k oracle and users can arbitrarily query it. Intuitively, an oracle is a black box that, whenever given an SQL query, returns its result. Any information we give is based on faithful, best- effort estimates of the ground-truth top-k tuples. This is especially critical in emergency response applications and in monitoring top-k applications. Furthermore, we are the first to provide the nested query capability with the uncertain top-k result being a subquery. We devise various query processing algorithms for top-k oracles, and verify their efficiency and accuracy through a systematic evaluation over real-world and synthetic datasets. I. INTRODUCTION A great number of data management tasks today call for the need to store and query probabilistic and uncertain data. These applications include data integration [9], information extrac- tion [13], sensor networks [5], smartphones (e.g., location- based applications [22]), social networks [1], and scientific computing [25]. For instance, RFIDs often give ambiguous detections of the presence of an object, with its location possi- bly at one of a few places. In general, it is infeasible to com- pletely clean the data, and the most common approach is to associate a tuple probability with each detection record, indi- cating the confidence of the detection. Top-k (a.k.a. ranking) queries prove to be very useful in these applications [16]. EXAMPLE 1. There has been recent research that uses wire- less medical sensor networks in emergency response (e.g., [10]). Consider an earthquake or another natural disaster. For years, the “first responders” on the scene manually measured the vital signs, documenting assessments on paper, and communicating with a medical center with handheld ra- dios. When disasters occurred, the large numbers of casual- ties quickly and easily overwhelmed the responders. But it is well known that the first hour or two is often crucial in saving a patient’s life. In that project [10], the first responders quick- ly attach a tag to each patient, which contains a variety of sensor add-ons – GPS, pulse oximetry, blood pressure, tem- perature, ECG, and relays data over a sensor network to a server. A table at the server may look like TABLE I. Each record of the table contains detection from a tag. The id attribute identifies the patient at a specific location, which belongs to a certain area. The score attribute indicates the urgency the higher the score, the more urgent that the pa- tient receive medical treatment for survival. The score, the amount of medical supply A needed, and care category (on- scene care, to be taken to the local trauma center, or burn center, etc.) are estimated by various algorithms based on the tag signals, which also gives a confidence (a probability in the last column) of the detection and estimation. The uncertainty comes from the unreliability of the sensors and wireless con- nection, and limited vital signs provided to the algorithm due to size constraints of the tags, etc. There can be mutual exclu- sion correlations among tuples (e.g., the first and fourth tu- ples) when they come from the same patient. TABLE I A TABLE MAINTAINED AT THE SERVER id location area score supply A needed care category conf. 1 (15, 22) A2 95 300 on scene 0.3 2 (205, 6) A1 89 50 trauma center 0.6 3 (251, 17) A1 79 900 burn center 0.7 1 (15, 22) A2 76 200 on scene 0.5 2 (205, 6) A1 73 100 on scene 0.4 This system is clearly much more scalable than the old one that requires manual measurement. A staff member at the cen- ter can query the most urgent top-k tuples in TABLE I based on the score and confidence, and determine the right actions. For instance, she could request a first responder on the scene to attach more sensors to specific patients to retrieve detec- tions with a higher confidence. She could also send ambu- lances to take urgent patients to a trauma center, etc. We will use this as a running example. The ranking problem on uncertain tuples is difficult due to the complex interaction between preference scores and tuple probabilities. Various semantics have been proposed, such as U-Topk, U-kRanks [24], PT-k [15], and expected ranks [6]. Li et al. [19] empirically illustrate that these natural definitions give wildly different and conflicting top-k results! Instead, they propose general and powerful parameterized ranking functions (PRF) that subsume almost all previous definitions as special cases. The PRF’s parameters have to be learned from a set of sample tuples for which the user needs to provide preferences. However, it may not be easy for the user to pro- vide preferences for the training set samples either. For exam-

Upload: others

Post on 16-Oct-2021

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Top-K Oracle: A New Way to Present Top-K Tuples for

Top-K Oracle: A New Way to Present Top-K Tuples for Uncertain Data

Chunyao Song, Zheng Li, Tingjian Ge Department of Computer Science, University of Massachusetts, Lowell, MA, USA

{csong, zli, ge}@cs.uml.edu

Abstract—Managing noisy and uncertain data is needed in a great number of modern applications. A major difficulty in man-aging such data is the sheer number of query result tuples with diverse probabilities. In many cases, users have a preference over the tuples in a deterministic world, determined by a scoring func-tion. Yet it has been a challenging problem to return top-k for uncertain data. Various semantics have been proposed, and they have been shown to give wildly different tuple rankings. In this paper, we propose a completely different approach. Instead of returning users tuples, which are merely one point in the com-plex distribution of top-k tuple vectors, we provide a so-called top-k oracle and users can arbitrarily query it. Intuitively, an oracle is a black box that, whenever given an SQL query, returns its result. Any information we give is based on faithful, best-effort estimates of the ground-truth top-k tuples. This is especially critical in emergency response applications and in monitoring top-k applications. Furthermore, we are the first to provide the nested query capability with the uncertain top-k result being a subquery. We devise various query processing algorithms for top-k oracles, and verify their efficiency and accuracy through a systematic evaluation over real-world and synthetic datasets.

I. INTRODUCTION A great number of data management tasks today call for the

need to store and query probabilistic and uncertain data. These applications include data integration [9], information extrac-tion [13], sensor networks [5], smartphones (e.g., location-based applications [22]), social networks [1], and scientific computing [25]. For instance, RFIDs often give ambiguous detections of the presence of an object, with its location possi-bly at one of a few places. In general, it is infeasible to com-pletely clean the data, and the most common approach is to associate a tuple probability with each detection record, indi-cating the confidence of the detection. Top-k (a.k.a. ranking) queries prove to be very useful in these applications [16].

EXAMPLE 1. There has been recent research that uses wire-less medical sensor networks in emergency response (e.g., [10]). Consider an earthquake or another natural disaster. For years, the “first responders” on the scene manually measured the vital signs, documenting assessments on paper, and communicating with a medical center with handheld ra-dios. When disasters occurred, the large numbers of casual-ties quickly and easily overwhelmed the responders. But it is well known that the first hour or two is often crucial in saving a patient’s life. In that project [10], the first responders quick-ly attach a tag to each patient, which contains a variety of sensor add-ons – GPS, pulse oximetry, blood pressure, tem-perature, ECG, and relays data over a sensor network to a server. A table at the server may look like TABLE I.

Each record of the table contains detection from a tag. The id attribute identifies the patient at a specific location, which belongs to a certain area. The score attribute indicates the urgency ─ the higher the score, the more urgent that the pa-tient receive medical treatment for survival. The score, the amount of medical supply A needed, and care category (on-scene care, to be taken to the local trauma center, or burn center, etc.) are estimated by various algorithms based on the tag signals, which also gives a confidence (a probability in the last column) of the detection and estimation. The uncertainty comes from the unreliability of the sensors and wireless con-nection, and limited vital signs provided to the algorithm due to size constraints of the tags, etc. There can be mutual exclu-sion correlations among tuples (e.g., the first and fourth tu-ples) when they come from the same patient.

TABLE I A TABLE MAINTAINED AT THE SERVER

id location area score supply A needed

care category

conf.

1 (15, 22) A2 95 300 on scene 0.3

2 (205, 6) A1 89 50 trauma center

0.6

3 (251, 17) A1 79 900 burn center 0.7

1 (15, 22) A2 76 200 on scene 0.5

2 (205, 6) A1 73 100 on scene 0.4

This system is clearly much more scalable than the old one that requires manual measurement. A staff member at the cen-ter can query the most urgent top-k tuples in TABLE I based on the score and confidence, and determine the right actions. For instance, she could request a first responder on the scene to attach more sensors to specific patients to retrieve detec-tions with a higher confidence. She could also send ambu-lances to take urgent patients to a trauma center, etc. We will use this as a running example. □

The ranking problem on uncertain tuples is difficult due to the complex interaction between preference scores and tuple probabilities. Various semantics have been proposed, such as U-Topk, U-kRanks [24], PT-k [15], and expected ranks [6]. Li et al. [19] empirically illustrate that these natural definitions give wildly different and conflicting top-k results! Instead, they propose general and powerful parameterized ranking functions (PRF) that subsume almost all previous definitions as special cases. The PRF’s parameters have to be learned from a set of sample tuples for which the user needs to provide preferences. However, it may not be easy for the user to pro-vide preferences for the training set samples either. For exam-

Page 2: Top-K Oracle: A New Way to Present Top-K Tuples for

ple, seeing the first two tuples of TABLE I, the medical person-nel might have no more knowledge than the database server about which tuple should be ranked higher. To put it different-ly, the user has no more understanding of the interplay be-tween score and probability than the database server.

Indeed, it is difficult to pick tuples to return to a user. This is not surprising. Using the possible world semantics, Figure 1 shows the top-3 tuple score vector distribution from a table of 15 rows, where the color intensity of each dot indi-cates the probability magnitude (i.e., a darker dot has a higher probability). Even for 3, the possible top- vectors are already widely dispersed and many points have close probabil-ities. A greater (which is hard to visualize) would increase the number of possible worlds in top-k and further decrease the probabilities of each point. All previous semantics typical-ly return tuples, which only map to one point in the -dimensional space. Unfortunately, wherever that point is, it does not tell much about the whole complex distribution. Moreover, the distribution of another attribute in top-k tuples (e.g., area in TABLE I) would even be different from Figure 1 (while ranking is still on the score attribute).

Fig. 1. Distribution of top-3 tuple vectors

Our Contributions. We tackle the problem from a different angle with a novel approach. First, we note that there exists a ground-truth top-k tuple vector (without any uncertainty)

, … , . Of course, we do not know ; but whatever information that the system gives to users about top-k should be based on a best-effort estimate of . Thus, we do not di-rectly return tuples to users, but let users retrieve any infor-mation on . We provide what we call a top-k oracle and let users write any queries over it. This is because any tuples could be arbitrarily far from the ground truth top-k, as they may have a very small probability. Intuitively, an oracle is a black box that, whenever given a query, returns its result.

The rationale is as follows. Top-k queries have been studied and widely used [16]. There are typically two ways in which a user could use a top-k result: (1) to browse it and get some information, or (2) to further query it, by putting the top-k result in a subquery, a temporary table, or a view. In applica-tions, usually (1) can also be accomplished through (2) from the information retrieval perspective. This is particularly true for top-k over uncertain tuples, as the system needs to provide the capability of (2), while (1) is very complex and hard to do.

The top-k oracle is a relation interface, and can appear wherever a subquery, a temporary table, or a view (i.e., a table expression) can be. Following is an example using a tempo-rary table (DB2 syntax): Q1: WITH top_patients AS (SELECT id, location, area, score, supply_A_needed, care_category FROM all_patients ORDER BY score DESC FETCH FIRST 50 ROWS ONLY) SELECT COUNT (DISTINCT id) FROM top_patients WHERE care_category = ‘trauma center’

Q1 first defines a temporary table top_patients, which is a top-50 oracle on uncertain tuples. Based on the result of Q1, the staff member at the center (Example 1) can determine how many ambulances to dispatch for delivering patients to the trauma center. We can also ask other queries on the oracle.

A key point is that there is a ground-truth deterministic top-k tuple vector for top_patients. In fact, when the medical per-sonnel arrive at the scene, they will face the ground truth. Pri-or to that, the server has uncertainty and only gets the distribu-tions of tuples due to the various constraints described in Ex-ample 1 (e.g., limitation of the tags). Therefore, it is crucial that the server give a best-effort and most accurate estimate of Q1’s result based on the available data. A top-k oracle exactly serves this purpose. By contrast, Q1’s result based on only tuples (as in previous semantics) may have a very small prob-ability to be the ground truth. Faithfully estimating infor-mation of the ground-truth top-k is especially critical in emer-gency response applications (Example 1) and in monitoring top-k applications [3].

Furthermore, it is important to provide the nested query ca-pability with the top-k result being a subquery. Our work is the first to provide it using the possible world semantics, as all previous semantics that return tuples only retain a tiny frac-tion of the possible worlds of all top-k tuples. It is generally hard to provide a top-k oracle because the tuples that can be in top-k have complex correlations, which are hard to represent even with graphical models [27, 19] (e.g., whether a tuple is in top-k is correlated with all tuples with higher scores).

We devise query processing algorithms for SQL queries over top-k oracles. In particular, we propose an efficient core algorithm that calculates all candidate tuples’ probabilities in top- , for 1 . It is used as a subroutine in a number of query processing algorithms. For SUM/AVG queries, we de-vise (1) a randomized approximation algorithm that provides a tradeoff between efficiency and accuracy, and (2) a highly efficient algorithm that computes the expectation and confi-dence intervals, which is empirically shown to be over three orders of magnitude faster. Moreover, we devise algorithms for point and range queries, other aggregate queries (MAX/MIN, COUNT, COUNT DISTINCT) with predicates, GROUP BY HAVING queries, projection and duplicate elim-ination, and self joins. In summary, our contributions include: A novel approach for retrieving information of top-k re-

sults over uncertain data, which provides a best-effort es-timate of the ground-truth top-k, and supports nested que-ries over top-k results under the possible world semantics.

Page 3: Top-K Oracle: A New Way to Present Top-K Tuples for

An efficient core algorithm that computes all candidate tuples’ probabilities in top- (1 ) and is used as a subroutine in a number of query processing algorithms.

A randomized approximation algorithm for SUM/AVG that has provable accuracy and efficiency, and a highly efficient algorithm that computes the expectation and con-fidence intervals of SUM/AVG.

Query processing algorithms for all other types of SQL queries on a top-k oracle.

A comprehensive experimental study using real-world datasets and synthetic datasets.

II. DEFINING A TOP-K ORACLE A. Data Model and Notations

We consider an uncertain result set , which we call a base relation, upon which the top-k operation will be applied. has a schema , … , , , , where , … , are attrib-utes, is a special attribute, namely, the ranking score, and the attribute is the marginal probability that a tuple is in . In addition, has a set of correlation rules [24]. Each rule contains a set of tuples in and specifies their correlations. In this work, we focus on the commonly used disjoint-independent probabilistic database model (i.e., correlations are limited to mutual exclusions) [4, 8, 24, 15, 26], and leave the more general graphical models [19] as future work. The uncer-tainty model that we use is more powerful than the independ-ent tuple uncertainty databases (e.g., [7, 18]), and is sufficient for a wide range of applications (e.g., Example 1 and the real-world datasets in Sec. IV). For instance, a tuple with an uncer-tain attribute bearing a discrete probability distribution can be modeled as a set of mutually exclusive tuples.

A top-k operation can be applied on , taking the top-k tu-ples from each possible world and resulting in a (generally smaller) relation , which we call a top-k oracle. has the same schema as , except that the attribute signifies the probability that the tuple is in (i.e., in the top-k of ).

TABLE II TERMINOLOGIES AND NOTATIONS USED IN THE PAPER

Terms, symbols Meaning

base relation the result set to apply top-k on

candidate tuples tuples in to be considered for the top-k oracle

probability threshold for candidate tuples

number of candidate tuples

the ’th candidate tuple in rank order

… { , , … , }

score of

probability of in the base relation

the rule group that is in – a set of tuples

| | number of tuples in

A tuple in that has a positive probability to be in is called a candidate tuple. The number of candidate tuples can be very large since many tuples may have tiny probabilities to be in top-k. We can generally resort to a cut-off probability threshold (e.g., 0.001). We do not need to consider a tuple in as a candidate if it has a probability less than to be in

top-k. Denote the number of candidate tuples as . One can easily determine a safe value of (e.g., using Theorem 8 in [15]) so that all tuples above the threshold are guaranteed to be candidate tuples. As discussed earlier, there are complex correlations among the candidate tuples; thus we do not mate-rialize and return them to users, but let users query the oracle.

Denote the candidate tuples as , … , in descending score order, where each tuple has score , a probability

of being in the base relation . Let belong to a sin-gle rule group , which is a set of tuples (including ). When

is an independent tuple, then . We use | | to de-note the number of candidate tuples in . The terminologies and notations are summarized in TABLE II for easy reference.

B. Learning the Parameter We observe that a user often wants to learn the parameter

first. For instance, in Example 1, the staff member issuing the query may not know how many patients are above an “urgen-cy threshold” that requires immediate attention. Because of resource constraints (e.g., how many doctors or ambulances are available; or how fast queries on a top-k oracle can run), she might have an upper limit of in mind (say, 500). But subject to that upper limit, she would like to learn from the actual data. This is easy in a deterministic database – just choose a such that the tuple ranked at 1 has a score be-low a threshold, determined by domain knowledge (e.g., ur-gency for immediate medical care). In an uncertain database, however, even the tuple ranked at 1 is not fixed; many tuples can have a rank 1. Naturally, we can choose such that Pr , where is a random variable denoting the score of the tuple ranked at 1, is a score threshold, and is a probability threshold.

We solve this problem efficiently using all-tuple-all-rank algorithm to be presented in Sec. III-A-1, which computes , the probability that candidate tuple has a rank (for all

1 and 1). From the ’s we can obtain the score dis-tribution of the ’th ranked tuple (based on ’s score and

’s for all 1). From ’s, we can stop at the smallest such that Pr and will be set to 1. There-fore, in defining a top-k oracle, the user can specify , , and , where is the upper limit of as discussed earlier. In turn, the system tells the user what value is chosen.

III. QUERYING A TOP-K ORACLE A. Point and Range Queries

Point/range queries are based on an arbitrary set of predi-cates over , … , , and attributes in a top-k oracle, e.g.: Q2: SELECT id, location, score FROM top_patients WHERE area = ‘A1’

We assume that top_patients is a top-k oracle similar to Q1, but is defined as a view. From the possible world semantics, a set of tuples is returned, each with a probability that it is in the result, which is the same as the probability that the tuple is in the top-k oracle. The key question is how to calculate this probability.

1) Efficiently Computing Tuple Rank Probability: We will compute the probability that is ranked , denoted as , , for

Page 4: Top-K Oracle: A New Way to Present Top-K Tuples for

1 , 1 , called an all-tuple-all-rank algorithm. Then the required tuple probability for point/range queries is: Pr ∑ , (1)

Our all-tuple-all-rank algorithm is used as a subroutine in a number of other query processing algorithms, as shown later. As discussed in Sec. II-B, it is also used to learn . Thus the performance of this algorithm is critical (we will compare our algorithm with the one in previous work [15]). To compute , , we first observe that it suffices to compute the probability

that within , …, at least tuples exist, denoted as , , for 1 and 1 . Then: , , 1, (2)

Note that Eq. (2) is true regardless of whether tuples are all independent or there are rule groups. This is because the only difference between the scenario that at least tuples exist in the first tuples and the scenario that at least tuples exist in the first 1 tuples is that exists and has rank . Calculat-ing , when tuples are independent is easier due to: , 1, 1 1 1, (3)

Eq. (3), however, does not hold in the presence of mutual ex-clusions because the existence of can be correlated with tuple existence in , … , .

Our novel solution to this problem is through finding two functions that recursively invoke each other. By interweaving the processes that solve the two functions, we obtain the need-ed result with the same time complexity . Besides , , we define , , the probability that at least tuples exist in … , i.e., the first tuples excluding ’s group. As a

warm-up, one might be tempted to write: , , 1 1 1, (4)

which looks intuitive: to have at least tuples in … , we ei-ther must have at least 1 tuples in … if exists, or have at least tuples in … if does not exist. Unfortu-nately, Eq. (4) is wrong. This is because when does not exist, we cannot use 1, in (4), we must first divide the probabilities of the tuples in … ∩ by 1 (and then find the probability that at least tuples exist in the modified … ). That is, the probabilities must be conditioned on “

does not exist”. Instead, we use two recurrence relations: , , 1 1 , (5)

1, , 1 1 , (6)

where is the sum of the probabilities of tuples in … ∩ . Both equations are self-explanatory; for instance, Eq. (6) con-siders the case when any tuple in … ∩ exists ( . The boundary conditions are , 0 , 0 1 and 0, 0, 0 , for 1 , 1 . Then from

Eq. (5) and (6), we can compute all , ’s and , ’s in time by first obtaining , from (6) and then , from (5), iteratively for from 1 to and from 1 to .

Thus, our all-tuple-all-rank algorithm has a time cost of and a space cost of , a significant reduction than

an algorithm in previous work [15] which has a time complex-ity of ) ([15] has an additional improvement by prefix sharing, but there is no guarantee on how much can be saved). The efficiency of our algorithm (compared to [15]) comes

from the novel usage of double recurrence relations. In Sec. IV, we will also empirically show that our algorithm is much faster (about seven times faster with a real-world dataset).

B. SUM and AVG Queries with Predicates Previous work and overview of our contributions. We now consider SUM/AVG queries on any attributes in the top-k oracle, and possibly with predicates. Ge et al. [11] described a method to obtain an approximate distribution of the sum of scores of top-k tuples. The approximation stems from possibly parsing a very large number of distinct values (i.e., ) in a score distribution. The basic idea is as follows: the distribution

of top- in … can be derived from two distributions: (1) the distribution of top- 1 in … for the case where

exists, and (2) the distribution of top- in … for the case where does not exist. We obtain by merging (plus the score of ) and , weighted by and 1 , re-spectively. Then as we go through the tuples bottom up for from to , and for from 1 to , we can get the score-sum distribution of top- in … . When there are mutual exclu-sions, the above process is run multiple times, with each time corresponding to the case where the rank- tuple happens to be a particular tuple. We refer the reader to [11] for details. However, the approximation in [11] does not have guaranteed error bounds. For SUM/AVG on a top-k oracle, we make the following novel contributions and discuss each of them next: We devise a randomized algorithm for approximation

with provable accuracy. There are two benefits: (1) it gives us theoretical underpinning so that accuracy is en-sured for any datasets; (2) it enables rigorous choice of parameters to balance efficiency and accuracy.

We handle SUM/AVG of arbitrary attributes and expres-sions in a top-k oracle, and with arbitrary predicates.

We propose a highly efficient method that reports the expectation and confidence intervals of SUM/AVG, ra-ther than a distribution. This is useful when efficiency is crucial, e.g., in data streams and monitoring top-k [3].

1) Randomized Approximation Algorithm: Consider a prob-ability mass function (PMF) as a set of pairs

, , … , , , where value has a probability (1 ), and ⋯ . We say that is the span of the PMF, and is the size of the PMF, denoted as | |. Without approximation, the size of an intermediate (and final) PMF during the SUM/AVG algorithm may be at most

(i.e., each k-tuple combination has a unique sum), making the time and space complexity to be at least .

The key idea of our algorithm is to maintain a budget (max-imum ) on the sizes of intermediate PMF’s. We will discuss the selection of parameter shortly. When merging two PMF’s during the SUM/AVG, if the result PMF has a size greater than , we consolidate the values in into evenly spaced values in the range, called bars. This is described more precisely in the DISTRIBUTION-MERGE algorithm below.

Lines 1-7 of the algorithm modify the two input distribu-tions to be ready for the merge. For instance, since corre-sponds to the case in which does not exist, all its probabili-ties are multiplied by 1 . In lines 8-15, if the number of

Page 5: Top-K Oracle: A New Way to Present Top-K Tuples for

distinct values in ∪ is within the budget , we simply return the merged distribution. Otherwise, in lines 16-17, we calculate the step size between two bars. In line 18, we pick a value uniformly at random from the interval

, ; this will be the position of the first bar.

Then all bars have fixed their positions as they are evenly spaced (lines 19-22). In lines 23-26, we “attach” all values in

∪ to their closest bars. Finally, we get an approximate distribution . The intuition of the random choice in line 18 is that, when this algorithm is run multiple times, sometimes we get positive errors and sometimes negative, which will likely cancel out each other (Theorem 1).

Algorithm DISTRIBUTION-MERGE , ,

Input: : a PMF for top- in tuples … : a PMF for top- 1 in tuples … : size budget of PMF’s Output: : a PMF for top- in tuples …

1: for each , ∈ do //probabilities in are changed 2: update : ← 1 // is ’s probability 3: end for 4: for each , ∈ do 5: update : ← // is ’s score 6: update : ← 7: end for 8: ← ∪ 9: if then //result PMF size within budget 10: for each distinct value in ∪ do 11: let ’s probability in and be , , resp. 12: add , into result 13: end for 14: return 15: end if 16: let and be the maximum and minimum values in

∪ , resp.

17: ← //step size between two bars

18: ← , //from this interval

19: for each ← 1… do 20: ← 1 ∙ //values for all b bars 21: ← 0 //initial probability 0 22: end for 23: for each pair , ∈ ∪ do 24: choose s.t. is closest to among 1. . 25: 26: end for 27: add all ( , ) into result (1 ) 28: return

THEOREM 1. Let the span of the distribution of SUM/AVG of top-k be . Consider the SUM/AVG algorithm that is based on DISTRIBUTION-MERGE. For any constant 0 , we set

√ 1 and let the actual distribution obtained by the algorithm be . Then, any top-k tuple vector adds probabil-ity at score ̃ in , where is accurate, and the proba-bility that ̃ is more than away from its true value is no more than 2 . Proof. First, the approximation in DISTRIBUTION-MERGE is only to shift score values, and the probabilities are always

computed accurately. Each original value in ∪ is moved to a position uniformly at random chosen from

, . Thus, the error incurred for a top-k

vector at each DISTRIBUTION-MERGE, denoted as

(1 1), has a uniform distribution in , .

Since , must be within , .

Let ’s last tuple (the ’th) be ( ). Thus, starts at , and the SUM/AVG algorithm will go through

, , … , in order, building up as well as its contribu-tion to the score distribution. In particular, if ∈ ( 1

1), then we multiply into ; otherwise 1 . Moreo-ver, for each , the query processing invokes DISTRIBU-TION-MERGE once to go from its input distribution (if ∉ ) or (if ∈ ) to its output distribution . Each

such invocation performs the random approximation (i.e., shifting the score of to the nearest bar) at most once. Since

, overall there are at most independent random shifts as each DISTRIBUTION-MERGE makes its own fresh random choice in line 18. Define the total error of as ∑ , where each has a uniform distribution centered at 0. Thus, from Hoeffding’s inequality [14]:

Pr | | 2/

/ 2 / With √ 1, we have the required bound 2 . □

Theorem 1 shows that we need to set parameter to be √ to have an arbitrarily small constant error. Therefore,

with this approximation, the worst case time complexity of the whole SUM/AVG algorithm in the presence of mutual exclu-

sion rules is , and the space complexity is √ .

2) SUM/AVG on Arbitrary Attributes or Expressions and with Predicates: Our discussion so far is limited to SUM/AVG of the scores of tuples in a top-k oracle. We now consider the general case in which we can sum or average any attributes , … , , or expressions over them, and optional-ly with predicates. Consider a simple example: Q3: SELECT SUM (supply_A_needed) FROM top_patients WHERE care_category = ‘on scene’

Q3 asks for the sum of medical supply A needed by patients who are determined to receive on scene medical care. Based on the result of Q3, the medical staff can determine how much medical supply A doctors need to bring to the disaster scene.

Interestingly, we can follow the same algorithm framework that processes the SUM/AVG of scores, including the approx-imation algorithm of Sec. III-B-1. The only difference is that the PMF distributions are calculated based on the attribute (or expression) being summed/averaged, instead of scores. Note that the sort order of tuples , … , is still based on scores (descending order), as the top-k oracle is defined by them. This sort order ensures that the probabilities in the target dis-tribution (on supply_A_needed) are calculated correctly.

In the presence of predicates (e.g., Q3), we change the tu-ples that are filtered out by the predicates to have value 0 for the attribute in SUM/AVG (e.g., supply_A_needed attribute in Q3), while keeping the original values of other tuples. This

Page 6: Top-K Oracle: A New Way to Present Top-K Tuples for

ensures that only those tuples that satisfy the predicates con-tribute to the final distribution. We cannot simply remove the filtered-out tuples from the algorithm (but need to keep them and assign value 0), as they affect the probability calculation. AVG queries with predicates. Answering an AVG with predicates is different and is more costly. An AVG without predicates is easier because each possible world has tuples, and we can simply divide all the values in the SUM distribu-tion by to get the AVG distribution. With predicates, how-ever, each possible world may have a different number of tu-ples, and hence we cannot easily get the AVG distribution.

Our solution is to replace any intermediate distribution in the SUM algorithm by versions. Specifically, we replace by at most 1 , 2 , … , . In , there are exactly tuples in each tuple vector. Clearly, the union of the versions is the original sum distribution . The DISTRIBUTION-MERGE algorithm in Sec. III-B-1 will take 1 ,… , ,

1 ,… , , b as inputs instead. Accordingly, 1 (with ’s attribute value added) will merge with and become based on the original DISTRIBUTION-MERGE algorithm, for 1,… , . This gives the correct output 1 ,… , . In the end, each distribution ’s values will

be divided by (for 1,… ) and the union of these ver-sions gives the final AVG distribution. The cost of this algo-rithm increases by a factor of in the worst case.

3) Efficiently Reporting Expectation and Bounds: We dis-cuss how to efficiently obtain the expectation of the SUM or AVG and probabilistic upper and lower bounds. This is often sufficient for end users in decision making, especially for top-k monitoring tasks in real-time data stream applications.

Let us first consider the expectation of SUM. For candidate tuples , … , , let the attribute or expression value being summed in be (1 ). Define a random variable to be ’s actual contribution to the SUM result. That is:

, 0,

.

Clearly, if is filtered out by a predicate, then 0; other-wise, Pr Pr . Let be the SUM result. Then ∑ . From the linearity of expectation,

∑ ∑ ∙ Pr . .

The summation is over all ’s that satisfy the predicates. Now we can use Eq. (1) in Section III-A-1 and have: ∑ ∑ . .

where is the probability that has rank , as computed in Sec. III-A-1. Thus, without pre-computation, the overall time complexity of computing the exact value of expectation (even with the presence of mutual exclusion rules) is only ∙ ,

with a space cost . This is in contrast to the ∙ time and ∙ √ space cost of computing an approximate PMF.

Let us now study upper and lower bounds of . These bounds, in addition to , give the user a better idea about the distribution. The difficulty is that the ’s are all correlat-ed. The following result gives a highly efficient algorithm.

THEOREM 2. Let be the SUM result over a top-k oracle and . Then for any 0, we have Pr | |

2∑ ,where is the attribute value of being

summed, and is a binary variable: 0 if is filtered out by the predicates of the query (if any), or if is the last tuple of an mutual exclusion group (as we scan , … , in order) that has a total probability 1 (including an independent tuple that has probability 1, which can be considered as a single-tuple mutual exclusion group). Otherwise, 1. Proof. Define indicator random variables:

1, 0,

1 .

In addition, define as a trivial random variable that has a constant value 0. We now define 1 random variables:

| , … , , 0, 1, … , Then , … , are a Doob martingale [21] w.r.t. , …, since it can be verified that | , … , . Note that | (7) | , … , (8) We now give a lower and an upper bound of , and eventually use the Azuma inequality [21]. When the outcomes of , … , (i.e., , … , ) are revealed, consider what we expect about the sum, i.e., | , … , . Let the sum of the revealed tuples so far be . Among the unre-vealed tuples, has the highest score. There are two cases: (Case 1) There are already tuples in top-k (i.e., ∑

). In this case, no more unrevealed tuples can be in top-k: | , … , | , … ,

(Case 2) Fewer than tuples are revealed to be in top-k: | , … ,

∑ Pr 1| , … , (9)

where Pr 1| , … , . Since has the highest score among the unrevealed tuples, we have the following information about ′ . If is an independent tuple, then

. If is in a mutual exclusion group and one tuple in this group has been revealed to be in top-k, then 0. Oth-erwise let be the total probability of tuples in … ∩ . Then (similar to Eq. (4) in Sec. III-A-1).

Now after revealing , (i) if 1, then: ∑ Pr 1| , … , , 1 (10)

Comparing Eq. (9) and (10), firstly, for any ∈ 1, , Pr 1| , … , , 1 Pr 1| , … , (11)

because the fact that is in top-k can only lower the probabil-ity that is in top-k. In addition, since knowing the next highest score tuple is in top-k cannot decrease the expec-tation of SUM. Thus, from Eq. (9), (10), and (11), we have: 0 ∙ (12)

(ii) If 0, similarly, we have: ∑ Pr 1| , … , , 0 (13)

Pr 1| , … , , 0 Pr 1| , … , (14)

Hence, from Equations (9), (13) and (14), we have: 0 ′ ∙ (15)

Page 7: Top-K Oracle: A New Way to Present Top-K Tuples for

Combining (12) and (15) of the two scenarios gives: ′ ∙ ∙ (16)

We have 0 when is the last tuple of that has a total probability 1 (a deterministic tuple also falls in this category as | | 1), because whether exists is completely determined after knowing , … , in this case. From this fact and Inequality (16), using the general-form version of the Azuma inequality (e.g., Corollary 6.9 in [21]) gives:

Pr | | Pr | | 2∑

where the first equality is based on Eq. (7) and (8). □

The algorithm implied by Theorem 2 to get bounds is ex-tremely efficient – a single parse of candidate tuples with a cost . We will empirically evaluate the bounds in Sec. IV.

C. COUNT (DISTINCT), Projection, Duplicate Elimination A COUNT query over a top-k oracle without predicates has

a trivial result as each possible world has tuples. A COUNT query with predicates can be rewritten using “SUM (1)”, i.e., SUM of a constant 1, which we have discussed.

COUNT DISTINCT queries, such as Q1 in Sec. I, require different methods. It turns out that efficiently computing the distribution of COUNT DISTINCT is extremely difficult without using the slow method of computing the probabilities of all top-k vectors. However, we devise algorithms that can compute the probabilities that each distinct value exists in the top-k oracle. We show that the expectation of the COUNT DISTINCT can be inferred from these probabilities.

Clearly, computing the probabilities that each distinct value exists happens to be the same as computing the result tuple probabilities after projection and duplicate elimination. Thus, we present their algorithms together here.

Algorithm COUNT-DISTINCT-1 , … , , ,

Input: , … , : tuples in decreasing score order : attribute/expression to be counted; has values , … , : predicate of the query Output: the expectation of number of distinct values of

1: for each ← , 1,… ,1 do 2: for each ← 1,… , do 3: make a clone of , as , and , as , 4: if is true for then 5: // is the value of ’s attribute to be counted 6: update the probability of in , to be 1.0 7: end if 8: for each , ∈ , do 9: update , : ← // is ’s prob. 10: end for 11: for each , ∈ , do 12: update , : ← 1 13: end for 14: for each distinct value in , or , do 15: ← sum of ’s probabilities in , and , 16: add , into , 17: end for end for end for 18: return ∑ , ∈ ,

1) A Bottom-up Algorithm: We first show the COUNT-DISTINCT-1 algorithm which serves as a base for comparison with a more efficient one in Sec. III-C-2. Moreover, the MAX/MIN algorithm in Sec. III-D is related to this algorithm. We present the algorithm in the tuple independence model; the extension to mutual exclusion cases is common, increasing the cost by at most a factor of (Sec. III-B). The key idea is as follows. As we scan the candidate tuples bottom up ( , … , ), we maintain, in , , the occurrence probabilities of each dis-tinct value in top-j of … . Thus, each , is a set of (value, probability) pairs. In the end, , contains the probabilities of each distinct value in the top-k of all candidate tuples.

Note that the boundary conditions are that distributions , (for 1 or 0) have probability 0 for any value. In lines 5-6, we find ’s attribute value in , and update its probability to be 1. In line 18, we simply add up the proba-bilities of all values in , . We now show the correctness.

THEOREM 3. COUNT-DISTINCT-1 returns the correct result, the expectation of number of distinct values of . Proof. We only need to prove two points: (1) Lines 14-17 produce the correct , and (2) line 18 gives the expectation.

For (1), consider each distinct value of . At , we have: Pr … = Pr … ∩ + Pr … ∩ (17)

If , Pr … ∩ Pr . This is because if exists, it must be in top of … (since it has the highest score in … ), and thus must be in top of … . If , we have: Pr … ∩

Pr … | ∙ Pr Pr 1 … ∙ . Thus, for each distinct value , lines 4-10 update , to contain Pr … ∩ . Further note that: Pr … ∩ Pr … | ∙Pr Pr … ∙ 1

Thus, lines 11-13 update , to be Pr … ∩ for each distinct . From Eq. (17), lines

14-17 must derive the correct , ; point (1) is proved. We now prove (2). From (1), by induction, , contains

probabilities of occurrence of all distinct values in the top-k oracle. Let the values in , be , … , (clearly ). Let their probabilities given by , be , … , . Define ran-

dom variables 1, 0,

1 . Thus,

Pr 1 . Let ∑ . Then is the number of distinct values. ∑ ∑ . □

The time complexity of COUNT-DISTINCT-1 is and the space complexity is because there are no more than distinct values whose distributions are computed in the algorithm. If we handle mutual exclusion tuples, the time in-creases to while the space cost remains the same.

Page 8: Top-K Oracle: A New Way to Present Top-K Tuples for

2) A Top-down Algorithm: Let us now consider a top-down approach (i.e., scanning tuples in , …, order), which im-proves the performance significantly. Consider a distinct value

. Suppose it is in a set of tuples , , … , . Then,

Pr 1 Pr , … , (18)

Moreover, Pr , … , (19)

Pr ∈ …

Pr ∈ … ∙ Pr

Pr ∈ … , ∙ Pr ,

⋯ Pr ∈ … , … , ∙ Pr , … ,

where “ ” denotes the tuple that has rank , and “ , … , ” denotes the event “none of , … , exists in the base relation”, etc. Eq. (19) decomposes the original event into smaller events based on where the rank- tuple is.

Each term on the right-hand side of Eq. (19) can be ob-tained from a modified version of , , the probability that at least tuples exist in … . , is discussed in Sec. III-A. In particular, Pr ∈ … 1, , and

Pr ∈ … , ∙ Pr , 1, 1, ∙ 1 1

where 1, a modified version of 1, in that we first remove , from the algorithm that calculates ∙,∙ and each tuple in ( , resp.) has a normalized

probability ( , resp.). Similarly with 1, .

The reason for this normalization is the same as the one in the proof of Theorem 2. Other terms of Eq. (19) can be obtained in like manner. From Eq. (19) and (18) we can get the proba-bilities that each distinct value (that satisfies the predicates, if any) exists in the top-k oracle.

Let us look at the complexity of the whole algorithm. Each term on the right hand side of Eq. (19) has cost ∙ even in the presence of mutual exclusion rules (as discussed in Sec-tion III-A-1). The total number of tuples , … , over all distinct values is at most (it would be less than if some do not satisfy the predicates). Therefore, the total time complexi-ty is ∙ and the space complexity is . This is a sig-nificant improvement over COUNT-DISTINCT-1 which has a time complexity ∙ and a space complexity ∙ .

D. Other Aggregate Queries 1) MAX/MIN Queries: We first discuss MAX/MIN queries

on the score attribute and without predicates. In Sec. III-A-1, we devise an ∙ algorithm to calculate 1, 1 , the probability that has rank . We can derive

the PMF distributions of MAX/MIN from ’s. For MAX, we just need to make one parse over the ’s

(i.e., rank 1). Let the number of distinct score values in , … , be , and these distinct scores be , … , . Then, the

MAX distribution is Pr ∑ 1. We can simply add up the probabilities of two tuples with

the same score because one tuple has rank 1 and another tuple has rank 1 are two disjoint events. We can similarly obtain

MIN’s distribution by having one parse over the ’s. Note that we can do this for quantile queries (i.e., the score distri-bution of any rank from 1 to ), which are useful in databases due to their robustness against data anomalies, and the usage in optimizers and user interface designs [12].

We now look at MAX/MIN on an arbitrary attribute or ex-pression, and/or with predicates. It turns out that we can use a similar algorithm as COUNT-DISTINCT-1. The only differ-ence is at the steps of obtaining , . We only discuss MAX (MIN is analogous). To get , , the distribution of MAX in top- from … , consider two cases. If does not exist, the MAX should be the same as , . If exists, and if satis-fies the predicates (if any), we update , as follows.

Let the value of ’s attribute/expression for which we seek MAX be . Then , is (where denotes the MAX):

Pr0,

∑ Pr , ,Pr , ,

This is because in those possible worlds in which MAX is no more than in , , the MAX is exactly when we add

. We then perform “weighted sum” of , and , to get , in the same manner as lines 8-17 of COUNT-DISTINCT-1. Finally, we get the MAX distribution at , .

2) GROUP BY HAVING Queries: Consider this query: Q4: SELECT area, COUNT(*) FROM top_patients GROUP BY area HAVING EXP(COUNT(*)) >= 5

which retrieves the count distributions of each area that has an expected count of at least 5 in the top-k oracle (the EXP func-tion returns the expectation). In general, a GROUP BY query is equivalent to aggregate queries with a predicate for each group. For example, in Q4, we can compute a COUNT distri-bution over top_patients with a predicate “area = A” for each distinct area A. Thus, the cost of executing such a query is at most the cost of running one aggregate query (with predicates) multiplied by the number of groups (Sharing is possible among multiple runs due to common tuple values).

E. Self-Join Queries We focus on self-joins in a top-k oracle, which is more

challenging due to the correlation between a pair of joining tuples (while we can usually assume independence when join-ing a top-k oracle with another relation). Here is an example. Q5: SELECT p1.id, p1.location, p2.id, p2.location FROM top_patients p1, top_patients p2 WHERE p1.id < p2.id AND p1.care_category = p2.care_category AND p1.area = p2.area

Q5 selects pairs of patients in the top-k oracle who are in the same area and have the same care category. For a pair of can-didate tuples that satisfy the join predicates, we calculate the probability that they are both in the top-k oracle, which is the tuple probability of the join output tuple. Moreover, this needs to be done efficiently for all pairs of candidate tuples.

The main idea of our algorithm is as follows. We first per-form the self-join over the candidate tuples ignoring the prob-abilities. Then consider a pair of matched tuples ( , ) at this

Page 9: Top-K Oracle: A New Way to Present Top-K Tuples for

phase where (i.e., ). We need the probability that they both exist in the top-k oracle (which is the probability that this tuple pair is in the join result). Imagine that exists in the base relation. Conditioned on this, all we need to do is to remove from the candidate tuples and calculate the prob-ability (say, ′) that is in top 1. Then, and together must be both in top- (with probability ′ ∙ ), since . We say that a tuple is a small index tuple in a tuple pair (e.g.,

here) if it has a smaller subscript index than the other tuple; the other tuple is called the large index tuple (e.g., here).

Algorithm SELF-JOIN , … , , _

Input: , … , : candidate tuples in descending score order _ : join predicates/conditions Output: pairs of join result tuples and their probabilities

1: do the self-join on the candidate tuples with _ 2: ← set of matched tuple pairs from line 1 3: for each distinct small index tuple in the tuple pairs in do 4: let the list of candidate tuples without be ′ 5: use ′ and the algorithm in Section III-A-1 for the following: 6: for each large index tuple that pairs with in ( ) do 7: ′ ← Pr 1 using ′ 8: assign tuple probability ′ ∙ to tuple pair , in 9: end for 10: end for 11: return

In lines 3-10 of SELF-JOIN, we loop through each distinct small index tuple (defined above) among all matched tuple pairs. In lines 6-9, we iterate through each large index tuple that pairs with the current small index tuple, and calculate the probability that they are both in top-k as described earlier.

Now consider the cost. Each outer loop has a time cost of at most due to the all-tuple-all-rank algorithm in Sec. III-A-1. The whole cost depends on how many matched pairs are produced in line 1, but clearly there are no more than distinct small index tuples (thereby no more than outer loops), mak-ing the time cost at most ∙ .

IV. EXPERIMENTAL EVALUATION A. Datasets and Setup

We use these datasets to perform an extensive evaluation: The real-world datasets are downloaded from the National

Center for Biotechnology Information (NCBI)’s Gene Expression Omnibus (GEO) database [29]. The datasets we use [31] are the gene expression data of cancer pa-tients and normal people (for comparison), as output from microarray experiments. We call them the GEO datasets.

We generated some synthetic datasets. The advantage is that we can programmatically vary the parameter values. The details are explained as we describe the experiments.

Note that the dataset sizes are irrelevant here as long as we ensure that the number of tuples in a base relation is greater than , the number of candidate tuples required (which is de-termined by and probabilities). We implement all the query processing algorithms. In addition, for all-tuple-all-rank, we implement a related algorithm in previous work [15] for com-parison. All experiments are performed on a machine with an Intel Core i7 2.67 GHz CPU and a 6 GB memory.

B. Experimental Results The GEO datasets contain the expression of all genes (i.e.,

expressed strength of each gene) from a number of cancer patients as well as normal people. If one can find the genes that are very different (in gene expression) between cancer patients and normal people, these genes will be identified and studied (e.g., for developing drugs that combat this type of cancer). This difference can be measured by the fold change of a gene [2], defined as the ratio between the expression strengths of the gene in a cancer cell and in a normal cell.

However, the microarray experiments have significant un-certainty [2] and multiple runs on the same patient’s tissue samples or on different patients (in the same age group) with the same type of cancer (even at the same stage) can produce results that differ in various degrees. There are four age groups (below 55, 56-65, 66-75, and above 75) and four can-cer stages. For each age group and stage combination, if we have more than one patient data sample, we bin the samples and collect the statistics of the frequencies of the bins and ob-tain a discrete distribution, in which each bin is assigned a value that is the average of the samples within the bin. Bins in a distribution are mutually exclusive so that at most one of them may be selected in a possible world.

We first study the semantics of a top-k oracle. We define a view top_genes for a top-k oracle that contains attributes gene_id, age_group, stage, fold_change, and direction. The view is created from a top-k query with the scoring function being fold_change. The fold_change attribute is always 1, with a direction attribute value 0 indicating an increase in gene expression in a cancer cell and a direction 1 for a de-crease. As changes in both directions equally deserve the at-tention of a medical researcher, we simply rank by fold_change. We use a threshold 0.01, i.e., is deter-mined as described in Sec. II-A such that tuples that have probability at least 0.01 in top-k must be candidate tuples.

A medical researcher may not know how many genes have significant fold changes a priori and will want to learn the parameter first, as discussed in Sec. II-B. With the real da-tasets, we obtain the relationship between the score threshold

of the rank 1 tuple and the value, while fixing 0.1. This is shown in Fig. 2. For instance, from domain

knowledge, if 50 can be used as a cut-off fold-change thresh-old, the researcher will have 33 for the top-k oracle. We can see from the figure that a smaller score threshold corre-sponds to a greater value. The user might of course impose an upper limit on (say, 300) so that she at most studies a certain number of most interesting genes.

Recall that a core algorithm that is used many times in pro-cessing various queries is all-tuple-all-rank (Sec. III-A). Thus, its performance is crucial. We also implement an algorithm in [15] that can accomplish the same task (although [15] has a different eventual goal). Specifically, we implement the algo-rithm in Sec. 4.2 and 4.3 of [15] with lazy reordering for pre-fix sharing (as shown in [15], lazy reordering is never worse than aggressive reordering). The execution time comparison using the GEO datasets is shown in Fig. 3, which indicates that our algorithm is about seven times faster than the one in

Page 10: Top-K Oracle: A New Way to Present Top-K Tuples for

previous work. This comes as no surprise since our algorithm has an ∙ cost while the previous one has an ∙ base cost with a further improvement based on prefix sharing. However, there is no guarantee on how much improvement the prefix sharing can have, which depends on the data.

We now generate synthetic datasets with varying parame-ters, one of which is , the fraction of tuples that are in mutual exclusion rules. By fixing 200 and all other parameters but only varying from 0 to 0.7, we again compare our all-tuple-all-rank with the previous method, as shown in Fig. 4. An interesting fact is that, when 0, the two algorithms have about the same performance. This is because when there are no correlation rules, both algorithms have an ∙ cost. As increases, our algorithm maintains a constant cost, while the previous one degrades significantly. We comment that the ability to efficiently handle mutual exclusion rules is very im-portant for real applications because of the fact that we can model a probability distribution using a set of mutually exclu-sive tuples (as we do for the GEO datasets).

Next, fixing 0.2 and 200, we vary the rule group size parameter. It follows a normal distribution, for which we vary the mean from 2 to 9, while keeping the variance to be 1. The result is shown in Fig. 5, where the performance of the previous algorithm degrades dramatically as rule group size increases. This is because as group size increases, tuples in a group are likely to span a wider range in the sorted order, making prefix sharing more difficult. On the other hand, our algorithm has a constant cost, showing the advantage that its performance is robust against the presence of mutual exclu-sion rules in any way (either rule tuple ratio or group size).

We now issue a query over the GEO datasets that selects the average fold_change of the top_genes view (the researcher wants to know how relevant these genes are), where we vary the definition of this view with different values. Our goal is to see how effective our “quick” method in Sec. III-B-3 is, as compared to our “slow” method that obtains a full distribu-tion. We first look at the accuracy. TABLE III shows the expec-tations. As discussed in Sec. III-B-3, our quick method actual-ly gives an exact value for the expectation, while the slow method computes an approximate distribution (to avoid being too slow) and hence its expectation is also approximate.

We then look at the effectiveness of the bounds in Theorem 2. The result is shown in TABLE IV. Take the first row of TA-

BLE IV as an example. When we write a confidence interval 30%, 30% , it denotes a range 0.7 , 1.3 , where

is the expectation of the AVG result (as shown in TA-

BLE III). The first row shows that the quick method (Theorem 2) has a confidence of 0.78 that the AVG result is in this range. With the same confidence (0.78), the slow method gives an interval of 0.805 , 1.195 . Note that The-orem 2 always gives a safe bound (i.e., it must be correct), which is wider than the one inferred from the full distribution. The full distribution usually gives a tighter bound, but subject to its error due to approximation. Overall, we see that the quick method gives a reasonable estimate of the AVG result. The fact that the quick method gives tighter bounds for larger

values is also encouraging because that is exactly when it is even more useful (due to its speedup, as shown next).

A clear advantage of the quick method is of course its speed. This is shown in Figure 6, which indicates that the quick method (i.e., calculating both the expectation and bounds) is over three orders of magnitude faster than the slow method. Additionally, we use the synthetic datasets and vary the rule tuple fraction , while fixing 50 and average rule group size 3. The result is shown in Figure 7. The algorithm that computes a full distribution has an increased cost as grows due to its iterations over tuples in rule groups. The quick method, however, keeps a constant cost because the two core algorithms that it uses (the all-tuple-all-rank and the bounds algorithm of Theorem 2) are both insensitive to .

TABLE III COMPARING THE EXPECTATIONS OF AN AVG QUERY RESULT

50 100 200

Expectation from quick method

61.8 47.1 34.3

Expectation from slow method

65.3 46.2 35.5

TABLE IV COMPARING THE CONFIDENCE INTERVALS OF TWO METHODS

Intervals from quick method

Confidence Intervals from slow method

50 30%, 30% 0.78 19.5%, 19.5%

40%, 40% 0.96 28.7%, 28.7%

100 20%, 20% 0.71 11.6%, 11.6%

30%, 30% 0.97 19.9%, 19.9%

200 15%, 15% 0.74 7.2%, 7.2%

20%, 20% 0.95 12.5%, 12.5%

We next compare the accuracy/performance with vs. with-out the approximation discussed in Sec.III-B-1, using the same query as in the previous experiment. For the approxima-

Fig. 2 vs. while fixing Fig. 3 Comparison with previous work Fig. 4 Varying the rule tuple ratio Fig. 5 Varying the rule group size

0 50 100 1500

20

40

60

80

100

Score threshold

k p

ara

me

ter

0 200 400 6000

100

200

300

400

k

Exe

cutio

n ti

me

(in

ms)

Our methodPrevious method

0 0.2 0.4 0.6 0.80

50

100

150

200

Exe

cutio

n ti

me

(in

ms)

Our methodPrevious method

2 4 6 8 100

50

100

150

200

Rule group size

Exe

cutio

n ti

me

(in

ms)

Our methodPrevious method

Page 11: Top-K Oracle: A New Way to Present Top-K Tuples for

tion algorithm we choose the parameter 4√ . Figure 8 shows the performance. We can see that the approximation gives about two orders of magnitude improvement. Further-more, we verify the accuracy of the randomized approxima-tion, shown in Figure 9. In this figure, for each of the values (10, 20, 30, 40, and 50), we display side by side a pair of two (close) fine line segments. The left (right, resp.) line is for the algorithm with (without, resp.) approximation. The midpoint of a line segment indicates the expectation while the half length of the segment is the standard deviation, both inferred from the full distribution that the corresponding algorithm obtains. The figure shows that the approximation algorithm is quite accurate. This verifies our analysis in Theorem 1.

We now examine the performance of various query pro-cessing algorithms in one figure. Besides the query that selects AVG fold change of top-k, we issue the following queries: Q6: SELECT AVG (fold_change) FROM top_genes WHERE gene_id IN (SELECT gene_id FROM gene_functions WHERE function_id = 5856)

Q7: SELECT MAX (fold_change) …… The rest is the same as in Q6 ……

Q8: SELECT COUNT (DISTINCT gene_id) FROM top_genes WHERE stage = 1

Q9: SELECT v1.gene_id, v1.stage, v1.fold_change, v2.stage, v2.fold_change FROM top_genes v1, top_genes v2 WHERE v1.gene_id = v2.gene_id AND v1.stage < v2.stage AND ABS(v1.fold_change – v2.fold_change) > 20

Q6 is an AVG query with a predicate containing a subquery that selects from a gene_functions table, which is downloaded

from [30]. The gene_functions table contains a gene_id attrib-ute and a function_id attribute. Thus, each row indicates that a gene has some function. Usually, each gene has many func-tions and each function involves many genes; multiple records of the table will together encode this information. Q6 asks for the average fold change of the genes that belong to a specific function and that are in the top-k oracle. A researcher might use Q6 to understand the role of function 5856 in this disease.

Q7 is similar except that it asks for the maximum fold change of a gene in the function 5856. Q8 is a COUNT DIS-TINCT, while Q9 is a self join query that finds those genes which have significantly different fold changes at two differ-ent cancer stages. This can help a researcher understand the connection between the evolvement of the disease and genes.

We run these queries over the GEO datasets for various values that define the top-k oracle. The execution times of these queries are shown in Fig. 10. The AVG query with pred-icates (Q6) is the slowest; it is slower than the SUM/AVG query without predicates because we have to maintain multi-ple (up to ) versions of distributions during the algorithm (Sec. III-B-2). The MAX/MIN (Q7) and COUNT DISTINCT (Q8) algorithm 1 (Sec. III-C-1) are faster than the SUM/AVG because they need to maintain fewer (value, probability) pairs in the distributions. By contrast, the COUNT DISTINCT algo-rithm 2 (Sec. III-C-2) is fundamentally different from the first one, and reduces to using the efficient all-tuple-all-rank algo-rithm (Sec. III-A-1). Consequently, this algorithm is an order of magnitude faster. Interestingly, the self join algorithm (Q9) runs the fastest since its cost is proportional to the number of matching pairs (and is at most ). In general, query pro-cessing algorithms that can take advantage of the all-tuple-all-

Fig. 6 Speed comparison of two methods Fig. 7 Running times with different rule tuple ratios Fig. 8 Speed improvement with approx.

Fig. 9 Accuracy of the approximation Fig. 10 Speeds of answering various queries Fig. 11 Comparing two algorithms

50 100 150 20010

0

102

104

106

k

Exe

cutio

n tim

e (in

ms)

Full distributionExpectation and bounds

0 0.2 0.4 0.6 0.810

1

102

103

104

105

Exe

cutio

n tim

e (in

ms)

Full distributionExpectation and bounds

10 20 30 40 5010

-1

100

101

102

103

k

Exe

cutio

n tim

e (in

sec

onds

)

With approximationWithout approximation

0 10 20 30 40 50 6050

60

70

80

90

100

110

120

130

140

k

Exp

ecta

tion

and

stan

dard

dev

iatio

n

Left: with approximation

Right: without approximation

0 50 100 15010

-1

100

101

102

103

104

105

106

k

Exe

cutio

n tim

e (in

ms)

SUM/AVG

AVG w ith predicates

MAX/MIN

COUNT DISTINCT/dup. elim. 1

COUNT DISTINCT/dup. elim. 2

JOIN

0 50 100 150 20010

2

103

104

105

Number of distinct values

Exe

cutio

n tim

e (in

ms)

Algorithm 1

Algorithm 2

Page 12: Top-K Oracle: A New Way to Present Top-K Tuples for

rank algorithm are faster than those that use the SUM/AVG algorithm framework as we have analyzed.

Finally, we use synthetic datasets to further examine the two algorithms for COUNT DISTINCT (or duplicate elimina-tion). One parameter is the number of distinct values (of the attribute in question) in all candidate tuples of the top-k ora-cle. We generate synthetic datasets that vary on this parame-ter. The result is shown in Fig. 11. We can see that the cost of the second algorithm is relatively insensitive to the number of distinct value parameter while the cost of the first algorithm increases almost linearly as grows. This is because the time complexity of the first algorithm is and the space complexity is . The second algorithm, however, runs the all-tuple-all-rank algorithm once for each candidate tuple (due to modification), regardless of how many unique values there are. Thus, its cost is insensitive to .

V. RELATED WORK There has been significant research on managing uncertain

data, including systems such as Trio [4], MystiQ [7], Orion [5], PrDB [18], BayesStore [27], MCDB [17], CLARO [25], among others. On the other hand, top-k queries have been extensively studied for deterministic data. We refer the reader to [16] for an excellent survey of the work in this area.

Top-k queries on uncertain data are particularly useful. Ré et al. [23] rank result tuples based on their probabilities get the top-k probable answers. Soliman et al. [24] are the first to consider the complex interplay between scores and tuple prob-abilities. They propose U-Topk which returns k tuples that have the highest probability to be the top-k, and U-kRanks which returns the winner tuple at each rank. Zhang and Chom-icki [28] develop the global top-k semantics. Hua et al. [15] propose a probabilistic threshold approach, while Cormode et al. [6] propose expected rank. Ge et al. [11] propose the con-cept of typical top-k results. Our SUM algorithm on attrib-ute is similar to theirs, but our novel contributions include the proposal to treat top-k results as an oracle and enabling users to run arbitrary types of query for needed information. In this paper, we also devise highly efficient algorithms to get expec-tations and confidence intervals. The authors recently apply similar randomized approximation idea to subsequence match-ing [20], a very different problem setting than this work. Li et al. [19] observe that the ranking results of previously proposed semantics are wildly different, and propose general and pow-erful parameterized ranking functions with efficient algo-rithms. They determine the parameters of ranking functions through giving the user sample tuples, letting the user rank the tuples, and learning from the user’s feedback. We discussed in Sec. I in details regarding the motivations for why we need a completely different approach.

VI. CONCLUSIONS AND FUTURE WORK We observe that any k tuples returned to the user (as in pre-

vious top-k semantics) could be far from the ground-truth top-k tuples, which is a problem in many applications. We have a completely different approach that provides an oracle machine interface so that users can arbitrarily query information about

uncertain top-k results. We devise various query processing algorithms, some of which provide tradeoffs between efficien-cy and accuracy. As future work, we will study the top-k ora-cle materialization/maintenance and the problem of providing the oracle interface for tuples with other correlation models.

ACKNOWLEDGMENT This work was supported in part by the NSF, under the

grants IIS-1149417 and IIS-1239176.

REFERENCES [1] E. Adar, C. Ré. Managing Uncertainty in Social Networks. In Data

Engineering Bulletin, 30(2):23-31, 2007. [2] D. Allison et al. DNA Microarrays and Related Genomics Techniques:

Design, Analysis, and Interpretation of Experiments. Ch. & Hall, 2005. [3] B. Babcock, C. Olston. Distributed Top-K Monitoring. SIGMOD’03. [4] O. Benjelloun, A. D. Sarma, A. Halevy, M. Theobald, and J. Widom.

Databases with uncertainty and lineage. In VLDB, 2008. [5] R. Cheng, D. Kalashnikov, and S. Prabhakar. Evaluating probabilistic

queries over imprecise data. In SIGMOD, 2003. [6] G. Cormode, F. Li, K. Yi. Semantics of ranking queries for probabilistic

data and expected ranks. In ICDE, 2009. [7] N. Dalvi, D. Suciu. Efficient Query Evaluation on Probabilistic Data-

bases. In VLDB, 2004. [8] N. Dalvi, D. Suciu. Management of Probabilistic Data: Foundations and

Challenges. In PODS, 2007. [9] X. Dong, A. Halevy, C. Yu. Data integration with uncertainty. VLDB 07. [10] T. Gao et al. Wireless Medical Sensor Networks in Emergency Re-

sponse: Implementation and Pilot Results. In ICTHS, 2008. [11] T. Ge, S. Zdonik, and S. Madden. Top-k queries on uncertain data: on

score distribution and typical answers. In SIGMOD, 2009. [12] A. Gilbert et al. How to summarize the universe: dynamic maintenance

of quantiles. In VLDB, 2002. [13] R. Gupta and S. Sarawagi. Creating probabilistic databases from infor-

mation extraction models. In VLDB, 2006. [14] W. Hoeffding, Probability inequalities for sums of bounded random

variables, In JASA, 58 (301): 13–30, March 1963. [15] M. Hua, J. Pei, W. Zhang, and X. Lin. Ranking queries on uncertain

data: a probabilistic threshold approach. In SIGMOD, 2008. [16] I. Ilyas, G. Beskales, M. Soliman. A Survey of Top-k Query Processing

Techniques in Relational Database Systems. In ACM C. Surveys, 2008. [17] R. Jampani, F. Xu, M. Wu, L. Perez, C. Jermaine, P. Haas. MCDB: a

Monte Carlo approach to managing uncertain data. SIGMOD, 2008. [18] B. Kanagal, J. Li, A. Deshpande. Sensitivity analysis and explanations

for robust query evaluation in probabilistic databases. SIGMOD, 2011. [19] J. Li, B. Saha, and A. Deshpande. A unified approach to ranking in

probabilistic databases. In VLDB, 2009. [20] Z. Li, T. Ge. Online Windowed Subsequence Matching over Probabilis-

tic Sequences. In SIGMOD, 2012. [21] C. McDiarmid. On the Method of Bounded Differences. In Surveys in

Combinatorics 141: 148–188, 1989. [22] J. Paek, J. Kim, and R. Govindan. Energy-Efficient Rate-Adaptive GPS-

based Positioning for Smartphones. In MobiSys, 2010. [23] C. Ré, N. Dalvi and D. Suciu. Efficient top-k query evaluation on prob-

abilistic data. In ICDE, 2007. [24] M. Soliman, I. Ilyas, and K. Chang. Top-k query processing in uncertain

databases. In ICDE, 2007. [25] T. Tran, L. Peng, B. Li, Y. Diao, A. Liu. PODS: A New Model and

Processing Algorithms for Uncertain Data Streams. SIGMOD, 2010. [26] C. Wang, L. Yuan, J. You, O. Zaiane, J. Pei. On Pruning for Top-K

Ranking in Uncertain Databases. In VLDB, 2011. [27] D. Wang, E. Michelakis, M. Garofalakis, J. Hellerstein. BayesStore:

managing large, uncertain data repositories with probabilistic graphical models. In VLDB, 2008.

[28] X. Zhang and J. Chomicki. On the Semantics and Evaluation of Top-k Queries in Probabilistic Databases. In DBRank’08.

[29] http://www.ncbi.nlm.nih.gov/geo/. [30] http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GPL96. [31] http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE10072.