probabilistic threshold range aggregate query processing over uncertain data
DESCRIPTION
Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data. Wenjie Zhang University of New South Wales & NICTA, Australia. Joint work: Shuxiang Yang, Ying Zhang, Xuemin Lin (UNSW & NICTA). Outline. Background and Preliminaries Probabilistic Threshold Range Aggregate Query - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data](https://reader035.vdocuments.mx/reader035/viewer/2022062801/56814437550346895db0d214/html5/thumbnails/1.jpg)
Probabilistic Threshold Range Aggregate Query
Processing over Uncertain Data
Wenjie Zhang
University of New South Wales & NICTA, AustraliaJoint work:
Shuxiang Yang, Ying Zhang, Xuemin Lin (UNSW & NICTA)
![Page 2: Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data](https://reader035.vdocuments.mx/reader035/viewer/2022062801/56814437550346895db0d214/html5/thumbnails/2.jpg)
Outline
DB@UNSW
2
Background and Preliminaries Probabilistic Threshold Range Aggregate
Query Exact query processing Approximate query processing: Simple
Sampling & Double Sampling Experiments
Conclusion
![Page 3: Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data](https://reader035.vdocuments.mx/reader035/viewer/2022062801/56814437550346895db0d214/html5/thumbnails/3.jpg)
Applications
DB@UNSW
3
Many applications involve data that is imperfect due to data randomness and incompleteness limitation of equipment delay or lose in data transfer … …
Applications Sensor networks Environmental surveillance Moving objects Data cleaning and integration … …
![Page 4: Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data](https://reader035.vdocuments.mx/reader035/viewer/2022062801/56814437550346895db0d214/html5/thumbnails/4.jpg)
Applications
DB@UNSW
4
Sensor Networks: Sensor readings are often imprecise due to equipment
limitation and periodical reporting mechanism. (figures are borrowed from Jian et al, SIGMOD08)
![Page 5: Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data](https://reader035.vdocuments.mx/reader035/viewer/2022062801/56814437550346895db0d214/html5/thumbnails/5.jpg)
Applications
DB@UNSW
5
Mobile Equipments / Moving Objects A mobile object reports its location periodically, the
exact location is often uncertain.
![Page 6: Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data](https://reader035.vdocuments.mx/reader035/viewer/2022062801/56814437550346895db0d214/html5/thumbnails/6.jpg)
Applications
DB@UNSW
6
Satellite data
![Page 7: Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data](https://reader035.vdocuments.mx/reader035/viewer/2022062801/56814437550346895db0d214/html5/thumbnails/7.jpg)
Applications
DBG @ UNSW
Data Quality Social Data Collection: Errors and estimation
inherent in customer surveys and sampling
7
![Page 8: Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data](https://reader035.vdocuments.mx/reader035/viewer/2022062801/56814437550346895db0d214/html5/thumbnails/8.jpg)
Outline
DB@UNSW
8
Background and Preliminaries Modeling Uncertainty & Related Work
Probabilistic Threshold Range Query Conclusion
![Page 9: Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data](https://reader035.vdocuments.mx/reader035/viewer/2022062801/56814437550346895db0d214/html5/thumbnails/9.jpg)
Modeling Uncertainty ( cont. )
DB@UNSW
9
Uncertain Objects Model1. Continuous case: described using a probability
density function (PDF) fU such that . E.g., uniform distribution, normal distribution.
Uu U duuf 1)(
![Page 10: Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data](https://reader035.vdocuments.mx/reader035/viewer/2022062801/56814437550346895db0d214/html5/thumbnails/10.jpg)
Modeling Uncertainty ( cont. )
DB@UNSW
10
Uncertain Objects Model2. Discrete case : described using a set of
instances each instance u has an occurrence probability pu
1 Uu up
![Page 11: Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data](https://reader035.vdocuments.mx/reader035/viewer/2022062801/56814437550346895db0d214/html5/thumbnails/11.jpg)
Possible World Semantics
DB@UNSW
11
Given a set of uncertain objects U1,U2, ..., Un, a possible world W = u1,u2, .., un is a set of n instances --- one instance per uncertain object
The probability of a possible worlds is
P(W) =
Let Ω be the set of all possible world, clearly,
n
i iuP1 )(
1)( WWP
![Page 12: Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data](https://reader035.vdocuments.mx/reader035/viewer/2022062801/56814437550346895db0d214/html5/thumbnails/12.jpg)
Probabilistic Queries:
DB@UNSW
12
Query Evaluation [CKP03, CXPSV04, DS04, DS05, DS07, SD07]
Aggregate Queries [BDJR05, MJ07, CG07]
Join Queries [CSP06, AW07]
Top-k queries [SIC07, YLSK08, RDS07, HJZL08]
Nearest Neighbor Queries [KKR07, CCMC08]
Skyline Queries [PJLY07]
… …
![Page 13: Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data](https://reader035.vdocuments.mx/reader035/viewer/2022062801/56814437550346895db0d214/html5/thumbnails/13.jpg)
Range query
DBG @ UNSW
13
Uncertain objects, exact query Probability threshold is often assigned
![Page 14: Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data](https://reader035.vdocuments.mx/reader035/viewer/2022062801/56814437550346895db0d214/html5/thumbnails/14.jpg)
Related Work
DB@UNSW
14
Range Queries [TCXNKP05, BPS06, AY08]
Given a rectangle r and a probabilistic threshold t , find all objects that appear in r with probability at least t.
Appearance probability
r
o .reg ion
rregionoxdxxpdfo
.)(.
![Page 15: Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data](https://reader035.vdocuments.mx/reader035/viewer/2022062801/56814437550346895db0d214/html5/thumbnails/15.jpg)
U-tree
DB@UNSW
15
Probabilistically Constrained Region ( PCR ) [TCXNKP05]
PCR (0.2) Multi PCRs
![Page 16: Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data](https://reader035.vdocuments.mx/reader035/viewer/2022062801/56814437550346895db0d214/html5/thumbnails/16.jpg)
Outline
DB@UNSW
16
Introduction Modeling Uncertainty & Related Work Probabilistic Threshold Range Aggregate
Query (PTRA) Conclusion
![Page 17: Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data](https://reader035.vdocuments.mx/reader035/viewer/2022062801/56814437550346895db0d214/html5/thumbnails/17.jpg)
Contribution
DB@UNSW
17
Formally define PTRA query aU-Tree structure for exact PTRA query singleSample and doubleSample
techniques for approximate answer.
![Page 18: Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data](https://reader035.vdocuments.mx/reader035/viewer/2022062801/56814437550346895db0d214/html5/thumbnails/18.jpg)
Problem Statement
DB@UNSW
18
Given a set of uncertain objects and query q , return the number of uncertain objects with appearance probability no less than threshold pq
![Page 19: Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data](https://reader035.vdocuments.mx/reader035/viewer/2022062801/56814437550346895db0d214/html5/thumbnails/19.jpg)
Problem Definition
DB@UNSW
19
Assume threshold = 0.5, if the appearance probability computed for b is > 0.5 and for c is < 0.5, then the aggregate returned is 2 (a & b)
![Page 20: Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data](https://reader035.vdocuments.mx/reader035/viewer/2022062801/56814437550346895db0d214/html5/thumbnails/20.jpg)
Exact Query Processing ( aU-Tree)
DB@UNSW
20
Main idea: add aggregate information on U-tree Advantage: stop at intermediate level if
pruned or fully covered by the query Disadvantage: otherwise, still need to drill
down to the leaf nodes. For a large portion of uncertain objects,
appearance probability needs to be computed Expensive for a massive number of instances
per object!
![Page 21: Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data](https://reader035.vdocuments.mx/reader035/viewer/2022062801/56814437550346895db0d214/html5/thumbnails/21.jpg)
Exact Query Processing ( aU-Tree)
DB@UNSW
21
![Page 22: Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data](https://reader035.vdocuments.mx/reader035/viewer/2022062801/56814437550346895db0d214/html5/thumbnails/22.jpg)
singleSample
DB@UNSW
22
Sampling the instances of the uncertain objects. If m’ out of m sampled instances are inside query
region, then the approximate appearance probability is m’/m
![Page 23: Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data](https://reader035.vdocuments.mx/reader035/viewer/2022062801/56814437550346895db0d214/html5/thumbnails/23.jpg)
singleSample ( cont. )
DB@UNSW
23
An immediate application of Chernoff-Hoeffding bound
![Page 24: Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data](https://reader035.vdocuments.mx/reader035/viewer/2022062801/56814437550346895db0d214/html5/thumbnails/24.jpg)
doubleSample
DB@UNSW
24
Single Sampling is expensive when there is a massive number of objects!
Sampling the uncertain objects as well. Naive : uniform sampling objects from all
uncertain objects.
![Page 25: Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data](https://reader035.vdocuments.mx/reader035/viewer/2022062801/56814437550346895db0d214/html5/thumbnails/25.jpg)
doubleSample: Accuracy
DB@UNSW
25
•Note: “ appearance probability” of each object follows uniform distribution means spatial location is uniformly distributed.•Using Chernoff-Hoeffding bound.
![Page 26: Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data](https://reader035.vdocuments.mx/reader035/viewer/2022062801/56814437550346895db0d214/html5/thumbnails/26.jpg)
doubleSample: Our Approach
DB@UNSW
26
Skew! Aim: select K disjoint groups covering all objects
with the minimum “skew”; i.e. objects in each group with “uniform” distribution. (Then do uniform sampling of objects in each group.)
The optimization problem is NP-hard. Observation:
Min-skew is a good heuristic to conduct such a group.
aU-tree groups objects with a similar principle to the min-skew.
![Page 27: Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data](https://reader035.vdocuments.mx/reader035/viewer/2022062801/56814437550346895db0d214/html5/thumbnails/27.jpg)
doubleSample: Our Approach
DB@UNSW
27
Step 1: choose K subtrees to cover all objects with the total minimum skew. NP-hard! Find a level L such that the number of nodes at level
L is smaller than K but the number of nodes at level L-1 is larger than K.
Feed the min-skew algorithm with the subtrees at level L.
(note: if at a level L, the number of nodes = K, then these K subtrees are chosen.)
Step 2: sample objects in each subtree. Step 3. sample instances in each sampled object.
![Page 28: Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data](https://reader035.vdocuments.mx/reader035/viewer/2022062801/56814437550346895db0d214/html5/thumbnails/28.jpg)
Experiments
DB@UNSW
28
Algorithms:
exact, singleSample, doubleSample
Data set:
LB : 53k objects at long beach country
CA : 62k objects at California
Synthetic aircraft dataset in 3D
10k instances for each points follow Uniform or constrained-Gaussian
Setting : C++, P4 2.8GHz , 2G memory, Debian linux, Page size 8K
![Page 29: Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data](https://reader035.vdocuments.mx/reader035/viewer/2022062801/56814437550346895db0d214/html5/thumbnails/29.jpg)
Efficiency
DB@UNSW
29
![Page 30: Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data](https://reader035.vdocuments.mx/reader035/viewer/2022062801/56814437550346895db0d214/html5/thumbnails/30.jpg)
Accuracy
DB@UNSW
30
![Page 31: Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data](https://reader035.vdocuments.mx/reader035/viewer/2022062801/56814437550346895db0d214/html5/thumbnails/31.jpg)
Accuracy ( cont. )
DB@UNSW
31
![Page 32: Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data](https://reader035.vdocuments.mx/reader035/viewer/2022062801/56814437550346895db0d214/html5/thumbnails/32.jpg)
Conclusion
DB@UNSW
32
Definition of PTRA aU-Tree technique Sampling technique Future work. Any approach with
theoretic guarantee?
![Page 33: Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data](https://reader035.vdocuments.mx/reader035/viewer/2022062801/56814437550346895db0d214/html5/thumbnails/33.jpg)
DB@UNSW
33
Thanks
![Page 34: Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data](https://reader035.vdocuments.mx/reader035/viewer/2022062801/56814437550346895db0d214/html5/thumbnails/34.jpg)
Min-Skew technique
DB@UNSW
34