scalable histograms on larger probabilistic datalifeifei/papers/probhist-talk.pdfh(i) = di=dn=m ee...
Post on 19-Nov-2020
3 Views
Preview:
TRANSCRIPT
Scalable Histograms on Larger Probabilistic Data
Mingwang Tang and Feifei Li
School of Computing, University of Utah
September 21, 2015
Introduction
New Challenges
Large scale data size
Distributed data sources
Uncertainty
Data synopsis on large probabilistic data
Scalable histograms on large probabilistic data
2 / 31
Histograms on deterministic data
V-optimal histogram: Given a frequency vector −→v = {v1, . . . , vn},where vi is the frequency of item i in [n], a space budget B, it seeksto minimize the SSE error:
min{B∑
k=1
ek∑i=sk
(vi − b̂k)2} (1)
1 2 3 4 5 6 7 8 9 10 11 12
vn = 12, B = 3
Optimal B-bucket histogram takes O(Bn2) time.
3 / 31
Probabilistic Database
Probabilistic database D on domain [n] = {1, . . . , n}
123
1 1642 30
75 6 8 9 10 11 12 13 14 15
frequency
items
: W1
D = {g1, g2, . . . , gn} where
gi = {(gi (W ),Pr(W ))|W ∈ W} (2)
4 / 31
Probabilistic Database
Probabilistic database D on domain [n] = {1, . . . , n}
123
1 1642 30
75 6 8 9 10 11 12 13 14 15
frequency
items
: W1 : W2
D = {g1, g2, . . . , gn} where
gi = {(gi (W ),Pr(W ))|W ∈ W} (2)
5 / 31
Probabilistic Data Models
gi = {(gi (W ),Pr(W ))|W ∈ W}
Tuple Model
Each tuple tj =⟨(tj1, pj1), . . . , (tj`j , pj`j )
⟩. Each tjk is drawn from
[n] for k ∈ [1, `j ].
1−∑`jk=1 pjk specify the possibility that tj generates no item.
t1t2t3
{(1, 0.2), (3, 0.3), (7, 0.2)}
t|τ |
{(3, 0.3), (5, 0.1), (9, 0.4)}{(3, 0.5), (10, 0.4), (13, 0.1)}
6 / 31
Probabilistic Data Models
gi = {(gi (W ),Pr(W ))|W ∈ W}
Tuple Model
Each tuple tj =⟨(tj1, pj1), . . . , (tj`j , pj`j )
⟩. Each tjk is drawn from
[n] for k ∈ [1, `j ].
1−∑`jk=1 pjk specify the possibility that tj generates no item.
t1t2t3
{(1, 0.2), (3, 0.3), (7, 0.2)}
t|τ |
{(3, 0.3), (5, 0.1), (9, 0.4)}{(3, 0.5), (10, 0.4), (13, 0.1)}
7 / 31
Probabilistic Data Models
gi = {(gi (W ),Pr(W ))|W ∈ W}
Value Model
Each tuple tj = 〈 j : fj = ((fj1, pj1), . . . , (fj`j , pj`j )) 〉, j is drawn from[n].
Pr(fj = 0) = 1−∑`jk=1 pjk
t1t2t3
tn
{< 3, (10, 0.3), (15, 0.2), (20, 0.5) >}{< 2, (6, 0.4), (7, 0.3), (15, 0.3) >}{< 1, (50, 0.2), (7, 0.1), (14, 0.2) >}
8 / 31
Probabilistic Data Models
gi = {(gi (W ),Pr(W ))|W ∈ W}
Value Model
Each tuple tj = 〈 j : fj = ((fj1, pj1), . . . , (fj`j , pj`j )) 〉, j is drawn from[n].
Pr(fj = 0) = 1−∑`jk=1 pjk
t1t2t3
tn
{< 3, (10, 0.3), (15, 0.2), (20, 0.5) >}{< 2, (6, 0.4), (7, 0.3), (15, 0.3) >}{< 1, (50, 0.2), (7, 0.1), (14, 0.2) >}
9 / 31
Histograms on Probabilistic data
Possible world semantic
123
1 1642 30
75 6 8 9 10 11 12 13 14 15
frequency
items
: W1 : W2
gi : frequency of item i becomes random variable across possibleworlds
Expectation based histogram
H(n,B) = min{EW
B∑k=1
ek∑j=sk
(gj − b̂k)2
}.[ICDE09] G. Cormode et al., Histograms and wavelets on probabilistic data, ICDE 2009
[VLDB09] G. Cormode et al., Probabilistic histograms for probabilistic data, VLDB 2009
10 / 31
Efficient computation of bucket error
The optimal B bucket histogram takes O(Bn2) time.
[TKDE10] shows that the minimal error of a bucket b = (s, e, b̂) is:
SSE(b, b̂) =e∑
i=s
EW [g2i ]−
1
e − s + 1EW [
e∑i=s
gi ]2. (3)
by setting b̂ = 1e−s+1EW
[∑ei=s gi
].
Based on two precomputed arrays (A,B), SSE (b, b̂) can becomputed in constant time.
[TKDE10] G. Cormode et al., Histograms and wavelets on probabilistic data, TKDE 2010
11 / 31
Pmerge Method
Pmerge method based on partition and merge principle
Partition phase: partition the domain n into m sub-domain of equalsize and compute the local optimal B buckets for each sub-domain.
Merge phase: merge mB input buckets from the partition phase intoB buckets.
123
1 164
bucket sub-domain boundary2 3
0domain value75 6 8 9 10 11 12 13 14 15
partition
frequency
12 / 31
Pmerge Method
Pmerge method based on partition and merge principle
Partition phase: partition the domain n into m sub-domain of equalsize and compute the local optimal B buckets for each sub-domain.
Merge phase: merge mB input buckets from the partition phase intoB buckets.
123
1 164
bucket sub-domain boundary2 3
0domain value75 6 8 9 10 11 12 13 14 15
partition
frequency
13 / 31
Pmerge Method
Pmerge method based on partition and merge principle
Partition phase: partition the domain n into m sub-domain of equalsize and compute the local optimal B buckets for each sub-domain.
Merge phase: merge mB input buckets from the partition phase intoB buckets.
123
1 16
w=2
4
bucket sub-domain boundary
: frequency in W1 : frequency in W2
2 3
w=2
0domain value
123
0
7
: weighted frequency
w=3w=1
w=1w=3 w=3
w=1
domain value
5 6 8 9 10 11 12 13 14 15
partition
merge
1 1642 3 75 6 8 9 10 11 12 13 14 15
frequency
14 / 31
Pmerge Method
Pmerge method based on partition and merge principle
Partition phase: partition the domain n into m sub-domain of equalsize and compute the local optimal B buckets for each sub-domain.
Merge phase: merge mB input buckets from the partition phase intoB buckets.
123
1 16
w=2
4
bucket sub-domain boundary
: frequency in W1 : frequency in W2
2 3
w=2
0domain value
123
0
7
: weighted frequency
w=3w=1
w=1w=3 w=3
w=1
domain value
5 6 8 9 10 11 12 13 14 15
partition
merge
1 1642 3 75 6 8 9 10 11 12 13 14 15
frequency
15 / 31
Recursive Merging Method
Pmerge method:
Approximation quality: Pmerge produces a√10 approximation in
O(N + Bn2/m + B3m2) time.
Recursive merging (RPmerge):
Partition [n] into m` subdomains, producing Bm`.Using ` iterations and each iteration reduce the domain size by afactor of m.Takes O(N + B n2
m` + B3 ∑`i=1 m
(i+1)) time and the RPmerge
method gives a 10`2 approximation of the optimal B-buckets
histogram found by OptHist.
In practice, Pmerge and RPmerge always provide close tooptimal approximation quality as shown in our experiments.
16 / 31
Distributed and Parallel Pmerge
Partition phase in the distributed environment
τ` . . .
τ1
τβ
n
1Probabilistic Database D m sub-domains
Communication cost
Computing Ak ,Bk arrays in the partition phase
Tuple model: O(βn) bytes.Value model: O(n) bytes.
O(Bm) bytes in the merge phase for both models.
17 / 31
Distributed and Parallel Pmerge
Partition phase in the distributed environment
τ` . . . ti (i, (E[fi],E[f2i ]))
h(i) = di/ dn/mee
τ1
τβ
n
1Probabilistic Database D m sub-domains
Communication cost
Computing Ak ,Bk arrays in the partition phase
Tuple model: O(βn) bytes.Value model: O(n) bytes.
O(Bm) bytes in the merge phase for both models.
18 / 31
Distributed and Parallel Pmerge
Partition phase in the distributed environment
τ` . . . ti (i, (E[fi],E[f2i ]))
h(i) = di/ dn/mee
τ1
τβ
A,B
n
1Probabilistic Database D m sub-domains
Communication cost
Computing Ak ,Bk arrays in the partition phase
Tuple model: O(βn) bytes.Value model: O(n) bytes.
O(Bm) bytes in the merge phase for both models.
19 / 31
Pmerge Based on Sampling
Sampling A,B arrays in the partition phase
Ak [j ] =i∑
i=1
E[f 2j ],Bk [j ] =
j∑i=1
E[fi ]
Estimate Ak ,Bk arrays using quantile sampling
2 3 5 9E[fi] · · ·
item: 1 2 3 4 . . .
20 / 31
Pmerge Based on Sampling
Sampling A,B arrays in the partition phase
Ak [j ] =i∑
i=1
E[f 2j ],Bk [j ] =
j∑i=1
E[fi ]
Estimate Ak ,Bk arrays using quantile sampling
2 3 5 9E[fi] · · ·
item: 1 2 3 4 . . .
21 / 31
Pmerge Based on Sampling
Sampling A,B arrays in the partition phase
Ak [j ] =i∑
i=1
E[f 2j ],Bk [j ] =
j∑i=1
E[fi ]
Estimate Ak ,Bk arrays using quantile sampling
2 3 5 9E[fi] · · ·3 3 3 3 3
item: 1 2 3 4 . . .
22 / 31
Pmerge Based on Sampling
Sampling A,B arrays in the partition phase
Ak [j ] =i∑
i=1
E[f 2j ],Bk [j ] =
j∑i=1
E[fi ]
Estimate Ak ,Bk arrays using quantile sampling
2 3 5 9E[fi] · · ·3 3 3 3 3
p = min{Θ(
√β
εN),Θ( 1
ε2N)}
item: 1 2 3 4 . . .
23 / 31
Pmerge Based on Sketch
Tuple model A,B arrays
Estimate F2 =∑j
i=sk(∑β`=1 EW,`[gi ])
2 using AMS Sketchtechniques and binary decomposition of domain [sk , ek ].
s e
F2 = εM ′′k
F2 = 2εM ′′k
F2 = M ′′k
EW [gαk,1 ]EW [gsk ] EW [gek ] EW,ℓ[gsk ] EW,ℓ[gek ]· · · · · ·
(a) binary decomposition (b) local Q-AMS
F2 = εM ′′k
EW [gαk, 1
ε−1
] EW,ℓ[gαk,1 ]EW,ℓ[gαk, 1
ε−1
]
F2 = 2εM ′′k
· · ·· · ·· · ·
AMS
· · ·· · ·· · ·AMS
AMS
24 / 31
Outline
1 Optimal B-buckets Histograms
2 Approximate Histograms
3 Pmerge Based on Sampling
4 Experiments
25 / 31
Experiment setup
Generate tuple model and the value model dataset using the clientid filed of 1998 WorldCup dataset and atmospheric measurementsfrom the SAMOS project.
The default experimental parameters:Symbol Definition Default Value
B number of buckets 400n domain size 100k (600k)` depth of recursions 2
26 / 31
Running time:
n: domain size
10 50 100 150 200
102
104
106
domain size: n (×103)
Tim
e(seconds)
OptHist PMerge
RPMerge
Figure: Tuple Model
10 50 100 150 200
102
104
106
domain size: n (×103)
Tim
e(seconds)
OptHist PMerge
RPMerge
Figure: Value Model
27 / 31
Approximation Ratio:
n: domain size
10 50 100 150 200
1
1.02
1.04
1.06
1.08
domain size: n (×103)
Approxim
ationratio
PMerge RPMerge
Figure: Tuple Model
10 50 100 150 200
1
1.0001
1.0002
1.0003
1.0004
domain size: n (×103)
Approxim
ationratio
PMerge RPMerge
Figure: Value Model
28 / 31
Running time on large scale probabilistic data
n: domain size
200 400 600 800 100010
2
103
104
105
domain size: n (×103)
Tim
e(seconds)
RPMerge Parallel-PMerge
Parallel-RPMerge
Figure: Tuple Model
200 400 600 80010
2
103
104
105
domain size: n (×103)
Tim
e(×
102seconds)
RPMerge Parallel-PMerge
Parallel-RPMerge
Figure: Value Model
29 / 31
Conclusion & Future work
Conclusion
Novel approximation methods for constructing scalable histogramson large probabilistic data.The quality of the approximate histograms are almost as good as theoptimal histogram in practice.Extended the techniques to distributed and parallel settings tofurther improve scalability.
Future work
extend our study to probabilistic histograms with pdf bucketrepresentatives and handle histogram of other error metrics
30 / 31
The end
Thank You
Q and A
31 / 31
top related