scalable histograms on larger probabilistic datalifeifei/papers/probhist-talk.pdfh(i) = di=dn=m ee...

Scalable Histograms on Larger Probabilistic Data

Mingwang Tang and Feifei Li

School of Computing, University of Utah

September 21, 2015

Introduction

New Challenges

Large scale data size

Distributed data sources

Uncertainty

Data synopsis on large probabilistic data

Scalable histograms on large probabilistic data

2 / 31

Histograms on deterministic data

V-optimal histogram: Given a frequency vector −→v = {v1, . . . , vn},where vi is the frequency of item i in [n], a space budget B, it seeksto minimize the SSE error:

min{B∑

ek∑i=sk

(vi − b̂k)2} (1)

1 2 3 4 5 6 7 8 9 10 11 12

vn = 12, B = 3

Optimal B-bucket histogram takes O(Bn2) time.

3 / 31

Probabilistic Database

Probabilistic database D on domain [n] = {1, . . . , n}

1 1642 30

75 6 8 9 10 11 12 13 14 15

frequency

D = {g1, g2, . . . , gn} where

gi = {(gi (W ),Pr(W ))|W ∈ W} (2)

4 / 31

Probabilistic Database

Probabilistic database D on domain [n] = {1, . . . , n}

1 1642 30

75 6 8 9 10 11 12 13 14 15

frequency

: W1 : W2

D = {g1, g2, . . . , gn} where

gi = {(gi (W ),Pr(W ))|W ∈ W} (2)

5 / 31

Probabilistic Data Models

gi = {(gi (W ),Pr(W ))|W ∈ W}

Tuple Model

Each tuple tj =⟨(tj1, pj1), . . . , (tj`j , pj`j )

⟩. Each tjk is drawn from

[n] for k ∈ [1, `j ].

1−∑`jk=1 pjk specify the possibility that tj generates no item.

t1t2t3

{(1, 0.2), (3, 0.3), (7, 0.2)}

t|τ |

{(3, 0.3), (5, 0.1), (9, 0.4)}{(3, 0.5), (10, 0.4), (13, 0.1)}

6 / 31

gi = {(gi (W ),Pr(W ))|W ∈ W}

Tuple Model

Each tuple tj =⟨(tj1, pj1), . . . , (tj`j , pj`j )

⟩. Each tjk is drawn from

[n] for k ∈ [1, `j ].

1−∑`jk=1 pjk specify the possibility that tj generates no item.

t1t2t3

{(1, 0.2), (3, 0.3), (7, 0.2)}

t|τ |

{(3, 0.3), (5, 0.1), (9, 0.4)}{(3, 0.5), (10, 0.4), (13, 0.1)}

7 / 31

gi = {(gi (W ),Pr(W ))|W ∈ W}

Value Model

Each tuple tj = 〈 j : fj = ((fj1, pj1), . . . , (fj`j , pj`j )) 〉, j is drawn from[n].

Pr(fj = 0) = 1−∑`jk=1 pjk

t1t2t3

{< 3, (10, 0.3), (15, 0.2), (20, 0.5) >}{< 2, (6, 0.4), (7, 0.3), (15, 0.3) >}{< 1, (50, 0.2), (7, 0.1), (14, 0.2) >}

8 / 31

gi = {(gi (W ),Pr(W ))|W ∈ W}

Value Model

Each tuple tj = 〈 j : fj = ((fj1, pj1), . . . , (fj`j , pj`j )) 〉, j is drawn from[n].

Pr(fj = 0) = 1−∑`jk=1 pjk

t1t2t3

{< 3, (10, 0.3), (15, 0.2), (20, 0.5) >}{< 2, (6, 0.4), (7, 0.3), (15, 0.3) >}{< 1, (50, 0.2), (7, 0.1), (14, 0.2) >}

9 / 31

Histograms on Probabilistic data

Possible world semantic

1 1642 30

75 6 8 9 10 11 12 13 14 15

frequency

: W1 : W2

gi : frequency of item i becomes random variable across possibleworlds

Expectation based histogram

H(n,B) = min{EW

B∑k=1

ek∑j=sk

(gj − b̂k)2

}.[ICDE09] G. Cormode et al., Histograms and wavelets on probabilistic data, ICDE 2009

[VLDB09] G. Cormode et al., Probabilistic histograms for probabilistic data, VLDB 2009

10 / 31

Efficient computation of bucket error

The optimal B bucket histogram takes O(Bn2) time.

[TKDE10] shows that the minimal error of a bucket b = (s, e, b̂) is:

SSE(b, b̂) =e∑

EW [g2i ]−

e − s + 1EW [

e∑i=s

gi ]2. (3)

by setting b̂ = 1e−s+1EW

[∑ei=s gi

Based on two precomputed arrays (A,B), SSE (b, b̂) can becomputed in constant time.

[TKDE10] G. Cormode et al., Histograms and wavelets on probabilistic data, TKDE 2010

11 / 31

Pmerge Method

Pmerge method based on partition and merge principle

Partition phase: partition the domain n into m sub-domain of equalsize and compute the local optimal B buckets for each sub-domain.

Merge phase: merge mB input buckets from the partition phase intoB buckets.

bucket sub-domain boundary2 3

0domain value75 6 8 9 10 11 12 13 14 15

partition

frequency

12 / 31

Pmerge Method

bucket sub-domain boundary2 3

0domain value75 6 8 9 10 11 12 13 14 15

partition

frequency

13 / 31

Pmerge Method

bucket sub-domain boundary

: frequency in W1 : frequency in W2

0domain value

: weighted frequency

w=3w=1

w=1w=3 w=3

domain value

5 6 8 9 10 11 12 13 14 15

partition

1 1642 3 75 6 8 9 10 11 12 13 14 15

frequency

14 / 31

Pmerge Method

bucket sub-domain boundary

: frequency in W1 : frequency in W2

0domain value

: weighted frequency

w=3w=1

w=1w=3 w=3

domain value

5 6 8 9 10 11 12 13 14 15

partition

1 1642 3 75 6 8 9 10 11 12 13 14 15

frequency

15 / 31

Recursive Merging Method

Pmerge method:

Approximation quality: Pmerge produces a√10 approximation in

O(N + Bn2/m + B3m2) time.

Recursive merging (RPmerge):

Partition [n] into m` subdomains, producing Bm`.Using ` iterations and each iteration reduce the domain size by afactor of m.Takes O(N + B n2

m` + B3 ∑`i=1 m

(i+1)) time and the RPmerge

method gives a 10`2 approximation of the optimal B-buckets

histogram found by OptHist.

In practice, Pmerge and RPmerge always provide close tooptimal approximation quality as shown in our experiments.

16 / 31

Distributed and Parallel Pmerge

Partition phase in the distributed environment

τ` . . .

1Probabilistic Database D m sub-domains

Communication cost

Computing Ak ,Bk arrays in the partition phase

Tuple model: O(βn) bytes.Value model: O(n) bytes.

O(Bm) bytes in the merge phase for both models.

17 / 31

τ` . . . ti (i, (E[fi],E[f2i ]))

h(i) = di/ dn/mee

Communication cost

18 / 31

τ` . . . ti (i, (E[fi],E[f2i ]))

h(i) = di/ dn/mee

Communication cost

19 / 31

Pmerge Based on Sampling

Sampling A,B arrays in the partition phase

Ak [j ] =i∑

E[f 2j ],Bk [j ] =

j∑i=1

E[fi ]

Estimate Ak ,Bk arrays using quantile sampling

2 3 5 9E[fi] · · ·

item: 1 2 3 4 . . .

20 / 31

Ak [j ] =i∑

E[f 2j ],Bk [j ] =

j∑i=1

E[fi ]

2 3 5 9E[fi] · · ·

item: 1 2 3 4 . . .

21 / 31

Ak [j ] =i∑

E[f 2j ],Bk [j ] =

j∑i=1

E[fi ]

2 3 5 9E[fi] · · ·3 3 3 3 3

item: 1 2 3 4 . . .

22 / 31

Ak [j ] =i∑

E[f 2j ],Bk [j ] =

j∑i=1

E[fi ]

2 3 5 9E[fi] · · ·3 3 3 3 3

p = min{Θ(

εN),Θ( 1

ε2N)}

item: 1 2 3 4 . . .

23 / 31

Pmerge Based on Sketch

Tuple model A,B arrays

Estimate F2 =∑j

i=sk(∑β`=1 EW,`[gi ])

2 using AMS Sketchtechniques and binary decomposition of domain [sk , ek ].

F2 = εM ′′k

F2 = 2εM ′′k

F2 = M ′′k

EW [gαk,1 ]EW [gsk ] EW [gek ] EW,ℓ[gsk ] EW,ℓ[gek ]· · · · · ·

(a) binary decomposition (b) local Q-AMS

F2 = εM ′′k

EW [gαk, 1

ε−1

] EW,ℓ[gαk,1 ]EW,ℓ[gαk, 1

ε−1

F2 = 2εM ′′k

· · ·· · ·· · ·

· · ·· · ·· · ·AMS

24 / 31

Outline

1 Optimal B-buckets Histograms

2 Approximate Histograms

3 Pmerge Based on Sampling

4 Experiments

25 / 31

Experiment setup

Generate tuple model and the value model dataset using the clientid filed of 1998 WorldCup dataset and atmospheric measurementsfrom the SAMOS project.

The default experimental parameters:Symbol Definition Default Value

B number of buckets 400n domain size 100k (600k)` depth of recursions 2

26 / 31

Running time:

n: domain size

10 50 100 150 200

domain size: n (×103)

e(seconds)

OptHist PMerge

RPMerge

Figure: Tuple Model

10 50 100 150 200

e(seconds)

OptHist PMerge

RPMerge

Figure: Value Model

27 / 31

Approximation Ratio:

n: domain size

10 50 100 150 200

Approxim

ationratio

PMerge RPMerge

Figure: Tuple Model

10 50 100 150 200

1.0001

1.0002

1.0003

1.0004

Approxim

ationratio

PMerge RPMerge

Figure: Value Model

28 / 31

Running time on large scale probabilistic data

n: domain size

200 400 600 800 100010

e(seconds)

RPMerge Parallel-PMerge

Parallel-RPMerge

Figure: Tuple Model

200 400 600 80010

102seconds)

RPMerge Parallel-PMerge

Parallel-RPMerge

Figure: Value Model

29 / 31

Conclusion & Future work

Conclusion

Novel approximation methods for constructing scalable histogramson large probabilistic data.The quality of the approximate histograms are almost as good as theoptimal histogram in practice.Extended the techniques to distributed and parallel settings tofurther improve scalability.

Future work

extend our study to probabilistic histograms with pdf bucketrepresentatives and handle histogram of other error metrics

30 / 31

The end

Thank You

Q and A

31 / 31

scalable histograms on larger probabilistic datalifeifei/papers/probhist-talk.pdfh(i) = di=dn=m ee...

Documents

full coverage: histograms

sampling and histograms

scalable histograms on large probabilistic...

scalable histograms on large probabilistic...

stemplots &histograms

chapter 05b histograms

quick reference for creating histograms in...

past paper questions – histograms - maths...

gcse - histograms

histograms 4.pdf

unit 3: histograms

line plots and histograms

iobm histograms

dynamic multidimensional histograms

probabilistic histograms for probabilistic data graham...

interpretation of histograms

histograms. histograms have some similar characteristics as...

1 histograms and wavelets on probabilistic data

histograms 1d histograms 3d histograms equalisation ...

stem & leaf plots + histograms