the complexity of massive data set computations
DESCRIPTION
The Complexity of Massive Data Set Computations. Ziv Bar-Yossef Computer Science Division U.C. Berkeley Ph.D. Dissertation Talk May 6, 2002. What Are Massive Data Sets?. Examples The Web IP packets Supermarket transactions Telephone call graph Astronomical observations - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: The Complexity of Massive Data Set Computations](https://reader036.vdocuments.mx/reader036/viewer/2022062517/5681360c550346895d9d83c1/html5/thumbnails/1.jpg)
1
The Complexity of Massive Data Set Computations
Ziv Bar-Yossef
Computer Science Division
U.C. Berkeley
Ph.D. Dissertation Talk
May 6, 2002
![Page 2: The Complexity of Massive Data Set Computations](https://reader036.vdocuments.mx/reader036/viewer/2022062517/5681360c550346895d9d83c1/html5/thumbnails/2.jpg)
2
What Are Massive Data Sets?Examples• The Web• IP packets• Supermarket transactions• Telephone call graph• Astronomical observations
Characterizing properties• Huge collections of raw data• Data is generated and modified continuously • Distributed over many sites• Slow storage devices• Data is not organized / indexed
![Page 3: The Complexity of Massive Data Set Computations](https://reader036.vdocuments.mx/reader036/viewer/2022062517/5681360c550346895d9d83c1/html5/thumbnails/3.jpg)
3
Nontraditional Computational Challenges
Restricted access to the data• Random access: expensive• “Streaming” access: more feasible• Some data may be unavailable• Fetching data is expensive
Traditionally
Cope with the difficulty of the problem
Massive Date Sets
Cope with the size of the data and the
restricted access to it
Sub-linear running time • Ideally, independent of data size
Sub-linear space• Ideally, logarithmic in data size
![Page 4: The Complexity of Massive Data Set Computations](https://reader036.vdocuments.mx/reader036/viewer/2022062517/5681360c550346895d9d83c1/html5/thumbnails/4.jpg)
4
Basic Framework
Massive data set computations are typically:• Approximate• Randomized• Have a restricted access regime
Input Data
Access Regime
AlgorithmApproximate Output$$
($$ = randomness)
![Page 5: The Complexity of Massive Data Set Computations](https://reader036.vdocuments.mx/reader036/viewer/2022062517/5681360c550346895d9d83c1/html5/thumbnails/5.jpg)
5
Prominent Computational Models for Massive Data Sets
• Sampling Computations– Sub-linear running time & space– Suitable for “insensitive” functions
• Data Stream Computations– Linear running time, sub-linear space– Can compute sensitive functions
• Sketch Computations– Suitable for distributed data
![Page 6: The Complexity of Massive Data Set Computations](https://reader036.vdocuments.mx/reader036/viewer/2022062517/5681360c550346895d9d83c1/html5/thumbnails/6.jpg)
6
Sampling Computations
Sampling Algorithm
Approximation of f(x1,…,xn)
x1
x2
xn
• Query input at random locations• Can choose query distribution and can query adaptively• Complexity measure: query complexity
• Applications– Statistical parameter estimation– Computational and statistical learning [Valiant 84, Vapnik 98]– Property testing [RS96,GGR96]
$$
![Page 7: The Complexity of Massive Data Set Computations](https://reader036.vdocuments.mx/reader036/viewer/2022062517/5681360c550346895d9d83c1/html5/thumbnails/7.jpg)
7
Data Stream Computations[HRR98, AMS96, FKSV99]
x1 x2 x3 xn
Data Stream Algorithm$$ memory
• Input arrives in a one-way stream in arbitrary order• Complexity measures: space and time per data item
Approximation of f(x1,…,xn)
• Applications– Database (Frequency moments [AMS96])
– Networking (Lp distance [AMS96, FKSV99, FS00, Indyk 00])
– Web Information Retrieval (Web crawling, Google query logs [CCF02])
![Page 8: The Complexity of Massive Data Set Computations](https://reader036.vdocuments.mx/reader036/viewer/2022062517/5681360c550346895d9d83c1/html5/thumbnails/8.jpg)
8
Sketch Computations[GM98, BCFM98, FKSV99]
compression compression compression
Sketch Algorithm
Approximation of f(x11,…,xtk)
$$
• Algorithm computes from data “sketches” sent from sites• Complexity measure: sketch lengths• Applications
– Web Information Retrieval (Identifying document similarities [BCFM98])
– Networking (Lp distance [FKSV99])
– Lossy compression, approximate nearest neighbor
x11 … x1k x21 … x2k xt1 … xtk $$
![Page 9: The Complexity of Massive Data Set Computations](https://reader036.vdocuments.mx/reader036/viewer/2022062517/5681360c550346895d9d83c1/html5/thumbnails/9.jpg)
9
Main Objective
• Develop general lower bound techniques
• Obtain lower bounds for specific functions
Explore the limitations of the above computational models
![Page 10: The Complexity of Massive Data Set Computations](https://reader036.vdocuments.mx/reader036/viewer/2022062517/5681360c550346895d9d83c1/html5/thumbnails/10.jpg)
10
General CC lower bounds [BJKS02b]
Information Theory
Communication Complexity
Thesis Blueprint
Statistical Decision Theory
Sampling Computations
Data Stream Computations
Sketch Computations
lower bounds for general functions [BKS01,B02]
One-way and simultaneous CC
lower bounds [BJKS02a]
Reduction from simultaneous
CC
Reduction from one-way CC
![Page 11: The Complexity of Massive Data Set Computations](https://reader036.vdocuments.mx/reader036/viewer/2022062517/5681360c550346895d9d83c1/html5/thumbnails/11.jpg)
11
Sampling Lower Bounds(with R. Kumar, and D. Sivakumar, STOC 2001, and Manuscript, 2002)
• Combinatorial lower bound [BKS01]– bounds the expected query complexity of every function– tends to be weak– based on a generalization of Boolean block sensitivity [Nisan 89]
• Statistical lower bounds– bound the query complexity of symmetric functions– via Hellinger distance: worst-case query complexity [BKS01]– via KL distance: expected query complexity [B02]– tend to be tight– work by a reduction from statistical hypothesis testing
• Information theory lower bound [B02]– bounds the worst-case query complexity of symmetric functions– has better dependence on the domain size
![Page 12: The Complexity of Massive Data Set Computations](https://reader036.vdocuments.mx/reader036/viewer/2022062517/5681360c550346895d9d83c1/html5/thumbnails/12.jpg)
12
Main observation:
Since for all x, w.p. 1 - , then:
x,y -disjoint T(x),T(y) are “far” from each other
Main Idea
)()( xCxT
)(xC )(yCapproximation
set of x
-disjoint inputs
)(wC approximation set of y
approximation set of w
1))()(Pr( xCxTapproximation:
![Page 13: The Complexity of Massive Data Set Computations](https://reader036.vdocuments.mx/reader036/viewer/2022062517/5681360c550346895d9d83c1/html5/thumbnails/13.jpg)
13
Main Result
TheoremFor any symmetric f and -disjoint inputs x,y, and for any algorithm that ()-approximates f,• Worst-case # of queries 1/h2(Ux,Uy) log(1/)• Expected # of queries 1/KL(Ux,Uy) log(1/)
• Ux – uniform query distribution on x: (induced by: pick i u.a.r, output xi)
• Hellinger: h2(Ux,Uy) = 1 – a (Ux(a) Uy(a))½
• KL: KL(Ux,Uy) = a Ux(a) log(Ux(a) / Uy(a))
![Page 14: The Complexity of Massive Data Set Computations](https://reader036.vdocuments.mx/reader036/viewer/2022062517/5681360c550346895d9d83c1/html5/thumbnails/14.jpg)
14
Example: Mean
1 0 1 0
½ + ½ - ½ - ½ + X: y:
h2(Ux,Uy) = KL(Ux,Uy) = O(2)
Theorem (originally, [CEG95])
Approximating the mean of n numbers in [0,1] to within additive error requires logqueries.
Other applications: Selection functions, frequency moments, extractors and dispersers
![Page 15: The Complexity of Massive Data Set Computations](https://reader036.vdocuments.mx/reader036/viewer/2022062517/5681360c550346895d9d83c1/html5/thumbnails/15.jpg)
15
1. For symmetric functions, WLOG, all queries are uniform without replacement
2. If # of queries is n½, can further assume queries are uniform with replacement
3. For any -disjoint inputs x,y,
4. Hypothesis testing lower bounds • via Hellinger distance (worst-case)• via KL distance (expected) (cf. [Siegmund 85])
Proof Outline
approximation of f with k
queries
Hypothesis test of Ux against Uy with
error and k samples
![Page 16: The Complexity of Massive Data Set Computations](https://reader036.vdocuments.mx/reader036/viewer/2022062517/5681360c550346895d9d83c1/html5/thumbnails/16.jpg)
16
Statistical Hypothesis Testing
Black Box
P
Q
Hypothesis Test
k i.i.d. samples
• Black box contains either P or Q• Test has to decide: “P” or “Q”• Allowed error probability • Goal: minimize k
![Page 17: The Complexity of Massive Data Set Computations](https://reader036.vdocuments.mx/reader036/viewer/2022062517/5681360c550346895d9d83c1/html5/thumbnails/17.jpg)
17
Sampling Algorithm Hypothesis Test
)(xC
x,y: -disjoint inputs
Black Box
Ux
Uy
Sampling Algorithm
“Uy” - otherwise
“Ux” – if output
))()(( yCxC
k i.i.d. samples
![Page 18: The Complexity of Massive Data Set Computations](https://reader036.vdocuments.mx/reader036/viewer/2022062517/5681360c550346895d9d83c1/html5/thumbnails/18.jpg)
18
Hypothesis test for Ux against Uy with error and
k samples
Lower Bound via Hellinger Distance
21),( ky
kx UUV
22 2 hhV k
yxky
kx UUhUUh )),(1(),(1 22
Lemma (cf. Le Cam, Yang 90)
12Corollary: k 1/h2(Ux,Uy) log(1/)
![Page 19: The Complexity of Massive Data Set Computations](https://reader036.vdocuments.mx/reader036/viewer/2022062517/5681360c550346895d9d83c1/html5/thumbnails/19.jpg)
19
Communication Complexity [Yao 79]
Alice
f: X Y Z
x X y Yf(x,y)
R(f) = randomized CC of f with error
$$ Bob$$
![Page 20: The Complexity of Massive Data Set Computations](https://reader036.vdocuments.mx/reader036/viewer/2022062517/5681360c550346895d9d83c1/html5/thumbnails/20.jpg)
20
Multi-Party Communication
f: X1 … Xt Z
P1
P2
P3
Pt
f(x1,…,xt)
x1
x2
x3
xt
![Page 21: The Complexity of Massive Data Set Computations](https://reader036.vdocuments.mx/reader036/viewer/2022062517/5681360c550346895d9d83c1/html5/thumbnails/21.jpg)
21
t-party set-disjointness
Example: Set-disjointness
ji
iit SSji
SDisj
,0
1||1 Pi gets Si [n],
Theorem [KS87,R90]: R(Disj2) = (n)
Theorem [AMS96]: R(Disjt) = (n/t4)
Best upper bound: R(Disjt) =O(n/t)
![Page 22: The Complexity of Massive Data Set Computations](https://reader036.vdocuments.mx/reader036/viewer/2022062517/5681360c550346895d9d83c1/html5/thumbnails/22.jpg)
22
Restricted Communication Models
P1 P2 Pt
Referee
P1 P2 Pt
f(x1,…,xt)
f(x1,…,xt)
One-Way Communication [PS84, Ablayev 93, KNR95]
Simultaneous Communication [Yao 79]
• Reduces to data stream computations
• Reduces to sketch computations
![Page 23: The Complexity of Massive Data Set Computations](https://reader036.vdocuments.mx/reader036/viewer/2022062517/5681360c550346895d9d83c1/html5/thumbnails/23.jpg)
23
Example: Disjointness Frequency
Moments
Fk(a1,…,am) = j [n] (fj)k
k-th frequency moment
knFDisj k /1Theorem [AMS96]:
Input stream: a1,…,am [n],
For j [n], fj = # of occurrences of j in a1,…,am
Corollary: DS(Fk) = n(1), k > 5
Best upper bounds: DS(Fk) = nO(1), k > 2
DS(Fk) = O(log n), k=0,1,2
![Page 24: The Complexity of Massive Data Set Computations](https://reader036.vdocuments.mx/reader036/viewer/2022062517/5681360c550346895d9d83c1/html5/thumbnails/24.jpg)
24
Information Statistics Approach to Communication Complexity
(with T.S. Jayram, R. Kumar, and D. Sivakumar, Manuscript 2002)
Applications• General CC lower bounds
– t-party set-disjointness: (n/t2) (improving on [AMS96])– Lp (solving an open problem of [Saks-Sun 02])
– Inner product • One-way CC lower bounds
– t-party set-disjointness: (n/t1+ ) for any > 0• Space lower bounds in the data stream model
– frequency moments: n(1),k > 2 (proving conjecture of [AMS96])
– Lp distance
A novel lower bound technique for randomized CC based on statistics and information theory
![Page 25: The Complexity of Massive Data Set Computations](https://reader036.vdocuments.mx/reader036/viewer/2022062517/5681360c550346895d9d83c1/html5/thumbnails/25.jpg)
25
Statistical View of Communication Complexity
– a -error randomized protocol for f: X Y Z(x,y) – distribution over transcripts
Lemma: For any two input pairs (x,y), (x’,y’) with f(x,y) f(x’,y’),
V((x,y),(x’,y’)) 1 – 2Proof:By reduction from hypothesis testing.
Corollary: h2((x,y),(x’,y’)) 1 – 2½
![Page 26: The Complexity of Massive Data Set Computations](https://reader036.vdocuments.mx/reader036/viewer/2022062517/5681360c550346895d9d83c1/html5/thumbnails/26.jpg)
26CC lower bound
For a protocol that computes f, how much information does (x,y) have to reveal about (x,y)?
= (X,Y) – a distribution over inputs of f
Definition: -information cost icost() = I(X,Y ; (X,Y))
icost(f) = min{icost()}
I(X,Y ; (X,Y)) H((X,Y)) |(X,Y)|
Information cost lower bound
Information Cost[Ablayev 93, Chakrabarti et al. 01, Saks-Sun 02]
![Page 27: The Complexity of Massive Data Set Computations](https://reader036.vdocuments.mx/reader036/viewer/2022062517/5681360c550346895d9d83c1/html5/thumbnails/27.jpg)
27
Direct Sum for Information CostDecomposable functions:
f(x,y) = g(h(x1,y1),…,h(xn,yn)),
h: Xi Yi {0,1}, g: {0,1}n {0,1}
Example: Set Disjointness Disj2(x,y) = (x1 Λ y1) V … V (xn Λyn)
Theorem (direct sum): For appropriately chosen ,’,
icost(f) n · icost’,(h)
Lower bound on icost(h)
Lower bound on icost(f)
![Page 28: The Complexity of Massive Data Set Computations](https://reader036.vdocuments.mx/reader036/viewer/2022062517/5681360c550346895d9d83c1/html5/thumbnails/28.jpg)
28
Information Cost of Single-Bit Functions
In Disj2, ’ = ½ ’1 + ½ ’2, where:
’1 = ½(1,0) + ½(0,0), ’2 = ½(0,1) + ½(0,0)
Lemma 1: For any protocol for AND,
icost’() (h2((0,1),(1,0))
Lemma 2: h2((0,1),(1,0)) = h2((1,1),(0,0))
Corollary 1: icost’,(AND) (1 – 2½)
Corollary 2: icost(Disj2) (n · (1 – 2½))
![Page 29: The Complexity of Massive Data Set Computations](https://reader036.vdocuments.mx/reader036/viewer/2022062517/5681360c550346895d9d83c1/html5/thumbnails/29.jpg)
29
Proof of Lemma 2“Rectangle” property of deterministic protocols:
For any transcript , the set of all (x,y) with (x,y) = is a “combinatorial rectangle”: S T, where S X and T Y
“Rectangle” property of randomized protocols:
For all x X, y Y, there exist functions px: {0,1}*[0,1] and qy: {0,1}*[0,1], such that for any possible transcript ,
Pr((x,y) = ) = px() · qy()
h2((0,1),(1,0)) = 1 - (Pr((0,1) = ) · Pr((1,0) = ))½
= 1 – (p0() · q1() · p1() · q0())½ = h2((0,0),(1,1))
![Page 30: The Complexity of Massive Data Set Computations](https://reader036.vdocuments.mx/reader036/viewer/2022062517/5681360c550346895d9d83c1/html5/thumbnails/30.jpg)
30
Conclusions
• Studied limitations of computing on massive data sets– Sampling computations– Data stream computations– Sketch computations
• Lower bound methodologies are based on– Information theory– Statistical decision theory– Communication complexity
• Lower bound techniques:– Reveal novel aspects of the models– Present a “template” for obtaining specific lower bounds
![Page 31: The Complexity of Massive Data Set Computations](https://reader036.vdocuments.mx/reader036/viewer/2022062517/5681360c550346895d9d83c1/html5/thumbnails/31.jpg)
31
Open Problems
• Sampling– Lower bounds for non-symmetric functions– Property testing lower bounds
• Communication complexity– Study the communication complexity of approximations– Tight lower bound for t-party set disjointness– Under what circumstances are one-way and
simultaneous communication equivalent?
![Page 32: The Complexity of Massive Data Set Computations](https://reader036.vdocuments.mx/reader036/viewer/2022062517/5681360c550346895d9d83c1/html5/thumbnails/32.jpg)
32
Thank You!
![Page 33: The Complexity of Massive Data Set Computations](https://reader036.vdocuments.mx/reader036/viewer/2022062517/5681360c550346895d9d83c1/html5/thumbnails/33.jpg)
33
Yao’s Lemma [Yao 83]
Definition: -distributional CC (D(f))
Complexity of best deterministic protocol that computes f with error on inputs drawn according to
Yao’s Lemma: R(f) maxD(f)
• Convenient technique to prove randomized CC lower bounds
![Page 34: The Complexity of Massive Data Set Computations](https://reader036.vdocuments.mx/reader036/viewer/2022062517/5681360c550346895d9d83c1/html5/thumbnails/34.jpg)
34
Communication Complexity Lower Bounds via Information Theory
(with T.S. Jayram, R. Kumar, and D. Sivakumar, Complexity 2002)
• A novel information theory paradigm for proving CC lower bounds
• Applications– Characterization results: (w.r.t. product distributions)
• 1-way simultaneous • 2-party 1-way t-party 1-way • VC dimension characterization of t-party 1-way CC
– Optimal lower bounds for simultaneous CC• t-party set-disjointness: (n/t) • Generalized addressing function
![Page 35: The Complexity of Massive Data Set Computations](https://reader036.vdocuments.mx/reader036/viewer/2022062517/5681360c550346895d9d83c1/html5/thumbnails/35.jpg)
35
Information Theory
sender receivernoisy channelm M r R
• M – distribution of transmitted messages
• R – distribution of received messages
• Goal of receiver: reconstruct m from r
• g: error probability of a reconstruction function g
Fano’s Inequality: For all g, H2(g) H(M | R)
MLE Principle: MLE H(M | R)
For a Boolean M
![Page 36: The Complexity of Massive Data Set Computations](https://reader036.vdocuments.mx/reader036/viewer/2022062517/5681360c550346895d9d83c1/html5/thumbnails/36.jpg)
36
Information Theory View of Distributional CC
• x,y distribute according to (X,Y)• “God” transmits f(x,y) to Alice & Bob• Alice & Bob receive the transcript (x,y)
• Fano’s inequality:
For any -error protocol for f,
H2() H(f(X,Y) | (X,Y))
f(x,y) (x,y)“God”
Alice & BobCC protocol
![Page 37: The Complexity of Massive Data Set Computations](https://reader036.vdocuments.mx/reader036/viewer/2022062517/5681360c550346895d9d83c1/html5/thumbnails/37.jpg)
37
Simultaneous CC vs. One-Way CC
Theorem
For every product distribution = X Y, and every Boolean f,
D,2H(),sim(f) D,,AB(f) + D,,BA(f)
Proof
A(x) – message of A on x in a -error A B protocol for f
B(y) – message of B on y in a -error B A protocol for f
Construct a SIM protocol for f:
A Referee: A(x) B Referee: B(y)
Referee outputs MLE(f(X,Y) | A(x), B(y))
![Page 38: The Complexity of Massive Data Set Computations](https://reader036.vdocuments.mx/reader036/viewer/2022062517/5681360c550346895d9d83c1/html5/thumbnails/38.jpg)
38
Simultaneous CC vs. One-Way CCProof (cont.)
By MLE Principle,
Pr(MLE(f(X,Y) | A(X),B(Y)) f(X,Y)) H(f(X,Y) | A(X),B(Y))
By Fano,
H(f(X,Y) | A(X),Y) H2() and H(f(X,Y) | X,B(Y)) H2()
Lemma For independent X,Y,
H(f(X,Y) | A(X),B(Y)) H(f(X,Y) | A(X),Y) + H(f(X,Y) | X,B(Y))
Our protocol errs with probability at most 2H2() □