multi-join query evaluation on big data lecture...

47
Background Skew Multi-round Open Problems Multi-join Query Evaluation on Big Data Lecture 4 Dan Suciu March, 2015 Dan Suciu Multi-Joins – Lecture 4 March, 2015 1 / 29

Upload: others

Post on 22-Aug-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Multi-join Query Evaluation on Big Data Lecture 4phdopen.mimuw.edu.pl/lato15/suciu-slides/lecture4.pdf · Dan Suciu Multi-Joins { Lecture 4 March, 2015 5 / 29. BackgroundSkewMulti-roundOpen

Background Skew Multi-round Open Problems

Multi-join Query Evaluation on Big DataLecture 4

Dan Suciu

March, 2015

Dan Suciu Multi-Joins – Lecture 4 March, 2015 1 / 29

Page 2: Multi-join Query Evaluation on Big Data Lecture 4phdopen.mimuw.edu.pl/lato15/suciu-slides/lecture4.pdf · Dan Suciu Multi-Joins { Lecture 4 March, 2015 5 / 29. BackgroundSkewMulti-roundOpen

Background Skew Multi-round Open Problems

Multi-join Query Evaluation – Outline

Part 1 Optimal Sequential Algorithms.

Part 2 Lower bounds for Parallel Algorithms.

Part 3 Optimal Parallel Algorithms.

Part 3 Data Skew.

Dan Suciu Multi-Joins – Lecture 4 March, 2015 2 / 29

Page 3: Multi-join Query Evaluation on Big Data Lecture 4phdopen.mimuw.edu.pl/lato15/suciu-slides/lecture4.pdf · Dan Suciu Multi-Joins { Lecture 4 March, 2015 5 / 29. BackgroundSkewMulti-roundOpen

Background Skew Multi-round Open Problems

Summary so far

Q(x) = R1(x1), . . . ,R`(x`) ∣R1∣ = m1, . . . , ∣R`∣ = m`

Sequential World Cost: output size of QUpper bound mρ∗ ; general: mu1

1 ⋯mu`` . Fractional edge cover.

Lower bound (tightness): fractional vertex packingGeneric-join algorithm.

Parallel World Cost: communication.1-round, skew-free, equal-cardinalities.

Lower bound m/p1/τ∗ ; general: (mu11 ⋯mu`

` /p)1/∑uj Fractional edgepacking.Upper bound: fractional vertex cover.HyperCube algorithm.

Dan Suciu Multi-Joins – Lecture 4 March, 2015 3 / 29

Page 4: Multi-join Query Evaluation on Big Data Lecture 4phdopen.mimuw.edu.pl/lato15/suciu-slides/lecture4.pdf · Dan Suciu Multi-Joins { Lecture 4 March, 2015 5 / 29. BackgroundSkewMulti-roundOpen

Background Skew Multi-round Open Problems

Outline of Lecture 4

Background: Hash-Based Partition

Coping with Skew

Multi-rounds (very short)

Open Problems

Will consider only databases without skew

Dan Suciu Multi-Joins – Lecture 4 March, 2015 4 / 29

Page 5: Multi-join Query Evaluation on Big Data Lecture 4phdopen.mimuw.edu.pl/lato15/suciu-slides/lecture4.pdf · Dan Suciu Multi-Joins { Lecture 4 March, 2015 5 / 29. BackgroundSkewMulti-roundOpen

Background Skew Multi-round Open Problems

Understanding Skew (Review from Lecture 2)

Given: R(x , y) with m items, p servers.Send each tuple R(x , y) to server h(y), where h = random function.

Claim 1 For each server u, its expected load is E[Lu] = m/p. Why?

Claim 2 We say that R is skewed if some value y occurs more thanm/p times in R. Then some server has load > m/p.

Claim 3 If R is not skewed, the maximum load of all servers isO(m/p) with high probability. (Details in lecture 4.)

Take-away: we assume no skew, meaning every frequency is ≤ m/p, then:

Max-load = O(Expected load)

Dan Suciu Multi-Joins – Lecture 4 March, 2015 5 / 29

Page 6: Multi-join Query Evaluation on Big Data Lecture 4phdopen.mimuw.edu.pl/lato15/suciu-slides/lecture4.pdf · Dan Suciu Multi-Joins { Lecture 4 March, 2015 5 / 29. BackgroundSkewMulti-roundOpen

Background Skew Multi-round Open Problems

Hash-Based Partition

A hash function is a function from some domain D to a range ofintegers [p]. E.g. h ∶ char(30)→ {1, . . . ,p}

A random family of hash functions is a set of hash functions, fromwhich we select one at random.

A strongly universal family of hash function is a set with the propertythat for any distinct values x1, . . . , xn ∈ D, and any outputs(u1, . . . ,un) ∈ [p]n,

P(h(x1) = u1 ∧⋯ ∧ h(xn) = un) =1

pn

Dan Suciu Multi-Joins – Lecture 4 March, 2015 6 / 29

Page 7: Multi-join Query Evaluation on Big Data Lecture 4phdopen.mimuw.edu.pl/lato15/suciu-slides/lecture4.pdf · Dan Suciu Multi-Joins { Lecture 4 March, 2015 5 / 29. BackgroundSkewMulti-roundOpen

Background Skew Multi-round Open Problems

Hash-Based Partition

Let R be a bag (“multi-set”) with m elements.

Partition R into p bins, using a hash function h: send element x ∈ R to binh(x) ∈ [p].

Denote Lu = number of elements in bin u ∈ [p] (a random variable).Example: h(x) = [(x −′ b′) mod 3] + 1

x ∶ a a a b c c d e e eh(x) ∶ 3 3 3 1 2 2 3 1 1 1

L1 = 4,L2 = 2,L3 = 4

Q: What is E[Lu], for a fixed u? A: E[Lu] = m/pQ: What is E[maxu Lu]? A: May be as large as m.

Dan Suciu Multi-Joins – Lecture 4 March, 2015 7 / 29

Page 8: Multi-join Query Evaluation on Big Data Lecture 4phdopen.mimuw.edu.pl/lato15/suciu-slides/lecture4.pdf · Dan Suciu Multi-Joins { Lecture 4 March, 2015 5 / 29. BackgroundSkewMulti-roundOpen

Background Skew Multi-round Open Problems

Hash-Based Partition

Let R be a bag (“multi-set”) with m elements.

Partition R into p bins, using a hash function h: send element x ∈ R to binh(x) ∈ [p].

Denote Lu = number of elements in bin u ∈ [p] (a random variable).Example: h(x) = [(x −′ b′) mod 3] + 1

x ∶ a a a b c c d e e eh(x) ∶ 3 3 3 1 2 2 3 1 1 1

L1 = 4,L2 = 2,L3 = 4

Q: What is E[Lu], for a fixed u? A: E[Lu] = m/pQ: What is E[maxu Lu]? A: May be as large as m.

Dan Suciu Multi-Joins – Lecture 4 March, 2015 7 / 29

Page 9: Multi-join Query Evaluation on Big Data Lecture 4phdopen.mimuw.edu.pl/lato15/suciu-slides/lecture4.pdf · Dan Suciu Multi-Joins { Lecture 4 March, 2015 5 / 29. BackgroundSkewMulti-roundOpen

Background Skew Multi-round Open Problems

Hash-Based Partition

Let R be a bag (“multi-set”) with m elements.

Partition R into p bins, using a hash function h: send element x ∈ R to binh(x) ∈ [p].

Denote Lu = number of elements in bin u ∈ [p] (a random variable).Example: h(x) = [(x −′ b′) mod 3] + 1

x ∶ a a a b c c d e e eh(x) ∶ 3 3 3 1 2 2 3 1 1 1

L1 = 4,L2 = 2,L3 = 4

Q: What is E[Lu], for a fixed u?

A: E[Lu] = m/pQ: What is E[maxu Lu]? A: May be as large as m.

Dan Suciu Multi-Joins – Lecture 4 March, 2015 7 / 29

Page 10: Multi-join Query Evaluation on Big Data Lecture 4phdopen.mimuw.edu.pl/lato15/suciu-slides/lecture4.pdf · Dan Suciu Multi-Joins { Lecture 4 March, 2015 5 / 29. BackgroundSkewMulti-roundOpen

Background Skew Multi-round Open Problems

Hash-Based Partition

Let R be a bag (“multi-set”) with m elements.

Partition R into p bins, using a hash function h: send element x ∈ R to binh(x) ∈ [p].

Denote Lu = number of elements in bin u ∈ [p] (a random variable).Example: h(x) = [(x −′ b′) mod 3] + 1

x ∶ a a a b c c d e e eh(x) ∶ 3 3 3 1 2 2 3 1 1 1

L1 = 4,L2 = 2,L3 = 4

Q: What is E[Lu], for a fixed u? A: E[Lu] = m/p

Q: What is E[maxu Lu]? A: May be as large as m.

Dan Suciu Multi-Joins – Lecture 4 March, 2015 7 / 29

Page 11: Multi-join Query Evaluation on Big Data Lecture 4phdopen.mimuw.edu.pl/lato15/suciu-slides/lecture4.pdf · Dan Suciu Multi-Joins { Lecture 4 March, 2015 5 / 29. BackgroundSkewMulti-roundOpen

Background Skew Multi-round Open Problems

Hash-Based Partition

Let R be a bag (“multi-set”) with m elements.

Partition R into p bins, using a hash function h: send element x ∈ R to binh(x) ∈ [p].

Denote Lu = number of elements in bin u ∈ [p] (a random variable).Example: h(x) = [(x −′ b′) mod 3] + 1

x ∶ a a a b c c d e e eh(x) ∶ 3 3 3 1 2 2 3 1 1 1

L1 = 4,L2 = 2,L3 = 4

Q: What is E[Lu], for a fixed u? A: E[Lu] = m/pQ: What is E[maxu Lu]?

A: May be as large as m.

Dan Suciu Multi-Joins – Lecture 4 March, 2015 7 / 29

Page 12: Multi-join Query Evaluation on Big Data Lecture 4phdopen.mimuw.edu.pl/lato15/suciu-slides/lecture4.pdf · Dan Suciu Multi-Joins { Lecture 4 March, 2015 5 / 29. BackgroundSkewMulti-roundOpen

Background Skew Multi-round Open Problems

Hash-Based Partition

Let R be a bag (“multi-set”) with m elements.

Partition R into p bins, using a hash function h: send element x ∈ R to binh(x) ∈ [p].

Denote Lu = number of elements in bin u ∈ [p] (a random variable).Example: h(x) = [(x −′ b′) mod 3] + 1

x ∶ a a a b c c d e e eh(x) ∶ 3 3 3 1 2 2 3 1 1 1

L1 = 4,L2 = 2,L3 = 4

Q: What is E[Lu], for a fixed u? A: E[Lu] = m/pQ: What is E[maxu Lu]? A: May be as large as m.

Dan Suciu Multi-Joins – Lecture 4 March, 2015 7 / 29

Page 13: Multi-join Query Evaluation on Big Data Lecture 4phdopen.mimuw.edu.pl/lato15/suciu-slides/lecture4.pdf · Dan Suciu Multi-Joins { Lecture 4 March, 2015 5 / 29. BackgroundSkewMulti-roundOpen

Background Skew Multi-round Open Problems

Hash-Based Partition

Let dx denote the number of occurrences of the value x in R.

Theorem

Let α > 0 be a constant such that dx ≤ mα⋅p , for all x . Then, for all δ ≥ 0:

∀u ∈ [p] ∶P(Lu > (1 + δ)mp) < 1

2α⋅δP(max

uLu > (1 + δ)m

p) < p

2α⋅δ

Exercise on the Problem Set: prove it (hint: use Bennett’s theorem).

Application: if αδ ≥ (1 + c) log p for some constant c > 0, then

P (maxu Lu > (1 + δ)mp ) < 1/pc

Dan Suciu Multi-Joins – Lecture 4 March, 2015 8 / 29

Page 14: Multi-join Query Evaluation on Big Data Lecture 4phdopen.mimuw.edu.pl/lato15/suciu-slides/lecture4.pdf · Dan Suciu Multi-Joins { Lecture 4 March, 2015 5 / 29. BackgroundSkewMulti-roundOpen

Background Skew Multi-round Open Problems

Summary on Hash-Based Partition

Basic Partition Sent item R(x) to bin h(x).If R has no skew (degrees ≤ m/p) thenMax-Size = O(Expected-Size) = O(m/p), w.h.p. (hiding log p factor)

HyperCube Partition Send R(x1, . . . , xk) to bin (h1(x1), . . . ,hk(xk))If R has no skew then Max-Size = O(Expected-Size) = O(m/p) w.h.p.

Given shares p1p2⋯pk = p, no skew means:∀S ⊆ [r], every value of (xi)i∈S occurs ≤ m/∏i∈S pi times:

x1 = a occurs ≤ m/p1 timesx2 = b occurs ≤ m/p2 times(x1 = a, x2 = b) occurs ≤ m/p1p2 times, etc.

The hidden log p factor becomes a poly-log factor.

Dan Suciu Multi-Joins – Lecture 4 March, 2015 9 / 29

Page 15: Multi-join Query Evaluation on Big Data Lecture 4phdopen.mimuw.edu.pl/lato15/suciu-slides/lecture4.pdf · Dan Suciu Multi-Joins { Lecture 4 March, 2015 5 / 29. BackgroundSkewMulti-roundOpen

Background Skew Multi-round Open Problems

Summary on Hash-Based Partition

Basic Partition Sent item R(x) to bin h(x).If R has no skew (degrees ≤ m/p) thenMax-Size = O(Expected-Size) = O(m/p), w.h.p. (hiding log p factor)

HyperCube Partition Send R(x1, . . . , xk) to bin (h1(x1), . . . ,hk(xk))If R has no skew then Max-Size = O(Expected-Size) = O(m/p) w.h.p.

Given shares p1p2⋯pk = p, no skew means:∀S ⊆ [r], every value of (xi)i∈S occurs ≤ m/∏i∈S pi times:

x1 = a occurs ≤ m/p1 timesx2 = b occurs ≤ m/p2 times(x1 = a, x2 = b) occurs ≤ m/p1p2 times, etc.

The hidden log p factor becomes a poly-log factor.

Dan Suciu Multi-Joins – Lecture 4 March, 2015 9 / 29

Page 16: Multi-join Query Evaluation on Big Data Lecture 4phdopen.mimuw.edu.pl/lato15/suciu-slides/lecture4.pdf · Dan Suciu Multi-Joins { Lecture 4 March, 2015 5 / 29. BackgroundSkewMulti-roundOpen

Background Skew Multi-round Open Problems

Summary on Hash-Based Partition

Basic Partition Sent item R(x) to bin h(x).If R has no skew (degrees ≤ m/p) thenMax-Size = O(Expected-Size) = O(m/p), w.h.p. (hiding log p factor)

HyperCube Partition Send R(x1, . . . , xk) to bin (h1(x1), . . . ,hk(xk))If R has no skew then Max-Size = O(Expected-Size) = O(m/p) w.h.p.

Given shares p1p2⋯pk = p, no skew means:∀S ⊆ [r], every value of (xi)i∈S occurs ≤ m/∏i∈S pi times:

x1 = a occurs ≤ m/p1 timesx2 = b occurs ≤ m/p2 times(x1 = a, x2 = b) occurs ≤ m/p1p2 times, etc.

The hidden log p factor becomes a poly-log factor.

Dan Suciu Multi-Joins – Lecture 4 March, 2015 9 / 29

Page 17: Multi-join Query Evaluation on Big Data Lecture 4phdopen.mimuw.edu.pl/lato15/suciu-slides/lecture4.pdf · Dan Suciu Multi-Joins { Lecture 4 March, 2015 5 / 29. BackgroundSkewMulti-roundOpen

Background Skew Multi-round Open Problems

Summary on Hash-Based Partition

Basic Partition Sent item R(x) to bin h(x).If R has no skew (degrees ≤ m/p) thenMax-Size = O(Expected-Size) = O(m/p), w.h.p. (hiding log p factor)

HyperCube Partition Send R(x1, . . . , xk) to bin (h1(x1), . . . ,hk(xk))If R has no skew then Max-Size = O(Expected-Size) = O(m/p) w.h.p.

Given shares p1p2⋯pk = p, no skew means:∀S ⊆ [r], every value of (xi)i∈S occurs ≤ m/∏i∈S pi times:

x1 = a occurs ≤ m/p1 timesx2 = b occurs ≤ m/p2 times(x1 = a, x2 = b) occurs ≤ m/p1p2 times, etc.

The hidden log p factor becomes a poly-log factor.

Dan Suciu Multi-Joins – Lecture 4 March, 2015 9 / 29

Page 18: Multi-join Query Evaluation on Big Data Lecture 4phdopen.mimuw.edu.pl/lato15/suciu-slides/lecture4.pdf · Dan Suciu Multi-Joins { Lecture 4 March, 2015 5 / 29. BackgroundSkewMulti-roundOpen

Background Skew Multi-round Open Problems

Heavy Hitters

Definition

(Informal) A value (or tuple of values) is a heavy hitter if it occurs moreoften than its skew threshold.

Example:

Basic partition of R(x , y) into p bins: x is heavy hitter if dx > m/pHypercube partition of R(x , y , z) into a cube p1p2p3:

x is a heavy hitter if dx > m/p1.y is a heavy hitter if dy > m/p2.A pair (x , y) is a heavy hitter if dxy > m/p1p2.A triple (x , y , z) is never heavy. (Why?)etc

Fact

There are at most O(p) heavy hitters. (Why?)

Dan Suciu Multi-Joins – Lecture 4 March, 2015 10 / 29

Page 19: Multi-join Query Evaluation on Big Data Lecture 4phdopen.mimuw.edu.pl/lato15/suciu-slides/lecture4.pdf · Dan Suciu Multi-Joins { Lecture 4 March, 2015 5 / 29. BackgroundSkewMulti-roundOpen

Background Skew Multi-round Open Problems

Heavy Hitters

Two ways to cope with heavy hitters:

Unknown Heavy Hitters Design the algorithm to be robust to heavyhitters.

Known Heavy Hitters Find all heavy hitters (only O(p)) and treat themspecially. This is by far preferred in practice.

Dan Suciu Multi-Joins – Lecture 4 March, 2015 11 / 29

Page 20: Multi-join Query Evaluation on Big Data Lecture 4phdopen.mimuw.edu.pl/lato15/suciu-slides/lecture4.pdf · Dan Suciu Multi-Joins { Lecture 4 March, 2015 5 / 29. BackgroundSkewMulti-roundOpen

Background Skew Multi-round Open Problems

Unknown Heavy Hitters

Consider a join: Q(x , y , z) = R(x , y),S(y , z).

Algorithm 1 Use shares px = 1,py = p,pz = 1 (standard hash-join).

If data has no skew: L = m/p.If data is skewed: L = m. Sensitive to skew

Algorithm 2 Use shares px = py = pz = 1/3.

If data has no skew: L = m/p2/3.If data is skewed: L = m/p1/3. Resilient to skew

This observation generalizes: for any query, we can compute shares thatminimize the load under the worst skew.

Dan Suciu Multi-Joins – Lecture 4 March, 2015 12 / 29

Page 21: Multi-join Query Evaluation on Big Data Lecture 4phdopen.mimuw.edu.pl/lato15/suciu-slides/lecture4.pdf · Dan Suciu Multi-Joins { Lecture 4 March, 2015 5 / 29. BackgroundSkewMulti-roundOpen

Background Skew Multi-round Open Problems

Known Heavy Hitters

Let HH be the set of all heavy hitter values. ∣HH ∣ = O(p)For each relation Rj(xj), variables z ⊆ xj , constants v ∈ Domainz, let:

mj ,z[v] = number of tuples in Rj that have z = v

In particular, mj ,∅[()] = mj

Definition

Given a database instance R1, . . . ,R`, its statistics are the set of numbers:

Σ = {mj ,z[v] ∣ j = 1, `, z ⊆ xj ,v ⊆ HHz}

Note: ∣Σ∣ = O(p), small enough that we can broadcast to all servers.

Open Problem design a Σ-optimal query evaluation algorithms (provablyoptimal for the set of statistics Σ).

Dan Suciu Multi-Joins – Lecture 4 March, 2015 13 / 29

Page 22: Multi-join Query Evaluation on Big Data Lecture 4phdopen.mimuw.edu.pl/lato15/suciu-slides/lecture4.pdf · Dan Suciu Multi-Joins { Lecture 4 March, 2015 5 / 29. BackgroundSkewMulti-roundOpen

Background Skew Multi-round Open Problems

Known Heavy Hitters

Let HH be the set of all heavy hitter values. ∣HH ∣ = O(p)For each relation Rj(xj), variables z ⊆ xj , constants v ∈ Domainz, let:

mj ,z[v] = number of tuples in Rj that have z = v

In particular, mj ,∅[()] = mj

Definition

Given a database instance R1, . . . ,R`, its statistics are the set of numbers:

Σ = {mj ,z[v] ∣ j = 1, `, z ⊆ xj ,v ⊆ HHz}

Note: ∣Σ∣ = O(p), small enough that we can broadcast to all servers.

Open Problem design a Σ-optimal query evaluation algorithms (provablyoptimal for the set of statistics Σ).

Dan Suciu Multi-Joins – Lecture 4 March, 2015 13 / 29

Page 23: Multi-join Query Evaluation on Big Data Lecture 4phdopen.mimuw.edu.pl/lato15/suciu-slides/lecture4.pdf · Dan Suciu Multi-Joins { Lecture 4 March, 2015 5 / 29. BackgroundSkewMulti-roundOpen

Background Skew Multi-round Open Problems

Known Heavy Hitters

Let HH be the set of all heavy hitter values. ∣HH ∣ = O(p)For each relation Rj(xj), variables z ⊆ xj , constants v ∈ Domainz, let:

mj ,z[v] = number of tuples in Rj that have z = v

In particular, mj ,∅[()] = mj

Definition

Given a database instance R1, . . . ,R`, its statistics are the set of numbers:

Σ = {mj ,z[v] ∣ j = 1, `, z ⊆ xj ,v ⊆ HHz}

Note: ∣Σ∣ = O(p), small enough that we can broadcast to all servers.

Open Problem design a Σ-optimal query evaluation algorithms (provablyoptimal for the set of statistics Σ).

Dan Suciu Multi-Joins – Lecture 4 March, 2015 13 / 29

Page 24: Multi-join Query Evaluation on Big Data Lecture 4phdopen.mimuw.edu.pl/lato15/suciu-slides/lecture4.pdf · Dan Suciu Multi-Joins { Lecture 4 March, 2015 5 / 29. BackgroundSkewMulti-roundOpen

Background Skew Multi-round Open Problems

Discussion

A Σ-optimal algorithm represents the sweet spot between worst-casealgorithm, and instance-optimal algorithms [Ngo’14].

In practice, systems often compute on the fly some statistics in Σ inorder to avoid significant skew, e.g. skew-join in PigLatin.

No Σ-optimal algorithm for arbitrary queries is known to date.[Beame’14] describe an algorithm that is optimal only within apoly-log p factor (due to the algorithm, not to the hash function).

Σ-optimal algorithms are known for a few special cases, in particularfor a join [Beame’14]. We will discuss next.

Dan Suciu Multi-Joins – Lecture 4 March, 2015 14 / 29

Page 25: Multi-join Query Evaluation on Big Data Lecture 4phdopen.mimuw.edu.pl/lato15/suciu-slides/lecture4.pdf · Dan Suciu Multi-Joins { Lecture 4 March, 2015 5 / 29. BackgroundSkewMulti-roundOpen

Background Skew Multi-round Open Problems

Σ-Optimal Join AlgorithmQ(x , y , z) = R(x , y),S(y , z) ∣R ∣ = mR , ∣S ∣ = mS .

∀v ∈ Domain mR[v] =degree of v in R

mS[v] =degree of v in S

(Thus: ∑v mR[v] = mR and ∑v mS [v] = mS )

Heavy hitters:

HHR ={v ∣ mR[v] ≥ mR/p}HHS ={v ∣ mS[v] ≥ mS/p}HH =HHR ∪HHS

Statistics:

Σ = {mR ,mS} ∪ {mR[v] ∣ v ∈ HHR} ∪ {mS[v] ∣ v ∈ HHS}

Dan Suciu Multi-Joins – Lecture 4 March, 2015 15 / 29

Page 26: Multi-join Query Evaluation on Big Data Lecture 4phdopen.mimuw.edu.pl/lato15/suciu-slides/lecture4.pdf · Dan Suciu Multi-Joins { Lecture 4 March, 2015 5 / 29. BackgroundSkewMulti-roundOpen

Background Skew Multi-round Open Problems

Σ-Optimal Join AlgorithmQ(x , y , z) = R(x , y),S(y , z) ∣R ∣ = mR , ∣S ∣ = mS .

∀v ∈ Domain mR[v] =degree of v in R

mS[v] =degree of v in S

(Thus: ∑v mR[v] = mR and ∑v mS [v] = mS )

Heavy hitters:

HHR ={v ∣ mR[v] ≥ mR/p}HHS ={v ∣ mS[v] ≥ mS/p}HH =HHR ∪HHS

Statistics:

Σ = {mR ,mS} ∪ {mR[v] ∣ v ∈ HHR} ∪ {mS[v] ∣ v ∈ HHS}

Dan Suciu Multi-Joins – Lecture 4 March, 2015 15 / 29

Page 27: Multi-join Query Evaluation on Big Data Lecture 4phdopen.mimuw.edu.pl/lato15/suciu-slides/lecture4.pdf · Dan Suciu Multi-Joins { Lecture 4 March, 2015 5 / 29. BackgroundSkewMulti-roundOpen

Background Skew Multi-round Open Problems

Σ-Optimal Join AlgorithmQ(x , y , z) = R(x , y),S(y , z) ∣R ∣ = mR , ∣S ∣ = mS .

∀v ∈ Domain mR[v] =degree of v in R

mS[v] =degree of v in S

(Thus: ∑v mR[v] = mR and ∑v mS [v] = mS )

Heavy hitters:

HHR ={v ∣ mR[v] ≥ mR/p}HHS ={v ∣ mS[v] ≥ mS/p}HH =HHR ∪HHS

Statistics:

Σ = {mR ,mS} ∪ {mR[v] ∣ v ∈ HHR} ∪ {mS[v] ∣ v ∈ HHS}

Dan Suciu Multi-Joins – Lecture 4 March, 2015 15 / 29

Page 28: Multi-join Query Evaluation on Big Data Lecture 4phdopen.mimuw.edu.pl/lato15/suciu-slides/lecture4.pdf · Dan Suciu Multi-Joins { Lecture 4 March, 2015 5 / 29. BackgroundSkewMulti-roundOpen

Background Skew Multi-round Open Problems

Σ-Optimal Join AlgorithmQ(x , y , z) = R(x , y),S(y , z) ∣R ∣ = mR , ∣S ∣ = mS .

∀v ∈ Domain mR[v] =degree of v in R

mS[v] =degree of v in S

(Thus: ∑v mR[v] = mR and ∑v mS [v] = mS )

Heavy hitters:

HHR ={v ∣ mR[v] ≥ mR/p}HHS ={v ∣ mS[v] ≥ mS/p}HH =HHR ∪HHS

Statistics:

Σ = {mR ,mS} ∪ {mR[v] ∣ v ∈ HHR} ∪ {mS[v] ∣ v ∈ HHS}

Dan Suciu Multi-Joins – Lecture 4 March, 2015 15 / 29

Page 29: Multi-join Query Evaluation on Big Data Lecture 4phdopen.mimuw.edu.pl/lato15/suciu-slides/lecture4.pdf · Dan Suciu Multi-Joins { Lecture 4 March, 2015 5 / 29. BackgroundSkewMulti-roundOpen

Background Skew Multi-round Open Problems

Cartesian Product Revisited

A heavy hitter transforms the join into a cartesian product, lets revisit it

Q(x , z) = R(x),S(z).

The load is: L = maxu (m

u11 m

u22

p )1/(u1+u2)

u L(u) Shares px ,pz

(1,1) (mRmS/p)1/2√

pmR

mS,√

pmS

mR

(1,0) mR/p p,1(0,1) mS/p 1,p

L = max((mRmS/p)1/2,mR/p,mS/p)

Dan Suciu Multi-Joins – Lecture 4 March, 2015 16 / 29

Page 30: Multi-join Query Evaluation on Big Data Lecture 4phdopen.mimuw.edu.pl/lato15/suciu-slides/lecture4.pdf · Dan Suciu Multi-Joins { Lecture 4 March, 2015 5 / 29. BackgroundSkewMulti-roundOpen

Background Skew Multi-round Open Problems

Σ-Optimal Join Algorithm – Intuition

Q(x , y , z) = R(x , y),S(y , z)

∀v ∈ HH a cartesian product: Qv(x , z) = Rv(x),Sv(z),where Rv(x) = R(x , v), Sv(z) = S(v , y).

Qv = residual query; ∣Rv ∣ = mR[v], ∣Sv ∣ = mS[v].

Allocate pv ≤ p servers to compute Qv .Load Lv = (mR[v] ⋅mS[v]/pv)1/2 (assuming > mR[v]/p,mS[v]/p)

Determine the number of servers pv such that ∑v∈HH pv = p, andmaxv Lv is minimized.

Optimal solution: pv = p mR[v]mS [v]

∑w∈HH mR[w]mS [w]

Optimal load L = (∑w∈HH mR[w]ms[w]

p )1/2

= Lv , for all v ∈ HH.

Dan Suciu Multi-Joins – Lecture 4 March, 2015 17 / 29

Page 31: Multi-join Query Evaluation on Big Data Lecture 4phdopen.mimuw.edu.pl/lato15/suciu-slides/lecture4.pdf · Dan Suciu Multi-Joins { Lecture 4 March, 2015 5 / 29. BackgroundSkewMulti-roundOpen

Background Skew Multi-round Open Problems

Σ-Optimal Join Algorithm – Intuition

Q(x , y , z) = R(x , y),S(y , z)

∀v ∈ HH a cartesian product: Qv(x , z) = Rv(x),Sv(z),where Rv(x) = R(x , v), Sv(z) = S(v , y).

Qv = residual query; ∣Rv ∣ = mR[v], ∣Sv ∣ = mS[v].

Allocate pv ≤ p servers to compute Qv .Load Lv = (mR[v] ⋅mS[v]/pv)1/2 (assuming > mR[v]/p,mS[v]/p)

Determine the number of servers pv such that ∑v∈HH pv = p, andmaxv Lv is minimized.

Optimal solution: pv = p mR[v]mS [v]

∑w∈HH mR[w]mS [w]

Optimal load L = (∑w∈HH mR[w]ms[w]

p )1/2

= Lv , for all v ∈ HH.

Dan Suciu Multi-Joins – Lecture 4 March, 2015 17 / 29

Page 32: Multi-join Query Evaluation on Big Data Lecture 4phdopen.mimuw.edu.pl/lato15/suciu-slides/lecture4.pdf · Dan Suciu Multi-Joins { Lecture 4 March, 2015 5 / 29. BackgroundSkewMulti-roundOpen

Background Skew Multi-round Open Problems

Σ-Optimal Join Algorithm – Intuition

Q(x , y , z) = R(x , y),S(y , z)

∀v ∈ HH a cartesian product: Qv(x , z) = Rv(x),Sv(z),where Rv(x) = R(x , v), Sv(z) = S(v , y).

Qv = residual query; ∣Rv ∣ = mR[v], ∣Sv ∣ = mS[v].

Allocate pv ≤ p servers to compute Qv .Load Lv = (mR[v] ⋅mS[v]/pv)1/2 (assuming > mR[v]/p,mS[v]/p)

Determine the number of servers pv such that ∑v∈HH pv = p, andmaxv Lv is minimized.

Optimal solution: pv = p mR[v]mS [v]

∑w∈HH mR[w]mS [w]

Optimal load L = (∑w∈HH mR[w]ms[w]

p )1/2

= Lv , for all v ∈ HH.

Dan Suciu Multi-Joins – Lecture 4 March, 2015 17 / 29

Page 33: Multi-join Query Evaluation on Big Data Lecture 4phdopen.mimuw.edu.pl/lato15/suciu-slides/lecture4.pdf · Dan Suciu Multi-Joins { Lecture 4 March, 2015 5 / 29. BackgroundSkewMulti-roundOpen

Background Skew Multi-round Open Problems

Σ-Optimal Join Algorithm – Intuition

Q(x , y , z) = R(x , y),S(y , z)

∀v ∈ HH a cartesian product: Qv(x , z) = Rv(x),Sv(z),where Rv(x) = R(x , v), Sv(z) = S(v , y).

Qv = residual query; ∣Rv ∣ = mR[v], ∣Sv ∣ = mS[v].

Allocate pv ≤ p servers to compute Qv .Load Lv = (mR[v] ⋅mS[v]/pv)1/2 (assuming > mR[v]/p,mS[v]/p)

Determine the number of servers pv such that ∑v∈HH pv = p, andmaxv Lv is minimized.

Optimal solution: pv = p mR[v]mS [v]

∑w∈HH mR[w]mS [w]

Optimal load L = (∑w∈HH mR[w]ms[w]

p )1/2

= Lv , for all v ∈ HH.

Dan Suciu Multi-Joins – Lecture 4 March, 2015 17 / 29

Page 34: Multi-join Query Evaluation on Big Data Lecture 4phdopen.mimuw.edu.pl/lato15/suciu-slides/lecture4.pdf · Dan Suciu Multi-Joins { Lecture 4 March, 2015 5 / 29. BackgroundSkewMulti-roundOpen

Background Skew Multi-round Open Problems

Σ-Optimal Join Algorithm – Intuition

Q(x , y , z) = R(x , y),S(y , z)

∀v ∈ HH a cartesian product: Qv(x , z) = Rv(x),Sv(z),where Rv(x) = R(x , v), Sv(z) = S(v , y).

Qv = residual query; ∣Rv ∣ = mR[v], ∣Sv ∣ = mS[v].

Allocate pv ≤ p servers to compute Qv .Load Lv = (mR[v] ⋅mS[v]/pv)1/2 (assuming > mR[v]/p,mS[v]/p)

Determine the number of servers pv such that ∑v∈HH pv = p, andmaxv Lv is minimized.

Optimal solution: pv = p mR[v]mS [v]

∑w∈HH mR[w]mS [w]

Optimal load L = (∑w∈HH mR[w]ms[w]

p )1/2

= Lv , for all v ∈ HH.

Dan Suciu Multi-Joins – Lecture 4 March, 2015 17 / 29

Page 35: Multi-join Query Evaluation on Big Data Lecture 4phdopen.mimuw.edu.pl/lato15/suciu-slides/lecture4.pdf · Dan Suciu Multi-Joins { Lecture 4 March, 2015 5 / 29. BackgroundSkewMulti-roundOpen

Background Skew Multi-round Open Problems

Σ-Optimal Join Algorithm – Intuition

Q(x , y , z) = R(x , y),S(y , z)

∀v ∈ HH a cartesian product: Qv(x , z) = Rv(x),Sv(z),where Rv(x) = R(x , v), Sv(z) = S(v , y).

Qv = residual query; ∣Rv ∣ = mR[v], ∣Sv ∣ = mS[v].

Allocate pv ≤ p servers to compute Qv .Load Lv = (mR[v] ⋅mS[v]/pv)1/2 (assuming > mR[v]/p,mS[v]/p)

Determine the number of servers pv such that ∑v∈HH pv = p, andmaxv Lv is minimized.

Optimal solution: pv = p mR[v]mS [v]

∑w∈HH mR[w]mS [w]

Optimal load L = (∑w∈HH mR[w]ms[w]

p )1/2

= Lv , for all v ∈ HH.

Dan Suciu Multi-Joins – Lecture 4 March, 2015 17 / 29

Page 36: Multi-join Query Evaluation on Big Data Lecture 4phdopen.mimuw.edu.pl/lato15/suciu-slides/lecture4.pdf · Dan Suciu Multi-Joins { Lecture 4 March, 2015 5 / 29. BackgroundSkewMulti-roundOpen

Background Skew Multi-round Open Problems

Algorithm HyperSkew

Assume HHR = HHS = HH

Light Hitters Run HyperCube on the light hitters.Load: max(mR/p,mS/p)

Heavy Hitters In parallel, for each v ∈ HH:

Compute Qv using HyperCube on pv = p mR[v]mS [v]

∑w∈HH mR[w]mS [w]servers.

Load: L = (∑w∈HH mR[w]ms[w]

p )1/2

Exercise: generalize to HHR ≠ HHS (Hint: set mR[v] = 1 for v ∈ HHS −HHR etc).

The total load is:

Llower = max(mR/p,mS/p,(∑w∈HH mR[w]ms[w]

p)1/2

)

Dan Suciu Multi-Joins – Lecture 4 March, 2015 18 / 29

Page 37: Multi-join Query Evaluation on Big Data Lecture 4phdopen.mimuw.edu.pl/lato15/suciu-slides/lecture4.pdf · Dan Suciu Multi-Joins { Lecture 4 March, 2015 5 / 29. BackgroundSkewMulti-roundOpen

Background Skew Multi-round Open Problems

One Sided Heavy Hitters

The contribution of one-sided heavy hitters is negligible:

Lemma

(∑w∈HHR−HHSmR[w]ms[w]p

)1/2

≤ max(mR ,mS)/p

Proof.

∑w∈HHR−HHSmR[w]ms[w] ≤ (∑w∈HHR−HHS

mR[w])msp ≤ mRmS

p

Thus, the load due to one sided heavy hitters will not exceed the load dueto light hitters, and it’s OK to approximate the missing degree with 1.

Dan Suciu Multi-Joins – Lecture 4 March, 2015 19 / 29

Page 38: Multi-join Query Evaluation on Big Data Lecture 4phdopen.mimuw.edu.pl/lato15/suciu-slides/lecture4.pdf · Dan Suciu Multi-Joins { Lecture 4 March, 2015 5 / 29. BackgroundSkewMulti-roundOpen

Background Skew Multi-round Open Problems

Discussion

Skew may affect significantly the communication cost for joins: thespeedup decreases from 1/p to 1/p1/2

Adapting the algorithm to the statistics Σ is also important: if allheavy hitters are one sided, then the extra cost can be avoidedcompletely.

Dan Suciu Multi-Joins – Lecture 4 March, 2015 20 / 29

Page 39: Multi-join Query Evaluation on Big Data Lecture 4phdopen.mimuw.edu.pl/lato15/suciu-slides/lecture4.pdf · Dan Suciu Multi-Joins { Lecture 4 March, 2015 5 / 29. BackgroundSkewMulti-roundOpen

Background Skew Multi-round Open Problems

Multiple Rounds

Basic idea is very simple: generate a query plan for Q, then computethe plan bottom up, each level is one round.

Goal: reduce the load by having multiple rounds.

Challenge (major!): intermediate results may be much bigger, and arehard to estimate.

Two approaches:

In [Beame’13] each operator in the query plan is a conjunctive query(rather than just a single join), and is evaluated using 1-roundHyperCube. No control over the intermediate results.

In [Afrati’14] the GYM algorithm computes conjunctive queries onlyat the leaves s.t. the residual query is acyclic, then use Yannakakis’semi-join reduction to control the size of the intermediate results.

Dan Suciu Multi-Joins – Lecture 4 March, 2015 21 / 29

Page 40: Multi-join Query Evaluation on Big Data Lecture 4phdopen.mimuw.edu.pl/lato15/suciu-slides/lecture4.pdf · Dan Suciu Multi-Joins { Lecture 4 March, 2015 5 / 29. BackgroundSkewMulti-roundOpen

Background Skew Multi-round Open Problems

Grand Summary

Big Data Analytics needs to run complex queries, on big data.

Traditional query processing: one join at a time.Challenge: large and unpredictable intermediate results.

Novel worst-case optimal sequential algorithm: runtime = AGMbound.

Novel parallel algorithm: communication cost = provably optimal.

Many open problems (next)

Dan Suciu Multi-Joins – Lecture 4 March, 2015 22 / 29

Page 41: Multi-join Query Evaluation on Big Data Lecture 4phdopen.mimuw.edu.pl/lato15/suciu-slides/lecture4.pdf · Dan Suciu Multi-Joins { Lecture 4 March, 2015 5 / 29. BackgroundSkewMulti-roundOpen

Background Skew Multi-round Open Problems

Open Problem 1: AGM Bounds for Given Statistics

Generalize the AGM bound to databases with known statistics Σ.

Simpler problems:

Generalize the AGM bound for databases with bounded degrees. E.g.Q = R(x , y),S(y , z), normal AGM upper bound is m2, if all degrees ≤ d ,upper bound is dm.

Generalize the AGM bound to functional dependencies. This immediatelyproves the previous item (for details, send me email)

Dan Suciu Multi-Joins – Lecture 4 March, 2015 23 / 29

Page 42: Multi-join Query Evaluation on Big Data Lecture 4phdopen.mimuw.edu.pl/lato15/suciu-slides/lecture4.pdf · Dan Suciu Multi-Joins { Lecture 4 March, 2015 5 / 29. BackgroundSkewMulti-roundOpen

Background Skew Multi-round Open Problems

Open Problem 2: Σ-Optimal Sequential Algorithm

Design a sequential algorithm that is worst-case optimal for instancessatisfying given statistics Σ.

E.g. Q = R(x , y),S(y , z),T (z , x). Suppose ∀y , degree of y in S is ≤ 5.Then compute Q in time O(m). What if we know one HH y0 with degreem/2?

Dan Suciu Multi-Joins – Lecture 4 March, 2015 24 / 29

Page 43: Multi-join Query Evaluation on Big Data Lecture 4phdopen.mimuw.edu.pl/lato15/suciu-slides/lecture4.pdf · Dan Suciu Multi-Joins { Lecture 4 March, 2015 5 / 29. BackgroundSkewMulti-roundOpen

Background Skew Multi-round Open Problems

Open Problem 3: Lower Bounds for Multi-Round ParallelAlgorithm

Prove lower bounds for 2 or more rounds, assuming servers can sendmessages encoding arbitrary information.

Example: prove that R(x , y),S(y , z),T (z ,u),K(u, v),L(v ,w) cannot becomputed in 2 rounds, with load m/p.

Note: [Beame’13] proved such lower bounds, but for a weaker model,when message consists of tuples, not of arbitrary bits.

Dan Suciu Multi-Joins – Lecture 4 March, 2015 25 / 29

Page 44: Multi-join Query Evaluation on Big Data Lecture 4phdopen.mimuw.edu.pl/lato15/suciu-slides/lecture4.pdf · Dan Suciu Multi-Joins { Lecture 4 March, 2015 5 / 29. BackgroundSkewMulti-roundOpen

Background Skew Multi-round Open Problems

Open Problem 4: Optimal Multi-round Algorithms

Find the exact tradeoff between the load/round L and the number ofrounds r for a general query.

Note: emphasis here is on algorithm, not lower bounds. For lower boundsit’s OK to give the proof for a weaker models (as in [Beame’13]).

Dan Suciu Multi-Joins – Lecture 4 March, 2015 26 / 29

Page 45: Multi-join Query Evaluation on Big Data Lecture 4phdopen.mimuw.edu.pl/lato15/suciu-slides/lecture4.pdf · Dan Suciu Multi-Joins { Lecture 4 March, 2015 5 / 29. BackgroundSkewMulti-roundOpen

Background Skew Multi-round Open Problems

Open Problem 5: Σ-Optimal Parallel Algorithms

Generalize HyperSkew from joins to an arbitrary queries. Seemsstraightforward how to generalize it, but it’s open whether this is optimal.

Conjectured tight bound:

Llower = maxu,z

⎛⎝∑v∈Domainz m

u11,z[v]⋯mu`

`,z[v]p

⎞⎠

1/∑j uj

where z ⊆ {x1, . . . , xk}, and u is fractional edge packing of the residualquery qz that covers all variables in z.This formula is known to be a lower bound [Beame’14], but not known iftight.

Dan Suciu Multi-Joins – Lecture 4 March, 2015 27 / 29

Page 46: Multi-join Query Evaluation on Big Data Lecture 4phdopen.mimuw.edu.pl/lato15/suciu-slides/lecture4.pdf · Dan Suciu Multi-Joins { Lecture 4 March, 2015 5 / 29. BackgroundSkewMulti-roundOpen

Background Skew Multi-round Open Problems

Final Remark

Please solve the 7 problems on the Problem Set. Some are quite easy,some more challenging.

If you work on any of the open problems and make progress, pleaselet me know.

Dan Suciu Multi-Joins – Lecture 4 March, 2015 28 / 29

Page 47: Multi-join Query Evaluation on Big Data Lecture 4phdopen.mimuw.edu.pl/lato15/suciu-slides/lecture4.pdf · Dan Suciu Multi-Joins { Lecture 4 March, 2015 5 / 29. BackgroundSkewMulti-roundOpen

Background Skew Multi-round Open Problems

Thank you!

Dan Suciu Multi-Joins – Lecture 4 March, 2015 29 / 29