ibm research confidential | © 2009 ibm corporation processing interval joins on map-reduce bhupesh...

37
IBM Research Confidential | © 2009 IBM Corporation Processing Interval Joins on Map-Reduce Bhupesh Chawda, Himanshu Gupta, Sumit Negi, Tanveer Faruquie, L V Subramaniam, Mukesh Mohania IBM Research, India

Upload: mariana-huckins

Post on 31-Mar-2015

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: IBM Research Confidential | © 2009 IBM Corporation Processing Interval Joins on Map-Reduce Bhupesh Chawda, Himanshu Gupta, Sumit Negi, Tanveer Faruquie,

IBM Research

Confidential | © 2009 IBM Corporation

Processing Interval Joins on Map-Reduce

Bhupesh Chawda, Himanshu Gupta, Sumit Negi, Tanveer Faruquie, L V Subramaniam, Mukesh Mohania

IBM Research, India

Page 2: IBM Research Confidential | © 2009 IBM Corporation Processing Interval Joins on Map-Reduce Bhupesh Chawda, Himanshu Gupta, Sumit Negi, Tanveer Faruquie,

IBM Research

IBM Confidential © 2009 IBM Corporation

Interval Join Intervals represent range information regarding an entity/event.

Interval – (start-point , end-point) (ts, te)

Duration of Rainfall, High temperature etc

Interval Join– Correlated intervals from different relations based on some relationships among intervals

– Q : Join R1, R2, R3 where R1.I overlaps R2.I and R2.I overlaps R3.I

– R1, R2: High Wind-speed intervals at location L1

– R3 : High pollutant concentration at location L3

u

v

w

Page 3: IBM Research Confidential | © 2009 IBM Corporation Processing Interval Joins on Map-Reduce Bhupesh Chawda, Himanshu Gupta, Sumit Negi, Tanveer Faruquie,

IBM Research

IBM Confidential © 2009 IBM Corporation

Spatial Joins

A rectangle can be visualized as formed by two intervals

Spatial Join – A special case of Interval Join

– Join R1 , R2 where R1.Rec overlaps R2.Rec

– Join R1, R2 where R1.L overlaps R2.L and R1.W overlaps R2.W

– Multi-Attribute Interval Join

Page 4: IBM Research Confidential | © 2009 IBM Corporation Processing Interval Joins on Map-Reduce Bhupesh Chawda, Himanshu Gupta, Sumit Negi, Tanveer Faruquie,

IBM Research

IBM Confidential © 2009 IBM Corporation

Allen’s Predicates

Page 5: IBM Research Confidential | © 2009 IBM Corporation Processing Interval Joins on Map-Reduce Bhupesh Chawda, Himanshu Gupta, Sumit Negi, Tanveer Faruquie,

IBM Research

IBM Confidential © 2009 IBM Corporation

Interval Join

Colocation Predicates - Allen’s Predicates other than Before/After Sequential Predicates – Before / After

Colocation Queries– Consist of only co-location predicates

– Join R1, R2, R3 where R1.I overlaps R2.I and R2.I contains R3.I Sequential Queries

– Consist of sequential predicates

– Join R1, R2, R3 where R1.I before R2.I and R2.I after R3.I Hybrid Queries

– Consist both co-location and sequential predicates

– Join R1,R2,R3 where R1.I overlaps R2.I and R2.I before R3.I Multi-Attribute Queries

– Consist of multiple attributes

Page 6: IBM Research Confidential | © 2009 IBM Corporation Processing Interval Joins on Map-Reduce Bhupesh Chawda, Himanshu Gupta, Sumit Negi, Tanveer Faruquie,

IBM Research

IBM Confidential © 2009 IBM Corporation

Outline

Project-Split-Replicate Computing 2-way Join on MR Naïve Approaches

– Why computing multi-way joins is hard? Computing Colocation Joins

– Consistent Sets, Crossing Sets

– RCCIS (Replicate Consistent and Crossing Interval-Sets)

– Experimental Evaluation Computing Sequence Joins

– Load Balancing

– All-Matrix Computing Hybrid Joins

– Naïve Approaches : FSTC, FCTS

– All-Seq-Matrix and Pruned-All-Seq-Matrix Computing Multi-Attribute Joins

– Gen-Matrix Conclusions

Page 7: IBM Research Confidential | © 2009 IBM Corporation Processing Interval Joins on Map-Reduce Bhupesh Chawda, Himanshu Gupta, Sumit Negi, Tanveer Faruquie,

IBM Research

IBM Confidential © 2009 IBM Corporation

Map-Reduce Job-Flow

Page 8: IBM Research Confidential | © 2009 IBM Corporation Processing Interval Joins on Map-Reduce Bhupesh Chawda, Himanshu Gupta, Sumit Negi, Tanveer Faruquie,

IBM Research

IBM Confidential © 2009 IBM Corporation

Join on Point Data vs Intervals

A point can be regarded as an interval of length 0

Join algorithms already proposed on MR

Join algorithms presented in this paper reduce to the join algorithms on point data as interval lengths are reduced to 0.

This paper analyzes how to efficient additional complexity due to intervals having a finite length.

Multiple colocation predicates arising in case of intervals, in case of point data there is only equality predicate

Page 9: IBM Research Confidential | © 2009 IBM Corporation Processing Interval Joins on Map-Reduce Bhupesh Chawda, Himanshu Gupta, Sumit Negi, Tanveer Faruquie,

IBM Research

IBM Confidential © 2009 IBM Corporation

Join On Point Data Select R.A, R.B, S.D where R.A==S.A

A B C

R1 1 10 12

R2 2 20 34

R3 1 10 22

R4 1 30 56

R5 3 40 17

A D E

S1 1 20 22

S2 2 30 36

S3 2 10 29

S4 3 50 16

S5 3 40 37

MAP 1

MAP 2

(1, 10)(2, 20)(1, 10)(1, 30)(3, 40)

(1, 20)(2, 30)(2, 10)(3, 50)(3, 40)

(1, 10, 20)(1, 10, 20)(1, 30, 20)

(2, 20, 30)(2, 20, 10)(3, 40, 50)(3, 40, 40)

(1, [10,10,30, 20])

(2,[20,30,10])(3,[40, 50, 40])

Reducer 1

Reducer 2

Page 10: IBM Research Confidential | © 2009 IBM Corporation Processing Interval Joins on Map-Reduce Bhupesh Chawda, Himanshu Gupta, Sumit Negi, Tanveer Faruquie,

IBM Research

IBM Confidential © 2009 IBM Corporation

Issues

How to get the tuples satisfying the join predicates on a single reducer?

How to do so, so that the communication costs are minimum?

How to do so, so that the load among the reducers is balanced?

Reading Cost, Minimal Number of MR Cycles

Page 11: IBM Research Confidential | © 2009 IBM Corporation Processing Interval Joins on Map-Reduce Bhupesh Chawda, Himanshu Gupta, Sumit Negi, Tanveer Faruquie,

IBM Research

IBM Confidential © 2009 IBM Corporation

Interval Joins

Select R.A, R.B, S.D where R.A overlaps S.A

A B C

R1 [1-2] 10 12

R2 [2-4] 20 34

R3 [1-4] 10 22

R4 [1-3] 30 56

R5 [3-4] 40 17

A D E

S1 [1-4] 20 22

S2 [2-4] 30 36

S3 [2-7] 10 29

S4 [3-6] 50 16

S5 [3-7] 40 37

MAP 1

MAP 2

??

??

([1-4], 10)

([2-7], 10)

1, 102, 103, 104, 10

2, 103, 104, 105, 106, 107, 10

Page 12: IBM Research Confidential | © 2009 IBM Corporation Processing Interval Joins on Map-Reduce Bhupesh Chawda, Himanshu Gupta, Sumit Negi, Tanveer Faruquie,

IBM Research

IBM Confidential © 2009 IBM Corporation

Project

Project(u) : (p1, u)

Project(v) : (p2, v)

Project(R) : [(p1, u), (p2, v)]

t0tn

Page 13: IBM Research Confidential | © 2009 IBM Corporation Processing Interval Joins on Map-Reduce Bhupesh Chawda, Himanshu Gupta, Sumit Negi, Tanveer Faruquie,

IBM Research

IBM Confidential © 2009 IBM Corporation

Split

Split(u) : [(p1, u), (p2, u)]

Split(v) : [(p2, v)]

Split(R) : [(p1, u), (p2, u), (p2, v)]

Page 14: IBM Research Confidential | © 2009 IBM Corporation Processing Interval Joins on Map-Reduce Bhupesh Chawda, Himanshu Gupta, Sumit Negi, Tanveer Faruquie,

IBM Research

IBM Confidential © 2009 IBM Corporation

Replicate

Replicate(u) : [(p1, u), (p2, u), (p3, u), (p4, u)]

Replicate(v) : [(p2, v), (p3, v), (p4, v) ]

Replicate(R) : [(p1, u), (p2, u), (p3, u), (p4, u), (p2, v), (p3, v), (p4, v) ]

Page 15: IBM Research Confidential | © 2009 IBM Corporation Processing Interval Joins on Map-Reduce Bhupesh Chawda, Himanshu Gupta, Sumit Negi, Tanveer Faruquie,

IBM Research

IBM Confidential © 2009 IBM Corporation

2-way Overlap Join Processing

Join R1 and R2 where R1 overlaps R2

Split R1 and Project R2

(p1, u), (p2, v) (p2, v) u

v

p1p2 p3 p4

u, v MAP (p1, u), (p2, u)

(p2, v)

(p1, u)

(p2, u), (p2, v) (u, v)

REDUCE

Page 16: IBM Research Confidential | © 2009 IBM Corporation Processing Interval Joins on Map-Reduce Bhupesh Chawda, Himanshu Gupta, Sumit Negi, Tanveer Faruquie,

IBM Research

IBM Confidential © 2009 IBM Corporation

Multi-way Overlap Join

Splitting all relations does not work

Reducer p1 will get [u] Reducer p2 will get [u,v] Reducer p3 will get [v,w,x] Reducer p4 will get [w,x]

No single reducer gets all the four intervals

u

v

p1p2 p3 p4

w

x

R1 overlaps R2 and R2 overlaps R3 and R3 overlaps R4

Page 17: IBM Research Confidential | © 2009 IBM Corporation Processing Interval Joins on Map-Reduce Bhupesh Chawda, Himanshu Gupta, Sumit Negi, Tanveer Faruquie,

IBM Research

IBM Confidential © 2009 IBM Corporation

2-way Cascade

- Handle multi-way join as a cascade of 2-way joins

R1 Overlaps R2 and R2 Overlaps R3 and R3 Overlaps R4

- Large intermediate results

- Consequent large reading and communication cost

- Requires multiple map-reduce cycles

JR(R1, R2) Overlaps JR(R3, R4)

JR(R1, R2, R3, R4)

Page 18: IBM Research Confidential | © 2009 IBM Corporation Processing Interval Joins on Map-Reduce Bhupesh Chawda, Himanshu Gupta, Sumit Negi, Tanveer Faruquie,

IBM Research

IBM Confidential © 2009 IBM Corporation

All-Replicate- Replicate all relations and project the right-most one

- Replicate(u): [(p1, u), (p2, u), (p3, u), (p4, u)]

- Replicate(v): [(p2, v), (p3, v), (p4, v)]

- Replicate(w): [(p3, w), (p4, w)]

- Project(x): [(p4, x)]

- Reducer p1 gets [u], Reducer p2 gets [u,v]

- Reducer p3 gets [u,v,w], Reducer p4 gets [u,v,w,x]

- Reducer p4 hence can compute the output tuple

u

v

p1p2 p3 p4

w

x

Page 19: IBM Research Confidential | © 2009 IBM Corporation Processing Interval Joins on Map-Reduce Bhupesh Chawda, Himanshu Gupta, Sumit Negi, Tanveer Faruquie,

IBM Research

IBM Confidential © 2009 IBM Corporation

Multi-way joins

Need a method to handle all join predicates at once

Does not replicate all the intervals

RCCIS – Replicate Consistent and Crossing Interval-Sets– Achieves Precisely this

Key Concepts– Consistent Interval-sets

– Crossing Interval-sets

Page 20: IBM Research Confidential | © 2009 IBM Corporation Processing Interval Joins on Map-Reduce Bhupesh Chawda, Himanshu Gupta, Sumit Negi, Tanveer Faruquie,

IBM Research

IBM Confidential © 2009 IBM Corporation

Map-1 (Split the Data)

Intermediate Key-Value Pairs

data sub-set 1 data sub-set 2 data sub-set 3 data sub-set 4

intervals To replicate

Interval Data – Relations (R1, R2, R3, R4)

Reduce-1

Interval Data – Relations (R1, R2, R3, R4)

Map-2 (Split the Data) Map-2 (Replicate the Data)

p1p2 p3 p4

p1p2 p3 p4

data sub-set 1 data sub-set 2 data sub-set 3 data sub-set 4

Reduce-2

Join Output

RCCIS Flow

Page 21: IBM Research Confidential | © 2009 IBM Corporation Processing Interval Joins on Map-Reduce Bhupesh Chawda, Himanshu Gupta, Sumit Negi, Tanveer Faruquie,

IBM Research

IBM Confidential © 2009 IBM Corporation

Consistent Interval SetsConsider an interval-set S

Any two intervals satisfy all the join conditions that are formed by the relations, these intervals belong to.

Consider the set {u, v, w}. Its relation-set is {R1, R2, R3}

Consistency implies that the intervals u, v and w satisfy all the join conditions formed by relations R1, R2, R3 in the query

– i.e., R1 overlaps R2 and R2 overlaps R3

R1 overlaps R2 and R2 overlaps R3 and R3 overlaps R4

Page 22: IBM Research Confidential | © 2009 IBM Corporation Processing Interval Joins on Map-Reduce Bhupesh Chawda, Himanshu Gupta, Sumit Negi, Tanveer Faruquie,

IBM Research

IBM Confidential © 2009 IBM Corporation

Example

{u0, v0} is consistent, Relation-Set is {R1, R2} u0 and v0 must overlap due to condition ‘R1 overlaps R2’

R1 overlaps R2 and R2 contains R3 and R3 overlaps R4

Page 23: IBM Research Confidential | © 2009 IBM Corporation Processing Interval Joins on Map-Reduce Bhupesh Chawda, Himanshu Gupta, Sumit Negi, Tanveer Faruquie,

IBM Research

IBM Confidential © 2009 IBM Corporation

Example

{u3, v1, w2} is consistent, Relation-Set is {R1, R2, R3} u3 and v1 must overlap due to condition ‘R1 overlaps R2’ v1 must contain w2 due to condition ‘R2 contains R3’

R1 overlaps R2 and R2 contains R3 and R3 overlaps R4

Page 24: IBM Research Confidential | © 2009 IBM Corporation Processing Interval Joins on Map-Reduce Bhupesh Chawda, Himanshu Gupta, Sumit Negi, Tanveer Faruquie,

IBM Research

IBM Confidential © 2009 IBM Corporation

Example

{u2,v1,w1,x3} is consistent, Relation-Set is {R1, R2, R3, R4} u2 and v1 must overlap due to condition ‘R1 overlaps R2’ v1 must contain w1 due to condition ‘R2 contains R3’ w1 and x3 must overlap due to condition R2 overlaps R3

R1 overlaps R2 and R2 contains R3 and R3 overlaps R4

Page 25: IBM Research Confidential | © 2009 IBM Corporation Processing Interval Joins on Map-Reduce Bhupesh Chawda, Himanshu Gupta, Sumit Negi, Tanveer Faruquie,

IBM Research

IBM Confidential © 2009 IBM Corporation

Example

{u1,v1,w1} is not consistent, Relation-Set is {R1, R2, R3} u1 and v1 do not overlap

R1 overlaps R2 and R2 contains R3 and R3 overlaps R4

Page 26: IBM Research Confidential | © 2009 IBM Corporation Processing Interval Joins on Map-Reduce Bhupesh Chawda, Himanshu Gupta, Sumit Negi, Tanveer Faruquie,

IBM Research

IBM Confidential © 2009 IBM Corporation

Intuition

Each subset of an output tuple is a consistent interval-set.

Each subset of {u, v, w, x} is consistent.

R1 overlaps R2 and R2 contains R3 and R3 overlaps R4

u

v

p1p2 p3 p4

w

x

Page 27: IBM Research Confidential | © 2009 IBM Corporation Processing Interval Joins on Map-Reduce Bhupesh Chawda, Himanshu Gupta, Sumit Negi, Tanveer Faruquie,

IBM Research

IBM Confidential © 2009 IBM Corporation

Crossing Interval Sets

R1 overlaps R2 and R2 contains R3 and R3 overlaps R4 Query Q and its relation-set R.

– R : {R1, R2, R3, R4}

Consider an interval-set U and its relation-set RU and the set R - RU

– U : {u, v, w}, RU: {R1, R2, R3}, R – RU : {R4}

Consider the join conditions formed by one relation in set RU and one set in R – RU

– R3 overlaps R4

Set U crosses partition-interval p if the following holds (for overlap)

– Re1 overlaps Re2, Re1 \in RU, Re2 \in R – RU , interval of Re1 crosses partition-interval p on the right

– Re1 overlaps Re2, Re1 \in R – RU and Re2 \in RU, interval of Re2 crosses partition-interval p on the left

Page 28: IBM Research Confidential | © 2009 IBM Corporation Processing Interval Joins on Map-Reduce Bhupesh Chawda, Himanshu Gupta, Sumit Negi, Tanveer Faruquie,

IBM Research

IBM Confidential © 2009 IBM Corporation

Intuition

An output tuple can be visualized as formed by the union of a crossing set with another consistent set

Consider output tuple – {u, v, w, x}

– {u} crosses p1 and combines with consistent set {v, w, x}

– {u, v} crosses p2 and combines with consistent set {w, x}

– {v, w} crosses p3 and combines with consistent set {u, x}

u

v

p1p2 p3 p4

w

x

Page 29: IBM Research Confidential | © 2009 IBM Corporation Processing Interval Joins on Map-Reduce Bhupesh Chawda, Himanshu Gupta, Sumit Negi, Tanveer Faruquie,

IBM Research

IBM Confidential © 2009 IBM Corporation

Example

{u3,v1,w2} crosses p2

RU is {R1, R2, R3}, R – RU is {R4} Condition to check : R3 overlaps R4 w2 crosses p2 on the right

R1 overlaps R2 and R2 overlaps R3 and R3 overlaps R4

Page 30: IBM Research Confidential | © 2009 IBM Corporation Processing Interval Joins on Map-Reduce Bhupesh Chawda, Himanshu Gupta, Sumit Negi, Tanveer Faruquie,

IBM Research

IBM Confidential © 2009 IBM Corporation

Example

{v3,w2} crosses p2

RU is {R2, R3}, R – RU is {R1, R4} Condition to check : R1 overlaps R2 and R3 overlaps R4 v3 crosses p2 on the left and w2 crosses p2 on the right

R1 overlaps R2 and R2 overlaps R3 and R3 overlaps R4

Page 31: IBM Research Confidential | © 2009 IBM Corporation Processing Interval Joins on Map-Reduce Bhupesh Chawda, Himanshu Gupta, Sumit Negi, Tanveer Faruquie,

IBM Research

IBM Confidential © 2009 IBM Corporation

Example

{u3, v2} does not cross p2

RU is {R1, R2}, R – RU is {R3, R4} Condition to check : R2 overlaps R3 v2 does not cross p2 on the right

R1 overlaps R2 and R2 overlaps R3 and R3 overlaps R4

Page 32: IBM Research Confidential | © 2009 IBM Corporation Processing Interval Joins on Map-Reduce Bhupesh Chawda, Himanshu Gupta, Sumit Negi, Tanveer Faruquie,

IBM Research

IBM Confidential © 2009 IBM Corporation

Replicate Consistent and Crossing Interval Sets (RCCIS)

Runs in two map-reduce cycles

First Map splits all the relations

Reducers in first cycle determine which intervals to replicate

Second cycle replicates the selected intervals and computes the multi-way join

u

v

p1p2 p3 p4

w

x

Page 33: IBM Research Confidential | © 2009 IBM Corporation Processing Interval Joins on Map-Reduce Bhupesh Chawda, Himanshu Gupta, Sumit Negi, Tanveer Faruquie,

IBM Research

IBM Confidential © 2009 IBM Corporation

RCCIS – Which Intervals to Replicate?

Reducer p computes the set of consistent and crossing sets

All intervals belonging to such sets which start in the partition-interval p are replicated.

Page 34: IBM Research Confidential | © 2009 IBM Corporation Processing Interval Joins on Map-Reduce Bhupesh Chawda, Himanshu Gupta, Sumit Negi, Tanveer Faruquie,

IBM Research

IBM Confidential © 2009 IBM Corporation

Example

Reducer p2 receives {u1, u2, u3, v1, v2, v3, w1, w2, x1, x2}

Consistent and Crossing sets are [ {u3, v1, w2}, {v3, w2}]– No other set satisfies the conditions of RCCIS

Intervals are {u3, v1, v3, w2}

p2 replicates {u3, v1, w2}

Page 35: IBM Research Confidential | © 2009 IBM Corporation Processing Interval Joins on Map-Reduce Bhupesh Chawda, Himanshu Gupta, Sumit Negi, Tanveer Faruquie,

IBM Research

IBM Confidential © 2009 IBM Corporation

Experimental Evaluation on Synthetic Data

Synthetic Data Time Range : [0, 100K] Max Interval Length: 100 Query : R1 overlaps R2 and R2 overlaps R3

Page 36: IBM Research Confidential | © 2009 IBM Corporation Processing Interval Joins on Map-Reduce Bhupesh Chawda, Himanshu Gupta, Sumit Negi, Tanveer Faruquie,

IBM Research

IBM Confidential © 2009 IBM Corporation

Experimental Evaluation on Internet Packet Trace Data

- Packet traces taken from MAWI group archive- http://tracer.csl.sonmy.co.jp/mawi

Page 37: IBM Research Confidential | © 2009 IBM Corporation Processing Interval Joins on Map-Reduce Bhupesh Chawda, Himanshu Gupta, Sumit Negi, Tanveer Faruquie,

IBM Research

IBM Confidential © 2009 IBM Corporation

Questions ?

Discussed the challenges in handling Interval-Joins on Map-Reduce

Presented Naïve algorithms for handling interval joins

Presented Novel algorithms for handling interval joins and illustrated that these significantly improve on naïve methods.