ibm research confidential | © 2009 ibm corporation processing interval joins on map-reduce bhupesh...
TRANSCRIPT
IBM Research
Confidential | © 2009 IBM Corporation
Processing Interval Joins on Map-Reduce
Bhupesh Chawda, Himanshu Gupta, Sumit Negi, Tanveer Faruquie, L V Subramaniam, Mukesh Mohania
IBM Research, India
IBM Research
IBM Confidential © 2009 IBM Corporation
Interval Join Intervals represent range information regarding an entity/event.
Interval – (start-point , end-point) (ts, te)
Duration of Rainfall, High temperature etc
Interval Join– Correlated intervals from different relations based on some relationships among intervals
– Q : Join R1, R2, R3 where R1.I overlaps R2.I and R2.I overlaps R3.I
– R1, R2: High Wind-speed intervals at location L1
– R3 : High pollutant concentration at location L3
u
v
w
IBM Research
IBM Confidential © 2009 IBM Corporation
Spatial Joins
A rectangle can be visualized as formed by two intervals
Spatial Join – A special case of Interval Join
– Join R1 , R2 where R1.Rec overlaps R2.Rec
– Join R1, R2 where R1.L overlaps R2.L and R1.W overlaps R2.W
– Multi-Attribute Interval Join
IBM Research
IBM Confidential © 2009 IBM Corporation
Allen’s Predicates
IBM Research
IBM Confidential © 2009 IBM Corporation
Interval Join
Colocation Predicates - Allen’s Predicates other than Before/After Sequential Predicates – Before / After
Colocation Queries– Consist of only co-location predicates
– Join R1, R2, R3 where R1.I overlaps R2.I and R2.I contains R3.I Sequential Queries
– Consist of sequential predicates
– Join R1, R2, R3 where R1.I before R2.I and R2.I after R3.I Hybrid Queries
– Consist both co-location and sequential predicates
– Join R1,R2,R3 where R1.I overlaps R2.I and R2.I before R3.I Multi-Attribute Queries
– Consist of multiple attributes
IBM Research
IBM Confidential © 2009 IBM Corporation
Outline
Project-Split-Replicate Computing 2-way Join on MR Naïve Approaches
– Why computing multi-way joins is hard? Computing Colocation Joins
– Consistent Sets, Crossing Sets
– RCCIS (Replicate Consistent and Crossing Interval-Sets)
– Experimental Evaluation Computing Sequence Joins
– Load Balancing
– All-Matrix Computing Hybrid Joins
– Naïve Approaches : FSTC, FCTS
– All-Seq-Matrix and Pruned-All-Seq-Matrix Computing Multi-Attribute Joins
– Gen-Matrix Conclusions
IBM Research
IBM Confidential © 2009 IBM Corporation
Map-Reduce Job-Flow
IBM Research
IBM Confidential © 2009 IBM Corporation
Join on Point Data vs Intervals
A point can be regarded as an interval of length 0
Join algorithms already proposed on MR
Join algorithms presented in this paper reduce to the join algorithms on point data as interval lengths are reduced to 0.
This paper analyzes how to efficient additional complexity due to intervals having a finite length.
Multiple colocation predicates arising in case of intervals, in case of point data there is only equality predicate
IBM Research
IBM Confidential © 2009 IBM Corporation
Join On Point Data Select R.A, R.B, S.D where R.A==S.A
A B C
R1 1 10 12
R2 2 20 34
R3 1 10 22
R4 1 30 56
R5 3 40 17
A D E
S1 1 20 22
S2 2 30 36
S3 2 10 29
S4 3 50 16
S5 3 40 37
MAP 1
MAP 2
(1, 10)(2, 20)(1, 10)(1, 30)(3, 40)
(1, 20)(2, 30)(2, 10)(3, 50)(3, 40)
(1, 10, 20)(1, 10, 20)(1, 30, 20)
(2, 20, 30)(2, 20, 10)(3, 40, 50)(3, 40, 40)
(1, [10,10,30, 20])
(2,[20,30,10])(3,[40, 50, 40])
Reducer 1
Reducer 2
IBM Research
IBM Confidential © 2009 IBM Corporation
Issues
How to get the tuples satisfying the join predicates on a single reducer?
How to do so, so that the communication costs are minimum?
How to do so, so that the load among the reducers is balanced?
Reading Cost, Minimal Number of MR Cycles
IBM Research
IBM Confidential © 2009 IBM Corporation
Interval Joins
Select R.A, R.B, S.D where R.A overlaps S.A
A B C
R1 [1-2] 10 12
R2 [2-4] 20 34
R3 [1-4] 10 22
R4 [1-3] 30 56
R5 [3-4] 40 17
A D E
S1 [1-4] 20 22
S2 [2-4] 30 36
S3 [2-7] 10 29
S4 [3-6] 50 16
S5 [3-7] 40 37
MAP 1
MAP 2
??
??
([1-4], 10)
([2-7], 10)
1, 102, 103, 104, 10
2, 103, 104, 105, 106, 107, 10
IBM Research
IBM Confidential © 2009 IBM Corporation
Project
Project(u) : (p1, u)
Project(v) : (p2, v)
Project(R) : [(p1, u), (p2, v)]
t0tn
IBM Research
IBM Confidential © 2009 IBM Corporation
Split
Split(u) : [(p1, u), (p2, u)]
Split(v) : [(p2, v)]
Split(R) : [(p1, u), (p2, u), (p2, v)]
IBM Research
IBM Confidential © 2009 IBM Corporation
Replicate
Replicate(u) : [(p1, u), (p2, u), (p3, u), (p4, u)]
Replicate(v) : [(p2, v), (p3, v), (p4, v) ]
Replicate(R) : [(p1, u), (p2, u), (p3, u), (p4, u), (p2, v), (p3, v), (p4, v) ]
IBM Research
IBM Confidential © 2009 IBM Corporation
2-way Overlap Join Processing
Join R1 and R2 where R1 overlaps R2
Split R1 and Project R2
(p1, u), (p2, v) (p2, v) u
v
p1p2 p3 p4
u, v MAP (p1, u), (p2, u)
(p2, v)
(p1, u)
(p2, u), (p2, v) (u, v)
REDUCE
IBM Research
IBM Confidential © 2009 IBM Corporation
Multi-way Overlap Join
Splitting all relations does not work
Reducer p1 will get [u] Reducer p2 will get [u,v] Reducer p3 will get [v,w,x] Reducer p4 will get [w,x]
No single reducer gets all the four intervals
u
v
p1p2 p3 p4
w
x
R1 overlaps R2 and R2 overlaps R3 and R3 overlaps R4
IBM Research
IBM Confidential © 2009 IBM Corporation
2-way Cascade
- Handle multi-way join as a cascade of 2-way joins
R1 Overlaps R2 and R2 Overlaps R3 and R3 Overlaps R4
- Large intermediate results
- Consequent large reading and communication cost
- Requires multiple map-reduce cycles
JR(R1, R2) Overlaps JR(R3, R4)
JR(R1, R2, R3, R4)
IBM Research
IBM Confidential © 2009 IBM Corporation
All-Replicate- Replicate all relations and project the right-most one
- Replicate(u): [(p1, u), (p2, u), (p3, u), (p4, u)]
- Replicate(v): [(p2, v), (p3, v), (p4, v)]
- Replicate(w): [(p3, w), (p4, w)]
- Project(x): [(p4, x)]
- Reducer p1 gets [u], Reducer p2 gets [u,v]
- Reducer p3 gets [u,v,w], Reducer p4 gets [u,v,w,x]
- Reducer p4 hence can compute the output tuple
u
v
p1p2 p3 p4
w
x
IBM Research
IBM Confidential © 2009 IBM Corporation
Multi-way joins
Need a method to handle all join predicates at once
Does not replicate all the intervals
RCCIS – Replicate Consistent and Crossing Interval-Sets– Achieves Precisely this
Key Concepts– Consistent Interval-sets
– Crossing Interval-sets
IBM Research
IBM Confidential © 2009 IBM Corporation
Map-1 (Split the Data)
Intermediate Key-Value Pairs
data sub-set 1 data sub-set 2 data sub-set 3 data sub-set 4
intervals To replicate
Interval Data – Relations (R1, R2, R3, R4)
Reduce-1
Interval Data – Relations (R1, R2, R3, R4)
Map-2 (Split the Data) Map-2 (Replicate the Data)
p1p2 p3 p4
p1p2 p3 p4
data sub-set 1 data sub-set 2 data sub-set 3 data sub-set 4
Reduce-2
Join Output
RCCIS Flow
IBM Research
IBM Confidential © 2009 IBM Corporation
Consistent Interval SetsConsider an interval-set S
Any two intervals satisfy all the join conditions that are formed by the relations, these intervals belong to.
Consider the set {u, v, w}. Its relation-set is {R1, R2, R3}
Consistency implies that the intervals u, v and w satisfy all the join conditions formed by relations R1, R2, R3 in the query
– i.e., R1 overlaps R2 and R2 overlaps R3
R1 overlaps R2 and R2 overlaps R3 and R3 overlaps R4
IBM Research
IBM Confidential © 2009 IBM Corporation
Example
{u0, v0} is consistent, Relation-Set is {R1, R2} u0 and v0 must overlap due to condition ‘R1 overlaps R2’
R1 overlaps R2 and R2 contains R3 and R3 overlaps R4
IBM Research
IBM Confidential © 2009 IBM Corporation
Example
{u3, v1, w2} is consistent, Relation-Set is {R1, R2, R3} u3 and v1 must overlap due to condition ‘R1 overlaps R2’ v1 must contain w2 due to condition ‘R2 contains R3’
R1 overlaps R2 and R2 contains R3 and R3 overlaps R4
IBM Research
IBM Confidential © 2009 IBM Corporation
Example
{u2,v1,w1,x3} is consistent, Relation-Set is {R1, R2, R3, R4} u2 and v1 must overlap due to condition ‘R1 overlaps R2’ v1 must contain w1 due to condition ‘R2 contains R3’ w1 and x3 must overlap due to condition R2 overlaps R3
R1 overlaps R2 and R2 contains R3 and R3 overlaps R4
IBM Research
IBM Confidential © 2009 IBM Corporation
Example
{u1,v1,w1} is not consistent, Relation-Set is {R1, R2, R3} u1 and v1 do not overlap
R1 overlaps R2 and R2 contains R3 and R3 overlaps R4
IBM Research
IBM Confidential © 2009 IBM Corporation
Intuition
Each subset of an output tuple is a consistent interval-set.
Each subset of {u, v, w, x} is consistent.
R1 overlaps R2 and R2 contains R3 and R3 overlaps R4
u
v
p1p2 p3 p4
w
x
IBM Research
IBM Confidential © 2009 IBM Corporation
Crossing Interval Sets
R1 overlaps R2 and R2 contains R3 and R3 overlaps R4 Query Q and its relation-set R.
– R : {R1, R2, R3, R4}
Consider an interval-set U and its relation-set RU and the set R - RU
– U : {u, v, w}, RU: {R1, R2, R3}, R – RU : {R4}
Consider the join conditions formed by one relation in set RU and one set in R – RU
– R3 overlaps R4
Set U crosses partition-interval p if the following holds (for overlap)
– Re1 overlaps Re2, Re1 \in RU, Re2 \in R – RU , interval of Re1 crosses partition-interval p on the right
– Re1 overlaps Re2, Re1 \in R – RU and Re2 \in RU, interval of Re2 crosses partition-interval p on the left
IBM Research
IBM Confidential © 2009 IBM Corporation
Intuition
An output tuple can be visualized as formed by the union of a crossing set with another consistent set
Consider output tuple – {u, v, w, x}
– {u} crosses p1 and combines with consistent set {v, w, x}
– {u, v} crosses p2 and combines with consistent set {w, x}
– {v, w} crosses p3 and combines with consistent set {u, x}
u
v
p1p2 p3 p4
w
x
IBM Research
IBM Confidential © 2009 IBM Corporation
Example
{u3,v1,w2} crosses p2
RU is {R1, R2, R3}, R – RU is {R4} Condition to check : R3 overlaps R4 w2 crosses p2 on the right
R1 overlaps R2 and R2 overlaps R3 and R3 overlaps R4
IBM Research
IBM Confidential © 2009 IBM Corporation
Example
{v3,w2} crosses p2
RU is {R2, R3}, R – RU is {R1, R4} Condition to check : R1 overlaps R2 and R3 overlaps R4 v3 crosses p2 on the left and w2 crosses p2 on the right
R1 overlaps R2 and R2 overlaps R3 and R3 overlaps R4
IBM Research
IBM Confidential © 2009 IBM Corporation
Example
{u3, v2} does not cross p2
RU is {R1, R2}, R – RU is {R3, R4} Condition to check : R2 overlaps R3 v2 does not cross p2 on the right
R1 overlaps R2 and R2 overlaps R3 and R3 overlaps R4
IBM Research
IBM Confidential © 2009 IBM Corporation
Replicate Consistent and Crossing Interval Sets (RCCIS)
Runs in two map-reduce cycles
First Map splits all the relations
Reducers in first cycle determine which intervals to replicate
Second cycle replicates the selected intervals and computes the multi-way join
u
v
p1p2 p3 p4
w
x
IBM Research
IBM Confidential © 2009 IBM Corporation
RCCIS – Which Intervals to Replicate?
Reducer p computes the set of consistent and crossing sets
All intervals belonging to such sets which start in the partition-interval p are replicated.
IBM Research
IBM Confidential © 2009 IBM Corporation
Example
Reducer p2 receives {u1, u2, u3, v1, v2, v3, w1, w2, x1, x2}
Consistent and Crossing sets are [ {u3, v1, w2}, {v3, w2}]– No other set satisfies the conditions of RCCIS
Intervals are {u3, v1, v3, w2}
p2 replicates {u3, v1, w2}
IBM Research
IBM Confidential © 2009 IBM Corporation
Experimental Evaluation on Synthetic Data
Synthetic Data Time Range : [0, 100K] Max Interval Length: 100 Query : R1 overlaps R2 and R2 overlaps R3
IBM Research
IBM Confidential © 2009 IBM Corporation
Experimental Evaluation on Internet Packet Trace Data
- Packet traces taken from MAWI group archive- http://tracer.csl.sonmy.co.jp/mawi
IBM Research
IBM Confidential © 2009 IBM Corporation
Questions ?
Discussed the challenges in handling Interval-Joins on Map-Reduce
Presented Naïve algorithms for handling interval joins
Presented Novel algorithms for handling interval joins and illustrated that these significantly improve on naïve methods.