efficient parallel set-similarity joins using mapreduce - sigmod 2010 slides
TRANSCRIPT
Efficient Parallel Set-Similarity Joins UsingMapReduce
Rares Vernica Michael J. Carey Chen Li
Department of Computer ScienceUniversity of California, Irvine
Special Interest Group on Management of Data, 2010
Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 1 / 22
Outline
1 Motivation
2 Problem Statement
3 Algorithms
4 Experimental Evaluation
Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 2 / 22
Example: Data Cleaning/Master-Data-Management
Customer data from two departmentsSales
ID Name . . .S10 John W Smith . . .
...
ReturnsID Name . . .
R20 Smith John . . ....
Master customer data across two departmentsCustomers
ID Name . . .C30 John W Smith . . .
...
Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 3 / 22
Example: Data Cleaning/Master-Data-Management
Customer data from two departmentsSales
ID Name . . .S10 John W Smith . . .
...
ReturnsID Name . . .
R20 Smith John . . ....
Master customer data across two departmentsCustomers
ID Name . . .C30 John W Smith . . .
...
Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 3 / 22
Example: Data Cleaning/Master-Data-Management
Customer data from two departmentsSales
ID Name . . .S10 John W Smith . . .
...
ReturnsID Name . . .
R20 Smith John . . ....
Master customer data across two departmentsCustomers
ID Name . . .C30 John W Smith . . .
...
Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 3 / 22
Example: Data Cleaning/Master-Data-Management
Customer data from two departmentsSales
ID Name . . .S10 John W Smith . . .
...
ReturnsID Name . . .
R20 John W Smith . . ....
Master customer data across two departmentsCustomers
ID Name . . .C30 John W Smith . . .
...
Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 3 / 22
Example: Data Cleaning/Master-Data-Management
String→ SetS10: “John W Smith”→ {John, Smith, W}
R20: “Smith John”→ {John, Smith}
Set-Similarity Metric
Jaccard similarity/Tanimoto coefficient: jaccard(x , y) = |x∩y ||x∪y |
jaccard(S10,R20) = 23
Set-Similarity JoinSet-Similarity Join “Sales” and “Returns” on “Name”
Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 4 / 22
Example: Data Cleaning/Master-Data-Management
String→ SetS10: “John W Smith”→ {John, Smith, W}
R20: “Smith John”→ {John, Smith}
Set-Similarity Metric
Jaccard similarity/Tanimoto coefficient: jaccard(x , y) = |x∩y ||x∪y |
jaccard(S10,R20) = 23
Set-Similarity JoinSet-Similarity Join “Sales” and “Returns” on “Name”
Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 4 / 22
Example: Data Cleaning/Master-Data-Management
String→ SetS10: “John W Smith”→ {John, Smith, W}
R20: “Smith John”→ {John, Smith}
Set-Similarity Metric
Jaccard similarity/Tanimoto coefficient: jaccard(x , y) = |x∩y ||x∪y |
jaccard(S10,R20) = 23
Set-Similarity JoinSet-Similarity Join “Sales” and “Returns” on “Name”
Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 4 / 22
Example: Data Cleaning/Master-Data-Management
String→ SetS10: “John W Smith”→ {John, Smith, W}
R20: “Smith John”→ {John, Smith}
Set-Similarity Metric
Jaccard similarity/Tanimoto coefficient: jaccard(x , y) = |x∩y ||x∪y |
jaccard(S10,R20) = 23
Set-Similarity JoinSet-Similarity Join “Sales” and “Returns” on “Name”
Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 4 / 22
Example: Data Cleaning/Master-Data-Management
String→ SetS10: “John W Smith”→ {John, Smith, W}
R20: “Smith John”→ {John, Smith}
Set-Similarity Metric
Jaccard similarity/Tanimoto coefficient: jaccard(x , y) = |x∩y ||x∪y |
jaccard(S10,R20) = 23
Set-Similarity JoinSet-Similarity Join “Sales” and “Returns” on “Name”
Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 4 / 22
Example: Data Cleaning/Master-Data-Management
String→ SetS10: “John W Smith”→ {John, Smith, W}
R20: “Smith John”→ {John, Smith}
Set-Similarity Metric
Jaccard similarity/Tanimoto coefficient: jaccard(x , y) = |x∩y ||x∪y |
jaccard(S10,R20) = 23
Set-Similarity JoinSet-Similarity Join “Sales” and “Returns” on “Name”
Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 4 / 22
Example: Data Cleaning/Master-Data-Management
String→ SetS10: “John W Smith”→ {John, Smith, W}
R20: “Smith John”→ {John, Smith}
Set-Similarity Metric
Jaccard similarity/Tanimoto coefficient: jaccard(x , y) = |x∩y ||x∪y |
jaccard(S10,R20) = 23
Set-Similarity JoinSet-Similarity Join “Sales” and “Returns” on “Name”
Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 4 / 22
Problem Statement: Set-Similarity Join
Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 5 / 22
Problem Statement: Set-Similarity Join
Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 5 / 22
Problem Statement: Set-Similarity Join
Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 5 / 22
Problem Statement: Set-Similarity Join
Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 5 / 22
Problem Statement: Set-Similarity Join
InputTwo files of records e.g., R(RID,a,b) and S(RID, c,d)A join column on each file e.g., R.a and S.cA similarity function, sim e.g., JaccardA similarity threshold, τ
OutputAll pairs of records from R and S where sim(R.a,S.c) ≥ τ
Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 6 / 22
Parallelizing Set-Similarity Joins
Large amounts of dataE.g., GeneBank: 100M, Google N-gram: 1TData or processing does not fit in one machineUse a cluster of machines and a parallel algorithmMapReduce: shared-nothing data-processing platform
ChallengesPartition problem for parallelismSolve using Map, Sort, and Reduce
Compute end-to-end set-similarity joinsDeal with out-of-memory situations
Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 7 / 22
Parallelizing Set-Similarity Joins
Large amounts of dataE.g., GeneBank: 100M, Google N-gram: 1TData or processing does not fit in one machineUse a cluster of machines and a parallel algorithmMapReduce: shared-nothing data-processing platform
ChallengesPartition problem for parallelismSolve using Map, Sort, and Reduce
Compute end-to-end set-similarity joinsDeal with out-of-memory situations
Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 7 / 22
MapReduce Review
map (k1,v1) → list(k2,v2);reduce (k2,list(v2)) → list(k3,v3).
combine (k2,list(v2)) → list(k2,v2).
Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 8 / 22
MapReduce Review
map (k1,v1) → list(k2,v2);reduce (k2,list(v2)) → list(k3,v3).
combine (k2,list(v2)) → list(k2,v2).
Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 8 / 22
MapReduce Review
map (k1,v1) → list(k2,v2);reduce (k2,list(v2)) → list(k3,v3).
combine (k2,list(v2)) → list(k2,v2).
Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 8 / 22
Parallel Set-Similarity Joins in MapReduce
Main ideaHash-partition data across the network based on keysJoin values cannot be directly used as keysGenerate signatures from join values and use them as keys
e.g., “John W Smith”→ {John, Smith, W}
Three Stages1 Compute set-element statistics
Records → Statistics
2 Generate similar record-ID (“RID”) pairs{Records, Statistics} → RID-pairs
3 Generate pairs of joined records{Records, RID-pairs} → Record-pairs
Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 9 / 22
Parallel Set-Similarity Joins in MapReduce
Main ideaHash-partition data across the network based on keysJoin values cannot be directly used as keysGenerate signatures from join values and use them as keys
e.g., “John W Smith”→ {John, Smith, W}
Three Stages1 Compute set-element statistics
Records → Statistics
2 Generate similar record-ID (“RID”) pairs{Records, Statistics} → RID-pairs
3 Generate pairs of joined records{Records, RID-pairs} → Record-pairs
Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 9 / 22
Set-Similarity Filtering
E.g., Prefix FilteringPigeonhole principleGlobal order for set elements:
E.g., sim is overlap size, τ = 4Prefix length is 2
Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 10 / 22
Set-Similarity Filtering
E.g., Prefix FilteringPigeonhole principleGlobal order for set elements:
E.g., sim is overlap size, τ = 4
Prefix length is 2
Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 10 / 22
Set-Similarity Filtering
E.g., Prefix FilteringPigeonhole principleGlobal order for set elements:
E.g., sim is overlap size, τ = 4Prefix length is 2
Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 10 / 22
Set-Similarity Filtering
E.g., Prefix FilteringPigeonhole principleGlobal order for set elements:
E.g., sim is overlap size, τ = 4Prefix length is 2
Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 10 / 22
Set-Similarity Filtering
E.g., Prefix FilteringPigeonhole principleGlobal order for set elements:
E.g., sim is overlap size, τ = 4Prefix length is 2
Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 10 / 22
Processing Stages and Alternatives
Stage 1: Token OrderingCompute the token frequencies and sort
Two MapReduce phases: sort in MapReduce (BTO)One MapReduce phase: sort in memory (OPTO)
Stage 2: Kernel (RID-Pair Generation)Use prefix-filter to divide, conquer using:
Nested loops (BK)Single-machine set-similarity join algorithm (PK)
Stage 3: Record JoinGenerate pairs of similar records
Two MapReduce phases: reduce-side join (BRJ)One MapReduce phase: map-side join (OPRJ)
Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 11 / 22
Processing Stages and Alternatives
Stage 1: Token OrderingCompute the token frequencies and sort
Two MapReduce phases: sort in MapReduce (BTO)One MapReduce phase: sort in memory (OPTO)
Stage 2: Kernel (RID-Pair Generation)Use prefix-filter to divide, conquer using:
Nested loops (BK)Single-machine set-similarity join algorithm (PK)
Stage 3: Record JoinGenerate pairs of similar records
Two MapReduce phases: reduce-side join (BRJ)One MapReduce phase: map-side join (OPRJ)
Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 11 / 22
Processing Stages and Alternatives
Stage 1: Token OrderingCompute the token frequencies and sort
Two MapReduce phases: sort in MapReduce (BTO)One MapReduce phase: sort in memory (OPTO)
Stage 2: Kernel (RID-Pair Generation)Use prefix-filter to divide, conquer using:
Nested loops (BK)Single-machine set-similarity join algorithm (PK)
Stage 3: Record JoinGenerate pairs of similar records
Two MapReduce phases: reduce-side join (BRJ)One MapReduce phase: map-side join (OPRJ)
Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 11 / 22
Stage 2: RID-Pair Generation
Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 12 / 22
Stage 2: RID-Pair Generation
Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 12 / 22
Stage 2: RID-Pair Generation
Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 12 / 22
Stage 2: RID-Pair Generation
Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 12 / 22
Stage 2: RID-Pair Generation
Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 12 / 22
Stage 2: RID-Pair Generation
Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 12 / 22
Additional Contributions
Solve R-S-join caseOptimize memory requirements for R-S joinHandle insufficient memory
Introduce additional filtersUse external memory
Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 13 / 22
Experimental Setting
Hardware10-node IBM x3650 cluster
Intel Xeon processor E5520 2.26GHz with four coresFour 300GB hard disks12GB RAM
SoftwareUbuntu 9.06, 64-bit, server edition OSJava 1.6, 64-bit, serverHadoop 0.20.1
Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 14 / 22
Experimental Setting
DatasetsDBLP
Average length: 259 bytesNumber of records: 1.2MTotal size: 300MB
CITESEERXAverage length: 1374 bytesNumber of records: 1.3MTotal size: 1.8GB
Increased each up to ×25, preserving join propertiesDBLP: 31M records, 8.2GBCITESEERX: 32M records, 45GB
Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 15 / 22
Running Time
Self-join DBLP×nn ∈ [5,25]10-node clusterBest timeBulk of the time
Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 16 / 22
Running Time
Self-join DBLP×nn ∈ [5,25]10-node clusterBest timeBulk of the time
Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 16 / 22
Running Time
Self-join DBLP×nn ∈ [5,25]10-node clusterBest timeBulk of the time
Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 16 / 22
Speedup
2 3 4 5 6 7 8 9 10
# Nodes
1
2
3
4
5
Spe
edup
BTO-BK-BRJBTO-PK-BRJBTO-PK-OPRJIdeal
Relative running timeSelf-join DBLP×10Different cluster sizes
Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 17 / 22
Speedup Breakdown
Stage 1
2 3 4 5 6 7 8 9 10
# Nodes
1
2
3
4
5
Spe
edup
BTOOPTOIdeal
Stage 2
2 3 4 5 6 7 8 9 10
# Nodes
1
2
3
4
5
Spe
edup
BKPKIdeal
Stage 3
2 3 4 5 6 7 8 9 10
# Nodes
1
2
3
4
5
Spe
edup
BRJOPRJIdeal
Relative running timeSelf-join DBLP×10Different cluster sizes
Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 18 / 22
Scaleup
2 3 4 5 6 7 8 9 10
# Nodes and Dataset Size
0
200
400
600
800
1000
Tim
e (s
econ
ds)
BTO-BK-BRJBTO-PK-BRJBTO-PK-OPRJIdeal
Running timeSelf-joining DBLP×nn ∈ [5,25]Proportional cluster
Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 19 / 22
Scaleup Breakdown
Stage 1
2 3 4 5 6 7 8 9 10
# Nodes and Dataset Size
0
40
80
120
160
Tim
e (s
econ
ds)
BTOOPTOIdeal
Stage 2
2 3 4 5 6 7 8 9 10
# Nodes and Dataset Size
0
100
200
300
400
500
600
Tim
e (s
econ
ds)
BKPKIdeal
Stage 3
2 3 4 5 6 7 8 9 10
# Nodes and Dataset Size
0
50
100
150
200
Tim
e (s
econ
ds)
BRJOPRJIdeal
Running timeSelf-joining DBLP×n, n ∈ [5,25]Proportional cluster
Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 20 / 22
Summary
Set-similarity joins in MapReduceThree-stage approachBalance workload and minimize replication
End-to-end algorithmsSelf-joinR-S join
Memory issuesOptimize memory requirementsHandle insufficient memory
Experiments40 cores, 40 disks clusterSpeedup and scaleup
Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 21 / 22
More Info
A longer version of the paper, the source-code and the datasets:http://asterix.ics.uci.edu/fuzzyjoin-mapreduce/
This work is part of theASTERIX
http://asterix.ics.uci.edu/and
Flamingohttp://flamingo.ics.uci.edu/
projects at UC Irvine.
Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 22 / 22
Stage 1: Basic Token Ordering (BTO)
Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 22 / 22
Stage 1: One Phase Token Ordering (OPTO)
Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 22 / 22
Stage 2: Basic Kernel (BK)
Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 22 / 22
Stage 2: Indexed Kernel (PK)
Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 22 / 22
Stage 3: Basic Record Join (BRJ)
Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 22 / 22
Stage 3: One Phase Record Join (OPRJ)
Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 22 / 22