efficient parallel set-similarity joins using mapreduce - sigmod 2010 slides

57
Efficient Parallel Set-Similarity Joins Using MapReduce Rares Vernica Michael J. Carey Chen Li Department of Computer Science University of California, Irvine Special Interest Group on Management of Data, 2010 Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 1 / 22

Upload: rvernica

Post on 12-Jul-2015

931 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Efficient Parallel Set-Similarity Joins Using MapReduce - SIGMOD 2010 Slides

Efficient Parallel Set-Similarity Joins UsingMapReduce

Rares Vernica Michael J. Carey Chen Li

Department of Computer ScienceUniversity of California, Irvine

Special Interest Group on Management of Data, 2010

Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 1 / 22

Page 2: Efficient Parallel Set-Similarity Joins Using MapReduce - SIGMOD 2010 Slides

Outline

1 Motivation

2 Problem Statement

3 Algorithms

4 Experimental Evaluation

Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 2 / 22

Page 3: Efficient Parallel Set-Similarity Joins Using MapReduce - SIGMOD 2010 Slides

Example: Data Cleaning/Master-Data-Management

Customer data from two departmentsSales

ID Name . . .S10 John W Smith . . .

...

ReturnsID Name . . .

R20 Smith John . . ....

Master customer data across two departmentsCustomers

ID Name . . .C30 John W Smith . . .

...

Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 3 / 22

Page 4: Efficient Parallel Set-Similarity Joins Using MapReduce - SIGMOD 2010 Slides

Example: Data Cleaning/Master-Data-Management

Customer data from two departmentsSales

ID Name . . .S10 John W Smith . . .

...

ReturnsID Name . . .

R20 Smith John . . ....

Master customer data across two departmentsCustomers

ID Name . . .C30 John W Smith . . .

...

Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 3 / 22

Page 5: Efficient Parallel Set-Similarity Joins Using MapReduce - SIGMOD 2010 Slides

Example: Data Cleaning/Master-Data-Management

Customer data from two departmentsSales

ID Name . . .S10 John W Smith . . .

...

ReturnsID Name . . .

R20 Smith John . . ....

Master customer data across two departmentsCustomers

ID Name . . .C30 John W Smith . . .

...

Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 3 / 22

Page 6: Efficient Parallel Set-Similarity Joins Using MapReduce - SIGMOD 2010 Slides

Example: Data Cleaning/Master-Data-Management

Customer data from two departmentsSales

ID Name . . .S10 John W Smith . . .

...

ReturnsID Name . . .

R20 John W Smith . . ....

Master customer data across two departmentsCustomers

ID Name . . .C30 John W Smith . . .

...

Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 3 / 22

Page 7: Efficient Parallel Set-Similarity Joins Using MapReduce - SIGMOD 2010 Slides

Example: Data Cleaning/Master-Data-Management

String→ SetS10: “John W Smith”→ {John, Smith, W}

R20: “Smith John”→ {John, Smith}

Set-Similarity Metric

Jaccard similarity/Tanimoto coefficient: jaccard(x , y) = |x∩y ||x∪y |

jaccard(S10,R20) = 23

Set-Similarity JoinSet-Similarity Join “Sales” and “Returns” on “Name”

Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 4 / 22

Page 8: Efficient Parallel Set-Similarity Joins Using MapReduce - SIGMOD 2010 Slides

Example: Data Cleaning/Master-Data-Management

String→ SetS10: “John W Smith”→ {John, Smith, W}

R20: “Smith John”→ {John, Smith}

Set-Similarity Metric

Jaccard similarity/Tanimoto coefficient: jaccard(x , y) = |x∩y ||x∪y |

jaccard(S10,R20) = 23

Set-Similarity JoinSet-Similarity Join “Sales” and “Returns” on “Name”

Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 4 / 22

Page 9: Efficient Parallel Set-Similarity Joins Using MapReduce - SIGMOD 2010 Slides

Example: Data Cleaning/Master-Data-Management

String→ SetS10: “John W Smith”→ {John, Smith, W}

R20: “Smith John”→ {John, Smith}

Set-Similarity Metric

Jaccard similarity/Tanimoto coefficient: jaccard(x , y) = |x∩y ||x∪y |

jaccard(S10,R20) = 23

Set-Similarity JoinSet-Similarity Join “Sales” and “Returns” on “Name”

Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 4 / 22

Page 10: Efficient Parallel Set-Similarity Joins Using MapReduce - SIGMOD 2010 Slides

Example: Data Cleaning/Master-Data-Management

String→ SetS10: “John W Smith”→ {John, Smith, W}

R20: “Smith John”→ {John, Smith}

Set-Similarity Metric

Jaccard similarity/Tanimoto coefficient: jaccard(x , y) = |x∩y ||x∪y |

jaccard(S10,R20) = 23

Set-Similarity JoinSet-Similarity Join “Sales” and “Returns” on “Name”

Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 4 / 22

Page 11: Efficient Parallel Set-Similarity Joins Using MapReduce - SIGMOD 2010 Slides

Example: Data Cleaning/Master-Data-Management

String→ SetS10: “John W Smith”→ {John, Smith, W}

R20: “Smith John”→ {John, Smith}

Set-Similarity Metric

Jaccard similarity/Tanimoto coefficient: jaccard(x , y) = |x∩y ||x∪y |

jaccard(S10,R20) = 23

Set-Similarity JoinSet-Similarity Join “Sales” and “Returns” on “Name”

Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 4 / 22

Page 12: Efficient Parallel Set-Similarity Joins Using MapReduce - SIGMOD 2010 Slides

Example: Data Cleaning/Master-Data-Management

String→ SetS10: “John W Smith”→ {John, Smith, W}

R20: “Smith John”→ {John, Smith}

Set-Similarity Metric

Jaccard similarity/Tanimoto coefficient: jaccard(x , y) = |x∩y ||x∪y |

jaccard(S10,R20) = 23

Set-Similarity JoinSet-Similarity Join “Sales” and “Returns” on “Name”

Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 4 / 22

Page 13: Efficient Parallel Set-Similarity Joins Using MapReduce - SIGMOD 2010 Slides

Example: Data Cleaning/Master-Data-Management

String→ SetS10: “John W Smith”→ {John, Smith, W}

R20: “Smith John”→ {John, Smith}

Set-Similarity Metric

Jaccard similarity/Tanimoto coefficient: jaccard(x , y) = |x∩y ||x∪y |

jaccard(S10,R20) = 23

Set-Similarity JoinSet-Similarity Join “Sales” and “Returns” on “Name”

Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 4 / 22

Page 14: Efficient Parallel Set-Similarity Joins Using MapReduce - SIGMOD 2010 Slides

Problem Statement: Set-Similarity Join

Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 5 / 22

Page 15: Efficient Parallel Set-Similarity Joins Using MapReduce - SIGMOD 2010 Slides

Problem Statement: Set-Similarity Join

Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 5 / 22

Page 16: Efficient Parallel Set-Similarity Joins Using MapReduce - SIGMOD 2010 Slides

Problem Statement: Set-Similarity Join

Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 5 / 22

Page 17: Efficient Parallel Set-Similarity Joins Using MapReduce - SIGMOD 2010 Slides

Problem Statement: Set-Similarity Join

Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 5 / 22

Page 18: Efficient Parallel Set-Similarity Joins Using MapReduce - SIGMOD 2010 Slides

Problem Statement: Set-Similarity Join

InputTwo files of records e.g., R(RID,a,b) and S(RID, c,d)A join column on each file e.g., R.a and S.cA similarity function, sim e.g., JaccardA similarity threshold, τ

OutputAll pairs of records from R and S where sim(R.a,S.c) ≥ τ

Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 6 / 22

Page 19: Efficient Parallel Set-Similarity Joins Using MapReduce - SIGMOD 2010 Slides

Parallelizing Set-Similarity Joins

Large amounts of dataE.g., GeneBank: 100M, Google N-gram: 1TData or processing does not fit in one machineUse a cluster of machines and a parallel algorithmMapReduce: shared-nothing data-processing platform

ChallengesPartition problem for parallelismSolve using Map, Sort, and Reduce

Compute end-to-end set-similarity joinsDeal with out-of-memory situations

Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 7 / 22

Page 20: Efficient Parallel Set-Similarity Joins Using MapReduce - SIGMOD 2010 Slides

Parallelizing Set-Similarity Joins

Large amounts of dataE.g., GeneBank: 100M, Google N-gram: 1TData or processing does not fit in one machineUse a cluster of machines and a parallel algorithmMapReduce: shared-nothing data-processing platform

ChallengesPartition problem for parallelismSolve using Map, Sort, and Reduce

Compute end-to-end set-similarity joinsDeal with out-of-memory situations

Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 7 / 22

Page 21: Efficient Parallel Set-Similarity Joins Using MapReduce - SIGMOD 2010 Slides

MapReduce Review

map (k1,v1) → list(k2,v2);reduce (k2,list(v2)) → list(k3,v3).

combine (k2,list(v2)) → list(k2,v2).

Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 8 / 22

Page 22: Efficient Parallel Set-Similarity Joins Using MapReduce - SIGMOD 2010 Slides

MapReduce Review

map (k1,v1) → list(k2,v2);reduce (k2,list(v2)) → list(k3,v3).

combine (k2,list(v2)) → list(k2,v2).

Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 8 / 22

Page 23: Efficient Parallel Set-Similarity Joins Using MapReduce - SIGMOD 2010 Slides

MapReduce Review

map (k1,v1) → list(k2,v2);reduce (k2,list(v2)) → list(k3,v3).

combine (k2,list(v2)) → list(k2,v2).

Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 8 / 22

Page 24: Efficient Parallel Set-Similarity Joins Using MapReduce - SIGMOD 2010 Slides

Parallel Set-Similarity Joins in MapReduce

Main ideaHash-partition data across the network based on keysJoin values cannot be directly used as keysGenerate signatures from join values and use them as keys

e.g., “John W Smith”→ {John, Smith, W}

Three Stages1 Compute set-element statistics

Records → Statistics

2 Generate similar record-ID (“RID”) pairs{Records, Statistics} → RID-pairs

3 Generate pairs of joined records{Records, RID-pairs} → Record-pairs

Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 9 / 22

Page 25: Efficient Parallel Set-Similarity Joins Using MapReduce - SIGMOD 2010 Slides

Parallel Set-Similarity Joins in MapReduce

Main ideaHash-partition data across the network based on keysJoin values cannot be directly used as keysGenerate signatures from join values and use them as keys

e.g., “John W Smith”→ {John, Smith, W}

Three Stages1 Compute set-element statistics

Records → Statistics

2 Generate similar record-ID (“RID”) pairs{Records, Statistics} → RID-pairs

3 Generate pairs of joined records{Records, RID-pairs} → Record-pairs

Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 9 / 22

Page 26: Efficient Parallel Set-Similarity Joins Using MapReduce - SIGMOD 2010 Slides

Set-Similarity Filtering

E.g., Prefix FilteringPigeonhole principleGlobal order for set elements:

E.g., sim is overlap size, τ = 4Prefix length is 2

Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 10 / 22

Page 27: Efficient Parallel Set-Similarity Joins Using MapReduce - SIGMOD 2010 Slides

Set-Similarity Filtering

E.g., Prefix FilteringPigeonhole principleGlobal order for set elements:

E.g., sim is overlap size, τ = 4

Prefix length is 2

Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 10 / 22

Page 28: Efficient Parallel Set-Similarity Joins Using MapReduce - SIGMOD 2010 Slides

Set-Similarity Filtering

E.g., Prefix FilteringPigeonhole principleGlobal order for set elements:

E.g., sim is overlap size, τ = 4Prefix length is 2

Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 10 / 22

Page 29: Efficient Parallel Set-Similarity Joins Using MapReduce - SIGMOD 2010 Slides

Set-Similarity Filtering

E.g., Prefix FilteringPigeonhole principleGlobal order for set elements:

E.g., sim is overlap size, τ = 4Prefix length is 2

Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 10 / 22

Page 30: Efficient Parallel Set-Similarity Joins Using MapReduce - SIGMOD 2010 Slides

Set-Similarity Filtering

E.g., Prefix FilteringPigeonhole principleGlobal order for set elements:

E.g., sim is overlap size, τ = 4Prefix length is 2

Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 10 / 22

Page 31: Efficient Parallel Set-Similarity Joins Using MapReduce - SIGMOD 2010 Slides

Processing Stages and Alternatives

Stage 1: Token OrderingCompute the token frequencies and sort

Two MapReduce phases: sort in MapReduce (BTO)One MapReduce phase: sort in memory (OPTO)

Stage 2: Kernel (RID-Pair Generation)Use prefix-filter to divide, conquer using:

Nested loops (BK)Single-machine set-similarity join algorithm (PK)

Stage 3: Record JoinGenerate pairs of similar records

Two MapReduce phases: reduce-side join (BRJ)One MapReduce phase: map-side join (OPRJ)

Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 11 / 22

Page 32: Efficient Parallel Set-Similarity Joins Using MapReduce - SIGMOD 2010 Slides

Processing Stages and Alternatives

Stage 1: Token OrderingCompute the token frequencies and sort

Two MapReduce phases: sort in MapReduce (BTO)One MapReduce phase: sort in memory (OPTO)

Stage 2: Kernel (RID-Pair Generation)Use prefix-filter to divide, conquer using:

Nested loops (BK)Single-machine set-similarity join algorithm (PK)

Stage 3: Record JoinGenerate pairs of similar records

Two MapReduce phases: reduce-side join (BRJ)One MapReduce phase: map-side join (OPRJ)

Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 11 / 22

Page 33: Efficient Parallel Set-Similarity Joins Using MapReduce - SIGMOD 2010 Slides

Processing Stages and Alternatives

Stage 1: Token OrderingCompute the token frequencies and sort

Two MapReduce phases: sort in MapReduce (BTO)One MapReduce phase: sort in memory (OPTO)

Stage 2: Kernel (RID-Pair Generation)Use prefix-filter to divide, conquer using:

Nested loops (BK)Single-machine set-similarity join algorithm (PK)

Stage 3: Record JoinGenerate pairs of similar records

Two MapReduce phases: reduce-side join (BRJ)One MapReduce phase: map-side join (OPRJ)

Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 11 / 22

Page 34: Efficient Parallel Set-Similarity Joins Using MapReduce - SIGMOD 2010 Slides

Stage 2: RID-Pair Generation

Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 12 / 22

Page 35: Efficient Parallel Set-Similarity Joins Using MapReduce - SIGMOD 2010 Slides

Stage 2: RID-Pair Generation

Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 12 / 22

Page 36: Efficient Parallel Set-Similarity Joins Using MapReduce - SIGMOD 2010 Slides

Stage 2: RID-Pair Generation

Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 12 / 22

Page 37: Efficient Parallel Set-Similarity Joins Using MapReduce - SIGMOD 2010 Slides

Stage 2: RID-Pair Generation

Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 12 / 22

Page 38: Efficient Parallel Set-Similarity Joins Using MapReduce - SIGMOD 2010 Slides

Stage 2: RID-Pair Generation

Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 12 / 22

Page 39: Efficient Parallel Set-Similarity Joins Using MapReduce - SIGMOD 2010 Slides

Stage 2: RID-Pair Generation

Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 12 / 22

Page 40: Efficient Parallel Set-Similarity Joins Using MapReduce - SIGMOD 2010 Slides

Additional Contributions

Solve R-S-join caseOptimize memory requirements for R-S joinHandle insufficient memory

Introduce additional filtersUse external memory

Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 13 / 22

Page 41: Efficient Parallel Set-Similarity Joins Using MapReduce - SIGMOD 2010 Slides

Experimental Setting

Hardware10-node IBM x3650 cluster

Intel Xeon processor E5520 2.26GHz with four coresFour 300GB hard disks12GB RAM

SoftwareUbuntu 9.06, 64-bit, server edition OSJava 1.6, 64-bit, serverHadoop 0.20.1

Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 14 / 22

Page 42: Efficient Parallel Set-Similarity Joins Using MapReduce - SIGMOD 2010 Slides

Experimental Setting

DatasetsDBLP

Average length: 259 bytesNumber of records: 1.2MTotal size: 300MB

CITESEERXAverage length: 1374 bytesNumber of records: 1.3MTotal size: 1.8GB

Increased each up to ×25, preserving join propertiesDBLP: 31M records, 8.2GBCITESEERX: 32M records, 45GB

Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 15 / 22

Page 43: Efficient Parallel Set-Similarity Joins Using MapReduce - SIGMOD 2010 Slides

Running Time

Self-join DBLP×nn ∈ [5,25]10-node clusterBest timeBulk of the time

Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 16 / 22

Page 44: Efficient Parallel Set-Similarity Joins Using MapReduce - SIGMOD 2010 Slides

Running Time

Self-join DBLP×nn ∈ [5,25]10-node clusterBest timeBulk of the time

Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 16 / 22

Page 45: Efficient Parallel Set-Similarity Joins Using MapReduce - SIGMOD 2010 Slides

Running Time

Self-join DBLP×nn ∈ [5,25]10-node clusterBest timeBulk of the time

Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 16 / 22

Page 46: Efficient Parallel Set-Similarity Joins Using MapReduce - SIGMOD 2010 Slides

Speedup

2 3 4 5 6 7 8 9 10

# Nodes

1

2

3

4

5

Spe

edup

BTO-BK-BRJBTO-PK-BRJBTO-PK-OPRJIdeal

Relative running timeSelf-join DBLP×10Different cluster sizes

Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 17 / 22

Page 47: Efficient Parallel Set-Similarity Joins Using MapReduce - SIGMOD 2010 Slides

Speedup Breakdown

Stage 1

2 3 4 5 6 7 8 9 10

# Nodes

1

2

3

4

5

Spe

edup

BTOOPTOIdeal

Stage 2

2 3 4 5 6 7 8 9 10

# Nodes

1

2

3

4

5

Spe

edup

BKPKIdeal

Stage 3

2 3 4 5 6 7 8 9 10

# Nodes

1

2

3

4

5

Spe

edup

BRJOPRJIdeal

Relative running timeSelf-join DBLP×10Different cluster sizes

Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 18 / 22

Page 48: Efficient Parallel Set-Similarity Joins Using MapReduce - SIGMOD 2010 Slides

Scaleup

2 3 4 5 6 7 8 9 10

# Nodes and Dataset Size

0

200

400

600

800

1000

Tim

e (s

econ

ds)

BTO-BK-BRJBTO-PK-BRJBTO-PK-OPRJIdeal

Running timeSelf-joining DBLP×nn ∈ [5,25]Proportional cluster

Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 19 / 22

Page 49: Efficient Parallel Set-Similarity Joins Using MapReduce - SIGMOD 2010 Slides

Scaleup Breakdown

Stage 1

2 3 4 5 6 7 8 9 10

# Nodes and Dataset Size

0

40

80

120

160

Tim

e (s

econ

ds)

BTOOPTOIdeal

Stage 2

2 3 4 5 6 7 8 9 10

# Nodes and Dataset Size

0

100

200

300

400

500

600

Tim

e (s

econ

ds)

BKPKIdeal

Stage 3

2 3 4 5 6 7 8 9 10

# Nodes and Dataset Size

0

50

100

150

200

Tim

e (s

econ

ds)

BRJOPRJIdeal

Running timeSelf-joining DBLP×n, n ∈ [5,25]Proportional cluster

Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 20 / 22

Page 50: Efficient Parallel Set-Similarity Joins Using MapReduce - SIGMOD 2010 Slides

Summary

Set-similarity joins in MapReduceThree-stage approachBalance workload and minimize replication

End-to-end algorithmsSelf-joinR-S join

Memory issuesOptimize memory requirementsHandle insufficient memory

Experiments40 cores, 40 disks clusterSpeedup and scaleup

Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 21 / 22

Page 51: Efficient Parallel Set-Similarity Joins Using MapReduce - SIGMOD 2010 Slides

More Info

A longer version of the paper, the source-code and the datasets:http://asterix.ics.uci.edu/fuzzyjoin-mapreduce/

This work is part of theASTERIX

http://asterix.ics.uci.edu/and

Flamingohttp://flamingo.ics.uci.edu/

projects at UC Irvine.

Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 22 / 22

Page 52: Efficient Parallel Set-Similarity Joins Using MapReduce - SIGMOD 2010 Slides

Stage 1: Basic Token Ordering (BTO)

Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 22 / 22

Page 53: Efficient Parallel Set-Similarity Joins Using MapReduce - SIGMOD 2010 Slides

Stage 1: One Phase Token Ordering (OPTO)

Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 22 / 22

Page 54: Efficient Parallel Set-Similarity Joins Using MapReduce - SIGMOD 2010 Slides

Stage 2: Basic Kernel (BK)

Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 22 / 22

Page 55: Efficient Parallel Set-Similarity Joins Using MapReduce - SIGMOD 2010 Slides

Stage 2: Indexed Kernel (PK)

Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 22 / 22

Page 56: Efficient Parallel Set-Similarity Joins Using MapReduce - SIGMOD 2010 Slides

Stage 3: Basic Record Join (BRJ)

Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 22 / 22

Page 57: Efficient Parallel Set-Similarity Joins Using MapReduce - SIGMOD 2010 Slides

Stage 3: One Phase Record Join (OPRJ)

Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 22 / 22