efficient parallel set-similarity joins using mapreduce - sigmod 2010 slides

Efficient Parallel Set-Similarity Joins UsingMapReduce

Rares Vernica Michael J. Carey Chen Li

Department of Computer ScienceUniversity of California, Irvine

Special Interest Group on Management of Data, 2010

Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 1 / 22

Outline

1 Motivation

2 Problem Statement

3 Algorithms

4 Experimental Evaluation

Example: Data Cleaning/Master-Data-Management

Customer data from two departmentsSales

ID Name . . .S10 John W Smith . . .

ReturnsID Name . . .

R20 Smith John . . ....

Master customer data across two departmentsCustomers

ID Name . . .C30 John W Smith . . .

R20 John W Smith . . ....

String→ SetS10: “John W Smith”→ {John, Smith, W}

R20: “Smith John”→ {John, Smith}

Set-Similarity Metric

Jaccard similarity/Tanimoto coefficient: jaccard(x , y) = |x∩y ||x∪y |

jaccard(S10,R20) = 23

Set-Similarity JoinSet-Similarity Join “Sales” and “Returns” on “Name”

Problem Statement: Set-Similarity Join

InputTwo files of records e.g., R(RID,a,b) and S(RID, c,d)A join column on each file e.g., R.a and S.cA similarity function, sim e.g., JaccardA similarity threshold, τ

OutputAll pairs of records from R and S where sim(R.a,S.c) ≥ τ

Parallelizing Set-Similarity Joins

Large amounts of dataE.g., GeneBank: 100M, Google N-gram: 1TData or processing does not fit in one machineUse a cluster of machines and a parallel algorithmMapReduce: shared-nothing data-processing platform

ChallengesPartition problem for parallelismSolve using Map, Sort, and Reduce

Compute end-to-end set-similarity joinsDeal with out-of-memory situations

Parallelizing Set-Similarity Joins

Large amounts of dataE.g., GeneBank: 100M, Google N-gram: 1TData or processing does not fit in one machineUse a cluster of machines and a parallel algorithmMapReduce: shared-nothing data-processing platform

ChallengesPartition problem for parallelismSolve using Map, Sort, and Reduce

Compute end-to-end set-similarity joinsDeal with out-of-memory situations

MapReduce Review

map (k1,v1) → list(k2,v2);reduce (k2,list(v2)) → list(k3,v3).

combine (k2,list(v2)) → list(k2,v2).

MapReduce Review

Parallel Set-Similarity Joins in MapReduce

Main ideaHash-partition data across the network based on keysJoin values cannot be directly used as keysGenerate signatures from join values and use them as keys

e.g., “John W Smith”→ {John, Smith, W}

Three Stages1 Compute set-element statistics

Records → Statistics

2 Generate similar record-ID (“RID”) pairs{Records, Statistics} → RID-pairs

3 Generate pairs of joined records{Records, RID-pairs} → Record-pairs

Parallel Set-Similarity Joins in MapReduce

Main ideaHash-partition data across the network based on keysJoin values cannot be directly used as keysGenerate signatures from join values and use them as keys

e.g., “John W Smith”→ {John, Smith, W}

Three Stages1 Compute set-element statistics

Records → Statistics

2 Generate similar record-ID (“RID”) pairs{Records, Statistics} → RID-pairs

3 Generate pairs of joined records{Records, RID-pairs} → Record-pairs

Set-Similarity Filtering

E.g., Prefix FilteringPigeonhole principleGlobal order for set elements:

E.g., sim is overlap size, τ = 4Prefix length is 2

E.g., sim is overlap size, τ = 4

Prefix length is 2

Processing Stages and Alternatives

Stage 1: Token OrderingCompute the token frequencies and sort

Two MapReduce phases: sort in MapReduce (BTO)One MapReduce phase: sort in memory (OPTO)

Stage 2: Kernel (RID-Pair Generation)Use prefix-filter to divide, conquer using:

Nested loops (BK)Single-machine set-similarity join algorithm (PK)

Stage 3: Record JoinGenerate pairs of similar records

Two MapReduce phases: reduce-side join (BRJ)One MapReduce phase: map-side join (OPRJ)

Stage 2: RID-Pair Generation

Additional Contributions

Solve R-S-join caseOptimize memory requirements for R-S joinHandle insufficient memory

Introduce additional filtersUse external memory

Experimental Setting

Hardware10-node IBM x3650 cluster

Intel Xeon processor E5520 2.26GHz with four coresFour 300GB hard disks12GB RAM

SoftwareUbuntu 9.06, 64-bit, server edition OSJava 1.6, 64-bit, serverHadoop 0.20.1

Experimental Setting

DatasetsDBLP

Average length: 259 bytesNumber of records: 1.2MTotal size: 300MB

CITESEERXAverage length: 1374 bytesNumber of records: 1.3MTotal size: 1.8GB

Increased each up to ×25, preserving join propertiesDBLP: 31M records, 8.2GBCITESEERX: 32M records, 45GB

Running Time

Self-join DBLP×nn ∈ [5,25]10-node clusterBest timeBulk of the time

Running Time

Speedup

2 3 4 5 6 7 8 9 10

# Nodes

BTO-BK-BRJBTO-PK-BRJBTO-PK-OPRJIdeal

Relative running timeSelf-join DBLP×10Different cluster sizes

Speedup Breakdown

Stage 1

2 3 4 5 6 7 8 9 10

# Nodes

BTOOPTOIdeal

Stage 2

2 3 4 5 6 7 8 9 10

# Nodes

BKPKIdeal

Stage 3

2 3 4 5 6 7 8 9 10

# Nodes

BRJOPRJIdeal

Relative running timeSelf-join DBLP×10Different cluster sizes

Scaleup

2 3 4 5 6 7 8 9 10

# Nodes and Dataset Size

BTO-BK-BRJBTO-PK-BRJBTO-PK-OPRJIdeal

Running timeSelf-joining DBLP×nn ∈ [5,25]Proportional cluster

Scaleup Breakdown

Stage 1

2 3 4 5 6 7 8 9 10

BTOOPTOIdeal

Stage 2

2 3 4 5 6 7 8 9 10

BKPKIdeal

Stage 3

2 3 4 5 6 7 8 9 10

BRJOPRJIdeal

Running timeSelf-joining DBLP×n, n ∈ [5,25]Proportional cluster

Summary

Set-similarity joins in MapReduceThree-stage approachBalance workload and minimize replication

End-to-end algorithmsSelf-joinR-S join

Memory issuesOptimize memory requirementsHandle insufficient memory

Experiments40 cores, 40 disks clusterSpeedup and scaleup

More Info

A longer version of the paper, the source-code and the datasets:http://asterix.ics.uci.edu/fuzzyjoin-mapreduce/

This work is part of theASTERIX

http://asterix.ics.uci.edu/and

Flamingohttp://flamingo.ics.uci.edu/

projects at UC Irvine.

Stage 1: Basic Token Ordering (BTO)

Stage 1: One Phase Token Ordering (OPTO)

Stage 2: Basic Kernel (BK)

Stage 2: Indexed Kernel (PK)

Stage 3: Basic Record Join (BRJ)

Stage 3: One Phase Record Join (OPRJ)

efficient parallel set-similarity joins using mapreduce - sigmod 2010 slides

Documents

r20 smith john

r20 john w smith

s10 john w smith

c30 john w smith

data cleaningmasterdata

managementcustomer data

management of data

master customer data