efficient parallel set-similarity joins using mapreduce - poster

1
Efficient Parallel Set-Similarity Joins Using MapReduce Rares Vernica Michael J. Carey Chen Li Department of Computer Science, University of California, Irvine http://asterix.ics.uci.edu/fuzzyjoin-mapreduce/ Problem Statement Example: Data Cleaning/Master-Data-Management Customer data from two departments Sales ID Name ... S10 John W Smith ... . . . Returns ID Name ... R20 Smith John ... . . . John W Smith Master customer data across two departments Customers ID Name ... C30 John W Smith ... . . . Parallelizing Set-Similarity Joins Large amounts of data I E.g., GeneBank: 100M, Google N-gram: 1T I Data or processing does not fit in one machine I Use a cluster of machines and a parallel algorithm I MapReduce: shared-nothing data-processing platform Challenges I Partition problem for parallelism I Solve the problem using Map, Sort, and Reduce I Compute end-to-end set-similarity joins I Deal with out-of-memory situations MapReduce Review map (k1,v1) list(k2,v2); reduce (k2,list(v2)) list(k3,v3). combine (k2,list(v2)) list(k2,v2). Prefix Filtering for Data Partitioning I Pigeonhole principle I Global order for set elements: I E.g., sim is overlap size, τ = 4 I Prefix length is 2 Processing Stages and Alternatives Stage 1: Token Ordering I Compute the token frequencies and sort I Two MapReduce phases: sort in MapReduce (BTO) I One MapReduce phase: sort in memory (OPTO) Stage 2: Kernel (RID-Pair Generation) I Use prefix-filter to divide, conquer using: I Nested loops (BK) I Single-machine set-similarity join algorithm (PK) Stage 3: Record Join I Generate pairs of similar records I Two MapReduce phases: reduce-side join (BRJ) I One MapReduce phase: map-side join (OPRJ) Stage 2: RID-Pair Generation Experimental Setting Hardware I 10-node IBM x3650 cluster I Intel Xeon processor E5520 2.26GHz with four cores I Four 300GB hard disks I 12GB RAM Datasets I DBLP: average length: 259 bytes; 1.2M records; 300MB I CITESEERX: average length: 1374 bytes; 1.3M records; 1.8GB I Increased each up to ×25, preserving join properties Experimental Results 2 3 4 5 6 7 8 9 10 # Nodes 1 2 3 4 5 Speedup BTO-BK-BRJ BTO-PK-BRJ BTO-PK-OPRJ Ideal 2 3 4 5 6 7 8 9 10 # Nodes and Dataset Size 0 200 400 600 800 1000 Time (seconds) BTO-BK-BRJ BTO-PK-BRJ BTO-PK-OPRJ Ideal

Upload: rvernica

Post on 18-Jul-2015

485 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Efficient Parallel Set-Similarity Joins Using MapReduce - Poster

Efficient Parallel Set-Similarity Joins Using MapReduceRares Vernica Michael J. Carey Chen Li

Department of Computer Science, University of California, Irvinehttp://asterix.ics.uci.edu/fuzzyjoin-mapreduce/

Problem Statement

Example: Data Cleaning/Master-Data-Management

Customer data from two departmentsSales

ID Name . . .S10 John W Smith . . .

...

ReturnsID Name . . .

R20 Smith John . . .... John W Smith

Master customer data across two departmentsCustomers

ID Name . . .C30 John W Smith . . .

...

Parallelizing Set-Similarity Joins

Large amounts of dataI E.g., GeneBank: 100M, Google N-gram: 1TI Data or processing does not fit in one machineI Use a cluster of machines and a parallel algorithmI MapReduce: shared-nothing data-processing platformChallengesI Partition problem for parallelismI Solve the problem using Map, Sort, and ReduceI Compute end-to-end set-similarity joinsI Deal with out-of-memory situations

MapReduce Review

map (k1,v1) → list(k2,v2);reduce (k2,list(v2)) → list(k3,v3).

combine (k2,list(v2)) → list(k2,v2).

Prefix Filtering for Data Partitioning

I Pigeonhole principleI Global order for set elements:

I E.g., sim is overlap size, τ = 4I Prefix length is 2

Processing Stages and Alternatives

Stage 1: Token OrderingI Compute the token frequencies and sort

I Two MapReduce phases: sort in MapReduce (BTO)I One MapReduce phase: sort in memory (OPTO)

Stage 2: Kernel (RID-Pair Generation)I Use prefix-filter to divide, conquer using:

I Nested loops (BK)I Single-machine set-similarity join algorithm (PK)

Stage 3: Record JoinI Generate pairs of similar records

I Two MapReduce phases: reduce-side join (BRJ)I One MapReduce phase: map-side join (OPRJ)

Stage 2: RID-Pair Generation

Experimental Setting

HardwareI 10-node IBM x3650 cluster

I Intel Xeon processor E5520 2.26GHz with four coresI Four 300GB hard disksI 12GB RAM

DatasetsI DBLP: average length: 259 bytes; 1.2M records; 300MBI CITESEERX: average length: 1374 bytes; 1.3M records; 1.8GBI Increased each up to ×25, preserving join properties

Experimental Results

2 3 4 5 6 7 8 9 10

# Nodes

1

2

3

4

5

Spe

edup

BTO-BK-BRJBTO-PK-BRJBTO-PK-OPRJIdeal

2 3 4 5 6 7 8 9 10

# Nodes and Dataset Size

0

200

400

600

800

1000

Tim

e (s

econ

ds)

BTO-BK-BRJBTO-PK-BRJBTO-PK-OPRJIdeal