p arallel s orted n eighborhood b locking with m ap r educe lars kolb, andreas thor, erhard rahm...
TRANSCRIPT
PARALLEL SORTED NEIGHBORHOOD BLOCKING WITH MAPREDUCELars Kolb, Andreas Thor, Erhard Rahm
Database Group Leipzighttp://dbs.uni-leipzig.de
Kaiserslautern, BTW 2011
2 / 13
• Detection of entities in one or more sources that refer to the same real-world object
ENTITY RESOLUTION
Parallel Sorted Neighborhood Blocking with MapReduce
3 / 13
ENTITY RESOLUTION (2)
• Runtime-intensive task O(n²) entity comparisons
• Blocking:• Semantically grouping of similar entities in blocks• Based on blocking keys derived from entities attributes• Restrict entity comparisons to entities from the same block
• Parallelization• MapReduce• Exploitation cloud infrastructures
Parallel Sorted Neighborhood Blocking with MapReduce
4 / 13
SORTED NEIGHBORHOOD - RUNNING EXAMPLE (w=3)
Parallel Sorted Neighborhood Blocking with MapReduce
K S1 a1 d2 b2 e2 f2 h3 c3 g3 i
Sabcdefghi
Key Generation + Sort by Key
d-e, b-eb-f, e-fe-h, f-hf-c, h-ch-g, c-gc-i, g-i
Sliding Window
• Determine blocking key for each entity and sort entities by blocking key• Move window of fixed size w over sorted records and compare all entities
within window• All entities within a distance of w-1 are compared• O(n²) O(n) + O(n*log n) + O(n*w)
a-d, a-b, d-b
5 / 13
OUTLINE
• Motivation
• Sorted Neighborhood and SN with MapReduce• Challenge 1: Sorted Reduce Partitions SRP• Challenge 2: Comparison of Boundary Entities JobSN/RepSN
• Experimental Results
• Conclusions & Future Work
Parallel Sorted Neighborhood Blocking with MapReduce
6 / 13
MAPREDUCE• Computation expressed by two UDFs• Contain sequential code• Executed in parallel among multiple nodes• map: (keyin, valuein) list(keytmp, valuetmp)
• reduce: (keytmp, list(valuetmp)) list(keyout, valueout)
• Computation relies on data partitioning and redistribution• Number of map tasks m and reduce tasks r• Task executed by some idle node in the cluster• UDF part partitions map output and distributes it to the r reduce tasks• Sorting of key-value pairs• Grouping of key-value pairs by key and invocation of reduce for each group
Parallel Sorted Neighborhood Blocking with MapReduce
7 / 13
ENTITY RESOLUTION WITH MAPREDUCE (m =3, r =2)
Parallel Sorted Neighborhood Blocking with MapReduce
Inpu
t Spl
it
map1Sabcdefghi
K S1 d2 e2 f
K S3 g2 h3 i
map2
map3
Sdef
Sghi
Parti
tioni
ng “
key
mod
ulo
r”
reduce1
reduce2
Mb-fe-h
Ma-dc-ib-fe-h
Out
put M
erge
Map Step: Blocking Reduce Step: Matching
K S1 a2 b3 c
Sabc
K S1 a
3 c1 d3 g3 i
K S1 a1 d3 c3 g3 i
K S2 b2 e2 f2 h
Ma-dc-i
•Map phase•Input data partitioned in m partitions•Each processed by one map task that calls map for each input record (“blocking”)•UDF part partitions map output and distributes it to the r reduce tasks
•Reduce phase•Sorting of key-value pairs by key •Grouping of key-value pairs by key•Invocation of reduce for each group (“matching”)
•Challenge 1: SN requires totally sorted list of entities•All entities assigned to reduce task Ri have smaller blocking key than all entities
assigned to reduce task Ri+1
•“Sorted reduce partitions” (SRP)•Must be ensured by part range partitioning
8 / 13
• reduce:forEach(entity ϵ list(valuetmp))
match(buffer, entity); //match all buffered entities with entitybuffer.append(entity);if(buffer.size()==w) buffer.removeFirst();
SORTED NEIGHBORHOOD WITH MAPREDUCE – SRP
Parallel Sorted Neighborhood Blocking with MapReduce
map1
K S1.1 a1.2 b2.3 c
K S1.1 d1.2 e1.2 f
K S2.3 g1.2 h2.3 i
map2
map3
Sabc
Sdef
Sghi Pa
rtitio
ning
by
parti
tion
prefi
x
K S1.1 a1.1 d1.2 b1.2 e1.2 f1.2 h
K S2.3 c2.3 g2.3 i
reduce1
reduce2
Bc-gc-ig-i
Key Generation + Partition Prefix Sliding Window (+ Matching)
K S1 a2 b3 c
K S1 d2 e2 f
K S3 g2 h3 i
Ba-da-bd-bd-eb-eb-fe-fe-hf-h
f-c ?h-c?h-g?
• Challenge 2: Boundary Entities• Comparison of entities entities that are assigned to different reduce tasks
• map outputs composite key: partitionPrefix.blockKey• partitionPrefix(k)= 1 if k<=2, otherwise 2 (range partitioning)
• part(partitionPrefix.blockKey)= partitionPrefix• Key-value pairs are sorted and grouped by composed key
9 / 13
• SN realization using two consecutive jobs• Job1:
• SRP + additional output of boundary entities• Keys of the additionally outputted entities are
prefixed with an additional boundary component• Job2:
• SN for boundary entities• part(boundary.partitionIndex.blockKey)= boundary % r• Sort and group by composed key
SORTED NEIGHBORHOOD WITH MAPREDUCE – JOBSN
Parallel Sorted Neighborhood Blocking with MapReduce
K S1.1 a1.1 d1.2 b1.2 e1.2 f1.2 h
K S2.3 c2.3 g2.3 i
reduce1
reduce2
Ba-d...f-h
Bc-gc-ig-i
Sliding Window (+ Matching)+ Boundary Prefix
K S1.2 f1.2 h
K S2.3 c2.3 g
map1
Parti
tioni
ng b
y bo
unda
ry p
refix
reduce1
Bf-ch-ch-g
Identity Sliding Window (+ Matching)
K S1.1.2 f1.1.2 h
map2
K S1.2.3 c1.2.3 g
K S1.1.2 f1.1.2 h1.2.3 c1.2.3 gK S
1.1.2 f1.1.2 h
K S1.2.3 c1.2.3 g
10 / 13
SORTED NEIGHBORHOOD WITH MAPREDUCE - REPSN
Parallel Sorted Neighborhood Blocking with MapReduce
map1
K S1.1 a1.2 b2.3 c
K S1.1 d1.2 e1.2 f
K S2.3 g1.2 h2.3 i
map2
map3
Sabc
Sdef
Sghi
Key Generation + Partition Prefix + Boundary Prefix
K S1.1 a1.2 b2.3 c1.1 a1.2 b
K S1.1 d1.2 e1.2 f1.2 e1.2 f
K S2.3 g1.2 h2.3 i1.2 h
K S1.1.1 a1.1.2 b2.2.3 c2.1.1 a2.1.2 b
K S1.1.1 d1.1.2 e1.1.2 f2.1.2 e2.1.2 f
K S2.2.3 g1.1.2 h2.2.3 i2.1.2 h
K S1.1.1 a1.1.1 d1.1.2 b1.1.2 e1.1.2 f1.1.2 h
K S2.1.1 a2.1.2 b2.1.2 e2.1.2 f2.1.2 h2.2.3 c2.2.3 g2.2.3 i
reduce1
reduce2
Ba-da-bd-bd-eb-eb-fe-fe-hf-h
Bf-ch-ch-gc-gc-ig-i
Sliding Window (+ Matching)
Parti
tion
ing
by b
ound
ary
prefi
x
• SN realization using data replication•Reduce task i>1 needs last w-1 entities ofprevious partition in front of its input•Potential boundary entities are replicatedby the map tasks (two key-value pairs)•Replica of entity that is assigned toreduce task Ri is assigned to Ri+1
•Implementation•Map key prefixed with boundary component (like JobSN)•boundary= partitionPrefix+1 for replicatedentities (boundary=partitionPrefix otherwise)•part(boundary.partitionPrefix.blockKey)= boundary
11 / 13
EXPERIMENTAL RESULTS• 1.4m publication records, blocking by title.substring(2), w=1000• 4 Dual core nodes, Hadoop 0.20.2
• Runtime reduction: 9h to 1.5h relative speedup of almost 6• Runtime of the implementations differ only slightly• JobSN faster for small degree of parallelism• RepSN completes faster gebinning with m=r=4
Parallel Sorted Neighborhood Blocking with MapReduce
12 / 13
CONCLUSIONS• Application of the MapReduce programming model for parallel
execution of typical Entity Resolution workflows
• Realization of Sorted Neighborhood Blocking with MapReduce• Sorted reduce partitions
• Range partitioning
• Boundary entities• JobSN: generation of boundary correspondences by additional job• RepSN: SN realization within a single job using data replication in map phase
• Evaluation of the proposed approaches
• Future work• Load balancing mechanisms for handling skewed (blocking key) data• Multi-pass Blocking within single job
Parallel Sorted Neighborhood Blocking with MapReduce
13 / 13Parallel Sorted Neighborhood Blocking with MapReduce
THANK YOU FOR YOUR ATTENTION