generic entity resolution: identifying real-world entities ...generic entity resolution: identifying...
TRANSCRIPT
-
Generic Entity Resolution:Identifying Real-World Entities in
Large Data Sets
Hector Garcia-MolinaStanford University
Work with: Omar Benjelloun, Qi Su,Jennifer Widom, Tyson Condie,
Nicolas Pombourcq, David Menestrina, Steven Whang
-
2
Entity Resolution
N: a A: b CC#: c Ph: e
e1
N: a Exp: d Ph: e
e2
-
3
Applications
• comparison shopping• mailing lists• classified ads• customer files• counter-terrorism
N: a A: b CC#: c Ph: e
e1
N: a Exp: d Ph: e
e2
-
4
Outline
• Why is ER challenging?• How is ER done?• Some ER work at Stanford• Confidences
-
5
Challenges (1)
• No keys!• Value matching– “Kaddafi”, “Qaddafi”, “Kadafi”, “Kaddaffi”...
• Record matching
Nm: TomAd: 123 Main StPh: (650) 555-1212Ph: (650) 777-7777
Nm: ThomasAd: 132 Main StPh: (650) 555-1212
-
6
Challenges (2)
• Merging recordsNm: TomAd: 123 Main StPh: (650) 555-1212Ph: (650) 777-7777
Nm: ThomasAd: 132 Main StPh: (650) 555-1212Zp: 94305
Nm: TomNm: ThomasAd: 123 Main StPh: (650) 555-1212Ph: (650) 777-7777Zp: 94305
-
7
Challenges (3)
• ChainingNm: TomWk: IBMOc: laywerSal: 500K
Nm: TomAd: 123 MainBD: Jan 1, 85Wk: IBM
Nm: ThomasAd: 123 MaimOc: lawyer
Nm: TomAd: 123 MainBD: Jan 1, 85Wk: IBMOc: lawyer
Nm: TomAd: 123 MainBD: Jan 1, 85Wk: IBMOc: lawyerSal: 500K
-
8
Challenges (4)
• Un-mergingNm: TomAd: 123 MainBD: Jan 1, 85Wk: IBMOc: lawyerSal: 500K
too young to make500K at IBM!!
-
9
Challenges (5)
• Confidences in dataNm: Tom (0.9)Ad: 123 Main St (1.0)Ph: (650) 555-1212 (0.6)Ph: (650) 777-7777 (0.8)
(0.8)
• In value matching, match rules, merge:
conf = ?
-
10
Taxonomy
• Pairwise snaps vs. clustering• De-duplication vs. fidelity enhancement• Schema differences• Relationships• Exact vs. approximate• Generic vs application specific• Confidences
-
11
Schema Differences
Name: TomAddress: 123 Main StPh: (650) 555-1212Ph: (650) 777-7777
FirstName: TomStreetName: Main StStreetNumber: 123Tel: (650) 777-7777
-
12
Pair-Wise Snaps vs. Clustering
r1
r2
r3
r4
r5
r6
s9
s8
s7
s10
r1
r2
r3
r4r5
r6
r9
r8
r7
r10
-
13
De-Duplication vs. Fidelity Enhancement
R SB S
N
-
14
Relationships
r1 r2
r5
r7brother
father
businessbusiness
-
15
Using Relationships
p1
p2
p5
p7
a1
a2
a3
a5
a4
authors papers
same??
-
16
Exact vs Approximate ER
products
cameras resolvedcameras
CDs
books
...
resolvedCDs
resolvedbooks
...
ER
ER
ER
-
17
Exact vs Approximate ER
terrorists terroristssortby age
Widom 30match against
ages 25-35
-
18
Generic vs Application Specific
• Match function M(r, s)• Merge function => t
-
19
Taxonomy
• Pairwise snaps vs. clustering• De-duplication vs. fidelity enhancement• Schema differences• Relationships• Exact vs. approximate• Generic vs application specific• Confidences
-
20
Outline
• Why is ER challenging?• How is ER done?• Some ER work at Stanford• Confidences
-
21
Taxonomy
• Pairwise snaps vs. clustering• De-duplication vs. fidelity enhancement• Schema differences No• Relationships No• Exact vs. approximate• Generic vs application specific• Confidences ... later on
-
22
Model
Nm: TomWk: IBMOc: laywerSal: 500K
Nm: TomAd: 123 MainBD: Jan 1, 85Wk: IBM
Nm: ThomasAd: 123 MaimOc: lawyer
Nm: TomAd: 123 MainBD: Jan 1, 85Wk: IBMOc: lawyer
Nm: TomAd: 123 MainBD: Jan 1, 85Wk: IBMOc: lawyerSal: 500K
r1 r3r2
r4:
M(r1, r2) M(r4, r3)
-
23
Correct Answer
r1
r2
r3
r4
r5
r6
s9
s8
s7
s10
ER(R) = All derivable records.....
Minus “dominated” records
-
24
Question
• What is best sequence of match, merge calls that give us right answer?
-
25
Brute Force Algorithm
• Input R:– r1 = [a:1, b:2]– r2 = [a:1, c: 4, e:5]– r3 = [b:2, c:4, f:6]– r4 = [a:7, e:5, f:6]
-
26
Brute Force Algorithm
• Input R:– r1 = [a:1, b:2]– r2 = [a:1, c: 4, e:5]– r3 = [b:2, c:4, f:6]– r4 = [a:7, e:5, f:6]
• Match all pairs:– r1 = [a:1, b:2]– r2 = [a:1, c: 4, e:5]– r3 = [b:2, c:4, f:6]– r4 = [a:7, e:5, f:6]– r12 = [a:1, b:2, c:4, e:5]
-
27
Brute Force Algorithm• Match all pairs:– r1 = [a:1, b:2]– r2 = [a:1, c: 4, e:5]– r3 = [b:2, c:4, f:6]– r4 = [a:7, e:5, f:6]– r12 = [a:1, b:2, c:4, e:5]
• Repeat:– r1 = [a:1, b:2]– r2 = [a:1, c: 4, e:5]– r3 = [b:2, c:4, f:6]– r4 = [a:7, e:5, f:6]– r12 = [a:1, b:2, c:4, e:5]– r123 =
[a:1, b:2, c:4, e:5, f:6]
-
28
Question # 1
Brute Force Algorithm
• Input R:– r1 = [a:1, b:2]– r2 = [a:1, c: 4, e:5]– r3 = [b:2, c:4, f:6]– r4 = [a:7, e:5, f:6]
• Match all pairs:– r1 = [a:1, b:2]– r2 = [a:1, c: 4, e:5]– r3 = [b:2, c:4, f:6]– r4 = [a:7, e:5, f:6]– r12 = [a:1, b:2, c:4, e:5]
Can we deleter1, r2?
-
29
Question # 2
Brute Force Algorithm
• Match all pairs:– r1 = [a:1, b:2]– r2 = [a:1, c: 4, e:5]– r3 = [b:2, c:4, f:6]– r4 = [a:7, e:5, f:6]– r12 = [a:1, b:2, c:4, e:5]
• Repeat:– r1 = [a:1, b:2]– r2 = [a:1, c: 4, e:5]– r3 = [b:2, c:4, f:6]– r4 = [a:7, e:5, f:6]– r12 = [a:1, b:2, c:4, e:5]– r123 =
[a:1, b:2, c:4, e:5, f:6]
Can we avoidcomparisons?
-
30
ICAR Properties
• Idempotence:– M(r1, r1) = true; = r1
• Commutativity:– M(r1, r2) = M(r2, r1)– =
• Associativity– =
-
31
More Properties
• Representativity– If = r3, then
for any r4 such that M(r1, r4) is truewe also have M(r3, r4) = true.
r1
r2
r3
r4
-
32
ICAR Properties Efficiency
• Commutativity• Idempotence• Associativity• Representativity
• Can discard records• ER result independent
of processing order
-
33
Swoosh Algorithms
• Record Swoosh• Merges records as soon as they match• Optimal in terms of record comparisons
• Feature Swoosh• Remembers values seen for each feature• Avoids redundant value comparisons
-
34
Swoosh Performance
-
35
If ICAR Properties Do Not Hold?
r1: [Joe Sr., 123 Main, DL:X]
r23: [Joe Jr., 123 Main, Ph: 123, DL:Y]
r12: [Joe Sr., 123 Main, Ph: 123, DL:X]
r2: [Joe, 123 Main, Ph:123]
r3: [Joe Jr., 123 Main, DL:Y]
-
36
If ICAR Properties Do Not Hold?
r1: [Joe Sr., 123 Main, DL:X]
r23: [Joe Jr., 123 Main, Ph: 123, DL:Y]
r12: [Joe Sr., 123 Main, Ph: 123, DL:X]
r2: [Joe, 123 Main, Ph:123]
r3: [Joe Jr., 123 Main, DL:Y]
Full Answer: ER(R) = {r12, r23, r1, r2, r3}Minus Dominated: ER(R) = {r12, r23}
-
37
If ICAR Properties Do Not Hold?
r1: [Joe Sr., 123 Main, DL:X]
r23: [Joe Jr., 123 Main, Ph: 123, DL:Y]
r12: [Joe Sr., 123 Main, Ph: 123, DL:X]
r2: [Joe, 123 Main, Ph:123]
r3: [Joe Jr., 123 Main, DL:Y]
Full Answer: ER(R) = {r12, r23, r1, r2, r3}Minus Dominated: ER(R) = {r12, r23}R-Swoosh Yields: ER(R) = {r12, r3} or {r1, r23}
-
38
Swoosh Without ICAR Properties
-
39
Distributed Swoosh
P1 P2 P3
r1r2r3r4r5r6...
-
40
Distributed Swoosh
P1 P2 P3
r1
r3r4
r6...
r1r2
r4r5
...
r2r3
r5r6...
-
41
DSwoosh Performance
-
42
Outline
• Why is ER challenging?• How is ER done?• Some ER work at Stanford• Confidences
-
43
Conclusion
• ER is old and important problem• Our approach: generic• Confidences– challenging– two ways to tame:• thresholds• packages
-
44
Thanks.
-
45
Generic Confidence Model
• r1 = 0.7 [ a:v1, b:v2, c: v3]
match.7[a, b, c]
.9[a, c, d]yes (or no)
merge.7[a, b, c]
.9[a, c, d].65[a,b,c,d,x]
-
46
Problem: Properties May Not Hold
• r1 = 0.9 [a, b, c]• r2 = 0.8 [a, d]• say confidences multiplied on merge• = 0.72[a, b, c, d]• < , r1> = 0.648[a, b, c, d]• < , r2> = = 0.72[a, b, c, d]
-
47
ER with Confidences
• Very Expensive:– must compute “all derivations”– cannot delete records after they merge
• What can we do??– thresholds– packages
-
48
Important Property
• If conf(Rx) < threshold• Then for any Ry derived from Rx
conf(Ry) < threshold
r1
r2
r3 r4
C= 0.7 C
-
49
Thresholds - Example
0.9 [ a: v1, b:v2 ]0.8 [ a:v1, c: v3 ]0.6 [ b:v2, c:v3, d:v4]0.75 [ a:v1, b:v2, c:v3]0.5 [a:v1, b:v2, c:v3, d: v4]...
T=0.7
-
50
Thresholds - Example
0.9 [ a: v1, b:v2 ]0.8 [ a:v1, c: v3 ]0.6 [ b:v2, c:v3, d:v4]0.75 [ a:v1, b:v2, c:v3]0.5 [a:v1, b:v2, c:v3, d: v4]...
T=0.7
-
51
Goal: C-Swoosh
baserecords
allpossiblemerges
eliminatedominated
eliminatebelow
threshold
-
52
Goal: C-Swoosh
baserecords
allpossiblemerges
eliminatedominated
eliminatebelow
threshold
earlier
-
53
Does Threshold Property Hold?
• NO: records are evidence
merge.7[a, b, c]
.8[a, b, c].9[a, b, c]
-
54
Does Threshold Property Hold?
• YES: records are beliefs
merge.7[a, b, c]
.8[a, b, c].8[a, b, c]
-
55
Simple Confidence Model
• 0.7 [a, b]
[a, b] [a, b] [a, b, c] [a, b] [a, b]
[a, b] [a, b, d] ??? ??? ???
Alternate Worlds:
-
56
Rules
• 0.7[a, b, c], 0.7[a, b, c]⇒ 0.7 [a, b, c]• 0.7 [a, b], 0.5 [a, b]⇒ 0.7 [a, b]• 0.7 [a, b, c], 0.5 [a, b]⇒ 0.7 [a, b, c]• 0.7 [a, b, c], 0.9[a, b]⇒ 0.7 [a, b, c], 0.9[a, b]• etc
-
57
Matches
0.9[a, b, c]0.8[a, b, d][a, x][c, d, y]
Match with confidence 0.5
worlds
[a,b,c][a,b,d]
[a,b,c,d]
1 2 3 4 5 6 7 8 9 10
-
58
Matches
0.9[a, b, c]0.8[a, b, d][a, x][c, d, y]
worlds
[a,b,c][a,b,d]
[a,b,c,d]
1 2 3 4 5 6 7 8 9 10
0.4[a,b,c,d]0.9[a, b, c]0.8[a, b, d][a, x][c, d, y]
-
59
Summary
• Belief model well suited for ER• Evidence model is very complex and
expensive!
-
60
Packages
• Match does not use confidences– merge does compute confidences
• 4 properties hold for deterministic attributes– e.g., =
ignoring confidences
-
61
Partition Records
– r1 = .9 [a:1, b:2]– r2 = .8 [a:1, c: 4, e:5]– r3 = .7 [b:2, c:4, f:6]– r4 = .8 [a:7, e:5, f:6]– r5 = .9 [a:7, b:2]
r1
r2
r3
r12r123
r45r5
r4
-
62
Expand Packages
– r1 = .9 [a:1, b:2]– r2 = .8 [a:1, c: 4, e:5]– r3 = .7 [b:2, c:4, f:6]– r4 = .8 [a:7, e:5, f:6]– r5 = .9 [a:7, b:2]
r1
r2
r3
r12r123
r1, r2, r3
...
-
63
Conclusion
• ER is old and important problem• Our approach: generic• Confidences– challenging– two ways to tame:• thresholds• packages
-
Thanks.
-
65
Extra Slides
-
66
Taxonomy
• Pairwise snaps vs. clustering• De-duplication vs. fidelity enhancement• Schema differences No• Relationships No• Exact vs. approximate• Generic vs application specific• Confidences ... later on
-
67
One Confidence Model
[id1, a, b, c, d][id1, a, b, d][id1, a, x][id1, b, y]
[id1, a, b, c, d]
[id3, a, b, c][id3, a, b, d][id3, a, b, f, g][id3, a, b, f, g]
[id3, a, b, f, g]
[id2, a, b, c][id2, a, c, e][id2, a, c, e][id2, a, c, e]
[id2, a, c, e]
shorthand
-
68
Records Are Evidence
[id1, a, b, c, d][id1, a, b, d][id1, a, x][id1, b, y]
[id1, a, b, c, d]
[id1, (3/4)a, (3/4)b, (1/4)c, (2/4)d, (1/4)x, (1/4)y]
not 0.25
-
69
New Evidence
[id1, a, b, c, d][id1, a, b, d][id1, a, x][id1, b, y]+ [id1, a, b, c, d]
[id1, a, b, c, d]
[id1, (4/5)a, (4/5)b, (2/5)c, (3/5)d, (1/5)x, (1/5)y]
[id1, (3/4)a, (3/4)b, (1/4)c, (2/4)d, (1/4)x, (1/4)y]+ [id1, a, b, c, d]
-
70
No Ids
[a, b, c][a, b, d][a, x][c, d, y] [a, b, (1/2)c, (1/2)d]
[a, x][c, d, y]
[a, b, c][a, b, d][a, x][c, d, y]
0.7
0.3
-
71
No Ids
[a, b, c][a, b, d][a, x][c, d, y] [a, b, (1/2)c, (1/2)d]
[a, x][c, d, y]
[a, b, c][a, b, d][a, x][c, d, y]
0.7
0.3
[(2/3)a, (2/3)b, (2/3)c, (2/3)d, (1/3)y][a, x]
[a, b, (1/2)c, (1/2)d][a, x][c, d, y]
0.9
0.1
-
72
Queries?
[a, b, c][a, b, d][a, x][c, d, y] [a, b, (1/2)c, (1/2)d]
[a, x][c, d, y]
[a, b, c][a, b, d][a, x][c, d, y]
0.7
0.3
[(2/3)a, (2/3)b, (2/3)c, (2/3)d, (1/3)y][a, x]
[a, b, (1/2)c, (1/2)d][a, x][c, d, y]
0.9
0.1
Threshold = 0.5; Support = 2Maximal RecordExample: [a, b, c, d]
-
73
Queries?
[a, b, c][a, b, d][a, x][c, d, y] [a, b, (1/2)c, (1/2)d]
[a, x][c, d, y]
[a, b, c][a, b, d][a, x][c, d, y]
0.7
0.3
[(2/3)a, (2/3)b, (2/3)c, (2/3)d, (1/3)y][a, x]
[a, b, (1/2)c, (1/2)d][a, x][c, d, y]
0.9
0.1
Threshold = 0.5; Support = 2Maximal RecordExample: [a, b, c, d]
-
74
Need Simpler Model?
-
75
Bonus Material
• Entity Resolution, Confidences,and their relationship to Information Privacy
-
76
Privacy
Nm: AliceAd: 32 FoxPh: 5551212
1.0
Nm: AliceAd: 32 FoxPh: 5551212
1.0Nm: AliceAd: 32 Fox
1.0Nm: AliceAd: 32 FoxPh: 5551212
0.7Nm: AliceAd: 32 FoxPh: 5551212Ad: 14 Cat
1.0
Bob
Alice
-
77
Leakage
Nm: AliceAd: 32 FoxPh: 5551212
1.0
Nm: AliceAd: 32 FoxPh: 5550000
0.7
Bob
Alice
L = 0.6 (between 0 and 1)
-
78
Multi-Record Leakage
Nm: AliceAd: 32 FoxPh: 5551212
1.0
Bob
Alice
LL = 0.9 (between 0 and 1,e.g., max L)
r1, L = 0.9r2, L = 0.8r3, L = 0.7
-
79
Q1: Added Vulnerability?
Bob
Alice
ΔLL = ??
r1 r2 r3 r4p
r4 may cause Bob’s records tosnap together!
-
80
Q2: Disinformation?
Bob
Alice
ΔLL = ??
r1 r2 r3 r4 (lies)p
What is mostcost effectivedisinformation?
-
81
Q3: Verification?
Bob
Alicep
What is best factto verify to increaseconfidence in hypothesis?
r1, 0.9r2, 0.8r3, 0.7...
hypothesis h (0.6)
-
82
Summary
• Entity resolution is critical• Efficient resolution important• Confidences are important, but how?• ER is key aspect of info privacy
– check www-db.stanford.edu forSwoosh paper & forthcoming paper
-
83
Thanks.
-
84
Extra Slides
-
85
Challenges
• Exponential growth in complexity
0.9 [ a: v1, b:v2 ]0.8 [ a:v1, c: v3 ]0.6 [ b:v2, c:v3, d:v4]0.75 [ a:v1, b:v2, c:v3]0.5 [a:v1, b:v2, c:v3, d: v4]...
-
86
Three Ideas to Tame Complexity
• Thresholds• Domination• Packages
-
87
Thresholds
0.9 [ a: v1, b:v2 ]0.8 [ a:v1, c: v3 ]0.6 [ b:v2, c:v3, d:v4]0.75 [ a:v1, b:v2, c:v3]0.5 [a:v1, b:v2, c:v3, d: v4]...
T=0.7
-
88
Domination
0.9 [ a: v1, b:v2, c:v3 ]0.8 [ a:v1, b: v2, c: v3 ]0.8 [ b:v2, c:v3 ]...
-
89
Domination
0.9 [ a: v1, b:v2, c:v3 ]0.8 [ a:v1, b: v2, c: v3 ]0.8 [ b:v2, c:v3 ]...
-
90
Summary
• Our approach: pairwise, generic, Swoosh• Confidences• Making Tractable:– threshold– domination– packages
-
91
Thanks You
-
92
What Swoosh Does NOT Do
• Hash table with every pair seen:– records ri, rj– compared values vi, vj
• Swoosh achieves the same effectwith our N2 space
-
93
Swoosh Performance (I)
-
94
Swoosh Performance (II)
-
95
Swoosh Performance (III)
Generic Entity Resolution:�Identifying Real-World Entities in Large Data SetsEntity ResolutionApplicationsOutlineChallenges (1)Challenges (2)Challenges (3)Challenges (4)Challenges (5)TaxonomySchema DifferencesPair-Wise Snaps vs. ClusteringDe-Duplication vs. �Fidelity EnhancementRelationshipsUsing RelationshipsExact vs Approximate ERExact vs Approximate ERGeneric vs Application SpecificTaxonomyOutlineTaxonomyModelCorrect AnswerQuestionBrute Force AlgorithmBrute Force AlgorithmBrute Force AlgorithmQuestion # 1Question # 2ICAR PropertiesMore PropertiesICAR Properties EfficiencySwoosh AlgorithmsSwoosh PerformanceIf ICAR Properties Do Not Hold?If ICAR Properties Do Not Hold?If ICAR Properties Do Not Hold?Swoosh Without ICAR PropertiesDistributed SwooshDistributed SwooshDSwoosh PerformanceOutlineConclusionThanks.Generic Confidence ModelProblem: Properties May Not HoldER with ConfidencesImportant PropertyThresholds - ExampleThresholds - ExampleGoal: C-SwooshGoal: C-SwooshDoes Threshold Property Hold?Does Threshold Property Hold?Simple Confidence ModelRulesMatchesMatchesSummaryPackagesPartition RecordsExpand PackagesConclusionThanks.Extra SlidesTaxonomyOne Confidence ModelRecords Are EvidenceNew EvidenceNo IdsNo IdsQueries?Queries?Need Simpler Model?Bonus MaterialPrivacyLeakageMulti-Record LeakageQ1: Added Vulnerability?Q2: Disinformation?Q3: Verification?SummaryThanks.Extra SlidesChallengesThree Ideas to Tame ComplexityThresholdsDominationDominationSummaryThanks YouWhat Swoosh Does NOT DoSwoosh Performance (I)Swoosh Performance (II)Swoosh Performance (III)