generic entity resolution: identifying real-world entities ...generic entity resolution: identifying...

Generic Entity Resolution:Identifying Real-World Entities in

Large Data Sets

Hector Garcia-MolinaStanford University

Work with: Omar Benjelloun, Qi Su,Jennifer Widom, Tyson Condie,

Nicolas Pombourcq, David Menestrina, Steven Whang

2

Entity Resolution

N: a A: b CC#: c Ph: e

e1

N: a Exp: d Ph: e

e2

3

Applications

• comparison shopping• mailing lists• classified ads• customer files• counter-terrorism

N: a A: b CC#: c Ph: e

e1

N: a Exp: d Ph: e

e2

4

Outline

• Why is ER challenging?• How is ER done?• Some ER work at Stanford• Confidences

5

Challenges (1)

• No keys!• Value matching– “Kaddafi”, “Qaddafi”, “Kadafi”, “Kaddaffi”...

• Record matching

Nm: TomAd: 123 Main StPh: (650) 555-1212Ph: (650) 777-7777

Nm: ThomasAd: 132 Main StPh: (650) 555-1212

6

Challenges (2)

• Merging recordsNm: TomAd: 123 Main StPh: (650) 555-1212Ph: (650) 777-7777

Nm: ThomasAd: 132 Main StPh: (650) 555-1212Zp: 94305

Nm: TomNm: ThomasAd: 123 Main StPh: (650) 555-1212Ph: (650) 777-7777Zp: 94305

7

Challenges (3)

• ChainingNm: TomWk: IBMOc: laywerSal: 500K

Nm: TomAd: 123 MainBD: Jan 1, 85Wk: IBM

Nm: ThomasAd: 123 MaimOc: lawyer

Nm: TomAd: 123 MainBD: Jan 1, 85Wk: IBMOc: lawyer

Nm: TomAd: 123 MainBD: Jan 1, 85Wk: IBMOc: lawyerSal: 500K

8

Challenges (4)

• Un-mergingNm: TomAd: 123 MainBD: Jan 1, 85Wk: IBMOc: lawyerSal: 500K

too young to make500K at IBM!!

9

Challenges (5)

• Confidences in dataNm: Tom (0.9)Ad: 123 Main St (1.0)Ph: (650) 555-1212 (0.6)Ph: (650) 777-7777 (0.8)

(0.8)

• In value matching, match rules, merge:

conf = ?

10

Taxonomy

• Pairwise snaps vs. clustering• De-duplication vs. fidelity enhancement• Schema differences• Relationships• Exact vs. approximate• Generic vs application specific• Confidences

11

Schema Differences

Name: TomAddress: 123 Main StPh: (650) 555-1212Ph: (650) 777-7777

FirstName: TomStreetName: Main StStreetNumber: 123Tel: (650) 777-7777

12

Pair-Wise Snaps vs. Clustering

r1

r2

r3

r4

r5

r6

s9

s8

s7

s10

r1

r2

r3

r4r5

r6

r9

r8

r7

r10

13

De-Duplication vs. Fidelity Enhancement

R SB S

N

14

Relationships

r1 r2

r5

r7brother

father

businessbusiness

15

Using Relationships

p1

p2

p5

p7

a1

a2

a3

a5

a4

authors papers

same??

16

Exact vs Approximate ER

products

cameras resolvedcameras

CDs

books

...

resolvedCDs

resolvedbooks

...

ER

ER

ER

17

Exact vs Approximate ER

terrorists terroristssortby age

Widom 30match against

ages 25-35

18

Generic vs Application Specific

• Match function M(r, s)• Merge function => t

19

Taxonomy

• Pairwise snaps vs. clustering• De-duplication vs. fidelity enhancement• Schema differences• Relationships• Exact vs. approximate• Generic vs application specific• Confidences

20

Outline


21

Taxonomy

• Pairwise snaps vs. clustering• De-duplication vs. fidelity enhancement• Schema differences No• Relationships No• Exact vs. approximate• Generic vs application specific• Confidences ... later on

22

Model

Nm: TomWk: IBMOc: laywerSal: 500K

Nm: TomAd: 123 MainBD: Jan 1, 85Wk: IBM

Nm: ThomasAd: 123 MaimOc: lawyer

Nm: TomAd: 123 MainBD: Jan 1, 85Wk: IBMOc: lawyer

Nm: TomAd: 123 MainBD: Jan 1, 85Wk: IBMOc: lawyerSal: 500K

r1 r3r2

r4:

M(r1, r2) M(r4, r3)

23

Correct Answer

r1

r2

r3

r4

r5

r6

s9

s8

s7

s10

ER(R) = All derivable records.....

Minus “dominated” records

24

Question

• What is best sequence of match, merge calls that give us right answer?

25

Brute Force Algorithm

• Input R:– r1 = [a:1, b:2]– r2 = [a:1, c: 4, e:5]– r3 = [b:2, c:4, f:6]– r4 = [a:7, e:5, f:6]

26



• Match all pairs:– r1 = [a:1, b:2]– r2 = [a:1, c: 4, e:5]– r3 = [b:2, c:4, f:6]– r4 = [a:7, e:5, f:6]– r12 = [a:1, b:2, c:4, e:5]

27

Brute Force Algorithm• Match all pairs:– r1 = [a:1, b:2]– r2 = [a:1, c: 4, e:5]– r3 = [b:2, c:4, f:6]– r4 = [a:7, e:5, f:6]– r12 = [a:1, b:2, c:4, e:5]

• Repeat:– r1 = [a:1, b:2]– r2 = [a:1, c: 4, e:5]– r3 = [b:2, c:4, f:6]– r4 = [a:7, e:5, f:6]– r12 = [a:1, b:2, c:4, e:5]– r123 =

[a:1, b:2, c:4, e:5, f:6]

28

Question # 1




Can we deleter1, r2?

29

Question # 2



• Repeat:– r1 = [a:1, b:2]– r2 = [a:1, c: 4, e:5]– r3 = [b:2, c:4, f:6]– r4 = [a:7, e:5, f:6]– r12 = [a:1, b:2, c:4, e:5]– r123 =

[a:1, b:2, c:4, e:5, f:6]

Can we avoidcomparisons?

30

ICAR Properties

• Idempotence:– M(r1, r1) = true; = r1

• Commutativity:– M(r1, r2) = M(r2, r1)– =

• Associativity– =

31

More Properties

• Representativity– If = r3, then

for any r4 such that M(r1, r4) is truewe also have M(r3, r4) = true.

r1

r2

r3

r4

32

ICAR Properties Efficiency

• Commutativity• Idempotence• Associativity• Representativity

• Can discard records• ER result independent

of processing order

33

Swoosh Algorithms

• Record Swoosh• Merges records as soon as they match• Optimal in terms of record comparisons

• Feature Swoosh• Remembers values seen for each feature• Avoids redundant value comparisons

34

Swoosh Performance

35

If ICAR Properties Do Not Hold?

r1: [Joe Sr., 123 Main, DL:X]

r23: [Joe Jr., 123 Main, Ph: 123, DL:Y]

r12: [Joe Sr., 123 Main, Ph: 123, DL:X]

r2: [Joe, 123 Main, Ph:123]

r3: [Joe Jr., 123 Main, DL:Y]

36





r2: [Joe, 123 Main, Ph:123]


Full Answer: ER(R) = {r12, r23, r1, r2, r3}Minus Dominated: ER(R) = {r12, r23}

37





r2: [Joe, 123 Main, Ph:123]


Full Answer: ER(R) = {r12, r23, r1, r2, r3}Minus Dominated: ER(R) = {r12, r23}R-Swoosh Yields: ER(R) = {r12, r3} or {r1, r23}

38

Swoosh Without ICAR Properties

39

Distributed Swoosh

P1 P2 P3

r1r2r3r4r5r6...

40

Distributed Swoosh

P1 P2 P3

r1

r3r4

r6...

r1r2

r4r5

...

r2r3

r5r6...

41

DSwoosh Performance

42

Outline


43

Conclusion

• ER is old and important problem• Our approach: generic• Confidences– challenging– two ways to tame:• thresholds• packages

44

Thanks.

45

Generic Confidence Model

• r1 = 0.7 [ a:v1, b:v2, c: v3]

match.7[a, b, c]

.9[a, c, d]yes (or no)

merge.7[a, b, c]

.9[a, c, d].65[a,b,c,d,x]

46

Problem: Properties May Not Hold

• r1 = 0.9 [a, b, c]• r2 = 0.8 [a, d]• say confidences multiplied on merge• = 0.72[a, b, c, d]• < , r1> = 0.648[a, b, c, d]• < , r2> = = 0.72[a, b, c, d]

47

ER with Confidences

• Very Expensive:– must compute “all derivations”– cannot delete records after they merge

• What can we do??– thresholds– packages

48

Important Property

• If conf(Rx) < threshold• Then for any Ry derived from Rx

conf(Ry) < threshold

r1

r2

r3 r4

C= 0.7 C

49

Thresholds - Example

0.9 [ a: v1, b:v2 ]0.8 [ a:v1, c: v3 ]0.6 [ b:v2, c:v3, d:v4]0.75 [ a:v1, b:v2, c:v3]0.5 [a:v1, b:v2, c:v3, d: v4]...

T=0.7

50

Thresholds - Example


T=0.7

51

Goal: C-Swoosh

baserecords

allpossiblemerges

eliminatedominated

eliminatebelow

threshold

52

Goal: C-Swoosh

baserecords

allpossiblemerges

eliminatedominated

eliminatebelow

threshold

earlier

53

Does Threshold Property Hold?

• NO: records are evidence

merge.7[a, b, c]

.8[a, b, c].9[a, b, c]

54

Does Threshold Property Hold?

• YES: records are beliefs

merge.7[a, b, c]

.8[a, b, c].8[a, b, c]

55

Simple Confidence Model

• 0.7 [a, b]

[a, b] [a, b] [a, b, c] [a, b] [a, b]

[a, b] [a, b, d] ??? ??? ???

Alternate Worlds:

56

Rules

• 0.7[a, b, c], 0.7[a, b, c]⇒ 0.7 [a, b, c]• 0.7 [a, b], 0.5 [a, b]⇒ 0.7 [a, b]• 0.7 [a, b, c], 0.5 [a, b]⇒ 0.7 [a, b, c]• 0.7 [a, b, c], 0.9[a, b]⇒ 0.7 [a, b, c], 0.9[a, b]• etc

57

Matches

0.9[a, b, c]0.8[a, b, d][a, x][c, d, y]

Match with confidence 0.5

worlds

[a,b,c][a,b,d]

[a,b,c,d]

1 2 3 4 5 6 7 8 9 10

58

Matches

0.9[a, b, c]0.8[a, b, d][a, x][c, d, y]

worlds

[a,b,c][a,b,d]

[a,b,c,d]

1 2 3 4 5 6 7 8 9 10

0.4[a,b,c,d]0.9[a, b, c]0.8[a, b, d][a, x][c, d, y]

59

Summary

• Belief model well suited for ER• Evidence model is very complex and

expensive!

60

Packages

• Match does not use confidences– merge does compute confidences

• 4 properties hold for deterministic attributes– e.g., =

ignoring confidences

61

Partition Records

– r1 = .9 [a:1, b:2]– r2 = .8 [a:1, c: 4, e:5]– r3 = .7 [b:2, c:4, f:6]– r4 = .8 [a:7, e:5, f:6]– r5 = .9 [a:7, b:2]

r1

r2

r3

r12r123

r45r5

r4

62

Expand Packages

– r1 = .9 [a:1, b:2]– r2 = .8 [a:1, c: 4, e:5]– r3 = .7 [b:2, c:4, f:6]– r4 = .8 [a:7, e:5, f:6]– r5 = .9 [a:7, b:2]

r1

r2

r3

r12r123

r1, r2, r3

...

63

Conclusion

• ER is old and important problem• Our approach: generic• Confidences– challenging– two ways to tame:• thresholds• packages

Thanks.

65

Extra Slides

66

Taxonomy

• Pairwise snaps vs. clustering• De-duplication vs. fidelity enhancement• Schema differences No• Relationships No• Exact vs. approximate• Generic vs application specific• Confidences ... later on

67

One Confidence Model

[id1, a, b, c, d][id1, a, b, d][id1, a, x][id1, b, y]

[id1, a, b, c, d]

[id3, a, b, c][id3, a, b, d][id3, a, b, f, g][id3, a, b, f, g]

[id3, a, b, f, g]

[id2, a, b, c][id2, a, c, e][id2, a, c, e][id2, a, c, e]

[id2, a, c, e]

shorthand

68

Records Are Evidence

[id1, a, b, c, d][id1, a, b, d][id1, a, x][id1, b, y]

[id1, a, b, c, d]

[id1, (3/4)a, (3/4)b, (1/4)c, (2/4)d, (1/4)x, (1/4)y]

not 0.25

69

New Evidence

[id1, a, b, c, d][id1, a, b, d][id1, a, x][id1, b, y]+ [id1, a, b, c, d]

[id1, a, b, c, d]

[id1, (4/5)a, (4/5)b, (2/5)c, (3/5)d, (1/5)x, (1/5)y]

[id1, (3/4)a, (3/4)b, (1/4)c, (2/4)d, (1/4)x, (1/4)y]+ [id1, a, b, c, d]

70

No Ids

[a, b, c][a, b, d][a, x][c, d, y] [a, b, (1/2)c, (1/2)d]

[a, x][c, d, y]

[a, b, c][a, b, d][a, x][c, d, y]

0.7

0.3

71

No Ids


[a, x][c, d, y]

[a, b, c][a, b, d][a, x][c, d, y]

0.7

0.3

[(2/3)a, (2/3)b, (2/3)c, (2/3)d, (1/3)y][a, x]

[a, b, (1/2)c, (1/2)d][a, x][c, d, y]

0.9

0.1

72

Queries?


[a, x][c, d, y]

[a, b, c][a, b, d][a, x][c, d, y]

0.7

0.3

[(2/3)a, (2/3)b, (2/3)c, (2/3)d, (1/3)y][a, x]

[a, b, (1/2)c, (1/2)d][a, x][c, d, y]

0.9

0.1

Threshold = 0.5; Support = 2Maximal RecordExample: [a, b, c, d]

73

Queries?


[a, x][c, d, y]

[a, b, c][a, b, d][a, x][c, d, y]

0.7

0.3

[(2/3)a, (2/3)b, (2/3)c, (2/3)d, (1/3)y][a, x]

[a, b, (1/2)c, (1/2)d][a, x][c, d, y]

0.9

0.1

Threshold = 0.5; Support = 2Maximal RecordExample: [a, b, c, d]

74

Need Simpler Model?

75

Bonus Material

• Entity Resolution, Confidences,and their relationship to Information Privacy

76

Privacy

Nm: AliceAd: 32 FoxPh: 5551212

1.0


1.0Nm: AliceAd: 32 Fox

1.0Nm: AliceAd: 32 FoxPh: 5551212

0.7Nm: AliceAd: 32 FoxPh: 5551212Ad: 14 Cat

1.0

Bob

Alice

77

Leakage


1.0


0.7

Bob

Alice

L = 0.6 (between 0 and 1)

78

Multi-Record Leakage


1.0

Bob

Alice

LL = 0.9 (between 0 and 1,e.g., max L)

r1, L = 0.9r2, L = 0.8r3, L = 0.7

79

Q1: Added Vulnerability?

Bob

Alice

ΔLL = ??

r1 r2 r3 r4p

r4 may cause Bob’s records tosnap together!

80

Q2: Disinformation?

Bob

Alice

ΔLL = ??

r1 r2 r3 r4 (lies)p

What is mostcost effectivedisinformation?

81

Q3: Verification?

Bob

Alicep

What is best factto verify to increaseconfidence in hypothesis?

r1, 0.9r2, 0.8r3, 0.7...

hypothesis h (0.6)

82

Summary

• Entity resolution is critical• Efficient resolution important• Confidences are important, but how?• ER is key aspect of info privacy

– check www-db.stanford.edu forSwoosh paper & forthcoming paper

83

Thanks.

84

Extra Slides

85

Challenges

• Exponential growth in complexity


86

Three Ideas to Tame Complexity

• Thresholds• Domination• Packages

87

Thresholds


T=0.7

88

Domination

0.9 [ a: v1, b:v2, c:v3 ]0.8 [ a:v1, b: v2, c: v3 ]0.8 [ b:v2, c:v3 ]...

89

Domination

0.9 [ a: v1, b:v2, c:v3 ]0.8 [ a:v1, b: v2, c: v3 ]0.8 [ b:v2, c:v3 ]...

90

Summary

• Our approach: pairwise, generic, Swoosh• Confidences• Making Tractable:– threshold– domination– packages

91

Thanks You

92

What Swoosh Does NOT Do

• Hash table with every pair seen:– records ri, rj– compared values vi, vj

• Swoosh achieves the same effectwith our N2 space

93

Swoosh Performance (I)

94

Swoosh Performance (II)

95

Swoosh Performance (III)

Generic Entity Resolution:�Identifying Real-World Entities in Large Data SetsEntity ResolutionApplicationsOutlineChallenges (1)Challenges (2)Challenges (3)Challenges (4)Challenges (5)TaxonomySchema DifferencesPair-Wise Snaps vs. ClusteringDe-Duplication vs. �Fidelity EnhancementRelationshipsUsing RelationshipsExact vs Approximate ERExact vs Approximate ERGeneric vs Application SpecificTaxonomyOutlineTaxonomyModelCorrect AnswerQuestionBrute Force AlgorithmBrute Force AlgorithmBrute Force AlgorithmQuestion # 1Question # 2ICAR PropertiesMore PropertiesICAR Properties EfficiencySwoosh AlgorithmsSwoosh PerformanceIf ICAR Properties Do Not Hold?If ICAR Properties Do Not Hold?If ICAR Properties Do Not Hold?Swoosh Without ICAR PropertiesDistributed SwooshDistributed SwooshDSwoosh PerformanceOutlineConclusionThanks.Generic Confidence ModelProblem: Properties May Not HoldER with ConfidencesImportant PropertyThresholds - ExampleThresholds - ExampleGoal: C-SwooshGoal: C-SwooshDoes Threshold Property Hold?Does Threshold Property Hold?Simple Confidence ModelRulesMatchesMatchesSummaryPackagesPartition RecordsExpand PackagesConclusionThanks.Extra SlidesTaxonomyOne Confidence ModelRecords Are EvidenceNew EvidenceNo IdsNo IdsQueries?Queries?Need Simpler Model?Bonus MaterialPrivacyLeakageMulti-Record LeakageQ1: Added Vulnerability?Q2: Disinformation?Q3: Verification?SummaryThanks.Extra SlidesChallengesThree Ideas to Tame ComplexityThresholdsDominationDominationSummaryThanks YouWhat Swoosh Does NOT DoSwoosh Performance (I)Swoosh Performance (II)Swoosh Performance (III)

generic entity resolution: identifying real-world entities ...generic entity resolution: identifying...

Documents