generic entity resolution with negative rules steven euijong whang · omar benjelloun · hector...

32
GENERIC ENTITY RESOLUTION WITH NEGATIVE RULES Steven Euijong Whang · Omar Benjelloun · Hector Garcia-Molina Compiled by – Darshana Pathak

Upload: hector-thornton

Post on 17-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

GENERIC ENTITY RESOLUTION WITH NEGATIVE RULESSteven Euijong Whang · Omar Benjelloun ·

Hector Garcia-Molina

Compiled by – Darshana Pathak

2

CONTENTS

1. Introduction

2. Example

3. ER-N model

4. GNR Algorithm

5. ENR Algorithm

10/7/2011

3

CONTENTS

6. How to choose negative rules?

7. Conclusion

8. References

10/7/2011

4

Introduction

What is entity resolution? A two step process:

I. Identifying records that refer to same real-world entity

II. Merge them together

Also known as record linkage, merge-purge, deduplication etc.

Application-specific, complex and error-prone process.

10/7/2011

5

Introduction

Why is this so complex?

Most of the times because data is –

I. Ambiguous

II. Missing or incomplete

III. Incorrect

These things are difficult to capture no matter what logic is used to decide whether records match or how they should be merged!

10/7/2011

6

Introduction What are negative rules?

I. Integrity constraints – Rules that tell us what data is invalid.

II. Sanity check in order to remove inconsistencies.

e.g. III. One person with 2 genders,IV. Same address location with two

different street names.

10/7/2011

7

Example

10/7/2011

General ER-Process:

I. Match Function M(r1, r2) = true if same, false if different. Denoted by (r1 ≈ r2) or (r1 ≠ r2)

II. Merge Function µ(r1, r2) = ‹r1, r2›

Record Name SSN Gender

r1 Pat 999-04-1234

r2 Patricia F

r3 Pat 999-04-1234 M

8

Example

Match r1 and r2, if (r1 ≈ r2), merge them

r12 | Pat, Patricia | 999-04-1234 | F

Match r12 with r3, if (r12 ≈ r3), merge them

r123 | Pat, Patricia | 999-04-1234 | F, M

Problem: Violation of negative rule

10/7/2011

9

Example Why was this constraint not enforced during ER process?

I. Constraints are much more complex (may be a big computer program considering many factors)

II. Patches added to program over time by different people

III. Condition acceptable during ER process and fixable.

10/7/2011

10

Example Resolving the inconsistency

Because, r123 is not acceptable as a final merge,

I. Unmerge r123 into {r12, r3}.

II. No two final records can have same SSN.

III. R12 and r3 can not be in final record set.

IV. Problem occurred because r1 was initially merged with r2 instead of r3.

In practice, there will be no obvious ordering of merging!

10/7/2011

11

ER-N Model Basic properties of Match and Merge functions:

I. Idempotence: Any record matches itself and merging a record with itself yields the same record.

Idempotence: ∀r, r ≈ r and r, r = r .

II. Commutativity: If r1 matches r2, then r2 matched r1.

Commutativity: ∀r1, r2, r1 ≈ r2 iff r2 ≈ r1,

and if r1 ≈ r2, then <r1, r2> = <r2, r1>.

10/7/2011

12

ER-N Model Basic properties of Match and Merge functions:

III. Domination: r1 ≤ r2

Record r1 is dominated by r2 if both records refer to the same entity, but r2’s information “includes” that of r1. Thus r1 is redundant information.

We can have r1 ≤ r2, whenever r2 = <r1, r’> for some r’.

10/7/2011

13

ER-N Model – Merge Closure Merge closure A merge closure ī contains all the possible

records that can be generated from I using M and μ, where I = {r1, ……, rn}.

Definition: The merge closure ī of I satisfies the following conditions:

1. I ⊆ ī

2. ∀r1, r2 ∈ ī s.t. r1 ≈ r2, <r1, r2> ∈ ī.

3. No strict subset of ī satisfies conditions 1,2.

10/7/2011

14

Algorithm overview: Steps to find closure from a set I of records:

I. Start with empty ī .

II. Loop until I is empty.

III. r = record from I. Remove r from I.

IV. For all r’ from ī, follow steps 1 to 2.

1. If r’ ≈ r then merged = <r,r’>

2. If merged not in I U ī U {r} then I = I U {merged}

V. ī = ī U {r}

VI. Return ī.

* This basic algorithm does not consider negative rules.

10/7/2011

15

Time to apply negative rules!

Classified according to number of arguments:

I. Unary negative rule: Checks if a record r is valid by itself.

II. Binary negative rule: Checks if two different records r1 and r2 can coexist

Two inconsistent records cannot coexist in ER solution. Match & merge rules and negative rules cannot be

combined together.

10/7/2011

16

Properties of negative rules A set of records is inconsistent if there exists a singleI. Record violating a unary negative rule and/or

II. A pair of records violating binary negative rule.

Commutativity for negative rules:

For all r1 and r2, if r1 is not consistent with r2,

then r2 is not consistent with r1.

10/7/2011

17

ER-N Model According to definition of ER-N Model –

Given an instance of I and the merge closure ī, an ER-N

of I is a consistent set of records J such that –

I. J is a maximal consistent subset of ī.

II. For all records r (ī - J),• There exists r’ J s. t. r r’ or• J U {r} is inconsistent.

10/7/2011

18

Back to our example

• ī = {r1, r2, r3, r12, r13, r23, r123}

• The instance {r13, r2} is a valid ER-N solution.

where r13 = Pat | 999-04-1234 | M,

r2 = Patricia | … | F

• This satisfies all conditions of ER-N model.

10/7/2011

19

Resolving Inconsistencies Late approach: I. Using match & merge rules, ER solution is generated

II. Solution is checked for inconsistencies.

III. Appropriate fixes are applied to remove inconsistencies with the guidance of domain expert – solver.

Early approach: I. With the help of solver, start identifying records that we want in the

final answer J.

II. Start fixing problems between the selected records in J and other records not yet selected.

* Early approach is preferred over late approach.

10/7/2011

20

Resolving Inconsistencies Ways inconsistencies can be fixed:

I. Discard data: Solver may decide to drop the record.

II. Forced merge: Solver decides that two inconsistent records should have been merged.

III. Override negative rule: Solver decides that flagged record(s) are indeed consistent i. e. negative rule was wrong flagging that record as inconsistent.

e.g. Comfort Inn vs Comfort Inn Milton

10/7/2011

21

The GNR Algorithm

General algorithm for negative rules:

Solver plays key role in making decisions. If no solver is available, algorithm makes choice at random (!) or based on some heuristic.

e.g. A record with more fields available is preferable to the one with fewer fields. The solution generated without solver may not be the “most desirable solution”.

10/7/2011

22

The GNR Algorithm Algorithm overview:

I. Generate closure ī using ER algorithm & Set S = ī.

II. Select set of non-dominated records (ndS) from ī.

III. Select record r from ndS with the help of solver.

IV. S = S \ {r} means remove r from S.

V. If r is self-inconsistent, discard r. Continue from step II.

VI. Else J = J U {r}

VII. Remove all records from S that are either inconsistent with r or dominated by r.

VIII. Continue from step II till S is empty.

IX. Return J.

10/7/2011

23

Back to our example

• ī = {r1, r2, r3, r12, r13, r23, r123}

• Select r123, but it is inconsistent, so discard it.

• Select one from {r12, r13, r23} r13

• Remove all records dominated by or inconsistent with r13

• So, S = {r2, r23}

• We discard r23 because its internally inconsistent.

• Final Solution {r13, r2}

10/7/2011

24

Things to ponder An important matric for GNR algorithm is the “Human effort” of the solver.

How to calculate human effort? Entity resolution is inherently expensive operation.

If we apply negative rules, it becomes more expensive!

10/7/2011

25

Techniques to reduce cost Semantic partitioning:

Data is divided into independent blocks using

semantic knowledge.

e.g. Category: Book, camera, snacks, …

The technique is commonly known as blocking.

Exploiting properties

Exploit properties of match and merge rules to make

it possible to find correct solution with less effort.

10/7/2011

26

Exploiting properties Desirable properties for M and

I. Associativity:

Merge order is irrelevant.

II. Representativity:

Merged record represents its base records and

matches with all records that match with the base

records.

May represent simple identity or a range.

10/7/2011

27

ENR algorithm Enhanced algorithm for negative rules - Makes things simpler and

more efficient.

Rather than looking at entire merge closure of I, partition I and look at merge closure of each partition.

This partitioning is different that blocking, as these do not assume any semantic knowledge.

This is similar to our first algorithm, except once records r and r’ are merged, they are removed from further consideration.

This works because any future records that match r or r’ match the merged record <r, r’>.

10/7/2011

28

Some more negative rules Borderline cases: In practice, many times rules are written such that all borderline cases are flagged, so that solver can check them out.

e. g. The case in which name, DOB, address, gender all match except one digit of SSN.

NameAddr negative rules: Special string comparison rules for name and address checks.

e. g. Are two street with similar names are “too far apart” to be in the same record?

10/7/2011

29

How to choose negative rules? Important: Design negative rules that do not generate too many unnecessary checks.

Always remember, more the flagged records, more are the human efforts required.

Good understanding of the application and match and merge rules is necessary.

Knowing “common errors” with match and merge rules helps a lot! (knowing the weak-points).

10/7/2011

30

Conclusion In ER process, negative rules capture “sanity checks”.

ER process often requires human guidance to handle

real-world data and unexpected situations.

GNR algorithm represents generic way to solve ER-N.

ENR algorithm makes GNR algorithm less costly.

Choice of negative rules is very important.

10/7/2011

31

References

1. Paper: Generic entity resolution with negative rules:

by Steven Euijong Whang, Omar Benjelloun,

Hector Garcia-Molina

(The VLDB Journal (2009) 18:1261–1277

DOI 10.1007/s00778-009-0136-3)

2. http://infoblog.stanford.edu/2008/10/generic-entity-resolution-with-negative.html

3. http://infolab.stanford.edu/serf/

10/7/2011

32

THANK YOU …

10/7/2011