Download - Entity Resolution: Introduction - Duke University

Entity Resolution: Introduction

Data Cleaning & IntegrationCompSci 590.01 Spring 2017

Based on: Getoor & Machanavajjhala’s VLDB 2012 tutorial slides

Cohen’s record linkage tutorialElsner & Schudy’s ILP-NLP slides

What’s ER?

Entity Resolution: identifying and linking/grouping different manifestations of the same real-world object, e.g.:• Different ways of addressing (names, emails,

Facebook accounts) the same person in text• Web pages with different descriptions of the same

business• Different photos taken for the same objectetc.

2

Ironically, ER has duplicate names…

• Record linkage• Duplicate detection• Deduplication• Reference reconciliation• Reference matching• Object consolidation• Fuzzy matching• Entity clustering• Hardening soft databases…

3

Example: IP aliasing problem

… when measuring Internet topology

4

IP Aliasing Problem [Willinger et al. 2009]

Willinger et al. Notices of the AMS, 2009

Example cont’d5

Figure 3. The IP alias resolution problem in practice. This is re-produced from [48] and shows acomparison between the Abilene/Internet2 topology inferred by Rocketfuel (left) and the actual

topology (top right). Rectangles represent routers with interior ovals denoting interfaces. Thehistograms of the corresponding node degrees are shown in the bottom right plot. © 2008 ACM,

Inc. Included here by permission.

(IP)-speaking) routers encountered en route fromthe source to the destination. Instead, since IProuters have multiple interfaces, each with its ownIP address, what traceroute really generates isthe list of (input interface) IP addresses, and a verycommon property of traceroute-derived routesis that one and the same router can appear ondifferent routes with different IP addresses. Unfor-tunately, faithfully mapping interface IP addressesto routers is a difficult open problem known asthe IP alias resolution problem [51, 28], and despitecontinued research efforts (e.g., [48, 9]), it hasremained a source of significant errors. While thegeneric problem is illustrated in Figure 2, its im-pact on inferring the (known) router-level topologyof an actual network (i.e., Abilene/Internet2) ishighlighted in Figure 3—the inability to solve thealias resolution problem renders in this case theinferred topology irrelevant and produces statistics(e.g., node degree distribution) that have little incommon with their actual counterparts.

Another commonly ignored problem is thattraceroute, being strictly limited to IP or layer-3,is incapable of tracing through opaque layer-2clouds that feature circuit technologies such asAsynchronous Transfer Mode (ATM) or Multipro-tocol Label Switching (MPLS). These technologieshave the explicit and intended purpose of hidingthe network’s physical infrastructure from IP, sofrom the perspective of traceroute, a networkthat runs these technologies will appear to providedirect connectivity between routers that are sep-arated by local, regional, national, or even globalphysical network infrastructures. The result isthat when traceroute encounters one of theseopaque layer-2 clouds, it falsely “discovers” ahigh-degree node that is really a logical entity—anetwork potentially spanning many hosts or greatdistances—rather than a physical node of theInternet’s router-level topology. Thus, reports ofhigh-degree hubs in the core of the router-levelInternet, which defy common engineering sense,can often be easily identified as simple artifacts of

590 Notices of the AMS Volume 56, Number 5

Other examples

• Name/attribute ambiguity, data entry errors, missing data, formatting differences, changing attributes…

6

Traditional Challenges in ER• Name/Attribute ambiguity

Thomas CruiseThomas Cruise

Michael Jordan

Traditional Challenges in ER• Name/Attribute ambiguity• Errors due to data entryErrors due to data entry• Missing Values• Changing Attributes• Changing Attributes

• Data formatting

/• Abbreviations / Data Truncation

Traditional Challenges in ER• Name/Attribute ambiguity• Errors due to data entryErrors due to data entry• Missing Values• Changing Attributes• Changing Attributes

• Data formatting

/• Abbreviations / Data Truncation

“Big-data” ER challenges• Larger + more datasets• More heterogeneity

• E.g., not just name matching any more, but matching Amazon profiles with Google browsing history or Facebook friend list

• More linked• Links crucial to ER; e.g., authors + papers + citations

• More complex structures• E.g., Walmart = Walmart Pharmacy?

• Diverse domains• No one-size-fit-all method

• Diverse applications• Different accuracy requirements; e.g., web search vs.

comparison shopping

7

Outline

• Data preparation and matching features• Pairwise-ER• Leveraging constraints in ER• Record linkage: exclusivity• Deduplication: transitivity• “Collective” ER: general

• Next lecture

8

Normalization

• Schema normalization• Schema matching: e.g., contact# vs. phone• Compound attributes: e.g., addr vs. (street, city, st, zip)• Nested or set-valued attributes: e.g., properties for rent

with a set of tags, multiple phone numbers

• Data normalization• Capitalization, white-space normalization• Correcting typos, replacing abbreviations, variations,

nick names• Usually done by employing “dictionaries”: e.g., lists of

businesses, postal addresses, etc.

9

Matching features

Give two records , compute a “comparison” vector of similarity scores for corresponding features• E.g., to match two bibliographical references,

compute ⟨1st-author-match-score, title-match-score, venue-match-score, year-match-score, …⟩• Score can be Boolean (match, or mismatch), or

reals (based on some distance function)

10

Quick tour of matching features• Difference between numeric values• Domain-specific, like Jaro (for names)• Edit distance: good for typos in strings

• Levenshtein, Smith-Waterman, affine gap

• Phonetic-based• Soundex

• Translation-based• Set similarity

• Jaccard, Dice• For text fields (set of words) or relational features (e.g., set of

authors of a paper)

• Vector-based• Cosine similarity, TF/IDF (good for text)

11

Jaro

Specifically designed for names by U.S. Census• Given 𝑠 and 𝑡, 𝑐 is common if 𝑠& = 𝑡( = 𝑐 and 𝑖 − 𝑗 ≤ -./ 0 , 2

3• 𝑐4 and 𝑐3 are a transposition if 𝑐4 and 𝑐3 are

common but appear in different orders in 𝑠 and 𝑡

• Jaro similarity = 45

60+ 6

2+ 689

36, where 𝑚 = #

commons and 𝑥 = some measure of # transpositions• Jaro-Winkler further weighs errors early in the

strings more heavily

12

Levenshtein

• Distance between strings s and t = shortest sequence of edit commands that transform s to t• Copy character from s over to t• Delete a character in s (cost 1)• Insert a character in t (cost 1)• Substitute one character for another (cost 1)

13

W I L L I A M _ C O H E N

W I L L L I A M _ C O H O NC C C C I C C C C C C C S C

0 0 0 0 1 1 1 1 1 1 1 1 2 2

s

t

op

cost

Computing Levenshtein

𝐷 𝑖, 𝑗 = score of best alignment between 𝑠4𝑠3 ⋯ 𝑠&and 𝑡4𝑡3 ⋯ 𝑡(

= minA𝐷 𝑖 − 1, 𝑗 − 1 + 𝑑 𝑠&, 𝑡( sub/copy𝐷 𝑖 − 1, 𝑗 + 1delete𝐷 𝑖, 𝑗 − 1 + 1insert

where 𝑑 𝑠&, 𝑡( = 𝟏 𝑠& ≠ 𝑡( ,and let 𝐷 0,0 = 0, 𝐷 𝑖, 0 = 𝑖, and𝐷 0, 𝑗 = 𝑗• Can then normalize using lengths of 𝑠 and 𝑡:

1 − 𝐷 𝑠 , 𝑡 /max 𝑠 , 𝑡

14

Smith-Waterman

• Find longest “soft matching” subsequence

𝑆 𝑖, 𝑗 = max

0start over𝑆 𝑖 − 1, 𝑗 − 1 − 𝑑 𝑠&, 𝑡( sub/copy𝑆 𝑖 − 1, 𝑗 − 𝐺delete𝑆 𝑖, 𝑗 − 1 − 𝐺insert

where 𝑑 𝑠&, 𝑡( = 𝟏 𝑠& ≠ 𝑡( − 2 ⋅ 𝟏 𝑠& = 𝑡( ,(linear) gap penalty 𝐺 = 1,and let 𝑆 0,0 = 0, 𝑆 𝑖, 0 = 0, and𝑆 0, 𝑗 = 0

15

Smith-Waterman example16

C O H E NM 0 0 0 0 0

C +2 +1 0 0 0

C +2 +1 0 0 0

O +1 +4 +3 +2 +1

H 0 +3 +6 +5 +4

N 0 +2 +5 +5 +7

𝑆 𝑖, 𝑗 = max

0start over𝑆 𝑖 − 1, 𝑗 − 1 − 𝑑 𝑠&, 𝑡( sub/copy𝑆 𝑖 − 1, 𝑗 − 𝐺delete𝑆 𝑖, 𝑗 − 1 − 𝐺insert

Affine gap distance

• Smith-Waterman fails on some pairs that seem quite similar:William W. Cohen vs.William W. “Don’t call me Dubya” Cohen• Intuitively, single long inserts are “cheaper” than a lot of

short inserts

• Idea: instead of charge 𝑛𝐺 for a gap of 𝑛 chars, charge 𝐴 + 𝑛 − 1 𝐵, where 𝐴 is the cost of opening a gap, and 𝐵 is the cost of continuing it

17

Dynamic programming, again

• 𝑆 𝑖, 𝑗 = max𝑆 𝑖 − 1, 𝑗 − 1 − 𝑑 𝑠&, 𝑡(𝐼0 𝑖 − 1, 𝑗 − 1 − 𝑑 𝑠&, 𝑡(𝐼2 𝑖 − 1, 𝑗 − 1 − 𝑑 𝑠&, 𝑡(

• 𝐼0 𝑖, 𝑗 = max S 𝑆 𝑖 − 1, 𝑗 − 𝐴𝐼0 𝑖 − 1, 𝑗 − 𝐵

• 𝐼2 𝑖, 𝑗 = max T𝑆 𝑖, 𝑗 − 1 − 𝐴𝐼2 𝑖, 𝑗 − 1 − 𝐵

18

Best score in which 𝑠&is aligned with a gap

Best score in which 𝑡(is aligned with a gap

−𝐵

−𝐵

−𝑑 𝑠&, 𝑡( 𝑆

𝐼0

𝐼2−𝑑 𝑠&, 𝑡(

−𝑑 𝑠&, 𝑡(−𝐴−𝐴

Set similarity

Given two sets 𝐴 and 𝐵

• Jaccard distance: 1 − U∩WU∪W

• Dice distance: 1 − 3 U∩WU Y W

• Not a distance metric (triangle inequality doesn’t hold)• Note the connection to the F1 measure, which is the

harmonic mean of• Precision: TP/(TP+FP)• Recall: TP/(TP+FN)

19

Cosine similarity and TF/IDF

• Let 𝑈 = 𝑥4, 𝑥3, … , 𝑥\ be the universe of all elements (e.g., possible words in English)• A multiset 𝐷 with elements drawn from 𝑈 (e.g., a

document) can be represented as an 𝑛-dim vector ⟨𝑤4, 𝑤3, … ,𝑤\⟩• Each 𝑤& can be as simple as 𝑐 𝐷, 𝑥& , count of 𝑥& in 𝐷

• Cosine similarity between 𝐷4 and 𝐷3 is^_⋅^`^_ ^`

, where ⋅ is the 𝐿3 (Euclidean) normal

20

TF/IDF

Alternatively, if you have a corpus 𝒟 of 𝐷’s, define

• Term frequency 𝑇𝐹 𝐷, 𝑥 = log4h 1 + 𝑐 𝐷, 𝑥 , where 𝑐 𝐷, 𝑥 is 𝑥’s number of occurrences in 𝐷• Inverse document frequency 𝐼𝐷𝐹 𝒟, 𝑥 =log4h

𝒟^i 𝒟,9

, where 𝐷𝐹 𝒟, 𝑥 is the number of 𝐷’s in 𝒟 containing 𝑥• Let 𝑤& = 𝑇𝐹 𝐷, 𝑥& ⋅ 𝐼𝐷𝐹 𝒟, 𝑥&• Idea: elements that don’t serve to distinguish a 𝐷 within 𝒟 (e.g., stop words) are weighed down

21

Tokening and shingling

What are the “elements” in text?Do we lose the sequencing information by treating text as a bag of elements?• Simply split by non-alphanumeric characters?• How about “San Francisco”?• Can use a language model to find sequences of words

that appear “more than random”

• Or additionally treat 𝑛-grams (all subsequences of length 𝑛) as your “elements” (shingling)

22

Outline


• Next lecture

23

Pairwise-ER

Given a vector of component-wise similarity scores for records 𝑥 and 𝑦, compute 𝑃 𝑥and𝑦matchPossible solutions• Check the weighed sum of component-wise scores

against a threshold to determine match/non-match• E.g., 0.5×1st-author-match-score + 0.2×venue-match-score

+ 0.3×title-match-score ≥ 0.8• Formulate rules about what constitutes a match• E.g., (1st-author-match-score> 0.7 AND

venue-match-score > 0.8) OR (title-match-score > 0.9AND venue-match-score > 0.9)

Hard to come up with weights, thresholds, and rules!

24

Fellegi & Sunter (Science, 1969)

• Given record pair 𝑟 = 𝑥, 𝑦 to match, with 𝛾 as the score vector• Let 𝑀 denote matches and 𝑈 non-matches• Decision rule:

𝑅 =𝑃 𝛾 𝑟 ∈ 𝑀𝑃 𝛾 𝑟 ∈ 𝑈

• Non-match if 𝑅 ≤ 𝑡z, match if 𝑡{ ≤ 𝑅, uncertain otherwise

• Naïve Bayes assumption: 𝑃 𝛾 𝑟 ∈ 𝑀 = Π&𝑃 𝛾&|𝑟 ∈ 𝑀

25

Supervised ML for pairwise ER• Naïve Bayes, decision trees (Cochinwala et al., IS 2001),

support vector machines (Bilenko & Mooney, KDD 2003; Christen KDD 2008), ensembles of classifiers (Chen et al., SIGMOD 2009), Conditional Random Fields (Gupta & Sarawagi, VLDB 2009), etc.• Imbalanced classes: typically many more negatives

(𝑂 𝑅 3 ) than positives (𝑂 𝑅 )• Pairs/matches are not i.i.d.• E.g., 𝑥, 𝑦 ∈ 𝑀 and 𝑦, 𝑧 ∈ 𝑀 implies 𝑥, 𝑧 ∈ 𝑀

• Constructing a training set is hard• Most pairs are “easy non-matches”• Some pairs are inherently ambiguous (e.g., is Paris Hilton

person or business?); others have missing attributes (e.g., Starbucks, Durham, NC)

26

Active learning

• Focus labeling efforts to reduce the “confusion region” of classifiers• To assess uncertainty, use the classifier’s output

(e.g., posterior probabilities of a Bayesian classifier), or votes by a “committee” (multiple weak classifiers)• Again, beware of evaluation metric—0-1 loss is no

good; need maximize recall with acceptable precision• Arasu et al. SIGMOD 2010; Bellare et al. KDD 2012

27

Outline


• Next lecture

28

Constraint under record linkage

• Record linkage: link records between two databases (each has been deduplicatedindependently)

• Exclusivity constraint: a record in one database can match at most one record in the other database• Pairwise ER may well match with multiple records!

29

Weighted bipartite matching

• Nodes in 𝑁4 and 𝑁3are records from the two respective databases• For each 𝑟4 ∈ 𝑁4 and 𝑟3 ∈ 𝑁3 draw an edge 𝑟4, 𝑟3

and assign it a weight based on the pairwise similarity score (e.g., log odds of match)• Find a matching (i.e., a set of edges without

common nodes) that maximizes the sum of weights• Can be done in 𝑂 𝑅 5 time using the Hungarian

Algorithm• In practice, no need to generate all 𝑂 𝑅 3 edges

because some pairs are obviously non-matches (Gupta and Sarawagi, VLDB 2009)

30

Outline


• Next lecture

31

Constraint under deduplication

• Deduplication: given a database containing potential duplicate mentions of the same entities, partition mentions into equivalence classes

• Transitivity constraint:• If 𝑥, 𝑦 ∈ 𝑀 and 𝑦, 𝑧 ∈ 𝑀, we must have 𝑥, 𝑧 ∈ 𝑀• Pairwise ER may or may not give us 𝑥, 𝑧 in this case

• A quick fix—compute transitive closure on the inferred match relationships?• Bad idea in some cases: graphs

resulting from pairwise ER can have diameter > 20 (Rastogi et al. Corr 2012)

32

Added by transitive closure

Clustering-based ER

• Resolution decisions are not made independently for each pair of records—good• Unsupervised—good, although often still needs

pairwise similarity as input• Existing clustering algorithms may be used, but• Number of clusters not known in advance• Many, many small (possibly singleton) clusters—not

what most existing clustering algorithms expect

33

Possible clustering approaches

• Hierarchical clustering• Bilenko et al. ICDM 2005

• Nearest-neighbor-based methods• Chaudhuri et al., ICDE 2005

• Correlation clustering• Soon et al. CL 2001, Ng et al. ACL 2002, Bansal et al. ML

2004, Elsner et al. ACL 2008, Ailon et al. JACM 2008, etc.

34

Correlation clustering

• Key advantage: no need to give the number of clusters; find the optimal number automatically

• Key idea: maximize the sum of• Similarities between nodes within the same clusters• Disimilarities between nodes in different clusters

35

Integer linear program formulation• Constants

• 𝑤9�8 ∈ 0,1 , cost of clustering 𝑥 and 𝑦 together• 𝑤9�Y ∈ 0,1 , cost of putting 𝑥 and 𝑦 in different clusters

• Variables• 𝑟9� = 1 if 𝑥 and 𝑦 are in the same cluster, or 0 otherwise

• Minimize ∑ 𝑟9�𝑤9�8 + 1 − 𝑟9� 𝑤9�Y�9�

subject to ∀𝑥, 𝑦, 𝑧 ∈ 𝑅:𝑟9� + 𝑟9� + 𝑟�� ≠ 2• The constraint is basically transitivity• Note that what matters is the net weight 𝑤9�± = 𝑤9�Y − 𝑤9�8

• Setting up weights using pairwise similarity 𝑝9�• Additive: 𝑤9�8 = 1 − 𝑝9�;𝑤9�Y = 𝑝9�• Or logarithmic:𝑤9�8 = log 1 − 𝑝9� ;𝑤9�Y = log 𝑝9�

• Problem is known to be NP-hard

36

Greedy algorithms

• Step through the records in some random order• To label the next record 𝑥, use a heuristic rule to

pick an existing cluster• Or start a new cluster with 𝑥 by itself

• In practice, run the algorithm multiple times and take the best answer

37

Greedy algorithms

Step through the nodes in random order.Use a linking rule to place each unlabeled node.

Previously assigned

Next node

?

12

Red arc: negative 𝑤±; prefer separateGreen arc: positive 𝑤±; prefer together

FIRST rule

• Soon et al. CL 2001

38

First link (Soon ‘01)

Previously assigned

Next node

?

the most recent positive arc

13

or start a new cluster if all arcs are negative

BEST rule

• Ng et al. ACL 2002

39

Best link (Ng+Cardie ‘02)

Previously assigned

Next node

?

the highest scoring arc

14


VOTE rule

• Elsner et al. ACL 2008

40

Voted link

Previously assigned

Next node

?

the cluster with highest arc sum

15


PIVOT

• Ailon et al. JACM 2008

41

Pivot (Ailon+al ‘08)

Create each whole cluster at once.Take the first node as the pivot.

add all nodeswith positive arcs

pivot node

16

Pivot

Choose the next unlabeled node as the pivot.

new pivot node

add all nodeswith positive arcs

17

Comparison of heuristics

Ailon et al. JACM 2008:• PIVOT has approximation guarantees• 5-approximation if 𝑤9�Y + 𝑤9�8 = 1 (probability

constraints)• 2-approximation if weights satisfy triangle inequality

Elsner & Schudy, ILP-NLP 2009:• VOTE works well in practice• Local improvement can always be used in post-

processing

42

Download - Entity Resolution: Introduction - Duke University

Top Related