entity resolution

8/14/2019 Entity Resolution

1/16

1

Entity Resolution

A Real-World Problem of MatchingRecords

Techniques: Minhashing, Locality-Sensitive Hashing

Measuring the Quality of the Results

2

What Is Entity Resolution?

xData from several sources may refer to

the same entities, e.g., people.xThere is no universal key to help us

match records.

xBig question: how do we tell if recordsrefer to the same underlying entity?


2/16

3

A Matching Problem

xCompany A sold the services of Company B.

xThey then got mad at each other and suedover how many customers of B wereoriginally from A.

xB never bothered to store a from A bit.

xI was asked to find how many of Bscustomers came from A.

4

Matching Details

xThere were about 1,000,000 records

from each company.xEach had name, address, and phone#

fields.

xBecause the records were createdindependently, there were manydifferences between records thatrepresented the same entity (person).


3/16

5

Examples of Differences

1. Typos of many sorts.

2. Abbreviations (St./Street).

3. Nicknames (Bob/Robert).

4. Missing middle name or initial.

5. First/last names reversed.6. Area-code changes.

7. Etc., etc.

6

Simple Approach

1. Develop a score of how close two

name-addr-phone records are.2. Consider all pairs of records, one from

A, one from B. If their score is abovea threshold, consider them torepresent the same customer.

x 1012 scorings --- way too much.


4/16

7

Scoring Matches

xThe exact formula used to measuresimilarity turned out not to matter much,because we were able to measure thequality of any given score.

xIn general: an ad-hoc, experimental

process.

8

Finding Pairs to Score

xThe key problem is deciding which of the1012 pairs are worth scoring.

xAn example of the near-neighborsproblem: Given N points, find all pairs of points that are

at distance less than some threshold.

Usually expressed as similarity = 1 normalized distance.


5/16

9

Standard N-N Framework

1. Points are sets.

2. Similarity of sets = Ratio of size ofintersection to size of union.

3. Minhashing to convert sets intomanageable summaries (signatures).

4. Locality-sensitive hashing to focus onpairs likely to be similar.

10

Locality-Sensitive Hashing

1. Choose many hash functions from points

to buckets.2. Arrange that nearby points have a good

chance of going to the same bucket.

3. Candidates = pairs of points sent to thesame bucket by at least one hash function.

4. Evaluate only candidate pairs.


6/16

11

Example: Similar Documents

xReplace a document by its k-shingles =all substrings of length k.

xExample: Doc1: abcdb; shingle set ={ab, bc, cd, db}.

xDoc2: cdab; shingle set = {cd, da, ab}.

x|intersection| = 2; |union| = 5; similarity =40%.

12

Minhashing

xPick a number of hash functions (say

100) from set elements to integers.xFor each hash function, the minhash

value for a set is the smallest integer towhich any of its members hash.

xThe signatureof a set is the list ofminhash values for the selected hashfunctions.


7/16

13

Theorem

xThe probability that the minhash of twosets is the same = the similarity of thesets.

xConsequence: if we minhash two setsmany times, the number of hash

functions for which their minhashes arethe same will approximate the similarityof the sets.

14

Back to LSH

xRepresent a set (e.g., sets of shingles

of a doc) by the column of (say) 100minhash values (its signature).

xMatrix M consists of a column per set.

xLSH starts by partitioning the rows intob blocks ofr rows each.


8/16

15

Partition Into Bands

Matrix M

r rowsper band

b bands

Column =signaturefor oneset.

16

Partition into Bands --- (2)

xFor each band, hash its portion of eachcolumn to a hash table with many buckets.

xCandidate column pairs are those thathash to the same bucket for 1 band.

xTune band r to catch most similar pairs,few nonsimilar pairs.


9/16

17

Matrix M

r rows bbands

Buckets

18

Example --- Efficiency of LSH

xSuppose 100,000 columns.

xSignatures of 100 integers.xTherefore, signatures take 40Mb.

So they fit in main memory.

xBut 5,000,000,000 pairs of signaturescan take a while to compare.

xChoose 20 bands of 5 integers/band.


10/16

19

Suppose C1, C2 are 80% Similar

xProbability C1, C2 identical in one

particular band: (0.8)5 = 0.328.

xProbability C1, C2 are not similar in any

of the 20 bands: (1-0.328)20 = .00035 . i.e., we miss about 1/3000th of the 80%-

similar column pairs.

20

Suppose C1, C2 Only 40% Similar

xProbability C1, C2 identical in any one

particular band: (0.4)5

= 0.01 .xProbability C1, C2 identical in 1 of 20

bands: 20 * 0.01 = 0.2.

xBut false positives much lower forsimilarities


11/16

21

LSH Involves a Tradeoff

xPick the number of minhashes, thenumber of bands, and the number ofrows per band to balance falsepositives/negatives.

xExample: if we had fewer than 20

bands, the number of false positiveswould go down, but the number of falsenegatives would go up.

22

LSH --- Graphically

x ExampleTarget: All pairs with Sim > t.

xPartition into bands gives us:s 1.0

SimProb.

1.0

t 1.0

SimProb.

1.0

0.0

Ideal

Sim0.0

Prob.

1.0

s 1.0

1 (1 sr)b

0.0

t

t

t ~ (1/b)1/r

One hash fn.


12/16

23

Back to Entity Resolution

x Name-addr-phone records are notnaturally representable by sets (e.g.,shingle sets).

x So we adapted the idea by using 3hash functions:

1. Hash by name.2. Hash by address.

3. Hash by phone.

24

Entity-Resolution LSH

xFalse negative for every pair of records

that represented the same customer buthad none of the three componentsidentical.

xWith more cycles, we could have usedbigger buckets and gotten fewer falsenegatives.


13/16

25

Example

x Hash on positions 1, 3, and 5 of the(5-digit) zip code.

x Approximately 1000 from each datasetgoes into each of 1000 buckets.

x 1 billion candidate pairs to score.

x Need many more hash functions likethis one.

26

How Many False Positives?

xScoring system: 100 pts. for each of

name, addr, phone.xPairs with a score of 300 certainly refer

to the same entity.

xWhat about pairs with a score of 220?150? etc.


14/16

27

Using the Time-Lag

xWe took advantage of the fact that a B-record was probably created shortlyafter the A-record.

xFor the 300-score pairs, the averagedelay was 10 days.

xWe did not even consider matchingrecords with more than a 90-day lag.

28

Time-Lag-Trick --- (2)

xBogus-pair time-lag avg. = 45 days.

xGood-pair time-lag avg. = 10 days.xSuppose the pairs with score s have

average time-lag d.

xFraction pairs with score s that aregood:

(45-d)/35.


15/16

29

Profile of Time-Lag

Score = 300 185 120 100

Lag = 10

45

30

Generalizing the Time-Lag Trick

xAll we need is some property of records

with a predictable correlation for bogusmatches and a measurable correlationfor good matches.

xExample: reserve phones for checking.

xNot even essential that all records havethe property.


16/16

31

Summary

xEntity-resolution: important step indatabase integration.

xMinhashing: useful tool for convertingsets into easily comparable vectors.

xLocality-sensitive hashing: powerful

technique for finding similar objects ofmany kinds.