entity resolution

Upload: matthewriley123

Post on 31-May-2018

221 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/14/2019 Entity Resolution

    1/16

    1

    Entity Resolution

    A Real-World Problem of MatchingRecords

    Techniques: Minhashing, Locality-Sensitive Hashing

    Measuring the Quality of the Results

    2

    What Is Entity Resolution?

    xData from several sources may refer to

    the same entities, e.g., people.xThere is no universal key to help us

    match records.

    xBig question: how do we tell if recordsrefer to the same underlying entity?

  • 8/14/2019 Entity Resolution

    2/16

    3

    A Matching Problem

    xCompany A sold the services of Company B.

    xThey then got mad at each other and suedover how many customers of B wereoriginally from A.

    xB never bothered to store a from A bit.

    xI was asked to find how many of Bscustomers came from A.

    4

    Matching Details

    xThere were about 1,000,000 records

    from each company.xEach had name, address, and phone#

    fields.

    xBecause the records were createdindependently, there were manydifferences between records thatrepresented the same entity (person).

  • 8/14/2019 Entity Resolution

    3/16

    5

    Examples of Differences

    1. Typos of many sorts.

    2. Abbreviations (St./Street).

    3. Nicknames (Bob/Robert).

    4. Missing middle name or initial.

    5. First/last names reversed.6. Area-code changes.

    7. Etc., etc.

    6

    Simple Approach

    1. Develop a score of how close two

    name-addr-phone records are.2. Consider all pairs of records, one from

    A, one from B. If their score is abovea threshold, consider them torepresent the same customer.

    x 1012 scorings --- way too much.

  • 8/14/2019 Entity Resolution

    4/16

    7

    Scoring Matches

    xThe exact formula used to measuresimilarity turned out not to matter much,because we were able to measure thequality of any given score.

    xIn general: an ad-hoc, experimental

    process.

    8

    Finding Pairs to Score

    xThe key problem is deciding which of the1012 pairs are worth scoring.

    xAn example of the near-neighborsproblem: Given N points, find all pairs of points that are

    at distance less than some threshold.

    Usually expressed as similarity = 1 normalized distance.

  • 8/14/2019 Entity Resolution

    5/16

    9

    Standard N-N Framework

    1. Points are sets.

    2. Similarity of sets = Ratio of size ofintersection to size of union.

    3. Minhashing to convert sets intomanageable summaries (signatures).

    4. Locality-sensitive hashing to focus onpairs likely to be similar.

    10

    Locality-Sensitive Hashing

    1. Choose many hash functions from points

    to buckets.2. Arrange that nearby points have a good

    chance of going to the same bucket.

    3. Candidates = pairs of points sent to thesame bucket by at least one hash function.

    4. Evaluate only candidate pairs.

  • 8/14/2019 Entity Resolution

    6/16

    11

    Example: Similar Documents

    xReplace a document by its k-shingles =all substrings of length k.

    xExample: Doc1: abcdb; shingle set ={ab, bc, cd, db}.

    xDoc2: cdab; shingle set = {cd, da, ab}.

    x|intersection| = 2; |union| = 5; similarity =40%.

    12

    Minhashing

    xPick a number of hash functions (say

    100) from set elements to integers.xFor each hash function, the minhash

    value for a set is the smallest integer towhich any of its members hash.

    xThe signatureof a set is the list ofminhash values for the selected hashfunctions.

  • 8/14/2019 Entity Resolution

    7/16

    13

    Theorem

    xThe probability that the minhash of twosets is the same = the similarity of thesets.

    xConsequence: if we minhash two setsmany times, the number of hash

    functions for which their minhashes arethe same will approximate the similarityof the sets.

    14

    Back to LSH

    xRepresent a set (e.g., sets of shingles

    of a doc) by the column of (say) 100minhash values (its signature).

    xMatrix M consists of a column per set.

    xLSH starts by partitioning the rows intob blocks ofr rows each.

  • 8/14/2019 Entity Resolution

    8/16

    15

    Partition Into Bands

    Matrix M

    r rowsper band

    b bands

    Column =signaturefor oneset.

    16

    Partition into Bands --- (2)

    xFor each band, hash its portion of eachcolumn to a hash table with many buckets.

    xCandidate column pairs are those thathash to the same bucket for 1 band.

    xTune band r to catch most similar pairs,few nonsimilar pairs.

  • 8/14/2019 Entity Resolution

    9/16

    17

    Matrix M

    r rows bbands

    Buckets

    18

    Example --- Efficiency of LSH

    xSuppose 100,000 columns.

    xSignatures of 100 integers.xTherefore, signatures take 40Mb.

    So they fit in main memory.

    xBut 5,000,000,000 pairs of signaturescan take a while to compare.

    xChoose 20 bands of 5 integers/band.

  • 8/14/2019 Entity Resolution

    10/16

    19

    Suppose C1, C2 are 80% Similar

    xProbability C1, C2 identical in one

    particular band: (0.8)5 = 0.328.

    xProbability C1, C2 are not similar in any

    of the 20 bands: (1-0.328)20 = .00035 . i.e., we miss about 1/3000th of the 80%-

    similar column pairs.

    20

    Suppose C1, C2 Only 40% Similar

    xProbability C1, C2 identical in any one

    particular band: (0.4)5

    = 0.01 .xProbability C1, C2 identical in 1 of 20

    bands: 20 * 0.01 = 0.2.

    xBut false positives much lower forsimilarities

  • 8/14/2019 Entity Resolution

    11/16

    21

    LSH Involves a Tradeoff

    xPick the number of minhashes, thenumber of bands, and the number ofrows per band to balance falsepositives/negatives.

    xExample: if we had fewer than 20

    bands, the number of false positiveswould go down, but the number of falsenegatives would go up.

    22

    LSH --- Graphically

    x ExampleTarget: All pairs with Sim > t.

    xPartition into bands gives us:s 1.0

    SimProb.

    1.0

    t 1.0

    SimProb.

    1.0

    0.0

    Ideal

    Sim0.0

    Prob.

    1.0

    s 1.0

    1 (1 sr)b

    0.0

    t

    t

    t ~ (1/b)1/r

    One hash fn.

  • 8/14/2019 Entity Resolution

    12/16

    23

    Back to Entity Resolution

    x Name-addr-phone records are notnaturally representable by sets (e.g.,shingle sets).

    x So we adapted the idea by using 3hash functions:

    1. Hash by name.2. Hash by address.

    3. Hash by phone.

    24

    Entity-Resolution LSH

    xFalse negative for every pair of records

    that represented the same customer buthad none of the three componentsidentical.

    xWith more cycles, we could have usedbigger buckets and gotten fewer falsenegatives.

  • 8/14/2019 Entity Resolution

    13/16

    25

    Example

    x Hash on positions 1, 3, and 5 of the(5-digit) zip code.

    x Approximately 1000 from each datasetgoes into each of 1000 buckets.

    x 1 billion candidate pairs to score.

    x Need many more hash functions likethis one.

    26

    How Many False Positives?

    xScoring system: 100 pts. for each of

    name, addr, phone.xPairs with a score of 300 certainly refer

    to the same entity.

    xWhat about pairs with a score of 220?150? etc.

  • 8/14/2019 Entity Resolution

    14/16

    27

    Using the Time-Lag

    xWe took advantage of the fact that a B-record was probably created shortlyafter the A-record.

    xFor the 300-score pairs, the averagedelay was 10 days.

    xWe did not even consider matchingrecords with more than a 90-day lag.

    28

    Time-Lag-Trick --- (2)

    xBogus-pair time-lag avg. = 45 days.

    xGood-pair time-lag avg. = 10 days.xSuppose the pairs with score s have

    average time-lag d.

    xFraction pairs with score s that aregood:

    (45-d)/35.

  • 8/14/2019 Entity Resolution

    15/16

    29

    Profile of Time-Lag

    Score = 300 185 120 100

    Lag = 10

    45

    30

    Generalizing the Time-Lag Trick

    xAll we need is some property of records

    with a predictable correlation for bogusmatches and a measurable correlationfor good matches.

    xExample: reserve phones for checking.

    xNot even essential that all records havethe property.

  • 8/14/2019 Entity Resolution

    16/16

    31

    Summary

    xEntity-resolution: important step indatabase integration.

    xMinhashing: useful tool for convertingsets into easily comparable vectors.

    xLocality-sensitive hashing: powerful

    technique for finding similar objects ofmany kinds.