entity resolution, clustering author references

21
Introduction Background Methodology Summary Entity Resolution, Clustering Author References Vlad Shchogolev [email protected] May 1, 2007 Vlad Shchogolev [email protected] Entity Resolution

Upload: others

Post on 12-Sep-2021

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Entity Resolution, Clustering Author References

IntroductionBackground

MethodologySummary

Entity Resolution, Clustering AuthorReferences

Vlad [email protected]

May 1, 2007

Vlad Shchogolev [email protected] Entity Resolution

Page 2: Entity Resolution, Clustering Author References

IntroductionBackground

MethodologySummary

Outline

1 IntroductionWhat is Entity Resolution?Motivation

2 BackgroundFormal DefinitionEfficieny ConsiderationsMeasuring Text SimilarityOther approaches

3 MethodologyClustering author referencesLearning distance functionClustering experiment

Vlad Shchogolev [email protected] Entity Resolution

Page 3: Entity Resolution, Clustering Author References

IntroductionBackground

MethodologySummary

What is Entity Resolution?Motivation

Set-Up

Given: a table of records, each record has a number of“fields”.Problem: decide which pairs of records refer to the same“entity”.Variation: given two tables, pair up matching records

Vlad Shchogolev [email protected] Entity Resolution

Page 4: Entity Resolution, Clustering Author References

IntroductionBackground

MethodologySummary

What is Entity Resolution?Motivation

A new name for an old research area

Record linkageOriginally studied by Dunn, 1946Formalized by Fellegi and Sunter, 1969

Merge/purge problemData matching, object identity problemCoreference resolution, reference reconciliation, etc.

Vlad Shchogolev [email protected] Entity Resolution

Page 5: Entity Resolution, Clustering Author References

IntroductionBackground

MethodologySummary

What is Entity Resolution?Motivation

Why is this useful?

Historical researchMedical researchGovernment record-keepingTracking terroristsWal-MartYahoo Local, Google LocalBibliographic citations

Vlad Shchogolev [email protected] Entity Resolution

Page 6: Entity Resolution, Clustering Author References

IntroductionBackground

MethodologySummary

Formal DefinitionEfficiency ConsiderationsMeasuring Text SimilarityOther approaches

Formal Definition

Two datasets A and B, and let (a, b) ∈ A × B.Also define M as the set of matched pairs (a = b) and U asthe set of unmatched pairs.For a given pair, there is a list of comparison values for(c1, . . . cn) where ck is the comparison value for the k th

field of the item.

Vlad Shchogolev [email protected] Entity Resolution

Page 7: Entity Resolution, Clustering Author References

IntroductionBackground

MethodologySummary

Formal DefinitionEfficiency ConsiderationsMeasuring Text SimilarityOther approaches

Probabilistic Model, Fellegi and Sunter

Define quantities mk = P{ck = 0|(a, b) ∈ M} anduk = P{ck = 0|(a, b) ∈ U}Assume independence assumption (!)Weight assigned to each component is:

wk =

{log(mk/uk ) if ck = 0,log(1 − mk )/(1 − uk )) if ck = 1.

Equivalent to Naive Bayes

Vlad Shchogolev [email protected] Entity Resolution

Page 8: Entity Resolution, Clustering Author References

IntroductionBackground

MethodologySummary

Formal DefinitionEfficiency ConsiderationsMeasuring Text SimilarityOther approaches

Canopies

Need methods to find candidate pairsMcCallum, Nigam, Ungar introduced “canopies” approach:clustering is performed in two stages, starting with a roughstage that divides data into overlapping subsetsIf clustering measures distance to cluster using clustercentroid, and canopy is larger than true cluster, nothing islost.

Vlad Shchogolev [email protected] Entity Resolution

Page 9: Entity Resolution, Clustering Author References

IntroductionBackground

MethodologySummary

Formal DefinitionEfficiency ConsiderationsMeasuring Text SimilarityOther approaches

Clustering bibliographic references

Goal was to compute the citation graph for research papersFirst pass: used a fast TF-IDF approachSecond pass: used an expensive string edit distancecomputation, combined with a HMM for field extractionResults in equally good accuracy, but orders of magnitudefaster

Vlad Shchogolev [email protected] Entity Resolution

Page 10: Entity Resolution, Clustering Author References

IntroductionBackground

MethodologySummary

Formal DefinitionEfficiency ConsiderationsMeasuring Text SimilarityOther approaches

Measuring Text Similarity

Several methods exist to construct a similarity measure for text:edit distance (customizable costs)Jaro’s algorithm (transpositions)character N-gramsTF-IDFstring kernelsterm-vector dot productsoundex

Research typically finds that no single method is best

Vlad Shchogolev [email protected] Entity Resolution

Page 11: Entity Resolution, Clustering Author References

IntroductionBackground

MethodologySummary

Formal DefinitionEfficiency ConsiderationsMeasuring Text SimilarityOther approaches

Refinements

Minton, Nanjo et al. introduced “transformation graphs” tohandle higher level concepts:

synonymsmisspellingabbreviationacronymconcatenation

Again, a Naive Bayes approach is used to learn weights fortransformations.

Vlad Shchogolev [email protected] Entity Resolution

Page 12: Entity Resolution, Clustering Author References

IntroductionBackground

MethodologySummary

Formal DefinitionEfficiency ConsiderationsMeasuring Text SimilarityOther approaches

Domain-independent approach

Monge and Elkan suggest a “domain-independent”approachEach record is treat as a single long stringSimilarity is measured using edit distance

Vlad Shchogolev [email protected] Entity Resolution

Page 13: Entity Resolution, Clustering Author References

IntroductionBackground

MethodologySummary

Clustering authorsLearning distance functionClustering experiment

Clustering author references

A related problem to bibliographic referencesExperiment run on a large biology research corpusGoal is to determine when two matching names (e.g.Smith J.) refer to the same personExtra structure: social network consisting of co-authorshipedges

Vlad Shchogolev [email protected] Entity Resolution

Page 14: Entity Resolution, Clustering Author References

IntroductionBackground

MethodologySummary

Clustering authorsLearning distance functionClustering experiment

Learning distance function

Similar to distance metric learning paper, but not inEuclidean spaceNot domain-independentDomain knowledge used to pick from one of 3 comparisonfunctions for each field:

equalityset intersectioncharacter N-gram similarity

Most important feature: number of common co-authors

Vlad Shchogolev [email protected] Entity Resolution

Page 15: Entity Resolution, Clustering Author References

IntroductionBackground

MethodologySummary

Clustering authorsLearning distance functionClustering experiment

Clustering experiment

Clustering name references can be useful for judgingco-authorship importanceTried experiments with simple Greedy AgglomerativeClustering.Measured within-cluster dispersion at each clustering step:

Data is clustered into k clusters C1, . . . CkSum of pairwise distances: Dr =

∑i,j∈Cr

di,j

Dispersion measure: Wk =∑k

r=11

2nrDr

Vlad Shchogolev [email protected] Entity Resolution

Page 16: Entity Resolution, Clustering Author References

IntroductionBackground

MethodologySummary

Clustering authorsLearning distance functionClustering experiment

Gap Statistic

Can we estimate the true number of clusters?Define Gap Statistic (Tibshirani et al. 2000)

Gapn(k) = E∗n (log(Wk ))− log(Wk )

Expectation is over a sample from reference distributionThe quantity log(Wk ) can be thought of as log-likelihood

Vlad Shchogolev [email protected] Entity Resolution

Page 17: Entity Resolution, Clustering Author References

IntroductionBackground

MethodologySummary

Clustering authorsLearning distance functionClustering experiment

Reference distribution

Gapn(k) = E∗n (log(Wk ))− log(Wk )

For uniform distribution, expectation should decrease atthe rate (2/p) log k .Better: a uniform distribution over a box align with theprincipal components.Goal is to produce evidence against the null model (singlecluster)

Vlad Shchogolev [email protected] Entity Resolution

Page 18: Entity Resolution, Clustering Author References

IntroductionBackground

MethodologySummary

Clustering authorsLearning distance functionClustering experiment

Estimation Procedure

Sample B Monte Carlo reference datasets from referencedistributionCluster each sampled datasetEstimate the gap:

Gap(k) = (1/B)∑

b

log(W ∗kb)− log(Wk )

Pick smallest k such that Gap(k) > Gap(k + 1)− sk+1

Vlad Shchogolev [email protected] Entity Resolution

Page 19: Entity Resolution, Clustering Author References

IntroductionBackground

MethodologySummary

Clustering authorsLearning distance functionClustering experiment

Estimation Procedure

Estimation performs poorly; reference distribution notappropriate?Euclidean distance metric not appropriate for highdimensionsHowever: have evidence that Wk graph has some signal

Vlad Shchogolev [email protected] Entity Resolution

Page 20: Entity Resolution, Clustering Author References

IntroductionBackground

MethodologySummary

Clustering authorsLearning distance functionClustering experiment

Two within-cluster dispersion graphs

Vlad Shchogolev [email protected] Entity Resolution

Page 21: Entity Resolution, Clustering Author References

IntroductionBackground

MethodologySummary

Summary

Rethink cluster estimation for this class of problemsDistance metric learning works well, but...Should take advantage of additional structure

Vlad Shchogolev [email protected] Entity Resolution