entity resolution, clustering author references
TRANSCRIPT
IntroductionBackground
MethodologySummary
Entity Resolution, Clustering AuthorReferences
Vlad [email protected]
May 1, 2007
Vlad Shchogolev [email protected] Entity Resolution
IntroductionBackground
MethodologySummary
Outline
1 IntroductionWhat is Entity Resolution?Motivation
2 BackgroundFormal DefinitionEfficieny ConsiderationsMeasuring Text SimilarityOther approaches
3 MethodologyClustering author referencesLearning distance functionClustering experiment
Vlad Shchogolev [email protected] Entity Resolution
IntroductionBackground
MethodologySummary
What is Entity Resolution?Motivation
Set-Up
Given: a table of records, each record has a number of“fields”.Problem: decide which pairs of records refer to the same“entity”.Variation: given two tables, pair up matching records
Vlad Shchogolev [email protected] Entity Resolution
IntroductionBackground
MethodologySummary
What is Entity Resolution?Motivation
A new name for an old research area
Record linkageOriginally studied by Dunn, 1946Formalized by Fellegi and Sunter, 1969
Merge/purge problemData matching, object identity problemCoreference resolution, reference reconciliation, etc.
Vlad Shchogolev [email protected] Entity Resolution
IntroductionBackground
MethodologySummary
What is Entity Resolution?Motivation
Why is this useful?
Historical researchMedical researchGovernment record-keepingTracking terroristsWal-MartYahoo Local, Google LocalBibliographic citations
Vlad Shchogolev [email protected] Entity Resolution
IntroductionBackground
MethodologySummary
Formal DefinitionEfficiency ConsiderationsMeasuring Text SimilarityOther approaches
Formal Definition
Two datasets A and B, and let (a, b) ∈ A × B.Also define M as the set of matched pairs (a = b) and U asthe set of unmatched pairs.For a given pair, there is a list of comparison values for(c1, . . . cn) where ck is the comparison value for the k th
field of the item.
Vlad Shchogolev [email protected] Entity Resolution
IntroductionBackground
MethodologySummary
Formal DefinitionEfficiency ConsiderationsMeasuring Text SimilarityOther approaches
Probabilistic Model, Fellegi and Sunter
Define quantities mk = P{ck = 0|(a, b) ∈ M} anduk = P{ck = 0|(a, b) ∈ U}Assume independence assumption (!)Weight assigned to each component is:
wk =
{log(mk/uk ) if ck = 0,log(1 − mk )/(1 − uk )) if ck = 1.
Equivalent to Naive Bayes
Vlad Shchogolev [email protected] Entity Resolution
IntroductionBackground
MethodologySummary
Formal DefinitionEfficiency ConsiderationsMeasuring Text SimilarityOther approaches
Canopies
Need methods to find candidate pairsMcCallum, Nigam, Ungar introduced “canopies” approach:clustering is performed in two stages, starting with a roughstage that divides data into overlapping subsetsIf clustering measures distance to cluster using clustercentroid, and canopy is larger than true cluster, nothing islost.
Vlad Shchogolev [email protected] Entity Resolution
IntroductionBackground
MethodologySummary
Formal DefinitionEfficiency ConsiderationsMeasuring Text SimilarityOther approaches
Clustering bibliographic references
Goal was to compute the citation graph for research papersFirst pass: used a fast TF-IDF approachSecond pass: used an expensive string edit distancecomputation, combined with a HMM for field extractionResults in equally good accuracy, but orders of magnitudefaster
Vlad Shchogolev [email protected] Entity Resolution
IntroductionBackground
MethodologySummary
Formal DefinitionEfficiency ConsiderationsMeasuring Text SimilarityOther approaches
Measuring Text Similarity
Several methods exist to construct a similarity measure for text:edit distance (customizable costs)Jaro’s algorithm (transpositions)character N-gramsTF-IDFstring kernelsterm-vector dot productsoundex
Research typically finds that no single method is best
Vlad Shchogolev [email protected] Entity Resolution
IntroductionBackground
MethodologySummary
Formal DefinitionEfficiency ConsiderationsMeasuring Text SimilarityOther approaches
Refinements
Minton, Nanjo et al. introduced “transformation graphs” tohandle higher level concepts:
synonymsmisspellingabbreviationacronymconcatenation
Again, a Naive Bayes approach is used to learn weights fortransformations.
Vlad Shchogolev [email protected] Entity Resolution
IntroductionBackground
MethodologySummary
Formal DefinitionEfficiency ConsiderationsMeasuring Text SimilarityOther approaches
Domain-independent approach
Monge and Elkan suggest a “domain-independent”approachEach record is treat as a single long stringSimilarity is measured using edit distance
Vlad Shchogolev [email protected] Entity Resolution
IntroductionBackground
MethodologySummary
Clustering authorsLearning distance functionClustering experiment
Clustering author references
A related problem to bibliographic referencesExperiment run on a large biology research corpusGoal is to determine when two matching names (e.g.Smith J.) refer to the same personExtra structure: social network consisting of co-authorshipedges
Vlad Shchogolev [email protected] Entity Resolution
IntroductionBackground
MethodologySummary
Clustering authorsLearning distance functionClustering experiment
Learning distance function
Similar to distance metric learning paper, but not inEuclidean spaceNot domain-independentDomain knowledge used to pick from one of 3 comparisonfunctions for each field:
equalityset intersectioncharacter N-gram similarity
Most important feature: number of common co-authors
Vlad Shchogolev [email protected] Entity Resolution
IntroductionBackground
MethodologySummary
Clustering authorsLearning distance functionClustering experiment
Clustering experiment
Clustering name references can be useful for judgingco-authorship importanceTried experiments with simple Greedy AgglomerativeClustering.Measured within-cluster dispersion at each clustering step:
Data is clustered into k clusters C1, . . . CkSum of pairwise distances: Dr =
∑i,j∈Cr
di,j
Dispersion measure: Wk =∑k
r=11
2nrDr
Vlad Shchogolev [email protected] Entity Resolution
IntroductionBackground
MethodologySummary
Clustering authorsLearning distance functionClustering experiment
Gap Statistic
Can we estimate the true number of clusters?Define Gap Statistic (Tibshirani et al. 2000)
Gapn(k) = E∗n (log(Wk ))− log(Wk )
Expectation is over a sample from reference distributionThe quantity log(Wk ) can be thought of as log-likelihood
Vlad Shchogolev [email protected] Entity Resolution
IntroductionBackground
MethodologySummary
Clustering authorsLearning distance functionClustering experiment
Reference distribution
Gapn(k) = E∗n (log(Wk ))− log(Wk )
For uniform distribution, expectation should decrease atthe rate (2/p) log k .Better: a uniform distribution over a box align with theprincipal components.Goal is to produce evidence against the null model (singlecluster)
Vlad Shchogolev [email protected] Entity Resolution
IntroductionBackground
MethodologySummary
Clustering authorsLearning distance functionClustering experiment
Estimation Procedure
Sample B Monte Carlo reference datasets from referencedistributionCluster each sampled datasetEstimate the gap:
Gap(k) = (1/B)∑
b
log(W ∗kb)− log(Wk )
Pick smallest k such that Gap(k) > Gap(k + 1)− sk+1
Vlad Shchogolev [email protected] Entity Resolution
IntroductionBackground
MethodologySummary
Clustering authorsLearning distance functionClustering experiment
Estimation Procedure
Estimation performs poorly; reference distribution notappropriate?Euclidean distance metric not appropriate for highdimensionsHowever: have evidence that Wk graph has some signal
Vlad Shchogolev [email protected] Entity Resolution
IntroductionBackground
MethodologySummary
Clustering authorsLearning distance functionClustering experiment
Two within-cluster dispersion graphs
Vlad Shchogolev [email protected] Entity Resolution
IntroductionBackground
MethodologySummary
Summary
Rethink cluster estimation for this class of problemsDistance metric learning works well, but...Should take advantage of additional structure
Vlad Shchogolev [email protected] Entity Resolution