entity resolution, clustering author references

IntroductionBackground

MethodologySummary

Entity Resolution, Clustering AuthorReferences

Vlad [email protected]

May 1, 2007

Vlad Shchogolev [email protected] Entity Resolution


MethodologySummary

Outline

1 IntroductionWhat is Entity Resolution?Motivation

2 BackgroundFormal DefinitionEfficieny ConsiderationsMeasuring Text SimilarityOther approaches

3 MethodologyClustering author referencesLearning distance functionClustering experiment



MethodologySummary

What is Entity Resolution?Motivation

Set-Up

Given: a table of records, each record has a number of“fields”.Problem: decide which pairs of records refer to the same“entity”.Variation: given two tables, pair up matching records



MethodologySummary


A new name for an old research area

Record linkageOriginally studied by Dunn, 1946Formalized by Fellegi and Sunter, 1969

Merge/purge problemData matching, object identity problemCoreference resolution, reference reconciliation, etc.



MethodologySummary


Why is this useful?

Historical researchMedical researchGovernment record-keepingTracking terroristsWal-MartYahoo Local, Google LocalBibliographic citations



MethodologySummary

Formal DefinitionEfficiency ConsiderationsMeasuring Text SimilarityOther approaches

Formal Definition

Two datasets A and B, and let (a, b) ∈ A × B.Also define M as the set of matched pairs (a = b) and U asthe set of unmatched pairs.For a given pair, there is a list of comparison values for(c1, . . . cn) where ck is the comparison value for the k th

field of the item.



MethodologySummary


Probabilistic Model, Fellegi and Sunter

Define quantities mk = P{ck = 0|(a, b) ∈ M} anduk = P{ck = 0|(a, b) ∈ U}Assume independence assumption (!)Weight assigned to each component is:

wk =

{log(mk/uk ) if ck = 0,log(1 − mk )/(1 − uk )) if ck = 1.

Equivalent to Naive Bayes



MethodologySummary


Canopies

Need methods to find candidate pairsMcCallum, Nigam, Ungar introduced “canopies” approach:clustering is performed in two stages, starting with a roughstage that divides data into overlapping subsetsIf clustering measures distance to cluster using clustercentroid, and canopy is larger than true cluster, nothing islost.



MethodologySummary


Clustering bibliographic references

Goal was to compute the citation graph for research papersFirst pass: used a fast TF-IDF approachSecond pass: used an expensive string edit distancecomputation, combined with a HMM for field extractionResults in equally good accuracy, but orders of magnitudefaster



MethodologySummary


Measuring Text Similarity

Several methods exist to construct a similarity measure for text:edit distance (customizable costs)Jaro’s algorithm (transpositions)character N-gramsTF-IDFstring kernelsterm-vector dot productsoundex

Research typically finds that no single method is best



MethodologySummary


Refinements

Minton, Nanjo et al. introduced “transformation graphs” tohandle higher level concepts:

synonymsmisspellingabbreviationacronymconcatenation

Again, a Naive Bayes approach is used to learn weights fortransformations.



MethodologySummary


Domain-independent approach

Monge and Elkan suggest a “domain-independent”approachEach record is treat as a single long stringSimilarity is measured using edit distance



MethodologySummary

Clustering authorsLearning distance functionClustering experiment

Clustering author references

A related problem to bibliographic referencesExperiment run on a large biology research corpusGoal is to determine when two matching names (e.g.Smith J.) refer to the same personExtra structure: social network consisting of co-authorshipedges



MethodologySummary


Learning distance function

Similar to distance metric learning paper, but not inEuclidean spaceNot domain-independentDomain knowledge used to pick from one of 3 comparisonfunctions for each field:

equalityset intersectioncharacter N-gram similarity

Most important feature: number of common co-authors



MethodologySummary


Clustering experiment

Clustering name references can be useful for judgingco-authorship importanceTried experiments with simple Greedy AgglomerativeClustering.Measured within-cluster dispersion at each clustering step:

Data is clustered into k clusters C1, . . . CkSum of pairwise distances: Dr =

∑i,j∈Cr

di,j

Dispersion measure: Wk =∑k

r=11

2nrDr



MethodologySummary


Gap Statistic

Can we estimate the true number of clusters?Define Gap Statistic (Tibshirani et al. 2000)

Gapn(k) = E∗n (log(Wk ))− log(Wk )

Expectation is over a sample from reference distributionThe quantity log(Wk ) can be thought of as log-likelihood



MethodologySummary


Reference distribution

Gapn(k) = E∗n (log(Wk ))− log(Wk )

For uniform distribution, expectation should decrease atthe rate (2/p) log k .Better: a uniform distribution over a box align with theprincipal components.Goal is to produce evidence against the null model (singlecluster)



MethodologySummary


Estimation Procedure

Sample B Monte Carlo reference datasets from referencedistributionCluster each sampled datasetEstimate the gap:

Gap(k) = (1/B)∑

b

log(W ∗kb)− log(Wk )

Pick smallest k such that Gap(k) > Gap(k + 1)− sk+1



MethodologySummary


Estimation Procedure

Estimation performs poorly; reference distribution notappropriate?Euclidean distance metric not appropriate for highdimensionsHowever: have evidence that Wk graph has some signal



MethodologySummary


Two within-cluster dispersion graphs



MethodologySummary

Summary

Rethink cluster estimation for this class of problemsDistance metric learning works well, but...Should take advantage of additional structure


entity resolution, clustering author references

Documents