entity resolution: tutorial
Embed Size (px)
DESCRIPTION
Lise Getoor Ashwin MachanavajjhalaUniversity of MarylandCollege Park MDDuke UniversityTRANSCRIPT
-
Entity Resolution: TutorialEntityResolution:Tutorial
Li G t Ashwin MachanavajjhalaLise GetoorUniversityofMaryland
College Park MD
Ashwin MachanavajjhalaDukeUniversityDurham,NCCollegePark,MD ,
http://www.cs.umd.edu/~getoor/Tutorials/ER_VLDB2012.pdf
http://goo.gl/f5eym
-
WhatisEntityResolution?Problemofidentifyingandlinking/groupingdifferent
manifestationsofthesamerealworldobject.f f j
Examplesofmanifestationsandobjects:p j Differentwaysofaddressing(names,emailaddresses,FaceBook
accounts)thesamepersonintext. Webpageswithdifferingdescriptionsofthesamebusiness. Differentphotosofthesameobject.
-
Ironically,EntityResolutionhasmanyduplicatenames
DuplicatedetectionRecordlinkage
Coreference resolution
Object consolidation
Coreference resolutionReferencereconciliation
Fuzzymatch
Deduplication
ObjectidentificationObjectconsolidation
Entity clustering
y
Entityclustering
Household matching
Approximatematch
Merge/purgeIdentityuncertainty
ReferencematchingHouseholding
HouseholdmatchingMerge/purge
Hardeningsoftdatabases
Doubles
gg
-
ERMotivatingExamples LinkingCensusRecords Public HealthPublicHealth Websearch Comparison shopping Comparisonshopping Counterterrorism Spam detection Spamdetection MachineReading
-
ERandNetworkAnalysis
before after
-
Motivation:NetworkScience Measuringthetopologyoftheinternetusingtraceroute
-
IPAliasingProblem[Willinger etal.2009]
-
IPAliasingProblem[Willinger etal.2009]
-
TraditionalChallengesinER Name/Attributeambiguity
Thomas CruiseThomasCruise
MichaelJordan
-
TraditionalChallengesinER Name/Attributeambiguity Errors due to data entryErrorsduetodataentry
-
TraditionalChallengesinER Name/Attributeambiguity Errors due to data entryErrorsduetodataentry MissingValues
[Gilletal;Univ ofOxford2003]
-
TraditionalChallengesinER Name/Attributeambiguity Errors due to data entryErrorsduetodataentry MissingValues Changing Attributes ChangingAttributes
Dataformatting
/ Abbreviations/DataTruncation
-
BigDataERChallenges
-
BigDataERChallenges LargerandmoreDatasets
NeedefficientparalleltechniquesM H t it MoreHeterogeneity Unstructured,UncleanandIncompletedata.Diversedatatypes. No longer just matching names with names, but Amazon profiles withNolongerjustmatchingnameswithnames,butAmazonprofileswith
browsinghistoryonGoogleandfriendsnetworkinFacebook.
-
BigDataERChallenges LargerandmoreDatasets
NeedefficientparalleltechniquesM H t it MoreHeterogeneity Unstructured,UncleanandIncompletedata.Diversedatatypes.
More linked Morelinked Needtoinferrelationshipsinadditiontoequality
MultiRelational Dealwithstructureofentities(AreWalmart andWalmart Pharmacy
thesame?)
l d Multidomain Customizablemethodsthatspanacrossdomains
Multiple applications (web search versus comparison shopping) Multipleapplications(websearchversuscomparisonshopping) Servediverseapplicationwithdifferentaccuracyrequirements
-
Outline1. AbstractProblemStatement2 Algorithmic Foundations of ER2. AlgorithmicFoundationsofER3. ScalingERtoBigData4 Challenges & Future Directions4. Challenges&FutureDirections
-
Outline1. AbstractProblemStatement2 Algorithmic Foundations of ER2. AlgorithmicFoundationsofER
a) DataPreparationandMatchFeaturesb) Pairwise ERc) ConstraintsinERd) Algorithms
Record Linkage
5minutebreak
RecordLinkage Deduplication CollectiveER
3. ScalingERtoBigData4. Challenges&FutureDirections
-
Outline1. AbstractProblemStatement2 Algorithmic Foundations of ER2. AlgorithmicFoundationsofER3. ScalingERtoBigData
a) Blocking/Canopy Generationa) Blocking/CanopyGenerationb) DistributedER
4. Challenges & Future Directions4. Challenges&FutureDirections
-
Outline1. AbstractProblemStatement2 Algorithmic Foundations of ER2. AlgorithmicFoundationsofER3. ScalingERtoBigData4 Challenges & Future Directions4. Challenges&FutureDirections
-
ScopeoftheTutorial Whatwecover:
FundamentalalgorithmicconceptsinER ScalingERtobigdatasets TaxonomyofcurrentERalgorithms
Whatwedonotcover: Schema/ontology resolutionSchema/ontologyresolution Datafusion/integration/exchange/cleaning Entity/InformationExtraction PrivacyaspectsofEntityResolution DetailsonsimilaritymeasuresTechnical details and proofs Technicaldetailsandproofs
-
ERReferences Book/SurveyArticles
DataQualityandRecordLinkageTechniques[T Herzog F Scheuren W Winkler Springer 07][T.Herzog,F.Scheuren,W.Winkler,Springer, 07]
DuplicateRecordDetection[A.Elmagrid,P.Ipeirotis,V.Verykios,TKDE07] AnIntroductiontoDuplicateDetection[F.Naumann,M.Herschel,M&P
synthesislectures2010]y ] EvaluationofEntityResolutionApproachedonRealworldMatchProblems
[H.Kopke,A.Thor,E.Rahm,PVLDB2010] DataMatching[P.Christen,Springer2012]
Tutorials RecordLinkage:SimilaritymeasuresandAlgorithms
[N.Koudas,S.Sarawagi,D.Srivatsava SIGMOD06] DatafusionResolvingdataconflictsforintegration
[X.Dong,F.Naumann VLDB09]Entity Resolution: Theory Practice and Open Challenges EntityResolution:Theory,PracticeandOpenChallengeshttp://goo.gl/Ui38o [L.Getoor,A.Machanavajjhala AAAI12]
-
PART 1ABSTRACT PROBLEM STATEMENTPART1ABSTRACTPROBLEMSTATEMENT
-
AbstractProblemStatementRealWorld DigitalWorld
Records//Mentions
-
Deduplication ProblemStatement Clustertherecords/mentionsthatcorrespondtosameentityy
-
Deduplication ProblemStatement Clustertherecords/mentionsthatcorrespondtosameentityy Intensional Variant:Computeclusterrepresentative
-
RecordLinkageProblemStatement Linkrecordsthatmatchacrossdatabases
AB
-
ReferenceMatchingProblem Matchnoisyrecordstocleanrecordsinareferencetable
ReferenceblTable
-
AbstractProblemStatementRealWorld DigitalWorld
-
Deduplication ProblemStatement
-
Deduplication withCanonicalization
-
GraphAlignment(&motifsearch)
Graph1 Graph2
-
Relationshipsarecrucial
-
Relationshipsarecrucial
-
Notation R:setofrecords/mentions(typed) H: set of relations / hyperedges (typed)H:setofrelations/hyperedges (typed) M:setofmatches(recordpairsthatcorrespondtosameentity) N: set of non matches ( d i di t diff t titi ) N:setofnonmatches(recordpairscorrespondingtodifferententities) E:setofentities L set of links L:setoflinks
T (M N E L ) di l ld True(Mtrue,Ntrue,Etrue,Ltrue):accordingtorealworldvs Predicted(Mpred,Npred,Epred,Lpred ):byalgorithm
-
RelationshipbetweenMtrue andMpred Mtrue (SameAs ,Equivalence) M (Similar representations and similar attributes)Mpred (Similarrepresentationsandsimilarattributes)
MtrueRxR Mpredtrue pred
-
Metrics Pairwise metrics
Precision/Recall,F1 #ofpredictedmatchingpairs
Clusterlevelmetrics purity,completeness,complexityPrecision/Recall/F1: Cluster level closest cluster MUC B3 Precision/Recall/F1:Clusterlevel,closestcluster,MUC,B3 ,RandIndex
Generalizedmergedistance[Menestrina etal,PVLDB10]
Littleworkthatevaluationscorrectpredictionoflinks
-
TypicalAssumptionsMade Eachrecord/mentionisassociatedwithasinglerealworldentity.y
Inrecordlinkage,noduplicatesinthesamesource If two records/mentions are identical then they are true Iftworecords/mentionsareidentical,thentheyaretruematches
( ) M(,) Mtrue
-
ERversusClassificationFindingmatchesvs nonmatchesisaclassificationproblem
Imbalanced:typicallyO(R)matches,O(R^2)nonmatches
Instancesarepairsofrecords.PairsarenotIID
(,) Mtrue( ) MAND
( , ) Mtrue
(,) MtrueAND(,) Mtrue
-
ERvs (Multirelational)ClusteringComputingentitiesfromrecordsisaclusteringproblem
Intypicalclusteringalgorithms(kmeans,LDA,etc.)number of clusters is a constant or sub linear in R.numberofclustersisaconstantorsublinearinR.
In ER: number of clusters is linear in R and averageInER:numberofclustersislinearinR,andaverageclustersizeisaconstant.Significantfractionofclustersaresingletons.g
-
PART 2ALGORITHMIC FOUNDATIONS OF ERPART2ALGORITHMICFOUNDATIONSOFER
-
OutlineofPart2a) DataPreparationandMatchFeatures
b) Pairwise ER Determiningwhetherornotapairofrecordsmatch
c) ConstraintsinER5minutebreak
d) Algorithms Recordlinkage(Propagationthroughexclusitivity negativeconstraint), Deduplication (Propagationthroughtransitivitypositiveconstraint), Collective(PropagationthroughGeneralConstraints)
-
MOTIVATING EXAMPLE:MOTIVATINGEXAMPLE:BIBLIOGRAPHICDOMAIN
-
Entities&RelationsinBibliographicDomain
Wrote AuthorName Author Mention
NameString
Paper
Research Area NameString
I tit ti Institute MentionWorksAtp
Title# of AuthorsTopicWord1
InstitutionName
Institute MentionNameString
Word1Word 2WordN
Paper MentionTitleString
Cites VenueName
Venue MentionNameStringAppearsIn gpp
:cooccurrencerelationships:resolutionrelationships
:entityrelationships
-
PART 2DATA PREPARATION &PART2aDATAPREPARATION&MATCHFEATURES
-
Normalization Schemanormalization
SchemaMatching e.g.,contactnumberandphonenumberC d tt ib t f ll dd t it t t i Compoundattributes fulladdressvs str,city,state,zip
Nestedattributes Listoffeaturesinonedataset(airconditioning,parking)vs eachfeatureaboolean attributeboolean attribute
Setvaluedattributes Setofphonesvs primary/secondaryphone
Record segmentation from text Recordsegmentationfromtext Datanormalization
Oftenconverttoalllower/allupper;removewhitespaced i d i l h i k hi l detectingandcorrectingvaluesthatcontainknowntypographicalerrorsorvariations,
expandingabbreviationsandreplacingthemwithstandardforms;replacingnicknames with their proper name formsnicknameswiththeirpropernameforms
Usuallydonebasedondictionaries(e.g.,commercialdictionaries,postaladdresses,etc.)
-
MatchingFeatures Fortworeferencesxandy,computeacomparisonvectorof
similarityscoresofcomponentattribute. [1stauthormatchscore,
papermatchscore,venuematchscore,yearmatchscore,.]
Similarityscoresy Boolean(matchornotmatch) Realvaluesbasedondistancefunctions
-
SummaryofMatchingFeaturesH dl
Equalityonaboolean predicate AlignmentbasedorTwotieredGoodforNames
HandleTypographicalerrors
Editdistance Levenstein,SmithWaterman,Affine
Setsimilarity
JaroWinkler,SoftTFIDF,MongeElkan PhoneticSimilarity
Soundex Jaccard,Dice
VectorBased Cosine similarity, TFIDF
Translationbased Numericdistancebetweenvalues DomainspecificCosinesimilarity,TFIDF
Useful packages
pGoodforTextlikereviews/tweets Usefulfor
abbreviations,alternate names. Usefulpackages:
SecondString,http://secondstring.sourceforge.net/ Simmetrics:http://sourceforge.net/projects/simmetrics/
Li Pi htt // li i /li i /i d ht l
alternatenames.
LingPipe,http://aliasi.com/lingpipe/index.html
-
RelationalMatchingFeatures Relationalfeaturesareoftensetbased
Setofcoauthorsforapaperf i i i Setofcitiesinacountry
Setofproductsmanufacturedbymanufacturer
Canusesetsimilarityfunctionsmentionedearlier CommonNeighbors:Intersectionsize Jaccards Coefficient: Normalize by union sizeJaccard s Coefficient:Normalizebyunionsize AdarCoefficient:Weightedsetsimilarity
C b t i il it i t f l Canreasonaboutsimilarityinsetsofvalues AverageorMax Otheraggregates
-
PART 2 bPAIRWISE MATCHINGPART2bPAIRWISE MATCHING
-
Pairwise MatchScoreProblem:Givenavectorofcomponentwisesimilaritiesforapairof
records(x,y),computeP(xandymatch).
Solutions:1. Weightedsumoraverageofcomponentwisesimilarityscores.
Thresholddeterminesmatchornonmatch. 0.5*1stauthormatchscore +0.2*venuematchscore +0.3*papermatchscore.p p Hardtopickweights.
Matchonlastnamematchmorepredictivethanloginname. Match on Smith less predictive than match on Getoor or Machanavajjhala Matchon Smith lesspredictivethanmatchon Getoor or Machanavajjhala .
Hardtotuneathreshold.
-
Pairwise MatchScoreProblem:Givenavectorofcomponentwisesimilaritiesforapairof
records(x,y),computeP(xandymatch).
Solutions:1. Weightedsumoraverageofcomponentwisesimilarityscores.
Thresholddeterminesmatchornonmatch.2 Formulate rules about what constitutes a match2. Formulaterulesaboutwhatconstitutesamatch.
(1stauthormatchscore>0.7ANDvenuematchscore>0.8)OR(papermatchscore>0.9ANDvenuematchscore>0.9)M ll f l ti th i ht t f l i h d Manuallyformulatingtherightsetofrulesishard.
-
BasicMLApproach r =(x,y)isrecordpair, iscomparisonvector,Mmatches,U non
matches
Decisionrule)|()|(
UrPMrPR
)|(
Match rtRMatch-Non rtR
-
Fellegi &Sunter Model[FS,Science69] r =(x,y)isrecordpair, iscomparisonvector,Mmatches,U non
matches
Decisionrule)|()|(
UrPMrPR
)|(
Match rtR l
Match-NonMatchPotential
rtRrtRt
u
ul
NaveBayes Assumption: )|()|( MrPMrP ii
-
MLPairwise Approaches Supervisedmachinelearningalgorithms
Decisiontrees [Co hin ala et al IS01] [Cochinwala etal,IS01]
Supportvectormachines [Bilenko &Mooney,KDD03];[Christen,KDD08]
Ensembles of classifiers Ensemblesofclassifiers [Chenetal.,SIGMOD09]
ConditionalRandomFields(CRF) [Gupta & Sarawagi VLDB09][Gupta&Sarawagi,VLDB09]
Issues:Training set generation Trainingsetgeneration
Imbalancedclasses manymorenegativesthanpositives(evenaftereliminatingobviousnonmatchesusingBlocking)
Misclassification cost Misclassificationcost
-
CreatingaTrainingSetisakeyissue Constructingatrainingsetishard sincemostpairsofrecordsareeasynonmatches.y 100recordsfrom100cities. Only106 pairsoutoftotal108 (1%)comefromthesamecity
Somepairsarehardtojudgeevenbyhumans Inherentlyambiguous
E.g.,ParisHilton(personorbusiness)
Missingattributes Starbucks,Torontovs Starbucks,QueenStreet,Toronto
-
AvoidingTrainingSetGeneration Unsupervised/SemisupervisedTechniques
EMbasedtechniquestolearnparameters [Winkler06,Herzogetal07]
GenerativeModels [Ravikumar & Cohen UAI04][Ravikumar &Cohen,UAI04]
ActiveLearning CommitteeofClassifiers
[Sarawagi etalKDD00,Tajeda etalIS01] Provably optimizing precision/recallProvablyoptimizingprecision/recall
[Arasu etalSIGMOD10,Bellare etalKDD12] Crowdsourcing
[W t l VLDB 12 M t l VLDB 12 ] [WangetalVLDB12,MarcusetalVLDB12,]
-
CommitteeofClassifiers [Tejada etal,IS01]
-
ActiveLearningwithProvableGuarantees
Mostactivelearningtechniquesminimize01loss[Beygelzimer etalNIPS2010].
However,ERisveryimbalanced: Numberofnonmatches>100*numberofmatches. Classifyingallpairsasnonmatcheshaslow01loss(
-
ActiveLearningwithProvableGuarantees
Monotonicity ofPrecision[Arasu etalSIGMOD10]
ThereisalargerfractionofmatchesinC1thaninC2.
Algorithmsearchesfortheoptimalclassifierusingbinaryp g ysearchoneachdimension
-
ActiveLearningwithProvableGuarantees
[Bellare etalKDD12]
O (log2 n) calls to a blackbox 0 1 loss active learning algorithmO (log2 n) calls to a blackbox 0 1 loss active learning algorithmO(log2 n)callstoablackbox 01lossactivelearningalgorithm.
Exponentiallysmallerlabelcomplexitythan[Arasu etalSIGMOD10](in the worst case).
O(log2 n)callstoablackbox 01lossactivelearningalgorithm.
Exponentiallysmallerlabelcomplexitythan[Arasu etalSIGMOD10](in the worst case).
1. PrecisionConstrainedWeighted01LossProblem(using a Lagrange Multiplier )
(intheworstcase).(intheworstcase).
(usingaLagrangeMultiplier).2. Givenafixedvaluefor,weighted01Losscanbeoptimizedby(onecallto)a
blackbox activelearningclassifier.3 Ri ht l f i t d b hi ll ti l l ifi3. Rightvalueof iscomputedbysearchingoveralloptimalclassifiers.
Classifiersareembeddedina2dplane(precision/recall) Searchisalongtheconvexhulloftheembeddedclassifiers
-
Crowdsourcing Growinginterestinintegratinghumancomputationindeclarative
workflowengines. ERisanimportantproblem(e.g.,forevaluatingfuzzyjoins) [WangetalVLDB12,MarcusetalVLDB12,]
Opportunity:utilizecrowdsourcing forcreatingtrainingsets,orforactivelearning.
Keyopenissue:Handlingerrorsinhumanjudgments InanexperimentonAmazonMechanicalTurk:
Pairwise matchingjudgment,eachgivento5differentpeople Majorityofworkersagreedontruthononly90%ofpairwise judgments.
-
SummaryofSingleEntityERAlgorithms
Manyalgorithmsforindependentclassificationofpairsofrecordsasmatch/nonmatch
MLbasedclassification&FellegiSunterP Ad d f h Pro:Advancedstateoftheart
Con:Buildinghighfidelitytrainingsetsisahardproblem
ActiveLearning&Crowdsourcing forERareactiveareasofresearch.
-
PART 2CONSTRAINTSPART2cCONSTRAINTS
-
Constraints Importantformsofconstraints:
Transitivity:IfM1andM2match,M2andM3match,thenM1andyM3match
Exclusivity:IfM1matcheswithM2,thenM3cannotmatchwithM2Functional Dependency: If M1 and M2 match then M3 and M4 must FunctionalDependency:IfM1andM2match,thenM3andM4mustmatch
Transitivityiskeytodeduplication Exclusivityiskeytorecordlinkage Functionaldependenciesfordatacleaning,e.g.,
[Ananthakrishna etal.,VLDB02][Fan,PODS08][Bohannonetal ICDE07]al,ICDE07]
-
Positive&NegativeEvidence Positive
Transitivity:IfM1andM2match,M2andM3match,thenM1andyM3match
Exclusivity:IfM1doesntmatchwithM2,thenM3canmatchwithM2M2
FunctionalDependency:IfM1andM2match,thenM3andM4mustmatch
Negative Transitivity:IfM1andM2match,M2andM3donotmatch,thenM1
and M3 do not matchandM3donotmatch Exclusivity:IfM1matcheswithM2,thenM3cannotmatchwithM2 FunctionalDependency:IfM1andM2donotmatch,thenM3and
M4cannotmatch
-
Positive&NegativeEvidence Positive
Transitivity:IfM1andM2match,M2andM3match,thenM1andyM3match
Exclusivity:IfM1doesntmatchwithM2,thenM3canmatchwithM2M2
FunctionalDependency:IfM1andM2match,thenM3andM4mustmatch
Negative Transitivity:IfM1andM2match,M2andM3donotmatch,thenM1
and M3 do not matchandM3donotmatch Exclusivity:IfM1matcheswithM2,thenM3cannotmatchwithM2 FunctionalDependency:IfM1andM2donotmatch,thenM3and
M4cannotmatch
-
ConstraintTypesHardConstraint SoftConstraint
PositiveEvidence IfM1,M2matchthenM3,M4mustmatchIf t t h th i
IfM1,M2matchthenM3,M4morelikelytomatchIf t t h thIftwopapersmatch,theirvenues
matchIftwovenuesmatch,thentheirpapersaremorelikelytomatch
NegativeEvidence MentionM1andM2mustrefertodistinctentities(Uniqueness)Coauthors are distinct
IfM1,M2dontmatchthenM3,M4lesslikelytomatchIf institutions dont matchCoauthorsaredistinct
IfM1,M2dontmatchthenM3,M4cannotmatch
Ifinstitutionsdon tmatch,thenauthorslesslikelytomatch
Iftwovenuesdontmatch,thentheirpapersdontmatch
-
ConstraintTypesHardConstraint SoftConstraint
PositiveEvidence IfM1,M2matchthenM3,M4mustmatchIf t t h th i
IfM1,M2matchthenM3,M4morelikelytomatchIf t t h thIftwopapersmatch,theirvenues
matchIftwovenuesmatch,thentheirpapersaremorelikelytomatch
NegativeEvidence MentionM1andM2mustrefertodistinctentities(Uniqueness)Coauthors are distinct
IfM1,M2dontmatchthenM3,M4lesslikelytomatchIf institutions dont matchCoauthorsaredistinct
IfM1,M2dontmatchthenM3,M4cannotmatch
Ifinstitutionsdon tmatch,thenauthorslesslikelytomatch
Iftwovenuesdontmatch,thentheirpapersdontmatch
-
ConstraintTypesHardConstraint SoftConstraint
PositiveEvidence IfM1,M2matchthenM3,M4mustmatchIf t t h th i
IfM1,M2matchthenM3,M4morelikelytomatchIf t t h thIftwopapersmatch,theirvenues
matchIftwovenuesmatch,thentheirpapersaremorelikelytomatch
NegativeEvidence MentionM1andM2mustrefertodistinctentities(Uniqueness)Coauthors are distinct
IfM1,M2dontmatchthenM3,M4lesslikelytomatchIf institutions dont matchCoauthorsaredistinct
IfM1,M2dontmatchthenM3,M4cannotmatch
Ifinstitutionsdon tmatch,thenauthorslesslikelytomatch
Iftwovenuesdontmatch,thentheirpapersdontmatch
-
ConstraintTypesHardConstraint SoftConstraint
PositiveEvidence IfM1,M2matchthenM3,M4mustmatchIf t t h th i
IfM1,M2matchthenM3,M4morelikelytomatchIf t t h thIftwopapersmatch,theirvenues
matchIftwovenuesmatch,thentheirpapersaremorelikelytomatch
NegativeEvidence MentionM1andM2mustrefertodistinctentities(Uniqueness)Coauthors are distinct
IfM1,M2dontmatchthenM3,M4lesslikelytomatchIf institutions dont matchCoauthorsaredistinct
IfM1,M2dontmatchthenM3,M4cannotmatch
Ifinstitutionsdon tmatch,thenauthorslesslikelytomatch
Iftwovenuesdontmatch,thentheirpapersdontmatch
-
ConstraintTypesHardConstraint SoftConstraint
PositiveEvidence IfM1,M2matchthenM3,M4mustmatchIf t t h th i
IfM1,M2matchthenM3,M4morelikelytomatchIf t t h th
Notethatsomeoftheconstraintsmayberelational
Iftwopapersmatch,theirvenuesmatch
Iftwovenuesmatch,thentheirpapersaremorelikelytomatch
yandrequirejoins
May be directionalNegativeEvidence MentionM1andM2mustreferto
distinctentities(Uniqueness)Coauthors are distinct
IfM1,M2dontmatchthenM3,M4lesslikelytomatchIf institutions dont match
Maybedirectionalorbidirectional
Constraints can be recursiveCoauthorsaredistinct
IfM1,M2dontmatchthenM3,M4cannotmatch
Ifinstitutionsdon tmatch,thenauthorslesslikelytomatch
Constraintscanberecursive,e.g.,iftwoauthorshavematchingcoauthors,thenh hIftwovenuesdontmatch,thentheir
papersdontmatchtheymatch
-
AdditionalConstraints AggregateConstraints[Chaudhuri etal.SIGMOD07]
count constraintscountconstraints EntityAcanlinktoatmostNBs
Authorshaveatmost5papersatanyconference
Oth t lik l Otheraggregateslikesum,averagemorecomplex
Again, these can be either hard or soft constraints,Again,thesecanbeeitherhardorsoftconstraints,providepositiveornegativeevidence
-
MatchDependencies
Whenmatchingdecisionsdependonothermatching decisions (in other words, matchingmatchingdecisions(inotherwords,matchingdecisionsarenotmadeindependently),werefer to the approach as collectiverefertotheapproachascollective
-
MatchExtent Global: Iftwopapersmatch,thentheirvenuesmatch
This constraint can be applied to all instances of venueThisconstraintcanbeappliedtoallinstancesofvenuementions
AlloccurrencesofSIGMODcanbematchedtoInternationalConference on Management of DataConferenceonManagementofData
Local:Iftwopapersmatch,thentheirauthorsmatch This constraint can only be applied locallyThisconstraintcanonlybeappliedlocally
DontwanttomatchalloccurrencesofJ.SmithwithJeffSmith,onlyinthecontextofthecurrentpaper
-
Ex.SemanticIntegrityConstraintsType Example
Aggregate C1=Noresearcher haspublishedmorethanfiveAAAIpapersinayear
Subsumption C2=IfacitationXfromDBLP matchesacitationYinahomepage,theneachauthormentionedinYmatchessomeauthormentionedinX
Neighborhood C3=IfauthorsXandYsharesimilarnamesandsomecoauthors,theylik l t t harelikelytomatch
Incompatible C4 =NoresearcherexistswhohaspublishedinbothHCIandnumericalanalysis
f i i h d h i il hLayout C5=Iftwomentionsinthesamedocumentsharesimilarnames,theyarelikelytomatch
Key/Uniqueness C6=MentionsinthePClistingofaconferenceisto differentresearchersresearchers
Ordering C7=Iftwo citationsmatch,thentheirauthorswillbematchedinorder
Individual C8=TheresearcherwiththenameMayssam Sariahasfewerthanfi ti i DBLP ( d t t d t)fivementionsinDBLP(newgraduatestudent)
[Shen,Li&Doan,AAAI05]
-
AlgorithmsforHandlingConstraints Recordlinkage propagationthroughexclusivity
Weightedkpartitematching
Deduplication propagationthroughtransitivityC l i l i Correlationclustering
Collective propagation through general constraintsCollective propagationthroughgeneralconstraints Similaritypropagation
Dependencygraphs,CollectiveRelationalClusteringP b bili ti h Probabilisticapproaches LDA,CRFs,MarkovLogicNetworks,ProbabilisticRelationalModels,
Hybridapproaches Dedupalog
-
PART 2 dALGORITHMSPART2dALGORITHMS
-
RECORD LINKAGERECORDLINKAGE
-
11assumption Matchingbetween(almost)deduplicated databases. Each record in one database matches at most one recordEachrecordinonedatabasematchesatmostonerecordinanotherdatabase.
Pairwise ERmaymatcharecordinonedatabasewithmorethanonerecordinseconddatabase
-
WeightedKPartiteMatching
WeightedEdges
WeightedEdges Edges Edges
Edgesbetweenpairsofrecordsfromdifferentdatabases Edgeweights
o Pairwise matchscoreL dd f hio Logoddsofmatching
-
WeightedKPartiteMatching
Findamatching(eachrecordmatchesatmostoneotherrecord)fromotherdatabase)thatmaximizethesumofweights.
GeneralproblemisNPhard(3Dmatching) Successive bipartite matching is typically used [G pta & Sara agi VLDB Successivebipartitematchingistypicallyused.[Gupta&Sarawagi,VLDB
09]
-
DEDUPLICATIONDEDUPLICATION
-
Deduplication =>Transitivity Oftenpairwise ERalgorithmoutputinconsistentresults
(x,y)Mpred ,(y,z)Mpred ,but(x,z)Mpred
Idea:Correctthisbyaddingadditionalmatchesusingtransitivelclosure
In certain cases this is a bad ideaIncertaincases,thisisabadidea. Graphsresultingfrompairwise ERhave
diameter>20[Rastogi et al Corr 12]
AddedbyTransitive[Rastogi etalCorr 12]
Needclusteringsolutionsthatdealwiththisproblemdirectlyby
TransitiveClosure
reasoningaboutrecordsjointly.
-
ClusteringbasedER Resolutiondecisionsarenotmadeindependentlyforeachpairofrecords
Basedonvarietyofclusteringalgorithms,but Numberofclustersunknownaprioiri Many,manysmall(possiblysingleton)clusters
Oftentakeapairwisesimilaritygraphasinput
Mayrequiretheconstructionofaclusterrepresentativeorcanonicalentity
-
ClusteringMethodsforER HierarchicalClustering
[Bilenko etal,ICDM05][ , ]
NearestNeighborbasedmethods [Chaudhuri etal,ICDE05]
CorrelationClustering [SoonetalCL01,Bansal etalML04,NgetalACL02,
Ailon etalJACM08,Elsner etalACL08,Elsner etalILPNLP09]
-
IntegerLinearProgrammingviewofER rxy {0,1},rxy =1ifrecordsx andyareinthesamecluster. w+xy [0,1],costofclusteringxandytogetherxy [ ] g y g wxy [0,1],costofplacingxandyindifferentclusters
TransitiveTransitiveclosure
-
CorrelationClustering
Clustermentionssuchthattotal cost is minimi ed
42
totalcostisminimizedSolidedgescontributew+xy totheobjectiveDashededgescontributewxy totheobjective
3
1
y
Costbasedonpairwise similarities
5
Additive:w+xy =pxy andwxy =(1pxy) Logarithmic:w+xy =log(pxy)andwxy =log(1pxy)oga t c xy og(pxy) a d xy og( pxy)
-
CorrelationClustering SolvingtheILPisNPhard[Ailon etal2008JACM]
Anumberofheuristics[Elsner etal2009ILPNLP] Greedy BEST/FIRST/VOTE algorithmsGreedyBEST/FIRST/VOTEalgorithms GreedyPIVOTalgorithm(5approximation) LocalSearch
-
GreedyAlgorithmsS 1 P h d di dStep1:PermutethenodesaccordingarandomStep2:AssignrecordxtotheclusterthatmaximizesQuality
Start a new cluster if Quality < 0StartanewclusterifQuality0[Soon et al 2001 CL] [Soonetal2001CL]
VOTE:Assigntoclusterthatminimizesobjectivefunction. [Elsner etal08ACL]
PracticalNote: Runthealgorithmformanyrandompermutations,andpicktheclusteringwith
bestobjectivevalue(betterthanaveragerun)
-
Greedywithapproximationguarantees
PIVOTAlgorithm [Ailon etal2008JACM] Pickarandom(pivot)recordp.(p ) p Newcluster=
12
={1,2,3,4}C={{1,2,3,4}} ={2,4,1,3}C={{1,2},{4},{3}} ={3,2,4,1}C={{1,3},{2},{4}}
1
43
{ , , , } {{ , }, { }, { }}
Whenweightsare0/1, E(cost(greedy))
-
CANONICALIZATIONPART2dCANONICALIZATION
-
Canonicalization Mergeinformationfromduplicatementionstoconstructaclusterrepresentativewithmaximalinformationp
Starbucks Starbucks,
3457HillsboroughRoad Starbucks3457HillsboroughRoad,Durham,NCPh:(919)3334444
Durham,NCPh:null
Starbacks,
CriticallyimportantinWebportalswhereusersmust beshownaconsolidatedview
HillsboroughRd,DurhamPh:(919)3334444
Eachmentiononlycontainsasubsetoftheattributes
Mentionscontainvariations(ofnames,addresses)
Someofthementionshaveincorrectvalues
-
CanonicalizationAlgorithms Rulebased:
Fornames:typicallylongestnamesareused. Forsetvaluesattributes:UNIONisused.
F t i [C l tt t l KDD07] l dit di t f fi di Forstrings,[Culotta etalKDD07]learnaneditdistanceforfindingthemostrepresentativecentroid.
Canusemajorityruletofixerrors(if4outof5sayabusinessisclosed,thenbusinessisclosed). Thismaynotalwaysworkduetocopying[DongetalVLDB09],orwhen
underlyingdatachanges[PaletalWWW11]
-
CanonicalizationforEfficiency StanfordEntityResolutionFramework[Benjelloun VLDBJ09]
Considerablackbox matchandmergefunction Matchisapairwise boolean operator Merge:constructcanonicalversionofamatchingpair
Canminimizetimetocomputematchesbyinterleavingmatchingandmerging esp.,whenmatchandmergefunctions
satisfymonotonicity properties. r345
r12 r45
r1 r2 r3 r4 r5
-
COLLECTIVE ENTITY RESOLUTIONCOLLECTIVEENTITYRESOLUTION
-
CollectiveApproaches Decisionsforclustermembershipdependsonotherclusters
Nonprobabilisticapproachesp pp SimilarityPropagation
ProbabilisticModelsG i M d l GenerativeModels
UndirectedModels
HybridApproachesy pp
-
SIMILARITY PROPAGATIONSIMILARITYPROPAGATION
-
SimilarityPropagationApproaches Similaritypropagationalgorithmsdefineagraphwhichencodes
thesimilaritybetweenentitymentionsandmatchingdecisions,andcomputematchingdecisionsbypropagatingsimilarityvalues. Detailsofconstructedgraphandhowthesimilarityiscomputedvaries Algorithms are usually defined procedurallyAlgorithmsareusuallydefinedprocedurally Whileprobabilitiesmaybeencodedinvariouswaysinthealgorithms,there
isnoglobalprobabilisticmodeldefined
Approachesoftenmorescalablethanglobalprobabilisticmodels
-
DependencyGraph[Dong et al SIGMOD05 ]
Constructagraphwherenodesrepresentsimilaritycomparisonsbetween attribute values (realvalued) and match decisions based
[Dongetal.,SIGMOD05]
betweenattributevalues(real valued)andmatchdecisionsbasedonmatchingdecisionsofassociatednodes(booleanvalued)
Asmentionsareresolved,enrichedtocontainassociatednodesofallmatchedmentions
Similaritypropagateduntilfixedpointisreached
Negativeconstraints(notmatchnodes)arecheckedaftersimilaritypropagationisperformed,andinconsistenciesarefixed
-
ExploittheDependencyGraphSlid f [D t l SIGMOD05]
(Distributed,Distributed )(a1,a4)
Slidesfrom[Dongetal,SIGMOD05]
(169180,169180)(RobertS.Epstein,Epstein,R.S.)
1 4
(p1,p2)
(a2,a5)
(MichaelStonebraker,Stonebraker,M.)
(a3,a6)(v1,v2)
(EugeneWong,Wong,E.) (ACM,ACMSIGMOD) (1978,1978)
Referencesimilarity Attributesimilarity
-
ExploittheDependencyGraph
(Distributed,Distributed )(a1,a4)
(169180,169180)(RobertS.Epstein,Epstein,R.S.)
1 4
(p1,p2)
(a2,a5)
(MichaelStonebraker,Stonebraker,M.)
(a3,a6)(v1,v2)
(EugeneWong,Wong,E.) (ACM,ACMSIGMOD) (1978,1978)
Reconciled Similar
-
CollectiveRelationalClustering[Bh h & G TKDD07]
Constructagraphwhereleafnodesareindividuali
[Bhattacharya&Getoor,TKDD07]
mentions Performhierarchicalagglomerativeclusteringtomerge
l t f ticlustersofmentions Similaritycomputedbasedonacombinationofattributeand relational similarityandrelationalsimilarity
Whenclustersaremerged,updatethesimilaritiesofanyrelated clusters (clusters corresponding to mentionsrelatedclusters(clusterscorrespondingtomentionswhichcooccurwithmergedmentions)
-
ObjectiveFunctionMi i i Minimize:
),(),( jiRRjiAA ccsimwccsimw weightfor
bweightfor
lsimilarityof
bSimilaritybasedonrelationaledges
b d
jji j
Greedy clustering algorithm: merge cluster pair with max reduction in objective function
attributes relationsattributes betweenci andcj
)()( ** ccsimccsim where for example for cluster representative c*),(),( jAttributesa
ijiA ccsimccsim
))()(()( cNcNsimccsimand
for cluster representative c*
))(),((),( jijaccardjiR cNcNsimccsim where N(c) are the relational neighbors of c
-
RelationalClusteringAlgorithm1. Findsimilarreferencesusingblocking2. Bootstrapclustersusingattributesandrelations3. Computesimilaritiesforclusterpairsandinsertintopriorityqueue
4. Repeatuntilpriorityqueueisemptyp p y q p y5. Findclosestclusterpair6. Stopifsimilaritybelowthreshold7 If no negative constraints violated7. Ifnonegativeconstraintsviolated8. Mergetocreatenewcluster9. Constructcanonicalclusterrepresentative10. Updatesimilarityforrelatedclusters
O(nklogn)algorithmw/efficientimplementation
-
SimilaritypropagationApproachesMethod Notes Constraints Evaluation
RelDC[Kalashnikovetl TODS06]
Referencedisambiguationi i
Modelchoicenodesidentifiedi f
Contextattraction
h
Accuracyandruntime forAuthor
l i dal,TODS06] usingusingRelationshipbaseddatacleaning(RelDC)
usingfeaturebasedsimilarity
measurestherelationalsimilarity
resolutionanddirectorresolutioninMovie database
g ( )
ReferenceReconciliation[Dongetal,
DependencyGraphforpropagating
ReferenceenrichmentExplicitly handle
Both positiveandnegativeconstraints
Precision/Recall,F1onpersonalinformation
SIGMOD05] similarities+enforcenonmatchconstraints
missingvaluesParameterssetbyhand
managementdata(PIM),Coradataset
constraints
CollectiveRelationalClustering
Modifiedhierarchicalagglomerative
Constructscanonical entityasmergesare
Focusoncoauthorresolution and
Precision/Recall,F1onthreebibliographicg
[Bhattacharya&Getoor,TKDD07]
ggclusteringapproach
gmade propagation
g pdatasets:CiteSeer,ArXiv,andBioBase,andsyntheticdata
-
PROBABILISTIC MODELS:PROBABILISTICMODELS:GENERATIVEAPPROACHES
-
GenerativeProbabilisticApproaches ProbabilisticsemanticsbasedonDirectedModels
Model dependencies between match decisions in a generativeModeldependenciesbetweenmatchdecisionsinagenerativemanner
Disadvantage:acyclicity requirement Varietyofapproaches
BasedonLatentDirichlet Allocation,BayesianNetworks Examples
LatentDirichlet Allocation[Bhattacharya&Getoor,SDM07] ProbabilisticRelationalModels[Pasula etal,NIPS02]
-
LDAforEntityResolution:DiscoveringGroups from CoOccurrence RelationsGroupsfromCo OccurrenceRelations
Stephen C JohnsonStephen P Johnson
Alfred V Aho
J ff D Ull
Ravi Sethi
M k C ss
Chris Walshaw Kevin McManus
M tin E tt
Bell Labs Group
Jeffrey D Ullman
Parallel Processing Research Group
Mark Cross Martin Everett
P1: C. Walshaw, M. Cross, M. G. Everett, P4: Alfred V. Aho, Stephen C. Johnson, J ff D UllS. Johnson
P2: C. Walshaw, M. Cross, M. G. Everett,S. Johnson, K. McManus
Jefferey D. Ullman
P5: A. Aho, S. Johnson, J. Ullman
P3: C. Walshaw, M. Cross, M. G. Everett P6: A. Aho, R. Sethi, J. Ullman
-
LDAERModel Entity label a and group label z for
each reference r : mixture of groups for each co-
: mixture of groups for each co-
occurrence z: multinomial for choosing entity a
for each group z
z Va: multinomial for choosing
reference r from entity a Dirichlet priors with and
aT
r
T
AV
InferenceusingblockedGibbssamplingforefficiency(andimprovedaccuracy)
PR A
-
GenerativeApproachesMethod Learning/Inference
MethodEvaluation
[Li Morie & Generative Truncated EM to learn F1 on person[Li,Morie,&Roth,AAAI04]
Generativemodelformentionsindocuments
Truncated EMtolearnparametersandMAPinferenceforentities(unsupervised)
F1on personnames,locationsandorganizationsinTRECdataset
ProbabilisticRelationalM d l [P l
ProbabilisticRelationalM d l
Parameterslearnedonseparatedcorpora,i f d i
%ofcorrectlyidentifiedlModels[Pasula
etal.,NIPS03]Models inferencedoneusing
MCMCclusters onsubsetsofCiteSeer data
Latent Dirichlet Latent Dirichlet Blocked Gibbs Precision/RecallLatentDirichletAllocation[Bhattacharya&Getoor,
LatentDirichletAllocationModel
BlockedGibbsSamplingUnsupervisedapproach
Precision/Recall/F1onCiteSeerandHEPdata
SDM06]
-
PROBABILISTIC MODELS:PROBABILISTICMODELS:UNDIRECTEDAPPROACHES
-
UndirectedProbabilisticApproaches ProbabilisticsemanticsbasedonMarkovNetworks
Advantage: no acyclicity requirementsAdvantage:noacyclicity requirements Insomecases,syntaxbasedonfirstorderlogic
Advantage:declarativeg
ExamplesExamples ConditionalRandomFields(CRFs)[McCallum&Wellner,NIPS04]
MarkovLogicNetworks(MLNs)[Singla &Domingos,ICDM06] ProbabilisticSimilarityLogic[Broecheler &Getoor,UAI10]
-
MarkovLogic AlogicalKBisasetofhardconstraints onthesetofpossibleworlds
Makethemsoftconstraints;whenaworldviolatesaformula,itbecomeslessprobablebutnotimpossible
Giveeachformulaaweight Higher weight Stronger constraintHigherweight Strongerconstraint
isfies s it satf formulaweights oP(world) exp
[Richardson&Domingos,06]
-
MarkovLogic AMarkovLogicNetwork(MLN) isasetofpairs(F,w)where F isaformulainfirstorderlogic w isarealnumber
# true groundings of ith clause
Fi
ii xnwZXP )(exp1)( Fi
Iterate over all first-order MLN formulasNormalization Constant
[Richardson&Domingos,06]
-
ERProblemFormulationinMLNs Given
A DB of records representing mentions of entities in the real ADBofrecordsrepresentingmentionsofentitiesintherealworld,e.g.papermentions
A set of fields e g author title venueAsetoffieldse.g.author,title,venue
Eachrecordrepresentedasasetoftypedpredicatese.g.HasAuthor(paper,author),HasVenue(paper,venue)(p p , ), (p p , )
Goald h h f h d /f ld f h Todeterminewhichoftherecords/fieldsrefertothesame
underlyingentity
Slidesfrom[Singla &Domingos,ICDM06]
-
HandlingEquality IntroduceEquals(x,y) orx=y
I t d th i f lit Introducetheaxiomsofequality Reflexivity: x=x
Symmetry: x=y y=x Transitivity: x=y y=z z=x PredicateEquivalence:
x1 = x2 y1 y2 (R(x1, y1) R(x2,y2))x1 x2 y1 y2 (R(x1,y1) R(x2,y2))
-
Positive,SoftEvidence Introducereversepredicateequivalence
S l ti ith th tit i id b t Samerelationwiththesameentitygivesevidenceabouttwoentitiesbeingsame
R(x1,y1) R(x2,y2) x1 =x2 y2 =y2 Nottruelogically,butgivesusefulinformation
ExampleHasAuthor(C1 J Cox) HasAuthor(C2 Cox J ) C1 = C2HasAuthor(C1,J.Cox) HasAuthor(C2,CoxJ.) C1=C2(J.Cox=CoxJ.)
-
FieldComparison Eachfieldisastringcomposedoftokens
Introduce HasWord(field word)IntroduceHasWord(field,word)
Usereversepredicateequivalence
HasWord(f1,w1) HasWord(f2,w2) w1=w2 f1=f2 Example
HasWord(J.Cox,Cox) HasWord(CoxJ.,Cox) (Cox=Cox)(J.Cox=CoxJ.)
h d ff h f h d Canhavedifferentweightforeachword
-
TwolevelSimilarity Individualwordsasunits:Cantdealwithspellingmistakes
Breakeachwordintongrams:IntroduceHasNgram(word,ngram)
Usereversepredicateequivalenceforwordcomparisons
-
RecordMatching SimplestVersion:Fieldsimilaritiesmeasuredbypresence/absenceofwordsincommon
HasWord(f1,w1) HasWord(f2,w2) HasField(r1,f1)HasField(r2,f2) w1=w2 r1=r2
E l ExampleHasWord(J.Cox,Cox) HasWord(CoxJ.,Cox) HasAuthor(P1,J.Cox) HasAuthor(P2,CoxJ.) (Cox=Cox) (P1=P2)) ( , ) ( ) ( )
Transitivity(f1 = f2) (f2 = f3) ( f3 = f1)(f1 f2) (f2 f3) ( f3 f1)
AdditionalConstraintsHasAuthor(c a ) HasAuthor(c a ) Coauthor(a a )HasAuthor(c,a1) HasAuthor(c,a2) Coauthor(a1,a2)Coauthor(a1,a2) Coauthor(a3,a4) a1=a3 a2=a4
-
Inference Usecheapheuristics(e.g.TFIDFbasedsimilarity)toidentifyplausiblepairsy p p
Inference/learningoverplausiblepairs
Inferencemethod:lazygrounding+MaxWalkSAT
Learning:supervisedandtransfer(learn/handsetononeg p ( /domainandtransferred)
-
ProbabilisticSoftLogic[Broecheler & Getoor UAI10]
DeclarativelanguagefordefiningconstrainedcontinuousMarkov random field (CCMRF) using first order logic
[Broecheler &Getoor,UAI10]
Markovrandomfield(CCMRF)usingfirstorderlogic(FOL)
Soft logic: truth values in [0 1] Softlogic:truthvaluesin[0,1] LogicaloperatorsrelaxedusingLukasiewicz tnorms Mechanisms for incorporating similarity functions and Mechanismsforincorporatingsimilarityfunctions,andreasoningaboutsets
MAP inference is a convex optimization MAPinferenceisaconvexoptimization Efficientsamplingmethodformarginalinference
-
FOLtoCCMRF
PSLconvertsaweightedruleintopotentialfunctionsbypenalizing its distance to satisfactionpenalizingitsdistancetosatisfaction,
isthetruthvalueofgroundruleunderinterpretationx
Thedistributionovertruthvaluesis
: weight of rule r
:weightofruler :allgroundingsofruler
:PSLprogramp g
-
UndirectedApproachesMethod Learning/Inference
MethodEvaluation
[McCallum & Conditional Graph partitioning F1 on DARPA[McCallum&Wellner,NIPS04]
ConditionalRandomFields(CRFs)capturing
Graphpartitioning(Boykov etal.1999),performedviacorrelationclustering
F1on DARPAMUC&ACEdatasets
transitivityconstraints
[Singla &D i
MarkovLogicN k
Supervisedlearningd i f i
ConditionalLoglik lih d dDomingos,
ICDM06]Networks(MLNs)
andinferenceusingMaxWalkSAT &MCMC
likelihood andAUConCoraandBibServdata
[Broecheler &Getoor,UAI10]
ProbabilisticSimilarityLogic
Supervisedlearningandinferenceusing
Precision/Recall/F1Ontology
(PSL) continuousoptimization
Alignment
-
HYBRID APPROACHESHYBRIDAPPROACHES
-
HybridApproaches Constraintbasedapproachesexplicitlyencoderelationalconstraints Theycanbeformulatedashybridofconstraintsandprobabilisticmodels
Orasconstraintoptimizationproblem Examples
ConstraintbasedEntityMatching[Shen,Li&Doan,AAAI05] Dedupalog [Arasu,Re,Suciu,ICDE09]
-
Dedupalog [Arasu etal.,ICDE09]
PaperRef(id, title, conference, publisher, year)Wrote(id authorName Position)
DatatobededuplicatedDatatobe
deduplicatedWrote(id, authorName, Position) deduplicateddeduplicated
TitleSimilar(title1,title2)A th Si il ( th 1 th 2)
(Thresholded)FuzzyJ i O t t
(Thresholded)FuzzyJ i O t tAuthorSimilar(author1,author2) JoinOutputJoinOutput
Step(0)Createinitialapproximatematches;thisisinputtoDedupalog.
Step(1)Declaretheentities ClusterPapers,Publishers,&Authors
Paper!(id) :- PaperRef(id,-,-,-)Publisher!(p) :- PaperRef(-,-,-,p,-)Author!(a) :- Wrote(-,a,-) P bli h (UNA) d P (NOT UNA)
Dedupalog isflexible:UniqueNamesAssumption(UNA)
Dedupalog isflexible:UniqueNamesAssumption(UNA)
( ) ( , , ) Publishers(UNA)andPapers(NOTUNA)
Slidesbasedon[Arasu,Re,Suciu,ICDE09]
-
Step(2)DeclareClusters
PaperRef(id, title, conference, publisher, year)Wrote(id authorName Position)
InputintheDB
Clusterpapers,publishers and authorsWrote(id, authorName, Position)
TitleSimilar(title1,title2)A th Si il ( th 1 th 2)
Paper!(id) :- PaperRef(id,-,-,-)Publisher!(p) :- PaperRef(- - - p -)
publishers,andauthors
AuthorSimilar(author1,author2) Publisher!(p) :- PaperRef(-,-,-,p,-)Author!(a) :- Wrote(-,a,-)
Author*(a1,a2) AuthorSimilar(a1,a2)
Clustersaredeclared using*(likeIDBsorViews):TheseareoutputClustersaredeclared using*(likeIDBsorViews):Theseareoutput
Author1 Author2
AA Arvind Arasu
*IDBsareequivalencerelations:A D d l i
ClusterauthorswithsimilarnamesAA Arvind Arasu
Arvind A Arvind Arasu
qSymmetric,Reflexive,&TransitivelyClosedRelations:i.e.,Clusters
ADedupalog program isasetofdataloglikerules
152
-
SimpleConstraints
Paperswithsimilartitlesshouldlikelybeclusteredtogether
Paper*(id1,id2) PaperRef(id1,t1,-), PaperRef(id2,t2,-),TitleSimilar(t1,t2)
Author*(a1,a2) AuthorSimilar(a1,a2) ()Softconstraints:( )Payacostifviolated.
Paper*(id1 id2)
-
Additional ConstraintsAdditionalConstraintsClusteringtwopapers,thenmustclustertheirfirstauthors
Author*(a1,a2)
-
Dedupalog viaCC
Semantics:TranslateaDedupalog Programtoasetofgraphs
Nodesarereferences(inthe!Relation) EntityReferences:Conference!(c)
VLDBJConference*(c1,c2) ConfSim(c1,c2)
Positiveedges
Negative edges are implicit[ ]
VLDBconfVLDB
Negativeedgesareimplicit[] ICDE
InternationalConf.DEICDT
Forasinglegraphw.o.hardconstraintswe can reuse prior work for O(1) apx
156
wecanreusepriorworkforO(1)apx.
-
CorrelationClustering
Conference*(c1 c2)
-
Voting
Extendalgorithmtowhole languageviavotingtechnique.Support different entity types recursive programs etc
Thm: ArecursivehardThm: ArecursivehardManydedupalog programsManydedupalog programs
Supportdifferententitytypes,recursiveprograms,etc.
constraintsnoO(1)apxconstraintsnoO(1)apxy p g p g
haveanO(1)apxy p g p g
haveanO(1)apx
Thm:AllsoftprogramsO(1) Expert:multiwaycuthard
Systemproperties:(1)StreamingalgorithmSystemproperties:(1)Streamingalgorithm(2)linearin#ofmatches(notn2)(3)Userinteraction(2)linearin#ofmatches(notn2)(3)Userinteraction
Features:Supportforweights,referencetables(partially),andcorrespondinghardnessresults.Features:Supportforweights,referencetables(partially),andcorrespondinghardnessresults.
-
HybridApproachesMethod Evaluation
ConstraintbasedEntity
Twolayermodel:Layer1:Generativemodelfordatasetsthatsatisfy
ResearchersandIMDBwith
Matching[Shen,Li&Doan,AAAI05];
constraints;Layer2:EMalgorithmandtherelaxationlabelingalgorithmtoperformmatching.Ineachiteration,useEM to estimate parameters of the generative model
noiseadded
AAAI05];buildson(Li,Morie,&Roth,AIMag 2004)
EMtoestimateparametersofthegenerativemodelandamatchingassignment,thenemploysrelaxationlabelingtoexploittheconstraints
Dedupalog[Arasu,Re,Suciu,ICDE09]
Declarativespecificationforrichcollectionofconstraintswithnicesyntactic sugaraddedtodatalogforER.Inference:Correlationclustering+voting
Precision/RecallonCora,subsetofACMdataset
-
Summary:CollectiveApproaches Decisionsforclustermembershipdependsonotherclusters
Similaritypropagationapproachesy p p g pp ProbabilisticModels
GenerativeModelsU di d M d l UndirectedModels
HybridApproaches
Nonprobabilisticapproachesoftenscalebetterthangenerativeprobabilisticapproaches
Undirected/constraintbasedmodelsareofteneasiertospecify Scalingundirectedmodelsactiveareaofresearch
-
PART 3SCALING ER TO BIGDATAPART3SCALINGERTOBIG DATA
-
ScalingERtoBigData Blocking/CanopyGeneration Distributed ERDistributedER
-
PART 3BLOCKING/CANOPY GENERATIONPART3aBLOCKING/CANOPYGENERATION
-
Blocking:Motivation Navepairwise:|R|2 pairwise comparisons
1000 business listings each from 1,000 different cities across1000businesslistingseachfrom1,000differentcitiesacrosstheworld
1trillioncomparisons 11.6days (ifeachcomparisonis1s)
Mentionsfromdifferentcitiesareunlikelytobematches BlockingCriterion:City 1billioncomparisons 16minutes(ifeachcomparisonis1s)
-
Blocking:Motivation Mentionsfromdifferentcitiesareunlikelytobematches
May miss potential matchesMaymisspotentialmatches
-
Blocking:MotivationMatchingPairsofRecords
PairsofRecordssatisfying
Blockingcriterion
SetofallPairsofRecords
-
BlockingAlgorithms1 Hashbasedblocking
EachblockCi isassociatedwithahashkeyhi. Mentionx ishashedtoCi ifhash(x)=hi. Withinablock,allpairsarecompared. Each hash function results in disjoint blocks.Eachhashfunctionresultsindisjointblocks.
Whathashfunction? Deterministicfunctionofattributevalues BooleanFunctionsoverattributevalues
[Bilenko etalICDM06,MichelsonetalAAAI06,[ , ,DasSarma etalCIKM12]
minHash (minwiseindependentpermutations)[Broder etalSTOC98]
-
BlockingAlgorithms2 Pairwise Similarity/Neighborhoodbasedblocking
Nearby nodes according to a similarity metric are clusteredNearbynodesaccordingtoasimilaritymetricareclusteredtogether
Resultsinnondisjointcanopies.
Techniques SortedNeighborhoodApproach[HernandezetalSIGMOD95] CanopyClustering[McCallumetalKDD00]
-
SimpleBlocking:InvertedIndexonaKey
Examplesofblockingkeys: First three characters of last nameFirstthreecharactersoflastname City+State+Zip CharacterorTokenngrams Minimuminfrequentngrams
-
LearningOptimalBlockingFunctions Usingoneormoreblockingkeysmaybeinsufficient
2,376,206 Americans shared the surname Smith in the 2000 US2,376,206American ssharedthesurnameSmithinthe2000US NULLvaluesmaycreatelargeblocks.
Solution:Constructblockingfunctionsbycombiningsimplefunctions
-
ComplexBlockingFunctions Conjunctionoffunctions[MichelsonetalAAAI06,Bilenko etalICDM06]
{City}AND{lastfourdigitsofphone}
Chaintrees[DasSarma etalCIKM12]If ({Ci } NULL LA) h {l f di i f h } AND { d } If({City}=NULLorLA)then {lastfourdigitsofphone}AND{areacode}
else {lastfourdigitsofphone}AND{City}
BlkTrees [DasSarma etalCIKM12]
-
LearninganOptimalfunction[Bilenko etalICDM06] Findkblockingfunctionsthateliminatethemostnonmatches,whileretainingalmostallmatches., g Needatrainingsetofpositiveandnegativepairs
AlgorithmIdea:RedBlueSetCover
PositiveExamples
Blocking Keys
PickkBlockingkeyssuchthat(a)Atmost bluenodesarenotcovered
PickkBlockingkeyssuchthat(a)Atmost bluenodesarenotcovered
NegativeExamples
BlockingKeys(b)Numberofrednodescoveredisminimized(b)Numberofrednodescoveredisminimized
-
LearninganOptimalfunction[Bilenko etalICDM06] AlgorithmIdea:RedBlueSetCover
PositiveExamples Pick k Blocking keys such thatPick k Blocking keys such thatp
BlockingKeys
PickkBlockingkeyssuchthat(a)Atmost bluenodesarenotcovered(b)Numberofrednodes
PickkBlockingkeyssuchthat(a)Atmost bluenodesarenotcovered(b)Numberofrednodes
Greedy Algorithm
NegativeExamplescoveredisminimizedcoveredisminimized
GreedyAlgorithm: Constructgoodconjunctionsofblockingkeys{p1,p2,}. Pick k conjunctions {p p p } such that the following is Pickkconjunctions{pi1,pi2,,pik},suchthatthefollowingisminimized
-
minHash (Minwise IndependentPermutations) LetFx beasetoffeaturesformentionx
(functions of) attribute values(functionsof)attributevalues characterngrams optimalblockingfunctions
Let bearandompermutationoffeaturesinFx E.g.,orderimposedbyarandomhashfunction
minHash(x)=minimumelementinFx accordingto
-
WhyminHash works?Surprisingproperty:Forarandompermutation,
How to build a blocking scheme such that only pairs withHowtobuildablockingschemesuchthatonlypairswithJacquardsimilarity>sfallinthesameblock(withhighprob)?
`Probabilitythat
(x,y)mentionsarebl k d t thblockedtogether
Similarity(x,y)
-
BlockingusingminHashes ComputeminHashes usingr*kpermutations(hashfunctions))
Band of rminHashesBandofrminHashes
Signatures that match on 1 out of k bands go to thek blocks
Signature sthatmatchon1outofk bands,gotothesameblock.
-
minHash AnalysisFalseNegatives:(missingmatches)P(pair x y not in the same block
Sim(s) P(notsameblock)P(pairx,y notinthesameblock
withJacquardsim =s)block)
0.9 108
0.8 0.00035should be very low for high similarity pairs
False Positives: (blocking nonmatches)
0.7 0.025
0.6 0.2
shouldbeverylowforhighsimilaritypairs
FalsePositives:(blockingnonmatches)P(pairx,y inthesameblock
with Jacquard sim = s)
0.5 0.52
0.4 0.81withJacquardsim s)0.3 0.95
0.2 0.994
0.1 0.9998
-
CanopyClustering[McCallumetalKDD00]Input:MentionsM,
d(x,y),adistancemetric, Inmultiplecanopies
thresholdsT1 >T2Algorithm:1 Pi k d l t f M
canopies
1. PickarandomelementxfromM2. CreatenewcanopyCx using
mentionsys.t.d(x,y)
-
PART 3 bDISTRIBUTED ERPART3bDISTRIBUTEDER
-
DistributedERM d i l f l k Mapreduceisverypopularforlargetasks Simpleprogrammingmodelformassivelydistributeddata
Hadoop providesfaulttoleranceandisopensource
MapPhase(perrecordcomputation)
ReducePhase(globalcomputation)
Shuffle
-
ERwithDisjointBlocking
Map Phase Reduce PhaseComputeBlocksinMap RemainingERinReduce
MapPhase(perrecordcomputation)
ReducePhase(globalcomputation)
Shuffle
BlockID Noneedtocomparerecordsacross
reducersreducers
-
NondisjointBlocking Howtoblock?
Hashbased: need an efficient technique to group records ifHash based:needanefficienttechniquetogrouprecordsiftheymatchonnoutofkblockingkeys[Vernica etalSIGMOD10]
Distancebased:canopyclusteringonmapreduce[Mahout] IterativeBlocking[Whang etalSIGMOD09]
-
Problem:Informationneededforarecordisinmultiplereducers.p
Informationneededforarecordisinmultiplereducers. Example 1:Example1:
Reducer1:amatcheswithb Reducer2:amatcheswithcN d t i t i d t tl l b Needtocommunicateinordertocorrectlyresolvea,b,c
-
Problem:Informationneededforarecordisinmultiplereducers.p
Example2:Dedup papersandauthorsId Author1 Author2 PaperA1 JohnSmith RichardJohnson IndicesandViewsA2 JSmith RJohnson SQLQueriesA Dr Smyth R Johnson Indices and ViewsA3 Dr.Smyth RJohnson IndicesandViews
Slideadaptedfrom[Rastogi etalVLDB11]talk
-
Problem:Informationneededforarecordisinmultiplereducers.p
Canopyclusteringresultsinnondisjointclusters[McCallumetalKDD00]
JohnSmithRichard
J.Smith JohnS.
J h J bRichardJohnson SmithRichard
SmithRichardM.Johnson
R.Smith
JohnJacob
CanopyCanopy CCCanopyf
Canopyf
Johnson
pyfor
Richard
pyfor
Richard
CanopyforSmithCanopyforSmith
forJohnforJohn
Slideadaptedfrom[Rastogi etalVLDB11]talk
-
Problem:Informationneededforarecordisinmultiplereducers.p
CoAuthor(A1,B1) CoAuthor(A2,B2)match(B1,B2)match(A1,A2)
CoAuthor rulegroundstothecorrelationmatch(RichardJohnson,RJohnson)=>match(J.Smith,JohnSmith)
JohnSmith
J.Smith JohnS.
SteveRichard Smith JohnJacobSteve
JohnsonR.Smith
CanopyCanopy CanopyCanopyR
Johnson
Johnson
Canopyfor
Johnson
Canopyfor
Johnson
CanopyforSmithCanopyforSmith
forJohnforJohn
Slideadaptedfrom[Rastogi etalVLDB11]talk
-
Problem:Informationneededforarecordisinmultiplereducers.p
Solution1:EfficientlyfindConnectedComponents[Rastogi etal2012,KangetalICDM2009]
C l i Cl i / C ll i ER i h+CorrelationClustering/CollectiveERineachcomponent
Solution2:CorrelationClustering/CollectiveERineachcanopyg / py+MessagePassing[Rastogi etalVLDB11]
-
Problem:Informationneededforarecordisinmultiplereducers.p
Solution1:EfficientlyfindConnectedComponents[Rastogi etal2012,KangetalICDM2009]
C l i Cl i / C ll i ER i h+CorrelationClustering/CollectiveERineachcomponent
Connectedcomponentscanbelargeinrelational/multientityER.p g / y
Solution2:CorrelationClustering/CollectiveERineachcanopy+MessagePassing[Rastogi etalVLDB11]
-
MessagePassing
Simple Message Passing (SMP)SimpleMessagePassing(SMP)1. RunentitymatcherM locallyineachcanopy2. IfM findsamatch(r1,r2)insomecanopy,passitas1 2
evidencetoallcanopies3. RerunMwithineachcanopyusingnewevidence
il h f d i h4. Repeatuntilnonewmatchesfoundineachcanopy
Runtime: O(k2 f(k) c)Runtime:O(k2 f(k)c) k:maximumsizeofacanopy f(k):TimetakenbyERoncanopyofsizek c:numberofcanopies
Slideadaptedfrom[Rastogi etalVLDB11]talk
-
FormalPropertiesforawellbehavedERmethod
Convergence:No.ofstepsno.ofmatches
Consistency:Outputindependentofthecanopyorder
Soundness:Eachoutputmatchisactuallyatruematch
y p p py
Completeness:Eachtruematchisalsoaoutputmatch
JohnSmith
J.Smith JohnS.
JohnJacobRichardSmith
RichardJohnson
SmithR.SmithRichardM.
Johnson
Slideadaptedfrom[Rastogi etalVLDB11]talk
-
Completeness
Papers2and3matchonlyifacanopyknows thatknowsthatmatch(a1,a2)match(b2,b3)
t h( 2 3)match(c2,c3)
Simplemessagepassingwillnotfindanymatches thus,nomessagesarepassed,noprogress
Solution:Maximalmessagepassing Sendamessageifthereisapotentialformatch
Slideadaptedfrom[Rastogi etalVLDB11]talk
-
SummaryofScalability O(|R|2)pairwise computationscanbeprohibitive.
Blockingeliminatescomparisonsonalargefractionofnonmatches.
EqualitybasedBlocking: Construct(oneormore)blockingkeysfromfeatures
Records not matching on any key are not compared Recordsnotmatchingonanykeyarenotcompared.
Neighbohood basedBlocking: Formoverlappingcanopiesofrecordsbasedonsimilarity. Onlycomparerecordswithinacluster.
Computingconnectedcomponents/MessagePassinginadditiontoblocking can help distribute ERblockingcanhelpdistributeER.
-
P t 4CHALLENGES AND FUTUREPart4CHALLENGESANDFUTUREDIRECTIONS
-
Challenges Sofar,wehaveviewedERasaonetimeprocessappliedtoentire
database;noneoftheseholdinrealworld. Temporal ERTemporalER
ERalgorithmsneedtoaccountforchangeinrealworld Reasoningaboutmultiplesources[Pal&M etal.WWW12] Modeltransitions[LietalVLDB11]
Reasoningaboutsourcequality Sourcesarenotindependent CopyingProblem[DongetalVLDB09]
QueryTimeER Howdoweselectivelydeterminethesmallestnumberofrecordstoresolve,so
wegetaccurateresultsforaparticularquery? Collective resolution for queries [Bhattacharya & Getoor JAIR07]Collectiveresolutionforqueries[Bhattacharya&GetoorJAIR07]
ER&Usergenerateddata Deduplicated entitiesinteractwithusersintherealworld
Userstag/associatephotos/reviewswithbusinessesonGoogle/Yahoo Whatshouldbedonetosupportinteractions?
-
OpenIssues ERisoftenpartofbiggerinferenceproblem
Pipelinedapproachesandjointapproachestoinformationextractionand graph identificationandgraphidentification
HowcanwecharacterizehowERerrorsaffectoverallqualityofresults?
ER Theory ERTheory Needbettersupportfortheorywhichcangiverelationallearning
bounds ER & Privacy ER&Privacy
ERenablesrecordreidentification HowdowedevelopatheoryofprivacypreservingER?ER B h k ERBenchmarks NeedforlargescalerealworldERdatasetswithgroundtruth Syntheticdatausefulforscalingbuthardtocapturerichcomplexities
f l ldofrealworld
-
Summary Growingomnipresenceofmassivelinkeddata,andtheneed
forcreatingknowledgebasesfromtextandunstructureddatamotivate a number of challenges in ERmotivateanumberofchallengesinER
EspeciallyinterestingchallengesandopportunitiesforERand/socialmedia/usergenerateddata
As data noise and knowledge grows greater needs &Asdata,noise,andknowledgegrows,greaterneeds&opportunitiesforintelligentreasoningaboutentityresolution
M th h ll Manyotherchallenges Largescaleidentitymanagement Understandingtheoreticalpotentials&limitsofER
-
THANK YOU!THANKYOU!
-
References IntroW.Willinger etal,MathematicsandtheInternet:ASourceofEnormousConfusionand
GreatPotential,NoticesoftheAMS56(5),2009L Gill and M Goldcare English National Record Linkage of Hospital Episode Statistics andL.GillandM.Goldcare, EnglishNationalRecordLinkageofHospitalEpisodeStatisticsand
DeathRegistrationRecords,ReporttotheDepartmentofHealth,2003T.Herzogetal,DataQualityandRecordLinkageTechniques,Springer2007A. Elmagrid et al, Duplicate Record Detection, TKDE 2007A.Elmagrid etal, DuplicateRecordDetection ,TKDE2007P.Christen,DataMatching,Springer2012N.Koudas etal,RecordLinkage:SimilaritymeasuresandAlgorithms,SIGMOD2006X. Dong & F. Naumann, Data fusionResolving data conflicts for integration, VLDB 2009X.Dong&F.Naumann, Datafusion Resolvingdataconflictsforintegration ,VLDB2009L.Getoor &A.Machanavajjhala,EntityResolution:Theory,PracticeandOpenChallenges,
AAAI2012
-
References SingleEntityERD.Menestrina etal,EvaluationEntityResolutionResults,PVLDB3(12),2010M.Cochinwala etal,Efficientdatareconciliation,InformationSciences137(14),2001M Bilenko & R Mooney Adaptive Duplicate Detection Using Learnable String SimilarityM.Bilenko &R.Mooney, AdaptiveDuplicateDetectionUsingLearnableStringSimilarity
Measures,KDD2003P.Christen,Automaticrecordlinkageusingseedednearestneighbour andsupportvector
machineclassification.,KDD2008Z.Chenetal,Exploitingcontextanalysisforcombiningmultipleentityresolutionsystems,
SIGMOD2009A.McCallum&B.Wellner,ConditionalModelsofIdentityUncertaintywithApplicationtoNoun
f Coreference,NIPS2004H.Newcombe etal,Automaticlinkageofvitalrecords,Science1959I.Fellegi &A.Sunter,ATheoryforRecordLinkage,JASA1969W.Winkler,OverviewofRecordLinkageandCurrentResearchDirections,ResearchReport
Series,USCensus,2006T.Herzogetal,DataQualityandRecordLinkageTechniques,Springer,2007P R ik & W C h A Hi hi l G hi l M d l f R d Li k UAI 2004P.Ravikumar &W.Cohen,AHierarchicalGraphicalModelforRecordLinkage,UAI2004
-
References SingleEntityER(contd.)S.Sarawagi etal,InteractiveDeduplication usingActiveLearning,KDD2000S.Tejada etal,LearningObjectIdentificationRulesforInformationIntegration,IS2001A Arasu et al On active learning of record matching packages SIGMOD 2010A.Arasu etal, Onactivelearningofrecordmatchingpackages ,SIGMOD2010K.Bellare etal,Activesamplingforentitymatching,KDD2012A.Beygelzimer etal,AgnosticActiveLearningwithoutConstraints,NIPS2010J Wang et al CrowdER: Crowdsourcing Entity Resolution PVLDB 5(11) 2012J.Wangetal, CrowdER:Crowdsourcing EntityResolution ,PVLDB5(11),2012A.Marcusetal,HumanpoweredSortsandJoins,PVLDB5(1),2011
-
References SingleEntityER(contd.)R. Gupta & S. Sarawagi, Answering Table Augmentation Queries from Unstructured Lists on the Web,R.Gupta&S.Sarawagi, AnsweringTableAugmentationQueriesfromUnstructuredListsontheWeb ,
PVLDB2(1),2009A.DasSarma etal,AnAutomaticBlockingMechanismforLargeScaleDeduplicationTasks,CIKM
2012M Bilenko et al Adaptive Product Normalization: Using Online Learning for Record Linkage inM.Bilenko etal, AdaptiveProductNormalization:UsingOnlineLearningforRecordLinkagein
ComparisonShopping,ICDM2005S.Chaudhuri etal,RobustIdentificationofFuzzyDuplicates,ICDE2005W.Soonetal,Amachinelearningapproachtocoreference resolutionofnounphrases,
C t ti l Li i ti 27(4) 2001ComputationalLinguistics27(4)2001N.Bansal etal,CorrelationClustering,MachineLearning56(13),2004V.Ng&C.Cardie,Improvingmachinelearningapproachestocoreference resolution,ACL2002M.Elsner &E.Charnaik,Youtalkingtome?acorpusandalgorithmforconversation, g p g
disentanglement,ACLHLT2008M.Elsner &W.Schudy,BoundingandComparingMethodsforCorrelationClusteringBeyondILP,
ILPNLP2009N Ailon et al Aggregating inconsistent information: Ranking and clustering JACM 55(5) 2008N.Ailon etal, Aggregatinginconsistentinformation:Rankingandclustering ,JACM55(5),2008X.Dongetal,IntegratingConflictingData:TheRoleofSourceDependence,PVLDB2(1),2009A.Paletal,InformationIntegrationoverTimeinUnreliableandUncertainEnvironments,WWW
2012A.Culotta etal,CanonicalizationofDatabaseRecordsusingAdaptiveSimilarityMeasures,KDD2007O.Benjelloun etal,Swoosh:AgenericapproachtoEntityResolution,VLDBJ18(1),2009
-
References Constraints&MultiRelationalERA h k i h l li i i f d li i d h VL 2002R.Ananthakrishna et.al,Eliminatingfuzzyduplicatesindatawarehouses,VLDB2002
A.Arasu etal,LargeScaleDeduplication withConstraintsusingDedupalog,ICDE2009S.Chaudhuri etal.,Leveragingaggregateconstraintsfordeduplication,SIGMOD07X.Dongetal,ReferenceRecounciliation inComplexInformationSpaces,SIGMOD2005I.Bhattacharya&L.Getoor,CollectiveEntityResolutioninRelationalData,TKDD2007I.Bhattacharya&L.Getoor,ALatentDirichlet ModelforUnsupervisedEntityResolution,SDM2007P.Bohannonetal.,ConditionalFunctionalDependenciesforDataCleaning,ICDE 2007M.Broecheler &L.Getoor,ProbabilisticSimilarityLogic,UAI2010, y g ,W.Fan,Dependenciesrevisitedforimprovingdataquality,PODS2008H.Pasula etal,IdentityUncertaintyandCitationMatching,NIPS2002D.Kalashnikovetal,DomainIndependentDataCleaningviaAnalysisofEntityRelationshipGraph,
TODS06J.Laffertyetal,ConditionalRandomFields:ProbabilisticModelsforSegmentingandLabeling
SequenceData.,ICML2001X.Lietal,IdentificationandTracingofAmbiguousNames:DiscriminativeandGenerative
Approaches,AAAI2004A.McCallum&B.Wellner,ConditionalModelsofIdentityUncertaintywithApplicationtoNoun
Coreference,NIPS2004M.Richardson&P.Domingos,MarkovLogic,MachineLearning62,2006W.Shen etal.,ConstraintbasedEntityMatching,AAAI2005P.Singla &P.Domingos,EntityResolutionwithMarkovLogic,ICDM2006Whang etal.,GenericEntityResolutionwithNegativeRules,VLDBJ2009Whang elal.,JointEntityResolution,ICDE2012
-
References BlockingM.Bilenko etal,AdaptiveBlocking:LearningtoScaleUpRecordLinkageandClustering,ICDM
2006M.Michelson&C.Knoblock,LearningBlockingSchemesforRecordLinkage,AAAI2006g g gA.DasSarma etal,AnAutomaticBlockingMechanismforLargeScaleDeduplicationTasks,
CIKM2012A.Broder etal,MinWiseIndependentPermutations,STOC1998G P di l B d 100 illi i i l l bl ki b d l i fG.Papadias etal,Beyond100millionentities:largescaleblockingbasedresolutionfor
heterogenous data,WSDM2012M.Hernandez&S.Stolfo,Themerge/purgeproblemforlargedatabases,SIGMOD1995A. McCallum et al, Efficient clustering of highdimensional data sets with application toA.McCallumetal, Efficientclusteringofhigh dimensionaldatasetswithapplicationto
referencematching,KDD2000L.Kolbetal,Dedoop:Efficientdeduplication withHadoop,(demo)PVLDB5(12),2012R.Vernica etal,EfficientParallelSetSimilarityJoinsUsingMapReduce,SIGMOD2010ApacheMahout:ScalableMachineLearningandDataMining,http://mahout.apache.org/S.Whang etal,EntityResolutionwithIterativeBlocking,SIGMOD2009U.Kangetal,PEGASUS:APetaScaleGraphMiningSystem Implementationand
Observations ICDM 2009Observations ,ICDM2009V.Rastogi etal,FindingConnectedComponentsonMapreduceinPolyLogRounds,Corr 2012V.Rastogi etal,LargeScaleCollectiveEntityMatching,PVLDB4(4),2011
-
References Challenges&FutureDirectionsI.BhattacharyaandL.Getoor,"QuerytimeEntityResolution",JAIR2007X.Dong,L.BertiEquille,D.Srivastava,Truthdiscoveryandcopyingdetectioninadynamic
world VLDB 2009world ,VLDB2009P.Li,X.Dong,A.Maurino,D.Srivastava,LinkingTemporalRecords,VLDB2011A. Pal,V.Rastogi,A.Machanavajjhala,P.Bohannon,Informationintegrationovertimein
unreliable and uncertain environments, WWW 2012unreliableanduncertainenvironments ,WWW2012