entity resolution: tutorial

of 179 /179
Entity Resolution: Tutorial Entity Resolution: Tutorial Li Gt Ashwin Machanavajjhala Lise Getoor University of Maryland College Park MD Ashwin Machanavajjhala Duke University Durham, NC College Park, MD http://www.cs.umd.edu/~getoor/Tutorials/ER_VLDB2012.pdf http://goo.gl/f5eym

Author: sudhakar-bhushan

Post on 16-Jan-2016

35 views

Category:

Documents


0 download

Embed Size (px)

DESCRIPTION

Lise Getoor Ashwin MachanavajjhalaUniversity of MarylandCollege Park MDDuke University

TRANSCRIPT

  • Entity Resolution: TutorialEntityResolution:Tutorial

    Li G t Ashwin MachanavajjhalaLise GetoorUniversityofMaryland

    College Park MD

    Ashwin MachanavajjhalaDukeUniversityDurham,NCCollegePark,MD ,

    http://www.cs.umd.edu/~getoor/Tutorials/ER_VLDB2012.pdf

    http://goo.gl/f5eym

  • WhatisEntityResolution?Problemofidentifyingandlinking/groupingdifferent

    manifestationsofthesamerealworldobject.f f j

    Examplesofmanifestationsandobjects:p j Differentwaysofaddressing(names,emailaddresses,FaceBook

    accounts)thesamepersonintext. Webpageswithdifferingdescriptionsofthesamebusiness. Differentphotosofthesameobject.

  • Ironically,EntityResolutionhasmanyduplicatenames

    DuplicatedetectionRecordlinkage

    Coreference resolution

    Object consolidation

    Coreference resolutionReferencereconciliation

    Fuzzymatch

    Deduplication

    ObjectidentificationObjectconsolidation

    Entity clustering

    y

    Entityclustering

    Household matching

    Approximatematch

    Merge/purgeIdentityuncertainty

    ReferencematchingHouseholding

    HouseholdmatchingMerge/purge

    Hardeningsoftdatabases

    Doubles

    gg

  • ERMotivatingExamples LinkingCensusRecords Public HealthPublicHealth Websearch Comparison shopping Comparisonshopping Counterterrorism Spam detection Spamdetection MachineReading

  • ERandNetworkAnalysis

    before after

  • Motivation:NetworkScience Measuringthetopologyoftheinternetusingtraceroute

  • IPAliasingProblem[Willinger etal.2009]

  • IPAliasingProblem[Willinger etal.2009]

  • TraditionalChallengesinER Name/Attributeambiguity

    Thomas CruiseThomasCruise

    MichaelJordan

  • TraditionalChallengesinER Name/Attributeambiguity Errors due to data entryErrorsduetodataentry

  • TraditionalChallengesinER Name/Attributeambiguity Errors due to data entryErrorsduetodataentry MissingValues

    [Gilletal;Univ ofOxford2003]

  • TraditionalChallengesinER Name/Attributeambiguity Errors due to data entryErrorsduetodataentry MissingValues Changing Attributes ChangingAttributes

    Dataformatting

    / Abbreviations/DataTruncation

  • BigDataERChallenges

  • BigDataERChallenges LargerandmoreDatasets

    NeedefficientparalleltechniquesM H t it MoreHeterogeneity Unstructured,UncleanandIncompletedata.Diversedatatypes. No longer just matching names with names, but Amazon profiles withNolongerjustmatchingnameswithnames,butAmazonprofileswith

    browsinghistoryonGoogleandfriendsnetworkinFacebook.

  • BigDataERChallenges LargerandmoreDatasets

    NeedefficientparalleltechniquesM H t it MoreHeterogeneity Unstructured,UncleanandIncompletedata.Diversedatatypes.

    More linked Morelinked Needtoinferrelationshipsinadditiontoequality

    MultiRelational Dealwithstructureofentities(AreWalmart andWalmart Pharmacy

    thesame?)

    l d Multidomain Customizablemethodsthatspanacrossdomains

    Multiple applications (web search versus comparison shopping) Multipleapplications(websearchversuscomparisonshopping) Servediverseapplicationwithdifferentaccuracyrequirements

  • Outline1. AbstractProblemStatement2 Algorithmic Foundations of ER2. AlgorithmicFoundationsofER3. ScalingERtoBigData4 Challenges & Future Directions4. Challenges&FutureDirections

  • Outline1. AbstractProblemStatement2 Algorithmic Foundations of ER2. AlgorithmicFoundationsofER

    a) DataPreparationandMatchFeaturesb) Pairwise ERc) ConstraintsinERd) Algorithms

    Record Linkage

    5minutebreak

    RecordLinkage Deduplication CollectiveER

    3. ScalingERtoBigData4. Challenges&FutureDirections

  • Outline1. AbstractProblemStatement2 Algorithmic Foundations of ER2. AlgorithmicFoundationsofER3. ScalingERtoBigData

    a) Blocking/Canopy Generationa) Blocking/CanopyGenerationb) DistributedER

    4. Challenges & Future Directions4. Challenges&FutureDirections

  • Outline1. AbstractProblemStatement2 Algorithmic Foundations of ER2. AlgorithmicFoundationsofER3. ScalingERtoBigData4 Challenges & Future Directions4. Challenges&FutureDirections

  • ScopeoftheTutorial Whatwecover:

    FundamentalalgorithmicconceptsinER ScalingERtobigdatasets TaxonomyofcurrentERalgorithms

    Whatwedonotcover: Schema/ontology resolutionSchema/ontologyresolution Datafusion/integration/exchange/cleaning Entity/InformationExtraction PrivacyaspectsofEntityResolution DetailsonsimilaritymeasuresTechnical details and proofs Technicaldetailsandproofs

  • ERReferences Book/SurveyArticles

    DataQualityandRecordLinkageTechniques[T Herzog F Scheuren W Winkler Springer 07][T.Herzog,F.Scheuren,W.Winkler,Springer, 07]

    DuplicateRecordDetection[A.Elmagrid,P.Ipeirotis,V.Verykios,TKDE07] AnIntroductiontoDuplicateDetection[F.Naumann,M.Herschel,M&P

    synthesislectures2010]y ] EvaluationofEntityResolutionApproachedonRealworldMatchProblems

    [H.Kopke,A.Thor,E.Rahm,PVLDB2010] DataMatching[P.Christen,Springer2012]

    Tutorials RecordLinkage:SimilaritymeasuresandAlgorithms

    [N.Koudas,S.Sarawagi,D.Srivatsava SIGMOD06] DatafusionResolvingdataconflictsforintegration

    [X.Dong,F.Naumann VLDB09]Entity Resolution: Theory Practice and Open Challenges EntityResolution:Theory,PracticeandOpenChallengeshttp://goo.gl/Ui38o [L.Getoor,A.Machanavajjhala AAAI12]

  • PART 1ABSTRACT PROBLEM STATEMENTPART1ABSTRACTPROBLEMSTATEMENT

  • AbstractProblemStatementRealWorld DigitalWorld

    Records//Mentions

  • Deduplication ProblemStatement Clustertherecords/mentionsthatcorrespondtosameentityy

  • Deduplication ProblemStatement Clustertherecords/mentionsthatcorrespondtosameentityy Intensional Variant:Computeclusterrepresentative

  • RecordLinkageProblemStatement Linkrecordsthatmatchacrossdatabases

    AB

  • ReferenceMatchingProblem Matchnoisyrecordstocleanrecordsinareferencetable

    ReferenceblTable

  • AbstractProblemStatementRealWorld DigitalWorld

  • Deduplication ProblemStatement

  • Deduplication withCanonicalization

  • GraphAlignment(&motifsearch)

    Graph1 Graph2

  • Relationshipsarecrucial

  • Relationshipsarecrucial

  • Notation R:setofrecords/mentions(typed) H: set of relations / hyperedges (typed)H:setofrelations/hyperedges (typed) M:setofmatches(recordpairsthatcorrespondtosameentity) N: set of non matches ( d i di t diff t titi ) N:setofnonmatches(recordpairscorrespondingtodifferententities) E:setofentities L set of links L:setoflinks

    T (M N E L ) di l ld True(Mtrue,Ntrue,Etrue,Ltrue):accordingtorealworldvs Predicted(Mpred,Npred,Epred,Lpred ):byalgorithm

  • RelationshipbetweenMtrue andMpred Mtrue (SameAs ,Equivalence) M (Similar representations and similar attributes)Mpred (Similarrepresentationsandsimilarattributes)

    MtrueRxR Mpredtrue pred

  • Metrics Pairwise metrics

    Precision/Recall,F1 #ofpredictedmatchingpairs

    Clusterlevelmetrics purity,completeness,complexityPrecision/Recall/F1: Cluster level closest cluster MUC B3 Precision/Recall/F1:Clusterlevel,closestcluster,MUC,B3 ,RandIndex

    Generalizedmergedistance[Menestrina etal,PVLDB10]

    Littleworkthatevaluationscorrectpredictionoflinks

  • TypicalAssumptionsMade Eachrecord/mentionisassociatedwithasinglerealworldentity.y

    Inrecordlinkage,noduplicatesinthesamesource If two records/mentions are identical then they are true Iftworecords/mentionsareidentical,thentheyaretruematches

    ( ) M(,) Mtrue

  • ERversusClassificationFindingmatchesvs nonmatchesisaclassificationproblem

    Imbalanced:typicallyO(R)matches,O(R^2)nonmatches

    Instancesarepairsofrecords.PairsarenotIID

    (,) Mtrue( ) MAND

    ( , ) Mtrue

    (,) MtrueAND(,) Mtrue

  • ERvs (Multirelational)ClusteringComputingentitiesfromrecordsisaclusteringproblem

    Intypicalclusteringalgorithms(kmeans,LDA,etc.)number of clusters is a constant or sub linear in R.numberofclustersisaconstantorsublinearinR.

    In ER: number of clusters is linear in R and averageInER:numberofclustersislinearinR,andaverageclustersizeisaconstant.Significantfractionofclustersaresingletons.g

  • PART 2ALGORITHMIC FOUNDATIONS OF ERPART2ALGORITHMICFOUNDATIONSOFER

  • OutlineofPart2a) DataPreparationandMatchFeatures

    b) Pairwise ER Determiningwhetherornotapairofrecordsmatch

    c) ConstraintsinER5minutebreak

    d) Algorithms Recordlinkage(Propagationthroughexclusitivity negativeconstraint), Deduplication (Propagationthroughtransitivitypositiveconstraint), Collective(PropagationthroughGeneralConstraints)

  • MOTIVATING EXAMPLE:MOTIVATINGEXAMPLE:BIBLIOGRAPHICDOMAIN

  • Entities&RelationsinBibliographicDomain

    Wrote AuthorName Author Mention

    NameString

    Paper

    Research Area NameString

    I tit ti Institute MentionWorksAtp

    Title# of AuthorsTopicWord1

    InstitutionName

    Institute MentionNameString

    Word1Word 2WordN

    Paper MentionTitleString

    Cites VenueName

    Venue MentionNameStringAppearsIn gpp

    :cooccurrencerelationships:resolutionrelationships

    :entityrelationships

  • PART 2DATA PREPARATION &PART2aDATAPREPARATION&MATCHFEATURES

  • Normalization Schemanormalization

    SchemaMatching e.g.,contactnumberandphonenumberC d tt ib t f ll dd t it t t i Compoundattributes fulladdressvs str,city,state,zip

    Nestedattributes Listoffeaturesinonedataset(airconditioning,parking)vs eachfeatureaboolean attributeboolean attribute

    Setvaluedattributes Setofphonesvs primary/secondaryphone

    Record segmentation from text Recordsegmentationfromtext Datanormalization

    Oftenconverttoalllower/allupper;removewhitespaced i d i l h i k hi l detectingandcorrectingvaluesthatcontainknowntypographicalerrorsorvariations,

    expandingabbreviationsandreplacingthemwithstandardforms;replacingnicknames with their proper name formsnicknameswiththeirpropernameforms

    Usuallydonebasedondictionaries(e.g.,commercialdictionaries,postaladdresses,etc.)

  • MatchingFeatures Fortworeferencesxandy,computeacomparisonvectorof

    similarityscoresofcomponentattribute. [1stauthormatchscore,

    papermatchscore,venuematchscore,yearmatchscore,.]

    Similarityscoresy Boolean(matchornotmatch) Realvaluesbasedondistancefunctions

  • SummaryofMatchingFeaturesH dl

    Equalityonaboolean predicate AlignmentbasedorTwotieredGoodforNames

    HandleTypographicalerrors

    Editdistance Levenstein,SmithWaterman,Affine

    Setsimilarity

    JaroWinkler,SoftTFIDF,MongeElkan PhoneticSimilarity

    Soundex Jaccard,Dice

    VectorBased Cosine similarity, TFIDF

    Translationbased Numericdistancebetweenvalues DomainspecificCosinesimilarity,TFIDF

    Useful packages

    pGoodforTextlikereviews/tweets Usefulfor

    abbreviations,alternate names. Usefulpackages:

    SecondString,http://secondstring.sourceforge.net/ Simmetrics:http://sourceforge.net/projects/simmetrics/

    Li Pi htt // li i /li i /i d ht l

    alternatenames.

    LingPipe,http://aliasi.com/lingpipe/index.html

  • RelationalMatchingFeatures Relationalfeaturesareoftensetbased

    Setofcoauthorsforapaperf i i i Setofcitiesinacountry

    Setofproductsmanufacturedbymanufacturer

    Canusesetsimilarityfunctionsmentionedearlier CommonNeighbors:Intersectionsize Jaccards Coefficient: Normalize by union sizeJaccard s Coefficient:Normalizebyunionsize AdarCoefficient:Weightedsetsimilarity

    C b t i il it i t f l Canreasonaboutsimilarityinsetsofvalues AverageorMax Otheraggregates

  • PART 2 bPAIRWISE MATCHINGPART2bPAIRWISE MATCHING

  • Pairwise MatchScoreProblem:Givenavectorofcomponentwisesimilaritiesforapairof

    records(x,y),computeP(xandymatch).

    Solutions:1. Weightedsumoraverageofcomponentwisesimilarityscores.

    Thresholddeterminesmatchornonmatch. 0.5*1stauthormatchscore +0.2*venuematchscore +0.3*papermatchscore.p p Hardtopickweights.

    Matchonlastnamematchmorepredictivethanloginname. Match on Smith less predictive than match on Getoor or Machanavajjhala Matchon Smith lesspredictivethanmatchon Getoor or Machanavajjhala .

    Hardtotuneathreshold.

  • Pairwise MatchScoreProblem:Givenavectorofcomponentwisesimilaritiesforapairof

    records(x,y),computeP(xandymatch).

    Solutions:1. Weightedsumoraverageofcomponentwisesimilarityscores.

    Thresholddeterminesmatchornonmatch.2 Formulate rules about what constitutes a match2. Formulaterulesaboutwhatconstitutesamatch.

    (1stauthormatchscore>0.7ANDvenuematchscore>0.8)OR(papermatchscore>0.9ANDvenuematchscore>0.9)M ll f l ti th i ht t f l i h d Manuallyformulatingtherightsetofrulesishard.

  • BasicMLApproach r =(x,y)isrecordpair, iscomparisonvector,Mmatches,U non

    matches

    Decisionrule)|()|(

    UrPMrPR

    )|(

    Match rtRMatch-Non rtR

  • Fellegi &Sunter Model[FS,Science69] r =(x,y)isrecordpair, iscomparisonvector,Mmatches,U non

    matches

    Decisionrule)|()|(

    UrPMrPR

    )|(

    Match rtR l

    Match-NonMatchPotential

    rtRrtRt

    u

    ul

    NaveBayes Assumption: )|()|( MrPMrP ii

  • MLPairwise Approaches Supervisedmachinelearningalgorithms

    Decisiontrees [Co hin ala et al IS01] [Cochinwala etal,IS01]

    Supportvectormachines [Bilenko &Mooney,KDD03];[Christen,KDD08]

    Ensembles of classifiers Ensemblesofclassifiers [Chenetal.,SIGMOD09]

    ConditionalRandomFields(CRF) [Gupta & Sarawagi VLDB09][Gupta&Sarawagi,VLDB09]

    Issues:Training set generation Trainingsetgeneration

    Imbalancedclasses manymorenegativesthanpositives(evenaftereliminatingobviousnonmatchesusingBlocking)

    Misclassification cost Misclassificationcost

  • CreatingaTrainingSetisakeyissue Constructingatrainingsetishard sincemostpairsofrecordsareeasynonmatches.y 100recordsfrom100cities. Only106 pairsoutoftotal108 (1%)comefromthesamecity

    Somepairsarehardtojudgeevenbyhumans Inherentlyambiguous

    E.g.,ParisHilton(personorbusiness)

    Missingattributes Starbucks,Torontovs Starbucks,QueenStreet,Toronto

  • AvoidingTrainingSetGeneration Unsupervised/SemisupervisedTechniques

    EMbasedtechniquestolearnparameters [Winkler06,Herzogetal07]

    GenerativeModels [Ravikumar & Cohen UAI04][Ravikumar &Cohen,UAI04]

    ActiveLearning CommitteeofClassifiers

    [Sarawagi etalKDD00,Tajeda etalIS01] Provably optimizing precision/recallProvablyoptimizingprecision/recall

    [Arasu etalSIGMOD10,Bellare etalKDD12] Crowdsourcing

    [W t l VLDB 12 M t l VLDB 12 ] [WangetalVLDB12,MarcusetalVLDB12,]

  • CommitteeofClassifiers [Tejada etal,IS01]

  • ActiveLearningwithProvableGuarantees

    Mostactivelearningtechniquesminimize01loss[Beygelzimer etalNIPS2010].

    However,ERisveryimbalanced: Numberofnonmatches>100*numberofmatches. Classifyingallpairsasnonmatcheshaslow01loss(

  • ActiveLearningwithProvableGuarantees

    Monotonicity ofPrecision[Arasu etalSIGMOD10]

    ThereisalargerfractionofmatchesinC1thaninC2.

    Algorithmsearchesfortheoptimalclassifierusingbinaryp g ysearchoneachdimension

  • ActiveLearningwithProvableGuarantees

    [Bellare etalKDD12]

    O (log2 n) calls to a blackbox 0 1 loss active learning algorithmO (log2 n) calls to a blackbox 0 1 loss active learning algorithmO(log2 n)callstoablackbox 01lossactivelearningalgorithm.

    Exponentiallysmallerlabelcomplexitythan[Arasu etalSIGMOD10](in the worst case).

    O(log2 n)callstoablackbox 01lossactivelearningalgorithm.

    Exponentiallysmallerlabelcomplexitythan[Arasu etalSIGMOD10](in the worst case).

    1. PrecisionConstrainedWeighted01LossProblem(using a Lagrange Multiplier )

    (intheworstcase).(intheworstcase).

    (usingaLagrangeMultiplier).2. Givenafixedvaluefor,weighted01Losscanbeoptimizedby(onecallto)a

    blackbox activelearningclassifier.3 Ri ht l f i t d b hi ll ti l l ifi3. Rightvalueof iscomputedbysearchingoveralloptimalclassifiers.

    Classifiersareembeddedina2dplane(precision/recall) Searchisalongtheconvexhulloftheembeddedclassifiers

  • Crowdsourcing Growinginterestinintegratinghumancomputationindeclarative

    workflowengines. ERisanimportantproblem(e.g.,forevaluatingfuzzyjoins) [WangetalVLDB12,MarcusetalVLDB12,]

    Opportunity:utilizecrowdsourcing forcreatingtrainingsets,orforactivelearning.

    Keyopenissue:Handlingerrorsinhumanjudgments InanexperimentonAmazonMechanicalTurk:

    Pairwise matchingjudgment,eachgivento5differentpeople Majorityofworkersagreedontruthononly90%ofpairwise judgments.

  • SummaryofSingleEntityERAlgorithms

    Manyalgorithmsforindependentclassificationofpairsofrecordsasmatch/nonmatch

    MLbasedclassification&FellegiSunterP Ad d f h Pro:Advancedstateoftheart

    Con:Buildinghighfidelitytrainingsetsisahardproblem

    ActiveLearning&Crowdsourcing forERareactiveareasofresearch.

  • PART 2CONSTRAINTSPART2cCONSTRAINTS

  • Constraints Importantformsofconstraints:

    Transitivity:IfM1andM2match,M2andM3match,thenM1andyM3match

    Exclusivity:IfM1matcheswithM2,thenM3cannotmatchwithM2Functional Dependency: If M1 and M2 match then M3 and M4 must FunctionalDependency:IfM1andM2match,thenM3andM4mustmatch

    Transitivityiskeytodeduplication Exclusivityiskeytorecordlinkage Functionaldependenciesfordatacleaning,e.g.,

    [Ananthakrishna etal.,VLDB02][Fan,PODS08][Bohannonetal ICDE07]al,ICDE07]

  • Positive&NegativeEvidence Positive

    Transitivity:IfM1andM2match,M2andM3match,thenM1andyM3match

    Exclusivity:IfM1doesntmatchwithM2,thenM3canmatchwithM2M2

    FunctionalDependency:IfM1andM2match,thenM3andM4mustmatch

    Negative Transitivity:IfM1andM2match,M2andM3donotmatch,thenM1

    and M3 do not matchandM3donotmatch Exclusivity:IfM1matcheswithM2,thenM3cannotmatchwithM2 FunctionalDependency:IfM1andM2donotmatch,thenM3and

    M4cannotmatch

  • Positive&NegativeEvidence Positive

    Transitivity:IfM1andM2match,M2andM3match,thenM1andyM3match

    Exclusivity:IfM1doesntmatchwithM2,thenM3canmatchwithM2M2

    FunctionalDependency:IfM1andM2match,thenM3andM4mustmatch

    Negative Transitivity:IfM1andM2match,M2andM3donotmatch,thenM1

    and M3 do not matchandM3donotmatch Exclusivity:IfM1matcheswithM2,thenM3cannotmatchwithM2 FunctionalDependency:IfM1andM2donotmatch,thenM3and

    M4cannotmatch

  • ConstraintTypesHardConstraint SoftConstraint

    PositiveEvidence IfM1,M2matchthenM3,M4mustmatchIf t t h th i

    IfM1,M2matchthenM3,M4morelikelytomatchIf t t h thIftwopapersmatch,theirvenues

    matchIftwovenuesmatch,thentheirpapersaremorelikelytomatch

    NegativeEvidence MentionM1andM2mustrefertodistinctentities(Uniqueness)Coauthors are distinct

    IfM1,M2dontmatchthenM3,M4lesslikelytomatchIf institutions dont matchCoauthorsaredistinct

    IfM1,M2dontmatchthenM3,M4cannotmatch

    Ifinstitutionsdon tmatch,thenauthorslesslikelytomatch

    Iftwovenuesdontmatch,thentheirpapersdontmatch

  • ConstraintTypesHardConstraint SoftConstraint

    PositiveEvidence IfM1,M2matchthenM3,M4mustmatchIf t t h th i

    IfM1,M2matchthenM3,M4morelikelytomatchIf t t h thIftwopapersmatch,theirvenues

    matchIftwovenuesmatch,thentheirpapersaremorelikelytomatch

    NegativeEvidence MentionM1andM2mustrefertodistinctentities(Uniqueness)Coauthors are distinct

    IfM1,M2dontmatchthenM3,M4lesslikelytomatchIf institutions dont matchCoauthorsaredistinct

    IfM1,M2dontmatchthenM3,M4cannotmatch

    Ifinstitutionsdon tmatch,thenauthorslesslikelytomatch

    Iftwovenuesdontmatch,thentheirpapersdontmatch

  • ConstraintTypesHardConstraint SoftConstraint

    PositiveEvidence IfM1,M2matchthenM3,M4mustmatchIf t t h th i

    IfM1,M2matchthenM3,M4morelikelytomatchIf t t h thIftwopapersmatch,theirvenues

    matchIftwovenuesmatch,thentheirpapersaremorelikelytomatch

    NegativeEvidence MentionM1andM2mustrefertodistinctentities(Uniqueness)Coauthors are distinct

    IfM1,M2dontmatchthenM3,M4lesslikelytomatchIf institutions dont matchCoauthorsaredistinct

    IfM1,M2dontmatchthenM3,M4cannotmatch

    Ifinstitutionsdon tmatch,thenauthorslesslikelytomatch

    Iftwovenuesdontmatch,thentheirpapersdontmatch

  • ConstraintTypesHardConstraint SoftConstraint

    PositiveEvidence IfM1,M2matchthenM3,M4mustmatchIf t t h th i

    IfM1,M2matchthenM3,M4morelikelytomatchIf t t h thIftwopapersmatch,theirvenues

    matchIftwovenuesmatch,thentheirpapersaremorelikelytomatch

    NegativeEvidence MentionM1andM2mustrefertodistinctentities(Uniqueness)Coauthors are distinct

    IfM1,M2dontmatchthenM3,M4lesslikelytomatchIf institutions dont matchCoauthorsaredistinct

    IfM1,M2dontmatchthenM3,M4cannotmatch

    Ifinstitutionsdon tmatch,thenauthorslesslikelytomatch

    Iftwovenuesdontmatch,thentheirpapersdontmatch

  • ConstraintTypesHardConstraint SoftConstraint

    PositiveEvidence IfM1,M2matchthenM3,M4mustmatchIf t t h th i

    IfM1,M2matchthenM3,M4morelikelytomatchIf t t h th

    Notethatsomeoftheconstraintsmayberelational

    Iftwopapersmatch,theirvenuesmatch

    Iftwovenuesmatch,thentheirpapersaremorelikelytomatch

    yandrequirejoins

    May be directionalNegativeEvidence MentionM1andM2mustreferto

    distinctentities(Uniqueness)Coauthors are distinct

    IfM1,M2dontmatchthenM3,M4lesslikelytomatchIf institutions dont match

    Maybedirectionalorbidirectional

    Constraints can be recursiveCoauthorsaredistinct

    IfM1,M2dontmatchthenM3,M4cannotmatch

    Ifinstitutionsdon tmatch,thenauthorslesslikelytomatch

    Constraintscanberecursive,e.g.,iftwoauthorshavematchingcoauthors,thenh hIftwovenuesdontmatch,thentheir

    papersdontmatchtheymatch

  • AdditionalConstraints AggregateConstraints[Chaudhuri etal.SIGMOD07]

    count constraintscountconstraints EntityAcanlinktoatmostNBs

    Authorshaveatmost5papersatanyconference

    Oth t lik l Otheraggregateslikesum,averagemorecomplex

    Again, these can be either hard or soft constraints,Again,thesecanbeeitherhardorsoftconstraints,providepositiveornegativeevidence

  • MatchDependencies

    Whenmatchingdecisionsdependonothermatching decisions (in other words, matchingmatchingdecisions(inotherwords,matchingdecisionsarenotmadeindependently),werefer to the approach as collectiverefertotheapproachascollective

  • MatchExtent Global: Iftwopapersmatch,thentheirvenuesmatch

    This constraint can be applied to all instances of venueThisconstraintcanbeappliedtoallinstancesofvenuementions

    AlloccurrencesofSIGMODcanbematchedtoInternationalConference on Management of DataConferenceonManagementofData

    Local:Iftwopapersmatch,thentheirauthorsmatch This constraint can only be applied locallyThisconstraintcanonlybeappliedlocally

    DontwanttomatchalloccurrencesofJ.SmithwithJeffSmith,onlyinthecontextofthecurrentpaper

  • Ex.SemanticIntegrityConstraintsType Example

    Aggregate C1=Noresearcher haspublishedmorethanfiveAAAIpapersinayear

    Subsumption C2=IfacitationXfromDBLP matchesacitationYinahomepage,theneachauthormentionedinYmatchessomeauthormentionedinX

    Neighborhood C3=IfauthorsXandYsharesimilarnamesandsomecoauthors,theylik l t t harelikelytomatch

    Incompatible C4 =NoresearcherexistswhohaspublishedinbothHCIandnumericalanalysis

    f i i h d h i il hLayout C5=Iftwomentionsinthesamedocumentsharesimilarnames,theyarelikelytomatch

    Key/Uniqueness C6=MentionsinthePClistingofaconferenceisto differentresearchersresearchers

    Ordering C7=Iftwo citationsmatch,thentheirauthorswillbematchedinorder

    Individual C8=TheresearcherwiththenameMayssam Sariahasfewerthanfi ti i DBLP ( d t t d t)fivementionsinDBLP(newgraduatestudent)

    [Shen,Li&Doan,AAAI05]

  • AlgorithmsforHandlingConstraints Recordlinkage propagationthroughexclusivity

    Weightedkpartitematching

    Deduplication propagationthroughtransitivityC l i l i Correlationclustering

    Collective propagation through general constraintsCollective propagationthroughgeneralconstraints Similaritypropagation

    Dependencygraphs,CollectiveRelationalClusteringP b bili ti h Probabilisticapproaches LDA,CRFs,MarkovLogicNetworks,ProbabilisticRelationalModels,

    Hybridapproaches Dedupalog

  • PART 2 dALGORITHMSPART2dALGORITHMS

  • RECORD LINKAGERECORDLINKAGE

  • 11assumption Matchingbetween(almost)deduplicated databases. Each record in one database matches at most one recordEachrecordinonedatabasematchesatmostonerecordinanotherdatabase.

    Pairwise ERmaymatcharecordinonedatabasewithmorethanonerecordinseconddatabase

  • WeightedKPartiteMatching

    WeightedEdges

    WeightedEdges Edges Edges

    Edgesbetweenpairsofrecordsfromdifferentdatabases Edgeweights

    o Pairwise matchscoreL dd f hio Logoddsofmatching

  • WeightedKPartiteMatching

    Findamatching(eachrecordmatchesatmostoneotherrecord)fromotherdatabase)thatmaximizethesumofweights.

    GeneralproblemisNPhard(3Dmatching) Successive bipartite matching is typically used [G pta & Sara agi VLDB Successivebipartitematchingistypicallyused.[Gupta&Sarawagi,VLDB

    09]

  • DEDUPLICATIONDEDUPLICATION

  • Deduplication =>Transitivity Oftenpairwise ERalgorithmoutputinconsistentresults

    (x,y)Mpred ,(y,z)Mpred ,but(x,z)Mpred

    Idea:Correctthisbyaddingadditionalmatchesusingtransitivelclosure

    In certain cases this is a bad ideaIncertaincases,thisisabadidea. Graphsresultingfrompairwise ERhave

    diameter>20[Rastogi et al Corr 12]

    AddedbyTransitive[Rastogi etalCorr 12]

    Needclusteringsolutionsthatdealwiththisproblemdirectlyby

    TransitiveClosure

    reasoningaboutrecordsjointly.

  • ClusteringbasedER Resolutiondecisionsarenotmadeindependentlyforeachpairofrecords

    Basedonvarietyofclusteringalgorithms,but Numberofclustersunknownaprioiri Many,manysmall(possiblysingleton)clusters

    Oftentakeapairwisesimilaritygraphasinput

    Mayrequiretheconstructionofaclusterrepresentativeorcanonicalentity

  • ClusteringMethodsforER HierarchicalClustering

    [Bilenko etal,ICDM05][ , ]

    NearestNeighborbasedmethods [Chaudhuri etal,ICDE05]

    CorrelationClustering [SoonetalCL01,Bansal etalML04,NgetalACL02,

    Ailon etalJACM08,Elsner etalACL08,Elsner etalILPNLP09]

  • IntegerLinearProgrammingviewofER rxy {0,1},rxy =1ifrecordsx andyareinthesamecluster. w+xy [0,1],costofclusteringxandytogetherxy [ ] g y g wxy [0,1],costofplacingxandyindifferentclusters

    TransitiveTransitiveclosure

  • CorrelationClustering

    Clustermentionssuchthattotal cost is minimi ed

    42

    totalcostisminimizedSolidedgescontributew+xy totheobjectiveDashededgescontributewxy totheobjective

    3

    1

    y

    Costbasedonpairwise similarities

    5

    Additive:w+xy =pxy andwxy =(1pxy) Logarithmic:w+xy =log(pxy)andwxy =log(1pxy)oga t c xy og(pxy) a d xy og( pxy)

  • CorrelationClustering SolvingtheILPisNPhard[Ailon etal2008JACM]

    Anumberofheuristics[Elsner etal2009ILPNLP] Greedy BEST/FIRST/VOTE algorithmsGreedyBEST/FIRST/VOTEalgorithms GreedyPIVOTalgorithm(5approximation) LocalSearch

  • GreedyAlgorithmsS 1 P h d di dStep1:PermutethenodesaccordingarandomStep2:AssignrecordxtotheclusterthatmaximizesQuality

    Start a new cluster if Quality < 0StartanewclusterifQuality0[Soon et al 2001 CL] [Soonetal2001CL]

    VOTE:Assigntoclusterthatminimizesobjectivefunction. [Elsner etal08ACL]

    PracticalNote: Runthealgorithmformanyrandompermutations,andpicktheclusteringwith

    bestobjectivevalue(betterthanaveragerun)

  • Greedywithapproximationguarantees

    PIVOTAlgorithm [Ailon etal2008JACM] Pickarandom(pivot)recordp.(p ) p Newcluster=

    12

    ={1,2,3,4}C={{1,2,3,4}} ={2,4,1,3}C={{1,2},{4},{3}} ={3,2,4,1}C={{1,3},{2},{4}}

    1

    43

    { , , , } {{ , }, { }, { }}

    Whenweightsare0/1, E(cost(greedy))

  • CANONICALIZATIONPART2dCANONICALIZATION

  • Canonicalization Mergeinformationfromduplicatementionstoconstructaclusterrepresentativewithmaximalinformationp

    Starbucks Starbucks,

    3457HillsboroughRoad Starbucks3457HillsboroughRoad,Durham,NCPh:(919)3334444

    Durham,NCPh:null

    Starbacks,

    CriticallyimportantinWebportalswhereusersmust beshownaconsolidatedview

    HillsboroughRd,DurhamPh:(919)3334444

    Eachmentiononlycontainsasubsetoftheattributes

    Mentionscontainvariations(ofnames,addresses)

    Someofthementionshaveincorrectvalues

  • CanonicalizationAlgorithms Rulebased:

    Fornames:typicallylongestnamesareused. Forsetvaluesattributes:UNIONisused.

    F t i [C l tt t l KDD07] l dit di t f fi di Forstrings,[Culotta etalKDD07]learnaneditdistanceforfindingthemostrepresentativecentroid.

    Canusemajorityruletofixerrors(if4outof5sayabusinessisclosed,thenbusinessisclosed). Thismaynotalwaysworkduetocopying[DongetalVLDB09],orwhen

    underlyingdatachanges[PaletalWWW11]

  • CanonicalizationforEfficiency StanfordEntityResolutionFramework[Benjelloun VLDBJ09]

    Considerablackbox matchandmergefunction Matchisapairwise boolean operator Merge:constructcanonicalversionofamatchingpair

    Canminimizetimetocomputematchesbyinterleavingmatchingandmerging esp.,whenmatchandmergefunctions

    satisfymonotonicity properties. r345

    r12 r45

    r1 r2 r3 r4 r5

  • COLLECTIVE ENTITY RESOLUTIONCOLLECTIVEENTITYRESOLUTION

  • CollectiveApproaches Decisionsforclustermembershipdependsonotherclusters

    Nonprobabilisticapproachesp pp SimilarityPropagation

    ProbabilisticModelsG i M d l GenerativeModels

    UndirectedModels

    HybridApproachesy pp

  • SIMILARITY PROPAGATIONSIMILARITYPROPAGATION

  • SimilarityPropagationApproaches Similaritypropagationalgorithmsdefineagraphwhichencodes

    thesimilaritybetweenentitymentionsandmatchingdecisions,andcomputematchingdecisionsbypropagatingsimilarityvalues. Detailsofconstructedgraphandhowthesimilarityiscomputedvaries Algorithms are usually defined procedurallyAlgorithmsareusuallydefinedprocedurally Whileprobabilitiesmaybeencodedinvariouswaysinthealgorithms,there

    isnoglobalprobabilisticmodeldefined

    Approachesoftenmorescalablethanglobalprobabilisticmodels

  • DependencyGraph[Dong et al SIGMOD05 ]

    Constructagraphwherenodesrepresentsimilaritycomparisonsbetween attribute values (realvalued) and match decisions based

    [Dongetal.,SIGMOD05]

    betweenattributevalues(real valued)andmatchdecisionsbasedonmatchingdecisionsofassociatednodes(booleanvalued)

    Asmentionsareresolved,enrichedtocontainassociatednodesofallmatchedmentions

    Similaritypropagateduntilfixedpointisreached

    Negativeconstraints(notmatchnodes)arecheckedaftersimilaritypropagationisperformed,andinconsistenciesarefixed

  • ExploittheDependencyGraphSlid f [D t l SIGMOD05]

    (Distributed,Distributed )(a1,a4)

    Slidesfrom[Dongetal,SIGMOD05]

    (169180,169180)(RobertS.Epstein,Epstein,R.S.)

    1 4

    (p1,p2)

    (a2,a5)

    (MichaelStonebraker,Stonebraker,M.)

    (a3,a6)(v1,v2)

    (EugeneWong,Wong,E.) (ACM,ACMSIGMOD) (1978,1978)

    Referencesimilarity Attributesimilarity

  • ExploittheDependencyGraph

    (Distributed,Distributed )(a1,a4)

    (169180,169180)(RobertS.Epstein,Epstein,R.S.)

    1 4

    (p1,p2)

    (a2,a5)

    (MichaelStonebraker,Stonebraker,M.)

    (a3,a6)(v1,v2)

    (EugeneWong,Wong,E.) (ACM,ACMSIGMOD) (1978,1978)

    Reconciled Similar

  • CollectiveRelationalClustering[Bh h & G TKDD07]

    Constructagraphwhereleafnodesareindividuali

    [Bhattacharya&Getoor,TKDD07]

    mentions Performhierarchicalagglomerativeclusteringtomerge

    l t f ticlustersofmentions Similaritycomputedbasedonacombinationofattributeand relational similarityandrelationalsimilarity

    Whenclustersaremerged,updatethesimilaritiesofanyrelated clusters (clusters corresponding to mentionsrelatedclusters(clusterscorrespondingtomentionswhichcooccurwithmergedmentions)

  • ObjectiveFunctionMi i i Minimize:

    ),(),( jiRRjiAA ccsimwccsimw weightfor

    bweightfor

    lsimilarityof

    bSimilaritybasedonrelationaledges

    b d

    jji j

    Greedy clustering algorithm: merge cluster pair with max reduction in objective function

    attributes relationsattributes betweenci andcj

    )()( ** ccsimccsim where for example for cluster representative c*),(),( jAttributesa

    ijiA ccsimccsim

    ))()(()( cNcNsimccsimand

    for cluster representative c*

    ))(),((),( jijaccardjiR cNcNsimccsim where N(c) are the relational neighbors of c

  • RelationalClusteringAlgorithm1. Findsimilarreferencesusingblocking2. Bootstrapclustersusingattributesandrelations3. Computesimilaritiesforclusterpairsandinsertintopriorityqueue

    4. Repeatuntilpriorityqueueisemptyp p y q p y5. Findclosestclusterpair6. Stopifsimilaritybelowthreshold7 If no negative constraints violated7. Ifnonegativeconstraintsviolated8. Mergetocreatenewcluster9. Constructcanonicalclusterrepresentative10. Updatesimilarityforrelatedclusters

    O(nklogn)algorithmw/efficientimplementation

  • SimilaritypropagationApproachesMethod Notes Constraints Evaluation

    RelDC[Kalashnikovetl TODS06]

    Referencedisambiguationi i

    Modelchoicenodesidentifiedi f

    Contextattraction

    h

    Accuracyandruntime forAuthor

    l i dal,TODS06] usingusingRelationshipbaseddatacleaning(RelDC)

    usingfeaturebasedsimilarity

    measurestherelationalsimilarity

    resolutionanddirectorresolutioninMovie database

    g ( )

    ReferenceReconciliation[Dongetal,

    DependencyGraphforpropagating

    ReferenceenrichmentExplicitly handle

    Both positiveandnegativeconstraints

    Precision/Recall,F1onpersonalinformation

    SIGMOD05] similarities+enforcenonmatchconstraints

    missingvaluesParameterssetbyhand

    managementdata(PIM),Coradataset

    constraints

    CollectiveRelationalClustering

    Modifiedhierarchicalagglomerative

    Constructscanonical entityasmergesare

    Focusoncoauthorresolution and

    Precision/Recall,F1onthreebibliographicg

    [Bhattacharya&Getoor,TKDD07]

    ggclusteringapproach

    gmade propagation

    g pdatasets:CiteSeer,ArXiv,andBioBase,andsyntheticdata

  • PROBABILISTIC MODELS:PROBABILISTICMODELS:GENERATIVEAPPROACHES

  • GenerativeProbabilisticApproaches ProbabilisticsemanticsbasedonDirectedModels

    Model dependencies between match decisions in a generativeModeldependenciesbetweenmatchdecisionsinagenerativemanner

    Disadvantage:acyclicity requirement Varietyofapproaches

    BasedonLatentDirichlet Allocation,BayesianNetworks Examples

    LatentDirichlet Allocation[Bhattacharya&Getoor,SDM07] ProbabilisticRelationalModels[Pasula etal,NIPS02]

  • LDAforEntityResolution:DiscoveringGroups from CoOccurrence RelationsGroupsfromCo OccurrenceRelations

    Stephen C JohnsonStephen P Johnson

    Alfred V Aho

    J ff D Ull

    Ravi Sethi

    M k C ss

    Chris Walshaw Kevin McManus

    M tin E tt

    Bell Labs Group

    Jeffrey D Ullman

    Parallel Processing Research Group

    Mark Cross Martin Everett

    P1: C. Walshaw, M. Cross, M. G. Everett, P4: Alfred V. Aho, Stephen C. Johnson, J ff D UllS. Johnson

    P2: C. Walshaw, M. Cross, M. G. Everett,S. Johnson, K. McManus

    Jefferey D. Ullman

    P5: A. Aho, S. Johnson, J. Ullman

    P3: C. Walshaw, M. Cross, M. G. Everett P6: A. Aho, R. Sethi, J. Ullman

  • LDAERModel Entity label a and group label z for

    each reference r : mixture of groups for each co-

    : mixture of groups for each co-

    occurrence z: multinomial for choosing entity a

    for each group z

    z Va: multinomial for choosing

    reference r from entity a Dirichlet priors with and

    aT

    r

    T

    AV

    InferenceusingblockedGibbssamplingforefficiency(andimprovedaccuracy)

    PR A

  • GenerativeApproachesMethod Learning/Inference

    MethodEvaluation

    [Li Morie & Generative Truncated EM to learn F1 on person[Li,Morie,&Roth,AAAI04]

    Generativemodelformentionsindocuments

    Truncated EMtolearnparametersandMAPinferenceforentities(unsupervised)

    F1on personnames,locationsandorganizationsinTRECdataset

    ProbabilisticRelationalM d l [P l

    ProbabilisticRelationalM d l

    Parameterslearnedonseparatedcorpora,i f d i

    %ofcorrectlyidentifiedlModels[Pasula

    etal.,NIPS03]Models inferencedoneusing

    MCMCclusters onsubsetsofCiteSeer data

    Latent Dirichlet Latent Dirichlet Blocked Gibbs Precision/RecallLatentDirichletAllocation[Bhattacharya&Getoor,

    LatentDirichletAllocationModel

    BlockedGibbsSamplingUnsupervisedapproach

    Precision/Recall/F1onCiteSeerandHEPdata

    SDM06]

  • PROBABILISTIC MODELS:PROBABILISTICMODELS:UNDIRECTEDAPPROACHES

  • UndirectedProbabilisticApproaches ProbabilisticsemanticsbasedonMarkovNetworks

    Advantage: no acyclicity requirementsAdvantage:noacyclicity requirements Insomecases,syntaxbasedonfirstorderlogic

    Advantage:declarativeg

    ExamplesExamples ConditionalRandomFields(CRFs)[McCallum&Wellner,NIPS04]

    MarkovLogicNetworks(MLNs)[Singla &Domingos,ICDM06] ProbabilisticSimilarityLogic[Broecheler &Getoor,UAI10]

  • MarkovLogic AlogicalKBisasetofhardconstraints onthesetofpossibleworlds

    Makethemsoftconstraints;whenaworldviolatesaformula,itbecomeslessprobablebutnotimpossible

    Giveeachformulaaweight Higher weight Stronger constraintHigherweight Strongerconstraint

    isfies s it satf formulaweights oP(world) exp

    [Richardson&Domingos,06]

  • MarkovLogic AMarkovLogicNetwork(MLN) isasetofpairs(F,w)where F isaformulainfirstorderlogic w isarealnumber

    # true groundings of ith clause

    Fi

    ii xnwZXP )(exp1)( Fi

    Iterate over all first-order MLN formulasNormalization Constant

    [Richardson&Domingos,06]

  • ERProblemFormulationinMLNs Given

    A DB of records representing mentions of entities in the real ADBofrecordsrepresentingmentionsofentitiesintherealworld,e.g.papermentions

    A set of fields e g author title venueAsetoffieldse.g.author,title,venue

    Eachrecordrepresentedasasetoftypedpredicatese.g.HasAuthor(paper,author),HasVenue(paper,venue)(p p , ), (p p , )

    Goald h h f h d /f ld f h Todeterminewhichoftherecords/fieldsrefertothesame

    underlyingentity

    Slidesfrom[Singla &Domingos,ICDM06]

  • HandlingEquality IntroduceEquals(x,y) orx=y

    I t d th i f lit Introducetheaxiomsofequality Reflexivity: x=x

    Symmetry: x=y y=x Transitivity: x=y y=z z=x PredicateEquivalence:

    x1 = x2 y1 y2 (R(x1, y1) R(x2,y2))x1 x2 y1 y2 (R(x1,y1) R(x2,y2))

  • Positive,SoftEvidence Introducereversepredicateequivalence

    S l ti ith th tit i id b t Samerelationwiththesameentitygivesevidenceabouttwoentitiesbeingsame

    R(x1,y1) R(x2,y2) x1 =x2 y2 =y2 Nottruelogically,butgivesusefulinformation

    ExampleHasAuthor(C1 J Cox) HasAuthor(C2 Cox J ) C1 = C2HasAuthor(C1,J.Cox) HasAuthor(C2,CoxJ.) C1=C2(J.Cox=CoxJ.)

  • FieldComparison Eachfieldisastringcomposedoftokens

    Introduce HasWord(field word)IntroduceHasWord(field,word)

    Usereversepredicateequivalence

    HasWord(f1,w1) HasWord(f2,w2) w1=w2 f1=f2 Example

    HasWord(J.Cox,Cox) HasWord(CoxJ.,Cox) (Cox=Cox)(J.Cox=CoxJ.)

    h d ff h f h d Canhavedifferentweightforeachword

  • TwolevelSimilarity Individualwordsasunits:Cantdealwithspellingmistakes

    Breakeachwordintongrams:IntroduceHasNgram(word,ngram)

    Usereversepredicateequivalenceforwordcomparisons

  • RecordMatching SimplestVersion:Fieldsimilaritiesmeasuredbypresence/absenceofwordsincommon

    HasWord(f1,w1) HasWord(f2,w2) HasField(r1,f1)HasField(r2,f2) w1=w2 r1=r2

    E l ExampleHasWord(J.Cox,Cox) HasWord(CoxJ.,Cox) HasAuthor(P1,J.Cox) HasAuthor(P2,CoxJ.) (Cox=Cox) (P1=P2)) ( , ) ( ) ( )

    Transitivity(f1 = f2) (f2 = f3) ( f3 = f1)(f1 f2) (f2 f3) ( f3 f1)

    AdditionalConstraintsHasAuthor(c a ) HasAuthor(c a ) Coauthor(a a )HasAuthor(c,a1) HasAuthor(c,a2) Coauthor(a1,a2)Coauthor(a1,a2) Coauthor(a3,a4) a1=a3 a2=a4

  • Inference Usecheapheuristics(e.g.TFIDFbasedsimilarity)toidentifyplausiblepairsy p p

    Inference/learningoverplausiblepairs

    Inferencemethod:lazygrounding+MaxWalkSAT

    Learning:supervisedandtransfer(learn/handsetononeg p ( /domainandtransferred)

  • ProbabilisticSoftLogic[Broecheler & Getoor UAI10]

    DeclarativelanguagefordefiningconstrainedcontinuousMarkov random field (CCMRF) using first order logic

    [Broecheler &Getoor,UAI10]

    Markovrandomfield(CCMRF)usingfirstorderlogic(FOL)

    Soft logic: truth values in [0 1] Softlogic:truthvaluesin[0,1] LogicaloperatorsrelaxedusingLukasiewicz tnorms Mechanisms for incorporating similarity functions and Mechanismsforincorporatingsimilarityfunctions,andreasoningaboutsets

    MAP inference is a convex optimization MAPinferenceisaconvexoptimization Efficientsamplingmethodformarginalinference

  • FOLtoCCMRF

    PSLconvertsaweightedruleintopotentialfunctionsbypenalizing its distance to satisfactionpenalizingitsdistancetosatisfaction,

    isthetruthvalueofgroundruleunderinterpretationx

    Thedistributionovertruthvaluesis

    : weight of rule r

    :weightofruler :allgroundingsofruler

    :PSLprogramp g

  • UndirectedApproachesMethod Learning/Inference

    MethodEvaluation

    [McCallum & Conditional Graph partitioning F1 on DARPA[McCallum&Wellner,NIPS04]

    ConditionalRandomFields(CRFs)capturing

    Graphpartitioning(Boykov etal.1999),performedviacorrelationclustering

    F1on DARPAMUC&ACEdatasets

    transitivityconstraints

    [Singla &D i

    MarkovLogicN k

    Supervisedlearningd i f i

    ConditionalLoglik lih d dDomingos,

    ICDM06]Networks(MLNs)

    andinferenceusingMaxWalkSAT &MCMC

    likelihood andAUConCoraandBibServdata

    [Broecheler &Getoor,UAI10]

    ProbabilisticSimilarityLogic

    Supervisedlearningandinferenceusing

    Precision/Recall/F1Ontology

    (PSL) continuousoptimization

    Alignment

  • HYBRID APPROACHESHYBRIDAPPROACHES

  • HybridApproaches Constraintbasedapproachesexplicitlyencoderelationalconstraints Theycanbeformulatedashybridofconstraintsandprobabilisticmodels

    Orasconstraintoptimizationproblem Examples

    ConstraintbasedEntityMatching[Shen,Li&Doan,AAAI05] Dedupalog [Arasu,Re,Suciu,ICDE09]

  • Dedupalog [Arasu etal.,ICDE09]

    PaperRef(id, title, conference, publisher, year)Wrote(id authorName Position)

    DatatobededuplicatedDatatobe

    deduplicatedWrote(id, authorName, Position) deduplicateddeduplicated

    TitleSimilar(title1,title2)A th Si il ( th 1 th 2)

    (Thresholded)FuzzyJ i O t t

    (Thresholded)FuzzyJ i O t tAuthorSimilar(author1,author2) JoinOutputJoinOutput

    Step(0)Createinitialapproximatematches;thisisinputtoDedupalog.

    Step(1)Declaretheentities ClusterPapers,Publishers,&Authors

    Paper!(id) :- PaperRef(id,-,-,-)Publisher!(p) :- PaperRef(-,-,-,p,-)Author!(a) :- Wrote(-,a,-) P bli h (UNA) d P (NOT UNA)

    Dedupalog isflexible:UniqueNamesAssumption(UNA)

    Dedupalog isflexible:UniqueNamesAssumption(UNA)

    ( ) ( , , ) Publishers(UNA)andPapers(NOTUNA)

    Slidesbasedon[Arasu,Re,Suciu,ICDE09]

  • Step(2)DeclareClusters

    PaperRef(id, title, conference, publisher, year)Wrote(id authorName Position)

    InputintheDB

    Clusterpapers,publishers and authorsWrote(id, authorName, Position)

    TitleSimilar(title1,title2)A th Si il ( th 1 th 2)

    Paper!(id) :- PaperRef(id,-,-,-)Publisher!(p) :- PaperRef(- - - p -)

    publishers,andauthors

    AuthorSimilar(author1,author2) Publisher!(p) :- PaperRef(-,-,-,p,-)Author!(a) :- Wrote(-,a,-)

    Author*(a1,a2) AuthorSimilar(a1,a2)

    Clustersaredeclared using*(likeIDBsorViews):TheseareoutputClustersaredeclared using*(likeIDBsorViews):Theseareoutput

    Author1 Author2

    AA Arvind Arasu

    *IDBsareequivalencerelations:A D d l i

    ClusterauthorswithsimilarnamesAA Arvind Arasu

    Arvind A Arvind Arasu

    qSymmetric,Reflexive,&TransitivelyClosedRelations:i.e.,Clusters

    ADedupalog program isasetofdataloglikerules

    152

  • SimpleConstraints

    Paperswithsimilartitlesshouldlikelybeclusteredtogether

    Paper*(id1,id2) PaperRef(id1,t1,-), PaperRef(id2,t2,-),TitleSimilar(t1,t2)

    Author*(a1,a2) AuthorSimilar(a1,a2) ()Softconstraints:( )Payacostifviolated.

    Paper*(id1 id2)

  • Additional ConstraintsAdditionalConstraintsClusteringtwopapers,thenmustclustertheirfirstauthors

    Author*(a1,a2)

  • Dedupalog viaCC

    Semantics:TranslateaDedupalog Programtoasetofgraphs

    Nodesarereferences(inthe!Relation) EntityReferences:Conference!(c)

    VLDBJConference*(c1,c2) ConfSim(c1,c2)

    Positiveedges

    Negative edges are implicit[ ]

    VLDBconfVLDB

    Negativeedgesareimplicit[] ICDE

    InternationalConf.DEICDT

    Forasinglegraphw.o.hardconstraintswe can reuse prior work for O(1) apx

    156

    wecanreusepriorworkforO(1)apx.

  • CorrelationClustering

    Conference*(c1 c2)

  • Voting

    Extendalgorithmtowhole languageviavotingtechnique.Support different entity types recursive programs etc

    Thm: ArecursivehardThm: ArecursivehardManydedupalog programsManydedupalog programs

    Supportdifferententitytypes,recursiveprograms,etc.

    constraintsnoO(1)apxconstraintsnoO(1)apxy p g p g

    haveanO(1)apxy p g p g

    haveanO(1)apx

    Thm:AllsoftprogramsO(1) Expert:multiwaycuthard

    Systemproperties:(1)StreamingalgorithmSystemproperties:(1)Streamingalgorithm(2)linearin#ofmatches(notn2)(3)Userinteraction(2)linearin#ofmatches(notn2)(3)Userinteraction

    Features:Supportforweights,referencetables(partially),andcorrespondinghardnessresults.Features:Supportforweights,referencetables(partially),andcorrespondinghardnessresults.

  • HybridApproachesMethod Evaluation

    ConstraintbasedEntity

    Twolayermodel:Layer1:Generativemodelfordatasetsthatsatisfy

    ResearchersandIMDBwith

    Matching[Shen,Li&Doan,AAAI05];

    constraints;Layer2:EMalgorithmandtherelaxationlabelingalgorithmtoperformmatching.Ineachiteration,useEM to estimate parameters of the generative model

    noiseadded

    AAAI05];buildson(Li,Morie,&Roth,AIMag 2004)

    EMtoestimateparametersofthegenerativemodelandamatchingassignment,thenemploysrelaxationlabelingtoexploittheconstraints

    Dedupalog[Arasu,Re,Suciu,ICDE09]

    Declarativespecificationforrichcollectionofconstraintswithnicesyntactic sugaraddedtodatalogforER.Inference:Correlationclustering+voting

    Precision/RecallonCora,subsetofACMdataset

  • Summary:CollectiveApproaches Decisionsforclustermembershipdependsonotherclusters

    Similaritypropagationapproachesy p p g pp ProbabilisticModels

    GenerativeModelsU di d M d l UndirectedModels

    HybridApproaches

    Nonprobabilisticapproachesoftenscalebetterthangenerativeprobabilisticapproaches

    Undirected/constraintbasedmodelsareofteneasiertospecify Scalingundirectedmodelsactiveareaofresearch

  • PART 3SCALING ER TO BIGDATAPART3SCALINGERTOBIG DATA

  • ScalingERtoBigData Blocking/CanopyGeneration Distributed ERDistributedER

  • PART 3BLOCKING/CANOPY GENERATIONPART3aBLOCKING/CANOPYGENERATION

  • Blocking:Motivation Navepairwise:|R|2 pairwise comparisons

    1000 business listings each from 1,000 different cities across1000businesslistingseachfrom1,000differentcitiesacrosstheworld

    1trillioncomparisons 11.6days (ifeachcomparisonis1s)

    Mentionsfromdifferentcitiesareunlikelytobematches BlockingCriterion:City 1billioncomparisons 16minutes(ifeachcomparisonis1s)

  • Blocking:Motivation Mentionsfromdifferentcitiesareunlikelytobematches

    May miss potential matchesMaymisspotentialmatches

  • Blocking:MotivationMatchingPairsofRecords

    PairsofRecordssatisfying

    Blockingcriterion

    SetofallPairsofRecords

  • BlockingAlgorithms1 Hashbasedblocking

    EachblockCi isassociatedwithahashkeyhi. Mentionx ishashedtoCi ifhash(x)=hi. Withinablock,allpairsarecompared. Each hash function results in disjoint blocks.Eachhashfunctionresultsindisjointblocks.

    Whathashfunction? Deterministicfunctionofattributevalues BooleanFunctionsoverattributevalues

    [Bilenko etalICDM06,MichelsonetalAAAI06,[ , ,DasSarma etalCIKM12]

    minHash (minwiseindependentpermutations)[Broder etalSTOC98]

  • BlockingAlgorithms2 Pairwise Similarity/Neighborhoodbasedblocking

    Nearby nodes according to a similarity metric are clusteredNearbynodesaccordingtoasimilaritymetricareclusteredtogether

    Resultsinnondisjointcanopies.

    Techniques SortedNeighborhoodApproach[HernandezetalSIGMOD95] CanopyClustering[McCallumetalKDD00]

  • SimpleBlocking:InvertedIndexonaKey

    Examplesofblockingkeys: First three characters of last nameFirstthreecharactersoflastname City+State+Zip CharacterorTokenngrams Minimuminfrequentngrams

  • LearningOptimalBlockingFunctions Usingoneormoreblockingkeysmaybeinsufficient

    2,376,206 Americans shared the surname Smith in the 2000 US2,376,206American ssharedthesurnameSmithinthe2000US NULLvaluesmaycreatelargeblocks.

    Solution:Constructblockingfunctionsbycombiningsimplefunctions

  • ComplexBlockingFunctions Conjunctionoffunctions[MichelsonetalAAAI06,Bilenko etalICDM06]

    {City}AND{lastfourdigitsofphone}

    Chaintrees[DasSarma etalCIKM12]If ({Ci } NULL LA) h {l f di i f h } AND { d } If({City}=NULLorLA)then {lastfourdigitsofphone}AND{areacode}

    else {lastfourdigitsofphone}AND{City}

    BlkTrees [DasSarma etalCIKM12]

  • LearninganOptimalfunction[Bilenko etalICDM06] Findkblockingfunctionsthateliminatethemostnonmatches,whileretainingalmostallmatches., g Needatrainingsetofpositiveandnegativepairs

    AlgorithmIdea:RedBlueSetCover

    PositiveExamples

    Blocking Keys

    PickkBlockingkeyssuchthat(a)Atmost bluenodesarenotcovered

    PickkBlockingkeyssuchthat(a)Atmost bluenodesarenotcovered

    NegativeExamples

    BlockingKeys(b)Numberofrednodescoveredisminimized(b)Numberofrednodescoveredisminimized

  • LearninganOptimalfunction[Bilenko etalICDM06] AlgorithmIdea:RedBlueSetCover

    PositiveExamples Pick k Blocking keys such thatPick k Blocking keys such thatp

    BlockingKeys

    PickkBlockingkeyssuchthat(a)Atmost bluenodesarenotcovered(b)Numberofrednodes

    PickkBlockingkeyssuchthat(a)Atmost bluenodesarenotcovered(b)Numberofrednodes

    Greedy Algorithm

    NegativeExamplescoveredisminimizedcoveredisminimized

    GreedyAlgorithm: Constructgoodconjunctionsofblockingkeys{p1,p2,}. Pick k conjunctions {p p p } such that the following is Pickkconjunctions{pi1,pi2,,pik},suchthatthefollowingisminimized

  • minHash (Minwise IndependentPermutations) LetFx beasetoffeaturesformentionx

    (functions of) attribute values(functionsof)attributevalues characterngrams optimalblockingfunctions

    Let bearandompermutationoffeaturesinFx E.g.,orderimposedbyarandomhashfunction

    minHash(x)=minimumelementinFx accordingto

  • WhyminHash works?Surprisingproperty:Forarandompermutation,

    How to build a blocking scheme such that only pairs withHowtobuildablockingschemesuchthatonlypairswithJacquardsimilarity>sfallinthesameblock(withhighprob)?

    `Probabilitythat

    (x,y)mentionsarebl k d t thblockedtogether

    Similarity(x,y)

  • BlockingusingminHashes ComputeminHashes usingr*kpermutations(hashfunctions))

    Band of rminHashesBandofrminHashes

    Signatures that match on 1 out of k bands go to thek blocks

    Signature sthatmatchon1outofk bands,gotothesameblock.

  • minHash AnalysisFalseNegatives:(missingmatches)P(pair x y not in the same block

    Sim(s) P(notsameblock)P(pairx,y notinthesameblock

    withJacquardsim =s)block)

    0.9 108

    0.8 0.00035should be very low for high similarity pairs

    False Positives: (blocking nonmatches)

    0.7 0.025

    0.6 0.2

    shouldbeverylowforhighsimilaritypairs

    FalsePositives:(blockingnonmatches)P(pairx,y inthesameblock

    with Jacquard sim = s)

    0.5 0.52

    0.4 0.81withJacquardsim s)0.3 0.95

    0.2 0.994

    0.1 0.9998

  • CanopyClustering[McCallumetalKDD00]Input:MentionsM,

    d(x,y),adistancemetric, Inmultiplecanopies

    thresholdsT1 >T2Algorithm:1 Pi k d l t f M

    canopies

    1. PickarandomelementxfromM2. CreatenewcanopyCx using

    mentionsys.t.d(x,y)

  • PART 3 bDISTRIBUTED ERPART3bDISTRIBUTEDER

  • DistributedERM d i l f l k Mapreduceisverypopularforlargetasks Simpleprogrammingmodelformassivelydistributeddata

    Hadoop providesfaulttoleranceandisopensource

    MapPhase(perrecordcomputation)

    ReducePhase(globalcomputation)

    Shuffle

  • ERwithDisjointBlocking

    Map Phase Reduce PhaseComputeBlocksinMap RemainingERinReduce

    MapPhase(perrecordcomputation)

    ReducePhase(globalcomputation)

    Shuffle

    BlockID Noneedtocomparerecordsacross

    reducersreducers

  • NondisjointBlocking Howtoblock?

    Hashbased: need an efficient technique to group records ifHash based:needanefficienttechniquetogrouprecordsiftheymatchonnoutofkblockingkeys[Vernica etalSIGMOD10]

    Distancebased:canopyclusteringonmapreduce[Mahout] IterativeBlocking[Whang etalSIGMOD09]

  • Problem:Informationneededforarecordisinmultiplereducers.p

    Informationneededforarecordisinmultiplereducers. Example 1:Example1:

    Reducer1:amatcheswithb Reducer2:amatcheswithcN d t i t i d t tl l b Needtocommunicateinordertocorrectlyresolvea,b,c

  • Problem:Informationneededforarecordisinmultiplereducers.p

    Example2:Dedup papersandauthorsId Author1 Author2 PaperA1 JohnSmith RichardJohnson IndicesandViewsA2 JSmith RJohnson SQLQueriesA Dr Smyth R Johnson Indices and ViewsA3 Dr.Smyth RJohnson IndicesandViews

    Slideadaptedfrom[Rastogi etalVLDB11]talk

  • Problem:Informationneededforarecordisinmultiplereducers.p

    Canopyclusteringresultsinnondisjointclusters[McCallumetalKDD00]

    JohnSmithRichard

    J.Smith JohnS.

    J h J bRichardJohnson SmithRichard

    SmithRichardM.Johnson

    R.Smith

    JohnJacob

    CanopyCanopy CCCanopyf

    Canopyf

    Johnson

    pyfor

    Richard

    pyfor

    Richard

    CanopyforSmithCanopyforSmith

    forJohnforJohn

    Slideadaptedfrom[Rastogi etalVLDB11]talk

  • Problem:Informationneededforarecordisinmultiplereducers.p

    CoAuthor(A1,B1) CoAuthor(A2,B2)match(B1,B2)match(A1,A2)

    CoAuthor rulegroundstothecorrelationmatch(RichardJohnson,RJohnson)=>match(J.Smith,JohnSmith)

    JohnSmith

    J.Smith JohnS.

    SteveRichard Smith JohnJacobSteve

    JohnsonR.Smith

    CanopyCanopy CanopyCanopyR

    Johnson

    Johnson

    Canopyfor

    Johnson

    Canopyfor

    Johnson

    CanopyforSmithCanopyforSmith

    forJohnforJohn

    Slideadaptedfrom[Rastogi etalVLDB11]talk

  • Problem:Informationneededforarecordisinmultiplereducers.p

    Solution1:EfficientlyfindConnectedComponents[Rastogi etal2012,KangetalICDM2009]

    C l i Cl i / C ll i ER i h+CorrelationClustering/CollectiveERineachcomponent

    Solution2:CorrelationClustering/CollectiveERineachcanopyg / py+MessagePassing[Rastogi etalVLDB11]

  • Problem:Informationneededforarecordisinmultiplereducers.p

    Solution1:EfficientlyfindConnectedComponents[Rastogi etal2012,KangetalICDM2009]

    C l i Cl i / C ll i ER i h+CorrelationClustering/CollectiveERineachcomponent

    Connectedcomponentscanbelargeinrelational/multientityER.p g / y

    Solution2:CorrelationClustering/CollectiveERineachcanopy+MessagePassing[Rastogi etalVLDB11]

  • MessagePassing

    Simple Message Passing (SMP)SimpleMessagePassing(SMP)1. RunentitymatcherM locallyineachcanopy2. IfM findsamatch(r1,r2)insomecanopy,passitas1 2

    evidencetoallcanopies3. RerunMwithineachcanopyusingnewevidence

    il h f d i h4. Repeatuntilnonewmatchesfoundineachcanopy

    Runtime: O(k2 f(k) c)Runtime:O(k2 f(k)c) k:maximumsizeofacanopy f(k):TimetakenbyERoncanopyofsizek c:numberofcanopies

    Slideadaptedfrom[Rastogi etalVLDB11]talk

  • FormalPropertiesforawellbehavedERmethod

    Convergence:No.ofstepsno.ofmatches

    Consistency:Outputindependentofthecanopyorder

    Soundness:Eachoutputmatchisactuallyatruematch

    y p p py

    Completeness:Eachtruematchisalsoaoutputmatch

    JohnSmith

    J.Smith JohnS.

    JohnJacobRichardSmith

    RichardJohnson

    SmithR.SmithRichardM.

    Johnson

    Slideadaptedfrom[Rastogi etalVLDB11]talk

  • Completeness

    Papers2and3matchonlyifacanopyknows thatknowsthatmatch(a1,a2)match(b2,b3)

    t h( 2 3)match(c2,c3)

    Simplemessagepassingwillnotfindanymatches thus,nomessagesarepassed,noprogress

    Solution:Maximalmessagepassing Sendamessageifthereisapotentialformatch

    Slideadaptedfrom[Rastogi etalVLDB11]talk

  • SummaryofScalability O(|R|2)pairwise computationscanbeprohibitive.

    Blockingeliminatescomparisonsonalargefractionofnonmatches.

    EqualitybasedBlocking: Construct(oneormore)blockingkeysfromfeatures

    Records not matching on any key are not compared Recordsnotmatchingonanykeyarenotcompared.

    Neighbohood basedBlocking: Formoverlappingcanopiesofrecordsbasedonsimilarity. Onlycomparerecordswithinacluster.

    Computingconnectedcomponents/MessagePassinginadditiontoblocking can help distribute ERblockingcanhelpdistributeER.

  • P t 4CHALLENGES AND FUTUREPart4CHALLENGESANDFUTUREDIRECTIONS

  • Challenges Sofar,wehaveviewedERasaonetimeprocessappliedtoentire

    database;noneoftheseholdinrealworld. Temporal ERTemporalER

    ERalgorithmsneedtoaccountforchangeinrealworld Reasoningaboutmultiplesources[Pal&M etal.WWW12] Modeltransitions[LietalVLDB11]

    Reasoningaboutsourcequality Sourcesarenotindependent CopyingProblem[DongetalVLDB09]

    QueryTimeER Howdoweselectivelydeterminethesmallestnumberofrecordstoresolve,so

    wegetaccurateresultsforaparticularquery? Collective resolution for queries [Bhattacharya & Getoor JAIR07]Collectiveresolutionforqueries[Bhattacharya&GetoorJAIR07]

    ER&Usergenerateddata Deduplicated entitiesinteractwithusersintherealworld

    Userstag/associatephotos/reviewswithbusinessesonGoogle/Yahoo Whatshouldbedonetosupportinteractions?

  • OpenIssues ERisoftenpartofbiggerinferenceproblem

    Pipelinedapproachesandjointapproachestoinformationextractionand graph identificationandgraphidentification

    HowcanwecharacterizehowERerrorsaffectoverallqualityofresults?

    ER Theory ERTheory Needbettersupportfortheorywhichcangiverelationallearning

    bounds ER & Privacy ER&Privacy

    ERenablesrecordreidentification HowdowedevelopatheoryofprivacypreservingER?ER B h k ERBenchmarks NeedforlargescalerealworldERdatasetswithgroundtruth Syntheticdatausefulforscalingbuthardtocapturerichcomplexities

    f l ldofrealworld

  • Summary Growingomnipresenceofmassivelinkeddata,andtheneed

    forcreatingknowledgebasesfromtextandunstructureddatamotivate a number of challenges in ERmotivateanumberofchallengesinER

    EspeciallyinterestingchallengesandopportunitiesforERand/socialmedia/usergenerateddata

    As data noise and knowledge grows greater needs &Asdata,noise,andknowledgegrows,greaterneeds&opportunitiesforintelligentreasoningaboutentityresolution

    M th h ll Manyotherchallenges Largescaleidentitymanagement Understandingtheoreticalpotentials&limitsofER

  • THANK YOU!THANKYOU!

  • References IntroW.Willinger etal,MathematicsandtheInternet:ASourceofEnormousConfusionand

    GreatPotential,NoticesoftheAMS56(5),2009L Gill and M Goldcare English National Record Linkage of Hospital Episode Statistics andL.GillandM.Goldcare, EnglishNationalRecordLinkageofHospitalEpisodeStatisticsand

    DeathRegistrationRecords,ReporttotheDepartmentofHealth,2003T.Herzogetal,DataQualityandRecordLinkageTechniques,Springer2007A. Elmagrid et al, Duplicate Record Detection, TKDE 2007A.Elmagrid etal, DuplicateRecordDetection ,TKDE2007P.Christen,DataMatching,Springer2012N.Koudas etal,RecordLinkage:SimilaritymeasuresandAlgorithms,SIGMOD2006X. Dong & F. Naumann, Data fusionResolving data conflicts for integration, VLDB 2009X.Dong&F.Naumann, Datafusion Resolvingdataconflictsforintegration ,VLDB2009L.Getoor &A.Machanavajjhala,EntityResolution:Theory,PracticeandOpenChallenges,

    AAAI2012

  • References SingleEntityERD.Menestrina etal,EvaluationEntityResolutionResults,PVLDB3(12),2010M.Cochinwala etal,Efficientdatareconciliation,InformationSciences137(14),2001M Bilenko & R Mooney Adaptive Duplicate Detection Using Learnable String SimilarityM.Bilenko &R.Mooney, AdaptiveDuplicateDetectionUsingLearnableStringSimilarity

    Measures,KDD2003P.Christen,Automaticrecordlinkageusingseedednearestneighbour andsupportvector

    machineclassification.,KDD2008Z.Chenetal,Exploitingcontextanalysisforcombiningmultipleentityresolutionsystems,

    SIGMOD2009A.McCallum&B.Wellner,ConditionalModelsofIdentityUncertaintywithApplicationtoNoun

    f Coreference,NIPS2004H.Newcombe etal,Automaticlinkageofvitalrecords,Science1959I.Fellegi &A.Sunter,ATheoryforRecordLinkage,JASA1969W.Winkler,OverviewofRecordLinkageandCurrentResearchDirections,ResearchReport

    Series,USCensus,2006T.Herzogetal,DataQualityandRecordLinkageTechniques,Springer,2007P R ik & W C h A Hi hi l G hi l M d l f R d Li k UAI 2004P.Ravikumar &W.Cohen,AHierarchicalGraphicalModelforRecordLinkage,UAI2004

  • References SingleEntityER(contd.)S.Sarawagi etal,InteractiveDeduplication usingActiveLearning,KDD2000S.Tejada etal,LearningObjectIdentificationRulesforInformationIntegration,IS2001A Arasu et al On active learning of record matching packages SIGMOD 2010A.Arasu etal, Onactivelearningofrecordmatchingpackages ,SIGMOD2010K.Bellare etal,Activesamplingforentitymatching,KDD2012A.Beygelzimer etal,AgnosticActiveLearningwithoutConstraints,NIPS2010J Wang et al CrowdER: Crowdsourcing Entity Resolution PVLDB 5(11) 2012J.Wangetal, CrowdER:Crowdsourcing EntityResolution ,PVLDB5(11),2012A.Marcusetal,HumanpoweredSortsandJoins,PVLDB5(1),2011

  • References SingleEntityER(contd.)R. Gupta & S. Sarawagi, Answering Table Augmentation Queries from Unstructured Lists on the Web,R.Gupta&S.Sarawagi, AnsweringTableAugmentationQueriesfromUnstructuredListsontheWeb ,

    PVLDB2(1),2009A.DasSarma etal,AnAutomaticBlockingMechanismforLargeScaleDeduplicationTasks,CIKM

    2012M Bilenko et al Adaptive Product Normalization: Using Online Learning for Record Linkage inM.Bilenko etal, AdaptiveProductNormalization:UsingOnlineLearningforRecordLinkagein

    ComparisonShopping,ICDM2005S.Chaudhuri etal,RobustIdentificationofFuzzyDuplicates,ICDE2005W.Soonetal,Amachinelearningapproachtocoreference resolutionofnounphrases,

    C t ti l Li i ti 27(4) 2001ComputationalLinguistics27(4)2001N.Bansal etal,CorrelationClustering,MachineLearning56(13),2004V.Ng&C.Cardie,Improvingmachinelearningapproachestocoreference resolution,ACL2002M.Elsner &E.Charnaik,Youtalkingtome?acorpusandalgorithmforconversation, g p g

    disentanglement,ACLHLT2008M.Elsner &W.Schudy,BoundingandComparingMethodsforCorrelationClusteringBeyondILP,

    ILPNLP2009N Ailon et al Aggregating inconsistent information: Ranking and clustering JACM 55(5) 2008N.Ailon etal, Aggregatinginconsistentinformation:Rankingandclustering ,JACM55(5),2008X.Dongetal,IntegratingConflictingData:TheRoleofSourceDependence,PVLDB2(1),2009A.Paletal,InformationIntegrationoverTimeinUnreliableandUncertainEnvironments,WWW

    2012A.Culotta etal,CanonicalizationofDatabaseRecordsusingAdaptiveSimilarityMeasures,KDD2007O.Benjelloun etal,Swoosh:AgenericapproachtoEntityResolution,VLDBJ18(1),2009

  • References Constraints&MultiRelationalERA h k i h l li i i f d li i d h VL 2002R.Ananthakrishna et.al,Eliminatingfuzzyduplicatesindatawarehouses,VLDB2002

    A.Arasu etal,LargeScaleDeduplication withConstraintsusingDedupalog,ICDE2009S.Chaudhuri etal.,Leveragingaggregateconstraintsfordeduplication,SIGMOD07X.Dongetal,ReferenceRecounciliation inComplexInformationSpaces,SIGMOD2005I.Bhattacharya&L.Getoor,CollectiveEntityResolutioninRelationalData,TKDD2007I.Bhattacharya&L.Getoor,ALatentDirichlet ModelforUnsupervisedEntityResolution,SDM2007P.Bohannonetal.,ConditionalFunctionalDependenciesforDataCleaning,ICDE 2007M.Broecheler &L.Getoor,ProbabilisticSimilarityLogic,UAI2010, y g ,W.Fan,Dependenciesrevisitedforimprovingdataquality,PODS2008H.Pasula etal,IdentityUncertaintyandCitationMatching,NIPS2002D.Kalashnikovetal,DomainIndependentDataCleaningviaAnalysisofEntityRelationshipGraph,

    TODS06J.Laffertyetal,ConditionalRandomFields:ProbabilisticModelsforSegmentingandLabeling

    SequenceData.,ICML2001X.Lietal,IdentificationandTracingofAmbiguousNames:DiscriminativeandGenerative

    Approaches,AAAI2004A.McCallum&B.Wellner,ConditionalModelsofIdentityUncertaintywithApplicationtoNoun

    Coreference,NIPS2004M.Richardson&P.Domingos,MarkovLogic,MachineLearning62,2006W.Shen etal.,ConstraintbasedEntityMatching,AAAI2005P.Singla &P.Domingos,EntityResolutionwithMarkovLogic,ICDM2006Whang etal.,GenericEntityResolutionwithNegativeRules,VLDBJ2009Whang elal.,JointEntityResolution,ICDE2012

  • References BlockingM.Bilenko etal,AdaptiveBlocking:LearningtoScaleUpRecordLinkageandClustering,ICDM

    2006M.Michelson&C.Knoblock,LearningBlockingSchemesforRecordLinkage,AAAI2006g g gA.DasSarma etal,AnAutomaticBlockingMechanismforLargeScaleDeduplicationTasks,

    CIKM2012A.Broder etal,MinWiseIndependentPermutations,STOC1998G P di l B d 100 illi i i l l bl ki b d l i fG.Papadias etal,Beyond100millionentities:largescaleblockingbasedresolutionfor

    heterogenous data,WSDM2012M.Hernandez&S.Stolfo,Themerge/purgeproblemforlargedatabases,SIGMOD1995A. McCallum et al, Efficient clustering of highdimensional data sets with application toA.McCallumetal, Efficientclusteringofhigh dimensionaldatasetswithapplicationto

    referencematching,KDD2000L.Kolbetal,Dedoop:Efficientdeduplication withHadoop,(demo)PVLDB5(12),2012R.Vernica etal,EfficientParallelSetSimilarityJoinsUsingMapReduce,SIGMOD2010ApacheMahout:ScalableMachineLearningandDataMining,http://mahout.apache.org/S.Whang etal,EntityResolutionwithIterativeBlocking,SIGMOD2009U.Kangetal,PEGASUS:APetaScaleGraphMiningSystem Implementationand

    Observations ICDM 2009Observations ,ICDM2009V.Rastogi etal,FindingConnectedComponentsonMapreduceinPolyLogRounds,Corr 2012V.Rastogi etal,LargeScaleCollectiveEntityMatching,PVLDB4(4),2011

  • References Challenges&FutureDirectionsI.BhattacharyaandL.Getoor,"QuerytimeEntityResolution",JAIR2007X.Dong,L.BertiEquille,D.Srivastava,Truthdiscoveryandcopyingdetectioninadynamic

    world VLDB 2009world ,VLDB2009P.Li,X.Dong,A.Maurino,D.Srivastava,LinkingTemporalRecords,VLDB2011A. Pal,V.Rastogi,A.Machanavajjhala,P.Bohannon,Informationintegrationovertimein

    unreliable and uncertain environments, WWW 2012unreliableanduncertainenvironments ,WWW2012