concepts and techniques for record linkage, entity resolution, and duplicate detection by peter...
TRANSCRIPT
![Page 1: CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER CHRISTEN PRESENTED BY JOSEPH PARK Data Matching](https://reader036.vdocuments.mx/reader036/viewer/2022062720/56649f115503460f94c23efd/html5/thumbnails/1.jpg)
CONCEPTS AND TECHNIQUES FOR RECORD L INKAGE, ENTITY RESOLUTION, AND
DUPLICATE DETECTION
BY PETER CHRISTEN
PRESENTED BY JOSEPH PARK
Data Matching
![Page 2: CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER CHRISTEN PRESENTED BY JOSEPH PARK Data Matching](https://reader036.vdocuments.mx/reader036/viewer/2022062720/56649f115503460f94c23efd/html5/thumbnails/2.jpg)
Introduction
“Data matching is the task of identifying, matching, and merging records that correspond to the same entities from several databases”
Also known as: Record or data linkage Entity resolution Object identification Field matching
![Page 3: CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER CHRISTEN PRESENTED BY JOSEPH PARK Data Matching](https://reader036.vdocuments.mx/reader036/viewer/2022062720/56649f115503460f94c23efd/html5/thumbnails/3.jpg)
Aims & Challenges
Three tasks: Schema matching Data matching Data fusion
Challenges: Lack of unique entity identifier and data quality Computation complexity Lack of training data (e.g. gold standards) Privacy and confidentiality (health informatics & data
mining)
![Page 4: CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER CHRISTEN PRESENTED BY JOSEPH PARK Data Matching](https://reader036.vdocuments.mx/reader036/viewer/2022062720/56649f115503460f94c23efd/html5/thumbnails/4.jpg)
Overview of Data Matching
Five major steps: Data pre-processing Indexing Record pair comparison Classification Evaluation
![Page 5: CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER CHRISTEN PRESENTED BY JOSEPH PARK Data Matching](https://reader036.vdocuments.mx/reader036/viewer/2022062720/56649f115503460f94c23efd/html5/thumbnails/5.jpg)
Diagram
![Page 6: CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER CHRISTEN PRESENTED BY JOSEPH PARK Data Matching](https://reader036.vdocuments.mx/reader036/viewer/2022062720/56649f115503460f94c23efd/html5/thumbnails/6.jpg)
Data Pre-processing
Remove unwanted characters and wordsExpand abbreviations and correct
misspellingsSegment attributes into well-defined and
consistent output attributesVerify the correctness of attribute values
![Page 7: CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER CHRISTEN PRESENTED BY JOSEPH PARK Data Matching](https://reader036.vdocuments.mx/reader036/viewer/2022062720/56649f115503460f94c23efd/html5/thumbnails/7.jpg)
Example of Data Pre-processing
![Page 8: CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER CHRISTEN PRESENTED BY JOSEPH PARK Data Matching](https://reader036.vdocuments.mx/reader036/viewer/2022062720/56649f115503460f94c23efd/html5/thumbnails/8.jpg)
Indexing
Reduces computational complexityGenerates candidate record pairsCommon technique—Blocking
![Page 9: CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER CHRISTEN PRESENTED BY JOSEPH PARK Data Matching](https://reader036.vdocuments.mx/reader036/viewer/2022062720/56649f115503460f94c23efd/html5/thumbnails/9.jpg)
Example of Blocking
![Page 10: CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER CHRISTEN PRESENTED BY JOSEPH PARK Data Matching](https://reader036.vdocuments.mx/reader036/viewer/2022062720/56649f115503460f94c23efd/html5/thumbnails/10.jpg)
Record Pair Comparison
Comparison vector – vector of numerical similarity values
![Page 11: CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER CHRISTEN PRESENTED BY JOSEPH PARK Data Matching](https://reader036.vdocuments.mx/reader036/viewer/2022062720/56649f115503460f94c23efd/html5/thumbnails/11.jpg)
Example of Record Pair Comparison
![Page 12: CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER CHRISTEN PRESENTED BY JOSEPH PARK Data Matching](https://reader036.vdocuments.mx/reader036/viewer/2022062720/56649f115503460f94c23efd/html5/thumbnails/12.jpg)
Jaro and Winkler String Comparison
Jaro: Combines edit distance and q-gram based comparison
Winkler: Increases Jaro similarity for up to four agreeing initial
chars
![Page 13: CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER CHRISTEN PRESENTED BY JOSEPH PARK Data Matching](https://reader036.vdocuments.mx/reader036/viewer/2022062720/56649f115503460f94c23efd/html5/thumbnails/13.jpg)
Record Pair Classification
Two-class or three-class classification: Match or non-match Match or non-match or potential match (requires
clerical review)Supervised and unsupervisedActive learning
![Page 14: CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER CHRISTEN PRESENTED BY JOSEPH PARK Data Matching](https://reader036.vdocuments.mx/reader036/viewer/2022062720/56649f115503460f94c23efd/html5/thumbnails/14.jpg)
Example of Record Pair Classification
![Page 15: CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER CHRISTEN PRESENTED BY JOSEPH PARK Data Matching](https://reader036.vdocuments.mx/reader036/viewer/2022062720/56649f115503460f94c23efd/html5/thumbnails/15.jpg)
Unsupervised Classification
Threshold-based classificationProbabilistic classificationCost-based classificationRule-based classificationClustering-based classification
![Page 16: CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER CHRISTEN PRESENTED BY JOSEPH PARK Data Matching](https://reader036.vdocuments.mx/reader036/viewer/2022062720/56649f115503460f94c23efd/html5/thumbnails/16.jpg)
Probabilistic Classification
Three-class basedDifferent weights assigned to different
attributes Newcombe & Kennedy – cardinalities
Comparison vectors, binary comparisonConditionally independent attributes
assumed
![Page 17: CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER CHRISTEN PRESENTED BY JOSEPH PARK Data Matching](https://reader036.vdocuments.mx/reader036/viewer/2022062720/56649f115503460f94c23efd/html5/thumbnails/17.jpg)
Formulae
![Page 18: CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER CHRISTEN PRESENTED BY JOSEPH PARK Data Matching](https://reader036.vdocuments.mx/reader036/viewer/2022062720/56649f115503460f94c23efd/html5/thumbnails/18.jpg)
Example of Probabilistic Classification
![Page 19: CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER CHRISTEN PRESENTED BY JOSEPH PARK Data Matching](https://reader036.vdocuments.mx/reader036/viewer/2022062720/56649f115503460f94c23efd/html5/thumbnails/19.jpg)
Active Learning
Trains a model with small set of seed dataClassifies comparison vectors not in training
set as matches or non-matchesAsks users for help on the most difficult to
classifyAdds manually classified to training data setTrains the next, improved, classification
modelRepeats until stopping criteria met