Download - Issues in Deterministic and Probabilistic Record Linkage Scott DuVall Salt Lake City VHA MC
Issues in Deterministic and Probabilistic Record Linkage
Scott DuVallSalt Lake City
VHA MC
the age of
informatiinformationon
informatiinformationon
informaticianinformatician
information = information =
Linkage Adds Information
Linkage Corrects Errors
6
• Missing informationaffects patient care1
1 Stiell et al. Prevalence of information gaps in the emergency department and the effect on patient outcomes. Cmaj 2003;169(10):1023-8.
2 Coleman et al. Lost in transition: challenges and opportunities for improving the quality of transitional care. Ann Intern Med 2004;141(7):533-6.
•Transitions in care cause breakdown in communication2
• Resolving duplicates can cost $60 per case.1
1Thornton SN, Hood SK. Reducing Duplicate Patient Creation Using a Probabilistic Matching Algorithm in an Open-access Community Data Sharing Environment. Proc AMIA Symp 2005:1135.
• “between $0.30 and $0.40 of every dollar spent on health care is wasted on overuse, under use, misuse, duplication, system failures, unnecessary repetition, poor communications and inefficiency.”1
1Reid PP, Compton WD, Grossman JH, Fanjiang G. Building a Better Delivery System: A New Engineering/Health Care Partnership. National Academies Press, 2005:99.
• Key element of health care information exchange and interoperability, estimated to be able to reduce costs $77.8 billion annually.1
1Walker J, Pan E, Johnston D, Adler-Milstein J, Bates DW, Middleton B. The value of health care information exchange and interoperability. Health Aff (Millwood). 2005 Jan-Jun;Suppl Web Exclusives: W5-10-W5-18.
10
Record Matching
• Many systems have record matching software.
• Errors still exist– 50% missed in CDC Survey1
– 5% missed in 1.5 million records = 75,0002
1 User Manual for the CDC Deduplication Evaluation Toolkit2 Snow LA, DuVall SL. Clinical Data Exchange Through A Looking Glass: A Gray-Box Approach To Record Linkage. NLM 2005.
Old Technology
Misunderstood Technology
Misunderstood Technology
Score Is Not Probability
score
probability
Information is not Used
MPIMPIMPIMPIName +
Date of Birth + Social Security Number
MPIMPIMPIMPI
Deterministic Linkage
1)IF r1.social_security_number = r2.social_security_number
THEN match.
2) IF SoundexCompare(r1.last_name, r2.last_name) AND
SoundexCompare(r1.first_name, r2.first_name) AND
EditDistance(r1.birth_place, r2.place)<2 AND
r1.birth_date = r2.birth_date AND
r1.multiplicity = r2.multiplicity AND
r1.birth_order = r2.birth_order
THEN match.
IF contains(0..9)
THEN NUMBER
IF contains(North, South, East, West)
THEN DIRECTION
IF contains(Street, Road, Lane, Drive, ...)
THEN STREET_TYPE
ELSE STREET_NAME
IF (NUMBER = NUMBER) AND (DIRECTION = DIRECTION) AND (STREET = STREET) AND (STREET_TYPE = STREET_TYPE)
THEN MATCH
Probabilistic Linkage
Each field given AGREEMENT and DISAGREEMENT weight
Weight proportional to the field’s DISCRIMINATION and RELIABILITY
Many more parameters, possibility of better matching
21
Record Matching
Understand your Data+
Understand Mistakes in your Data
Good Strategy for LinkageMANUAL REVIEW
MANUAL REVIEW
Understanding the Data
• Compare characteristics of records in the duplicate subset with records in the full enterprise data warehouse
• Describe instances where records in the duplicate subset are not typical of the database at large
• Provide considerations for others looking at duplicate records in master patient indexes
UUHSC Friedman
Extra names and titles 34.3% 36.9%
Nicknames, spelling variations 21.8% 13.9%
One letter substitutions 13.6% 13.7%
One letter added or deleted 7.6% 12.9%
Punctuation or spaces 1.9% 11.8%
Different last names for females 12.9% 7.8%
Permuted parts of names 3.2% 1.4%
Different first names 2.8% 1.4%
One letter transposed 1.9% 0.8%
Nicknames, spelling variations 21.8% 13.9%
Punctuation or spaces 1.9% 11.8%
UUHSC Grannis
Missing SSN 52.4% 35%
Typographical errors 62.7% 35.5%
Spouse (family) collisions 14.8% 47.5%
Unexplained collisions 9.9% 17%
Invalid SSN 12.6% 0%
Missing SSN 52.4% 35%
Typographical Errors 62.7% 35.5%
All Collisions 24.7% 64.5%
Invalid SSN 12.6% 0%
Extension of the Probabilistic Model for Approximate Field Comparators
Probabilistic Model
Field in Record A = Field in Record BAgreement Weight
Field in Record A ≠ Field in Record BDisagreement Weight
M – probability that field matches in dup pair
U – probability that field matches in non-dup pair
Agreement WeightLOG(M/U)
Disagreement WeightLOG(1-M/1-U)
Field in Record A ≈ Field in Record B?
Approximate Comparator
Edit Distance
ED( Johnathan, Jonathan ) = 1
Approximate Comparator Weight
LOG(Mδ /Uδ)
Mδ – probability that field approximately matches by δ in dup pair
Uδ – probability that field approximately matches by δ in non-dup pair
Dups Non-Dups
Load and randomizetraining set
Classify with estimated
parameters
Estimate Dups and Non-Dups
Update Parameters
Initial Parameters
Dups Non-Dups
Load and randomizetraining set
Classify with updated
parameters
Re-estimate Dups and Non-Dups
Update Parameters
Updated Parameters
Dups Non-Dups
Load and randomizevalidation set
Classify with training set parameters
Classified Dups and Non-Dups
Training Set Parameters
questionsquestions??