Download - Issues in Deterministic and Probabilistic Record Linkage Scott DuVall Salt Lake City VHA MC

Issues in Deterministic and Probabilistic Record Linkage

Scott DuVallSalt Lake City

VHA MC

the age of

informatiinformationon

informatiinformationon

informaticianinformatician

information = information =

Linkage Adds Information

Linkage Corrects Errors

6

• Missing informationaffects patient care1

1 Stiell et al. Prevalence of information gaps in the emergency department and the effect on patient outcomes. Cmaj 2003;169(10):1023-8.

2 Coleman et al. Lost in transition: challenges and opportunities for improving the quality of transitional care. Ann Intern Med 2004;141(7):533-6.

•Transitions in care cause breakdown in communication2

• Resolving duplicates can cost $60 per case.1

1Thornton SN, Hood SK. Reducing Duplicate Patient Creation Using a Probabilistic Matching Algorithm in an Open-access Community Data Sharing Environment. Proc AMIA Symp 2005:1135.

• “between $0.30 and $0.40 of every dollar spent on health care is wasted on overuse, under use, misuse, duplication, system failures, unnecessary repetition, poor communications and inefficiency.”1

1Reid PP, Compton WD, Grossman JH, Fanjiang G. Building a Better Delivery System: A New Engineering/Health Care Partnership. National Academies Press, 2005:99.

• Key element of health care information exchange and interoperability, estimated to be able to reduce costs $77.8 billion annually.1

1Walker J, Pan E, Johnston D, Adler-Milstein J, Bates DW, Middleton B. The value of health care information exchange and interoperability. Health Aff (Millwood). 2005 Jan-Jun;Suppl Web Exclusives: W5-10-W5-18.

10

Record Matching

• Many systems have record matching software.

• Errors still exist– 50% missed in CDC Survey1

– 5% missed in 1.5 million records = 75,0002

1 User Manual for the CDC Deduplication Evaluation Toolkit2 Snow LA, DuVall SL. Clinical Data Exchange Through A Looking Glass: A Gray-Box Approach To Record Linkage. NLM 2005.

Old Technology

Misunderstood Technology

Score Is Not Probability

score

probability

Information is not Used

MPIMPIMPIMPIName +

Date of Birth + Social Security Number

MPIMPIMPIMPI

Deterministic Linkage

1)IF r1.social_security_number = r2.social_security_number

THEN match.

2) IF SoundexCompare(r1.last_name, r2.last_name) AND

SoundexCompare(r1.first_name, r2.first_name) AND

EditDistance(r1.birth_place, r2.place)<2 AND

r1.birth_date = r2.birth_date AND

r1.multiplicity = r2.multiplicity AND

r1.birth_order = r2.birth_order

THEN match.

IF contains(0..9)

THEN NUMBER

IF contains(North, South, East, West)

THEN DIRECTION

IF contains(Street, Road, Lane, Drive, ...)

THEN STREET_TYPE

ELSE STREET_NAME

IF (NUMBER = NUMBER) AND (DIRECTION = DIRECTION) AND (STREET = STREET) AND (STREET_TYPE = STREET_TYPE)

THEN MATCH

Probabilistic Linkage

Each field given AGREEMENT and DISAGREEMENT weight

Weight proportional to the field’s DISCRIMINATION and RELIABILITY

Many more parameters, possibility of better matching

21

Record Matching

Understand your Data+

Understand Mistakes in your Data

Good Strategy for LinkageMANUAL REVIEW

MANUAL REVIEW

Understanding the Data

• Compare characteristics of records in the duplicate subset with records in the full enterprise data warehouse

• Describe instances where records in the duplicate subset are not typical of the database at large

• Provide considerations for others looking at duplicate records in master patient indexes

UUHSC Friedman

Extra names and titles 34.3% 36.9%

Nicknames, spelling variations 21.8% 13.9%

One letter substitutions 13.6% 13.7%

One letter added or deleted 7.6% 12.9%

Punctuation or spaces 1.9% 11.8%

Different last names for females 12.9% 7.8%

Permuted parts of names 3.2% 1.4%

Different first names 2.8% 1.4%

One letter transposed 1.9% 0.8%

Nicknames, spelling variations 21.8% 13.9%

Punctuation or spaces 1.9% 11.8%

UUHSC Grannis

Missing SSN 52.4% 35%

Typographical errors 62.7% 35.5%

Spouse (family) collisions 14.8% 47.5%

Unexplained collisions 9.9% 17%

Invalid SSN 12.6% 0%

Missing SSN 52.4% 35%

Typographical Errors 62.7% 35.5%

All Collisions 24.7% 64.5%

Invalid SSN 12.6% 0%

Extension of the Probabilistic Model for Approximate Field Comparators

Probabilistic Model

Field in Record A = Field in Record BAgreement Weight

Field in Record A ≠ Field in Record BDisagreement Weight

M – probability that field matches in dup pair

U – probability that field matches in non-dup pair

Agreement WeightLOG(M/U)

Disagreement WeightLOG(1-M/1-U)

Field in Record A ≈ Field in Record B?

Approximate Comparator

Edit Distance

ED( Johnathan, Jonathan ) = 1

Approximate Comparator Weight

LOG(Mδ /Uδ)

Mδ – probability that field approximately matches by δ in dup pair

Uδ – probability that field approximately matches by δ in non-dup pair

Dups Non-Dups

Load and randomizetraining set

Classify with estimated

parameters

Estimate Dups and Non-Dups

Update Parameters

Initial Parameters

Dups Non-Dups

Load and randomizetraining set

Classify with updated

parameters

Re-estimate Dups and Non-Dups

Update Parameters

Updated Parameters

Dups Non-Dups

Load and randomizevalidation set

Classify with training set parameters

Classified Dups and Non-Dups

Training Set Parameters

questionsquestions??

Download - Issues in Deterministic and Probabilistic Record Linkage Scott DuVall Salt Lake City VHA MC

Top Related