andrew borthwick, phd§ vikki papadouka, phd, mph* deborah walker, phd* *new york city department of...
Post on 21-Dec-2015
220 Views
Preview:
TRANSCRIPT
Andrew Borthwick, PhD§Vikki Papadouka, PhD, MPH*
Deborah Walker, PhD*
*New York City Department of Healthvpapadou@dohlan.cn.ci.nyc.ny.usdwalker@dohlan.cn.ci.nyc.ny.us
§ ChoiceMaker Technologies, Inc.Andrew.Borthwick@choicemaker.com
Adapted from a presentation at the34th National Immunization Conference
Washington, DCJuly 7, 2000
The NY Citywide Immunization Registry’sMEDD De-Duplication Project
ChoiceMaker Technologies
New York Citywide Immunization Registry:The MEDD De-duplication Project
The NYC CIR
New York Citywide Immunization Registry was mandated in January 1997
All health-care providers are required to submit immunizations
Goals of the system:Doctors look up kids’ immunization statuses to
determine which shots to giveNotify parents when their children are due for an
appointmentIdentify citywide immunization trends
Similar registries are being built at the state and local level around the country
ChoiceMaker Technologies
New York Citywide Immunization Registry:The MEDD De-duplication Project
NYC CIR Background
About 122,000 children are born in NYC every year
Each month the CIR receives: 50-100,000 patient records and
80-200,000 immunization records
From >1,100 institutions and private providers
Given this volume, hand-matching each new record before it enters the CIR is unrealistic
ChoiceMaker Technologies
New York Citywide Immunization Registry:The MEDD De-duplication Project
NYC CIR: Background
Contains 1.8 million records
Very high duplication rate estimated at 3 records: 2 children because of very strict criteria for automatic merging
During April-September 1998 CIR staff reviewed and manually de-duplicated about 260,000 record pairs: spent 1,700 hours
ChoiceMaker Technologies
New York Citywide Immunization Registry:The MEDD De-duplication Project
MEDD: What it is
A system for deciding when two records represent the same child
Fast and accurateReplicates the human decision-making process
ChoiceMaker Technologies
New York Citywide Immunization Registry:The MEDD De-duplication ProjectMEDD’s Decision-Making Process
For every record pair, MEDDMEDD computes a probability between 0 and 100% that the pair should be merged
High probabilities “mergemerge”
Low probabilities “don’t mergedon’t merge” Intermediate probabilities (close to 50%) indicate
“don’t knowdon’t know” and require human reviewThresholds dividing the merge/ merge/ don’t know/ don’t know/ don’t don’t
merge merge cases are set by the user
ChoiceMaker Technologies
New York Citywide Immunization Registry:The MEDD De-duplication Project
Maximum Entropy ModelingMEDD uses “Maximum Entropy Modeling”
A new statistical decision-making techniqueLearn the human judgment process by training from examplesHas been used in sentence parsing, computer vision, financial modeling, and proper-name identification
Has achieved state-of-the-art results on these problems
ChoiceMaker Technologies
New York Citywide Immunization Registry:The MEDD De-duplication Project
Maximum Entropy Modeling: Features
Maximum Entropy uses “Features”Feature = a function which looks at specific fields in the pair of records to make a “merge” or “don’t merge” decisionMEDD has many different features, each of which is assigned a “weight” during training
ChoiceMaker Technologies
New York Citywide Immunization Registry:The MEDD De-duplication Project
Sample MEDD Features
Mother’s BirthdayMatch of Mom’s B’day predicts “Merge” Mismatch of Mom’s B’day predicts “No-Merge”Neither feature fires if Mom’s B’day wasn’t filled in on both records
We have no evidence in this caseMany other features
Child’s birthdayChild’s first and last nameMedicaid Number
ChoiceMaker Technologies
Record pairshand-marked withmerge/no-merge decisions
A weight foreach feature
A set of features
Maximum Entropy
ParameterEstimator
New York Citywide Immunization Registry:The MEDD De-duplication Project
Training the System
ChoiceMaker Technologies
New York Citywide Immunization Registry:The MEDD De-duplication Project
Probability Computation
Merge = product of weights of all features predicting “mergemerge” for the
pairNoMerge = product of weights of all features
predicting “no mergeno merge” for the pair
For a pair of records, MEDD computes the probability that the pair should be merged as:
NoMergeMerge
Merge
ChoiceMaker Technologies
Field Name Record Feature Weight Prediction
1 2
Last name Smith Smith Match 1.153 Merge
First name Emily Emely No-matchSoundex
1.3504.708
No-mergeMerge
DOB [04/28/97] [04/28/97] Match 1.138 Merge
Multiple birth N N
Mom’s Maiden Name CRUZ
Mother’s DOB 12/04/76
Street 4528 3rd Ave 4528 3rd Ave Match 4.342 Merge
City Bronx Bronx Match 1.103 Merge
State NY NY
Zip 10462 10462 Match 3.013 Merge
Phone 718-123-4567 718-123-6789 No-match 2.130 No-merge
Med Rec Number 11856437503 11856437503 Match 6.587 Merge
High Probability. Human Decision: Merge
Merge Total = 587.2
No-merge total = 2.9995.0
9.22.587
2.587
MEDD predicts “Merge” with 99.5% confidenceMEDD predicts “Merge” with 99.5% confidence
Field Name Record Feature Weight Prediction
1 2
Last name Lopez Lopez Match 1.153 Merge
First name Girl Susan
DOB [1/11/97] [1/2/97] No-match 28.949 No-merge
Multiple birth N N
Mom’s Maiden Name
Lopez
Mother’s DOB
Street 987 Cornelia 456 Park No-match 2.937 No-merge
City Brooklyn Brooklyn Match 1.103 Merge
State NY NY
Zip 11211 11211 Match 3.013 Merge
Phone 718-123-4567 718-234-5678 No-match 2.130 No-merge
Med Rec Number 1001002 567435
Low Probability. Human Decision: No-Merge
Merge Total = 3.8
No-merge total = 181.1021.0
8.31.181
8.3
MEDD predicts “No-merge” with 97.9% confidenceMEDD predicts “No-merge” with 97.9% confidence
Field Name Record Feature Weight Prediction
1 2
Last name Hernandez Hernandez Match 1.153 Merge
First name Boy David
DOB [2/14/97] [2/14/97] Match 1.138 Merge
Multiple birth N N
Mom’s Maiden Name
Hernandez
Mother’s DOB 11/4/78
Street 142 4th Ave 142 4th Ave Match 4.342 Merge
City Bronx Bronx Match 1.103 Merge
State NY NY
Zip 11051 11052 No-match 2.551 No-merge
Phone 718-524-4879 718-524-4878 No-match 2.130 No-merge
Med Rec Number 1001002 567435
Intermediate Probability. Human Decision: Merge
Merge Total = 6.3
No-merge total = 5.4539.0
4.53.6
3.6
Predicts “Merge” with 53.9% confidence (Human review)Predicts “Merge” with 53.9% confidence (Human review)
ChoiceMaker Technologies
New York Citywide Immunization Registry:The MEDD De-duplication Project
Sophisticated MEDD features:Name Frequency
Name Frequency“Rodriguez” is 9 times more common than “Walker” in
NYCLess than 3 kids per year are born with the names
“Borthwick” and “Papadouka”Hence we build features categorizing names as “very
common”, “somewhat common”, “very rare”, etc.Given that we have a name match, the fact that the names
are very common is a feature predicting “don’t merge”A match between rare names is a feature predicting “merge”
ChoiceMaker Technologies
New York Citywide Immunization Registry:The MEDD De-duplication Project
Sophisticated MEDD features:Partial Name Match
Soundex: A phonetic representation of namesConnor = Conor = Conner = CNRWhen the Soundex representation of two
names matches, a feature fires predicting “merge”
Edit Distance: Features firing based on two names having an edit distance of 1
Borthwich Borthwick Bortwick
ChoiceMaker Technologies
New York Citywide Immunization Registry:The MEDD De-duplication Project
Special Situation Features
Every database has its quirksHMO XYZ always sends its data to the CIR with Day of
Birth = “1”Birthday = July 1, 1998 not July 15, 1998
We have a special feature:If Provider = “HMO XYZ” AND Day of Birth = 1 AND
dates differs only on day of birth, THEN predict merge
We plan to allow users to define these types of features themselves
New York Citywide Immunization Registry:The MEDD De-duplication Project
Test Procedure
MEDD MEDD tested on c. 3,000 pairs under NYC DOH supervisionPairs were carefully hand-scored by NYC DOH as Merge/Don’t Merge
ChoiceMaker never saw the test data
ChoiceMaker Technologies
New York Citywide Immunization Registry:The MEDD De-duplication Project
MEDD Evaluation Results
RequestedAccuracy
% of Records Needing Human Review
1% False Positive1% False Negative
1.4%
0.5% False Positive0.5% False Negative
2.6%
0.3% False Positive0.3% False Negative 3.2%
Even with double-checking, humanerror rate is no better than 0.3%
Even with double-checking, humanerror rate is no better than 0.3%
ChoiceMaker Technologies
New York Citywide Immunization Registry:The MEDD De-duplication Project
Summary: What MEDD Offers
Can be trained on just 3,000 record pairs Judges nearly 1,000 record-pairs per secondAchieves very high accuracy by finding the optimal
weighting of the different clues (“features”) indicating
mergemerge/don’t mergedon’t merge Says “mergemerge”, “don’t mergedon’t merge”, or “I don’t knowI don’t know”Can be rigorously testedRegistry management can make informed judgments
regarding the effort vs. accuracy trade-off
ChoiceMaker Technologies
New York Citywide Immunization Registry:The MEDD De-duplication Project
The 5 Stages of the De-duplication Process
1. “Blocking”: Identify list of possible duplicates (SmartSearch)
2. “Decision-Making”: Identify a definitive list of duplicate records (MEDD)
3. Human Review ofa. Records marked as “don’t know” by MEDDb. Records held by special filters (twins, scanty records, etc.)
4. Linkage: Link records that belong to the same child together (if A=B and B=C then A=C)
5. Update the CIR
ChoiceMaker Technologies
New York Citywide Immunization Registry:The MEDD De-duplication Project
Project Avalanche
Project AvalancheProject Avalanche: A project by which we systematically de-duplicate the whole CIR by comparing every record to every record meeting certain criteria
Uses our querying tool Smart Search and our de-duplication tool MEDD
Project Avalanche I: February-April 2000Project Avalanche II: May-July 2000
ChoiceMaker Technologies
New York Citywide Immunization Registry:The MEDD De-duplication Project
Project Avalanche I
Used strict blocking criteria for finding possible duplicates to be passed on to MEDD such as:
Exact match on DOB+Medical Record orExact match on Medicaid number orFirst name+gender+DOB+last name=maiden name (and vise versa) orLast name+First name+DOB
Used 98% as the cut-off for automatic mergingHand-reviewed records produced by the filters
ChoiceMaker Technologies
New York Citywide Immunization Registry:The MEDD De-duplication Project
Project Avalanche I: Results
CohortBefore A1 After A1 # Dups* # % Dups1996 203,000 187,000 68,000 16,000 251997 216,000 195,000 81,000 21,000 261998 208,000 184,000 73,000 24,000 321999 158,000 143,000 ? 15,000 ?
TOTAL 785,000 709,000 223,000 77,000 avg=28
# of Records Dups removed
ChoiceMaker Technologies
* Estimated
New York Citywide Immunization Registry:The MEDD De-duplication Project
Project Avalanche II
In April 2000 we loaded 4 months worth of data that were held due to Y2K problems
Used more liberal blocking criteria:Medical Record Number+
month and year of DOB orday and year of DOB orday and month of DOB orfirst name
Used 90% as the cut-off for automatic mergingCurrently hand-reviewing records produced by
the filters
ChoiceMaker Technologies
New York Citywide Immunization Registry:The MEDD De-duplication Project
Project Avalanche II: Results
Cohort Before A2 After A2 # Dups* # % Dups1996 190,000 182,000 55,000 9,000 161997 196,000 183,000 61,000 13,000 221998 206,000 182,000 71,000 24,000 341999 210,000 182,000 75,000 28,000 37
TOTAL 802,000 728,000 262,000 74,000 avg=27
# of Records Dups removed
ChoiceMaker Technologies
*Estimated
New York Citywide Immunization Registry:The MEDD De-duplication Project
Project Avalanche: Discussion
Using a very conservative cut-off for automatic merging we reduced the duplicates by about 27.5% each time, more than 30% including human review
As a result of Project Avalanche 81% of records now have immunizations vs. 58% 6 months ago
Since MEDD is not yet implemented on the front end of the CIR, you don’t see the total number of duplicates decreasing over time in these early runs
ChoiceMaker Technologies
New York Citywide Immunization Registry:The MEDD De-duplication Project
Future of MEDD at the CIR
As part of the Lead and CIR integration MEDD will be inserted on the front end, thus reducing the number of duplicates being created
Improving MEDD’s performance will enable us to automatically merge more duplicates with the same error rate
Will continue with Project Avalanche until we bring the duplication rate down to an acceptable level
ChoiceMaker Technologies
New York Citywide Immunization Registry:The MEDD De-duplication ProjectSummary: ChoiceMaker Status
Currently have two employeesAndrew Borthwick, Ph.D.Prof. Arthur Goldberg
Have several major contracts with New York City Dept. Of Health
Good prospects of finding similar work with other state and municipal health departments
ChoiceMaker Technologies
New York Citywide Immunization Registry:The MEDD De-duplication Project
Summary: De-duplication Marketplace
Immunization Registries have very difficult duplicate record problems
Many others have similar problemsMedical researchers (correlating birth
certificate and maternal death records)Banks, phone companies (correlating clients
from different lines of business)Direct marketers (merging mailing lists)
ChoiceMaker Technologies
New York Citywide Immunization Registry:The MEDD De-duplication ProjectSummary: ChoiceMaker’s Plans
Do further research to decrease the amount of consulting time needed to deploy MEDD
Seeking first-round investors to fund expansion of R&D and marketing
Have an opening for someone with an M.S. in C.S. or similar qualifications, starting 10/1/2000 and a C.S. Ph.D. starting 11/1/2000
ChoiceMaker Technologies
top related