biolink nl a national infrastructure for linkage of biobanks to medical and socioeconomic registries...
TRANSCRIPT
Biolink NL
A national infrastructure for linkage of biobanks
to medical and socioeconomic registries
Adelaide Ariel
SHIP Conference 28th-30th August 2013
2
The Dutch Biolink Project (Biolink NL)
Main goals:
To improve the efficiency and quality of linkage of biobanks to medical and socioeconomic registries, in conformity with statutory and consent obligations to participants;
To set up a national infrastructure to enable these linkages
The Biolink Project is a collaboration project of Dutch universities, University Medical Centers, Statistics Netherlands, and health care institutions.
www.biolink-nl.eu
3
Linking Challenges in the Biolink NL
Unique identifier is lacking
Linking would be performed on personal identifiers
Privacy concerns
Surname might not be allowed for use
Personal identifiers have to be encrypted
Both availability and quality of the personal identifiers may vary across registries
4
Linking Approaches in the Biolink NL
Personal identifiers as linking variables: Surname, the date of birth, sex, postal code
Take into consideration:Surname might not be allowed for use
Research questions:
which personal identifier would be a ‘must’
in which situation a deterministic/probabilistic method would perform best
5
Project Approach
DevelopmentDevelopment
EvaluationEvaluation
TestingTesting
• Conduct a literature survey on record linkage methodology & applications
• Develop a prototype for the linkage strategy by using simulated data
Test the linkage strategy on real data
Evaluate the linking results by means of •other identifier (encrypted Dutch-ID)•content variable (content-validation)
6
Current Presentation
DevelopmentDevelopment
EvaluationEvaluation
TestingTesting
Develop a prototype for linkage strategy by using simulated data.
Real data were used as blueprints for simulated data.
Overview:
•Our motivations•Factors considered in the simulation•Findings•Prototype for the linkage strategy
7
Our motivations:
We want to experiment with different approaches, without violating privacy concerns.
The simulated data sets are modelled after the real data sets.
We want to include “what-if” scenarios:
What if not all identifiers are available for linking?
What if the amount of shared records is small?
What if the error rate is high?
Using Simulated Data
8
Factors Considered for the Simulation
The linkages in the Biolink NL deal with registries of varying size and population covered
Pathology Data
Pathology Data Cancer
Registry
Cancer Registry
General Population Registry
General Population Registry
FemaleCohort
FemaleCohortChildren
CohortChildren
Cohort
9
Factors Considered for the Simulation
The amount of shared records (overlap) may vary
Cancer Registry
Cancer Registry
General Population Registry
General Population Registry
Cancer Registry
Cancer Registry
FemaleCohort
FemaleCohort
LargeOverlap
SmallOverlap
10
Factors Considered for the Simulation
Personal identifiers are not 100% accurate or consistent;
for instance due to: Typing errors Changing addressUsing different surnames (married vs maiden name)
We vary the amount of errors up to 30%
11
Linking Methods
Preferably practical and applicable for encrypted identifiers.
Deterministic linkage methodPartial matching
Probabilistic linkage methodSimple probabilisticJaro-WinklerBigram
Implemented in SAS 9.2 and RecordLinkage (R package)
12
Simulation Findings (1)
The identifier date of birth should be included.
13
Simulation Findings (2)
Together, deterministic and probabilistic method can be used to help detect possible overlap size.
14
Simulation Findings (3)
Deterministic method appears to be particularly more suitable for:Small overlap size (< 60%)
Probabilistic method appears to perform best when the following conditions are met:Large overlap size (more than 60%)All identifiers are taken as linkage variables
15
Linking Strategy
15
Less than 20,000 records?
Less than 20,000 records?
Include surname?
Include surname?
Deterministic
DeterministicProbabilistic
Probabilistic
Possible overlap size
< 50%?
Possible overlap size
< 50%?Deterministic
Deterministic
Deterministic
Deterministic
Deterministic
DeterministicProbabilistic
Probabilistic
Include surname?
Include surname?
Yes
Yes
Yes
Yes
No
No
No
No
16
Next Steps
The following linkages will be chosen for testing and evaluation:
A Dutch female cohort – the Dutch Cancer Registry Dutch twin-children cohort – Health Insurance DatabaseDutch children cohort – the Dutch National Pharmacy Database