biolink nl a national infrastructure for linkage of biobanks to medical and socioeconomic registries...

16
Biolink NL A national infrastructure for linkage of biobanks to medical and socioeconomic registries Adelaide Ariel SHIP Conference 28th-30th August 2013

Upload: nigel-walters

Post on 19-Jan-2016

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Biolink NL A national infrastructure for linkage of biobanks to medical and socioeconomic registries Adelaide Ariel SHIP Conference 28th-30th August 2013

Biolink NL

A national infrastructure for linkage of biobanks

to medical and socioeconomic registries

Adelaide Ariel

SHIP Conference 28th-30th August 2013

Page 2: Biolink NL A national infrastructure for linkage of biobanks to medical and socioeconomic registries Adelaide Ariel SHIP Conference 28th-30th August 2013

2

The Dutch Biolink Project (Biolink NL)

Main goals:

To improve the efficiency and quality of linkage of biobanks to medical and socioeconomic registries, in conformity with statutory and consent obligations to participants;

To set up a national infrastructure to enable these linkages

The Biolink Project is a collaboration project of Dutch universities, University Medical Centers, Statistics Netherlands, and health care institutions.

www.biolink-nl.eu

Page 3: Biolink NL A national infrastructure for linkage of biobanks to medical and socioeconomic registries Adelaide Ariel SHIP Conference 28th-30th August 2013

3

Linking Challenges in the Biolink NL

Unique identifier is lacking

Linking would be performed on personal identifiers

Privacy concerns

Surname might not be allowed for use

Personal identifiers have to be encrypted

Both availability and quality of the personal identifiers may vary across registries

Page 4: Biolink NL A national infrastructure for linkage of biobanks to medical and socioeconomic registries Adelaide Ariel SHIP Conference 28th-30th August 2013

4

Linking Approaches in the Biolink NL

Personal identifiers as linking variables: Surname, the date of birth, sex, postal code

Take into consideration:Surname might not be allowed for use

Research questions:

which personal identifier would be a ‘must’

in which situation a deterministic/probabilistic method would perform best

Page 5: Biolink NL A national infrastructure for linkage of biobanks to medical and socioeconomic registries Adelaide Ariel SHIP Conference 28th-30th August 2013

5

Project Approach

DevelopmentDevelopment

EvaluationEvaluation

TestingTesting

• Conduct a literature survey on record linkage methodology & applications

• Develop a prototype for the linkage strategy by using simulated data

Test the linkage strategy on real data

Evaluate the linking results by means of •other identifier (encrypted Dutch-ID)•content variable (content-validation)

Page 6: Biolink NL A national infrastructure for linkage of biobanks to medical and socioeconomic registries Adelaide Ariel SHIP Conference 28th-30th August 2013

6

Current Presentation

DevelopmentDevelopment

EvaluationEvaluation

TestingTesting

Develop a prototype for linkage strategy by using simulated data.

Real data were used as blueprints for simulated data.

Overview:

•Our motivations•Factors considered in the simulation•Findings•Prototype for the linkage strategy

Page 7: Biolink NL A national infrastructure for linkage of biobanks to medical and socioeconomic registries Adelaide Ariel SHIP Conference 28th-30th August 2013

7

Our motivations:

We want to experiment with different approaches, without violating privacy concerns.

The simulated data sets are modelled after the real data sets.

We want to include “what-if” scenarios:

What if not all identifiers are available for linking?

What if the amount of shared records is small?

What if the error rate is high?

Using Simulated Data

Page 8: Biolink NL A national infrastructure for linkage of biobanks to medical and socioeconomic registries Adelaide Ariel SHIP Conference 28th-30th August 2013

8

Factors Considered for the Simulation

The linkages in the Biolink NL deal with registries of varying size and population covered

Pathology Data

Pathology Data Cancer

Registry

Cancer Registry

General Population Registry

General Population Registry

FemaleCohort

FemaleCohortChildren

CohortChildren

Cohort

Page 9: Biolink NL A national infrastructure for linkage of biobanks to medical and socioeconomic registries Adelaide Ariel SHIP Conference 28th-30th August 2013

9

Factors Considered for the Simulation

The amount of shared records (overlap) may vary

Cancer Registry

Cancer Registry

General Population Registry

General Population Registry

Cancer Registry

Cancer Registry

FemaleCohort

FemaleCohort

LargeOverlap

SmallOverlap

Page 10: Biolink NL A national infrastructure for linkage of biobanks to medical and socioeconomic registries Adelaide Ariel SHIP Conference 28th-30th August 2013

10

Factors Considered for the Simulation

Personal identifiers are not 100% accurate or consistent;

for instance due to: Typing errors Changing addressUsing different surnames (married vs maiden name)

We vary the amount of errors up to 30%

Page 11: Biolink NL A national infrastructure for linkage of biobanks to medical and socioeconomic registries Adelaide Ariel SHIP Conference 28th-30th August 2013

11

Linking Methods

Preferably practical and applicable for encrypted identifiers.

Deterministic linkage methodPartial matching

Probabilistic linkage methodSimple probabilisticJaro-WinklerBigram

Implemented in SAS 9.2 and RecordLinkage (R package)

Page 12: Biolink NL A national infrastructure for linkage of biobanks to medical and socioeconomic registries Adelaide Ariel SHIP Conference 28th-30th August 2013

12

Simulation Findings (1)

The identifier date of birth should be included.

Page 13: Biolink NL A national infrastructure for linkage of biobanks to medical and socioeconomic registries Adelaide Ariel SHIP Conference 28th-30th August 2013

13

Simulation Findings (2)

Together, deterministic and probabilistic method can be used to help detect possible overlap size.

Page 14: Biolink NL A national infrastructure for linkage of biobanks to medical and socioeconomic registries Adelaide Ariel SHIP Conference 28th-30th August 2013

14

Simulation Findings (3)

Deterministic method appears to be particularly more suitable for:Small overlap size (< 60%)

Probabilistic method appears to perform best when the following conditions are met:Large overlap size (more than 60%)All identifiers are taken as linkage variables

Page 15: Biolink NL A national infrastructure for linkage of biobanks to medical and socioeconomic registries Adelaide Ariel SHIP Conference 28th-30th August 2013

15

Linking Strategy

15

Less than 20,000 records?

Less than 20,000 records?

Include surname?

Include surname?

Deterministic

DeterministicProbabilistic

Probabilistic

Possible overlap size

< 50%?

Possible overlap size

< 50%?Deterministic

Deterministic

Deterministic

Deterministic

Deterministic

DeterministicProbabilistic

Probabilistic

Include surname?

Include surname?

Yes

Yes

Yes

Yes

No

No

No

No

Page 16: Biolink NL A national infrastructure for linkage of biobanks to medical and socioeconomic registries Adelaide Ariel SHIP Conference 28th-30th August 2013

16

Next Steps

The following linkages will be chosen for testing and evaluation:

A Dutch female cohort – the Dutch Cancer Registry Dutch twin-children cohort – Health Insurance DatabaseDutch children cohort – the Dutch National Pharmacy Database