novel genomic de-identification method to enable research

25
1 Novel Genomic De-Identification Method to Enable Research Session 220, February 14, 2019 Kenneth Park MD, VP, Offering Development, IQVIA Ronald Miller PhD, Assoc Dir, Offering Development, IQVIA

Upload: others

Post on 13-Jan-2022

3 views

Category:

Documents


0 download

TRANSCRIPT

1

Novel Genomic De-Identification Methodto Enable Research

Session 220, February 14, 2019

Kenneth Park MD, VP, Offering Development, IQVIARonald Miller PhD, Assoc Dir, Offering Development, IQVIA

2

Conflicts of Interest

Kenneth Park, MD

Has no real or apparent conflicts of interest to report.

Ronald Miller, PhD

Has no real or apparent conflicts of interest to report.

3

• Genomic Data – What Is It and Why Do We Care?

• Genomic Privacy Issues and Policies

• Genomic Privacy Protection and De-Identification

Agenda

4

Learning Objectives

• Describe the growing awareness and concern for genomic data privacy

• Explain why traditional de-identification approaches do not work with genomic data

• Evaluate current genomic data privacy approaches and their relative risks and benefits

• Identify the key features of a comprehensive genomic de-identification approach

5

What Is Genomic Data?

Genomic Data

• DNA sequence information

- Impacts and can predict a

person’s physical characteristics

and disease

- Can directly cause traits /

diseases or effect likelihood of

outcome

• Genomic research often require

linkage with phenotypic data for

(e.g., observation or clinical

outcome data)

Phenotypic Data

• Observable traits

- Eye Color

- Disease onset

• Influenced by both genetics and

environment

• Measured through observation or

clinical testing

- EMR

- Claims

- Rx

SOURCE: https://www.news-medical.net/health/Genetics-of-Eye-

Color.aspx

6

The Cost of Doing a Genomic Sequence Has Approximately Halved Each Year

SOURCE: https://www.genome.gov/27565109/the-cost-of-sequencing-a-human-genome/

$1,000

$10,000

$100,000

$1,000,000

$10,000,000

$100,000,000

Watson’s Genome

Completed ~$1M

Sequencing now

available for ~$1,000

per Genome

First consumer

genome sequencing

available ~$50,000

Cost per Genome Sequence

7

Countries

reporting WGS

Completed

(2018)

UK 90,000

Iceland 40,000

China 32,000

Saudi Arabia 2,400

Singapore 400

Portugal 100

US 0

Global Growth of Genomic Data

Despite the growth of genomic data, ongoing challenges remain

• Limited applicability to clinical care

• Datasets not linkable with other patient data

• No focus on patient consent / privacy

8

Growth of Genomic Data Is Driving New Research Applications

Discovery /basic science

Pre-clinical Clinical Commercial

Disease pathobiology studies

Molecular target ID

Predictive biomarker ID & validation

Clinical study planning and design

Disease natural history (genomic factors)

Comparative effectiveness

Target Product Profile (TPP) definition

Contributors & decision makers vary across product lifecycle

Drug MOA elucidation

Drug safety

New indication identification / selection

Post marketing

/ lifecycle mgmt.

9

Industry Is Investing in Genomic Data and Genomic Research…

2015 2017

Subje

cts

pro

vid

ing g

enom

ic info

rmation

200k

400k

600k

800k

0

20182016

1 1 1

2

1 Pfizer / 23andMe, Calico / Ancestry, and Roche / FMI / Flatiron subject totals not published.2 GSK 23andMe has an opt in / opt out model whereby some 80% of the 5 million 23andMe subjects have, by default opted in, but can opt out any time.

10

…Identifying Genetic Associations with Disease…

Disease Gene Discovered

Aortic aneurysm APC / MYH11

Breast-ovarian cancer BRCA1 / BRCA2

Familial hypercholesterolemia APOB / LDLR

Loeys-Dietz syndrome TGFBR1 / TGFBR2 / SMAD3

Long QT Syndrome KCNQ1 / KCNH2 / SCN5A

Lynch Syndrome MLH1 / MSH6 / PMS2

Marfan’s Syndrome TGFBR1 / MEN1

Retinoblastoma RB1

SOURCE: https://www.ncbi.nlm.nih.gov/clinvar/docs/acmg/

11

…and Leading to New Drugs and Therapies

Disease Gene Discovered Known or Candidate Drug

Type 2 DiabetesSLC30A8

KCNJ11

ZnT-8 antagonists

Glyburide

Rheumatoid ArthritisPADI4

IL6R

BB-Cl-amidine

Tocilizumab

Psoriasis IL23A Risankizumab

OsteoporosisRANKL

ESR1

Denosumab

Raloxifene and HRT

Schizophrenia DRD2 Anti-psychotics

LDL Cholesterol HMGCR Pravastatin

SOURCE: Visscher PM, Wray NR, Zhang Q, Sklar P, McCarthy MI, Brown MA, Yang J.

10 Years of GWAS Discovery: Biology, Function, and Translation. Am J Hum Genet. 2017 Jul 6;101(1):5-22. doi: Review.

12

Hallmarks of an Ideal Genomic Dataset

• Whole genome sequence

• Rich clinical history

• Sample size >1M

• Diverse ethnic, environmental,

social and geographic

representation

• Broad disease representation

Key Characteristics Current Challenges

Balancing patient privacy

with research use

Getting comprehensive

environmental, social, and

clinical data

Aggregating data at scale

13

• Genomic Data – What Is It and Why Do We Care?

• Genomic Privacy Issues and Policies

• Genomic Privacy Protection and De-Identification

Agenda

14

Genomic Data Is Personally Unique and Identifying

Fingerprints Genomic Data

• Calculations estimate there are

~64B possible combinations

• 4^3.2B all possible combinations

4^335M based on observed

mutations

SOURCE: http://www.biometricbits.com/Galton-Fingerprints-1892.pdf

https://www.ncbi.nlm.nih.gov/projects/SNP/snp_summary.cgi

15

Public Genealogy Databases Can Be Used to Identify Individuals or Relatives

• In 2013, researchers compared genetic profiles with public

genealogy databases to identify surnames of individuals

• Information such as age and geographic area enabled

researchers to link an individual back to the genomic data

• Early effort underscored re-identification risk (~12%)

• In 2018, additional researchers increased identification to ~60%

by expanding criteria to 3rd cousins

SOURCE: Gymrek M, McGuire AL, Golan D, Halperin E, Erlich Y. Science. 2013 Jan 18;339(6117):321-4.

Erlich Y, Shor T, Pe'er I, Carmi S. Identity inference of genomic data using long-range familial searches. Science. 2018 Nov 9;362(6415):690-694.

16

Genomic Data Can Identify Individuals Based on Physical Traits

• Researchers used genomic data to predict skin color,

eye color, age, height, and biometric features

• Applying this information to 3D facial models, they

could correctly identify the subject from a pool of

1000 diverse individuals more than 80% of the time

Limitations / Criticisms

SOURCE: Lippert et al. Identification of individuals by trait prediction using whole-

genome sequencing data. Proc Natl Acad Sci U S A. 2017 Sep 19;114(38):10166-10171.

• Model success dropped to

50% when using a pool of

ethnically similar subjects

• Critics of the study claimed

that the diversity and limited

size of the dataset made it

relatively easy to identify

individuals

• Despite criticisms, this study

highlights the dynamic

nature of genomic re-

identification risk

17

Policy Approaches to Genomic Privacy

Government Approaches Private Organization Recommendations

• HIPAA

• Genetic Information

Nondiscrimination Act

• Over 60 state statutes

• The responsible and secure sharing of

genomic and health data is key to

accelerating research and improving

human health.

• Individuals' rights include privacy,

autonomy, and the ability to choose for

themselves how they want to manage risk,

consistent with their own personal values

and life situations

SOURCE: https://www.hhs.gov/ohrp/regulations-and-policy/guidance/guidance-on-genetic-information-nondiscrimination-act/index.html;

https://www.genome.gov/policyethics/legdatabase/pubsearch.cfm; https://www.federalregister.gov/documents/2013/01/25/2013-01073/modifications-

to-the-hipaa-privacy-security-enforcement-and-breach-notification-rules-under-the https://www.ncbi.nlm.nih.gov/pubmed/29187736

• GDPR

• Country-level legislation

18

• Genomic Data – What Is It and Why Do We Care?

• Genomic Privacy Issues and Policies

• Genomic Privacy Protection and De-Identification

Agenda

19

Two Main Approaches to Protect Genomic Data

Process Controls Masking / Encryption

Methodology

Limitations • Assumes that good behavior of

users

• Data still identifiable and unable

to linked to de-identified data

• Vulnerable in the event of a

breach

• Research utility of the dataset

limited to provide data privacy

• Often increases analytic

complexity making large scale

analyses difficult

• Non-genomic identifiers (name /

address / etc.) are removed and

process controls or established

user agreements protect privacy

• Data is concealed or encrypted

to protect the privacy of

individuals

• Approaches include

homomorphic encryption, Yao’s

protocol, and other cryptographic

techniques

20

An Alternative Approach –Deterministic Variant Tokenization

Patient-Level Tokenized

Genomic Data

ABC123

XYZ789

Tokenized Aggregated

Analytic Results

Variant Outcome p value

ABC123 T2DM .0023

DEF456 FBS .0014

Detokenized Aggregated Results

Chr Pos Mut Outcome p value

20 7241 C T2DM .0023

4 8902 T FBS .0014

2

3 4

Tokenization

Analysis De-tokenization of

Aggregated Results

CTGCTCATCGCTCCTGTCATCGAGGCCCCTGG

CCCAATGGCAGGCGTCTCCCCCTCCTCTGGC

CTGGTCCCGCCTCTCCTGCCCCTTGTGCTCAG

CGCTACCTGCTGCCCGGACAAATCCAGAGCT

G

1 Genomic Sequencing

21

De-Identified Genomic Data Can Still Be Annotated for Research

Gene Structure Information

• Identification of location within tokenized genes

• Exon / intron / promoter / enhancer

Variant Classification Information

• Insertion / Deletion

• Non-synonymous / synonymous / frameshift / stop-gain

Pathway Information

• Biological Pathway Data

• Gene Ontology Data

Clinical Information

• Clinical Significance

• Disease Pathway

22

Model for De-ID and Linkage of Genomic and Clinical Data from Multiple Sites

Analytic Results

De-ID

Genomic

Data

Analytic Results

Research Query Research Query

Genomics

Consortium

Analytics

Platform Biopharma

De-ID

Engine

Data Sharing Model Goals

Linked Data

Tokenized

Variants

De-ID Clinical Data

One-Way Hash

• Whole genome variant data

• Linked to rich clinical history

• Sample size >1M

• Diverse ethnic,

environmental, social, and

geographic representation

• Broad disease representation

De-ID

Clinical

Data

23

De-Identification and Data Linking Enable Ideal Data Model

Getting comprehensive

environmental, social, and

clinical data

Aggregate data at scale

De-identification of genomic data

enables linkage with other de-

identified data assets

De-identification of data reduces

privacy and competitive concerns

from contributing institutions

Balancing patient privacy

with research use

Genomic data is never readable at

the patient level throughout the

research process

24

• Deterministic de-identification of genomic sequences offers a novel, pragmatic approach to patient privacy while maintaining research

• Challenges remain in managing genomic data storage and computational capacity given both the number and volume of genomic data

• Future policy discussions should look to enable research while still maintaining genomic data privacy

The Future of Genomic Data Privacy

25

Questions

Kenneth Park, MD

[email protected]

Ronald Miller, PhD

[email protected]

Please complete online session evaluation