novel genomic de-identification method to enable research
TRANSCRIPT
1
Novel Genomic De-Identification Methodto Enable Research
Session 220, February 14, 2019
Kenneth Park MD, VP, Offering Development, IQVIARonald Miller PhD, Assoc Dir, Offering Development, IQVIA
2
Conflicts of Interest
Kenneth Park, MD
Has no real or apparent conflicts of interest to report.
Ronald Miller, PhD
Has no real or apparent conflicts of interest to report.
3
• Genomic Data – What Is It and Why Do We Care?
• Genomic Privacy Issues and Policies
• Genomic Privacy Protection and De-Identification
Agenda
4
Learning Objectives
• Describe the growing awareness and concern for genomic data privacy
• Explain why traditional de-identification approaches do not work with genomic data
• Evaluate current genomic data privacy approaches and their relative risks and benefits
• Identify the key features of a comprehensive genomic de-identification approach
5
What Is Genomic Data?
Genomic Data
• DNA sequence information
- Impacts and can predict a
person’s physical characteristics
and disease
- Can directly cause traits /
diseases or effect likelihood of
outcome
• Genomic research often require
linkage with phenotypic data for
(e.g., observation or clinical
outcome data)
Phenotypic Data
• Observable traits
- Eye Color
- Disease onset
• Influenced by both genetics and
environment
• Measured through observation or
clinical testing
- EMR
- Claims
- Rx
SOURCE: https://www.news-medical.net/health/Genetics-of-Eye-
Color.aspx
6
The Cost of Doing a Genomic Sequence Has Approximately Halved Each Year
SOURCE: https://www.genome.gov/27565109/the-cost-of-sequencing-a-human-genome/
$1,000
$10,000
$100,000
$1,000,000
$10,000,000
$100,000,000
Watson’s Genome
Completed ~$1M
Sequencing now
available for ~$1,000
per Genome
First consumer
genome sequencing
available ~$50,000
Cost per Genome Sequence
7
Countries
reporting WGS
Completed
(2018)
UK 90,000
Iceland 40,000
China 32,000
Saudi Arabia 2,400
Singapore 400
Portugal 100
US 0
Global Growth of Genomic Data
Despite the growth of genomic data, ongoing challenges remain
• Limited applicability to clinical care
• Datasets not linkable with other patient data
• No focus on patient consent / privacy
8
Growth of Genomic Data Is Driving New Research Applications
Discovery /basic science
Pre-clinical Clinical Commercial
Disease pathobiology studies
Molecular target ID
Predictive biomarker ID & validation
Clinical study planning and design
Disease natural history (genomic factors)
Comparative effectiveness
Target Product Profile (TPP) definition
Contributors & decision makers vary across product lifecycle
Drug MOA elucidation
Drug safety
New indication identification / selection
Post marketing
/ lifecycle mgmt.
9
Industry Is Investing in Genomic Data and Genomic Research…
2015 2017
Subje
cts
pro
vid
ing g
enom
ic info
rmation
200k
400k
600k
800k
0
20182016
1 1 1
2
1 Pfizer / 23andMe, Calico / Ancestry, and Roche / FMI / Flatiron subject totals not published.2 GSK 23andMe has an opt in / opt out model whereby some 80% of the 5 million 23andMe subjects have, by default opted in, but can opt out any time.
10
…Identifying Genetic Associations with Disease…
Disease Gene Discovered
Aortic aneurysm APC / MYH11
Breast-ovarian cancer BRCA1 / BRCA2
Familial hypercholesterolemia APOB / LDLR
Loeys-Dietz syndrome TGFBR1 / TGFBR2 / SMAD3
Long QT Syndrome KCNQ1 / KCNH2 / SCN5A
Lynch Syndrome MLH1 / MSH6 / PMS2
Marfan’s Syndrome TGFBR1 / MEN1
Retinoblastoma RB1
SOURCE: https://www.ncbi.nlm.nih.gov/clinvar/docs/acmg/
11
…and Leading to New Drugs and Therapies
Disease Gene Discovered Known or Candidate Drug
Type 2 DiabetesSLC30A8
KCNJ11
ZnT-8 antagonists
Glyburide
Rheumatoid ArthritisPADI4
IL6R
BB-Cl-amidine
Tocilizumab
Psoriasis IL23A Risankizumab
OsteoporosisRANKL
ESR1
Denosumab
Raloxifene and HRT
Schizophrenia DRD2 Anti-psychotics
LDL Cholesterol HMGCR Pravastatin
SOURCE: Visscher PM, Wray NR, Zhang Q, Sklar P, McCarthy MI, Brown MA, Yang J.
10 Years of GWAS Discovery: Biology, Function, and Translation. Am J Hum Genet. 2017 Jul 6;101(1):5-22. doi: Review.
12
Hallmarks of an Ideal Genomic Dataset
• Whole genome sequence
• Rich clinical history
• Sample size >1M
• Diverse ethnic, environmental,
social and geographic
representation
• Broad disease representation
Key Characteristics Current Challenges
Balancing patient privacy
with research use
Getting comprehensive
environmental, social, and
clinical data
Aggregating data at scale
13
• Genomic Data – What Is It and Why Do We Care?
• Genomic Privacy Issues and Policies
• Genomic Privacy Protection and De-Identification
Agenda
14
Genomic Data Is Personally Unique and Identifying
Fingerprints Genomic Data
• Calculations estimate there are
~64B possible combinations
• 4^3.2B all possible combinations
4^335M based on observed
mutations
SOURCE: http://www.biometricbits.com/Galton-Fingerprints-1892.pdf
https://www.ncbi.nlm.nih.gov/projects/SNP/snp_summary.cgi
15
Public Genealogy Databases Can Be Used to Identify Individuals or Relatives
• In 2013, researchers compared genetic profiles with public
genealogy databases to identify surnames of individuals
• Information such as age and geographic area enabled
researchers to link an individual back to the genomic data
• Early effort underscored re-identification risk (~12%)
• In 2018, additional researchers increased identification to ~60%
by expanding criteria to 3rd cousins
SOURCE: Gymrek M, McGuire AL, Golan D, Halperin E, Erlich Y. Science. 2013 Jan 18;339(6117):321-4.
Erlich Y, Shor T, Pe'er I, Carmi S. Identity inference of genomic data using long-range familial searches. Science. 2018 Nov 9;362(6415):690-694.
16
Genomic Data Can Identify Individuals Based on Physical Traits
• Researchers used genomic data to predict skin color,
eye color, age, height, and biometric features
• Applying this information to 3D facial models, they
could correctly identify the subject from a pool of
1000 diverse individuals more than 80% of the time
Limitations / Criticisms
SOURCE: Lippert et al. Identification of individuals by trait prediction using whole-
genome sequencing data. Proc Natl Acad Sci U S A. 2017 Sep 19;114(38):10166-10171.
• Model success dropped to
50% when using a pool of
ethnically similar subjects
• Critics of the study claimed
that the diversity and limited
size of the dataset made it
relatively easy to identify
individuals
• Despite criticisms, this study
highlights the dynamic
nature of genomic re-
identification risk
17
Policy Approaches to Genomic Privacy
Government Approaches Private Organization Recommendations
• HIPAA
• Genetic Information
Nondiscrimination Act
• Over 60 state statutes
• The responsible and secure sharing of
genomic and health data is key to
accelerating research and improving
human health.
• Individuals' rights include privacy,
autonomy, and the ability to choose for
themselves how they want to manage risk,
consistent with their own personal values
and life situations
SOURCE: https://www.hhs.gov/ohrp/regulations-and-policy/guidance/guidance-on-genetic-information-nondiscrimination-act/index.html;
https://www.genome.gov/policyethics/legdatabase/pubsearch.cfm; https://www.federalregister.gov/documents/2013/01/25/2013-01073/modifications-
to-the-hipaa-privacy-security-enforcement-and-breach-notification-rules-under-the https://www.ncbi.nlm.nih.gov/pubmed/29187736
• GDPR
• Country-level legislation
18
• Genomic Data – What Is It and Why Do We Care?
• Genomic Privacy Issues and Policies
• Genomic Privacy Protection and De-Identification
Agenda
19
Two Main Approaches to Protect Genomic Data
Process Controls Masking / Encryption
Methodology
Limitations • Assumes that good behavior of
users
• Data still identifiable and unable
to linked to de-identified data
• Vulnerable in the event of a
breach
• Research utility of the dataset
limited to provide data privacy
• Often increases analytic
complexity making large scale
analyses difficult
• Non-genomic identifiers (name /
address / etc.) are removed and
process controls or established
user agreements protect privacy
• Data is concealed or encrypted
to protect the privacy of
individuals
• Approaches include
homomorphic encryption, Yao’s
protocol, and other cryptographic
techniques
20
An Alternative Approach –Deterministic Variant Tokenization
Patient-Level Tokenized
Genomic Data
ABC123
…
XYZ789
Tokenized Aggregated
Analytic Results
Variant Outcome p value
ABC123 T2DM .0023
DEF456 FBS .0014
Detokenized Aggregated Results
Chr Pos Mut Outcome p value
20 7241 C T2DM .0023
4 8902 T FBS .0014
2
3 4
Tokenization
Analysis De-tokenization of
Aggregated Results
CTGCTCATCGCTCCTGTCATCGAGGCCCCTGG
CCCAATGGCAGGCGTCTCCCCCTCCTCTGGC
CTGGTCCCGCCTCTCCTGCCCCTTGTGCTCAG
CGCTACCTGCTGCCCGGACAAATCCAGAGCT
G
1 Genomic Sequencing
21
De-Identified Genomic Data Can Still Be Annotated for Research
Gene Structure Information
• Identification of location within tokenized genes
• Exon / intron / promoter / enhancer
Variant Classification Information
• Insertion / Deletion
• Non-synonymous / synonymous / frameshift / stop-gain
Pathway Information
• Biological Pathway Data
• Gene Ontology Data
Clinical Information
• Clinical Significance
• Disease Pathway
22
Model for De-ID and Linkage of Genomic and Clinical Data from Multiple Sites
Analytic Results
De-ID
Genomic
Data
Analytic Results
Research Query Research Query
Genomics
Consortium
Analytics
Platform Biopharma
De-ID
Engine
Data Sharing Model Goals
Linked Data
Tokenized
Variants
De-ID Clinical Data
One-Way Hash
• Whole genome variant data
• Linked to rich clinical history
• Sample size >1M
• Diverse ethnic,
environmental, social, and
geographic representation
• Broad disease representation
De-ID
Clinical
Data
23
De-Identification and Data Linking Enable Ideal Data Model
Getting comprehensive
environmental, social, and
clinical data
Aggregate data at scale
De-identification of genomic data
enables linkage with other de-
identified data assets
De-identification of data reduces
privacy and competitive concerns
from contributing institutions
Balancing patient privacy
with research use
Genomic data is never readable at
the patient level throughout the
research process
24
• Deterministic de-identification of genomic sequences offers a novel, pragmatic approach to patient privacy while maintaining research
• Challenges remain in managing genomic data storage and computational capacity given both the number and volume of genomic data
• Future policy discussions should look to enable research while still maintaining genomic data privacy
The Future of Genomic Data Privacy
25
Questions
Kenneth Park, MD
Ronald Miller, PhD
Please complete online session evaluation