genome hacking - genetic alliance10/8/13 yaniv erlich the venter case intro. risk assessment the...

14
Genome Hacking Yaniv Erlich Whitehead Ins2tute for Biomedical Research Twi;er: @erlichya

Upload: others

Post on 27-Jun-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Genome Hacking

Yaniv  Erlich  Whitehead  Ins2tute  for  Biomedical  Research  

 Twi;er:  @erlichya  

Yaniv Erlich Yaniv Erlich 10/8/13

Public data is important for genetic studies

Intro. Risk assessment

The Venter case Anonymous datasets Summary

To make this endeavor sustainable, we must proactively map risks

Yaniv Erlich Yaniv Erlich 10/8/13

Co-segregation between Y-chr and surnames

www.ysearch.org: Y

Y

Smith Smith

Smith

Y

Smith

Intro. Risk assessment

The Venter case Anonymous datasets Summary

Erlich

Yaniv Erlich Yaniv Erlich 10/8/13

The Venter case

Yaniv Erlich

Intro. Risk assessment

The Venter case Anonymous datasets Summary

www.ysearch.org:

lobSTR DYS458: 17 repeats

Method

lobSTR: A short tandem repeat profilerfor personal genomesMelissa Gymrek,1,2 David Golan,2,3 Saharon Rosset,3 and Yaniv Erlich2,4

1Harvard–MIT Division of Health Sciences and Technology, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139,

USA; 2Whitehead Institute for Biomedical Research, Cambridge, Massachusetts 02142, USA; 3Department of Statistics and Operations

Research, Tel Aviv University, Tel Aviv 69978, Israel

Short tandem repeats (STRs) have a wide range of applications, including medical genetics, forensics, and genetic gene-alogy. High-throughput sequencing (HTS) has the potential to profile hundreds of thousands of STR loci. However,mainstream bioinformatics pipelines are inadequate for the task. These pipelines treat STR mapping as gapped alignment,which results in cumbersome processing times and a biased sampling of STR alleles. Here, we present lobSTR, a novelmethod for profiling STRs in personal genomes. lobSTR harnesses concepts from signal processing and statistical learningto avoid gapped alignment and to address the specific noise patterns in STR calling. The speed and reliability of lobSTRexceed the performance of current mainstream algorithms for STR profiling. We validated lobSTR’s accuracy by mea-suring its consistency in calling STRs from whole-genome sequencing of two biological replicates from the same in-dividual, by tracing Mendelian inheritance patterns in STR alleles in whole-genome sequencing of a HapMap trio, and bycomparing lobSTR results to traditional molecular techniques. Encouraged by the speed and accuracy of lobSTR, we usedthe algorithm to conduct a comprehensive survey of STR variations in a deeply sequenced personal genome. We tracedthe mutation dynamics of close to 100,000 STR loci and observed more than 50,000 STR variations in a single genome.lobSTR’s implementation is an end-to-end solution. The package accepts raw sequencing reads and provides the user withthe genotyping results. It is written in C/C++, includes multi-threading capabilities, and is compatible with the BAMformat.

[Supplemental material is available for this article.]

Short tandem repeats (STRs), also known as microsatellites, area class of genetic variations with repetitive elements of 2–6 nu-cleotides (nt) that consist of approximately a quartermillion loci inthe human genome (Benson 1999). The repetitive structure ofthose loci creates unusual secondary DNA conformations that areprone to replication slippage events and result in high variabilityin the number of repeat elements (Mirkin 2007). The spontaneousmutation rate of STRs exceeds that of any other type of knowngenetic variation and can reach 1/500 mutations per locus pergeneration (Walsh 2001; Ballantyne et al. 2010), 200-fold higherthan the rate of spontaneous copy number variations (CNV)(Lupski 2007) and 200,000-fold higher than the rate of de novoSNPs (Conrad et al. 2011).

STR variations have been instrumental in wide-ranging areasof human genetics. STR expansions are implicated in the etiologyof a variety of genetic disorders, such as Huntingon’s Disease andFragile-X Syndrome (Pearson et al. 2005; Mirkin 2007). ForensicsDNA fingerprinting relies on profiling autosomal STRmarkers andY-chromosome STR (Y-STR) loci (Kayser and de Knijff 2011). STRshave been extensively used in genetic anthropology, where theirhigh mutation rates create a unique capability to link recent his-torical events to DNA variations, including the well-knownCohenModal Haplotype that segregates in patrilineal lines of Jewishpriests (Skorecki et al. 1997; Zhivotovsky et al. 2004). Anotherrelatively recent application of STR analysis is tracing cell lineagesin cancer samples (Frumkin et al. 2008).

Despite the plurality of applications, STR variations are notroutinely analyzed in whole-genome sequencing studies, mainlydue to a lack of adequate tools (Treangen and Salzberg 2011). STRspose a remarkable challenge tomainstreamHTS analysis pipelines.First, not all reads that align to an STR locus are informative(Supplemental Fig. 1A). If a single or paired-end read partially en-compasses an STR locus, it provides only a lower bound on thenumber of repeats. Only reads that fully encompass an STR can beused for exact STR allelotyping. Second, mainstream aligners, suchas BWA, generally exhibit a trade-off between run time and toler-ance to insertions/deletions (indels) (Li and Homer 2010). Thus,profiling STR variations—even for an expansion of three repeats ina trinucleotide STR—would require a cumbersome gapped align-ment step and lengthy processing times (Supplemental Fig. 1B).Third, PCR amplification of an STR locus can create stutter noise, inwhich the DNA amplicons show false repeat lengths due to suc-cessive slippage events of DNA polymerase during amplification(Supplemental Fig. 1C; Hauge and Litt 1993; Ellegren 2004). SincePCR amplification is a standard step in library preparation forwhole-genome sequencing, an STR profiler should explicitlymodel and attempt to remove this noise to enhance accuracy.

Here, we present lobSTR, a rapid and accurate algorithm forSTR profiling in whole-genome sequencing data sets (Fig. 1).Briefly, the algorithm has three steps. The first step is sensing:lobSTR swiftly scans genomic libraries, flags informative reads thatfully encompass STR loci, and characterizes their STR sequence.This ab initio procedure relies on a signal processing approach thatuses rapid entropy measurements to find informative STR readsfollowed by a Fast Fourier Transform to characterize the repeatsequence. The second step is alignment: lobSTR uses a divide-and-conquer strategy that anchors the nonrepetitive flanking regions

4Corresponding author.E-mail [email protected] published online before print. Article, supplemental material, and publi-cation date are at http://www.genome.org/cgi/doi/10.1101/gr.135780.111.

22:000–000 ! 2012 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/12; www.genome.org Genome Research 1www.genome.org

Cold Spring Harbor Laboratory Press on April 25, 2012 - Published by genome.cshlp.orgDownloaded from

10.1101/gr.135780.111Access the most recent version at doi: published online April 20, 2012Genome Res.

Melissa Gymrek, David Golan, Saharon Rosset, et al. lobSTR: A short tandem repeat profiler for personal genomes

P<P Published online April 20, 2012 in advance of the print journal.

serviceEmail alerting

click heretop right corner of the article orReceive free email alerts when new articles cite this article - sign up in the box at the

object identifier (DOIs) and date of initial publication. by PubMed from initial publication. Citations to Advance online articles must include the digital publication). Advance online articles are citable and establish publication priority; they are indexedappeared in the paper journal (edited, typeset versions may be posted when available prior to final Advance online articles have been peer reviewed and accepted for publication but have not yet

http://genome.cshlp.org/subscriptions go to: Genome ResearchTo subscribe to

Copyright © 2012 by Cold Spring Harbor Laboratory Press

Cold Spring Harbor Laboratory Press on April 25, 2012 - Published by genome.cshlp.orgDownloaded from

Try it yourself: bit.ly/craig_venter_haplotype_updated

•  We got a surname from whole genome sequencing data

Yaniv Erlich Yaniv Erlich 10/8/13

Finding Craig Venter Searching for: 1.  Venter 2.  Male 3.  California 4.  Born in 1946 In USsearch.com and PeopleFinders.com

Intro. Risk assessment

The Venter case Anonymous datasets Summary

Two matches, including:

Based on genomics data

Not considered as identifiers by HIPAA

Yaniv Erlich Yaniv Erlich 10/8/13

Can we identify anonymous personal genomes?

Intro. Risk assessment

The Venter case Anonymous datasets Summary

Yaniv Erlich Yaniv Erlich 10/8/13

Recovering the identifies of CEU individuals

Intro. Risk assessment

The Venter case Anonymous datasets Summary

8 Surname predictions with Utah ancestry

10 CEU genomes

Found an obituary that has the exact description of the pedigree

Probability of a random match < 5x10-9

Winfield Utah

Yaniv Erlich Yaniv Erlich 10/8/13

Beginner’s luck?

Intro. Risk assessment

The Venter case Anonymous datasets Summary

In total: 5 successful surname recoveries Breaching the privacy of close to 50 CEU samples.

Successful surname recovery (targeted individual)

Patrilineal line from source to target

Person tested by genetic genealogy service (source)

p<5x10-9 p<10-5 p<5x10-6 p<5x10-6 p<10-5

Yaniv Erlich Yaniv Erlich 10/8/13

Summary

Intro. Risk assessment

The Venter case Anonymous datasets Summary

Our approach:

-  No experimental work involved. -  The identifying information propagates via deep genealogical ties. -  The attack completely relies on public resources.

Testing close to 1000 Y-STR haplotypes, demonstrating complete identification of Venter and close to 50 CEU individuals.

Yaniv Erlich Yaniv Erlich 10/8/13

Recommendations

Intro. Risk assessment

The Venter case Anonymous datasets Summary

1. Consent: - Be honest about risks. Be honest about benefits.

2. Multi-tier approach:

- Give participants options for data sharing. 3. Proactive approach:

- Keep mapping risks. Friendly hacking is far better than a real one.

4.  Technical solutions:

- We did not explore those enough. Much more to do here.

Yaniv Erlich Yaniv Erlich

Melissa Gymrek (HST – Harvard/MIT) Amy McGuire (Baylor) David Golan (Tel-Aviv University) Eran Halperin (Tel-Aviv University)

Melissa Gymrek

Acknowledgements

1/18/13

Open Access (with FREE registration)

Yaniv Erlich Yaniv Erlich

Exploiting genetic genealogy databases

Intro. Risk assessment

The Venter case Anonymous datasets Summary

An anecdote?

Yaniv Erlich Yaniv Erlich

Can we recover the identity of anonymous sequencing datasets using public resources?

The main idea – a systematic study

Intro. Risk assessment

The Venter case Anonymous datasets Summary

Yaniv Erlich Yaniv Erlich

Empirical test: what is the probability to recover a surname?

Y-STR of a real person Querying Ysearch and SMGF

Calculating surname confidence score

Inferring surname

Intro. Risk assessment

The Venter case Anonymous datasets Summary

Comparing the predicted surname to the true one

x900

Expectation for US Caucasian males from middle and upper class: 12% Successful recoveries