computational pan-genomics: status, promises …...computational pan-genomics: status, promises and...
TRANSCRIPT
Computational Pan-Genomics:Status, Promises and Challenges
Tobias Marschall, Manja Marz, Thomas Abeel, Louis Dijkstra, Bas E. Dutilh, AliGhaffaari, Paul Kersey, Wigard P. Kloosterman, Veli Makinen, Adam M. Novak,
Benedict Paten, David Porubsky, Eric Rivals, Can Alkan, Jasmijn Baaijens, Paul I.W. De Bakker, Valentina Boeva, Raoul J. P. Bonnal, Francesca Chiaromonte,
Rayan Chikhi, Francesca D. Ciccarelli, Robin Cijvat, Erwin Datema, Cornelia M.Van Duijn, Evan E. Eichler, Corinna Ernst, Eleazar Eskin, Erik Garrison,
Mohammed El-Kebir, Gunnar W. Klau, Jan O. Korbel, Eric-Wubbo Lameijer,Benjamin Langmead, Marcel Martin, Paul Medvedev, John C. Mu, Pieter
Neerincx, Klaasjan Ouwens, Pierre Peterlongo, Nadia Pisanti, Sven Rahmann,Ben Raphael, Knut Reinert, Dick de Ridder, Jeroen de Ridder, Matthias
Schlesner, Ole Schulz-Trieglaff, Ashley D. Sanders, Siavash Sheikhizadeh, CarlShneider, Sandra Smit, Daniel Valenzuela, Jiayin Wang, Lodewyk Wessels, Ying
Zhang, Victor Guryev, Fabio Vandin, Kai Ye, Alexander Schonhuth
30th November 2016
Rebooting the Human Genome*
”The Human Genome Project was one of mankind’s greatest
triumphs. But the official gene map that resulted in 2003, knownas the “reference genome,” is no longer up to the job.“
(Antonio Regalado, MIT Technology Review, June 3, 2015)
*https://www.technologyreview.com/s/537916/rebooting-the-human-genome
Workshop Paper
Computational pan-genomics: status, promises
and challengesTheComputational Pan-Genomics Consortium*Corresponding author: Tobias Marschall, Center for Bioinformatics at Saarland University and Max Planck Institute for Informatics, Saarland Informatics
Campus, Saarbrucken, Germany. Tel.: +4968130270880; E-mail: [email protected]*The Computational Pan-Genomics Consortium formed at a workshop held from 8 to 12 June 2015, at the Lorentz Center in Leiden, the Netherlands, with
the purpose of providinga cross-disciplinary overview of the emergingdiscipline of Computational Pan-Genomics. Theworkshop was organized by Victor
Guryev, Tobias Marschall, Alexander Schonhuth (chair), Fabio Vandin, and Kai Ye. Consortium members are listed at the end of this article.
Abstract
Many disciplines, from human genetics and oncology to plant breeding,microbiology and virology, commonly face the chal-lengeof analyzing rapidly increasingnumbers of genomes. In caseof Homosapiens, the number of sequenced genomeswillapproach hundreds of thousands in thenext few years. Simply scalingup established bioinformatics pipelines will not besuff cient for leveraging the full potential of such rich genomic data sets. Instead, novel, qualitatively different computa-tional methods and paradigms are needed.Wewill witness the rapid extension of computational pan-genomics, a newsub-area of research in computational biology. In this article, wegeneralize existingdef nitions and understand a pan-genomeas any collection of genomic sequences to be analyzed jointly or to beused as a reference.Weexaminealreadyavailable approaches to construct and usepan-genomes, discuss the potential benef ts of future technologies andmethod-ologies and review open challenges from thevantagepoint of theabove-mentioned biological disciplines. As a prominentexample for a computational paradigm shift, we particularly highlight the transition from the representation of referencegenomes as strings to representations as graphs.Weoutline how this and other challenges from different application do-mains translate into common computational problems, point out relevant bioinformatics techniques and identify openproblems in computer science.With this review, weaim to increaseawareness that a joint approach to computational pan-genomics can help addressmany of the problems currently faced in various domains.
Key words: pan-genome; sequencegraph; readmapping; haplotypes; data structures
Introduction
In 1995, the complete genome sequence for the bacteriumHaemophilus influenzae was published [1], followed by the se-quence for the eukaryote Saccharomyces cerevisiae in 1996 [2] andthe landmark publication of the human genome in 2001 [3, 4].These sequences, and many more that followed, have served as‘reference genomes’, which formed the basis for both major ad-vances in functional genomics and for studying genetic vari-ation by re-sequencing other individuals from the same species[5–8]. The advent of rapid and cheap ‘next-generation’ sequenc-ing technologies since 2006 has turned re-sequencing into one
of the most popular modern genome analysis workflows. As oftoday, an incredible wealth of genomic variation within popula-tions has already been detected, permitting functional annota-tion of many such variants, and it is reasonable to expect thatthis is only the beginning.
With the number of sequenced genomes steadily increasing,it makes sense to re-think the idea of a ‘reference’ genome[9, 10]. Such a reference sequence can take a number of forms,including:
• the genomeof a single selected individual,• a consensus drawn from an entire population,
Submitted: 14May 2016;Received (in revised form): 17August 2016
VC TheAuthor 2016. Published by Oxford University Press.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/),
which permits unrestricted reuse, distribution, and reproduction in anymedium, provided the original work is properly cited.
1
Briefings in Bioinformatics, 2016, 1–18
doi: 10.1093/bib/bbw089
Paper
Briefings in Bioinformatics Advance Access published October 21, 2016
at MP
I Com
puter Science on N
ovember 15, 2016
http://bib.oxfordjournals.org/D
ownloaded from
Table of Content
IntroductionDefinition of Computational Pan-GenomicsGoals of Computational Pan-Genomics
ApplicationsMicrobes, Metagenomics, Viruses, Plants, Human GeneticDiseases, Cancer, Phylogenomics
Impact of Sequencing Technology on Pan-GenomicsData Structures
Design GoalsApproaches
Computational ChallengesRead MappingVariant Calling and GenotypingHaplotype PhasingVisualizationData Uncertainty Propagation
Table of Content
IntroductionDefinition of Computational Pan-GenomicsGoals of Computational Pan-Genomics
ApplicationsMicrobes, Metagenomics, Viruses, Plants, Human GeneticDiseases, Cancer, Phylogenomics
Impact of Sequencing Technology on Pan-GenomicsData Structures
Design GoalsApproaches
Computational ChallengesRead MappingVariant Calling and GenotypingHaplotype PhasingVisualizationData Uncertainty Propagation
What is a Pan-Genome?
Term pan-genome popularized in microbiology in 2005.
Definition (gene-based pan-genome)
The pan-genome of a species (or other taxonomic unit) is theunion of all sets of genes across all individuals.
one genome
genes
What is a Pan-Genome?
Term pan-genome popularized in microbiology in 2005.
Definition (gene-based pan-genome)
The pan-genome of a species (or other taxonomic unit) is theunion of all sets of genes across all individuals.
one genome
genes
What is a Pan-Genome?
Term pan-genome popularized in microbiology in 2005.
Definition (gene-based pan-genome)
The pan-genome of a species (or other taxonomic unit) is theunion of all sets of genes across all individuals.
another genomeone genome
genes
What is a Pan-Genome?
Term pan-genome popularized in microbiology in 2005.
Definition (gene-based pan-genome)
The pan-genome of a species (or other taxonomic unit) is theunion of all sets of genes across all individuals.
pan-genome
core genome
another genomeone genome
Pan-Genome Definition
“... and use the term pan-genome to refer to any collection ofgenomic sequences to be analyzed jointly or to be used as areference. These sequences can be linked in a graph-like structure,or simply constitute sets of (aligned or unaligned) sequences.Questions about efficient data structures, algorithms and statisticalmethods to perform bioinformatic analyses of pan-genomes giverise to the discipline of computational pan-genomics.”
Notes
Not restricted to taxonomic units
Not restricted to full genomes
Not tied to graphs
Intentionally intersects with metagenomics, comparativegenomics, and population genetics
Pan-Genome Definition
“... and use the term pan-genome to refer to any collection ofgenomic sequences to be analyzed jointly or to be used as areference. These sequences can be linked in a graph-like structure,or simply constitute sets of (aligned or unaligned) sequences.Questions about efficient data structures, algorithms and statisticalmethods to perform bioinformatic analyses of pan-genomes giverise to the discipline of computational pan-genomics.”
Notes
Not restricted to taxonomic units
Not restricted to full genomes
Not tied to graphs
Intentionally intersects with metagenomics, comparativegenomics, and population genetics
Table of Content
IntroductionDefinition of Computational Pan-GenomicsGoals of Computational Pan-Genomics
ApplicationsMicrobes, Metagenomics, Viruses, Plants, Human GeneticDiseases, Cancer, Phylogenomics
Impact of Sequencing Technology on Pan-GenomicsData Structures
Design GoalsApproaches
Computational ChallengesRead MappingVariant Calling and GenotypingHaplotype PhasingVisualizationData Uncertainty Propagation
(High-Level) Goals of Computational Pan-Genomics
completeness: containing all functional elements and enoughof the sequence space to serve as a reference for the analysisof additional individuals,
stability: having uniquely identifiable features that can bestudied by different researchers and at different points in time,
comprehensibility: facilitating understanding of thecomplexities of genome structures across many individuals orspecies,
efficiency organizing data in such a way as to acceleratedownstream analysis.
Table of Content
IntroductionDefinition of Computational Pan-GenomicsGoals of Computational Pan-Genomics
ApplicationsMicrobes, Metagenomics, Viruses, Plants, Human GeneticDiseases, Cancer, Phylogenomics
Impact of Sequencing Technology on Pan-GenomicsData Structures
Design GoalsApproaches
Computational ChallengesRead MappingVariant Calling and GenotypingHaplotype PhasingVisualizationData Uncertainty Propagation
Approaches I
ACCCGCTCCTGCCCCGCGTCTGGAGTTCCAAATAAGGCTTGGAAATTT ACCCGCTCCTGCCCCGCGTATTATATTCCAACTCTCTG
C CCAAATAAG CTTGGAAATTTACCCGCTCCTGCCCCGCGTCTGGAGTTCTATTATATT CAACTCTCTG
G ACAAATAAG CTTGGAAATTTACCCGCTCCTGCCCCGCGTCTGGAGTTCTATTATATT CAACTCTCTG
ACCCGCTCCTGCCCCGCGTCTGGAGTTCCAAATAAGGCTTGGAAATTT
C
G
ACCCGCTCCTGCCCCGCGTATTATATTC
C
A
CAACTCTCTG
CAAATAAG
CAAATAAG
CTTGGAAATTT
CTTGGAAATTT
ACCCGCTCCTGCCCCGCG
ACCCGCTCCTGCCCCGCG
TCTGGAGTTC
TCTGGAGTTC
TATTATATT
TATTATATT
CAACTCTCTG
CAACTCTCTG
------------------
------------------
(a) Unaligned sequences
(b) Multiple sequence alignment
(c) De Bruijn graph
CAAA AAAT AATA ATAA TAAG CTTG TTGG GGAATGGA GAAA
AATT
AAAT
ATTT
AAGG
AAGC
AGGC GGCT GCTT
AGCC GCCT CCTT
Haplotype 1
Haplotype 2
Haplotype 3
Haplotype 1
Haplotype 2
Haplotype 3
Approaches II
CTTGGAAATTTACCCGCTCCTGCCCCGCGTCTGGAGTTC TATTATATT
C
T
CAACTCTCTGACCCGCTCCTGCCCCGCGCAAATAAG
C
G
CTTGGAAATTT TCTGGAGTTCCAAATAAG
C
G
ACCCGCTCCTGCCCCGCG TATTATATT
C
T
CAACTCTCTG
(d) Acyclic sequence graph
(e) Cyclic sequence graph
Haplotype 1Haplotype 2Haplotype 3
Approaches III
A TGC
(f) Li-Stephens model
Haplotype 1
Haplotype 2
Haplotype 3
A TGC A TGC A TGC A TGC A TGC A TGC A TGC A TGC
A TGC A TGC A TGC A TGC A TGC A TGC A TGC A TGC A TGC
A TGC A TGC A TGC A TGC A TGC A TGC A TGC A TGC A TGC
Table of Content
IntroductionDefinition of Computational Pan-GenomicsGoals of Computational Pan-Genomics
ApplicationsMicrobes, Metagenomics, Viruses, Plants, Human GeneticDiseases, Cancer, Phylogenomics
Impact of Sequencing Technology on Pan-GenomicsData Structures
Design GoalsApproaches
Computational ChallengesRead MappingVariant Calling and GenotypingHaplotype PhasingVisualizationData Uncertainty Propagation
Applications
Application Domain: Microbes
Pan-genomics at the gene level: established workflows andmature software are available
For a number of microorganisms, pan-genome sequencedata is already available
Microbial pan-genomes support comparative genomicsstudies (especially given horizontal gene exchange)
Genome-wide association studies (GWAS) for microbes isan emerging field
Application Domain: Metagenomics
Metagenomics: set of genomic sequences co-occurrence inan environment
Questions: taxonomic composition of the sample, presenceof certain gene products or whole pathways, anddetermining which genomes these functional genes areassociated with.
Pan-Genome data structures present the chance to revealcommon adaptations to the environment as well asco-evolution of interactions.
Application Domain: Viruses
Reliable viral haplotype reconstruction is not fully solved
Patient’s viral pan-genome → diagnosis, staging, andtherapy selection
Virus-host interactions: pan-genome structure of a viralpopulation to be directly compared with that of a susceptiblehost population
Application Domain: Plants
Large-scale genomics projects completed / under way:Arabidopsis thaliana, rice, maize, sorghum, and tomato
Plant genomes are large, complex (containing many repeats)and often polyploid
Having a pan-genome available for a given crop that includesits wild relatives provides a single coordinate system toanchor all known variation and phenotype information
Application Domain: Human Genetic Diseases
Numerous genes have been successfully mapped for raremonogenic diseases
Common diseases ← GWAS ← imputation ← catalogs ofhuman genome variation, their linkage disequilibrium (LD)properties
Pan-Genomics can help to achieve this, especially in difficultgenomic regions
Application Domain: Cancer
Improved detection of somatic mutations, throughimproved quality of read mapping to polymorphic regions
Somatic pan-genome describing the general somaticvariability in the human population, → accurate baseline forassessing the impact of somatic alterations.
Vision: personal cancer pan-genome to be built for eachtumor patient: single-cell data, haplotype information,sequencing data from circulating tumor cells and DNA, etc.
Application Domain: Phylogenomics
Computational pan-genomics: repidly extrcat evolutionarysignals, such as gene content tables, sequence alignments ofshared marker genes, genome-wide SNPs, or internaltranscribed spacer (ITS) sequences
Move beyond only using the best aligned, and most wellbehaved residues of a multiple sequence alignment (oftenused in “traditional” phylogenomics)
Table of Content
IntroductionDefinition of Computational Pan-GenomicsGoals of Computational Pan-Genomics
ApplicationsMicrobes, Metagenomics, Viruses, Plants, Human GeneticDiseases, Cancer, Phylogenomics
Impact of Sequencing Technology on Pan-GenomicsData Structures
Design GoalsApproaches
Computational ChallengesRead MappingVariant Calling and GenotypingHaplotype PhasingVisualizationData Uncertainty Propagation
Operations
Sample 1 Sample 2 Sample n
DNA reads
Assembled contigs/whole genome
ATAAGG ACCCGCTC
Assembled contigs/whole genome
ATAAGC ACCCGCTC
Construct
Store
Restore
Pan-genomeAnotherPan-genome
Compare
Annotate Simulate
Assembled contigs/whole genome
ATAAGG ACCCGCTC
Reference genome
ATAAGGACCCGCTC
Sample n+1
Map
ACCCGCTGGAAATTT32M2D16M
Alignment
Variant calling / Phasing
Update
Known variants
ATAAGG ACCCGCTC
ATAAGC ACCCGCTC
Haplotypes
ATAAGG ACCCGCTC
ATAAG - ACCCGCTC
Novel variants
Visualize Retrieve
ATAAGG ACCCGCTC
ATAAG - ACCCGCTC
Haplotypes
ATAAGG ACCCGCTC
ATAAG - ACCCGCTC
Haplotypes
Abstract datatype
Operation Operation output
Operation input Set
Groups multiple inputs/outputs
Pan-genome related data formats
Other data formats
Database
Data processing
Legend
Variants
ATAAGG ACCCGCTC
ATAAG - ACCCGCTC
ATAAGC ACCCGCTC
DNA reads DNA reads DNA reads
Annotation
ACCCGCTGGAAATTT
Gene 1
Visualization
CAAATAAG CTTGGAAATTT ACCCGCTCCTGCCCCGCG TCTGGAGTTC TATTATATT
C
G
Table of Content (What I Skipped)
IntroductionDefinition of Computational Pan-GenomicsGoals of Computational Pan-Genomics
ApplicationsMicrobes, Metagenomics, Viruses, Plants, Human GeneticDiseases, Cancer, Phylogenomics
Impact of Sequencing Technology on Pan-GenomicsData Structures
Design GoalsApproaches
Computational ChallengesRead MappingVariant Calling and GenotypingHaplotype PhasingVisualizationData Uncertainty Propagation
Design Goals
Construction and Maintenance.Pan-genomes should be constructable from different independentsources, such as
1 existing linear reference genomes and their variants,
2 haplotype reference panels,
3 raw reads, either from bulk sequencing of complex mixtures orfrom multiple samples sequenced separately
The data structure should allow dynamic updates of storedinformation without rebuilding the entire data structure, includinglocal modifications such as adding a new genetic variant, insertionsof new genomes and deletion of contained genomes
Design Goals
Coordinate System.A pan-genome defines the space in which (pan-)genomic analysestake place. It should provide a coordinate system tounambiguously identify genetic loci and (potentially nested)genetic variants. Desirable properties of such a “coordinatesystem” include:
1 nearby positions should have similar coordinates
2 paths representing genomes should correspond to monotonicsequences of coordinates where possible
3 coordinates should be concise and interpretable
Design Goals
Biological Features and Computational Layers.Annotation of biological features should be coherently providedacross all individual genomes. Computationally these featuresrepresent additional layers on top of pan-genomes. This includesinformation about
1 genes, introns, transcription factor binding sites
2 epigenetic properties
3 linkages, including haplotypes
4 gene regulation
5 transcriptional units
6 genomic 3D structure
7 taxonomy among individuals
Design Goals
Data Retrieval.
A pan-genome data structure should provide positionalaccess to individual genome sequences, access to allvariants and to the corresponding allele frequencies.
Haplotypes should be reconstructable including informationabout all maximal blocks and linkage disequilibriumbetween two variants.
Design Goals
Searching within Pan-Genomes.Comparisons of short and long sequences (e.g. reads) with thepan-genome ideally results in the corresponding location and thebest matching individual genome(s). This scenario may occurfor transcriptomic data as well as for DNA re-sequencing data,facilitating the identification of known variants in new samples.
Design Goals
Comparison among Pan-Genomes.Given any pair of genomes within a pan-genome, we expect a datastructure to highlight differences, variable and conservedregions, as well as common syntenic regions. Beyond that, aglobal comparison of two (or more) pan-genomes, e.g. with respectto gene content or population differentiation, should besupported.
Design Goals
Simulation.A pan-genome data structure should support the generation(sampling) of individual genomes similar to the genomes itcontains.
Design Goals
Visualization.All information within a data structure should be easily accessiblefor human eyes by visualization support on different scales. Thisincludes visualization of global genome structure, structuralvariants on genome level and local variants on nucleotide level, butalso biological features and other computational layers should berepresented.
Design Goals
Efficiency.We expect a data structure to use as little space on disk andmemory as possible, while being compatible to computationaltools with a low running time. Supporting specialized hardware,such as general purpose graphics processing units (GPGPUs) orfield-programmable gate arrays (FPGAs), is partly animplementation detail. Yet, in some cases, the target platform caninfluence data structure design significantly.
Table of Content (What I Skipped)
IntroductionDefinition of Computational Pan-GenomicsGoals of Computational Pan-Genomics
ApplicationsMicrobes, Metagenomics, Viruses, Plants, Human GeneticDiseases, Cancer, Phylogenomics
Impact of Sequencing Technology on Pan-GenomicsData Structures
Design GoalsApproaches
Computational ChallengesRead MappingVariant Calling and GenotypingHaplotype PhasingVisualizationData Uncertainty Propagation