computational pan-genomics: status, promises …...computational pan-genomics: status, promises and...

Computational Pan-Genomics:Status, Promises and Challenges

Tobias Marschall, Manja Marz, Thomas Abeel, Louis Dijkstra, Bas E. Dutilh, AliGhaffaari, Paul Kersey, Wigard P. Kloosterman, Veli Makinen, Adam M. Novak,

Benedict Paten, David Porubsky, Eric Rivals, Can Alkan, Jasmijn Baaijens, Paul I.W. De Bakker, Valentina Boeva, Raoul J. P. Bonnal, Francesca Chiaromonte,

Rayan Chikhi, Francesca D. Ciccarelli, Robin Cijvat, Erwin Datema, Cornelia M.Van Duijn, Evan E. Eichler, Corinna Ernst, Eleazar Eskin, Erik Garrison,

Mohammed El-Kebir, Gunnar W. Klau, Jan O. Korbel, Eric-Wubbo Lameijer,Benjamin Langmead, Marcel Martin, Paul Medvedev, John C. Mu, Pieter

Neerincx, Klaasjan Ouwens, Pierre Peterlongo, Nadia Pisanti, Sven Rahmann,Ben Raphael, Knut Reinert, Dick de Ridder, Jeroen de Ridder, Matthias

Schlesner, Ole Schulz-Trieglaff, Ashley D. Sanders, Siavash Sheikhizadeh, CarlShneider, Sandra Smit, Daniel Valenzuela, Jiayin Wang, Lodewyk Wessels, Ying

Zhang, Victor Guryev, Fabio Vandin, Kai Ye, Alexander Schonhuth

30th November 2016

Rebooting the Human Genome*

”The Human Genome Project was one of mankind’s greatest

triumphs. But the official gene map that resulted in 2003, knownas the “reference genome,” is no longer up to the job.“

(Antonio Regalado, MIT Technology Review, June 3, 2015)

*https://www.technologyreview.com/s/537916/rebooting-the-human-genome

https://www.technologyreview.com/s/537916/rebooting-the-human-genome

Workshop Paper

Computational pan-genomics: status, promises

and challengesTheComputational Pan-Genomics Consortium*Corresponding author: Tobias Marschall, Center for Bioinformatics at Saarland University and Max Planck Institute for Informatics, Saarland Informatics

Campus, Saarbrucken, Germany. Tel.: +4968130270880; E-mail: [email protected]*The Computational Pan-Genomics Consortium formed at a workshop held from 8 to 12 June 2015, at the Lorentz Center in Leiden, the Netherlands, with

the purpose of providinga cross-disciplinary overview of the emergingdiscipline of Computational Pan-Genomics. Theworkshop was organized by Victor

Guryev, Tobias Marschall, Alexander Schonhuth (chair), Fabio Vandin, and Kai Ye. Consortium members are listed at the end of this article.

Abstract

Many disciplines, from human genetics and oncology to plant breeding,microbiology and virology, commonly face the chal-lengeof analyzing rapidly increasingnumbers of genomes. In caseof Homosapiens, the number of sequenced genomeswillapproach hundreds of thousands in thenext few years. Simply scalingup established bioinformatics pipelines will not besuff cient for leveraging the full potential of such rich genomic data sets. Instead, novel, qualitatively different computa-tional methods and paradigms are needed.Wewill witness the rapid extension of computational pan-genomics, a newsub-area of research in computational biology. In this article, wegeneralize existingdef nitions and understand a pan-genomeas any collection of genomic sequences to be analyzed jointly or to beused as a reference.Weexaminealreadyavailable approaches to construct and usepan-genomes, discuss the potential benef ts of future technologies andmethod-ologies and review open challenges from thevantagepoint of theabove-mentioned biological disciplines. As a prominentexample for a computational paradigm shift, we particularly highlight the transition from the representation of referencegenomes as strings to representations as graphs.Weoutline how this and other challenges from different application do-mains translate into common computational problems, point out relevant bioinformatics techniques and identify openproblems in computer science.With this review, weaim to increaseawareness that a joint approach to computational pan-genomics can help addressmany of the problems currently faced in various domains.

Key words: pan-genome; sequencegraph; readmapping; haplotypes; data structures

Introduction

In 1995, the complete genome sequence for the bacteriumHaemophilus influenzae was published [1], followed by the se-quence for the eukaryote Saccharomyces cerevisiae in 1996 [2] andthe landmark publication of the human genome in 2001 [3, 4].These sequences, and many more that followed, have served as‘reference genomes’, which formed the basis for both major ad-vances in functional genomics and for studying genetic vari-ation by re-sequencing other individuals from the same species[5–8]. The advent of rapid and cheap ‘next-generation’ sequenc-ing technologies since 2006 has turned re-sequencing into one

of the most popular modern genome analysis workflows. As oftoday, an incredible wealth of genomic variation within popula-tions has already been detected, permitting functional annota-tion of many such variants, and it is reasonable to expect thatthis is only the beginning.

With the number of sequenced genomes steadily increasing,it makes sense to re-think the idea of a ‘reference’ genome[9, 10]. Such a reference sequence can take a number of forms,including:

• the genomeof a single selected individual,• a consensus drawn from an entire population,

Submitted: 14May 2016;Received (in revised form): 17August 2016

VC TheAuthor 2016. Published by Oxford University Press.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/),

which permits unrestricted reuse, distribution, and reproduction in anymedium, provided the original work is properly cited.

1

Briefings in Bioinformatics, 2016, 1–18

doi: 10.1093/bib/bbw089

Paper

Briefings in Bioinformatics Advance Access published October 21, 2016

at MP

I Com

puter Science on N

ovember 15, 2016

http://bib.oxfordjournals.org/D

ownloaded from

Table of Content

IntroductionDefinition of Computational Pan-GenomicsGoals of Computational Pan-Genomics

ApplicationsMicrobes, Metagenomics, Viruses, Plants, Human GeneticDiseases, Cancer, Phylogenomics

Impact of Sequencing Technology on Pan-GenomicsData Structures

Design GoalsApproaches

Computational ChallengesRead MappingVariant Calling and GenotypingHaplotype PhasingVisualizationData Uncertainty Propagation

What is a Pan-Genome?

Term pan-genome popularized in microbiology in 2005.

Definition (gene-based pan-genome)

The pan-genome of a species (or other taxonomic unit) is theunion of all sets of genes across all individuals.

one genome

genes





another genomeone genome

genes





pan-genome

core genome

another genomeone genome

Pan-Genome Definition

“... and use the term pan-genome to refer to any collection ofgenomic sequences to be analyzed jointly or to be used as areference. These sequences can be linked in a graph-like structure,or simply constitute sets of (aligned or unaligned) sequences.Questions about efficient data structures, algorithms and statisticalmethods to perform bioinformatic analyses of pan-genomes giverise to the discipline of computational pan-genomics.”

Notes

Not restricted to taxonomic units

Not restricted to full genomes

Not tied to graphs

Intentionally intersects with metagenomics, comparativegenomics, and population genetics

Table of Content






(High-Level) Goals of Computational Pan-Genomics

completeness: containing all functional elements and enoughof the sequence space to serve as a reference for the analysisof additional individuals,

stability: having uniquely identifiable features that can bestudied by different researchers and at different points in time,

comprehensibility: facilitating understanding of thecomplexities of genome structures across many individuals orspecies,

efficiency organizing data in such a way as to acceleratedownstream analysis.

Table of Content






Approaches I

ACCCGCTCCTGCCCCGCGTCTGGAGTTCCAAATAAGGCTTGGAAATTT ACCCGCTCCTGCCCCGCGTATTATATTCCAACTCTCTG

C CCAAATAAG CTTGGAAATTTACCCGCTCCTGCCCCGCGTCTGGAGTTCTATTATATT CAACTCTCTG

G ACAAATAAG CTTGGAAATTTACCCGCTCCTGCCCCGCGTCTGGAGTTCTATTATATT CAACTCTCTG

ACCCGCTCCTGCCCCGCGTCTGGAGTTCCAAATAAGGCTTGGAAATTT

C

G

ACCCGCTCCTGCCCCGCGTATTATATTC

C

A

CAACTCTCTG

CAAATAAG

CAAATAAG

CTTGGAAATTT

CTTGGAAATTT

ACCCGCTCCTGCCCCGCG

ACCCGCTCCTGCCCCGCG

TCTGGAGTTC

TCTGGAGTTC

TATTATATT

TATTATATT

CAACTCTCTG

CAACTCTCTG

------------------

------------------

(a) Unaligned sequences

(b) Multiple sequence alignment

(c) De Bruijn graph

CAAA AAAT AATA ATAA TAAG CTTG TTGG GGAATGGA GAAA

AATT

AAAT

ATTT

AAGG

AAGC

AGGC GGCT GCTT

AGCC GCCT CCTT

Haplotype 1

Haplotype 2

Haplotype 3

Haplotype 1

Haplotype 2

Haplotype 3

Approaches II

CTTGGAAATTTACCCGCTCCTGCCCCGCGTCTGGAGTTC TATTATATT

C

T

CAACTCTCTGACCCGCTCCTGCCCCGCGCAAATAAG

C

G

CTTGGAAATTT TCTGGAGTTCCAAATAAG

C

G

ACCCGCTCCTGCCCCGCG TATTATATT

C

T

CAACTCTCTG

(d) Acyclic sequence graph

(e) Cyclic sequence graph

Haplotype 1Haplotype 2Haplotype 3

Approaches III

A TGC

(f) Li-Stephens model

Haplotype 1

Haplotype 2

Haplotype 3

A TGC A TGC A TGC A TGC A TGC A TGC A TGC A TGC

A TGC A TGC A TGC A TGC A TGC A TGC A TGC A TGC A TGC

A TGC A TGC A TGC A TGC A TGC A TGC A TGC A TGC A TGC

Table of Content






Applications

Application Domain: Microbes

Pan-genomics at the gene level: established workflows andmature software are available

For a number of microorganisms, pan-genome sequencedata is already available

Microbial pan-genomes support comparative genomicsstudies (especially given horizontal gene exchange)

Genome-wide association studies (GWAS) for microbes isan emerging field

Application Domain: Metagenomics

Metagenomics: set of genomic sequences co-occurrence inan environment

Questions: taxonomic composition of the sample, presenceof certain gene products or whole pathways, anddetermining which genomes these functional genes areassociated with.

Pan-Genome data structures present the chance to revealcommon adaptations to the environment as well asco-evolution of interactions.

Application Domain: Viruses

Reliable viral haplotype reconstruction is not fully solved

Patient’s viral pan-genome → diagnosis, staging, andtherapy selection

Virus-host interactions: pan-genome structure of a viralpopulation to be directly compared with that of a susceptiblehost population

Application Domain: Plants

Large-scale genomics projects completed / under way:Arabidopsis thaliana, rice, maize, sorghum, and tomato

Plant genomes are large, complex (containing many repeats)and often polyploid

Having a pan-genome available for a given crop that includesits wild relatives provides a single coordinate system toanchor all known variation and phenotype information

Application Domain: Human Genetic Diseases

Numerous genes have been successfully mapped for raremonogenic diseases

Common diseases ← GWAS ← imputation ← catalogs ofhuman genome variation, their linkage disequilibrium (LD)properties

Pan-Genomics can help to achieve this, especially in difficultgenomic regions

Application Domain: Cancer

Improved detection of somatic mutations, throughimproved quality of read mapping to polymorphic regions

Somatic pan-genome describing the general somaticvariability in the human population, → accurate baseline forassessing the impact of somatic alterations.

Vision: personal cancer pan-genome to be built for eachtumor patient: single-cell data, haplotype information,sequencing data from circulating tumor cells and DNA, etc.

Application Domain: Phylogenomics

Computational pan-genomics: repidly extrcat evolutionarysignals, such as gene content tables, sequence alignments ofshared marker genes, genome-wide SNPs, or internaltranscribed spacer (ITS) sequences

Move beyond only using the best aligned, and most wellbehaved residues of a multiple sequence alignment (oftenused in “traditional” phylogenomics)

Table of Content






Operations

Sample 1 Sample 2 Sample n

DNA reads

Assembled contigs/whole genome

ATAAGG ACCCGCTC


ATAAGC ACCCGCTC

Construct

Store

Restore

Pan-genomeAnotherPan-genome

Compare

Annotate Simulate


ATAAGG ACCCGCTC

Reference genome

ATAAGGACCCGCTC

Sample n+1

Map

ACCCGCTGGAAATTT32M2D16M

Alignment

Variant calling / Phasing

Update

Known variants

ATAAGG ACCCGCTC

ATAAGC ACCCGCTC

Haplotypes

ATAAGG ACCCGCTC

ATAAG - ACCCGCTC

Novel variants

Visualize Retrieve

ATAAGG ACCCGCTC

ATAAG - ACCCGCTC

Haplotypes

ATAAGG ACCCGCTC

ATAAG - ACCCGCTC

Haplotypes

Abstract datatype

Operation Operation output

Operation input Set

Groups multiple inputs/outputs

Pan-genome related data formats

Other data formats

Database

Data processing

Legend

Variants

ATAAGG ACCCGCTC

ATAAG - ACCCGCTC

ATAAGC ACCCGCTC

DNA reads DNA reads DNA reads

Annotation

ACCCGCTGGAAATTT

Gene 1

Visualization

CAAATAAG CTTGGAAATTT ACCCGCTCCTGCCCCGCG TCTGGAGTTC TATTATATT

C

G

Table of Content (What I Skipped)






Design Goals

Construction and Maintenance.Pan-genomes should be constructable from different independentsources, such as

1 existing linear reference genomes and their variants,

2 haplotype reference panels,

3 raw reads, either from bulk sequencing of complex mixtures orfrom multiple samples sequenced separately

The data structure should allow dynamic updates of storedinformation without rebuilding the entire data structure, includinglocal modifications such as adding a new genetic variant, insertionsof new genomes and deletion of contained genomes

Design Goals

Coordinate System.A pan-genome defines the space in which (pan-)genomic analysestake place. It should provide a coordinate system tounambiguously identify genetic loci and (potentially nested)genetic variants. Desirable properties of such a “coordinatesystem” include:

1 nearby positions should have similar coordinates

2 paths representing genomes should correspond to monotonicsequences of coordinates where possible

3 coordinates should be concise and interpretable

Design Goals

Biological Features and Computational Layers.Annotation of biological features should be coherently providedacross all individual genomes. Computationally these featuresrepresent additional layers on top of pan-genomes. This includesinformation about

1 genes, introns, transcription factor binding sites

2 epigenetic properties

3 linkages, including haplotypes

4 gene regulation

5 transcriptional units

6 genomic 3D structure

7 taxonomy among individuals

Design Goals

Data Retrieval.

A pan-genome data structure should provide positionalaccess to individual genome sequences, access to allvariants and to the corresponding allele frequencies.

Haplotypes should be reconstructable including informationabout all maximal blocks and linkage disequilibriumbetween two variants.

Design Goals

Searching within Pan-Genomes.Comparisons of short and long sequences (e.g. reads) with thepan-genome ideally results in the corresponding location and thebest matching individual genome(s). This scenario may occurfor transcriptomic data as well as for DNA re-sequencing data,facilitating the identification of known variants in new samples.

Design Goals

Comparison among Pan-Genomes.Given any pair of genomes within a pan-genome, we expect a datastructure to highlight differences, variable and conservedregions, as well as common syntenic regions. Beyond that, aglobal comparison of two (or more) pan-genomes, e.g. with respectto gene content or population differentiation, should besupported.

Design Goals

Simulation.A pan-genome data structure should support the generation(sampling) of individual genomes similar to the genomes itcontains.

Design Goals

Visualization.All information within a data structure should be easily accessiblefor human eyes by visualization support on different scales. Thisincludes visualization of global genome structure, structuralvariants on genome level and local variants on nucleotide level, butalso biological features and other computational layers should berepresented.

Design Goals

Efficiency.We expect a data structure to use as little space on disk andmemory as possible, while being compatible to computationaltools with a low running time. Supporting specialized hardware,such as general purpose graphics processing units (GPGPUs) orfield-programmable gate arrays (FPGAs), is partly animplementation detail. Yet, in some cases, the target platform caninfluence data structure design significantly.

Table of Content (What I Skipped)






computational pan-genomics: status, promises …...computational pan-genomics: status, promises and...

Documents