![Page 1: GENOME SEQUENCING AND ASSEMBLY Mayo/UIUC Summer Course in Computational Biology](https://reader031.vdocuments.mx/reader031/viewer/2022020111/56649e945503460f94b98f1f/html5/thumbnails/1.jpg)
![Page 2: GENOME SEQUENCING AND ASSEMBLY Mayo/UIUC Summer Course in Computational Biology](https://reader031.vdocuments.mx/reader031/viewer/2022020111/56649e945503460f94b98f1f/html5/thumbnails/2.jpg)
![Page 3: GENOME SEQUENCING AND ASSEMBLY Mayo/UIUC Summer Course in Computational Biology](https://reader031.vdocuments.mx/reader031/viewer/2022020111/56649e945503460f94b98f1f/html5/thumbnails/3.jpg)
GENOME SEQUENCING AND ASSEMBLY
Mayo/UIUC Summer Course in Computational Biology
![Page 4: GENOME SEQUENCING AND ASSEMBLY Mayo/UIUC Summer Course in Computational Biology](https://reader031.vdocuments.mx/reader031/viewer/2022020111/56649e945503460f94b98f1f/html5/thumbnails/4.jpg)
Session Outline
Planning a genome sequencing project
Assembly strategies and algorithms
Assessing the quality of the assembly
Assessing the quality of the assemblers
Genome annotation
![Page 5: GENOME SEQUENCING AND ASSEMBLY Mayo/UIUC Summer Course in Computational Biology](https://reader031.vdocuments.mx/reader031/viewer/2022020111/56649e945503460f94b98f1f/html5/thumbnails/5.jpg)
Genome sequencing
![Page 6: GENOME SEQUENCING AND ASSEMBLY Mayo/UIUC Summer Course in Computational Biology](https://reader031.vdocuments.mx/reader031/viewer/2022020111/56649e945503460f94b98f1f/html5/thumbnails/6.jpg)
Schematic overview of genome assembly. (a) DNA is collected from the biological sample and sequenced. (b) The output from the sequencer consists of many billions of short, unordered DNA fragments from random positions in the genome. (c) The short fragments are compared with each other to discover how they overlap. (d) The overlap relationships are captured in a large assembly graph shown as nodes representing kmers or reads, with edges drawn between overlapping kmers or reads. (e) The assembly graph is refined to correct errors and simplify into the initial set of contigs, shown as large ovals connected by edges. (f) Finally, mates, markers and other long-range information are used to order and orient the initial contigs into large scaffolds, as shown as thin black lines connecting the initial contigs.Schatz et al. Genome Biology 2012 13:243
![Page 7: GENOME SEQUENCING AND ASSEMBLY Mayo/UIUC Summer Course in Computational Biology](https://reader031.vdocuments.mx/reader031/viewer/2022020111/56649e945503460f94b98f1f/html5/thumbnails/7.jpg)
Planning a genome sequencing project
How large is my genome?How much of it is repetitive, and what is the repeat size distribution?Is a good quality genome of a related species available?What will be my strategy for performing the assembly?
![Page 8: GENOME SEQUENCING AND ASSEMBLY Mayo/UIUC Summer Course in Computational Biology](https://reader031.vdocuments.mx/reader031/viewer/2022020111/56649e945503460f94b98f1f/html5/thumbnails/8.jpg)
How large is my genome?
The size of the genome can be estimated from the ploidy of the organism and the DNA content per cellThis will affect:
» How many reads will be required to attain sufficient coverage (typically 10x to 100x)
»What sequencing technology to use»What computational resources will be needed
![Page 9: GENOME SEQUENCING AND ASSEMBLY Mayo/UIUC Summer Course in Computational Biology](https://reader031.vdocuments.mx/reader031/viewer/2022020111/56649e945503460f94b98f1f/html5/thumbnails/9.jpg)
Repetitive sequences
Most common source of assembly errorsIf sequencing technology produces reads > repeat size, impact is much smallerMost common solution: generate reads or mate pairs with spacing > largest known repeat
![Page 10: GENOME SEQUENCING AND ASSEMBLY Mayo/UIUC Summer Course in Computational Biology](https://reader031.vdocuments.mx/reader031/viewer/2022020111/56649e945503460f94b98f1f/html5/thumbnails/10.jpg)
![Page 11: GENOME SEQUENCING AND ASSEMBLY Mayo/UIUC Summer Course in Computational Biology](https://reader031.vdocuments.mx/reader031/viewer/2022020111/56649e945503460f94b98f1f/html5/thumbnails/11.jpg)
Assemblies can collapse around repetitive sequences.
Salzberg S L , and Yorke J A Bioinformatics 2005;21:4320-4321
© The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected]
![Page 12: GENOME SEQUENCING AND ASSEMBLY Mayo/UIUC Summer Course in Computational Biology](https://reader031.vdocuments.mx/reader031/viewer/2022020111/56649e945503460f94b98f1f/html5/thumbnails/12.jpg)
![Page 13: GENOME SEQUENCING AND ASSEMBLY Mayo/UIUC Summer Course in Computational Biology](https://reader031.vdocuments.mx/reader031/viewer/2022020111/56649e945503460f94b98f1f/html5/thumbnails/13.jpg)
![Page 14: GENOME SEQUENCING AND ASSEMBLY Mayo/UIUC Summer Course in Computational Biology](https://reader031.vdocuments.mx/reader031/viewer/2022020111/56649e945503460f94b98f1f/html5/thumbnails/14.jpg)
Genome(s) from related species
Preferably of good quality, with large reliable scaffoldsHelp guiding the assembly of the target speciesHelp verifying the completeness of the assemblyCan themselves be improved in some casesBut to be used with caution – can cause errors when architectures are different!
![Page 15: GENOME SEQUENCING AND ASSEMBLY Mayo/UIUC Summer Course in Computational Biology](https://reader031.vdocuments.mx/reader031/viewer/2022020111/56649e945503460f94b98f1f/html5/thumbnails/15.jpg)
Strategies for assembly
The sequencing approaches and assembly strategies are interdependent!
» E.g., for bacterial genome assembly, can generate PacBio reads and assemble with Celera Assembler, or generate Illumina reads and assemble with Velvet or SPAdes
» Optimal sequencing strategies very different for a SOAPdenovo or an ALLPATHS-LG assembly
![Page 16: GENOME SEQUENCING AND ASSEMBLY Mayo/UIUC Summer Course in Computational Biology](https://reader031.vdocuments.mx/reader031/viewer/2022020111/56649e945503460f94b98f1f/html5/thumbnails/16.jpg)
Typical sequencing strategies
Bacterial genome:» 2x300 overlapping paired-end reads from Illumina MiSeq machine,
assembly with SPAdes» PacBio CLR sequences at 200x coverage, self-correction and/or hybrid
correction and assembly using Celera Assembler or PBJelly
Vertebrate genome:» Combination paired-end (2x250 nt overlapping fragments) and mate-pair
(1, 3 and 10 kb libraries) 100 nt reads from Illumina machine at 100x coverage (~1B reads for 1 GB genome), assembly with ALLPATHS-LG
![Page 17: GENOME SEQUENCING AND ASSEMBLY Mayo/UIUC Summer Course in Computational Biology](https://reader031.vdocuments.mx/reader031/viewer/2022020111/56649e945503460f94b98f1f/html5/thumbnails/17.jpg)
![Page 18: GENOME SEQUENCING AND ASSEMBLY Mayo/UIUC Summer Course in Computational Biology](https://reader031.vdocuments.mx/reader031/viewer/2022020111/56649e945503460f94b98f1f/html5/thumbnails/18.jpg)
Illumina paired end and mate pair sequencing
![Page 19: GENOME SEQUENCING AND ASSEMBLY Mayo/UIUC Summer Course in Computational Biology](https://reader031.vdocuments.mx/reader031/viewer/2022020111/56649e945503460f94b98f1f/html5/thumbnails/19.jpg)
Additional useful data
Fosmid libraries» End sequencing adds long-range contiguity information» Pooled fosmids (~5000) can often be assembled more efficiently
Moleculo (Illumina TSLR) libraries» Technology acquired by Illumina, allows generation of fully assembled 10
kb sequences
Pacbio reads» Provide 5-8 kb reads, but in most cases need parallel coverage by
Illumina data for error correction
![Page 20: GENOME SEQUENCING AND ASSEMBLY Mayo/UIUC Summer Course in Computational Biology](https://reader031.vdocuments.mx/reader031/viewer/2022020111/56649e945503460f94b98f1f/html5/thumbnails/20.jpg)
Assembly strategies and algorithms
In all cases, start with cleanup and error correction of raw readsFor long reads (>500 nt), Overlap/Layout/Consensus (OLC) algorithms work bestFor short reads, De Bruijn graph-based assemblers are most widely used
![Page 21: GENOME SEQUENCING AND ASSEMBLY Mayo/UIUC Summer Course in Computational Biology](https://reader031.vdocuments.mx/reader031/viewer/2022020111/56649e945503460f94b98f1f/html5/thumbnails/21.jpg)
Cleaning up the data
Trim reads with low quality callsRemove short readsCorrect errors:
» Find all distinct k-mers (typically k=15) in input data
» Plot coverage distribution» Correct low-coverage k-mers to match high-
coverage» Part of several assemblers, also stand-alone
Quake or khmer programs
![Page 22: GENOME SEQUENCING AND ASSEMBLY Mayo/UIUC Summer Course in Computational Biology](https://reader031.vdocuments.mx/reader031/viewer/2022020111/56649e945503460f94b98f1f/html5/thumbnails/22.jpg)
![Page 23: GENOME SEQUENCING AND ASSEMBLY Mayo/UIUC Summer Course in Computational Biology](https://reader031.vdocuments.mx/reader031/viewer/2022020111/56649e945503460f94b98f1f/html5/thumbnails/23.jpg)
Overlap-layout-consensus
23
Main entity: readRelationship between reads: overlap
12
3
45
6
78
9
1 2 3 4 5 6 7 8 9
1 2 3
1 2 3
1 2 3 12
3
1 3
2
13
2
ACCTGAACCTGAAGCTGAACCAGA
![Page 24: GENOME SEQUENCING AND ASSEMBLY Mayo/UIUC Summer Course in Computational Biology](https://reader031.vdocuments.mx/reader031/viewer/2022020111/56649e945503460f94b98f1f/html5/thumbnails/24.jpg)
![Page 25: GENOME SEQUENCING AND ASSEMBLY Mayo/UIUC Summer Course in Computational Biology](https://reader031.vdocuments.mx/reader031/viewer/2022020111/56649e945503460f94b98f1f/html5/thumbnails/25.jpg)
![Page 26: GENOME SEQUENCING AND ASSEMBLY Mayo/UIUC Summer Course in Computational Biology](https://reader031.vdocuments.mx/reader031/viewer/2022020111/56649e945503460f94b98f1f/html5/thumbnails/26.jpg)
OLC assembly steps
Calculate overlays» Can use BLAST-like method, but finding common k-
mers more efficient
Assemble layout graph, try to simplify graph and remove nodes (reads) – find Hamiltonian pathGenerate consensus from the alignments between reads (overlays)
![Page 27: GENOME SEQUENCING AND ASSEMBLY Mayo/UIUC Summer Course in Computational Biology](https://reader031.vdocuments.mx/reader031/viewer/2022020111/56649e945503460f94b98f1f/html5/thumbnails/27.jpg)
Some OLC-based assemblers
Celera Assembler with the Best Overlap Graph (CABOG)
» Designed for Sanger sequences, but works with 454 and PacBio reads (with or without error correction)
Newbler, a.k.a. GS de novo Assembler» Designed for 454 sequences, but works with Sanger
reads
![Page 28: GENOME SEQUENCING AND ASSEMBLY Mayo/UIUC Summer Course in Computational Biology](https://reader031.vdocuments.mx/reader031/viewer/2022020111/56649e945503460f94b98f1f/html5/thumbnails/28.jpg)
De Bruijn graphs - concept
![Page 29: GENOME SEQUENCING AND ASSEMBLY Mayo/UIUC Summer Course in Computational Biology](https://reader031.vdocuments.mx/reader031/viewer/2022020111/56649e945503460f94b98f1f/html5/thumbnails/29.jpg)
Converting reads to a De Bruijn graph
Reads are 7 nt long
Graph with k=3
Deduced sequence (main branch)
![Page 30: GENOME SEQUENCING AND ASSEMBLY Mayo/UIUC Summer Course in Computational Biology](https://reader031.vdocuments.mx/reader031/viewer/2022020111/56649e945503460f94b98f1f/html5/thumbnails/30.jpg)
DBG implementation in the Velvet assembler
![Page 31: GENOME SEQUENCING AND ASSEMBLY Mayo/UIUC Summer Course in Computational Biology](https://reader031.vdocuments.mx/reader031/viewer/2022020111/56649e945503460f94b98f1f/html5/thumbnails/31.jpg)
![Page 32: GENOME SEQUENCING AND ASSEMBLY Mayo/UIUC Summer Course in Computational Biology](https://reader031.vdocuments.mx/reader031/viewer/2022020111/56649e945503460f94b98f1f/html5/thumbnails/32.jpg)
Examples of DBG-based assemblers
EULER (P. Pevzner), the first assembler to use DBGVelvet (D. Zerbino), a popular choice for small genomesSOAPdenovo (BGI), widely used by BGI, best for relatively unstructured assembliesALLPATHS-LG, probably the most reliable assembler for large genomes (but with strict input requirements)
![Page 33: GENOME SEQUENCING AND ASSEMBLY Mayo/UIUC Summer Course in Computational Biology](https://reader031.vdocuments.mx/reader031/viewer/2022020111/56649e945503460f94b98f1f/html5/thumbnails/33.jpg)
![Page 34: GENOME SEQUENCING AND ASSEMBLY Mayo/UIUC Summer Course in Computational Biology](https://reader031.vdocuments.mx/reader031/viewer/2022020111/56649e945503460f94b98f1f/html5/thumbnails/34.jpg)
Repeats often split genome into contigs
Contig derived from unique sequencesReads from multiple repeatscollapse into artefactual contig
![Page 35: GENOME SEQUENCING AND ASSEMBLY Mayo/UIUC Summer Course in Computational Biology](https://reader031.vdocuments.mx/reader031/viewer/2022020111/56649e945503460f94b98f1f/html5/thumbnails/35.jpg)
Consensus (15- 30Kbp)
Reads
ContigAssembly without pairs results in contigs whose order and orientation are not known.
?
Pairs, especially groups of corroborating ones, link the contigs into scaffolds where the size of gaps is well characterized.
2-pair
Mean & Std.Dev.is known
Scaffold
Pairs Give Order & Orientation
![Page 36: GENOME SEQUENCING AND ASSEMBLY Mayo/UIUC Summer Course in Computational Biology](https://reader031.vdocuments.mx/reader031/viewer/2022020111/56649e945503460f94b98f1f/html5/thumbnails/36.jpg)
ChromosomeSTS
STS-mapped Scaffolds
Contig
Gap (mean & std. dev. Known)Read pair (mates)
Consensus
Reads (of several haplotypes)
SNPsExternal “Reads”
Anatomy of a WGS Assembly
![Page 37: GENOME SEQUENCING AND ASSEMBLY Mayo/UIUC Summer Course in Computational Biology](https://reader031.vdocuments.mx/reader031/viewer/2022020111/56649e945503460f94b98f1f/html5/thumbnails/37.jpg)
Assembly gaps
37
sequencing gap - we know the order and orientation of the contigs and have at least one clone spanning the gap
physical gap - no information known about the adjacent contigs, nor about the DNA spanning the gap
Sequencing gaps
Physical gaps
![Page 38: GENOME SEQUENCING AND ASSEMBLY Mayo/UIUC Summer Course in Computational Biology](https://reader031.vdocuments.mx/reader031/viewer/2022020111/56649e945503460f94b98f1f/html5/thumbnails/38.jpg)
38
Handling repeats
1. Repeat detection» pre-assembly: find fragments that belong to repeats
• statistically (most existing assemblers)• repeat database (RepeatMasker)
» during assembly: detect "tangles" indicative of repeats (Pevzner, Tang, Waterman 2001)
» post-assembly: find repetitive regions and potential mis-assemblies. • Reputer, RepeatMasker• "unhappy" mate-pairs (too close, too far, mis-oriented)
2. Repeat resolution» find DNA fragments belonging to the repeat» determine correct tiling across the repeat» Obtain long reads spanning repeats
![Page 39: GENOME SEQUENCING AND ASSEMBLY Mayo/UIUC Summer Course in Computational Biology](https://reader031.vdocuments.mx/reader031/viewer/2022020111/56649e945503460f94b98f1f/html5/thumbnails/39.jpg)
How good is my assembly?
How much total sequence is in the assembly relative to estimated genome size?How many pieces, and what is their size distribution?Are the contigs assembled correctly?Are the scaffolds connected in the right order / orientation?How were the repeats handled?Are all the genes I expected in the assembly?
![Page 40: GENOME SEQUENCING AND ASSEMBLY Mayo/UIUC Summer Course in Computational Biology](https://reader031.vdocuments.mx/reader031/viewer/2022020111/56649e945503460f94b98f1f/html5/thumbnails/40.jpg)
N50: the most common measure of assembly quality
N50 = length of the shortestcontig in a set making up 50%of the total assembly length
![Page 41: GENOME SEQUENCING AND ASSEMBLY Mayo/UIUC Summer Course in Computational Biology](https://reader031.vdocuments.mx/reader031/viewer/2022020111/56649e945503460f94b98f1f/html5/thumbnails/41.jpg)
![Page 42: GENOME SEQUENCING AND ASSEMBLY Mayo/UIUC Summer Course in Computational Biology](https://reader031.vdocuments.mx/reader031/viewer/2022020111/56649e945503460f94b98f1f/html5/thumbnails/42.jpg)
Order and orientation of contigs – more errors in one assembly than in another
![Page 43: GENOME SEQUENCING AND ASSEMBLY Mayo/UIUC Summer Course in Computational Biology](https://reader031.vdocuments.mx/reader031/viewer/2022020111/56649e945503460f94b98f1f/html5/thumbnails/43.jpg)
REAPR overview
![Page 44: GENOME SEQUENCING AND ASSEMBLY Mayo/UIUC Summer Course in Computational Biology](https://reader031.vdocuments.mx/reader031/viewer/2022020111/56649e945503460f94b98f1f/html5/thumbnails/44.jpg)
REAPR Summary
REAPR is a toolkit that assesses the quality of a genome assembly independently of the assembler, and without needing a “gold” reference assemblyREAPR is not a variant calling tool; it examines the consistency of a genome assembly with the same data that were used to assemble itREAPR output can be visualized in many ways, and helps genome finishing projectsEvery genome assembly project should use REAPR or a similar toolkit to perform quality checks on the assemblies being produced
![Page 45: GENOME SEQUENCING AND ASSEMBLY Mayo/UIUC Summer Course in Computational Biology](https://reader031.vdocuments.mx/reader031/viewer/2022020111/56649e945503460f94b98f1f/html5/thumbnails/45.jpg)
BUSCO and CEGMA: conserved gene sets
From Ian Korf’s group, UC DavisMapping Core Eukaryotic Genes
From Evgeny Zdobnov’s group,University of Geneva
Coverage is indicative of qualityand completeness of assembly
![Page 46: GENOME SEQUENCING AND ASSEMBLY Mayo/UIUC Summer Course in Computational Biology](https://reader031.vdocuments.mx/reader031/viewer/2022020111/56649e945503460f94b98f1f/html5/thumbnails/46.jpg)
Even the best genomes are not perfect
![Page 47: GENOME SEQUENCING AND ASSEMBLY Mayo/UIUC Summer Course in Computational Biology](https://reader031.vdocuments.mx/reader031/viewer/2022020111/56649e945503460f94b98f1f/html5/thumbnails/47.jpg)
There is no such thing as a “perfect” assembler (results from GAGE competition)
![Page 48: GENOME SEQUENCING AND ASSEMBLY Mayo/UIUC Summer Course in Computational Biology](https://reader031.vdocuments.mx/reader031/viewer/2022020111/56649e945503460f94b98f1f/html5/thumbnails/48.jpg)
The computational demands and effectiveness of assemblers are very different
![Page 49: GENOME SEQUENCING AND ASSEMBLY Mayo/UIUC Summer Course in Computational Biology](https://reader031.vdocuments.mx/reader031/viewer/2022020111/56649e945503460f94b98f1f/html5/thumbnails/49.jpg)
Assessing assembly strategies
Assemblathon (UC Davis and UC Santa Cruz)» Provide challenging datasets to assemble in open competition (synthetic for
edition 1, real for edition 2)» Assess competitor assemblies by many different metrics» Publish extensive reports
GAGE (U. of Maryland and Johns Hopkins)» Select datasets associated with known high-quality genomes» Run a set of open source assemblers with parameter sweeps on these datasets» Compare the results, publish in scholarly Journals with complete documentation
of parameters
![Page 50: GENOME SEQUENCING AND ASSEMBLY Mayo/UIUC Summer Course in Computational Biology](https://reader031.vdocuments.mx/reader031/viewer/2022020111/56649e945503460f94b98f1f/html5/thumbnails/50.jpg)
Some advice on running assemblies
Perform parameter sweeps» Use many different values of key parameters, especially k-mer size for DBG
assemblers, and evaluate the output (some assemblers can do this automatically)
Try different subsets of the data» Sometimes libraries are of poor quality and degrade the quality of the assembly» Artefacts in the data (e.g. PCR duplicates, homopolymer runs, …) can also badly
affect output quality
Try more than one assembler» There is no such thing as “the best” assembler
![Page 51: GENOME SEQUENCING AND ASSEMBLY Mayo/UIUC Summer Course in Computational Biology](https://reader031.vdocuments.mx/reader031/viewer/2022020111/56649e945503460f94b98f1f/html5/thumbnails/51.jpg)
Genome annotation
A genome sequence is useless without annotationThree steps in genome annotation:
» Find features not associated with protein-coding genes (e.g. tRNA, rRNA, snRNA, SINE/LINE, miRNA precursors)
» Build models for protein-coding genes, including exons, coding regions, regulatory regions
» Associate biologically relevant information with the genome features and genes
![Page 52: GENOME SEQUENCING AND ASSEMBLY Mayo/UIUC Summer Course in Computational Biology](https://reader031.vdocuments.mx/reader031/viewer/2022020111/56649e945503460f94b98f1f/html5/thumbnails/52.jpg)
Methods for genome annotation
Ab initio, i.e. based on sequence alone» INFERNAL/rFAM (RNA genes), miRBase (miRNAs), RepeatMasker
(repeat families), many gene prediction algorithms (e.g. AUGUSTUS, Glimmer, GeneMark, …)
Evidence-based» Require transcriptome data for the target organism (the more the
better)» Align cDNA sequences to assembled genome and generate gene
models: TopHat/Cufflinks, Scripture
![Page 53: GENOME SEQUENCING AND ASSEMBLY Mayo/UIUC Summer Course in Computational Biology](https://reader031.vdocuments.mx/reader031/viewer/2022020111/56649e945503460f94b98f1f/html5/thumbnails/53.jpg)
Methods for biological annotation
BLAST of gene models against protein databases» Sequence similarity to known proteins
InterProScan of predicted proteins against databases of protein domains (Pfam, Prosite, HAMAP, PANTHER, …)Mapping against Gene Ontology terms (BLAST2GO)
![Page 54: GENOME SEQUENCING AND ASSEMBLY Mayo/UIUC Summer Course in Computational Biology](https://reader031.vdocuments.mx/reader031/viewer/2022020111/56649e945503460f94b98f1f/html5/thumbnails/54.jpg)
MAKER, integration framework for genome annotation
MAKER runs many software tools on the assembled genome and collates the outputsSee http://gmod.org/wiki/MAKER
![Page 55: GENOME SEQUENCING AND ASSEMBLY Mayo/UIUC Summer Course in Computational Biology](https://reader031.vdocuments.mx/reader031/viewer/2022020111/56649e945503460f94b98f1f/html5/thumbnails/55.jpg)
Acknowledgements
For this slide deck I “borrowed” figures and slides from many publications, Web pages and presentations by
»M. Schatz, S. Salzberg, K. Bradnam, K. Krampis, D. Zerbino, J. J. Cook, M. Pop, G. Sutton, T. Seemann
Thank you!