de novo assembly course - cbs.dtu.dk · the dog assembly tracked closely to the theoretical...

27626 - Next Generation Sequencing Analysis

de novo assemblyPreprocessing and SNP calling Natasja S. Ehlers, PhD student Center for Biological Sequence Analysis Functional Human Variation Group Josef K Vogt

Slides by: Simon Rasmussen

2017


Generalized NGS analysis

Raw reads

Pre-processing

Assembly:Alignment /

de novo

Application specific:

Variant calling,count matrix, ...

Comparesamples / methods

Answer?Question

Dat

a si

ze


Merge small DNA fragments together so they form a previously unknown sequence

Merge millions reads together so they form previously unknown sequences

What is de novo assembly?

de novo assembly• Assemble reads into longer fragments

• Find overlap between reads

• Many approaches

reads&

con*gs&

scaffolds&

Lets try to assemble some reads!• Rules:

• a minimum of 7-bp overlap

• overlap must not include any N bases

• same orientation so that the sequence can be read left to right

• there may be 1-bp differences

• simplified - no double stranded DNA

..NNNNGGACTATGATTCG ||||||| TGATTCGAGGCTAANN....NNNNNNNNCGATTCTGATCCGA ||||||| GTCCTCGATTCTNNNNNNNN..

..NNNNCGGACTATGATT |||||| ATGATTCGAGGCTAANN....NNNNNNNNCGCTACTGATCCGA || | ||| GTCCTCGATTCTGNNNNNNN..

Which are valid?

..NNNNGGACTATGATTCG ||||||| TGATTCGAGGCTAANN..

..NNNNNNNNCGATTCTGATCCGA ||||||| GTCCTCGATTCTNNNNNNNN..

..NNNNCGGACTATGATT |||||| ATGATTCGAGGCTAANN..

..NNNNNNNNCGCTACTGATCCGA || | ||| GTCCTCGATTCTGNNNNNNN..


Which approaches?

• Greedy (“Simple” approach)

• Overlap-Layout-Consensus (OLC)

• de Bruijn graphs


Simple approach - Greedy• Pseudo code:

1.Pairwise alignment of all reads

2.Identify fragments that have largest overlap

3.Merge these

4.Repeat until all overlaps are used

• Can only resolve repeats smaller than read length

• High computational cost with increasing no. reads


Reads > Contigs > Scaffolds

• Overlap Layout Consensus and de Bruijn use a similar general approach.

1.Try to correct sequence errors in reads with high coverage

2.Assemble reads to contiguous sequence fragments “contigs”

3.Identify repeat contigs

4.Combine and order contigs to “scaffolds”, with gaps representing regions of uncertainty


Overlap-Layout-Consensus

• Create overlap graph by all-vs-all alignment (Overlap)

• Build graph where each node is a read, edges are overlaps between reads (Layout)

• Example

separate paths. Short repeats of this type can be resolved, but theyrequire additional processing and therefore additional time.

Another potential drawback of the de Bruijn approach is thatthe de Bruijn graph can require an enormous amount of computerspace (random access memory, or RAM). Unlike conventionaloverlap computations, which can be easily partitioned into mul-tiple jobs with distinct batches of reads, the construction andanalysis of a de Bruijn graph is not easily parallelized. As a result, deBruijn assemblers such as Velvet and ALLPATHS, which have beenused successfully on bacterial genomes, do not scale to large ge-nomes. For a human-sized genome, these programs would requireseveral terabytes of RAM to store their de Bruijn graphs, which is farmore memory than is available on most computers.

To date, only two de Bruijn graph assemblers have been shownto have the ability to assemble a mammalian-sized genome. ABySS(Simpson et al. 2009) assembled a human genome in 87 h ona cluster of 21 eight-core machines each with 16 GB of RAM (168cores, 336 GB of RAM total). SOAPdenovo assembled a human ge-nome in 40 h using a single computer with 32 cores and 512 GB ofRAM (Li et al. 2010). Although these types of computing resourcesare not widely available, they are within reach for large-scale sci-entific centers.

In theory, the size of the de Bruijn graph depends only on thesize of the genome, including polymorphic alleles, and should beindependent of the number of reads. However, because sequencingerrors create their own graph nodes, increasing the number of readsinevitably increases the size of the de Bruijn graph. In the de novoassembly of human from short reads, SOAPdenovo reduced thenumber of 25-mers from 14.6 billion to 5.0 billion by correctingerrors before constructing the de Bruijn graph (Li et al. 2010). Itserror correction method first counts the number of occurrences ofall k-mers in the reads and replaces any k-mers occurring less thanthree times with the highest frequency alternative k-mer.

Choice of assembler andsequencing strategyOnly de Bruijn graph assemblers havedemonstrated the ability to successfullyassemble very short reads (<50 bp). Forlonger reads (>100 bp), overlap graph as-semblers have been quite successful andhave a much better track record overall. Ade Bruijn graph assembler should func-tion with longer reads as well, but a largedifference between the read length andthe k-mer length will result in many morebranching nodes than in the simplifiedoverlap graph. The precise conditions un-der which one assembly method is supe-rior to the other remain an open question,and the answer may ultimately dependon the specific assembler and genomecharacteristics.

As Figure 3 illustrates, there is a di-rect and dramatic tradeoff among readlength, coverage, and expected contiglength in a genome assembly. The figureshows the theoretical expected contigslength, based on the Lander-Watermanmodel (Lander and Waterman 1988), inan assembly where all overlaps have beendetected perfectly. This model, which was

widely applied for predicting assembly quality in the Sanger se-quencing era, predicts that under ideal conditions, 710-bp readsshould require 33 coverage to produce 4-kbp average contig sizes,while 30-bp reads would require 283 coverage. In practice, themodel is inadequate for modeling very short reads: The figure alsoshows the actual contig sizes for the dog genome, assembled with710-bp reads, and the panda genome, assembled with 52-bp reads.The dog assembly tracked closely to the theoretical prediction,while the panda assembly has contig sizes that are many timeslower than predicted by the model. The large discrepancy betweenpredicted and observed assembly quality results from the fact that

Figure 2. Differences between an overlap graph and a de Bruijn graph for assembly. Based on the setof 10 8-bp reads (A), we can build an overlap graph (B) in which each read is a node, and overlaps >5 bpare indicated by directed edges. Transitive overlaps, which are implied by other longer overlaps, areshown as dotted edges. In a de Bruin graph (C ), a node is created for every k-mer in all the reads; herethe k-mer size is 3. Edges are drawn between every pair of successive k-mers in a read, where the k-mersoverlap by k ! 1 bases. In both approaches, repeat sequences create a fork in the graph. Note here wehave only considered the forward orientation of each sequence to simplify the figure.

Figure 3. Expected average contig length for a range of different readlengths and coverage values. Also shown are the average contig lengthsand N50 lengths for the dog genome, assembled with 710-bp reads, andthe panda genome, assembled with reads averaging 52 bp in length.

Schatz et al.

1168 Genome Researchwww.genome.org

Cold Spring Harbor Laboratory Press on December 2, 2010 - Published by genome.cshlp.orgDownloaded from










Schatz et al.



Schatz et al., Genome Res, 2010


Overlap-Layout-Consensus• Create consensus sequence

• We need to use graph theory to solve the graph

• Walk the Hamiltonian path

• Eg. visit each node exactly once

Imagine trying to solve this for a graph of hundred of thousands of nodes (=reads) - this is an NP-complete problem


Overlap-Layout-Consensus

• Relative high assembly error rate

• Very hard for high coverage and large genomes (all-vs-all)

• With short read lengths, hard to resolve repeats

• Good for small genomes and large read lengths (454, Ion Torrent, Pac Bio)


de Bruijn graph• Directed graph of overlapping items (here DNA

sequences)

• Instead of comparing reads, decompose reads into k-mers

• Graph is created by mapping the k-mers to the graph

• Each k-mer only exists once in the graph

• Problem is reduced to walking Eulerian path (visiting each edge once) - this is a solve-able problem


Drawbacks ...

• Lots of RAM required (1-1000 GB !)

• Optimal k can not be identified a priori, must be experimentally tested for each dataset

• small k: very complex graph, large k: limited overlap in low coverage areas

• Iterative approach to find best assembly


8










Schatz et al.



How is the graph constructed?

• Same 10 reads, extract k-mers from reads and map onto graph, k = 3:

GAC ACC CCT CTA TAC ACA CAA AAG AGT

GTT

TTA

TAG

GTC

TCC

CCG

No alignment is used!

Different assemblers uses different modifications of the de Bruijn graphs

1 1 1 1 1 12 2 2 2 2 13 4 4 4 3 2 15 6 6 5 5

3

2

1

6 8 9 8

3

2

1


Complicated graphs

CTA TAC ACA CAA AAG AGT

GTT

TTA

TAG

GTC

TCC

CCG










Schatz et al.



Large genomes with many repeats/errors creates very large graphs

GAC ACC CCT

GTCTCCCCT

GAC

ACCT

G to T

AAGACTCCGACTGGGACTTT

Create the de Bruijn graph of this genome using k=3


2

2

2

1

After building: Simplify

Clip tips (seq err, end)

Remove low cov. links

30

27

Pinch bubbles(seq err, middle, SNP)

30

30 28

27


Create contigs and scaffolds

C3

Repeat

Cut graph at repeat boundaries to create

contigs

Use paired end information to resolve repeats and combine to scaffolds

The assembly is done

C4...

...

...

... C2C1C1 C2 C3 C4

S1

Fill potential gaps using PE readsS2


Iterate parameters• Re-run with different k-sizes, find optimum

• Compare assembly statistics such as, assembly length, N50, no. contigs

• Assembly refinement

• Break contigs not supported by PE/MP reads

• Analyze assembly using REAPR or QUAST


Application Note: Sequencing

Illumina, Inc.

FOR RESEARCH USE ONLY

© 2011 Illumina, Inc. All rights reserved.Illumina, illuminaDx, BeadArray, BeadXpress, cBot, CSPro, DASL, Eco, GAIIx, Genetic Energy, Genome Analyzer, GenomeStudio, GoldenGate, HiScan, HiSeq, Infinium, iSelect, MiSeq, Nextera, Sentrix, Solexa, TruSeq, VeraCode, the pumpkin orange color, and the Genetic Energy streaming bases design are trademarks or registered trademarks of Illumina, Inc. All other brands and names contained herein are the property of their respective owners. Pub. No. 770-2011-009 Current as of 8 August 2011

Figure 2: Circos Plot of MiSeq E. coli Assembly

Circos plot showing E. coli MG1655 assembly from MiSeq data. Genomic position is indicated on the white outermost ring. Genes are indicated in red (numbers are in thousands). Coverage from the de novo assembly is represented in orange, aligned to the complete refer-ence using BLAST and filtered for 99% identity. Percent GC content is indicated in green, Grey lines show repetitive sequences and their relationships across the genome.

Figure 3: De novo Assembly Contigs and N50 Length

0

50000

100000

150000

200000

250000

300000

0

20000

40000

60000

80000

100000

120000

140000

IT 314 MIRA 3.2.1 IT 314 Ray 1.3.0

MiSeq MIRA 3.0.0 MiSeq Ray 1.3.0

A

B

A. Max contig length in bp for MiSeq assemblies (purple) using MIRA v3.0.0 and Ray v1.3.0 (green) compared to Ion Torrent assemblies (blue) using MIRA v3.2 and Ray v1.3.0 (grey). B. N50 is the length of the smallest contig in the set that contains the fewest (largest) contigs whose combined length represents at least 50% of the assembly.

analyzed using CASAVA 1.8a5, and de novo assembly was completed using Velvet. For the de novo assembly comparison between MiSeq and Ion Torrent, the open access assemblers MIRA1 and Ray2 were used on MiSeq data down-sampled to 50× and compared to the entire data set from Ion Torrent reads3. These open source assembly tools are reported to work well with both Illumina and Ion Torrent data3, and produced results comparable to Velvet for the MiSeq data.

Results and Data AnalysisData generated from the MiSeq and HiSeq systems showed similar cluster density and numbers of clusters passing filter. De novo assembly metrics from the HiSeq and MiSeq reads are very similar (Table 1). Comparison of HiSeq and MiSeq data with the reference sequence illustrates equivalent coverage over a range of GC content (Figure 1). Data from the MiSeq assembly overlaying the E. coli reference sequence are shown in a Circos plot (Figure 2), demonstrating excellent coverage over the entire genome.

De novo assembly data from the 2 × 150 bp MiSeq run was compared with Ion Torrent data3. To make an equal comparison, MiSeq data was down-sampled to 50× coverage, comprising 231 Mb, or approximately 1/7th of the data. Both the max contig length and N50 values were vastly superior in the down-sampled MiSeq data compared to the entire Ion Torrent data set (Figures 3A and B).

Conclusions

Using the same library preparation from bacterial DNA, sequencing on MiSeq was shown to be very comparable to HiSeq; both platforms yield high-quality data with > 85% bases above Q30 with even GC coverage. De novo assembly with these data also produce similar results, with excellent coverage of the reference sequence. Sequencing results generated on the MiSeq System are highly predictive of those delivered by the high-throughput HiSeq 2000 sequencing platform, making MiSeq ideal for piloting larger studies or performing independent experiments requiring speed and accuracy. For de novo assembly, the importance of high-quality, paired-end MiSeq reads is readily apparent compared to Ion Torrent. The high quality assembly produced from MiSeq paired-end reads show that better data give a more accurate picture of the genome.

References1. http://www.chevreux.org/projects_mira.html2. http://sourceforge.net/projects/denovoassembler/files3. http://pathogenomics.bham.ac.uk/blog/2011/05/first-look-at-ion-torrent-

data-de-novo-assembly

Successful de novo assembly

• Success is a factor of:

• Genome size, genomic repeats(!), ploidy

• High coverage, long read lengths, PE/MP libraries

Repeats in E. coli

Improving de novo assemblies

• Paired end & Mate pair for long range continuity

• Hybrid approaches (combine Illumina with PacBio/Oxford Nanopore)

• Synthetic long reads: Illumina Synthetic Reads (Moleculo) or 10X Genomics

• Hi-C contact maps


Two bacterial genomes de Bruijn graphs

only a handful of near-identical repeats longer than 200 bp (Fig. 3), whereas complex genomes, such as the human, usually have their repeat length determined by whether there has been an active LINE or SINE transposable element (usually around 4 kb in length for the former and between 500 bp and 1 kb for the lat-ter). As the ability to produce longer read pairs (also referred to as ‘mate pairs’ to distinguish them from the shorter read pairs) has only recently been optimized for next-generation technologies, assemblies of complex genomes have been rare.

The other main barrier for large, complex genome assem-blies is the memory overhead for these methods. Although the de Bruijn data structure is compressed, all the methods use some sort of adjunct data structures in addition to the core de Bruijn graph to map the reads to the graph. These adjunct structures are critical for leveraging additional information required for accurate assemblies, such as read pair information.

sequence length from a reference assembly. The read lengths need only be over the k-mer length to generate a reasonable assembly (in theory, k must be over 15 bp, though in practice 19 is the lowest sensible k-mer, and larger k-mers are always better, although at the expense of having to generate more coverage to support these large k-mer sizes).

The first assembler to exploit this technology was Roche’s 454 assembler, Newbler, which adapted the scheme specifically to handle the main source of error in 454 sequencing—namely, ambiguity in the length of homopolymer runs. In late 2007 and early 2008, sev-eral second-generation de Bruijn graph assemblers were released for very short reads, compatible with the Solexa technology, includ-ing SHARCGS27, VCAKE28, VELVET29, EULER-SR30, EDENA31, ABySS32 and ALLPATHS33. Some of these methods, such as VELVET, EULER-SR and ABySS, explicitly use de Bruijn graphs, whereas other methods implicitly explore a de Bruijn graph—for example, constrained by read-pair behavior, as in ALLPATHS. The methods differ in how they treat errors and to what extent they use read-pair information. Read pairs are defined as two short DNA sequence reads generated from different ends of a longer DNA molecule—for example, 35-bp reads generated from both ends of a 500 bp frag-ment. One does not know the identity of the sequence between the read pairs, but one usually has an estimate of the length of the inter-vening sequence. As it is only marginally more expensive to generate short reads in read-pair format than as single reads, extremely high coverage of read pairs is routinely available. The more advanced de Bruijn graph assemblers29,30,32,33 can use read pairs to provide long assemblies. A particular challenge has been the two-base-encoding ‘color space’ of ABI SOLiD technology. In this two-base encoding, a single error produces a systematic translation error on all subse-quent decoding of the bases for the rest of the read. In the context of an alignment, such an encoding scheme can be integrated into the alignment routine, and there is an argument that the double base encoding provides better discrimination between errors and observed differences. In de novo assembly, however, there is no ref-erence. The solution has been to perform the assembly directly in color space and then ‘key’ the resulting color space assembly to one of the four feasible base-pair assemblies using either a small amount of traditional sequence or the presence of a known base at the start of each SOLiD read.

Whichever sequencing technology and assembly method are used, the ability to provide long assemblies critically requires that at least a proportion of the read pairs are longer than the longest common near-identical repeat in the genome. This var-ies considerably between genomes. Bacterial genomes often have

Linear stretches

Tips

× × × × × × × ×

× × ×

× × × ×

× × × × × ×

×

× × × ×

×

×××××××××

a

b

1. Sequencing (for example, Solexa or 454)

2. Hashing

3. Simplification of linear stretches

4. Error (tip and bubble) removal Bubble

×

Figure 3 | Constructing and visualizing a de Bruijn graph of a DNA sequence. (a) An example de Bruijn graph assembly for a short genomic sequence without polymorphism. Sequence at top represents the genome, which is then sampled using shotgun sequencing in base space with 7-bp reads (step 1). Some of the reads have errors (red). In step 2, the k-mers in the reads (4-mers in this example) are collected into nodes and the coverage at each node is recorded. There are continuous linear stretches within the graph, and the sequencing errors create distinctive, low-coverage features through out the graph. In step 3, the graph is simplified to combine nodes that are associated with the continuous linear stretches into single, larger nodes of various k-mer sizes. In step 4, error correction removes the tips and bubbles that result from sequencing errors and creates a final graph structure that accurately and completely describes in the original genome sequence. (b) A full de Bruijn graph from a bacterial genome that shows the general lack of repetitive structure within the entire genome.

S10 | VOL.6 NO.11s | NOVEMBER 2009 | NATURE METHODS SUPPLEMENT

REVIEW

Figure 2.9: Graphical representation of the de Bruijn of the Streptococcussuis genome

In this representation, node sequences are represented as curves, whichconnect at their tips.

2.7.2 Local reference based visualisation

It quickly became clear that troubleshooting some of the algorithms de-scribed in the following chapters would require some adequate visualisa-tion techniques. Given that most tests were done on species with a knownreference, it was possible to use this sequence as a guide to the graph’scomplexity.

The first solution consisted in following the path of the referencesequence through the graph, recording the properties of the nodes beingtraversed. Figure 2.10 represents the length and multiplicity of successivenodes on the reference path.

In this diagram, long contigs are interrupted by two types of breaks.Sometimes, two long contigs are separated by a very short, isolated fea-

39

Flicek & Birney, Nat.Methods 2009 Zerbino, 2009

Few repeats “more” repeats

N50: Assembly quality

• Calculate sum of assembly

• Order contigs by size

• Sum contigs starting by largest

• When half the sum is reached, N50 is the length of the contig

N50: What is the smallest piece in the largest half of the assembly?

N50 example5 scaffolds, calculate N50:

200kb

180kb

150kb

125kb

35kb Sum: 200 + 180 + 150 + 125 + 35 = 690kbHalf: 630 / 2 = 395kb200kb + 180kb = 380kb380kb + 150kb = 530kb530kb > 395kb => N50: 150kb

Start adding:


Some assemblers

• OLC: Newbler, (the only one I really use)

• de Bruijn: Allpaths-LG, SPAdes, Velvet, SOAPdenovo, Megahit

• other: MIRA, SGA

de novo assembly course - cbs.dtu.dk · the dog assembly tracked closely to the theoretical...

Documents