de-novo assembly day 4. the challenge given many (millions or billions) of reads, produce a linear...

De-novo Assembly

Day 4

The Challenge

• Given many (millions or billions) of reads, produce a linear (or perhaps circular) genome

• Issues: – Coverage– Errors in reads– Reads vary from very short (35bp) to quite long (800bp),

and are double-stranded– Non-uniqueness of solution– Running time and memory

In an ideal world…

• Reads have no error

• Reads are long enough and each appears once

• Each read given in the same orientation (all 5’ to 3’, for example)

• Contiguous reads have enough overlap so that they can easily be assembled

DNA target sampleSHEAR & SIZE

End Reads / Mate Pairs

550bp

10,000bp

Shotgun DNA Sequencing

Not all sequencing technologies produce mate-pairs. Different error modelsDifferent read lengths

Terminology

• Reads are what you start with (35bp-800bp)

• Fragmented assemblies produce contigs

• Contigs can be put together into scaffolds

Reference-based vs de novo assembly

• Comparative assembly means you have a “reference” genome– Same OR related species– Can get weird in bacteria and viruses due to recombination

• De novo assembly means you do everything from scratch

Comparative Assembly

Much easier than de novo!

Basic idea: • Take the reads and map them onto the

reference (allowing for small mismatches)• Collect all overlapping reads, produce a

multiple sequence alignment, and produce consensus sequence

Comparative Assembly

• Fast

• Short reads can map to several places (especially if they have errors)

• Needs close reference genome

• Repeats are problematic

• Can be highly accurate even when reads have errors

De Novo assembly

• Much easier to do with long reads

• Need very good coverage

• Generally produces fragmented assemblies

• Necessary when you don’t have a closely related (and correctly assembled) reference genome

Conceptual steps in de novo assembly

1. Find reads that overlap by a specified number of bases (the k-mer size)

2. Merge overlapping, “good” reads into longer contigs

3. Link contigs to form scaffolds using paired-end information

Diagrams from Serafim Batzoglou, Stanford

11

De Novo Assembly paradigms

• Overlap-layout-consensus methods

• k-mer graph (especially useful for assembly from short reads)– aka de Bruijn graph

de Bruijn graph hask-mers as nodes connected by reads;

assembly involves finding Eulerian path through graph

Diagram from Michael Schatz, Cold Spring Harbor

de Bruijn graph• Vertices are k-mers that appear in some read, and edges defined by

overlap of k-1 nucleotides

• Small values of k produce small graphs

• Does not require all-pairs overlap calculation!

• But: loss of information about reads can lead to “chimeric” contigs, and incorrect assemblies

• Also produces fragmented assemblies (even shorter contigs)

de Bruijn Graph use

• Create the de Bruijn graph for the following string, using k=5– ACATAGGATTCAC

• Find the Eulerian path

• Is the Eulerian path unique?

• Reconstruct the sequence from this path

But…

• Because of – Errors in reads– Repeats– Insufficient coveragede Bruijn graphs generally don’t Eulerian paths/circuits

• This means the first step doesn’t completely assemble the genome

Velvet and SOAPdenovo are two leading assemblers

• Velvet is from EMBL-EBI– http://www.ebi.ac.uk/~zerbino/velvet

• SOAPdenovo is from BGI– http://soap.genomics.org.cn/soapdenovo.html

• k-mer size is adjustable parameter– Typically it is adjusted to maximize N50 length of scaffolds or contigs– N50 length is central measure of distribution weighted by lengths

NOTE: QC step slightly different

1. You still do standard checks with something like FASTQC

2. Read Trimming is different– Algorithms/tools can deal with different read lengths– Finding overlaps give longer contigs– So we never want to sacrifice good reads

3. Solution: 1. remove sequencing adapters 2. trim individual reads as needed

• http://www.usadellab.org/cms/index.php?page=trimmomatic

Three useful measures for optimization of k-mer length

• Mean length: the usual average of the lengths

• Median length: the length for which half the sequences are shorter & half are longer

• N50 length: the length that splits the total bases in half, after the lengths are ordered

• Example values for distribution of contig lengths:– Mean length: 627– Median length: 200– N50 length: 1,718

• We’ll look at using N50 in in the practical

Practical prep

• Download and install– TRIMMOMATIC:

http://www.usadellab.org/cms/index.php?page=trimmomatic

– VELVET: http://www.ebi.ac.uk/~zerbino/velvet/• Use make ’MAXKMERLENGTH=70’ when compiling

• Grab practical dataset from course page



http://www.ebi.ac.uk/~zerbino/velvet/

de-novo assembly day 4. the challenge given many (millions or billions) of reads, produce a linear...

Documents

short reads

long reads

good reads

error reads

overlapping reads

path slide

terminology reads

billions of reads