de novo genome assembly - imb winter school - 7 july 2015

41
De novo genome assembly A/Prof Torsten Seemann IMB Winter School – Brisbane, Australia – Tue 7 July 2015

Upload: torsten-seemann

Post on 07-Aug-2015

380 views

Category:

Documents


0 download

TRANSCRIPT

De novo genome assembly

A/Prof Torsten Seemann

IMB Winter School – Brisbane, Australia – Tue 7 July 2015

Introduction

Ideal world

I would not need to give this talk!

AGTCTAGGATTCGCTACAGATTCAGGCTCTGAAGCTAGATCGCTATGCTATGATCTAGATCTCGAGATTCGTATAAGTCTAGGATTCGCTATAGATTCAGGCTCTGATATAT

Human DNA iSequencer™

46 complete haplotype

chromosome sequences

Real world

• Can’t sequence full-length native DNA– no instrument yet

• But we can sequence short fragments– 100 at a time (Sanger)– 100,000 at a time (Roche 454)– 1,000,000 at a time (PGM)– 10,000,000 at a time (Proton, MiSeq)– 100,000,000 at a time (HiSeq)

De novo assembly

The process of reconstructingthe original DNA sequencefrom the fragment reads alone.

Instinctively like a jigsaw puzzle– Find reads which “fit together” (overlap)– Could be missing pieces (sequencing bias)– Some pieces will be dirty (sequencing errors)

An example

A small “genome”

Friends, Romans, countrymen, lend me your ears;

I’ll return them

tomorrow!

I’ll return them

tomorrow!

Shakespearomics

• Readsds, Romans, countns, countrymen, leFriends, Rom send me your ears;crymen, lend me

Whoops! I dropped

them.

Whoops! I dropped

them.

Shakespearomics

• Readsds, Romans, countns, countrymen, leFriends, Rom send me your ears;crymen, lend me

• OverlapsFriends, Rom ds, Romans, count ns, countrymen, le crymen, lend me send me your ears;

I am good with

words.

I am good with

words.

Shakespearomics

• Readsds, Romans, countns, countrymen, leFriends, Rom send me your ears;crymen, lend me

• OverlapsFriends, Rom ds, Romans, count ns, countrymen, le crymen, lend me send me your ears;

• Majority consensusFriends, Romans, countrymen, lend me your ears;

We have a consensus

!

We have a consensus

!

Algorithms

Approaches

• greedy assembly

• seed and extend

• de Bruijn graphs

• string graphs

… all essentially doing the same thing,but taking different short cuts.

Cleangraph

Overlap - Layout - Consensus

Find read overlaps

• If we have N reads of length L– we have to do ½N(N-1) ~ O(N²) comparisons– each comparison is an ~ O(L²) alignment – use special tricks/heuristics to reduce these!

• What counts as “overlapping” ?– minimum overlap length eg. 20bp– minimum %identity across overlap eg. 95%– choice depends on L and expected error rate

Build an overlap graph

What ruins the graph?

• Read errors– introduce false edges and nodes

• Heterozygosity– causes lots of detours from homozygous areas

• Repeats– causes nodes to be shared, locality confusion

Graph simplification

• Squash bubbles– collapse small errors (or minor heterozygosity)

• Remove spurs– short “dead end” hairs on the graph

• Join unambiguously connected nodes– reliable stretches of unique DNA

Graph traversal

• For each unconnected graph– at least one per replicon in original sample

• Find a path which visits each node once– Hamiltonian path/cycle is NP-hard (this is bad)– solution will be a set of paths which terminate at

decision points

• Form a consensus sequences from paths– use all the overlap alignments – each of these collapsed paths is a contig

Contigs

Contiguous, unambiguous stretches of assembled DNA sequence

• Contigs ends correspond to– Real ends (for linear DNA molecules)– Dead ends (missing sequence)– Decision points (forks in the road)

Repeats

What is a repeat?

A segment of DNAwhich occurs more than once in the genome sequence

• Very common– Transposons (self replicating genes)– Satellites (repetitive adjacent patterns)– Gene duplications (paralogs)

Assembling repeats

Reads longer than the repeat

Heterozygosity

Long reads untangle graphs

The law of repeats

• It is impossible to resolve repeats of length S unless you have reads longer than S.

• It is impossible to resolve repeats of length S unless you have reads longer than S.

Scaffolding

Beyond contigs

Contig sizes are limited by:

• the length of repeats in your genome– can’t change this!

• the length (or “span”) of the reads– wait for new technology– use “tricks” with existing technology

base

pairs

Paired reads

• DNA fragment (200-800 bp)==============================

• Single end-------->=====================

• Paired end (up to 800 bp span)----->==================<-----

• Mate pair (up to 40 kbp span)---->========/+/=========<----

Contigs to scaffolds

Contigs

Paired-end read

Scaffold Gap Gap

Mate-pair read

Assessment

Assessing assemblies

• We desire– Total length similar to genome size

– Fewer, larger contigs

– No mistakes (mis-assemblies)

• Metrics– No generally useful objective measure

– Longest, total bp, genes recovered, …

The “N50”

The length of that contig from which 50% of the bases are in it and shorter contigs

• Imagine we got 7 contigs with lengths:– 1,1,3,5,8,12,20

• Total – 1+1+3+5+8+12+20 = 50

• N50 is the “halfway sum” = 25– 1+1+3+5+8+12 = 30 (≥ 25) so N50 is 12

N50 concerns

• Optimizing for N50– encourages mis-assemblies!

• An aggressive assembler may over-join:– 1,1,3,5,8,12,20 (previous)– 1,1,3,5,20,20 (now)– 1+1+3+5+20+20 = 50 (unchanged)

• N50 is the “halfway sum” (still 25)– 1+1+3+5+20= 30 (≥ 25) so N50 is 20 (was 12)

Validation

• Self consistency– Align read back to contigs– Check for errors or discordant pairs

• Second opinion– Use two complementary sequencing methods– Target troublesome areas for PCR– Use a genome wide “optical map”

Software

Genome assembly software

Considerations

• Size of genome– virus, bacteria, eukaryote, meta-genome

• Hardware– phone, laptop, desktop, server, cloud– RAM is more limiting than CPU

• Operating system– Linux, Mac, Windows

• Software budget– commercial, free, open-source

Recommendations

• SPAdes– Unix command-line (Mac, Linux)

• VAGUE (Velvet)– Unix GUI (Mac, Linux)

• CLC Genomics Workbench– Java GUI (Windows, Mac, Linux)– Commercial product

Online tutorial

• The GVL– Genomics Virtual Laboratory– http://genome.edu.au

• Protocols– Microbial de novo assembly for Illumina data– Written by Simon Gladman (VLSCI)– https://genome.edu.au/wiki/Protocols

Contact

Web tseemann.github.io

Twitter @torstenseemann

Slides slideshare.net/torstenseemann

Blog TheGenomeFactory.blogspot.com