towards your own genome. designing your sequencing run sequencing strategy genome size and genome

Towards your own genome

Designing your Sequencing Runhttps://genohub.com/next-generation-sequencing-guide/

Sequencing strategy

Genome size and genome complexity?!related organism, PFGE, flow cytometry

Noncoding DNA in genomes

Repetitive DNA in the human genome

Sequencing strategy

Template and Library prep: Fragment (SE),Paired-end (PE)or Mate pair (MP)BAC clones, fosmids....

Sequencing Platform

MethodSingle-molecule real-

time sequencing (Pacific Bio)

Ion semiconductor (Ion Torrent sequencing)

Pyrosequencing (454)

Sequencing by synthesis (Illumina)

Sequencing by ligation (SOLiD sequencing)

Chain termination (Sanger

sequencing)Read length 2900 bp average[38] 200 bp 700 bp 50 to 250 bp 50+35 or 50+50 bp 400 to 900 bp

Accuracy 87% - 99% 98% 99.9% 98% 99.9% 99.9%

Reads per run 35-75 thousand up to 5 million 1 million up to 3 billion 1.2 to 1.4 billion N/A

Time per run 30 minutes to 2 hours 2 hours 24 hours1 to 10 days,

depending upon sequencer

1 to 2 weeks 20 minutes to 3 hours

Cost per 1 mil. bases $2 $1 $10 $0.05 to $0.15 $0.13 $2400

AdvantagesLongest read length. Fast. Detects 4mC,

5mC, 6mA.[41]

Less expensive equipment. Fast.

Long read size. Fast.

Potential for high sequence yield, depending upon sequencer model

and desired application.

Low cost per base.Long individual

reads. Useful for many

applications.

DisadvantagesLow yield at high

accuracy. Equipment can be very expensive

Homopolymer errors.Runs are

expensive. Homopolymer

errors.

Short reads. Slower than other methods.

More expensive and impractical

for larger sequencing

projects.

Genome sequencing: Comparison of NGS methods

Instrument Application: de novo assemblies

BACs, plastids, & microbial genomes Transcriptome Plant & animal genome454 – GS Jr. B – good but expensive C – need multiple runs, expensive D – cost prohibitive

454 – FLX+ A – good, need to multiplex to be economicalB – good but expensive, libraries usually normalized, not best for short RNAs

C – OK as part of a mixed platform strategy, prohibitive to use alone

MiSeq – v2 A – good, need to multiplex for best economicsA/B –expensive for rare transcripts (compared to HiSeq), but reads are longer for better assembly

B – expensive relative to HiSeq, but additional read length can be valuable

HiSeq 2000/2500, standard run

B/C – more data than needed unless highly indexed; assembly more challenging than 454 or MiSeq

A – good, assembly more challenging than 454 but much more data available for analyses

A – primary data type in many current projects; requires mate-pair libraries

HiSeq 2500, rapid run (projected)

B – more data than needed unless highly indexed; assembly more challenging than 454

A – good, assembly more challenging than 454 but much more data available for analyses

A – will probably be more expensive than HiSeq2000, but increased read length may be worth it

Ion Torrent – 314 B/C – OK, lowest experimental cost but reads are shorter & more expensive than Illumina

C – OK, but reads are shorter & more expensive than Illumina

D – cost prohibitive, reads shorter than alternatives

Ion Torrent – 318 B/A – good, less data than MiSeqB/A – good, less data than MiSeq, reads similar to 454 titanium but less expensive

C – high cost relative to Proton or Illumina, more economical than 454 for mixed platform strategy

Ion Torrent Proton IB – more data than needed unless indexed; assembly more challenging than 454 or Illumina

B/A – assembly currently more challenging than Illumina or 454

B – expensive relative to HiSeq or Proton II/III

Ion Torrent Proton II (projected)

B/C – more data than needed unless highly indexed; assembly more challenging than 454 or Illumina

B/A – assembly currently more challenging than Illumina or 454 A/B – should be similar to HiSeq

Ion Torrent Proton III (forecast) C – more data than needed unless highly indexed B/A – need assembly pipelines A – cost per MB could make it the

best

SOLiD – 5500 C – more data than needed unless highly indexed; assembly more challenging than 454 or Illumina

C/D – short reads make assembly challenging or impossible

C/D – short reads make assembly challenging or impossible

PacBio – RSB – good for hybrid assemblies; not economical for solo assemblies – requires high coverage due to high error rates

B/D – good for hybrid assemblies; too expensive for solo use; short RNA is challenging

B/D – good for hybrid assemblies & scaffolding (mixed platform strategy); cost prohibitive for solo use

Platform – instrument Application: resequencing

Targeted loci Transcript counting Genome resequencing

454 – GS Jr. B/C – good but expensive, need to limit loci D – cost prohibitive D – cost prohibitive for large

genomes

454 – FLX+ B – good but expensive, should limit loci D – cost prohibitive D – cost prohibitive for large genomes

MiSeq A/B – good, fewer and higher cost reads than HiSeq

B – more expensive than HiSeq or SOLiD or ProtonII+ B/C – expensive for large genomes

HiSeq 2000/2500 – standard run

A – primary data type in many current projects; best for many loci

A – primary data type in many current projects

A – primary data type in many current projects

HiSeq 2500 – rapid run (projected) A – faster path to leading data type

A/B – likely to be slightly more expensive than with standard flow cell

A – faster path to leading data type

Ion Torrent – 314 C – OK but expensive, need to limit loci D – cost prohibitive D – cost prohibitive

Ion Torrent – 318 B – good, slightly less data per run than MiSeq

B/C – more expensive than HiSeq or SOLiD; new informatics pipelines needed; new error profile

C – expensive for large genomes

Ion Torrent Proton I A/B – similar to MiSeq, but different error profile will inhibit switching

B – more expensive than Illumina or SOLiD; new informatics pipelines needed (different error profile than Illumina)

B – expensive relative to HiSeq or Proton II+

Ion Torrent Proton II (projected)

A/B – similar to HiSeq, but different error profile will inhibit switching

A/B – new informatics pipelines needed

A – supposed to set new pricing standard, could become leading shorter-read platform

Ion Torrent Proton III (forecast)

A/B – costs projected to be better than HiSeq; error profile different than Illumina

A/B – new informatics pipelines needed

A – supposed to set new pricing standard, could become leading shorter-read platform

SOLiD – 5500xl B – harder to assemble than Illumina A/B – used much less than HiSeq A/B – used much less than HiSeq

PacBio – RS C/D – expensive but can sequence difficult regions D – cost prohibitive C/D – cost prohibitive except for

strutural variants

Bacterial genomes

Noncoding DNA in genomes

Bacterial genomes

Complex Bacterial Genomes

Fosmid and plasmid library; Sanger

Simplified Bacterial Genomes

MDA for 16h on one lysed cell3kb Sanger libraries plus 45415 gaps (chimeric clones) Sanger finishingPolishing by Illumina reads37 regions Sanger polishing

454 (average read length 225bp)Illumina (33bp)

Bacterial genomes

Eukaryotic Genomes

Eukaryotic Genomes: Fish genomes

Template: A female fish was chosen because of its XX sex chromosome constitution

Roche 454 Titanium (3 and 20kb libraries)Illumina PE insert size 200bp and 75 bp readsphysical map: fingerprints with ABI3730 from the WLC-1247 BAC library (insert size of 160 kb; 10× genome coverage with a total of 43,192 clones available)

Bird genomes

Mammalian genomes

HiSeq2000

DNA isolated from blood

Extremelly large genomes

loblolly pine (Pinus taeda)

The largest genome assembled to date

DNA template:a single megagametophyte, the haploid tissue of a single pine seed – quantitylong-fragment mate pair libraries from the parental diploid DNA

Novel fosmid DiTag libraries

N50 scaffold size of 66.9 kbp

Raw Data Trimming and Filtering

Quality score

Raw Data Trimming and Filtering

Assembly

N50N75

ContigsScaffolds

Assembly: K-mer

A common sequence shared by pairs of reads

Assembly: K-mer

Assembly

Assembly – algorithms

Repeats!

OLC Overlap/Layout/Consensus

Overlap: Overlap discovery all-against-all, seed & extend heuristic algorithm;K-mers as alignment seeds-sensitivity

Layout: Construction and manipulation of an overlap graph leads to an approximate read layout

Consensus: Multiple sequence alignment (MSA) determines the precise layout and then the consensus sequence.

Loading base calls-computer memory

Assembly vs Repetitive DNA

Assembly vs Repetitive DNA and Coverage

Why is coverage important?resolutionrepeat discovery, copy number estimationbinning of metagenomic data

Why is GC important?affecting coverageHGT discoverybinning of metagenomic data

Assembly vs GC content

both GC-rich fragments and AT-rich fragments are underrepresented in the Illumina sequencing results

Assembly vs GC content

Less even coverage with Illumina

Velvet and Velvet OptimizerNewblerCeleraMaSuRCA

Assembling algorithms and Scaffolders

http://en.wikipedia.org/wiki/Sequence_assembly

http://en.wikipedia.org/wiki/Sequence_assembly

Assembling algorithms and Scaffolders

Annotation

Ready for Annotation?

Checking gene coverage:UCOs - Ultra Conserved Orthologs (Kozik et al., 2007)

CEGMA - Core Eukaryotic Genes Mapping Approach (Parra et al., 2007)

SICO - genes Single Copy genes Proteobacteria (Lerat et al., 2003)

Median gene length roughly proportional to genome size

Percent gaps:library insert size vs. 50 “N”s

Sanger

454Illumina

Ready for Annotation?UCOs

Annotation of Prokaryotic Genomes

Automated pipelines and annotation softwares:RASTBASysSOPPROKKAIMG ER

Gene prediction:GLIMMERProdigal Prokaryotic Dynamic Programming Genefinding Algorithm

Annotation of Prokaryotic Genomes

Repeated errors

Inconsistent gene names

Additional data and postgenomic experiments

Annotation of Eukaryotic Genomes

Standard draft assembly

High quality draft assembly

Two phases1. computation phase

repeat masking (homopolymers, transposable elements)evidence alignment (proteins, ESTs, RNA-seq data aligned)ab initio gene prediction vs Evidence driven gene prediction

2. annotation phasefinding a consensus


Gene prediction and gene annotation are not synonyms!Predictors do not report untranslated regions (UTRs) or splice variants

towards your own genome. designing your sequencing run sequencing strategy genome size and genome

Documents