towards your own genome. designing your sequencing run sequencing strategy genome size and genome
TRANSCRIPT
Towards your own genome
Designing your Sequencing Runhttps://genohub.com/next-generation-sequencing-guide/
Sequencing strategy
Genome size and genome complexity?!related organism, PFGE, flow cytometry
Noncoding DNA in genomes
Repetitive DNA in the human genome
Sequencing strategy
Template and Library prep: Fragment (SE),Paired-end (PE)or Mate pair (MP)BAC clones, fosmids....
Sequencing Platform
MethodSingle-molecule real-
time sequencing (Pacific Bio)
Ion semiconductor (Ion Torrent sequencing)
Pyrosequencing (454)
Sequencing by synthesis (Illumina)
Sequencing by ligation (SOLiD sequencing)
Chain termination (Sanger
sequencing)Read length 2900 bp average[38] 200 bp 700 bp 50 to 250 bp 50+35 or 50+50 bp 400 to 900 bp
Accuracy 87% - 99% 98% 99.9% 98% 99.9% 99.9%
Reads per run 35-75 thousand up to 5 million 1 million up to 3 billion 1.2 to 1.4 billion N/A
Time per run 30 minutes to 2 hours 2 hours 24 hours1 to 10 days,
depending upon sequencer
1 to 2 weeks 20 minutes to 3 hours
Cost per 1 mil. bases $2 $1 $10 $0.05 to $0.15 $0.13 $2400
AdvantagesLongest read length. Fast. Detects 4mC,
5mC, 6mA.[41]
Less expensive equipment. Fast.
Long read size. Fast.
Potential for high sequence yield, depending upon sequencer model
and desired application.
Low cost per base.Long individual
reads. Useful for many
applications.
DisadvantagesLow yield at high
accuracy. Equipment can be very expensive
Homopolymer errors.Runs are
expensive. Homopolymer
errors.
Short reads. Slower than other methods.
More expensive and impractical
for larger sequencing
projects.
Genome sequencing: Comparison of NGS methods
Instrument Application: de novo assemblies
BACs, plastids, & microbial genomes Transcriptome Plant & animal genome454 – GS Jr. B – good but expensive C – need multiple runs, expensive D – cost prohibitive
454 – FLX+ A – good, need to multiplex to be economicalB – good but expensive, libraries usually normalized, not best for short RNAs
C – OK as part of a mixed platform strategy, prohibitive to use alone
MiSeq – v2 A – good, need to multiplex for best economicsA/B –expensive for rare transcripts (compared to HiSeq), but reads are longer for better assembly
B – expensive relative to HiSeq, but additional read length can be valuable
HiSeq 2000/2500, standard run
B/C – more data than needed unless highly indexed; assembly more challenging than 454 or MiSeq
A – good, assembly more challenging than 454 but much more data available for analyses
A – primary data type in many current projects; requires mate-pair libraries
HiSeq 2500, rapid run (projected)
B – more data than needed unless highly indexed; assembly more challenging than 454
A – good, assembly more challenging than 454 but much more data available for analyses
A – will probably be more expensive than HiSeq2000, but increased read length may be worth it
Ion Torrent – 314 B/C – OK, lowest experimental cost but reads are shorter & more expensive than Illumina
C – OK, but reads are shorter & more expensive than Illumina
D – cost prohibitive, reads shorter than alternatives
Ion Torrent – 318 B/A – good, less data than MiSeqB/A – good, less data than MiSeq, reads similar to 454 titanium but less expensive
C – high cost relative to Proton or Illumina, more economical than 454 for mixed platform strategy
Ion Torrent Proton IB – more data than needed unless indexed; assembly more challenging than 454 or Illumina
B/A – assembly currently more challenging than Illumina or 454
B – expensive relative to HiSeq or Proton II/III
Ion Torrent Proton II (projected)
B/C – more data than needed unless highly indexed; assembly more challenging than 454 or Illumina
B/A – assembly currently more challenging than Illumina or 454 A/B – should be similar to HiSeq
Ion Torrent Proton III (forecast) C – more data than needed unless highly indexed B/A – need assembly pipelines A – cost per MB could make it the
best
SOLiD – 5500 C – more data than needed unless highly indexed; assembly more challenging than 454 or Illumina
C/D – short reads make assembly challenging or impossible
C/D – short reads make assembly challenging or impossible
PacBio – RSB – good for hybrid assemblies; not economical for solo assemblies – requires high coverage due to high error rates
B/D – good for hybrid assemblies; too expensive for solo use; short RNA is challenging
B/D – good for hybrid assemblies & scaffolding (mixed platform strategy); cost prohibitive for solo use
Platform – instrument Application: resequencing
Targeted loci Transcript counting Genome resequencing
454 – GS Jr. B/C – good but expensive, need to limit loci D – cost prohibitive D – cost prohibitive for large
genomes
454 – FLX+ B – good but expensive, should limit loci D – cost prohibitive D – cost prohibitive for large genomes
MiSeq A/B – good, fewer and higher cost reads than HiSeq
B – more expensive than HiSeq or SOLiD or ProtonII+ B/C – expensive for large genomes
HiSeq 2000/2500 – standard run
A – primary data type in many current projects; best for many loci
A – primary data type in many current projects
A – primary data type in many current projects
HiSeq 2500 – rapid run (projected) A – faster path to leading data type
A/B – likely to be slightly more expensive than with standard flow cell
A – faster path to leading data type
Ion Torrent – 314 C – OK but expensive, need to limit loci D – cost prohibitive D – cost prohibitive
Ion Torrent – 318 B – good, slightly less data per run than MiSeq
B/C – more expensive than HiSeq or SOLiD; new informatics pipelines needed; new error profile
C – expensive for large genomes
Ion Torrent Proton I A/B – similar to MiSeq, but different error profile will inhibit switching
B – more expensive than Illumina or SOLiD; new informatics pipelines needed (different error profile than Illumina)
B – expensive relative to HiSeq or Proton II+
Ion Torrent Proton II (projected)
A/B – similar to HiSeq, but different error profile will inhibit switching
A/B – new informatics pipelines needed
A – supposed to set new pricing standard, could become leading shorter-read platform
Ion Torrent Proton III (forecast)
A/B – costs projected to be better than HiSeq; error profile different than Illumina
A/B – new informatics pipelines needed
A – supposed to set new pricing standard, could become leading shorter-read platform
SOLiD – 5500xl B – harder to assemble than Illumina A/B – used much less than HiSeq A/B – used much less than HiSeq
PacBio – RS C/D – expensive but can sequence difficult regions D – cost prohibitive C/D – cost prohibitive except for
strutural variants
Bacterial genomes
Noncoding DNA in genomes
Bacterial genomes
Bacterial genomes
Bacterial genomes
Complex Bacterial Genomes
Fosmid and plasmid library; Sanger
Simplified Bacterial Genomes
MDA for 16h on one lysed cell3kb Sanger libraries plus 45415 gaps (chimeric clones) Sanger finishingPolishing by Illumina reads37 regions Sanger polishing
454 (average read length 225bp)Illumina (33bp)
Bacterial genomes
Eukaryotic Genomes
Eukaryotic Genomes: Fish genomes
Template: A female fish was chosen because of its XX sex chromosome constitution
Roche 454 Titanium (3 and 20kb libraries)Illumina PE insert size 200bp and 75 bp readsphysical map: fingerprints with ABI3730 from the WLC-1247 BAC library (insert size of 160 kb; 10× genome coverage with a total of 43,192 clones available)
Bird genomes
Mammalian genomes
HiSeq2000
DNA isolated from blood
Extremelly large genomes
loblolly pine (Pinus taeda)
The largest genome assembled to date
DNA template:a single megagametophyte, the haploid tissue of a single pine seed – quantitylong-fragment mate pair libraries from the parental diploid DNA
Novel fosmid DiTag libraries
N50 scaffold size of 66.9 kbp
Raw Data Trimming and Filtering
Quality score
Raw Data Trimming and Filtering
Raw Data Trimming and Filtering
Assembly
N50N75
ContigsScaffolds
Assembly: K-mer
A common sequence shared by pairs of reads
Assembly: K-mer
Assembly
Assembly – algorithms
Repeats!
OLC Overlap/Layout/Consensus
Overlap: Overlap discovery all-against-all, seed & extend heuristic algorithm;K-mers as alignment seeds-sensitivity
Layout: Construction and manipulation of an overlap graph leads to an approximate read layout
Consensus: Multiple sequence alignment (MSA) determines the precise layout and then the consensus sequence.
Loading base calls-computer memory
Assembly vs Repetitive DNA
Assembly vs Repetitive DNA
Assembly vs Repetitive DNA and Coverage
Why is coverage important?resolutionrepeat discovery, copy number estimationbinning of metagenomic data
Why is GC important?affecting coverageHGT discoverybinning of metagenomic data
Assembly vs GC content
both GC-rich fragments and AT-rich fragments are underrepresented in the Illumina sequencing results
Assembly vs GC content
Less even coverage with Illumina
Velvet and Velvet OptimizerNewblerCeleraMaSuRCA
Assembling algorithms and Scaffolders
http://en.wikipedia.org/wiki/Sequence_assembly
Assembling algorithms and Scaffolders
Annotation
Ready for Annotation?
Checking gene coverage:UCOs - Ultra Conserved Orthologs (Kozik et al., 2007)
CEGMA - Core Eukaryotic Genes Mapping Approach (Parra et al., 2007)
SICO - genes Single Copy genes Proteobacteria (Lerat et al., 2003)
Median gene length roughly proportional to genome size
Percent gaps:library insert size vs. 50 “N”s
Sanger
454Illumina
Ready for Annotation?UCOs
Annotation of Prokaryotic Genomes
Automated pipelines and annotation softwares:RASTBASysSOPPROKKAIMG ER
Gene prediction:GLIMMERProdigal Prokaryotic Dynamic Programming Genefinding Algorithm
Annotation of Prokaryotic Genomes
Repeated errors
Inconsistent gene names
Additional data and postgenomic experiments
Annotation of Eukaryotic Genomes
Standard draft assembly
High quality draft assembly
Two phases1. computation phase
repeat masking (homopolymers, transposable elements)evidence alignment (proteins, ESTs, RNA-seq data aligned)ab initio gene prediction vs Evidence driven gene prediction
2. annotation phasefinding a consensus
Annotation of Eukaryotic Genomes
Gene prediction and gene annotation are not synonyms!Predictors do not report untranslated regions (UTRs) or splice variants
Annotation of Eukaryotic Genomes