massively parallel sequencing for biodiversity
DESCRIPTION
Author: Annie Archambault, Research professional at the Quebec Center for Biodiversity Science (qcbs.ca)Objective: Provide an overview of the NGS technology for researchers in biodiversity scienceDescription: Slides presenting the major massively parallel sequencing platforms (next-generation sequencing NGS), as wel as strategies for reducing the genome complexity, and for multiplexing different samples into one sequencing run. Laboratory steps and estimated costs are summarized for a few cases studies. Bartram et al 2011. Generation of multi-million 16S rRNA gene libraries from complex microbial communities by assembling paired-end Illumina reads. Appl. Environ. Microbiol. http://aem.asm.org/content/early/2011/04/01/AEM.02772-10 Malausa et al. 2011. High-throughput microsatellite isolation through 454 GS-FLX Titanium pyrosequencing of enriched DNA libraries. Molecular Ecology Resources 11: 638-644. http://doi.wiley.com/10.1111/j.1755-0998.2011.02992.x Cosart et al. 2011. Exome-wide DNA capture and next generation sequencing in domestic and wild species. BMC Genomics 12: 347. http://www.biomedcentral.com/1471-2164/12/347 Timmermans et al. 2010 Why barcode? High-throughput multiplex sequencing of mitochondrial genomes for molecular systematics. Nucleic Acids Research 38:e197–e197. http://www.nar.oxfordjournals.org/cgi/doi/10.1093/nar/gkq807 Griffin et al. 2011. A next-generation sequencing method for overcoming the multiple gene copy problem in polyploid phylogenetics, applied to Poa grasses. BMC Biology 9: 19. http://www.biomedcentral.com/1741-7007/9/19 Koopman, et al. 2011. “The Microbial Phyllogeography of the Carnivorous Plant Sarracenia alata” Microbial Ecology 61, no. 4 : 750–758. http://www.springerlink.com/index/10.1007/s00248-011-9832-9TRANSCRIPT
Massively parallel sequencing for biodiversity scienceAnnie Archambault, Centre de la Science de la Biodiversité du Québec
qcbs.ca
April 2013
Synonyms: Next generation sequencing (NGS), Massively parallel sequencing, High throughput sequencing, 2nd or 3rd generation sequencing…
Parallelize the sequencing process : Producing thousands of short sequencing reads at once
REPLACES CLONING AND CLONE SCREENINGREPLACES INDIVIDUAL SEQUENCING REACTIONS
Outline – Uses for biodiversity studies
Very brief review of the 4 main platforms
Examples of experimental procedures strategies (complexity reduction and multiplexing)
Laboratory steps and costs for 4 cases studies
Disclaimer: I still have limited experience with these instruments, I
gained understanding from intensive readings
Useful reading
Review of the chemistry and the workflow : Myllykangas S, Buenrostro J, Ji HP: Overview of
Sequencing Technology Platforms. In Bioinformatics for High Throughput Sequencing. Springer New York; 2012: 11–25. http://www.springerlink.com/content/n6u33m1335750g57/
Review of technologies and applications in biodiversity : Purdy KJ, Hurd PJ, Moya-Laraño J, Trimmer M, Oakley BB,
Woodward G: Systems Biology for Ecology: From Molecules to Ecosystems. In Advances in Ecological Research. 2010: 87–149. http://linkinghub.elsevier.com/retrieve/pii/B9780123850058000034
Instruments comparison
Platform Amplification, detection Detection StepAt GQ Innovation Center
Unit
GS - FLX+ (454)
Pyrosequencing – emulsion PCR
FluorescenceDuring synthesis
Yes1 plate (divided in ¼)
HiSeq (Illumina)
BridgePCR Wash after every base
FluorescenceDuring synthesis Yes
Flow cell of 8 lanes
Ion PGM™ Sequencer (LifeTechnologies) 314, 316 chip
Emulsion PCR. Pyrosequencing-like
H+ ionsDuring synthesis Yes Chip
PacBio RSNo prior amplification.Single-molecule Real-time sequencing (smrt)
FluorescenceDuring synthesis No Cell
Visuals
454 GS FLX
HiSeq Ion PGM PacBio RS http://454.com/products/technology.asp http://bcove.me/7eidiq1e?width=490&height=274
http://www.youtube.com/watch?v=77r5p8IBwJk
http://www.youtube.com/watch?v=NHCJ8PtYCFc&feature=related
http://www.youtube.com/watch?v=yVf2295JqUg&feature=plcp&context=C4897380VDvjVQa1PpcFPcv91xP1YGJ3-1VyENe915toprCBsg2Jc%3D
Visuals
454 GS FLX
HiSeq Ion PGM PacBio RS
Instrument comparisonPlatform
Nb reads per unit
Read length
Run time
Cost $ per Mb *
Preferred usesType of errors
GS - FLX+ (454)
1 million per plate
350 – 500 bp
20 hLibrary prep: 160 $ Per plate: 8 200 $
7 $Amplicon sequencing; Initial characterization. non-model species.
Indels
HiSeq (Illumina)
~200 million per lane
50 bp 100 bp150 bp
8 days
Library prep: 160 $ Per lane: 715 $ to 2 100 $ (length)
0.1 $
Re-sequencing; Frequency-based applications.NOT amplicons
Susbstitutions
Ion PGM314 ; 316 or 318 chip
314: 100 000316: 1 million318: 10 millions
35 – 400 bp
1.5 – 7 h ? 50 $Individual laboratories, Small scale
Indels
PacBio RS50 000 reads per cell
6000 bp
2 h~ 750$ USD per sample
11 – 200 $
Non-model species, long fragments, methylated fragments
CG deletions, High error rates
*Cost estimate: Glenn TC. 2011. Field guide to next‐generation DNA sequencers. Molecular Ecology Resources 2011, 11:759–769. http://onlinelibrary.wiley.com/doi/10.1111/j.1755-0998.2011.03024.x/abstract
Quantity instead of length or quality
Long templates (gDNA) Short amplicon templatesTemplate
Each read is short (75 – 200 bp), and bears errors: need to be confirm with many reads covering the same template region
Library preparation: gDNA fragmentation + adaptors
Library preparation: Amplification + adaptors
Quantity instead of length or quality
Long templates (gDNA) Short amplicon templatesTemplate
Each read is short (75 – 200 bp), and bears errors: need to be confirm with many reads covering the same template region
Reads
Library preparation: gDNA fragmentation + adaptors
Deduced template sequence
8X coverage
8X coverage
Excluded from further
analyses
8X coverage
Excluded from further
analyses
8X coverage
8X coverage
Library preparation: Amplification + adaptors
Assembly/mapping by similiarity
Useful in biodiversity?
How to make use of 200 millions reads for your biological question? Be strategic!
Reduce the complexity of genetic material analyzed
Combine different samples into a single run (Multiplexing)
Strategies: Multiplexing Incorporate specific KNOWN oligos (code or index) at
beginning of the each fragment. During library preparation
Read at sequencing
Sorted by sequence deconvolution according its “code”
Roche 454: 30 (up to 130) Multiplex identifiers (MID), 10 bp
Illumina: 12 “Index sequences”, 6 bp
Depth of coverage: GS-FLX ¼ plate = 250 000 reads / 25 barcodes: 10 000 reads per sample. Enough for you?
Sample 1 Sample 2Sample 3
A single run
Pool in one tube
Sorted according to “coded” seq.
Sample 1 Sample 2 Sample 3
Strategies: Multiplexing
Incorporate specific KNOWN oligos (code or index) at beginning of the UNKNOWN fragment. During library preparation
Example of Roche 10 bp MID barcode for Amplicon sequencing
5'-CTCGTAGACTGCGTACCAATTC.............TTACTCAGGACTCAT-3’
3’ - CATCTGACGCATGGTTAAG.............AATGAGTCCTGAGTAGCAG-5’
TargetSpecificGACTGCGTACCAATTC-3’
3’-CAATGAGTCCTGAGTAG TargetSpecific
Lib-L-PrimerA
key MID3
5’-CCATCTCATCCCTGCGTGTCTCCGACTCAGAGACGCACTC
GACTCTGACGGTTCCGTGTGTCCCCTATCC-5’
Key
Lib-L-PrimerB Primer LibL_A with MID3 with TargetSpecific
Primer LibL_B with TargetSpecific (no MID)
Strategies: complexity reduction
A few organisms (1 to a few hundreds) : Survey a few thousands loci per sample Enrich in gene-rich regions for gDNA sequencing Random genomic survey Transcriptome sequencing
Very many organisms (e.g. environmental studies): Survey one or two loci per individual Amplicon sequencing with universal primers (PCR)
By hybridization
A few organisms: Enrich in simple-sequence-repeats. Hybridization to target
repeats (e.g. microsatellites loci) Enrich in gene-rich regions for genomic DNA sequencing.
Hybridization to reference set of genes (e.g. target exons)
Be creative!
Hybridization: Enrich in specific fragments (e.g. exon)
DNA fragmentationSequence the enriched pool
Beads
Bait (custom made)
Unbound (discarded)
Bound (retained)
! Evaluate costs carefully
From 2008 to 2013? Instruments give higher
throughput Each sequencing run is cheaper
May be cheaper not to target specific regions
By methylation-sensitive RE
Be creative! A few organisms:
Enrich in gene-rich regions for genomic DNA sequencing Elimination of methylation rich regions (plants repetitive
elements)
Insert in E. coli : digests methylated DNA
Nuclear DNA fragmentation
Sequence the enriched pool
By amplification
One or a few organisms: Randomly sample the whole genome
Amplification: “AFLP-like” Sequence instead of length polymorphism
ddRAD : Double digest restriction-site-associated DNA sequencing, to find SNPs
Adaptor ligationAmplification with adaptor primers
DNA fragmentationEnz.A Enz.A Enz.B
By amplification
One or a few organisms: Randomly sample the whole genome
ddRAD : Double digest restriction-site-associated DNA sequencing
Powerful: Coupled with multiplexing
Sample 1
Sample 2
Enz.A Enz.A
Multiplex
« Index »
Adaptor
Enz.B
Genome complexity reduction: RNA
A few samples: Transcriptome sequencing
Total RNA : RNAseq Reduce to mRNA only (polyA) Reduce to microRNA only
! Driven by external condition and by tissues type Needs high number of reads: Illumina preferred
Transcription (DNA –> RNA)
Translation (RNA –> protein)
Genome complexity reduction: RNA
A few organisms: Reminder: mRNA sequences include non-coding regions (UTR)
5’ UTR Exon Exon Intron 3’ UTR
AAAAAAA
CDS5’ UTR
3’ UTR
Genome complexity reduction: Amplicon
Very many organisms: Amplicon sequencing with universal primers for ONE loci
Limitation: primers may not amplify equally well in ALL target organisms
Environmental samples targeting ITS, 16S, CO1 (the barcode loci)
Primers anneal Primers anneal Primers DO NOT anneal
Case studies in biodiversity
Bartram et al 2011. Generation of multi-million 16S rRNA gene libraries from complex microbial communities by assembling paired-end Illumina reads. Appl. Environ. Microbiol. http://aem.asm.org/content/early/2011/04/01/AEM.02772-10
Castoe et al. 2011 Rapid Microsatellite Identification from Illumina Paired-End Genomic Sequencing in Two Birds and a Snake. PlosOne http://www.plosone.org/article/info:doi/10.1371/journal.pone.0030953
Peterson et al. 2012. Double Digest RADseq: An Inexpensive Method for De Novo SNP Discovery and Genotyping in Model and Non-Model Species. http://www.plosone.org/article/info:doi/10.1371/journal.pone.0037135
Griffin et al. 2011. A next-generation sequencing method for overcoming the multiple gene copy problem in polyploid phylogenetics, applied to Poa grasses. BMC Biology 9: 19. http://www.biomedcentral.com/1741-7007/9/19
Bacterial communities Bartram et al: Generation of multi-million 16S rRNA
gene libraries from complex microbial communities by assembling paired-end Illumina reads. Appl. Environ. Microbiol. http://aem.asm.org/content/early/2011/04/01/AEM.02772-10
Objective: develop a protocol for community genetic diversity (test for two samples, costs for 20 samples)
Material: Soils from arctic tundra. Total DNA extracted with FastDNA
(MPBiomedicals) Also includes a control bacterial mix in liquid media. 20 X 5 $ = 100 $
Bacterial communities
Primer for hypervariable region 3 (V3) of the microbial 16S rRNA
81 bp, purified PAGE:
caagcagaagacggcatacgagatCGTGATgtgactggagttcagacgtgtgctcttccgatctATTACCGCGGCTGCTGG
Amplify with High fidelity polymerase (Phusion)
Extract desired length 200-250 bp. (columns)
Multiplexing: Yes, including technical replicates
Quality control for libraries : (e.g. Agilent Bioanalyzer)
Sequencing : paired-end 2 x 125bp Illumina GAIIx (would be HiSeq)
flow-cell-binding Index Illumina-prime Target-gene
25 X 67 $ = 1 675 $
1 X 90 $ = 90 $25 X 1.5$ = 40 $
25 X 50$ = 1 250 $
1 X = 2 090 $
Total = 5 250 $
Molecular steps
Bacterial communities
Bioinformatics: Base calling and error estimation Illumina Analysis Pipeline
Quality filtering, reads sorting according to index sequence, contig assembly (custom made, PANDAseq).
Discard: 1 or more mismatch between the two overlapping fragments of a the
pair-end 1 or more ambiguous base
Assignation to taxonomic affiliations : naïve Bayesian classification (Ribosomal Database Project RDP classifier) cutoff 0.5.
Good’s coverage for each libraries to estimate sequence coverage (C = 1 – n1/N)
CD-HIT to cluster arctic tundra datasets at 97% sequence identity
Sequence analyses
Paired-end reads assembled
Cluster modified single linkage
Classification / Diversity estimate
Custom program
Raw reads
CD-HIT
RDP / QIIME
Index seq.
Bacterial communities
Total of 12 million raw reads
Discard 50% of the reads: Raw reads: 7.6 million and 4.4 millions for each technical replicates Post-assembly: 4.1 and 2.4 millions for each technical replicates
Average post-assembly contig : 150 ± 11 bases (without primers). Overlap 66 ± 11 bases
Pre-clustering at 97% sequence identity
Estimate error rate (from control library): 1 error per 5 contig (1%/base). Higher than Sanger sequencing.
Find contaminant in the growth media of a control
Duplicate arctic tundra libraries displayed a high degree of similarity Comparison of phyla in one library compared to one another (AT1 to AT2; r=0.999) The majority of sequences clusters (99.57%) detected in both replicates
Results
Isolation of novel microsatellites loci
Castoe et al. 2011 Rapid Microsatellite Identification from Illumina Paired-End Genomic Sequencing in Two Birds and a Snake. PlosOne http://www.plosone.org/article/info:doi/10.1371/journal.pone.0030953
Objective: Discover novel microsatellites loci from diverse organisms
Material: Total genomic DNA from many different organisms (report results for only 3 species, I calculate for 8)
Also read: Jennings et al. 2011. “Multiplexed Microsatellite Recovery Using Massively Parallel Sequencing.” Molecular Ecology Resources. http://doi.wiley.com/10.1111/j.1755-0998.2011.03033.x.
Isolation of novel microsatellites loci
Genome complexity reduction: None: direct sequencing
Material: Genomic DNA (5 ug). One individual per species
Library preparation for Illumina sequencing (likely on ~8 to 10 species)
Multiplexing: Yes. At the sequencing facility, during library preparation.
Sequencing platform: Illumina GAIIx ; 120 bp paired-end. Would now be HiSeq2000
One need to order primers for each loci after that
~ 700$ for 50 loci … up to 5 500$ for 8
8 X 5 $ = 40 $
8 X 160 $ = 1 280 $
1 X 2 090 $ = 2 090 $
Total: 3 400 $
Total (with primers): 9 000$
Isolation of novel microsatellites loci
Bioinformatics: Simple, no assembly, no comparison to reference genome In a perl script Identify reads that contain perfects SSR : 2mer to 6mer,
repeated at least 6 times Sort by SSR types (de-multiplex) Design primers (with Primer3) Discard the primer pairs that also occur in other reads
Isolation of novel microsatellites loci
Results: Number of raw reads: Not reported Use 5 millions paired-end reads per sample (A 1X coverage) Mean sequence length : Not reported Between 150 000 to 540 000 potential loci (containing
microsatellites) Primers designed for 72 000 to 174 000 loci, depending on
species With extra stringency (only 3 to 6-mer, >7 repeats): 200 to
2000 loci
Primers not tested for amplifyability
Conclusions: Large variation in number and proportion of motifs (3-mer, 4-mer…) in the different organisms.
ddRAD-seq
Peterson et al. 2012 Double Digest RADseq: An Inexpensive Method for De Novo SNP Discovery and Genotyping in Model and Non-Model Species. PlosOne http://www.plosone.org/article/info:doi/10.1371/journal.pone.0037135
Objective: Aim at 10 000 SNPs, random genome-wide,10X coverage
Material: Total DNA extracted from 54 P. leucopus , one population (Qiagen kits)
54 X 5 $ = 270 $
ddRAD-seq
Complexity reduction: Yes, complex Digestion, annealing, size-selection, many purifications steps
Multiplexing: Yes 54 samples. With simulation, include genome size and nucleotides frequency,
estimate they need 400 000 reads per individual, for the 300 +- 30 bp
Platform: Two lanes of GAII (now HiSeq 2000). Paired-end2 X 2 010 $ = 4 020 $
ddRAD-seq
110 X ~30$ = 3 300 $
Enzymes + purif. = 1 250 $
Big total = ~9 000 $
Many oligos, combine « index » on the 5’ and on the 3’
Digestions and PCR amplifications
Many purifications and precise size selection (pippin prep)
PCRprimer1 (46bp)AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACG adaptorP1 gDNA adaptorP2 ACACTCTTTCCCTACACGACGCTCTTCCGATCTAATTA-3’ AATTCNNN…NNN P-5’-CGAGATCGGAAGAGCGAGAACAAOligo1.1 |||||||||||||||||||||||||||||||||||||| |||||||| ||||||||||||| Oligo2.1 Oligo1.2 one of 48) TGTGAGAAAGGGATGTGCTGCGAGAAGGCTAGATTAATTTAA-5’-P GNNN…NNNGGC TCTAGCCTTCTCGTGTGCAGACTTGAGGTCAGTG Oligo2.2
CGTGTGCAGACTTGAGGTCAGTGTAGTGCTAGAGCATACGGCAGAAGACGAAC PCRprimer2 (1 of 12)
ddRAD-seq – Sequence analyses Initial sequence processing
De-multiplex – accept 1 bp mismatch in the 4 bp barcode Assign the read to a single individual Collapse identical reads to one seq., retaining fequency
No reference genome; not the « Stacks” package Compute pariwise distance btw alll reads (BLAT) MCL to group similar reads (ortholog inference) Count unique seqs in a cluster (=loci), count how many are beyond the ploidy
level )=% error containg reads) Align orthologs (MUSCLE) Write alignment as reference-ordered SAM/BAM files GATK UnifiedGenotyper, to genotype
Error : rate ranged from 0.18 – 0.22% per nucleotide. 1/10 reads Technical replicates? No
ddRAD-seq – results
The 54 wild Peromyscus from a same population Total reads: not reported (~2 X 21 millions) Assigned to an individual: not reported Discard 5.4% of reads
SNP discovered Variable regions (loci): 6 200 found Polymorphic sites for >70% of individuals: 16 000 sites
found
In an analysis on samples from different populations SNPs in multi-SNPs loci: >80% These multi-SNP are usually excluded in other analyses
Phylogenies with polyploids Griffin et al. 2011. A next-generation sequencing
method for overcoming the multiple gene copy problem in polyploid phylogenetics, applied to Poa grasses. BMC Biology 9: 19. http://www.biomedcentral.com/1741-7007/9/19
Objective: Phylogenies in polyploid grasses ; recent, rapid radiation (in a time and cost effective experimental design)
Phylogenies with polyploids Material: Total DNA extracted from 60
individuals of 11 different polyploids Poa species
Complexity reduction: Amplify 3 cp genes (rpl32-trnL, rpoB-trnC and
trnH-psbA) and two nuclear genes (DMC1 and CDO504) from each of the 60 samples.
Target amplicons are < 500 bp
Pool the 5 different PCR-products for one individual
60 X 5 $ = 300 $
Primers 10 X 5 $ = 50 $Enzymes1 X 150 $ = 150 $
Phylogenies with polyploids Multiplexing: Yes, addition of ds-adaptors
with MID barcodes by ligation. Design 64 different barcodes (with 3 technical replicates)
Ligation of barcode-adapters to the pools of amplicon, purification, pool, quality control
Sequencing platform: ¼ plate of a Roche 454 with Titanium 2010 chemistry
64 X 2 X 9 $ = 1160 $64 X 2 X 6 $ = 800 $
1 X 100 $ = 100 $64 X 1.5 $ = 96 $1 X 50 $ = 50 $
1 X 2 140 $ = 2 140 $Total = ~4 900$
TitaniumAdapterA 25bp + MIDbarcode + TCGTATCGCCTCCCTCGCGCCATCAG + ACGAGTGCGT + TGCATAGCGGAGGGAGCGCGGTAGTA TGCTCACGCA
A - TitaniumAdapterB A + CTGAGCGGGCTGGCAAGGCGCATAG GACTCGCCCGACCGTTCCGCGTATC
Phylogenies with polyploids Bioinformatics: Galaxy platform (free)
Sort the gene regions by regular expression (REGEX) of gene specific primers
Discard: low-quality reads short sequences reads matching no MID barcode
Calculate error rate by calculating SNP at chloroplast regions.
Detect and discard PCR recombinant Alleles that occurred at <5% for a species OR Both ends of the allele do not match the same common allele
Phylogenies with polyploids Results: 121 000 raw reads. Length: 40 to 775 bp (mean 278 bp).
111 200 (92%) match to gene specific primer
70 601 reads (58%) remained after barcode sorting and quality control
Useful sequence for 281 out of 320 (88%) targets = 12% missing
Sequence error rate 0.13% PCR recombination : 2.9% of CDO504 reads and 14% of DMC1 reads Technical replicates: P. costiniana: identical alleles (At the < 2 bp
level), but one extra allele (=PCR error). Two distinct copies (and more) of each nuclear gene deduced
DMC1 has 19 (4.0%) base difference and CDO504, 35 (8.5%) and seven-bp indels and 4-bp-indels
One extra gene copy discovered for CDO514, shows a 57-bp deletion in intron.
Phylogenies with polyploids Number of sequence reads
obtained for each marker/individual combination. A - After quality control and barcode
deconvolution. B - Useful sequence reads remaining after
alignment and editing.
Percentage of useful reads gained for each nuclear gene copy and allele, including recombinant reads.
Phylogenies with polyploids Results: Phylogenetic analyses
Timing of polyploidization: took place before the Australian and the American species diverged.
Extensive haplotype sharing between taxa currently different species
Nuclear gene networks showed incongruence both with each other and with the chloroplast gene networks
Tasmania-mainland differentiation detected
On the local scale, strong spatial genetic structure detected using two of the chloroplast markers. Suggest a smaller neighborhood for seed dispersal than for pollen
dispersal.
To remember Diversity of protocols and experimental design (Be creative!)
Budget: 3 000 $ to 9 000 $, main cost can be primers and library preparation
Standards are rapidly increasing: Technical replicates required
Challenge: Data analysis No standard analytical protocol (custom, in house, developed) No standard calculation of error rate Initial steps computer intensive (30 millions of short reads…)
Results: Half of the reads are discarded Many target loci will be missing Unequal proportion of technical replicates in final dataset Prone to PCR recombination and chimeras assembly
Comparison
PlatformNb samples
Total cost% reads retained
Nb clean unique reads
Tech. reps
Error rate
Missed target
Arctic soilGA II, one lane
24 5 250 $ 53 %4.1 million vs 2.4 millions for tech rep
Yes 1% NA
Microsat GAIIx, one lane ? (? 8)3 400 $ (without primers)
? ? No ? NA
SNP (ddRAD)
GA II, two lanes 54 9 000 $ 95% 7 000 loci w SNPs No0.18 – 0.22% per base
?
PolyploidsGS FLX, ¼ plate
61 4 900 $ 58 % 70 601 Yes 0.13% 12 %
Thank you!