massively parallel sequencing for biodiversity

Massively parallel sequencing for biodiversity scienceAnnie Archambault, Centre de la Science de la Biodiversité du Québec

qcbs.ca

April 2013

Synonyms: Next generation sequencing (NGS), Massively parallel sequencing, High throughput sequencing, 2nd or 3rd generation sequencing…

Parallelize the sequencing process : Producing thousands of short sequencing reads at once

REPLACES CLONING AND CLONE SCREENINGREPLACES INDIVIDUAL SEQUENCING REACTIONS

Outline – Uses for biodiversity studies

Very brief review of the 4 main platforms

Examples of experimental procedures strategies (complexity reduction and multiplexing)

Laboratory steps and costs for 4 cases studies

Disclaimer: I still have limited experience with these instruments, I

gained understanding from intensive readings

Useful reading

Review of the chemistry and the workflow : Myllykangas S, Buenrostro J, Ji HP: Overview of

Sequencing Technology Platforms. In Bioinformatics for High Throughput Sequencing. Springer New York; 2012: 11–25. http://www.springerlink.com/content/n6u33m1335750g57/

Review of technologies and applications in biodiversity : Purdy KJ, Hurd PJ, Moya-Laraño J, Trimmer M, Oakley BB,

Woodward G: Systems Biology for Ecology: From Molecules to Ecosystems. In Advances in Ecological Research. 2010: 87–149. http://linkinghub.elsevier.com/retrieve/pii/B9780123850058000034

Instruments comparison

Platform Amplification, detection Detection StepAt GQ Innovation Center

Unit

GS - FLX+ (454)

Pyrosequencing – emulsion PCR

FluorescenceDuring synthesis

Yes1 plate (divided in ¼)

HiSeq (Illumina)

BridgePCR Wash after every base

FluorescenceDuring synthesis Yes

Flow cell of 8 lanes

Ion PGM™ Sequencer (LifeTechnologies) 314, 316 chip

Emulsion PCR. Pyrosequencing-like

H+ ionsDuring synthesis Yes Chip

PacBio RSNo prior amplification.Single-molecule Real-time sequencing (smrt)

FluorescenceDuring synthesis No Cell

Visuals

454 GS FLX

HiSeq Ion PGM PacBio RS http://454.com/products/technology.asp http://bcove.me/7eidiq1e?width=490&height=274

http://www.youtube.com/watch?v=77r5p8IBwJk

http://www.youtube.com/watch?v=NHCJ8PtYCFc&feature=related

http://www.youtube.com/watch?v=yVf2295JqUg&feature=plcp&context=C4897380VDvjVQa1PpcFPcv91xP1YGJ3-1VyENe915toprCBsg2Jc%3D

Visuals

454 GS FLX

HiSeq Ion PGM PacBio RS

Instrument comparisonPlatform

Nb reads per unit

Read length

Run time

Cost $ per Mb *

Preferred usesType of errors

GS - FLX+ (454)

1 million per plate

350 – 500 bp

20 hLibrary prep: 160 $ Per plate: 8 200 $

7 $Amplicon sequencing; Initial characterization. non-model species.

Indels

HiSeq (Illumina)

~200 million per lane

50 bp 100 bp150 bp

8 days

Library prep: 160 $ Per lane: 715 $ to 2 100 $ (length)

0.1 $

Re-sequencing; Frequency-based applications.NOT amplicons

Susbstitutions

Ion PGM314 ; 316 or 318 chip

314: 100 000316: 1 million318: 10 millions

35 – 400 bp

1.5 – 7 h ? 50 $Individual laboratories, Small scale

Indels

PacBio RS50 000 reads per cell

6000 bp

2 h~ 750$ USD per sample

11 – 200 $

Non-model species, long fragments, methylated fragments

CG deletions, High error rates

*Cost estimate: Glenn TC. 2011. Field guide to next‐generation DNA sequencers. Molecular Ecology Resources 2011, 11:759–769. http://onlinelibrary.wiley.com/doi/10.1111/j.1755-0998.2011.03024.x/abstract

Quantity instead of length or quality

Long templates (gDNA) Short amplicon templatesTemplate

Each read is short (75 – 200 bp), and bears errors: need to be confirm with many reads covering the same template region

Library preparation: gDNA fragmentation + adaptors

Library preparation: Amplification + adaptors

Quantity instead of length or quality

Long templates (gDNA) Short amplicon templatesTemplate

Each read is short (75 – 200 bp), and bears errors: need to be confirm with many reads covering the same template region

Reads

Library preparation: gDNA fragmentation + adaptors

Deduced template sequence

8X coverage

8X coverage

Excluded from further

analyses

8X coverage

Excluded from further

analyses

8X coverage

8X coverage

Library preparation: Amplification + adaptors

Assembly/mapping by similiarity

Useful in biodiversity?

How to make use of 200 millions reads for your biological question? Be strategic!

Reduce the complexity of genetic material analyzed

Combine different samples into a single run (Multiplexing)

Strategies: Multiplexing Incorporate specific KNOWN oligos (code or index) at

beginning of the each fragment. During library preparation

Read at sequencing

Sorted by sequence deconvolution according its “code”

Roche 454: 30 (up to 130) Multiplex identifiers (MID), 10 bp

Illumina: 12 “Index sequences”, 6 bp

Depth of coverage: GS-FLX ¼ plate = 250 000 reads / 25 barcodes: 10 000 reads per sample. Enough for you?

Sample 1 Sample 2Sample 3

A single run

Pool in one tube

Sorted according to “coded” seq.

Sample 1 Sample 2 Sample 3

Strategies: Multiplexing

Incorporate specific KNOWN oligos (code or index) at beginning of the UNKNOWN fragment. During library preparation

Example of Roche 10 bp MID barcode for Amplicon sequencing

5'-CTCGTAGACTGCGTACCAATTC.............TTACTCAGGACTCAT-3’

3’ - CATCTGACGCATGGTTAAG.............AATGAGTCCTGAGTAGCAG-5’

TargetSpecificGACTGCGTACCAATTC-3’

3’-CAATGAGTCCTGAGTAG TargetSpecific

Lib-L-PrimerA

key MID3

5’-CCATCTCATCCCTGCGTGTCTCCGACTCAGAGACGCACTC

GACTCTGACGGTTCCGTGTGTCCCCTATCC-5’

Key

Lib-L-PrimerB Primer LibL_A with MID3 with TargetSpecific

Primer LibL_B with TargetSpecific (no MID)

Strategies: complexity reduction

A few organisms (1 to a few hundreds) : Survey a few thousands loci per sample Enrich in gene-rich regions for gDNA sequencing Random genomic survey Transcriptome sequencing

Very many organisms (e.g. environmental studies): Survey one or two loci per individual Amplicon sequencing with universal primers (PCR)

By hybridization

A few organisms: Enrich in simple-sequence-repeats. Hybridization to target

repeats (e.g. microsatellites loci) Enrich in gene-rich regions for genomic DNA sequencing.

Hybridization to reference set of genes (e.g. target exons)

Be creative!

Hybridization: Enrich in specific fragments (e.g. exon)

DNA fragmentationSequence the enriched pool

Beads

Bait (custom made)

Unbound (discarded)

Bound (retained)

! Evaluate costs carefully

From 2008 to 2013? Instruments give higher

throughput Each sequencing run is cheaper

May be cheaper not to target specific regions

By methylation-sensitive RE

Be creative! A few organisms:

Enrich in gene-rich regions for genomic DNA sequencing Elimination of methylation rich regions (plants repetitive

elements)

Insert in E. coli : digests methylated DNA

Nuclear DNA fragmentation

Sequence the enriched pool

By amplification

One or a few organisms: Randomly sample the whole genome

Amplification: “AFLP-like” Sequence instead of length polymorphism

ddRAD : Double digest restriction-site-associated DNA sequencing, to find SNPs

Adaptor ligationAmplification with adaptor primers

DNA fragmentationEnz.A Enz.A Enz.B

By amplification

One or a few organisms: Randomly sample the whole genome

ddRAD : Double digest restriction-site-associated DNA sequencing

Powerful: Coupled with multiplexing

Sample 1

Sample 2

Enz.A Enz.A

Multiplex

« Index »

Adaptor

Enz.B

Genome complexity reduction: RNA

A few samples: Transcriptome sequencing

Total RNA : RNAseq Reduce to mRNA only (polyA) Reduce to microRNA only

! Driven by external condition and by tissues type Needs high number of reads: Illumina preferred

Transcription (DNA –> RNA)

Translation (RNA –> protein)

Genome complexity reduction: RNA

A few organisms: Reminder: mRNA sequences include non-coding regions (UTR)

5’ UTR Exon Exon Intron 3’ UTR

AAAAAAA

CDS5’ UTR

3’ UTR

Genome complexity reduction: Amplicon

Very many organisms: Amplicon sequencing with universal primers for ONE loci

Limitation: primers may not amplify equally well in ALL target organisms

Environmental samples targeting ITS, 16S, CO1 (the barcode loci)

Primers anneal Primers anneal Primers DO NOT anneal

Case studies in biodiversity

Bartram et al 2011. Generation of multi-million 16S rRNA gene libraries from complex microbial communities by assembling paired-end Illumina reads. Appl. Environ. Microbiol. http://aem.asm.org/content/early/2011/04/01/AEM.02772-10

Castoe et al. 2011 Rapid Microsatellite Identification from Illumina Paired-End Genomic Sequencing in Two Birds and a Snake. PlosOne http://www.plosone.org/article/info:doi/10.1371/journal.pone.0030953

Peterson et al. 2012. Double Digest RADseq: An Inexpensive Method for De Novo SNP Discovery and Genotyping in Model and Non-Model Species. http://www.plosone.org/article/info:doi/10.1371/journal.pone.0037135

Griffin et al. 2011. A next-generation sequencing method for overcoming the multiple gene copy problem in polyploid phylogenetics, applied to Poa grasses. BMC Biology 9: 19. http://www.biomedcentral.com/1741-7007/9/19

Bacterial communities Bartram et al: Generation of multi-million 16S rRNA

gene libraries from complex microbial communities by assembling paired-end Illumina reads. Appl. Environ. Microbiol. http://aem.asm.org/content/early/2011/04/01/AEM.02772-10

Objective: develop a protocol for community genetic diversity (test for two samples, costs for 20 samples)

Material: Soils from arctic tundra. Total DNA extracted with FastDNA

(MPBiomedicals) Also includes a control bacterial mix in liquid media. 20 X 5 $ = 100 $

Bacterial communities

Primer for hypervariable region 3 (V3) of the microbial 16S rRNA

81 bp, purified PAGE:

caagcagaagacggcatacgagatCGTGATgtgactggagttcagacgtgtgctcttccgatctATTACCGCGGCTGCTGG

Amplify with High fidelity polymerase (Phusion)

Extract desired length 200-250 bp. (columns)

Multiplexing: Yes, including technical replicates

Quality control for libraries : (e.g. Agilent Bioanalyzer)

Sequencing : paired-end 2 x 125bp Illumina GAIIx (would be HiSeq)

flow-cell-binding Index Illumina-prime Target-gene

25 X 67 $ = 1 675 $

1 X 90 $ = 90 $25 X 1.5$ = 40 $

25 X 50$ = 1 250 $

1 X = 2 090 $

Total = 5 250 $

Molecular steps


Bioinformatics: Base calling and error estimation Illumina Analysis Pipeline

Quality filtering, reads sorting according to index sequence, contig assembly (custom made, PANDAseq).

Discard: 1 or more mismatch between the two overlapping fragments of a the

pair-end 1 or more ambiguous base

Assignation to taxonomic affiliations : naïve Bayesian classification (Ribosomal Database Project RDP classifier) cutoff 0.5.

Good’s coverage for each libraries to estimate sequence coverage (C = 1 – n1/N)

CD-HIT to cluster arctic tundra datasets at 97% sequence identity

Sequence analyses

Paired-end reads assembled

Cluster modified single linkage

Classification / Diversity estimate

Custom program

Raw reads

CD-HIT

RDP / QIIME

Index seq.


Total of 12 million raw reads

Discard 50% of the reads: Raw reads: 7.6 million and 4.4 millions for each technical replicates Post-assembly: 4.1 and 2.4 millions for each technical replicates

Average post-assembly contig : 150 ± 11 bases (without primers). Overlap 66 ± 11 bases

Pre-clustering at 97% sequence identity

Estimate error rate (from control library): 1 error per 5 contig (1%/base). Higher than Sanger sequencing.

Find contaminant in the growth media of a control

Duplicate arctic tundra libraries displayed a high degree of similarity Comparison of phyla in one library compared to one another (AT1 to AT2; r=0.999) The majority of sequences clusters (99.57%) detected in both replicates

Results

Isolation of novel microsatellites loci

Castoe et al. 2011 Rapid Microsatellite Identification from Illumina Paired-End Genomic Sequencing in Two Birds and a Snake. PlosOne http://www.plosone.org/article/info:doi/10.1371/journal.pone.0030953

Objective: Discover novel microsatellites loci from diverse organisms

Material: Total genomic DNA from many different organisms (report results for only 3 species, I calculate for 8)

Also read: Jennings et al. 2011. “Multiplexed Microsatellite Recovery Using Massively Parallel Sequencing.” Molecular Ecology Resources. http://doi.wiley.com/10.1111/j.1755-0998.2011.03033.x.


Genome complexity reduction: None: direct sequencing

Material: Genomic DNA (5 ug). One individual per species

Library preparation for Illumina sequencing (likely on ~8 to 10 species)

Multiplexing: Yes. At the sequencing facility, during library preparation.

Sequencing platform: Illumina GAIIx ; 120 bp paired-end. Would now be HiSeq2000

One need to order primers for each loci after that

~ 700$ for 50 loci … up to 5 500$ for 8

8 X 5 $ = 40 $

8 X 160 $ = 1 280 $

1 X 2 090 $ = 2 090 $

Total: 3 400 $

Total (with primers): 9 000$


Bioinformatics: Simple, no assembly, no comparison to reference genome In a perl script Identify reads that contain perfects SSR : 2mer to 6mer,

repeated at least 6 times Sort by SSR types (de-multiplex) Design primers (with Primer3) Discard the primer pairs that also occur in other reads


Results: Number of raw reads: Not reported Use 5 millions paired-end reads per sample (A 1X coverage) Mean sequence length : Not reported Between 150 000 to 540 000 potential loci (containing

microsatellites) Primers designed for 72 000 to 174 000 loci, depending on

species With extra stringency (only 3 to 6-mer, >7 repeats): 200 to

2000 loci

Primers not tested for amplifyability

Conclusions: Large variation in number and proportion of motifs (3-mer, 4-mer…) in the different organisms.

ddRAD-seq

Peterson et al. 2012 Double Digest RADseq: An Inexpensive Method for De Novo SNP Discovery and Genotyping in Model and Non-Model Species. PlosOne http://www.plosone.org/article/info:doi/10.1371/journal.pone.0037135

Objective: Aim at 10 000 SNPs, random genome-wide,10X coverage

Material: Total DNA extracted from 54 P. leucopus , one population (Qiagen kits)

54 X 5 $ = 270 $

ddRAD-seq

Complexity reduction: Yes, complex Digestion, annealing, size-selection, many purifications steps

Multiplexing: Yes 54 samples. With simulation, include genome size and nucleotides frequency,

estimate they need 400 000 reads per individual, for the 300 +- 30 bp

Platform: Two lanes of GAII (now HiSeq 2000). Paired-end2 X 2 010 $ = 4 020 $

ddRAD-seq

110 X ~30$ = 3 300 $

Enzymes + purif. = 1 250 $

Big total = ~9 000 $

Many oligos, combine « index » on the 5’ and on the 3’

Digestions and PCR amplifications

Many purifications and precise size selection (pippin prep)

PCRprimer1 (46bp)AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACG adaptorP1 gDNA adaptorP2 ACACTCTTTCCCTACACGACGCTCTTCCGATCTAATTA-3’ AATTCNNN…NNN P-5’-CGAGATCGGAAGAGCGAGAACAAOligo1.1 |||||||||||||||||||||||||||||||||||||| |||||||| ||||||||||||| Oligo2.1 Oligo1.2 one of 48) TGTGAGAAAGGGATGTGCTGCGAGAAGGCTAGATTAATTTAA-5’-P GNNN…NNNGGC TCTAGCCTTCTCGTGTGCAGACTTGAGGTCAGTG Oligo2.2

CGTGTGCAGACTTGAGGTCAGTGTAGTGCTAGAGCATACGGCAGAAGACGAAC PCRprimer2 (1 of 12)

ddRAD-seq – Sequence analyses Initial sequence processing

De-multiplex – accept 1 bp mismatch in the 4 bp barcode Assign the read to a single individual Collapse identical reads to one seq., retaining fequency

No reference genome; not the « Stacks” package Compute pariwise distance btw alll reads (BLAT) MCL to group similar reads (ortholog inference) Count unique seqs in a cluster (=loci), count how many are beyond the ploidy

level )=% error containg reads) Align orthologs (MUSCLE) Write alignment as reference-ordered SAM/BAM files GATK UnifiedGenotyper, to genotype

Error : rate ranged from 0.18 – 0.22% per nucleotide. 1/10 reads Technical replicates? No

ddRAD-seq – results

The 54 wild Peromyscus from a same population Total reads: not reported (~2 X 21 millions) Assigned to an individual: not reported Discard 5.4% of reads

SNP discovered Variable regions (loci): 6 200 found Polymorphic sites for >70% of individuals: 16 000 sites

found

In an analysis on samples from different populations SNPs in multi-SNPs loci: >80% These multi-SNP are usually excluded in other analyses

Phylogenies with polyploids Griffin et al. 2011. A next-generation sequencing

method for overcoming the multiple gene copy problem in polyploid phylogenetics, applied to Poa grasses. BMC Biology 9: 19. http://www.biomedcentral.com/1741-7007/9/19

Objective: Phylogenies in polyploid grasses ; recent, rapid radiation (in a time and cost effective experimental design)

Phylogenies with polyploids Material: Total DNA extracted from 60

individuals of 11 different polyploids Poa species

Complexity reduction: Amplify 3 cp genes (rpl32-trnL, rpoB-trnC and

trnH-psbA) and two nuclear genes (DMC1 and CDO504) from each of the 60 samples.

Target amplicons are < 500 bp

Pool the 5 different PCR-products for one individual

60 X 5 $ = 300 $

Primers 10 X 5 $ = 50 $Enzymes1 X 150 $ = 150 $

Phylogenies with polyploids Multiplexing: Yes, addition of ds-adaptors

with MID barcodes by ligation. Design 64 different barcodes (with 3 technical replicates)

Ligation of barcode-adapters to the pools of amplicon, purification, pool, quality control

Sequencing platform: ¼ plate of a Roche 454 with Titanium 2010 chemistry

64 X 2 X 9 $ = 1160 $64 X 2 X 6 $ = 800 $

1 X 100 $ = 100 $64 X 1.5 $ = 96 $1 X 50 $ = 50 $

1 X 2 140 $ = 2 140 $Total = ~4 900$

TitaniumAdapterA 25bp + MIDbarcode + TCGTATCGCCTCCCTCGCGCCATCAG + ACGAGTGCGT + TGCATAGCGGAGGGAGCGCGGTAGTA TGCTCACGCA

A - TitaniumAdapterB A + CTGAGCGGGCTGGCAAGGCGCATAG GACTCGCCCGACCGTTCCGCGTATC

Phylogenies with polyploids Bioinformatics: Galaxy platform (free)

Sort the gene regions by regular expression (REGEX) of gene specific primers

Discard: low-quality reads short sequences reads matching no MID barcode

Calculate error rate by calculating SNP at chloroplast regions.

Detect and discard PCR recombinant Alleles that occurred at <5% for a species OR Both ends of the allele do not match the same common allele

Phylogenies with polyploids Results: 121 000 raw reads. Length: 40 to 775 bp (mean 278 bp).

111 200 (92%) match to gene specific primer

70 601 reads (58%) remained after barcode sorting and quality control

Useful sequence for 281 out of 320 (88%) targets = 12% missing

Sequence error rate 0.13% PCR recombination : 2.9% of CDO504 reads and 14% of DMC1 reads Technical replicates: P. costiniana: identical alleles (At the < 2 bp

level), but one extra allele (=PCR error). Two distinct copies (and more) of each nuclear gene deduced

DMC1 has 19 (4.0%) base difference and CDO504, 35 (8.5%) and seven-bp indels and 4-bp-indels

One extra gene copy discovered for CDO514, shows a 57-bp deletion in intron.

Phylogenies with polyploids Number of sequence reads

obtained for each marker/individual combination. A - After quality control and barcode

deconvolution. B - Useful sequence reads remaining after

alignment and editing.

Percentage of useful reads gained for each nuclear gene copy and allele, including recombinant reads.

Phylogenies with polyploids Results: Phylogenetic analyses

Timing of polyploidization: took place before the Australian and the American species diverged.

Extensive haplotype sharing between taxa currently different species

Nuclear gene networks showed incongruence both with each other and with the chloroplast gene networks

Tasmania-mainland differentiation detected

On the local scale, strong spatial genetic structure detected using two of the chloroplast markers. Suggest a smaller neighborhood for seed dispersal than for pollen

dispersal.

To remember Diversity of protocols and experimental design (Be creative!)

Budget: 3 000 $ to 9 000 $, main cost can be primers and library preparation

Standards are rapidly increasing: Technical replicates required

Challenge: Data analysis No standard analytical protocol (custom, in house, developed) No standard calculation of error rate Initial steps computer intensive (30 millions of short reads…)

Results: Half of the reads are discarded Many target loci will be missing Unequal proportion of technical replicates in final dataset Prone to PCR recombination and chimeras assembly

Comparison

PlatformNb samples

Total cost% reads retained

Nb clean unique reads

Tech. reps

Error rate

Missed target

Arctic soilGA II, one lane

24 5 250 $ 53 %4.1 million vs 2.4 millions for tech rep

Yes 1% NA

Microsat GAIIx, one lane ? (? 8)3 400 $ (without primers)

? ? No ? NA

SNP (ddRAD)

GA II, two lanes 54 9 000 $ 95% 7 000 loci w SNPs No0.18 – 0.22% per base

?

PolyploidsGS FLX, ¼ plate

61 4 900 $ 58 % 70 601 Yes 0.13% 12 %

Thank you!

massively parallel sequencing for biodiversity

Documents

parallel sequencing

sequencing process

thousands of short sequencing

bp illumina

high throughput sequencing

generation sequencing

sample strategies

adaptors library preparation