massively parallel sequencing for biodiversity

46
Massively parallel sequencing for biodiversity science Annie Archambault, Centre de la Science de la Biodiversité du Québec qcbs.ca April 2013

Upload: qcbsannie

Post on 23-Oct-2014

121 views

Category:

Documents


1 download

DESCRIPTION

Author: Annie Archambault, Research professional at the Quebec Center for Biodiversity Science (qcbs.ca)Objective: Provide an overview of the NGS technology for researchers in biodiversity scienceDescription: Slides presenting the major massively parallel sequencing platforms (next-generation sequencing NGS), as wel as strategies for reducing the genome complexity, and for multiplexing different samples into one sequencing run. Laboratory steps and estimated costs are summarized for a few cases studies. Bartram et al 2011. Generation of multi-million 16S rRNA gene libraries from complex microbial communities by assembling paired-end Illumina reads. Appl. Environ. Microbiol. http://aem.asm.org/content/early/2011/04/01/AEM.02772-10 Malausa et al. 2011. High-throughput microsatellite isolation through 454 GS-FLX Titanium pyrosequencing of enriched DNA libraries. Molecular Ecology Resources 11: 638-644. http://doi.wiley.com/10.1111/j.1755-0998.2011.02992.x Cosart et al. 2011. Exome-wide DNA capture and next generation sequencing in domestic and wild species. BMC Genomics 12: 347. http://www.biomedcentral.com/1471-2164/12/347 Timmermans et al. 2010 Why barcode? High-throughput multiplex sequencing of mitochondrial genomes for molecular systematics. Nucleic Acids Research 38:e197–e197. http://www.nar.oxfordjournals.org/cgi/doi/10.1093/nar/gkq807 Griffin et al. 2011. A next-generation sequencing method for overcoming the multiple gene copy problem in polyploid phylogenetics, applied to Poa grasses. BMC Biology 9: 19. http://www.biomedcentral.com/1741-7007/9/19 Koopman, et al. 2011. “The Microbial Phyllogeography of the Carnivorous Plant Sarracenia alata” Microbial Ecology 61, no. 4 : 750–758. http://www.springerlink.com/index/10.1007/s00248-011-9832-9

TRANSCRIPT

Page 1: Massively parallel sequencing for Biodiversity

Massively parallel sequencing for biodiversity scienceAnnie Archambault, Centre de la Science de la Biodiversité du Québec

qcbs.ca

April 2013

Page 2: Massively parallel sequencing for Biodiversity

Synonyms: Next generation sequencing (NGS), Massively parallel sequencing, High throughput sequencing, 2nd or 3rd generation sequencing…

Parallelize the sequencing process : Producing thousands of short sequencing reads at once

REPLACES CLONING AND CLONE SCREENINGREPLACES INDIVIDUAL SEQUENCING REACTIONS

Page 3: Massively parallel sequencing for Biodiversity

Outline – Uses for biodiversity studies

Very brief review of the 4 main platforms

Examples of experimental procedures strategies (complexity reduction and multiplexing)

Laboratory steps and costs for 4 cases studies

Disclaimer: I still have limited experience with these instruments, I

gained understanding from intensive readings

Page 4: Massively parallel sequencing for Biodiversity

Useful reading

Review of the chemistry and the workflow : Myllykangas S, Buenrostro J, Ji HP: Overview of

Sequencing Technology Platforms. In Bioinformatics for High Throughput Sequencing. Springer New York; 2012: 11–25. http://www.springerlink.com/content/n6u33m1335750g57/

Review of technologies and applications in biodiversity : Purdy KJ, Hurd PJ, Moya-Laraño J, Trimmer M, Oakley BB,

Woodward G: Systems Biology for Ecology: From Molecules to Ecosystems. In Advances in Ecological Research. 2010: 87–149. http://linkinghub.elsevier.com/retrieve/pii/B9780123850058000034

Page 5: Massively parallel sequencing for Biodiversity

Instruments comparison

Platform Amplification, detection Detection StepAt GQ Innovation Center

Unit

GS - FLX+ (454)

Pyrosequencing – emulsion PCR

FluorescenceDuring synthesis

Yes1 plate (divided in ¼)

HiSeq (Illumina)

BridgePCR Wash after every base

FluorescenceDuring synthesis Yes

Flow cell of 8 lanes

Ion PGM™ Sequencer (LifeTechnologies) 314, 316 chip

Emulsion PCR. Pyrosequencing-like

H+ ionsDuring synthesis Yes Chip

PacBio RSNo prior amplification.Single-molecule Real-time sequencing (smrt)

FluorescenceDuring synthesis No Cell

Page 6: Massively parallel sequencing for Biodiversity

Visuals

454 GS FLX

HiSeq Ion PGM PacBio RS http://454.com/products/technology.asp http://bcove.me/7eidiq1e?width=490&height=274

http://www.youtube.com/watch?v=77r5p8IBwJk

http://www.youtube.com/watch?v=NHCJ8PtYCFc&feature=related

http://www.youtube.com/watch?v=yVf2295JqUg&feature=plcp&context=C4897380VDvjVQa1PpcFPcv91xP1YGJ3-1VyENe915toprCBsg2Jc%3D

Page 7: Massively parallel sequencing for Biodiversity

Visuals

454 GS FLX

HiSeq Ion PGM PacBio RS

Page 8: Massively parallel sequencing for Biodiversity

Instrument comparisonPlatform

Nb reads per unit

Read length

Run time

Cost $ per Mb *

Preferred usesType of errors

GS - FLX+ (454)

1 million per plate

350 – 500 bp

20 hLibrary prep: 160 $ Per plate: 8 200 $

7 $Amplicon sequencing; Initial characterization. non-model species.

Indels

HiSeq (Illumina)

~200 million per lane

50 bp 100 bp150 bp

8 days

Library prep: 160 $ Per lane: 715 $ to 2 100 $ (length)

0.1 $

Re-sequencing; Frequency-based applications.NOT amplicons

Susbstitutions

Ion PGM314 ; 316 or 318 chip

314: 100 000316: 1 million318: 10 millions

35 – 400 bp

1.5 – 7 h ? 50 $Individual laboratories, Small scale

Indels

PacBio RS50 000 reads per cell

6000 bp

2 h~ 750$ USD per sample

11 – 200 $

Non-model species, long fragments, methylated fragments

CG deletions, High error rates

*Cost estimate: Glenn TC. 2011. Field guide to next‐generation DNA sequencers. Molecular Ecology Resources 2011, 11:759–769. http://onlinelibrary.wiley.com/doi/10.1111/j.1755-0998.2011.03024.x/abstract

Page 9: Massively parallel sequencing for Biodiversity

Quantity instead of length or quality

Long templates (gDNA) Short amplicon templatesTemplate

Each read is short (75 – 200 bp), and bears errors: need to be confirm with many reads covering the same template region

Library preparation: gDNA fragmentation + adaptors

Library preparation: Amplification + adaptors

Page 10: Massively parallel sequencing for Biodiversity

Quantity instead of length or quality

Long templates (gDNA) Short amplicon templatesTemplate

Each read is short (75 – 200 bp), and bears errors: need to be confirm with many reads covering the same template region

Reads

Library preparation: gDNA fragmentation + adaptors

Deduced template sequence

8X coverage

8X coverage

Excluded from further

analyses

8X coverage

Excluded from further

analyses

8X coverage

8X coverage

Library preparation: Amplification + adaptors

Assembly/mapping by similiarity

Page 11: Massively parallel sequencing for Biodiversity

Useful in biodiversity?

How to make use of 200 millions reads for your biological question? Be strategic!

Reduce the complexity of genetic material analyzed

Combine different samples into a single run (Multiplexing)

Page 12: Massively parallel sequencing for Biodiversity

Strategies: Multiplexing Incorporate specific KNOWN oligos (code or index) at

beginning of the each fragment. During library preparation

Read at sequencing

Sorted by sequence deconvolution according its “code”

Roche 454: 30 (up to 130) Multiplex identifiers (MID), 10 bp

Illumina: 12 “Index sequences”, 6 bp

Depth of coverage: GS-FLX ¼ plate = 250 000 reads / 25 barcodes: 10 000 reads per sample. Enough for you?

Sample 1 Sample 2Sample 3

A single run

Pool in one tube

Sorted according to “coded” seq.

Sample 1 Sample 2 Sample 3

Page 13: Massively parallel sequencing for Biodiversity

Strategies: Multiplexing

Incorporate specific KNOWN oligos (code or index) at beginning of the UNKNOWN fragment. During library preparation

Example of Roche 10 bp MID barcode for Amplicon sequencing

5'-CTCGTAGACTGCGTACCAATTC.............TTACTCAGGACTCAT-3’

3’ - CATCTGACGCATGGTTAAG.............AATGAGTCCTGAGTAGCAG-5’

TargetSpecificGACTGCGTACCAATTC-3’

3’-CAATGAGTCCTGAGTAG TargetSpecific

Lib-L-PrimerA

key MID3

5’-CCATCTCATCCCTGCGTGTCTCCGACTCAGAGACGCACTC

GACTCTGACGGTTCCGTGTGTCCCCTATCC-5’

Key

Lib-L-PrimerB Primer LibL_A with MID3 with TargetSpecific

Primer LibL_B with TargetSpecific (no MID)

Page 14: Massively parallel sequencing for Biodiversity

Strategies: complexity reduction

A few organisms (1 to a few hundreds) : Survey a few thousands loci per sample Enrich in gene-rich regions for gDNA sequencing Random genomic survey Transcriptome sequencing

Very many organisms (e.g. environmental studies): Survey one or two loci per individual Amplicon sequencing with universal primers (PCR)

Page 15: Massively parallel sequencing for Biodiversity

By hybridization

A few organisms: Enrich in simple-sequence-repeats. Hybridization to target

repeats (e.g. microsatellites loci) Enrich in gene-rich regions for genomic DNA sequencing.

Hybridization to reference set of genes (e.g. target exons)

Be creative!

Hybridization: Enrich in specific fragments (e.g. exon)

DNA fragmentationSequence the enriched pool

Beads

Bait (custom made)

Unbound (discarded)

Bound (retained)

Page 16: Massively parallel sequencing for Biodiversity

! Evaluate costs carefully

From 2008 to 2013? Instruments give higher

throughput Each sequencing run is cheaper

May be cheaper not to target specific regions

Page 17: Massively parallel sequencing for Biodiversity

By methylation-sensitive RE

Be creative! A few organisms:

Enrich in gene-rich regions for genomic DNA sequencing Elimination of methylation rich regions (plants repetitive

elements)

Insert in E. coli : digests methylated DNA

Nuclear DNA fragmentation

Sequence the enriched pool

Page 18: Massively parallel sequencing for Biodiversity

By amplification

One or a few organisms: Randomly sample the whole genome

Amplification: “AFLP-like” Sequence instead of length polymorphism

ddRAD : Double digest restriction-site-associated DNA sequencing, to find SNPs

Adaptor ligationAmplification with adaptor primers

DNA fragmentationEnz.A Enz.A Enz.B

Page 19: Massively parallel sequencing for Biodiversity

By amplification

One or a few organisms: Randomly sample the whole genome

ddRAD : Double digest restriction-site-associated DNA sequencing

Powerful: Coupled with multiplexing

Sample 1

Sample 2

Enz.A Enz.A

Multiplex

« Index »

Adaptor

Enz.B

Page 20: Massively parallel sequencing for Biodiversity

Genome complexity reduction: RNA

A few samples: Transcriptome sequencing

Total RNA : RNAseq Reduce to mRNA only (polyA) Reduce to microRNA only

! Driven by external condition and by tissues type Needs high number of reads: Illumina preferred

Transcription (DNA –> RNA)

Translation (RNA –> protein)

Page 21: Massively parallel sequencing for Biodiversity

Genome complexity reduction: RNA

A few organisms: Reminder: mRNA sequences include non-coding regions (UTR)

5’ UTR Exon Exon Intron 3’ UTR

AAAAAAA

CDS5’ UTR

3’ UTR

Page 22: Massively parallel sequencing for Biodiversity

Genome complexity reduction: Amplicon

Very many organisms: Amplicon sequencing with universal primers for ONE loci

Limitation: primers may not amplify equally well in ALL target organisms

Environmental samples targeting ITS, 16S, CO1 (the barcode loci)

Primers anneal Primers anneal Primers DO NOT anneal

Page 23: Massively parallel sequencing for Biodiversity

Case studies in biodiversity

Bartram et al 2011. Generation of multi-million 16S rRNA gene libraries from complex microbial communities by assembling paired-end Illumina reads. Appl. Environ. Microbiol. http://aem.asm.org/content/early/2011/04/01/AEM.02772-10

Castoe et al. 2011 Rapid Microsatellite Identification from Illumina Paired-End Genomic Sequencing in Two Birds and a Snake. PlosOne http://www.plosone.org/article/info:doi/10.1371/journal.pone.0030953

Peterson et al. 2012. Double Digest RADseq: An Inexpensive Method for De Novo SNP Discovery and Genotyping in Model and Non-Model Species. http://www.plosone.org/article/info:doi/10.1371/journal.pone.0037135

Griffin et al. 2011. A next-generation sequencing method for overcoming the multiple gene copy problem in polyploid phylogenetics, applied to Poa grasses. BMC Biology 9: 19. http://www.biomedcentral.com/1741-7007/9/19

Page 24: Massively parallel sequencing for Biodiversity

Bacterial communities Bartram et al: Generation of multi-million 16S rRNA

gene libraries from complex microbial communities by assembling paired-end Illumina reads. Appl. Environ. Microbiol. http://aem.asm.org/content/early/2011/04/01/AEM.02772-10

Objective: develop a protocol for community genetic diversity (test for two samples, costs for 20 samples)

Material: Soils from arctic tundra. Total DNA extracted with FastDNA

(MPBiomedicals) Also includes a control bacterial mix in liquid media. 20 X 5 $ = 100 $

Page 25: Massively parallel sequencing for Biodiversity

Bacterial communities

Primer for hypervariable region 3 (V3) of the microbial 16S rRNA

81 bp, purified PAGE:

caagcagaagacggcatacgagatCGTGATgtgactggagttcagacgtgtgctcttccgatctATTACCGCGGCTGCTGG

Amplify with High fidelity polymerase (Phusion)

Extract desired length 200-250 bp. (columns)

Multiplexing: Yes, including technical replicates

Quality control for libraries : (e.g. Agilent Bioanalyzer)

Sequencing : paired-end 2 x 125bp Illumina GAIIx (would be HiSeq)

flow-cell-binding Index Illumina-prime Target-gene

25 X 67 $ = 1 675 $

1 X 90 $ = 90 $25 X 1.5$ = 40 $

25 X 50$ = 1 250 $

1 X = 2 090 $

Total = 5 250 $

Molecular steps

Page 26: Massively parallel sequencing for Biodiversity

Bacterial communities

Bioinformatics: Base calling and error estimation Illumina Analysis Pipeline

Quality filtering, reads sorting according to index sequence, contig assembly (custom made, PANDAseq).

Discard: 1 or more mismatch between the two overlapping fragments of a the

pair-end 1 or more ambiguous base

Assignation to taxonomic affiliations : naïve Bayesian classification (Ribosomal Database Project RDP classifier) cutoff 0.5.

Good’s coverage for each libraries to estimate sequence coverage (C = 1 – n1/N)

CD-HIT to cluster arctic tundra datasets at 97% sequence identity

Sequence analyses

Paired-end reads assembled

Cluster modified single linkage

Classification / Diversity estimate

Custom program

Raw reads

CD-HIT

RDP / QIIME

Index seq.

Page 27: Massively parallel sequencing for Biodiversity

Bacterial communities

Total of 12 million raw reads

Discard 50% of the reads: Raw reads: 7.6 million and 4.4 millions for each technical replicates Post-assembly: 4.1 and 2.4 millions for each technical replicates

Average post-assembly contig : 150 ± 11 bases (without primers). Overlap 66 ± 11 bases

Pre-clustering at 97% sequence identity

Estimate error rate (from control library): 1 error per 5 contig (1%/base). Higher than Sanger sequencing.

Find contaminant in the growth media of a control

Duplicate arctic tundra libraries displayed a high degree of similarity Comparison of phyla in one library compared to one another (AT1 to AT2; r=0.999) The majority of sequences clusters (99.57%) detected in both replicates

Results

Page 28: Massively parallel sequencing for Biodiversity

Isolation of novel microsatellites loci

Castoe et al. 2011 Rapid Microsatellite Identification from Illumina Paired-End Genomic Sequencing in Two Birds and a Snake. PlosOne http://www.plosone.org/article/info:doi/10.1371/journal.pone.0030953

Objective: Discover novel microsatellites loci from diverse organisms

Material: Total genomic DNA from many different organisms (report results for only 3 species, I calculate for 8)

Also read: Jennings et al. 2011. “Multiplexed Microsatellite Recovery Using Massively Parallel Sequencing.” Molecular Ecology Resources. http://doi.wiley.com/10.1111/j.1755-0998.2011.03033.x.

Page 29: Massively parallel sequencing for Biodiversity

Isolation of novel microsatellites loci

Genome complexity reduction: None: direct sequencing

Material: Genomic DNA (5 ug). One individual per species

Library preparation for Illumina sequencing (likely on ~8 to 10 species)

Multiplexing: Yes. At the sequencing facility, during library preparation.

Sequencing platform: Illumina GAIIx ; 120 bp paired-end. Would now be HiSeq2000

One need to order primers for each loci after that

~ 700$ for 50 loci … up to 5 500$ for 8

8 X 5 $ = 40 $

8 X 160 $ = 1 280 $

1 X 2 090 $ = 2 090 $

Total: 3 400 $

Total (with primers): 9 000$

Page 30: Massively parallel sequencing for Biodiversity

Isolation of novel microsatellites loci

Bioinformatics: Simple, no assembly, no comparison to reference genome In a perl script Identify reads that contain perfects SSR : 2mer to 6mer,

repeated at least 6 times Sort by SSR types (de-multiplex) Design primers (with Primer3) Discard the primer pairs that also occur in other reads

Page 31: Massively parallel sequencing for Biodiversity

Isolation of novel microsatellites loci

Results: Number of raw reads: Not reported Use 5 millions paired-end reads per sample (A 1X coverage) Mean sequence length : Not reported Between 150 000 to 540 000 potential loci (containing

microsatellites) Primers designed for 72 000 to 174 000 loci, depending on

species With extra stringency (only 3 to 6-mer, >7 repeats): 200 to

2000 loci

Primers not tested for amplifyability

Conclusions: Large variation in number and proportion of motifs (3-mer, 4-mer…) in the different organisms.

Page 32: Massively parallel sequencing for Biodiversity

ddRAD-seq

Peterson et al. 2012 Double Digest RADseq: An Inexpensive Method for De Novo SNP Discovery and Genotyping in Model and Non-Model Species. PlosOne http://www.plosone.org/article/info:doi/10.1371/journal.pone.0037135

Objective: Aim at 10 000 SNPs, random genome-wide,10X coverage

Material: Total DNA extracted from 54 P. leucopus , one population (Qiagen kits)

54 X 5 $ = 270 $

Page 33: Massively parallel sequencing for Biodiversity

ddRAD-seq

Complexity reduction: Yes, complex Digestion, annealing, size-selection, many purifications steps

Multiplexing: Yes 54 samples. With simulation, include genome size and nucleotides frequency,

estimate they need 400 000 reads per individual, for the 300 +- 30 bp

Platform: Two lanes of GAII (now HiSeq 2000). Paired-end2 X 2 010 $ = 4 020 $

Page 34: Massively parallel sequencing for Biodiversity

ddRAD-seq

110 X ~30$ = 3 300 $

Enzymes + purif. = 1 250 $

Big total = ~9 000 $

Many oligos, combine « index » on the 5’ and on the 3’

Digestions and PCR amplifications

Many purifications and precise size selection (pippin prep)

PCRprimer1 (46bp)AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACG adaptorP1 gDNA adaptorP2 ACACTCTTTCCCTACACGACGCTCTTCCGATCTAATTA-3’ AATTCNNN…NNN P-5’-CGAGATCGGAAGAGCGAGAACAAOligo1.1 |||||||||||||||||||||||||||||||||||||| |||||||| ||||||||||||| Oligo2.1 Oligo1.2 one of 48) TGTGAGAAAGGGATGTGCTGCGAGAAGGCTAGATTAATTTAA-5’-P GNNN…NNNGGC TCTAGCCTTCTCGTGTGCAGACTTGAGGTCAGTG Oligo2.2

CGTGTGCAGACTTGAGGTCAGTGTAGTGCTAGAGCATACGGCAGAAGACGAAC PCRprimer2 (1 of 12)

Page 35: Massively parallel sequencing for Biodiversity

ddRAD-seq – Sequence analyses Initial sequence processing

De-multiplex – accept 1 bp mismatch in the 4 bp barcode Assign the read to a single individual Collapse identical reads to one seq., retaining fequency

No reference genome; not the « Stacks” package Compute pariwise distance btw alll reads (BLAT) MCL to group similar reads (ortholog inference) Count unique seqs in a cluster (=loci), count how many are beyond the ploidy

level )=% error containg reads) Align orthologs (MUSCLE) Write alignment as reference-ordered SAM/BAM files GATK UnifiedGenotyper, to genotype

Error : rate ranged from 0.18 – 0.22% per nucleotide. 1/10 reads Technical replicates? No

Page 36: Massively parallel sequencing for Biodiversity

ddRAD-seq – results

The 54 wild Peromyscus from a same population Total reads: not reported (~2 X 21 millions) Assigned to an individual: not reported Discard 5.4% of reads

SNP discovered Variable regions (loci): 6 200 found Polymorphic sites for >70% of individuals: 16 000 sites

found

In an analysis on samples from different populations SNPs in multi-SNPs loci: >80% These multi-SNP are usually excluded in other analyses

Page 37: Massively parallel sequencing for Biodiversity

Phylogenies with polyploids Griffin et al. 2011. A next-generation sequencing

method for overcoming the multiple gene copy problem in polyploid phylogenetics, applied to Poa grasses. BMC Biology 9: 19. http://www.biomedcentral.com/1741-7007/9/19

Objective: Phylogenies in polyploid grasses ; recent, rapid radiation (in a time and cost effective experimental design)

Page 38: Massively parallel sequencing for Biodiversity

Phylogenies with polyploids Material: Total DNA extracted from 60

individuals of 11 different polyploids Poa species

Complexity reduction: Amplify 3 cp genes (rpl32-trnL, rpoB-trnC and

trnH-psbA) and two nuclear genes (DMC1 and CDO504) from each of the 60 samples.

Target amplicons are < 500 bp

Pool the 5 different PCR-products for one individual

60 X 5 $ = 300 $

Primers 10 X 5 $ = 50 $Enzymes1 X 150 $ = 150 $

Page 39: Massively parallel sequencing for Biodiversity

Phylogenies with polyploids Multiplexing: Yes, addition of ds-adaptors

with MID barcodes by ligation. Design 64 different barcodes (with 3 technical replicates)

Ligation of barcode-adapters to the pools of amplicon, purification, pool, quality control

Sequencing platform: ¼ plate of a Roche 454 with Titanium 2010 chemistry

64 X 2 X 9 $ = 1160 $64 X 2 X 6 $ = 800 $

1 X 100 $ = 100 $64 X 1.5 $ = 96 $1 X 50 $ = 50 $

1 X 2 140 $ = 2 140 $Total = ~4 900$

TitaniumAdapterA 25bp + MIDbarcode + TCGTATCGCCTCCCTCGCGCCATCAG + ACGAGTGCGT + TGCATAGCGGAGGGAGCGCGGTAGTA TGCTCACGCA

A - TitaniumAdapterB A + CTGAGCGGGCTGGCAAGGCGCATAG GACTCGCCCGACCGTTCCGCGTATC

Page 40: Massively parallel sequencing for Biodiversity

Phylogenies with polyploids Bioinformatics: Galaxy platform (free)

Sort the gene regions by regular expression (REGEX) of gene specific primers

Discard: low-quality reads short sequences reads matching no MID barcode

Calculate error rate by calculating SNP at chloroplast regions.

Detect and discard PCR recombinant Alleles that occurred at <5% for a species OR Both ends of the allele do not match the same common allele

Page 41: Massively parallel sequencing for Biodiversity

Phylogenies with polyploids Results: 121 000 raw reads. Length: 40 to 775 bp (mean 278 bp).

111 200 (92%) match to gene specific primer

70 601 reads (58%) remained after barcode sorting and quality control

Useful sequence for 281 out of 320 (88%) targets = 12% missing

Sequence error rate 0.13% PCR recombination : 2.9% of CDO504 reads and 14% of DMC1 reads Technical replicates: P. costiniana: identical alleles (At the < 2 bp

level), but one extra allele (=PCR error). Two distinct copies (and more) of each nuclear gene deduced

DMC1 has 19 (4.0%) base difference and CDO504, 35 (8.5%) and seven-bp indels and 4-bp-indels

One extra gene copy discovered for CDO514, shows a 57-bp deletion in intron.

Page 42: Massively parallel sequencing for Biodiversity

Phylogenies with polyploids Number of sequence reads

obtained for each marker/individual combination. A - After quality control and barcode

deconvolution. B - Useful sequence reads remaining after

alignment and editing.

Percentage of useful reads gained for each nuclear gene copy and allele, including recombinant reads.

Page 43: Massively parallel sequencing for Biodiversity

Phylogenies with polyploids Results: Phylogenetic analyses

Timing of polyploidization: took place before the Australian and the American species diverged.

Extensive haplotype sharing between taxa currently different species

Nuclear gene networks showed incongruence both with each other and with the chloroplast gene networks

Tasmania-mainland differentiation detected

On the local scale, strong spatial genetic structure detected using two of the chloroplast markers. Suggest a smaller neighborhood for seed dispersal than for pollen

dispersal.

Page 44: Massively parallel sequencing for Biodiversity

To remember Diversity of protocols and experimental design (Be creative!)

Budget: 3 000 $ to 9 000 $, main cost can be primers and library preparation

Standards are rapidly increasing: Technical replicates required

Challenge: Data analysis No standard analytical protocol (custom, in house, developed) No standard calculation of error rate Initial steps computer intensive (30 millions of short reads…)

Results: Half of the reads are discarded Many target loci will be missing Unequal proportion of technical replicates in final dataset Prone to PCR recombination and chimeras assembly

Page 45: Massively parallel sequencing for Biodiversity

Comparison

PlatformNb samples

Total cost% reads retained

Nb clean unique reads

Tech. reps

Error rate

Missed target

Arctic soilGA II, one lane

24 5 250 $ 53 %4.1 million vs 2.4 millions for tech rep

Yes 1% NA

Microsat GAIIx, one lane ? (? 8)3 400 $ (without primers)

? ? No ? NA

SNP (ddRAD)

GA II, two lanes 54 9 000 $ 95% 7 000 loci w SNPs No0.18 – 0.22% per base

?

PolyploidsGS FLX, ¼ plate

61 4 900 $ 58 % 70 601 Yes 0.13% 12 %

Page 46: Massively parallel sequencing for Biodiversity

Thank you!