miseq vs. ion torrent data quality

21
COMPANY CONFIDENTIAL – INTERNAL USE ONLY © 2011 Illumina, Inc. All rights reserved. Illumina, illuminaDx, BeadArray, BeadXpress, cBot, CSPro, DASL, Eco, Genetic Energy, GAIIx, Genome Analyzer, GenomeStudio, GoldenGate, HiScan, HiSeq, Infinium, iSelect, MiSeq, Nextera, Sentrix, Solexa, TruSeq, VeraCode, the pumpkin orange color, and the Genetic Energy streaming bases design are trademarks or registered trademarks of Illumina, Inc. All other brands and names contained herein are the property of their respective owners. MiSeq vs. PGM Data Quality Omayma Al-Awar Regional Marketing–AMR Feb 17, 2012

Upload: ahsan-hussain

Post on 17-Dec-2015

66 views

Category:

Documents


6 download

DESCRIPTION

Comparison of Two Instruments.

TRANSCRIPT

  • COMPANY CONFIDENTIAL INTERNAL USE ONLY 2011 Illumina, Inc. All rights reserved. Illumina, illuminaDx, BeadArray, BeadXpress, cBot, CSPro, DASL, Eco, Genetic Energy, GAIIx, Genome Analyzer, GenomeStudio, GoldenGate, HiScan, HiSeq, Infinium, iSelect, MiSeq, Nextera, Sentrix, Solexa, TruSeq, VeraCode, the pumpkin orange color, and the Genetic Energy streaming bases design are trademarks or registered trademarks of Illumina, Inc. All other brands and names contained herein are the property of their respective owners.

    MiSeq vs. PGM Data Quality

    Omayma Al-Awar Regional MarketingAMR Feb 17, 2012

  • 2 COMPANY CONFIDENTIAL INTERNAL USE ONLY

  • 3 COMPANY CONFIDENTIAL INTERNAL USE ONLY

    ! We have higher quality scores and therefore lower error rates.

    ! We sequence homopolymers accurately, while they have an inherent problem in sequencing those regions. Their higher indel error rates increases the rate of false positive SNP calls, and increases the cost of downstream validation

    ! We measure the accuracy of every base, while they rely on stacks of reads and a reference. What if a reference is not present? What if you are looking for rare variants?

    ! Proton sensing is novel and interesting, but technically inferior for DNA sequencing. Our sequencing approach is more sensitive, requires less library amplification, and produces higher fidelity data.

    MiSeq Data Quality is Superior to that of Ion Torrent

  • 4 COMPANY CONFIDENTIAL INTERNAL USE ONLY

    ! We have higher quality scores and therefore lower error rates.

    ! We sequence homopolymers accurately, while they have an inherent problem in sequencing those regions. Their higher indel error rates increases the rate of false positive SNP calls, and increases the cost of downstream validation

    ! We measure the accuracy of every base, while they rely on stacks of reads and a reference. What if a reference is not present? What if you are looking for rare variants?

    ! Proton sensing is novel and interesting, but technically inferior for DNA sequencing. Our sequencing approach is more sensitive, requires less library amplification, and produces higher fidelity data.

    MiSeq Data Quality is Superior to that of Ion Torrent

  • 5 COMPANY CONFIDENTIAL INTERNAL USE ONLY

    ! Illumina data quality guarantees are at Q30 , or the probability of 1 error in 1000 bases. Ion Torrent data quality is at Q17, or 2-3 errors every 100 bases (at best at Q20 with 1 error every 100 bases).

    Will that work for your application?

    Message #1: Quality Scores

  • 6 COMPANY CONFIDENTIAL INTERNAL USE ONLY

    What does a Q score mean?

    ! A single base call is assigned a quality score. It is an odds ratio, the probability of an incorrect call.

    For example, a base with a Q40 score has a probability of 1 in 10,000 of an incorrect base call.

    Definitions-Quality Scores

    Phred Quality Score Probability of Incorrect Base Call

    Base Call Accuracy

    10 1 in 10 90%

    20 1 in 100 99%

    30 1 in 1000 99.9%

    40 1 in 10000 99.99%

    Our Quality Guarantee: >75% bases >Q30 at 2 x 150 Ion Torrent Quality Scores: Q17- Q20

  • 7 COMPANY CONFIDENTIAL INTERNAL USE ONLY

    ! Why is base quality score important?

    Definitions-Quality Scores

    T

    T T

    T

    T T T T T T T

    A A

    T

    11/13 T 2/13 A

    The consensus is T. Do the As represent a SNP? Do the As represent a sequencing error?

    Base quality score determines whether this is a SNP or an error!

    If As are at Q30, we flag them as SNPs If As are at Q17, we discard them as errors

    What if you sequenced this on the Ion Torrent with an average Q score of 17?

    You will call more False Positive SNPs. You will spend more time and money on downstream validation

  • 8 COMPANY CONFIDENTIAL INTERNAL USE ONLY

    ! Example: Matched tumor/normal samples were sequenced. Results at a certain base position were:

    Normal sample: 15 As, 15 Ts Tumor sample: 12 As, 15 Ts, 6 Gs, and 3 Cs

    How do you call it?

    ! Why is base quality score critical for cancer sequencing?

    ! Few solid tumors that are sequenced are pure

    ! Normal tissue contamination can be as high as 95%

    Definitions-Quality Scores

    The normal sample is a heterozygote A/T Without high BASE QUALITY scores, tumor sample is impossible to call accurately. A CONSENSUS ACCURACY score that IT generates does not distinguish between bona fide SNPs or errors in this example Q scores of 30 or higher are critical for accurate calling of SNPs in complex samples

  • 9 COMPANY CONFIDENTIAL INTERNAL USE ONLY

    MiSeq (2Gb spec)

    PGM (316-100 bp)

    Raw read length 2 x 151 (overlapping) 100 (nominal)

    Processed read length 220* 120

    No. of reads (millions) 17.4 1.47

    Yield (Gb) (Quality)

    2.633 (84.2% >Q30)

    0.197 (Avg. Q20)

    Genome coverage 877x 66x

    Assembler Velvet 1.1.06 (k=95) Newbler 2.6

    Genome Assembly (bp) 3,053,394 2,980,036

    N50 45,964 22,750

    Raw SNP Count 9,709 5,636

    Data-Quality Scores

    Comparison of Entrococcus faecium de novo Sequencing Data

    Work performed in the laboratory of Dr. Tim Stinear University of Melbourne

  • 10 COMPANY CONFIDENTIAL INTERNAL USE ONLY

    ! Dan Koboldt, MassGenomics Blog. June 22, 2011. Analyzed public IT data: DH10B sequenced on 316 chip Average base quality is 23. A lot of base quality values of 8 (he wondered if this is equivalent to Illumina Q2

    scores, indicating virtually no confidence in the base)

    From the blogosphere Quality Scores

  • 11 COMPANY CONFIDENTIAL INTERNAL USE ONLY

    ! Illumina chemistry sequences homopolymers accurately, while Ion Torrent technology has an inherent problem in reading through homopolymers, as do all pyrosequencing-based technologies. This means their indel error rate is high.

    ! Ion Torrent omits indels from their quality analysis!

    ! Why should high indel error rate matter? This means you will have a much higher number of false positive SNPs. You need to

    do more downstream validation, which means the experiment will cost you more!

    Message #2 Ion Torrent has high indel error rates

  • 12 COMPANY CONFIDENTIAL INTERNAL USE ONLY

    Definitions - Ion Torrent has high indel error rates

    O

    PPP

    HN

    N

    O

    O

    cleavage site fluor

    3 block

    Next cycle

    Incorporation Detection Deblock; fluor removal

    O

    DNA

    HN

    N

    O

    O

    3

    O

    5

    free 3 end

    X

    OH

    Illumina Proven Reversible Terminator Chemistry

    All 4 labeled nucleotides in 1 reaction Higher accuracy No problems with homopolymer repeats

  • 13 COMPANY CONFIDENTIAL INTERNAL USE ONLY

    MiSeq has no issues with Homopolymers

    BGI HiSeq data from recent E.coli outbreak

    BGI IT data from recent E.coli outbreak

    Data - Ion Torrent has high indel error rates

  • 14 COMPANY CONFIDENTIAL INTERNAL USE ONLY

    ! Keith Robison, Omics! Omics! Blog. Sept 14, 2011.

    omitting indels from the quality analysis hits close to home, because I've been guilty of this too. Partly this was it was easy to avoid them at first, and partly it stems from the fact that indels don't really fit into the phred score paradigm I've been using (that's a whole 'nother stalled blog post). I've tried to be up front about that, but it is certainly an issue. In some applications the homopolymer reads can be seen as just a tax on your data. For example, if I know I'm only looking for activating substitutions in an oncogene, those must be in frame and I can discard the reads with indels in the vicinity of my codon(s) of interest. But, in most cases they really are an issue.

    From the blogosphere Ion Torrent has high indel error rates

    Why do they get away with it? Yet he acknowledges.

    Ion Torrent do not factor indels in their data quality analysis

  • 15 COMPANY CONFIDENTIAL INTERNAL USE ONLY

    ! Dan Koboldt, MassGenomnics Blog. June 22, 2011. Analyzed public IT data: DH10B sequenced on 316 chip Because it is a laboratory reference strain, it is expected to be genetically

    homogeneous, and sequence should be identical to the reference. Any apparent SNPs or Indels are due to sequencing errors. VarScan detected an astonishing 1,122,276 insertion/deletion events, reflecting indel error rate of 0.726%, or about eight-fold higher than the substitution error rate VarScan removed 87.6% of indels as clear artifacts in homopolymer runs of 4 or more bases. What about SNPs? Found 142,920 SNPs (remember, sequence should be identical to reference!)

    Conclusion: Homopolymers are obviously an issue for this platform 28% of the erroneous SNPS were in homopolymer regions

    From the blogosphere Ion Torrent has high indel error rates Impact on SNP calls

  • 16 COMPANY CONFIDENTIAL INTERNAL USE ONLY

    ! We measure the accuracy of every base in every read. Ion Torrent uses consensus call accuracy and base call recalibration to measure their accuracy. This means they rely on a stack of reads and a reference genome to recalibrate their quality score.

    What if you dont have a reference genome? What if your application requires that you perform de novo sequencing What if youre interested in metagenomics? Why should any microbiology lab ever consider an Ion Torrent?

    Messages #3- Single base call accuracy

  • 17 COMPANY CONFIDENTIAL INTERNAL USE ONLY

    Measuring Accuracy

    ! Illumina uses single base call accuracy to assess data quality.

    ! Ion Torrent uses consensus accuracy (look at accuracy of a whole stack of reads in a given genomic position).

    Definitions Single base call accuracy

    T A

    T T

    T

    T T

    Single Read Reference Genome

    Single base call accuracy Consensus call accuracy

  • 18 COMPANY CONFIDENTIAL INTERNAL USE ONLY

    MiSeq (2Gb spec)

    PGM (316-100 bp)

    Raw read length 2 x 151 (overlapping) 100 (nominal)

    Processed read length 220* 120

    No. of reads (millions) 17.4 1.47

    Yield (Gb) (Quality)

    2.633 (84.2% >Q30)

    0.197 (Avg. Q20)

    Genome coverage 877x 66x

    Assembler Velvet 1.1.06 (k=95) Newbler 2.6

    Genome Assembly (bp) 3,053,394 2,980,036

    N50 45,964 22,750

    Raw SNP Count 9,709 5,636

    Data Single base call accuracy de novo assembly Comparison of Entrococcus faecium de novo Sequencing Data

    Work performed in the laboratory of Dr. Tim Stinear University of Melbourne

  • 19 COMPANY CONFIDENTIAL INTERNAL USE ONLY

    Message # 4: Our sequencing is technically superior

    Its a Photon vs. Proton Argument.

    Proton sensing is novel and interesting, but technically inferior for DNA sequencing.

    Optical DNA sequencing:

    ! 500x more sensitive with better SNR

    ! Requires less amplification

    ! Produces higher fidelity data

    ! Safer choice when the stakes are high

  • 20 COMPANY CONFIDENTIAL INTERNAL USE ONLY

    Supplemental Figures

    Figure S1 Process overview

    a, Overview of ion sequencing work flow. b, Prepare genomic library, DNA is fragmented, sized, and forward and reverse adapters ligated. c, Amplify Template on bead, adapter-ligated libraries are clonally amplified onto beads. A magnetic bead-based enrichment process selects template-carrying beads. d, Sequence on ion chip, sequencing primers and DNA polymerase are bound to the template-carrying beads, beads are pipetted into the chips loading port. The chip is installed in the sequencing instrument; all four nucleotides cyclically flowed in an automated 2-hour run. Signal processing, software converts the raw data into measurements of incorporation in each well for each successive nucleotide flow. After bases are called, each read is passed through a filter to exclude low-accuracy reads and per-base quality values are predicted.

    WWW.NATURE.COM/NATURE | 1

    ! Ion sensing requires a massive amount of PCR to overcome the native background signal in solution

    ! Native PCR introduces 1 error every 9000 bases

    ! Sequence detection is analog, requiring high speed computing to call bases

    Message # 4: Our sequencing is technically superior

  • 21 COMPANY CONFIDENTIAL INTERNAL USE ONLY

    QUESTIONS?