comparison of rna sequencing with 19,319 lab validated rt ... · comparison of rna sequencing with...
TRANSCRIPT
Comparison of RNA sequencing with 19,319 lab validated RT-qPCR assays
Jan Hellemans, PhD
London, UK
October 20-21, 2014
Acknowledgements
• Biogazelle
Biogazelle team & collaborators
• Ghent University
• Steve Lefever
• SEQC consortium
• Christopher Mason
• David Kreil
• Leming Shi
• Bio-Rad
• qPCR: reference technology for nucleic acid quantification
• sensitivity and specificity
• wide dynamic range
• speed
• relatively low cost
• conceptual and practical simplicity
• easy to perform ≠ easy to do it right
• many steps involved
• all need to be right
Introduction
Assays & MIQE
• design
• amplicon length
• primer positions (exonic or intron-spanning)
• transcript coverage
• in silico verification
• specificity prediction (retropseudogenes and other homologues)
• secondary structure analysis
• empirical (wet lab) validation
• specificity assessment (gel, melt, amplicon sequencing)
• Cq of NTC (for SYBR assays)
• amplification efficiency determination (slope, E, SE(E), r²)
Assays & MIQE
• design
• amplicon length
• primer positions (exonic or intron-spanning)
• transcript coverage
• in silico verification
• specificity prediction (retropseudogenes and other homologues)
• secondary structure analysis
• empirical (wet lab) validation
• specificity assessment (gel, melt, amplicon sequencing)
• Cq of NTC (for SYBR assays)
• amplification efficiency determination (slope, E, SE(E), r²)
The perfect assay
• specific for the gene of interest (no off-target amplification)
• detection of all transcript variants
• detection not affected by polymorphisms (no allelic bias or drop out)
• amplification efficiency ~100%
• no gDNA co-amplification
• no primer dimer formation
properties
The perfect assay
The perfect assay
• For some genes, there is no perfect assay
• no unique sequence (homology with other genes – pseudogenes)
• no common sequence among all transcripts
• regions are excluded because of repeats, secondary structures, SNPs, homology, ...
• Make the best possible compromise and report potential issues
• Design à in silico quality control à lab validation
... or the best possible
Assay design using primerXL
• database of genomic information (transcripts, SNPs, ...)
• tools for target region selection (maximize transcript coverage)
• primer3 design engine
• analysis of secondary structures and SNPs in primer annealing regions
• specificity prediction (BiSearch)
• relaxation cascade (from perfect to best possible)
BiSearch specificity prediction
• BiSearch loose
• 1222222222222222
• BiSearch strict
• 1233333333333
BiSearch specificity prediction
• BiSearch loose
• 1222222222222222
• only the gene of interest (FFAR2)
• BiSearch strict
• 1233333333333
reads seq gene_list official_symbol location
2843 CATGGCAGTCACCATCTTCTGCTACTGGCGTTTTGTGTGGATCATGCTCTCCCAGCCCCTTGTGGGGGCCCAGAGGCGGCGCCGAGCCGTGGGGCTGGCTGTGGTGACGCTGCTCAATTTCCTGGTGTGCTTCGGACCTTACAGATCGGAA
ENSG00000126262 FFAR2 19:35940617-35942667
1897 GTAAGGTCCGAAGCACACCAGGAAATTGAGCAGCGTCACCACAGCCAGCCCCACGGCTCGGCGCCGCCTCTGGGCCCCCACAAGGGGCTGGGAGAGCATGATCCACACAAAACGCCAGTAGCAGAAGATGGTGACTGCCATGAGATCGGAA
ENSG00000126262 FFAR2 19:35940617-35942667
1535 GTAAGGTCCGAAGCACACCGAGAGCTGGGAGCAGGAGCTACACAGTCTGCTGGCCTCACTGCACACCCTGCTGGGGGCCCTGTACGAGGGAGCAGAGACTGCTCCTGTGCAGAATGAAGGCCCTGGGGTGGAGATGCTGCTGTCCTCAGAA
ENSG00000141456 AC091153.1 17:4574680-4607632
1097 CATGGCAGTCACCATCTTCTGAGGACAGCAGCATCTCCACCCCAGGGCCTTCATTCTGCACAGGAGCAGTCTCTGCTCCCTCGTACAGGGCCCCCAGCAGGGTGTGCA
GTGAGGCCAGCAGACTGTGTAGCTCCTGCTCCCAGCTCTCGG
ENSG00000141456 AC091153.1 17:4574680-4607632
1091 CATGGCAGTCACCATCTTCTGAGGACAGCAGCATCTCCACCCCAGGGCCTTCATTCTGCACAGGAGCAGTCTCTGCTCCCTCGTACAGGGCCCCCAGCAGGGTGTGCAGTGAGGCCAGCAGACTGTGTAGCTCCTGCTCCCAGCTCTCGGT
ENSG00000141456 AC091153.1 17:4574680-4607632
Wet lab validation
• PCR composition
• total volume: 5 µl
• instrument: CFX384 (with automation)
• mastermix: SsoAdvanced SYBR
• primer conc: 250 nM each
• PCR program
• default cycling protocol for SsoAdvanced SYBR (Ta=60°C)
• Samples
• cDNA: 25 ng (total RNA equivalents – Agilent Universal human reference RNA = MAQC A)
• gDNA: 2.5 ng (Roche)
• NTC: water + carrier (5 ng/μl yeast transfer RNA)
• synthetic template (pooled 60-mers in concentration range: 20 M – 20 copies)
setup
Wet lab validation
• lab validation of 103 053 assays (human, mouse and rat coding genes)
• 1 456 142 reactions
• 3 822 PCR plates (384-well)
• equivalent to 15 288 PCR plates (96-well)
some numbers
305 m
Amplification efficiency
• initial publication: Vermeulen et al., Nucleic Acids Research, 2009
• Biogazelle approach (easy & cost effective)
• 60-mer
• no modifications, standard desalted
• 7 points dilution series: 20 000 000 > 20 molecules
• equivalent to full length double stranded template
• limitation: behavior of first cycles amplifying from cDNA are not evaluated
synthetic templates
30 nt 3’ 30 nt 5’
ds template ss oligo r²<0.99 1 1 median E 2.00 2.01 average E 2.00 2.01 count E <> [1.90-2.10] 1 3 paired t-test p-value 0.14
Amplification efficiency distribution (n = 50 133)
89%
Amplification efficiency distribution (n = 50 133)
89%
redesign
redesign
Specificity
• amplicon sizing ( + melt analysis for SYBR assays)
• limited sensitivity for detecting low level non-specific coamplification
• failure to observe non-specific amplification of sequences with similar size and/or Tm e.g. expressed pseudogenes or homologous genes
• Next level of specificity assessment
• in silico specificity predictions by BiSearch
• massively parallel sequencing of pooled PCR products
• average coverage > 1000-fold à lab specificity > 99.9%
• 50 – 200 times more sensitive than size analysis and Sanger sequencing
NGS for increased sensitivity
Specificity most assays are 100% on-target
Specificity
0%
25%
50%
75%
100%
% o
n-t
arg
et
2/3 of non-specific assays may go unnoticed without NGS
0% 20% 40% 60%
0 < x < 0.1 0.1 < x < 0.2 0.2 < x < 0.3 0.3 < x < 0.4 0.4 < x < 0.5 0.5 < x < 0.6 0.6 < x < 0.7 0.7 < x < 0.8 0.8 < x < 0.9
0.9 < x < 1
Specificity
perfect 60 293 86%
acceptable (<10% non-specific)
5 866 8%
predicted non-specificity (no specific design found)
1 204 2%
failing specificity QC criteria 2 467 4%
the power of in silico verification
MIQE compliant PrimePCR assay validation data sheet
Dynamic range
> 10 000 000 fold
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
16 7
77.2
16
8 3
88
.608
4 1
94
.304
2 0
97
.152
1 0
48
.576
52
4.2
88
26
2.1
44
13
1.0
72
65
.536
32
.768
16.3
84
8.1
92
4.0
96
2.0
48
1.0
24
0.5
12
0.2
56
0.1
28
0.0
64
0.0
32
0.0
16
0.0
08
0.0
04
0.0
02
0.0
01
ge
ne
co
un
t
copies per cell
human mouse rat
SEQC
• multisite, cross-platform analysis of RNAseq
• FDA sponsored and guided MAQC-III
• Nature Biotechnology, Sept 2014 Focus on RNA sequencing quality control (SEQC) 2 Biogazelle co-authors
• MAQC samples reference RNA with built in controls – known truths
• > 100 billion reads
• compared against qPCR (PrimePCR)
RNAseq vs PrimePCR Differential expression
454 ILMN PGM PRO
0.83 0.89 0.86 0.89
13,190 genes 16,264 genes 14,981 genes 16,242 genes
qPCR (PrimePCR) vs RNAseq (Illumina) r² = 75% for genes detected by both platforms
qPCR (PrimePCR) vs RNAseq (Illumina)
Saturation analysis
preparation sample libraries reads GENCODE12 mapping
PrimePCR mapping
ribo-depleted
MAQC A 22 5 304 M 1 955 M (37%) 1 692 M (32%)
MAQC B 17 3 370 M 1 447 M (43%) 1 193 M (35%)
poly-A–enriched
MAQC A 4 427 M 291 M (68%) 278 M (65%)
MAQC B 4 446 M 323 M (72%) 297 M (67%)
ABRF-NGS dataset
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
4 096 000 000
2 048 000 000
1 024 000 000
512 000 000
256 000 000
128 000 000
64 000 000
32 000 000
16 000 000
8 000 000
4 000 000
2 000 000
1 000 000
500 000
250 000
125 000
MAQC A - detection
MAQC B - detection
Saturation analysis ribo-depletion RNAseq - % of GENCODE12
Saturation analysis ribo-depletion RNAseq - % of GENCODE12
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
4 096 000 000
2 048 000 000
1 024 000 000
512 000 000
256 000 000
128 000 000
64 000 000
32 000 000
16 000 000
8 000 000
4 000 000
2 000 000
1 000 000
500 000
250 000
125 000
MAQC A - detection
MAQC B - detection
Saturation analysis ribo-depletion RNAseq - % of GENCODE12
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
4 096 000 000
2 048 000 000
1 024 000 000
512 000 000
256 000 000
128 000 000
64 000 000
32 000 000
16 000 000
8 000 000
4 000 000
2 000 000
1 000 000
500 000
250 000
125 000
MAQC A - detection
MAQC B - detection
MAQC A - quantification
MAQC B - quantification
Saturation analysis poly-A RNAseq - % of GENCODE12
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
4 096 000 000
2 048 000 000
1 024 000 000
512 000 000
256 000 000
128 000 000
64 000 000
32 000 000
16 000 000
8 000 000
4 000 000
2 000 000
1 000 000
500 000
250 000
125 000
MAQC A - detection
MAQC B - detection
MAQC A - quantification
MAQC B - quantification
Saturation analysis ribo-depletion RNAseq for MAQC A - GENCODE12 vs PrimePCR
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
4 096 000 000
2 048 000 000
1 024 000 000
512 000 000
256 000 000
128 000 000
64 000 000
32 000 000
16 000 000
8 000 000
4 000 000
2 000 000
1 000 000
500 000
250 000
125 000
GENCODE12 detection
primePCR detection
GENCODE12 quantification
primePCR quantification
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
4 096 000 000
2 048 000 000
1 024 000 000
512 000 000
256 000 000
128 000 000
64 000 000
32 000 000
16 000 000
8 000 000
4 000 000
2 000 000
1 000 000
500 000
250 000
125 000
ribo-depletion RNAseq - detection
poly-A RNAseq - detection
qPCR - detection
ribo-depletion RNAseq - quantification
poly-A RNAseq - quantification
qPCR - quantification
Saturation analysis MAQC A - % of PrimePCR
Confirmation rate of novel junctions
Junction prediction junctions confirmed confirmation rate
multiple algorithms (Cstar + Magic + Subread)
136 136 100%
single algorithm 24 20 83%
• novel exon
• one of the primers in the novel exon
• novel junction
• one of the primers overlapping the novel junction ≥ 5 bases at either side of junction
• size analysis to confirm expected size for novel transcripts
Conclusions - I
• Assay design and in silico verification
• Transcript coverage
• SNPs and secondary structures
• Specificity prediction
• Empirical assay validation
• Efficiency in 90-110% range
• Stringent specificity analysis by massively parallel amplicon sequencing
• validated assays for human, mouse & rat coding genes PrimePCR
Conclusions - II
• qPCR based transcriptome profiling
• Samples from MAQC/SEQC study
• PCR data as benchmark for evaluation of RNAseq
• qPCR benefits: high sensitivity and large dynamic range
• good correlation with RNAseq results
• for individual genes, RNAseq ≤ 100 M reads gives lower sensitivity than qPCR
• the majority of novel junctions identified by RNAseq can be confirmed by qPCR