part 1 of rna-seq for de analysis: defining the goal
DESCRIPTION
First part of the training session 'RNA-seq for Differential expression' analysis. We explain how we can detect differential expression based on RNA-seq data. Interested in following this session? Please contact http://www.jakonix.be/contact.htmlTRANSCRIPT
![Page 1: Part 1 of RNA-seq for DE analysis: Defining the goal](https://reader033.vdocuments.mx/reader033/viewer/2022052906/5589cf41d8b42a4a578b45b6/html5/thumbnails/1.jpg)
This presentation is available under the Creative Commons Attribution-ShareAlike 3.0 Unported License. Please refer to http://www.bits.vib.be/ if you use this presentation or parts hereof.
RNA-seq for DE analysis training
Defining the goal of RNA-seq analysis for differential expression
Joachim Jacob22 and 24 April 2014
![Page 2: Part 1 of RNA-seq for DE analysis: Defining the goal](https://reader033.vdocuments.mx/reader033/viewer/2022052906/5589cf41d8b42a4a578b45b6/html5/thumbnails/2.jpg)
2 of 74
Great power comes with great responsibility
You can't do all
RNA-seq is powerful, we have to aim for a certain goal.
Our goal is to detect differential expression
on the gene level.
![Page 3: Part 1 of RNA-seq for DE analysis: Defining the goal](https://reader033.vdocuments.mx/reader033/viewer/2022052906/5589cf41d8b42a4a578b45b6/html5/thumbnails/3.jpg)
3 of 74
With great power comes great responsibility
RNA-seq enables one to
1) get an idea which are all active genes
2) quantify expression of each transcript
3) quantify alternative splicing
… (use your imagination)
Principles of transcriptome analysis and gene expression quantification: an RNA-seq tutorial. http://onlinelibrary.wiley.com/doi/10.1111/1755-0998.12109/abstract
![Page 4: Part 1 of RNA-seq for DE analysis: Defining the goal](https://reader033.vdocuments.mx/reader033/viewer/2022052906/5589cf41d8b42a4a578b45b6/html5/thumbnails/4.jpg)
4 of 74
Differential expression: useful?
What are we looking for? Explanations of observed phenotypes
yeast
GDA
Yeast mutant
GDA + vit C
why?
![Page 5: Part 1 of RNA-seq for DE analysis: Defining the goal](https://reader033.vdocuments.mx/reader033/viewer/2022052906/5589cf41d8b42a4a578b45b6/html5/thumbnails/5.jpg)
5 of 74
The central dogma
<What?>
yeast
GDA
Yeast mutant
GDA + vit C
?
causes the phenotypic differences
![Page 6: Part 1 of RNA-seq for DE analysis: Defining the goal](https://reader033.vdocuments.mx/reader033/viewer/2022052906/5589cf41d8b42a4a578b45b6/html5/thumbnails/6.jpg)
6 of 74
The central dogma
yeast
GDA
Yeast mutant
GDA + vit C
Difference in protein activitycauses the phenotypic differences
![Page 7: Part 1 of RNA-seq for DE analysis: Defining the goal](https://reader033.vdocuments.mx/reader033/viewer/2022052906/5589cf41d8b42a4a578b45b6/html5/thumbnails/7.jpg)
7 of 74
The central dogma
yeast
GDA
Yeast mutant
GDA + vit C
Presence/concentration of proteins in a cellcauses the phenotypic differences
![Page 8: Part 1 of RNA-seq for DE analysis: Defining the goal](https://reader033.vdocuments.mx/reader033/viewer/2022052906/5589cf41d8b42a4a578b45b6/html5/thumbnails/8.jpg)
8 of 74
The central dogma
yeast
GDA
Yeast mutant
GDA + vit C
?
Different regulation of protein productioncauses the phenotypic differences
![Page 9: Part 1 of RNA-seq for DE analysis: Defining the goal](https://reader033.vdocuments.mx/reader033/viewer/2022052906/5589cf41d8b42a4a578b45b6/html5/thumbnails/9.jpg)
9 of 74
The central dogma
yeast
GDA
Yeast mutant
GDA + vit C
?
Level of templates for protein productioncauses the phenotypic differences
![Page 10: Part 1 of RNA-seq for DE analysis: Defining the goal](https://reader033.vdocuments.mx/reader033/viewer/2022052906/5589cf41d8b42a4a578b45b6/html5/thumbnails/10.jpg)
10 of 74
The central dogma
yeast
GDA
Yeast mutant
GDA + vit C
?
Level of mRNA copiescauses the phenotypic differences
![Page 11: Part 1 of RNA-seq for DE analysis: Defining the goal](https://reader033.vdocuments.mx/reader033/viewer/2022052906/5589cf41d8b42a4a578b45b6/html5/thumbnails/11.jpg)
11 of 74
Does the reason for measuring DE make sense?
Difference in protein activity
Level of mRNA copies
Level of templates for protein production
Level of protein production
Presence/concentration of proteins in a cell
Phenotype
![Page 12: Part 1 of RNA-seq for DE analysis: Defining the goal](https://reader033.vdocuments.mx/reader033/viewer/2022052906/5589cf41d8b42a4a578b45b6/html5/thumbnails/12.jpg)
12 of 74
Problem reduction
We can measure mRNA levels (much easier than protein levels).
So we measure mRNA.
The level of mRNA is a proxy of the level of protein activity causing the aberrant phenotype.
![Page 13: Part 1 of RNA-seq for DE analysis: Defining the goal](https://reader033.vdocuments.mx/reader033/viewer/2022052906/5589cf41d8b42a4a578b45b6/html5/thumbnails/13.jpg)
13 of 74
How to measure mRNA levels
1. Q-PCR (real-time)
2. Microarray
3. RNA-seq
A lot of work to measure few genes, in a relatively wide array of tissues. Very accurate.
Easier way to measure many predefined genes in a relatively wide array of tissues. Robust.
![Page 14: Part 1 of RNA-seq for DE analysis: Defining the goal](https://reader033.vdocuments.mx/reader033/viewer/2022052906/5589cf41d8b42a4a578b45b6/html5/thumbnails/14.jpg)
14 of 74
Sequence-based measuring: high level view
● Get your sample● Lyse the cells and extract RNA● Convert the RNA to cDNA● The cDNA pool get sequenced
The result is sequence information from scratch. No prior information is needed.
Yeast sample
Comprehensive comparative analysis of strand-specific RNA sequencing methods http://www.nature.com/nmeth/journal/v7/n9/full/nmeth.1491.html
Comparative analysis of RNA sequencing methods for degraded or low-input sampleshttp://www.nature.com/nmeth/journal/v10/n7/full/nmeth.2483.html
![Page 15: Part 1 of RNA-seq for DE analysis: Defining the goal](https://reader033.vdocuments.mx/reader033/viewer/2022052906/5589cf41d8b42a4a578b45b6/html5/thumbnails/15.jpg)
15 of 74
RNA-seq is not a new idea
● ESTs: expressed sequence tags, ideal for discovery of new genes.
● SAGE: serial analysis of gene expression, measurement of number of copies of mRNA
http://www.montana.edu/observatory/people/mcdermottlab.html
![Page 16: Part 1 of RNA-seq for DE analysis: Defining the goal](https://reader033.vdocuments.mx/reader033/viewer/2022052906/5589cf41d8b42a4a578b45b6/html5/thumbnails/16.jpg)
16 of 74
RNA-seq is not a new idea
● ESTs: expressed sequence tags, ideal for discovery of new genes.
● SAGE: serial analysis of gene expression, measurement of number of copies of mRNA
http://www.sagenet.org/findings/index.html
![Page 17: Part 1 of RNA-seq for DE analysis: Defining the goal](https://reader033.vdocuments.mx/reader033/viewer/2022052906/5589cf41d8b42a4a578b45b6/html5/thumbnails/17.jpg)
17 of 74
RNA-seq is not a new idea
● ESTs: expressed sequence tags● SAGE: serial analysis of gene expression
Low throughput: long sequence information, but for only ~thousands of genes.
![Page 18: Part 1 of RNA-seq for DE analysis: Defining the goal](https://reader033.vdocuments.mx/reader033/viewer/2022052906/5589cf41d8b42a4a578b45b6/html5/thumbnails/18.jpg)
18 of 74
Concept of measuring with RNA-seq
Extract mRNAand turn into cDNA
Fragment, ligateAdaptor, amplify,size selection.
Put a fraction of the pool on a high throughput sequencer to read fragments.
One template of protein production, mRNA
Figure: All things must pass: contrasts and commonalities in eukaryotic and bacterial mRNA decay, Nature Reviews Molecular Cell Biology 11, 467–478
GeneA GeneB GeneC
cell
nucleus
DNA
![Page 19: Part 1 of RNA-seq for DE analysis: Defining the goal](https://reader033.vdocuments.mx/reader033/viewer/2022052906/5589cf41d8b42a4a578b45b6/html5/thumbnails/19.jpg)
19 of 74
Every step means some loss
Yeast sample
![Page 20: Part 1 of RNA-seq for DE analysis: Defining the goal](https://reader033.vdocuments.mx/reader033/viewer/2022052906/5589cf41d8b42a4a578b45b6/html5/thumbnails/20.jpg)
20 of 74
RNA-seq numbers might explain phenotype
Phenotype
Proteins
mRNA levels
cDNA pool
RNA-seq read numbers
Represent the cDNA pool we've created
Represent the RNA pool we've extracted
Are a proxy for protein activity
Define the phenotype
![Page 21: Part 1 of RNA-seq for DE analysis: Defining the goal](https://reader033.vdocuments.mx/reader033/viewer/2022052906/5589cf41d8b42a4a578b45b6/html5/thumbnails/21.jpg)
21 of 74
So many steps must fail our assumption
Phenotype
Proteins
mRNA levels
cDNA pool
RNA-seq read numbers
Protein activity is regulated:Fosforylation, ubiquitination,...
mRNA templates havedifferent speeds of protein pro-Duction: availability of tRNAs, rate of mRNA degration, Alternative splicing events,...
Loss on RNA extraction, 90% of RNA in cell is rRNA, ligation
of adapters, conversion to cDNAnot 100%
Fail to map reads to correctgene, lane-specific biases onreading cDNA fragments,...
![Page 22: Part 1 of RNA-seq for DE analysis: Defining the goal](https://reader033.vdocuments.mx/reader033/viewer/2022052906/5589cf41d8b42a4a578b45b6/html5/thumbnails/22.jpg)
22 of 74
Consequence: focus on comparison
Phenotype A
Proteins
mRNA pool
cDNA pool
RNA-seq reads
Phenotype B
Proteins
mRNA pool
cDNA pool
RNA-seq reads
Possibly dueto differences in
expression
![Page 23: Part 1 of RNA-seq for DE analysis: Defining the goal](https://reader033.vdocuments.mx/reader033/viewer/2022052906/5589cf41d8b42a4a578b45b6/html5/thumbnails/23.jpg)
23 of 74
Consequence: focus on comparison
Phenotype A
Proteins
mRNA pool
cDNA pool
RNA-seq reads
Phenotype B
Proteins
mRNA pool
cDNA pool
RNA-seq reads
DESIGN OFEXPERIMENT
![Page 24: Part 1 of RNA-seq for DE analysis: Defining the goal](https://reader033.vdocuments.mx/reader033/viewer/2022052906/5589cf41d8b42a4a578b45b6/html5/thumbnails/24.jpg)
24 of 74
Comparing read numbers per gene
GeneA GeneB GeneC
sample
RNA-seq
Obviously, the number of reads is dependent on:1. the expression level of the gene2. the total number of reads generated3. the length of the transcript
OUR QUESTION
![Page 25: Part 1 of RNA-seq for DE analysis: Defining the goal](https://reader033.vdocuments.mx/reader033/viewer/2022052906/5589cf41d8b42a4a578b45b6/html5/thumbnails/25.jpg)
25 of 74
Interpreting the counts for our goal
Our focus: which genes are differentially expressed between different conditions?
Obviously, the number of reads is dependent on:1. the expression level of the gene2. the total number of reads generated3. the length of the transcript
![Page 26: Part 1 of RNA-seq for DE analysis: Defining the goal](https://reader033.vdocuments.mx/reader033/viewer/2022052906/5589cf41d8b42a4a578b45b6/html5/thumbnails/26.jpg)
26 of 74
Experimental design
Our focus: which genes are differentially expressed between different conditions?
“How can we detect genes for which the counts of reads change between conditions more systematically than as expected by chance”
We must design an experiment in which we can test this deviance from chance.
Oshlack et al. 2010. From RNA-seq reads to differential expression results. Genome Biology 2010, 11:220 http://genomebiology.com/2010/11/12/220
![Page 27: Part 1 of RNA-seq for DE analysis: Defining the goal](https://reader033.vdocuments.mx/reader033/viewer/2022052906/5589cf41d8b42a4a578b45b6/html5/thumbnails/27.jpg)
27 of 74
How many reads to sequence?
In other words: how deep to sequence? What is the required 'depth of sequencing'?
GeneA GeneB GeneC
sample
RNA-seq
RNA-seq
GeneA GeneB GeneC
The final test will look at ratios:6 5 3
5 6 4
1,2 0,83 0,75
sample
![Page 28: Part 1 of RNA-seq for DE analysis: Defining the goal](https://reader033.vdocuments.mx/reader033/viewer/2022052906/5589cf41d8b42a4a578b45b6/html5/thumbnails/28.jpg)
28 of 74
How many reads to sequence?
The difference between the lowest gene count and the highest gene count is typically 105. This is called the dynamic range.
Linear scale is useless. The logarithmic scale is better.
Wait! Something's not correct here!
![Page 29: Part 1 of RNA-seq for DE analysis: Defining the goal](https://reader033.vdocuments.mx/reader033/viewer/2022052906/5589cf41d8b42a4a578b45b6/html5/thumbnails/29.jpg)
29 of 74
Zero remains zero!
We are working with counts. A count is >=1. A gene with zero counts can be not yet sequenced (not deep enough) or is not expressed in that condition.
It is not a full logarithmic scale. It starts at zero.
0
![Page 30: Part 1 of RNA-seq for DE analysis: Defining the goal](https://reader033.vdocuments.mx/reader033/viewer/2022052906/5589cf41d8b42a4a578b45b6/html5/thumbnails/30.jpg)
30 of 74
How do the numbers change?
Assuming equal sequencing depth in the samples, and these counts. Do all these genes differ in expression? sample sample
GeneA 5 10 2
GeneB 15 30 2
GeneC 40 80 2
GeneD 100 200 2
GeneE 1000 2000 2
GeneZ 1 2 2
RATIO
![Page 31: Part 1 of RNA-seq for DE analysis: Defining the goal](https://reader033.vdocuments.mx/reader033/viewer/2022052906/5589cf41d8b42a4a578b45b6/html5/thumbnails/31.jpg)
31 of 74
How do the numbers change?
sample sample
GeneA 11 10 0,91
GeneB 11 30 2,72
GeneC 60 80 1,33
GeneD 79 200 2,53
GeneE 1150 2000 1,74
GeneZ 5 1 0,20
RATIO
2?
Is there a trend in howthese numbers change?
Sequencing the result of the same steps again is called a technical replicate.
![Page 32: Part 1 of RNA-seq for DE analysis: Defining the goal](https://reader033.vdocuments.mx/reader033/viewer/2022052906/5589cf41d8b42a4a578b45b6/html5/thumbnails/32.jpg)
32 of 74
Technical replicates
sample
GeneA 11 5 4 4
GeneB 11 16 14 8
GeneC 60 45 32 38
GeneD 79 102 95 110
GeneE 1150 1023 987 1005
GeneZ 3 0 0 1
sample sample sample
We take the same cDNA pool and sequence it several times: technical replicates.
![Page 33: Part 1 of RNA-seq for DE analysis: Defining the goal](https://reader033.vdocuments.mx/reader033/viewer/2022052906/5589cf41d8b42a4a578b45b6/html5/thumbnails/33.jpg)
33 of 74
The poisson distribution
The counts of technical replicates follow a poisson distribution (Marioni et al 2008). The Poisson distribution can be applied to systems with a large number of possible events, each of which is rare.
From Wikipedia. Can be 3 different genes, each with their own poisson distribution. Lambda is the mean of the gene's distribution, with a certain number of reads.
Y=axis: chance to pick that number of reads.
![Page 34: Part 1 of RNA-seq for DE analysis: Defining the goal](https://reader033.vdocuments.mx/reader033/viewer/2022052906/5589cf41d8b42a4a578b45b6/html5/thumbnails/34.jpg)
34 of 74
The poisson distribution
So when we have 4 technical replicates sequenced up to a big depth (say 10 M reads). We can get by chance, these numbers for 3 different genes.
GeneA 0, 0, 1, 3
GeneB 2, 3, 4, 7
GeneC 8, 9, 11, 14
![Page 35: Part 1 of RNA-seq for DE analysis: Defining the goal](https://reader033.vdocuments.mx/reader033/viewer/2022052906/5589cf41d8b42a4a578b45b6/html5/thumbnails/35.jpg)
35 of 74
Working the intuition
How many blue balls?How many red balls?
Draw 10Draw 10 moreDraw 10 more
Estimate how large the fraction is in the set?
![Page 36: Part 1 of RNA-seq for DE analysis: Defining the goal](https://reader033.vdocuments.mx/reader033/viewer/2022052906/5589cf41d8b42a4a578b45b6/html5/thumbnails/36.jpg)
36 of 74
The intuition with the balls
Color 10 draws 20 draws 30 draws 40 draws
Blue
Red
No color
![Page 37: Part 1 of RNA-seq for DE analysis: Defining the goal](https://reader033.vdocuments.mx/reader033/viewer/2022052906/5589cf41d8b42a4a578b45b6/html5/thumbnails/37.jpg)
37 of 74
Conclusion of the experiment
How bigger the fraction in the pool, how quicker (i.e. with less sequencing depth) we are certain about the estimate of that fraction.
For lower counts, the variance is relatively bigger than the variance for higher counts.
CV (coëfficient of variation) = sqrt(count)/count
Genes with lower expression need much deeper sequencing than genes with higher expression levels.
estimate=count; variance=count
![Page 38: Part 1 of RNA-seq for DE analysis: Defining the goal](https://reader033.vdocuments.mx/reader033/viewer/2022052906/5589cf41d8b42a4a578b45b6/html5/thumbnails/38.jpg)
38 of 74
Comparing counts
“Here we show the overlap of Poisson distributions of single measurements at different read counts. Because relative Poisson uncertainty is high at low read counts, a count of 1 versus 2 has very little power to discriminate a true 2X fold change, though at higher counts a 2X fold change becomes significant.
In an actual experiment, the width of the distribution would be greater due to additional biological and technical uncertainty, but the uncertainty to the mean expression would narrow with each additional replicate.”
Scotty: a web tool for designing RNA-Seq experiments to measure differential gene expression. Bioinformatics (2013) doi: 10.1093/bioinformatics/btt015
![Page 39: Part 1 of RNA-seq for DE analysis: Defining the goal](https://reader033.vdocuments.mx/reader033/viewer/2022052906/5589cf41d8b42a4a578b45b6/html5/thumbnails/39.jpg)
39 of 74
Comparing technical replicates
Risso et al. “GC-Content Normalization for RNA-Seq Data”BMC Bioinformatics 2011, 12:480
http://www.biomedcentral.com/1471-2105/12/480 - EDASeq package (R)
Correlation between meanand variance
according to Poisson
Lowess fit throughthe data
(Log2 of the counts)
(Lo
g2 o
f th
e co
un
ts)
![Page 40: Part 1 of RNA-seq for DE analysis: Defining the goal](https://reader033.vdocuments.mx/reader033/viewer/2022052906/5589cf41d8b42a4a578b45b6/html5/thumbnails/40.jpg)
40 of 74
But poisson does not seem to fit
Extending the samples to real biological samples, this mean variance relationship does not hold...
Plotted using EDASeqPackage in R.
![Page 41: Part 1 of RNA-seq for DE analysis: Defining the goal](https://reader033.vdocuments.mx/reader033/viewer/2022052906/5589cf41d8b42a4a578b45b6/html5/thumbnails/41.jpg)
41 of 74
But poisson does not seem to fit
Extending the samples to real biological samples, this mean variance relationship does not hold!
Plotted using EDASeqPackage in R.
Reasonable fit
Something is going on!
![Page 42: Part 1 of RNA-seq for DE analysis: Defining the goal](https://reader033.vdocuments.mx/reader033/viewer/2022052906/5589cf41d8b42a4a578b45b6/html5/thumbnails/42.jpg)
42 of 74
An extra source of variation
The Poisson distribution has an 'overdispersed' variance: the variance is bigger than expected for higher counts between biological replicates.
Plotted using EDASeqPackage in R.
Something is going on!
![Page 43: Part 1 of RNA-seq for DE analysis: Defining the goal](https://reader033.vdocuments.mx/reader033/viewer/2022052906/5589cf41d8b42a4a578b45b6/html5/thumbnails/43.jpg)
43 of 74
An extra source of variation
Where Poisson: CV = std dev / mean => CV² = 1/μIf an additional distribution is involved (also dependent on π, the fraction of the gene in the cDNA pool), we have amixture of distributions:
CV² = 1/μ + φ
Low counts! dispersion
Generalization of Poisson with this extra parameter: the Negative Binomial Model fits better!
![Page 44: Part 1 of RNA-seq for DE analysis: Defining the goal](https://reader033.vdocuments.mx/reader033/viewer/2022052906/5589cf41d8b42a4a578b45b6/html5/thumbnails/44.jpg)
44 of 74
The negative binomial model
The NB model fits observed expression data of RNA-seq better. It is a generalization of Poisson, and 2 parameters need to be estimated (μ and φ)
Counts (gene g in sample j) has a Mean = μ
gj
Variance = μgj + φ
g μ
gj²
Biological CV² = φg
=> Biological CV = √φg
Methods differ in estimating this dispersion per gene:Can only be measured with true biological replicates
![Page 45: Part 1 of RNA-seq for DE analysis: Defining the goal](https://reader033.vdocuments.mx/reader033/viewer/2022052906/5589cf41d8b42a4a578b45b6/html5/thumbnails/45.jpg)
45 of 74
Variation summary, intuitively
Total CV² = Technical CV² + Biological CV²
For low counts, the Poisson (technical) variation or the measurement error is dominant.
For higher counts, the Poisson variation gets smaller, and another source of variation becomes dominant, the dispersion or the biological variation. Biological variation does not get smaller with higher counts.
![Page 46: Part 1 of RNA-seq for DE analysis: Defining the goal](https://reader033.vdocuments.mx/reader033/viewer/2022052906/5589cf41d8b42a4a578b45b6/html5/thumbnails/46.jpg)
46 of 74
Beyond the NB model
It appears from analysis of many biological replicates (#=69) that not every gene can be modeled as NB: the Poisson-Tweedie model provides a further generalisation and a better fit for many genes (with an additional shape parameter).
Left figure: raw data shows that about 26% of the genes fit a NB model. Depending on the estimated shape parameter, other distributions fit better.
Esnaola et al. BMC Bioinformatics 2013, 14:254http://www.biomedcentral.com/1471-2105/14/254
![Page 47: Part 1 of RNA-seq for DE analysis: Defining the goal](https://reader033.vdocuments.mx/reader033/viewer/2022052906/5589cf41d8b42a4a578b45b6/html5/thumbnails/47.jpg)
47 of 74
Consequence for our design
● For low counts: the uncertainty is big due to Poisson
● For high counts: the uncertainty is big due to biological variation. (highly expressed genes differ in their natural variation (regulated by cellular processes) more than lowly expressed genes).
● If we focus on the ratios between the conditions: is it reasonable to set a restriction of fold change? Highly expressed genes can have a smaller and be significant. Lowly expressed genes can exceed 2.
![Page 48: Part 1 of RNA-seq for DE analysis: Defining the goal](https://reader033.vdocuments.mx/reader033/viewer/2022052906/5589cf41d8b42a4a578b45b6/html5/thumbnails/48.jpg)
48 of 74
Consequence on fold change
The readily applied cut-off in micro-array analysis is in RNA-seq not of use.
Blue and red: known DE genes
Volcanoplot
These cut-offs oftenapplied can prohibitdetecting DE genes
![Page 49: Part 1 of RNA-seq for DE analysis: Defining the goal](https://reader033.vdocuments.mx/reader033/viewer/2022052906/5589cf41d8b42a4a578b45b6/html5/thumbnails/49.jpg)
49 of 74
Just remember...
We need to estimate the model behind the count.
Never work without biological replicates.
Never work with 2 biological replicates.
Try avoiding working with 3 biological replicates.
Go for at least 4 biological replicates.
![Page 50: Part 1 of RNA-seq for DE analysis: Defining the goal](https://reader033.vdocuments.mx/reader033/viewer/2022052906/5589cf41d8b42a4a578b45b6/html5/thumbnails/50.jpg)
50 of 74
Break?
![Page 51: Part 1 of RNA-seq for DE analysis: Defining the goal](https://reader033.vdocuments.mx/reader033/viewer/2022052906/5589cf41d8b42a4a578b45b6/html5/thumbnails/51.jpg)
51 of 74
Overview
GeneA GeneB GeneC
Sample 1
RNA-seq
GeneA GeneB GeneC
Sample 2
RNA-seq
GeneA GeneB GeneC
Sample 3
RNA-seq
GeneA GeneB GeneC
Sample 4
RNA-seq
GeneA GeneB GeneC
Sample 5
RNA-seq
GeneA GeneB GeneC
Sample 6
RNA-seq
Condition X
Condition Y
![Page 52: Part 1 of RNA-seq for DE analysis: Defining the goal](https://reader033.vdocuments.mx/reader033/viewer/2022052906/5589cf41d8b42a4a578b45b6/html5/thumbnails/52.jpg)
52 of 74
Factors influencing read count
Obviously, the number of reads is dependent on:1. chance
→ Define the count model (NB) from replicates2. the expression level of the gene
→ Compare the ratios with a test3. the total number of reads generated4. the length of the transcript
![Page 53: Part 1 of RNA-seq for DE analysis: Defining the goal](https://reader033.vdocuments.mx/reader033/viewer/2022052906/5589cf41d8b42a4a578b45b6/html5/thumbnails/53.jpg)
53 of 74
Library size influences read counts
GeneA GeneB GeneC
sample
RNA-seq
The number of reads is dependent on the total number of reads generated. If one library is sequenced to 20M reads, and another one to 40M, most genes will ~double their counts.
GeneA GeneB GeneC
sample
More RNA-seq
![Page 54: Part 1 of RNA-seq for DE analysis: Defining the goal](https://reader033.vdocuments.mx/reader033/viewer/2022052906/5589cf41d8b42a4a578b45b6/html5/thumbnails/54.jpg)
54 of 74
Normalization for library size
Naive approach: divide by total library size. Is not applied anymore!
Why not? Composition matters!
2 things to remember:- zero sum system or “every gene we measure takes up a part (at least one read) of the total library”
- 5 orders of magnitude
![Page 55: Part 1 of RNA-seq for DE analysis: Defining the goal](https://reader033.vdocuments.mx/reader033/viewer/2022052906/5589cf41d8b42a4a578b45b6/html5/thumbnails/55.jpg)
55 of 74
Normalization for library size
Every gene takes up at least one read. But in every sample, a lot of reads are spend on few extremely highly expressed genes. Reason unknown. Often different between samples. This fact biases average based (naïve) normalization attempts.
Average count (log2)
Comparing 2 samples
Cou
nt
dif
fere
nce
(lo
g2 r
atio
)
![Page 56: Part 1 of RNA-seq for DE analysis: Defining the goal](https://reader033.vdocuments.mx/reader033/viewer/2022052906/5589cf41d8b42a4a578b45b6/html5/thumbnails/56.jpg)
56 of 74
Normalization for library size
Schematically: when normalized on library size (square represent number of reads).
Rest of the genesRest of the genes
Few genes with enormous counts
All counts for library A All counts for library B
![Page 57: Part 1 of RNA-seq for DE analysis: Defining the goal](https://reader033.vdocuments.mx/reader033/viewer/2022052906/5589cf41d8b42a4a578b45b6/html5/thumbnails/57.jpg)
57 of 74
Normalization for library size
Better normalization would be as shown below. DESeq2 and EdgeR apply such an approach (see later).
Rest of the genesRest of the genes
100%
100%
![Page 58: Part 1 of RNA-seq for DE analysis: Defining the goal](https://reader033.vdocuments.mx/reader033/viewer/2022052906/5589cf41d8b42a4a578b45b6/html5/thumbnails/58.jpg)
58 of 74
Gene length influences the count
“Longer transcripts generate more reads”
True! But the transcript length does not differ between samples. Since we are concerned with relative differences between samples, this needs no normalization (this story changes in case of absolute quantification).
Sample A Sample B
Gene A
Gene B
Gene A
Gene B
![Page 59: Part 1 of RNA-seq for DE analysis: Defining the goal](https://reader033.vdocuments.mx/reader033/viewer/2022052906/5589cf41d8b42a4a578b45b6/html5/thumbnails/59.jpg)
59 of 74
The many flavours of sample variation
Some properties of libraries/samples can effect the counts, and lead to variation. This is called between-lane variation. Obvious ones: library size (how many reads are sampled), library composition.
Different libraries/samples differ sometimes in how gene properties relate to gene counts. This is called within-lane variation.
![Page 60: Part 1 of RNA-seq for DE analysis: Defining the goal](https://reader033.vdocuments.mx/reader033/viewer/2022052906/5589cf41d8b42a4a578b45b6/html5/thumbnails/60.jpg)
60 of 74
GC-content of genes can influence counts
GC-content differs between genes. But it does not change between samples, so there should be no problem for relative expression comparison.
We can visualize the relationship between counts and GC very easily (see right). There is some trend, and it is equal for all samples.
EDAseq (R)
![Page 61: Part 1 of RNA-seq for DE analysis: Defining the goal](https://reader033.vdocuments.mx/reader033/viewer/2022052906/5589cf41d8b42a4a578b45b6/html5/thumbnails/61.jpg)
61 of 74
GC-content of genes can influence counts
Sometimes, samples show different relationships between GC-content of the genes and the counts.
This within-lane variation (or intra-sample) variation needs to be corrected for, so that in one sample not all differentially expressed genes are also the GC-riched ones.
Length can have also this effect.
![Page 62: Part 1 of RNA-seq for DE analysis: Defining the goal](https://reader033.vdocuments.mx/reader033/viewer/2022052906/5589cf41d8b42a4a578b45b6/html5/thumbnails/62.jpg)
62 of 74
Putting our experiment together
We want to detect differentially expressed genes between 2 or more conditions.
For this, we need to apply the conditions in a controlled environment (randomisation,...).
For good testing, we need to have some biological replicates per condition.
For cost effectiveness, we determine how deep we will sequence from each sample.
We analyse the reads, get raw counts and do the test!
![Page 63: Part 1 of RNA-seq for DE analysis: Defining the goal](https://reader033.vdocuments.mx/reader033/viewer/2022052906/5589cf41d8b42a4a578b45b6/html5/thumbnails/63.jpg)
63 of 74
On the sequencer
HiSeq2000: 24 single-index barcodes available. 1 lane gives 150-180 M reads. One lane of 50 bp SE approx €1.500.
![Page 64: Part 1 of RNA-seq for DE analysis: Defining the goal](https://reader033.vdocuments.mx/reader033/viewer/2022052906/5589cf41d8b42a4a578b45b6/html5/thumbnails/64.jpg)
64 of 74
Bioinformatics analysis of the output
Quality control (QC) of raw reads
Preprocessing: filtering of reads and read parts, to help our goal of differential detection.
QC of preprocessing Mapping to a reference genome(alternative: to a transcriptome)
QC of the mapping
Count table extraction
QC of the count table
DE test
Biological insight
![Page 65: Part 1 of RNA-seq for DE analysis: Defining the goal](https://reader033.vdocuments.mx/reader033/viewer/2022052906/5589cf41d8b42a4a578b45b6/html5/thumbnails/65.jpg)
65 of 74
Bioinformatics analysis will take most of your time
Quality control (QC) of raw reads
Preprocessing: filtering of reads and read parts, to help our goal of differential detection.
QC of preprocessing Mapping to a reference genome(alternative: to a transcriptome)
QC of the mapping
Count table extraction
QC of the count table
DE test
Biological insight
1
2
3
4
5
6
![Page 66: Part 1 of RNA-seq for DE analysis: Defining the goal](https://reader033.vdocuments.mx/reader033/viewer/2022052906/5589cf41d8b42a4a578b45b6/html5/thumbnails/66.jpg)
66 of 74
Overview
Anders et al. Count-based differential expression analysis of RNA sequencing data using R and Bioconductor. 2013http://www.nature.com/nprot/journal/v8/n9/full/nprot.2013.099.html
![Page 67: Part 1 of RNA-seq for DE analysis: Defining the goal](https://reader033.vdocuments.mx/reader033/viewer/2022052906/5589cf41d8b42a4a578b45b6/html5/thumbnails/67.jpg)
67 of 74
The numbers get reduced with every step
20M
25M
15M
~16%
~5%
~10%
~30% decreasefrom sequenced reads to counted reads.
![Page 68: Part 1 of RNA-seq for DE analysis: Defining the goal](https://reader033.vdocuments.mx/reader033/viewer/2022052906/5589cf41d8b42a4a578b45b6/html5/thumbnails/68.jpg)
68 of 74
Deeper, or more replicates?
Variance will be lower with more reads: but sequencing another biological replicate is preferred over sequencing deeper, or technical reps.
Busby et al. Scotty: a web tool for designing RNA-Seq experiments to measure differential gene expression. Doi: 10.1093/bioinformatics/btt015
![Page 69: Part 1 of RNA-seq for DE analysis: Defining the goal](https://reader033.vdocuments.mx/reader033/viewer/2022052906/5589cf41d8b42a4a578b45b6/html5/thumbnails/69.jpg)
69 of 74
There is tool to help you set up
![Page 70: Part 1 of RNA-seq for DE analysis: Defining the goal](https://reader033.vdocuments.mx/reader033/viewer/2022052906/5589cf41d8b42a4a578b45b6/html5/thumbnails/70.jpg)
70 of 74
Scotty: power analysis
'How many samples and how deep in order to minimize false negatives?'
Power: the probability to reject the null hypothesis if the alternative is true. A null hypothesis is always a scenario in which there is no difference, hence no differential expression.
Check the BITS wiki:
http://wiki.bits.vib.be/index.php/RNAseq_toolbox
![Page 71: Part 1 of RNA-seq for DE analysis: Defining the goal](https://reader033.vdocuments.mx/reader033/viewer/2022052906/5589cf41d8b42a4a578b45b6/html5/thumbnails/71.jpg)
71 of 74
Help with design
http://wiki.bits.vib.be/index.php/RNAseq_toolbox http://rnaseq.uoregon.edu/exp_design.html
![Page 72: Part 1 of RNA-seq for DE analysis: Defining the goal](https://reader033.vdocuments.mx/reader033/viewer/2022052906/5589cf41d8b42a4a578b45b6/html5/thumbnails/72.jpg)
72 of 74
How many samples to sequence?
→ Scotty exercise
![Page 73: Part 1 of RNA-seq for DE analysis: Defining the goal](https://reader033.vdocuments.mx/reader033/viewer/2022052906/5589cf41d8b42a4a578b45b6/html5/thumbnails/73.jpg)
73 of 74
KeywordsA read count of a gene is dependent on:
1. chance
2. expression level
3. transcript length
4. depth of sequencing
5. GC-content
Poisson distribution
Negative binomial distribution
Condition
Sample
Normalization
Write in your own words what the terms mean
![Page 74: Part 1 of RNA-seq for DE analysis: Defining the goal](https://reader033.vdocuments.mx/reader033/viewer/2022052906/5589cf41d8b42a4a578b45b6/html5/thumbnails/74.jpg)
74 of 74
Reads
All my references available at:https://www.zotero.org/groups/dernaseq/items