introduction to statistics ii
TRANSCRIPT
Statistics for Microarray Data
Background
• Few observations made by a black box
• What is the distribution behind the black box?
• E.g., with what probability will it output a number bigger than 5?
2 μ, σ2
Approach
• Easy to determine with many observations
• With few observations..
• Assume a canonical distribution based on prior knowledge
• Determine parameters of this distribution using the observations, e.g., mean, variance
Estimating the mean
= 2 )/
• has an approximate Normal distribution with mean μ and variance σ2/
• So E()= μ
Estimating the variance σ2
=
• has ?? distribution with mean σ2 and variance σ4 (2/)
• So E()= σ2
Chi-Square if the original distribution was Normal
Microarray Data
• Many genes, 25000
• 2 conditions (or more), many replicates within each condition
• Which genes are differentially expressed between the two conditions?
More Specifically
• For a particular gene– Each condition is a black box– Say 3 observations from each black box
• Do both black boxes have the same distribution?– Assume same canonical distribution– Do both have the same parameters?
Which Canonical Distribution
• Use data with many replicates
• 418.0294, 295.8019, 272.1220, 315.2978, 294.2242, 379.8320, 392.1817, 450.4758, 335.8242, 265.2478, 196.6982, 289.6532, 274.4035, 246.6807, 254.8710, 165.9416, 281.9463, 246.6434, 259.0019, 242.1968
• Distribution??
What is a QQ Plot
• Sample n values independently from the canonical distribution of interest
• Expected value of the smallest: – where the area under the curve to the left is
1/
• Expected value of the kth smallest: – where the area under the curve to the left is
k/
• Sort the data (n observations at hand) and scatter plot against the above expected values
Distribution of log raw intensities across genes on a single array
The QQ plot of log scale intensities (i.e., actual vs simulated from normal)
QQ Plot against a Normal Distribution
• 10 + 10 replicates in two groups
• Single group QQ plot
• Combined 2 groups QQ plot
• Combined log-scale QQ plot Shapiro-
Wilk Test
Which Canonical Distribution
• Assume log normal distribution
Benford’s Law
• Frequency distribution of first significant digit
Pr(d<=x<d+1 )= log10(1+d)-log10(d), log10(x) is uniformly distributed in [0,1]
Differential Expression
μ1,σ12 μ2,σ2
2
Group 1 Group 2
Is μ1= μ2?σ1 = σ2 ? Is variance a
function of mean?
SD vs Mean across 3 replicates plotted for all genes
SD increases linearly
with Mean
SD vs Mean across 3 replicates computed for all genes after log-transformation
SD is flat now,
except for very low values
Another reason to work on the log scale
Differential Expression
μ1,σ12 μ2,σ2
2
Group 1 Group 2
Is μ1= μ2?σ1 = σ2 ? Sort-of YES
The T-Statistic
• : Group 1 estimated mean• : Group 2 estimated mean
• : estimated common variance of the two groups– is a commonly used estimate
=
The T-Statistic =
• Suppose μ1= μ2 [Null hypothesis]
• Both groups follow the same hidden distribution
• E()= E() = μ1= μ2
• E(; is distributed around 0
The T-Statistic =
• If we knew the distribution of, we could then easily calculate the probability of getting the particular value of that we got
• If this prob is small, then the assumption μ1= μ2 was not valid
• If this prob is small, then the gene is differentially expressed
The T-Statistic
= Flattened
Normal or T-Distribution
The distribution is a function of the
number of observations used
to compute
For df=
A Problem
• estimation from just a few replicates could be erroneous
• Can we get a better estimate– Go across genes. Moderation.
=
SD vs Mean across 3 replicates computed for all genes after log-transformattion
The curve fit here
may be a better
estimate
Not much difference
here
Lots of false positives can be avoided
here
Thank You