introduction to statistics ii

Statistics for Microarray Data

Background

• Few observations made by a black box

• What is the distribution behind the black box?

• E.g., with what probability will it output a number bigger than 5?

2 μ, σ2

Approach

• Easy to determine with many observations

• With few observations..

• Assume a canonical distribution based on prior knowledge

• Determine parameters of this distribution using the observations, e.g., mean, variance

Estimating the mean

= 2 )/

• has an approximate Normal distribution with mean μ and variance σ2/

• So E()= μ

Estimating the variance σ2

=

• has ?? distribution with mean σ2 and variance σ4 (2/)

• So E()= σ2

Chi-Square if the original distribution was Normal

Microarray Data

• Many genes, 25000

• 2 conditions (or more), many replicates within each condition

• Which genes are differentially expressed between the two conditions?

More Specifically

• For a particular gene– Each condition is a black box– Say 3 observations from each black box

• Do both black boxes have the same distribution?– Assume same canonical distribution– Do both have the same parameters?

Which Canonical Distribution

• Use data with many replicates

• 418.0294, 295.8019, 272.1220, 315.2978, 294.2242, 379.8320, 392.1817, 450.4758, 335.8242, 265.2478, 196.6982, 289.6532, 274.4035, 246.6807, 254.8710, 165.9416, 281.9463, 246.6434, 259.0019, 242.1968

• Distribution??

What is a QQ Plot

• Sample n values independently from the canonical distribution of interest

• Expected value of the smallest: – where the area under the curve to the left is

1/

• Expected value of the kth smallest: – where the area under the curve to the left is

k/

• Sort the data (n observations at hand) and scatter plot against the above expected values

Distribution of log raw intensities across genes on a single array

The QQ plot of log scale intensities (i.e., actual vs simulated from normal)

QQ Plot against a Normal Distribution

• 10 + 10 replicates in two groups

• Single group QQ plot

• Combined 2 groups QQ plot

• Combined log-scale QQ plot Shapiro-

Wilk Test

Which Canonical Distribution

• Assume log normal distribution

Benford’s Law

• Frequency distribution of first significant digit

Pr(d<=x<d+1 )= log10(1+d)-log10(d), log10(x) is uniformly distributed in [0,1]

Differential Expression

μ1,σ12 μ2,σ2

2

Group 1 Group 2

Is μ1= μ2?σ1 = σ2 ? Is variance a

function of mean?

SD vs Mean across 3 replicates plotted for all genes

SD increases linearly

with Mean

SD vs Mean across 3 replicates computed for all genes after log-transformation

SD is flat now,

except for very low values

Another reason to work on the log scale

Differential Expression

μ1,σ12 μ2,σ2

2

Group 1 Group 2

Is μ1= μ2?σ1 = σ2 ? Sort-of YES

The T-Statistic

• : Group 1 estimated mean• : Group 2 estimated mean

• : estimated common variance of the two groups– is a commonly used estimate

=

The T-Statistic =

• Suppose μ1= μ2 [Null hypothesis]

• Both groups follow the same hidden distribution

• E()= E() = μ1= μ2

• E(; is distributed around 0

The T-Statistic =

• If we knew the distribution of, we could then easily calculate the probability of getting the particular value of that we got

• If this prob is small, then the assumption μ1= μ2 was not valid

• If this prob is small, then the gene is differentially expressed

The T-Statistic

= Flattened

Normal or T-Distribution

The distribution is a function of the

number of observations used

to compute

For df=

A Problem

• estimation from just a few replicates could be erroneous

• Can we get a better estimate– Go across genes. Moderation.

=

SD vs Mean across 3 replicates computed for all genes after log-transformattion

The curve fit here

may be a better

estimate

Not much difference

here

Lots of false positives can be avoided

here

Thank You

introduction to statistics ii

Technology

log normal distribution

black boxes

microarray data

single array

significant digitprd

actual vs

particular gene

approach easy