introduction to statistics ii

25
Statistics for Microarray Data

Upload: strand-life-sciences-pvt-ltd

Post on 21-May-2015

360 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: Introduction to statistics ii

Statistics for Microarray Data

Page 2: Introduction to statistics ii

Background

• Few observations made by a black box

• What is the distribution behind the black box?

• E.g., with what probability will it output a number bigger than 5?

2 μ, σ2

Page 3: Introduction to statistics ii

Approach

• Easy to determine with many observations

• With few observations..

• Assume a canonical distribution based on prior knowledge

• Determine parameters of this distribution using the observations, e.g., mean, variance

Page 4: Introduction to statistics ii

Estimating the mean

= 2 )/

• has an approximate Normal distribution with mean μ and variance σ2/

• So E()= μ

Page 5: Introduction to statistics ii

Estimating the variance σ2

=

• has ?? distribution with mean σ2 and variance σ4 (2/)

• So E()= σ2

Chi-Square if the original distribution was Normal

Page 6: Introduction to statistics ii

Microarray Data

• Many genes, 25000

• 2 conditions (or more), many replicates within each condition

• Which genes are differentially expressed between the two conditions?

Page 7: Introduction to statistics ii

More Specifically

• For a particular gene– Each condition is a black box– Say 3 observations from each black box

• Do both black boxes have the same distribution?– Assume same canonical distribution– Do both have the same parameters?

Page 8: Introduction to statistics ii

Which Canonical Distribution

• Use data with many replicates

• 418.0294, 295.8019, 272.1220, 315.2978, 294.2242, 379.8320, 392.1817, 450.4758, 335.8242, 265.2478, 196.6982, 289.6532, 274.4035, 246.6807, 254.8710, 165.9416, 281.9463, 246.6434, 259.0019, 242.1968

• Distribution??

Page 9: Introduction to statistics ii

What is a QQ Plot

• Sample n values independently from the canonical distribution of interest

• Expected value of the smallest: – where the area under the curve to the left is

1/

• Expected value of the kth smallest: – where the area under the curve to the left is

k/

• Sort the data (n observations at hand) and scatter plot against the above expected values

Page 10: Introduction to statistics ii

Distribution of log raw intensities across genes on a single array

Page 11: Introduction to statistics ii

The QQ plot of log scale intensities (i.e., actual vs simulated from normal)

Page 12: Introduction to statistics ii

QQ Plot against a Normal Distribution

• 10 + 10 replicates in two groups

• Single group QQ plot

• Combined 2 groups QQ plot

• Combined log-scale QQ plot Shapiro-

Wilk Test

Page 13: Introduction to statistics ii

Which Canonical Distribution

• Assume log normal distribution

Page 14: Introduction to statistics ii

Benford’s Law

• Frequency distribution of first significant digit

Pr(d<=x<d+1 )= log10(1+d)-log10(d), log10(x) is uniformly distributed in [0,1]

Page 15: Introduction to statistics ii

Differential Expression

μ1,σ12 μ2,σ2

2

Group 1 Group 2

Is μ1= μ2?σ1 = σ2 ? Is variance a

function of mean?

Page 16: Introduction to statistics ii

SD vs Mean across 3 replicates plotted for all genes

SD increases linearly

with Mean

Page 17: Introduction to statistics ii

SD vs Mean across 3 replicates computed for all genes after log-transformation

SD is flat now,

except for very low values

Another reason to work on the log scale

Page 18: Introduction to statistics ii

Differential Expression

μ1,σ12 μ2,σ2

2

Group 1 Group 2

Is μ1= μ2?σ1 = σ2 ? Sort-of YES

Page 19: Introduction to statistics ii

The T-Statistic

• : Group 1 estimated mean• : Group 2 estimated mean

• : estimated common variance of the two groups– is a commonly used estimate

=

Page 20: Introduction to statistics ii

The T-Statistic =

• Suppose μ1= μ2 [Null hypothesis]

• Both groups follow the same hidden distribution

• E()= E() = μ1= μ2

• E(; is distributed around 0

Page 21: Introduction to statistics ii

The T-Statistic =

• If we knew the distribution of, we could then easily calculate the probability of getting the particular value of that we got

• If this prob is small, then the assumption μ1= μ2 was not valid

• If this prob is small, then the gene is differentially expressed

Page 22: Introduction to statistics ii

The T-Statistic

= Flattened

Normal or T-Distribution

The distribution is a function of the

number of observations used

to compute

For df=

Page 23: Introduction to statistics ii

A Problem

• estimation from just a few replicates could be erroneous

• Can we get a better estimate– Go across genes. Moderation.

=

Page 24: Introduction to statistics ii

SD vs Mean across 3 replicates computed for all genes after log-transformattion

The curve fit here

may be a better

estimate

Not much difference

here

Lots of false positives can be avoided

here

Page 25: Introduction to statistics ii

Thank You