multiple testing and false discovery rate in feature selection workflow of feature selection using...

37
Multiple testing and false discovery rate in feature selection Workflow of feature selection using high-throughput data General considerations of statistical testing using high- throughput data Some important FDR methods Benjamini-Hochberg FDR Storey-Tibshirani’s q-value Efron et al.’s local fdr Ploner et al.’s

Upload: clare-buttler

Post on 14-Dec-2015

238 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Multiple testing and false discovery rate in feature selection  Workflow of feature selection using high- throughput data  General considerations of

Multiple testing and false discovery rate in feature selection

Workflow of feature selection using high-throughput data

General considerations of statistical testing using high-throughput data

Some important FDR methodsBenjamini-Hochberg FDRStorey-Tibshirani’s q-valueEfron et al.’s local fdrPloner et al.’s Multidimensional

local fdr

Page 2: Multiple testing and false discovery rate in feature selection  Workflow of feature selection using high- throughput data  General considerations of

Gene/protein/metabolite expression data

After all the pre-processing, we have a feature by sample matrix of expression indices.It is like an molecular “fingerprint” of each sample.The most common use: to find biomarkers of a disease.

Page 3: Multiple testing and false discovery rate in feature selection  Workflow of feature selection using high- throughput data  General considerations of

Workflow of feature selection

Raw data

Feature-level expression in each sample

Preprocessing, normalization,filtering …

Statistical testing/model fittingTest statistic for

every feature

Compare to null distribution

P-value for every feature

Significance of every feature (FWER, FDR, …)

Other information (fold change, biological pathway…)Selected

features

Biological interpretation, … …

Page 4: Multiple testing and false discovery rate in feature selection  Workflow of feature selection using high- throughput data  General considerations of

Workflow of feature selection

Raw data

Feature-level expression in each sample

Statistical testing/model fitting

Feature group information (biological pathway ……)

Group-level significance

Selected features from significant groups

Another route. Preprocessing,

normalization,filtering …

Page 5: Multiple testing and false discovery rate in feature selection  Workflow of feature selection using high- throughput data  General considerations of

The simplest strategy:

Assume each gene is independent from others.Perform testing between treatment groups for every gene.Select those that are significant.

When we do 50,000 t-tests, if the alpha level of 0.05 is used, we expect ~50,000x0.05 = 2,500 false-positives !

If we use Bonferonni’s correction? 0.05/50000= 1e-6Unrealistic!

Gene/protein/metabolite expression data

Page 6: Multiple testing and false discovery rate in feature selection  Workflow of feature selection using high- throughput data  General considerations of

General considerations

Family-wise error rate (FWER)

When we have multiple tests, let V be the number of true nulls called significant (false positives)

FWER = P(V ≥ 1) = 1-P(V=0)

“Family”: a group of hypothesis that are similar in purpose, and need to be jointly accurate.

Bonferroni correction is one version of FWER control.It is the simplest and most conservative approach.

Page 7: Multiple testing and false discovery rate in feature selection  Workflow of feature selection using high- throughput data  General considerations of

General considerations

Control every test at the level α/m

For each test, P(Ti significant | H0) ≤ α/m

ThenP(some T are significant | H0) ≤ α

i.e. FWER = P(V ≥ 1) ≤ α

It has little power to detect differential expression when m is big.

Page 8: Multiple testing and false discovery rate in feature selection  Workflow of feature selection using high- throughput data  General considerations of

Non-technical Reviews:Gusnanto A, Calza S, Pawitan Y. Curr Opin Lipidol 2007; 18:187-193.Pounds SB. Brief Bioinf 2005; 7(1): 25-36.Saeys Y, Inza I, Larranaga P. Bioinformatics 2007, 23 (19): 2507-2517.

Original papers:Benjamini Y, Hochberg Y. JRSS B 1995; 57(1):289–300.Storey JD, Tibshirani R. Proc Natl Acad Sci U S A 2003; 100:9440–9445.Efron B. Ann Stat 2007; 35(4):1351-137.Ploner A, Calza S, Gusnanto A, Pawitan Y. Bioinf 2006;22(5):556-565.

(A number of figures were taken from these papers.)

References

Page 9: Multiple testing and false discovery rate in feature selection  Workflow of feature selection using high- throughput data  General considerations of

General considerations

SignificantNon-

significant

No change V U Q

Differentially expressed

S T M-Q

R M-R M

Simultaneously test M hypotheses.Q is # true null – genes that didn’t change (unobserved)R is # rejected – genes called significant (observed)U, V, T, S are unobservable random variables.V: number of type-I errors; T: number of type-II errors.

Page 10: Multiple testing and false discovery rate in feature selection  Workflow of feature selection using high- throughput data  General considerations of

General considerationsSignificant

Non- signific

ant

No change V U Q

Differentially

expressedS T M-Q

R M-R MIn traditional testing, we consider just one test, from a frequentist’s point of view.

we control the false positive rate: E(V/Q)

Sensitivity: E[S/(M-Q)]Specificity: E[U/Q]

Page 11: Multiple testing and false discovery rate in feature selection  Workflow of feature selection using high- throughput data  General considerations of

General considerationsThere is always the trade-off between sensitivity and specificity.

Significant

Non- signific

ant

No changeFalse

positive

True negati

ve

Total true

negative

Differentially

expressed

True positiv

e

False negati

ve

Total true

positive

Total positive calls

Total negati

ve calls

total

Receiver operating characteristic (ROC) curve.

Example from Jiang et al. BMC Bioinformatics 7:417.

Page 12: Multiple testing and false discovery rate in feature selection  Workflow of feature selection using high- throughput data  General considerations of

http://upload.wikimedia.org/wikipedia/en/b/b4/Roc-general.png

General considerations

Page 13: Multiple testing and false discovery rate in feature selection  Workflow of feature selection using high- throughput data  General considerations of

General considerations

False discovery rate (FDR) = E(V/R)

Among all tests called significant, what percentage are false calls?

Significant

Non- significa

nt

No change V U Q

Differentially expressed S T M-Q

R M-R M

Page 14: Multiple testing and false discovery rate in feature selection  Workflow of feature selection using high- throughput data  General considerations of

General considerationsSignifica

nt

Non- significa

nt

No change 5 49795 49800

Differentially expressed 95 105 200

100 49900 50000

Significant

Non- significan

t

No change 320 49480 49800

Differentially

expressed180 20 200

500 49500 50000

It makes more sense than this, which leans too heavily towards sensitivity:

Page 15: Multiple testing and false discovery rate in feature selection  Workflow of feature selection using high- throughput data  General considerations of

General considerationsSignifica

nt

Non- significa

nt

No change 5 49795 49800

Differentially expressed 95 105 200

100 49900 50000

Significant

Non- significa

nt

No change 1 49799 49800

Differentially expressed 14 186 200

15 49985 50000

It makes more sense than this, which leans too heavily towards specificity:

Page 16: Multiple testing and false discovery rate in feature selection  Workflow of feature selection using high- throughput data  General considerations of

Was the BH definition the first? No. Defined in 1955….

True discovery rate

True positive rate

False positive rate http://en.wikipedia.org/wiki/

Precision_and_recall

Page 17: Multiple testing and false discovery rate in feature selection  Workflow of feature selection using high- throughput data  General considerations of

FDR – BH procedureTesting m hypotheses:

The p-values are:

Order the p-values such that:

Let q* be the level of FDR we want to control,

Find the largest i such that

Make the corresponding p-value the cutoff value, the FDR is controlled at q*.

Page 18: Multiple testing and false discovery rate in feature selection  Workflow of feature selection using high- throughput data  General considerations of

The method assumes weak dependence between test statistics.

In computation, it can be simplified by taking mP(i)/i and compare to q*.

Intuitively,

mP(i) is the number of false-positives expected if the cutoff is P(i)

If the cutoff were P(i), then we select the first i features.

So, mP(i)/i is the expected fraction of false-positives – the FDR.

FDR – BH procedure

Page 19: Multiple testing and false discovery rate in feature selection  Workflow of feature selection using high- throughput data  General considerations of

Higher power compared to FWER controlling methods:

FDR – BH procedure

Page 20: Multiple testing and false discovery rate in feature selection  Workflow of feature selection using high- throughput data  General considerations of

ST q-valueSignificant

Non- significa

nt

No change V U Q

Differentially

expressedS T M-Q

R M-R M

FDR = E[V/(V+S)] = E[V/R]

Let t be the threshold on p-value, then with all p-values observed, V and R become functions of t.V(t) = # {null pi ≤ t}

R(t) = # {pi ≤ t}

FDR(t) = E[V(t)/R(t)] ≈ E[V(t)]/E[R(t)]

For R(t), we can simply plug in # {pi ≤ t};For V(t), true null p-values should be uniformly distributed.

Page 21: Multiple testing and false discovery rate in feature selection  Workflow of feature selection using high- throughput data  General considerations of

Significant

Non- significa

nt

No change V U Q

Differentially

expressedS T M-Q

R M-R M

V(t) = Qt

However, Q is unknown. Let π0=Q/M

Now, try to find π0.

Without specifying the distribution of the alternative p-values, but assuming most of them are small, we can use areas of the histogram that’s relatively flat to estimate π0

Density of p-values

λ

ST q-value

Page 22: Multiple testing and false discovery rate in feature selection  Workflow of feature selection using high- throughput data  General considerations of

Significant

Non- significa

nt

No change V U Q

Differentially

expressedS T M-Q

R M-R M

This procedure involves tuning the parameter λ.

With most alternative p-values at the smaller end, the estimated

Should stabilize when λ is above a certain value.

ST q-value

Page 23: Multiple testing and false discovery rate in feature selection  Workflow of feature selection using high- throughput data  General considerations of

“The more mathematical definition of the q value is the minimumFDR that can be attained when calling that feature significant”

Given a list of ordered p-values, this guarantees the corresponding q-values are increasing in the same order as the p-values.

The q-value procedure is robust against weak dependence between features, which “can loosely be described as any form of dependence whose effect becomes negligible as the number of features increases to infinity.”

ST q-value

Page 24: Multiple testing and false discovery rate in feature selection  Workflow of feature selection using high- throughput data  General considerations of

ST q-value

Page 25: Multiple testing and false discovery rate in feature selection  Workflow of feature selection using high- throughput data  General considerations of

ST q-value

Page 26: Multiple testing and false discovery rate in feature selection  Workflow of feature selection using high- throughput data  General considerations of

Efron’s Local fdr

The previous versions of FDR make statements about features falling on the tails of the distribution of the test statistic. However they don’t make statements about and individual feature, i.e. how likely is this feature false-positive given its specific p-value ?

-------------------------------Efron’s local FDR uses a mixture model and the empirical Bayes approach. An empirical null distribution is put in the place of the theoretical null.

With z being the test statistic, local FDR:

Page 27: Multiple testing and false discovery rate in feature selection  Workflow of feature selection using high- throughput data  General considerations of

The test statistic come from a mixture of two distributions:

The exact form of f1() is not specified. It is required to be longer-tailed than f0().

We need the empirical null. But we only have a histogram from the mixture. So the null comes in a strong parametric form.

And we need the proportion p0, the Bayes a priori probability.

Efron’s Local fdr

Page 28: Multiple testing and false discovery rate in feature selection  Workflow of feature selection using high- throughput data  General considerations of

One way to estimate in the R package locfdr - “central matching”: Use quadratic form to approximate

Efron’s Local fdr

Page 29: Multiple testing and false discovery rate in feature selection  Workflow of feature selection using high- throughput data  General considerations of

Efron’s Local fdr6033 test statistics

Page 30: Multiple testing and false discovery rate in feature selection  Workflow of feature selection using high- throughput data  General considerations of

Efron’s Local fdrNow we have the null distribution and the proportion. Define the null subdensity around z:

The Bayes posterior probability that a case is null given z,

Compare to other forms of Fdr that focus on tail area,

(the c.d.f.s of f0 and f1)

Fdr(z) is the average of fdr(Z) for Z<z

Page 31: Multiple testing and false discovery rate in feature selection  Workflow of feature selection using high- throughput data  General considerations of

Efron’s Local fdrA real data example. Notice most non-null cases (bars plotted negatively) are not reported. A big loss of sensitivity to control FDR, which is very common.

Page 32: Multiple testing and false discovery rate in feature selection  Workflow of feature selection using high- throughput data  General considerations of

Multidimensional Local fdrA natural extension to the local FDR.

Use more than one test statistics to capture different characteristics of the features. Now we have a multidimensional mixture model.

Comment:Remember the “curse of dimensionality” ? Since we don’t have too many realizations of the non-null distribution, we can’t go beyond just a few, say 2, dimensions.

Page 33: Multiple testing and false discovery rate in feature selection  Workflow of feature selection using high- throughput data  General considerations of

Multidimensional Local fdrUsing t-statistic in one dimension and the log standard error in the other.

Genes with small s.e. tend to have higher FDR.This approach discounts genes with too small s.e. – similar to the fold change idea but in a theoretically sound way.

Simulated:

Page 34: Multiple testing and false discovery rate in feature selection  Workflow of feature selection using high- throughput data  General considerations of

Multidimensional Local fdrThe null distribution is generated by permutation:

Permute the treatment labels of each sample, and re-compute the test statistics.

Repeat 100 times to obtain the null distribution f0(z).

The f(z) is obtained by the observed Z.

Like local FDR, smoothing is involved. Here two densities in 2D need to be obtained by smoothing. In 2D, the points are not as dense as in 1D. So the choice of smoothing parameters becomes more consequential.

Page 35: Multiple testing and false discovery rate in feature selection  Workflow of feature selection using high- throughput data  General considerations of

Multidimensional Local fdr

To address the problem, the authors did smoothing on the ratio (details skipped):

p is the number of permutations.

Afterwards, the local fdr is estimated by:

Page 36: Multiple testing and false discovery rate in feature selection  Workflow of feature selection using high- throughput data  General considerations of

Multidimensional Local fdr

Real data:

Page 37: Multiple testing and false discovery rate in feature selection  Workflow of feature selection using high- throughput data  General considerations of

Multidimensional Local fdr

Using other statistics: