different expression multiple hypothesis testing stat115 spring 2012

24
Different Expression Multiple Hypothesis Testing STAT115 Spring 2012

Upload: aron-clarke

Post on 23-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Different ExpressionMultiple Hypothesis Testing

STAT115

Spring 2012

Tongji 20092

Outline

• Differential gene expression– Parametric test: t and Welch-t test– Non-parametric test: permutation t and Mann-

Whitney

• Multiple hypothesis testing– Family-wide error rate, and FDR– Affy detection (present/absent calls)

Tongji 20093

Normalized & Summarized Data 5 Normal and 9 Myeloma (MM) Samples

Samples

Gen

es

probe set Normal m412aNormal m414aNormal m416aNormal m426aNormal m430aMM m282 MM m331aMM m332aMM m333aMM m334aMM m353aMM m408aMM m423aMM m424a39089_at 89.31 143.37 111.61 134.78 121.57 104.02 101.11 105.16 121.21 176.72 117.16 137.19 109.5 109.0635862_at 95.05 107.04 71.06 100.63 117.58 103.96 95.2 114.35 95.03 90.32 93.13 88.61 90.87 112.9541777_at 22.76 20.05 21.37 25.55 30.8 20.75 21.95 28.82 30.85 28.81 22.65 18.91 22.58 21.6538250_at 53.55 62.89 29.36 62.74 36.14 60.07 37.46 42.85 27.86 41.48 116.4 46.39 38.9 29.11656_at 177.69 177.65 167.15 166.04 155.07 180.4 136.47 200.4 201.8 138.38 165.92 176.25 162.85 156.17332_at 128.5 98.29 130.58 111.49 103.56 115.47 121.01 134.5 118.85 88.71 105.08 93.28 113.18 140.1339185_at 107.86 114.02 104.08 108.89 112.75 113.61 120.9 120.1 113.82 102.72 109.81 104.86 104.4 95.53514_at 69.21 51.43 92.43 69.21 55.46 58.43 73.9 74.58 88.07 57.01 79.11 53.63 53.43 69.6235010_at 65.34 42 48.14 52.85 59.07 49.62 62.59 68.39 55.57 47.92 46.97 49.73 44.7 55.7334793_s_at 9.95 9.12 10.45 14.65 21.91 13.2 14.02 17.15 9.05 10.66 8.24 13.43 17.17 15.9733277_at 153.21 120.52 136.7 113.79 110.23 140.96 153.44 149.59 119.14 98.57 156.85 101.86 117.28 104.7234788_at 167.66 172.86 142.6 199.39 195.34 156.66 173.96 159.16 207.34 154.18 158.59 151.91 171.65 246.112053_at 91.76 111.82 99.57 95.58 87.17 123.15 82.24 93.92 97.76 114.66 80.33 107.65 89.78 85.4133465_at 63.37 45.24 54.72 56.74 58.16 59.55 63.43 71.55 55.76 46.63 49.78 40.49 44.5 69.3341097_at 145.34 148.08 171.78 151.96 128.26 138.98 148.45 160.25 169.47 133.5 166.24 135.37 159.2 129.9632394_s_at 449.9 1190.09 429.93 1034.13 196.52 214.51 220.81 331.66 652.66 488.37 699.41 1903.88 843.79 575.161969_s_at 30.03 34.58 59.76 32.84 46.98 51.34 40.4 41.75 31.8 36.74 62.42 40.4 36.37 26.0639225_at 43.19 82.15 97.56 78.3 57.23 65.29 75.14 54.5 58.35 62.47 124.64 56.42 90.55 57.2836919_r_at 36.45 26.84 37.94 35.79 38.86 33.99 28.94 32.57 39.61 32.08 31.37 36.58 44.33 36.9933574_at 16.14 12.58 10.93 14.65 29.64 19.38 14.65 15.29 16.14 19.72 11.23 12.6 18.2 24.0436271_at 41.71 25.8 39.79 49.71 52.64 33.5 48.33 41.15 48.74 45.12 36.5 38.58 55.99 29.73490_g_at 83.48 103.93 121.57 80.05 73.81 115.47 106.57 96.19 101.49 78.5 86.13 71.87 83.73 93.641654_at 78.63 82.7 93.15 73.96 73.82 104.4 100.39 91.78 82.26 63.21 76.23 56.97 76.2 73.0441207_at 100.27 80.62 84.98 75.44 74.26 95.56 96.83 100.36 85.12 71.34 81.04 75.81 70.77 70.8140080_at 172.83 106.63 122.03 118.12 131.15 153.53 150.19 161.04 123 101.64 142.03 110.02 113.58 117.1838699_at 69.1 67.16 62.73 67.46 74.03 61.16 75.27 75.7 63.2 68.12 57.25 65.42 70.71 75.81698_f_at 21.36 43.88 30.5 65.43 35.73 44.05 32.34 35.17 33.89 62.61 34.72 42.49 32.13 37.5136036_at 105.59 71.45 88.72 79.84 75.78 95.13 115.07 100.81 84.13 69.87 76.51 71.58 72.16 73.8540720_at 104.84 175.9 186.87 65.58 64 204.55 89.48 110.87 99 59.84 138.3 59.43 197.43 118.3232194_at 34.01 165.32 153.91 59.4 43.4 98.5 59.53 43.28 47.98 63.09 217.29 127.38 79.38 82.0431499_s_at 42.66 36.26 47.61 43.35 48.55 40.87 52.57 53.86 41.41 40.08 44.22 35.6 43.32 41.4841685_at 25.07 14.68 22.41 22.98 19.79 22.21 21.85 25.12 20.27 18.44 20.37 12.85 22.02 25.9131788_at 115.87 151.38 103.33 144.45 138.01 125.9 132.74 121.06 113.56 114.21 149.88 199.76 121.17 96.031719_at 15.65 18.26 16.74 21.49 15.16 11.49 17.52 21.35 19.36 20.6 15.13 14.3 18.77 18.49973_at 169.15 142.44 164.57 129 151.38 189.15 171.12 169.57 139.02 140.37 145.62 145.17 130.23 132.35

Tongji 20094

Identify Differentially Expressed Genes

• Understand what is the difference between two conditions / samples– Disease pathways

• Find disease markers for diagnosis– Diagnosis chips

• Interested in genes with:– Statistical significance: observed differential

expression is unlikely to be due to chance– Biological significance: observed differential

expression is sufficient of biological relevance

Tongji 20095

Classical study of cancer subtypes

Golub et al. (1999)

Identification of Diagnostic Genes

Tongji 20096

Identify Differentially Expressed Genes

• Fold change• Parametric test (assume expression value follows

normal distribution)– T test and Welch-t test

• Non-parametric test (no assumption of expression distribution)– Permutation t-test and Mann-Whitney U (Wilcoxon

rank sum) test

• Non-parametric is good only if you have plenty of samples to choose from– Expression with 3 treatment and 3 controls are better

off with regular t or Welch-t statistic

Tongji 20097

Fold Change

• Naïve method

• Avg(X) / Avg(Y)

• May not be a good measure of differential expression, especially for less abundant transcripts

• Note on scale:– Natural scale: MAS4, MAS5, dChip– Log scale: RMA, need to take exp() before

calculating fold change

Tongji 20098

Two Sample t-test• Statistical significance in the two sample problem

Group 1: X1, X2, … Xn1

Group 2: Y1, Y2, … Yn2

• If Xi ~ Normal (μ1, σ2),

Yi ~ Normal (μ2, σ2)

• Null hypothesis of μ1= μ2

2,//

)(

2

)1()1(,

1

)(

21

22

12

21

222

2112

1

1

2

21

1

nndfnsns

YXt

nn

snsns

n

Xx

s

pp

pntoi

i

Tongji 20099

Two Sample t-test• Statistical significance in the two sample problem

Group 1: X1, X2, … Xn1

Group 2: Y1, Y2, … Yn2

• If Xi ~ Normal (μ1, σ12),

Yi ~ Normal (μ2, σ22)

• Null hypothesis of μ1= μ2

• Use Welch-t statistic• Check T table for p-val• A gene with small p-val

(very big or small t) – Reject null– Significant difference between normal and MM

2221

21 //

)(

nsns

YXt

Tongji 200910

Permutation Test

• Non-parametric method for p-val calculation– Do not assume normal expression distribution

– Do not assume the two groups have equal variance

• Randomly permute sample label, calculate t to form the empirical null t distribution– For MM-study, (14 choose 5) = 2002 different t values

from permutation

• If the observed t extremely high/low differential expression with statistical significance

Tongji 200911

Permutation Technique

Condition 0 Condition 1

Patient 4 Patient 2 Patient 3 Patient 1 Patient 5 Patient 6

Condition 0 Condition 1

Patient 1 Patient 2 Patient 5 Patient 4 Patient 3 Patient 6

Condition 0 Condition 1

Patient 1 Patient 6 Patient 3 Patient 4 Patient 5 Patient 2

Condition 0 Condition 1

Patient 1 Patient 2 Patient 3 Patient 4 Patient 5 Patient 6Compute T0

Compute T1

Compute T2

Compute T3

Compare T0 to T* set

Tongji 200912

Wilcoxon Rank Sum Test

• Rank all data in row, count sum of ranks TT or TC

• Significance calculated from permutation as well• E.g. 10 normal and 10 cancer

– Min(T) = 55– Max(T) = 155– Significance(T=150)

• Check U table(transformation of T) for stat significance

• Intuition similarto permutation t-test

Tongji 200913

Multiple Hypotheses Testing

• We test differential expression for every gene with p-value, e.g. 0.01

• If there are ~15 K genes on the array, potentially 0.01 x 15K = 150 genes wrongly called

• H0: no diff expr; H1: diff expr

– Reject H0: call something to be differentially expressed

• Should control family-wise error rate or false discovery rate

• Use Affy’s present/absent calls

Tongji 200914

Family-Wise Error Rate

• P(false rejection at least one hypothesis) < αP(no false rejection ) > 1- α

• Bonferroni correction: to control the family-wise error rate for testing m hypotheses at level α, we need to control the false rejection rate for each individual test at α/m

• If α is 0.05, for 15K gene prediction, p-value cutoff is 0.05/15K = 3.33 E-6

• Too conservative for differentially expressed gene selection

Tongji 200915

False Discovery Rate# not rejected

Not called

# rejectedCalled

Total

# H0

Two groups similar

U V m0

# H1

Two groups different

T S m1

Total m - R R m

V: type I errors, false positivesT: type II errors, false negativesFDR = V / R, FP / all called

Tongji 200916

False Discovery Rate

• Less conservative than family-wise error rate

• Benjamini and Hochberg (1995) method for FDR control, e.g. FDR ≤ *

– Draw all m genes, ranked by p-val– Draw line y = x * / m, x = 1…m– Call all the genes below the line

Tongji 200917

FDR Threshold

Genes ranked by p-val

x * / m line

Tongji 200918

SAM for FDR Control

• Statistical Analysis of Microarrays (SAM), Tusher et al. PNAS 2001– With small number of samples, there could be

small and very big t by chance– SAM: modified t*, increase based on of

other genes on the array (i.e. lowest 5 percentile of )

– Proceeds with regular FDR

Tongji 200919

Q-value• Storey & Tibshirani,

PNAS, 2003• Empirically derived

q-value• Every p-value has its

corresponding q-value (FDR)

• FDR’s academic vs practical values

Tongji 200920

Affymetrix Detection• MAS 5.0 makes an absent/marginal/present call

for each probeset

• Define R = (PM-MM)/(PM+MM)– R near 1 means PM>>MM, abundant transcript– R near or below 0 means PM <= MM

• R should make cutoff () to be considered present

PM

MMPresent (P)

PM

MMAbsent (A)

Tongji 200921

Affymetrix Detection (default 0.015) empirically set by Affy • Detection p-value from Wilcoxon signed rank test

– Rank probes by (PM-MM) / (PM+MM) -

– T+: 25, T-: -20, n = 9

– Check T+ against Wilcoxon Table (n) for p-value

PM MM R-t Rank(|R-t|) Sign rank510 503 -0.00809 3 -3513 509 -0.011086 4 -4514 517 -0.01791 5 -5535 511 0.0079446 2 2566 527 0.0206816 6 6582 538 0.0242857 8 8584 592 -0.021803 7 -7588 516 0.0502174 9 9594 579 -0.002212 1 -1

Tongji 200922

Affymetrix Detection 1 and 2 are user defined

values but have optimized defaults in MAS5

• Since expression index for low abundant transcripts is unreliable, it is better to find differentially expressed genes only from present call genes

• Increasing can reduce FDR, but true present calls could be lost

Present Marginal Absent Default: 0.04 0.06

1 2

P-value of a probe set

Tongji 200923

Outline

• Differential gene expression– Parametric test: t and Welch-t test– Non-parametric test: permutation t and Mann-

Whitney

• Multiple hypothesis testing– Family-wide error rate and FDR– Find diff expr genes only on Affy present calls

Tongji 200924

Acknowledgment• Kevin Coombes & Keith Baggerly• Mark Craven• Georg Gerber• Gabriel Eichler• Ying Xie• Terry Speed & Group• Larry Hunter• Wing Wong & Cheng Li• Mark Reimers• Jenia Semyonov