test of significance for small samples javier cabrera

Test of significance for small samples

Javier CabreraDirector, Biostatistics Institute Rutgers University

Dhammika Amaratunga,Johnson & Johnson Pharmaceutical Research & Development

Outline

• Microarray Experiments and Differential expression

• Small sample size issues• Conditional t approach• Comparison with other methods• Extensions

Reference: Exploration and Analysis of DNA Microarray and Protein Array Data. Wiley.2004. Amaratunga, Cabrera.

Software: DNAMR and DNAMRweb

http://www.rci.rutgers.edu/~cabrera/DNAMR

A gene is expressed via the process:

DNA mRNA protein transcription translation

replication

The central dogma of molecular biology

Genes: A gene is a segment of DNA whose sequence of bases (nucleotides) codes for a specific protein.

AKAP6: CATCATGCAGCAGGTCAAACAAGGCATCTCCTAGTATTGCATCCTACA……

cDNA oroligonucleotide

preparation

Glass slide Biological sample

Reverse transcribeand label

SampleMicroarray +

Quantify spot intensities

Gene expression data

5k-50k genes arrayed in rectangular grid; one spot per gene

Microarray experiment

Hybridize, wash and scan

Print or synthesize

Differential gene expression

An organism’s genome is the complete set of genes in each of its cells. Given an organism, every one of its cells has a copy of the exact same genome, but

different cells express different genes

different genes express under different conditions

differential gene expression leads toaltered cell states

C1 C2 C3 T1 T2 T3 G1 4.67 4.44 4.42 4.73 4.85 4.69 G2 3.13 2.54 1.96 0.97 2.38 3.36 G3 6.22 6.77 5.32 6.40 6.94 6.87 G4 10.74 10.81 10.69 10.75 10.68 10.68 G5 3.76 4.16 5.27 3.05 3.20 2.85 G6 6.95 6.78 6.33 6.81 6.95 7.01 G7 4.98 4.61 4.56 4.57 4.90 4.44 G8 2.72 3.30 3.24 3.22 3.42 3.22 G9 5.29 4.79 5.13 3.31 4.67 5.27G10 5.12 4.85 3.79 4.13 3.12 4.79G11 4.67 3.50 4.77 4.09 3.86 2.88G12 6.22 6.42 5.02 6.38 6.54 6.80G13 2.88 3.76 2.78 2.98 4.81 4.15.......

Differential Expression for small samples

1. Preprocessed data.2. Perform a t-test for each gene.3. Select the most significant subset.

The t test statistic for testing for a mean effect is: 1/ 2

2 1 1 2( ) /( (1/ 1/ ) )g g g gT X X s n n

where sg, the pooled standard error, is the positive square root of: 2 2 2

1 1 2 2 1 2(( 1) ( 1) ) /( 2)g g gs n s n s n n

If there is no mean effect,

1 2( 2)~g n nT t

(Student / Fisher)

The pooled variances T-test

300 21983

Plot t vs sp Distribution of sp

Random Data

Differentially expressed genes have smaller sp.

Is this effect Statistical or Biological?

500 Simulation: 1000 Genes 4 Controls + 4 Treats iid Normal(0, 2)

100 genes are differentially express with mean diff = +1 or -1

2=1 CONSTANT, False Discoveries True DiscoveriesT-test 44 22 z-test 43 29

2 from Chi-square(df=3), False Discoveries True DiscoveriesT-test 43 28 z-test 53 13

The effect of small sample size

Often the sample size per group is small.

unreliable variances (inferences)

dependence between the test statistics (tg) and the standard error estimates (sg)

borrow strength across genes (LPE/EB)

regularize the test statistics (SAM)

work with tg|sg (Conditional t).

Analysis results

Top 10 genes (sorted by t-test p-value)

Gene Fold Dir p p(Bonf) G6546 2.36 D 0.000004 0.0964G19945 3.25 U 0.000005 0.1102G21586 1.64 U 0.000008 0.1765G18970 2.52 U 0.000019 0.4220 G7432 3.70 D 0.000033 0.7248G19057 1.85 U 0.000046 1.0000G17361 4.34 D 0.000067 1.0000 G8525 5.57 D 0.000067 1.0000 G425 18.11 D 0.000078 1.0000 G8524 4.74 D 0.000109 1.0000

SAM: Determining c

v1 () =mad{ Tg}

v2() v3() v4() v5() v6() v7()

For each

-3 -2 -1 0 1 2 3

0.0 0.2 0.4 0.6 0.8 1.0

Pooled Sd

ˆ( )gT c

( ) ˆ( )gT c

SAM: Gene selection

( ) ˆ( )gT c ˆ( )gT c= Expected value of under permutations

Let Xgij denote the preprocessed intensity measurement for gene g in array i of group j.

Model: Xgij = gj + g gij

Effect of interest:g= g2 - g1

Error model:gij ~ F(location=0, scale=1)

Gene mean-variance model:(g1,g2)

with marginals: g1 ~ Fand g2 ~ F

Conditional t: Basic Model

Parametric: Assume functional forms for F and F and apply either a Bayes or Empirical Bayes procedure.

Nonparametric:1. or

For small samples is not a good estimator of F Use method of moments = Target estimation

2. Proceed via resampling and estimate the distribution: t |sp (Conditional t).

Estimate F: edf , ,F, of {( 1gX, sg2)}

Estimate F: edf , F, of {( )/gij gj gX X s }

Possible approaches

(1) D raw a gene, g , at random from {1, … , G }.

C all it g*. ( * 1gX , *

gs ) ~ ,F .

(2) Take a random sam ple (w ith replacem ent)

of size n1+n2 from F : * ˆ~ijr F

(3) C om bine these to form pseudo-data:

1ij ijg gX X s r

(4) C alculate the pooled standard error s* and t test statistic t* for the pseudo-data {X ij

Procedure

(5) Repeat steps (1)-(4) a large number (10,000) of times. (6) Given , estimate the “critical envelope”, t(sg), as the (/2) and (1-/2) quantile curves in the tg vs sg relationship. (7) Genes that fall outside the critical envelope defined by t(sg) are deemed significant at level . (Overall unconditional Type I error rate = )

Procedure (cont.)

ˆ ( ) is not a good estimator of ( ) F t F t

Let {Xij} be a sample from the model with F

and let the variance obtained from the {Xij} be s2

Then Var(s2) > Var(2)

For example, if we assume that F = 32, n=4 and

~ N(0,1), then Var(2)=6 and Var(s2)=15.

Fix by target estimation: Method of moments.

Shrink towards the center

Roadblock

0.0 0.5 1.0 1.5 2.0 2.5 3.0

Case 1

0 1 2 3 4

Case 2

1 2 3 4 5

Case 3

E7 Data

0.0 0.2 0.4 0.6 0.8 1.0

Case 1

0.0 0.5 1.0 1.5 2.0 2.5 3.0

Case 2

0 1 2 3 4

Case 3

0 1 2 3 4 5 6

Example: Checking for the distribution of g

1. Df=0.5

2. Df=2 3. Df=6

1. Df=0.5 2. Df=2

3. Df=6

Mice Data

2 2 2 2 2 20.5 2 61. ~ , 2. ~ , 3. ~ Compare the distr. of sg vs simulation with:

Tox Data

0.0 0.1 0.2 0.3 0.4 0.5

Case 1

0.2 0.4 0.6 0.8 1.0 1.2 1.4

Case 2

0 1 2 3 4

Case 3

0 100 200 300 400

Another Example

0.2 0.4 0.6 0.8 1.0 1.2 1.4

Case 1

0 1 2 3 4

Case 2

0 100 200 300 400

Case 3

0.0 0.1 0.2 0.3 0.4 0.5

mean diff vs Sp

Df=0.5

Df=3 Df=6

Df=0.5

Df=6Df=6

2 2 2 2 2 20.5 3 61. ~ , 2. ~ , 3. ~ Compare the distr. of sg vs simulation with:

Fixing the variance distribution

The idea is to estimate the function h:[0:1] [0,1] defined by

h(F(x)) = F (x). Since h is strictly monotonic, it can be inverted

in order to obtain an estimate of F(x). Procedure:

(1) Assume that F (x) is the true distribution of and draw a

random sample, s*2, from F .

(2) Take a random sample (with replacement) of size N from F : * ˆ~ijr F for i=1,…, nj, j=1,2.

(3) Combine these to form pseudo-data: * * *ij ijX s r

( 4 ) C a l c u l a t e t h e p o o l e d s t a n d a r d e r r o r s * * f o r t h e p s e u d o - d a t a { X i j

* } . ( 5 ) R e p e a t s t e p s ( B 1 ) - ( B 4 ) a l a r g e n u m b e r ( s a y 1 0 0 , 0 0 0 ) o f t i m e s a n d r e c o r d , f o r e a c h i t e r a t i o n , t h e p a i r o f v a l u e s { ( s * 2 , s * * 2 ) } .

( 6 ) L e t *ˆF ( x ) b e t h e e m p i r i c a l d i s t r i b u t i o n o f t h e s * * 2

g ’ s . T h e n t h e

e s t i m a t o r o f h i s o b t a i n e d b y m a p p i n g t h e e m p i r i c a l d i s t r i b u t i o n ˆF i n t o *

ˆF . M o r e p r e c i s e l y 1

*ˆ ˆ ˆ ˆ( ( ) ) ( ( ) )h y F x F F y

a n d 1 1*

ˆ ˆ ˆ( ) ( ( ) )h y F F y .

H e n c e t h e b i a s - c o r r e c t e d e s t i m a t o r o f F i s : 1

*ˆ ˆ ˆ( ) ( ( ( ) ) )F x F F F x

Fixing the variance distribution (contd)

Proceed as before …

191 22092

Plot t vs sp

Differentially expressed genes may have large sp

500 Simulation: 1000 Genes 4 Controls + 4 Treats iid Normal(0, 2)

100 genes are differentially express with mean diff = +1 or -1

2=1 CONSTANT False Discoveries True DiscoveriesT-test 44 22 z-test 43 29C-t 45 30

2 from Chi-square(df=3) False Discoveries True DiscoveriesT-test 43 28 z-test 53 13C-t 42 38

Using 8 iid samples from Khan Data, we make changes to 50 genes to make them differentially expressed for high level.

T-testSAM

To generate p-values, recall that the Ct procedure generates curves, c(s). Start with a set of curves,

1( ) ( )

kg gc s c s , for a set of

prespecified values, 1 k .

Now consider the relationship between vi=log(-log(i)) and ui=log( ( ))

i gc s

To assign an approximate p-value to the gth gene, if |tg | ( )k gc s ,

interpolate the relationship between the {ui} and the {vi}.

Generating p-values

Extensions F test: - Condition on the sqrt(MSE) Multiple comparisons: - Tukey, Dunnett, Bump. - Condition on the sqrt(MSE) Gene Ontology. - Test for the significance of groups.

- Use Hypergeometric Statistic, mean t, mean p-value, or other.

- Condition on log of the number of genes per group

Conditional F

0.2 0.4 0.6 0.8 1.0 1.2

Sqrt(MSE)

0 2 4 6

GO Ontology: Conditioning on log(n)

Abs(T)

Log(n)

The Details:ReferenceExploration and Analysis of DNA Microarray and Protein Array Data. Wiley . Jan 2004.Amaratunga, Cabrera.

Emailcabrera@stat.rutgers.edudamaratu@prdus.jnj.com

Webpage for DNAMR and DNAMRwebhttp://www.rci.rutgers.edu/~cabrera/DNAMR

Target Estimation:

Cabrera, Fernholz (1999)

- Bias Reduction.

- MSE reduction.

Recent Applications:

- Ellipse Estimation (Multivariate Target).

- Logistic Regression:

• Cabrera, Fernholz, Devas (2003)

• Patel (2003) Target Conditional MLE (TCMLE)

Implementation in StatXact (CYTEL) and

logXact Proc’s in SAS(by CYTEL).

Target Estimation

T(x1,x2,…,xn)

E(T) =

Target Estimation:1

ˆSuppose we have an estimator ( ,..., ) of a paramter

ˆTarget estimator : Solve ( )

ˆ ( ) ( ) then ( )

nT x x

h E T h

Algorithms: - Stochastic approximation.

- Simulation and iteration.

- Exact algorithm for TCMLE

test of significance for small samples javier cabrera

Documents

· cabrera cabrera cabrera cabrera cabrera cabrera cabrera...

teoria javier cabrera darquea

transferencias anfa...

university research symposium - c.ymcdn.com ·...

el mensaje de las piedras de ica - javier cabrera

1 alumno: javier insa cabrera director: josé hernández...

the gender wage gap and occupational segregation - · pdf...

ictus-2006 asociaciÓn de algunos conceptos con la lÓgica...

directiva de ppff inicial - unidadborja.edu.ec · sr....

horario asesorÍa -...

diplomado en técnicas municipales en la relación...

maria alejandra cabrera cabrera

javier insa-cabrera josé hernández-orallo dep. de sistemes...

a la mémoire du dr javier cabrera · 2005. 1. 9. · a la...

proceso electoral federal 2009 - instituto nacional...

guía docente: inglés técnico curso...

alumno: javier insa alumno: javier insainsacabrera cabrera...

capítulo vii. las empresas turísticas javier gonzález...

taller 1 de sensibilización - finanzas para emprendedores...

astudillo cabrera jaime javier troya estrella andrÉs ...