robust and exploratory statistics: applications using

52
Center for Research Methods and Analysis Forum April 30, 2004 Robust and exploratory statistics: Applications using microarray data Kellie J. Archer, Ph.D. Department of Biostatistics Center for the Study of Biological Complexity Virginia Commonwealth University [email protected]

Upload: others

Post on 19-May-2022

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Robust and exploratory statistics: Applications using

Center for Research Methods and Analysis ForumApril 30, 2004

Robust and exploratory statistics: Applications using microarray

dataKellie J. Archer, Ph.D.

Department of BiostatisticsCenter for the Study of Biological Complexity

Virginia Commonwealth [email protected]

Page 2: Robust and exploratory statistics: Applications using

Outline

• Background regarding robust methods of estimation

• Microarray data• Robust statistical methods

– Spread-versus-level plot for identifying a variance stabilizing transformation

– Median polish as an alternative to two-way ANOVA

Page 3: Robust and exploratory statistics: Applications using

Classical statistical methods

• Classical theory emphasizes large sample properties.

• Classical statistical techniques are designed to be the best possible when stringent assumptions apply.

• Classical techniques can behave badly when the practical situation departs from the ideal.

Page 4: Robust and exploratory statistics: Applications using

Robust statistical methods

• Data sets are usually small and/or the data do not adhere to underlying assumptions.

• Robust and resistant methods are those which are “best” compromises for a broad range of situations.

Page 5: Robust and exploratory statistics: Applications using

Robust statistical methods

• Resistant – estimator is insensitive to outlying values – estimate changes only slightly when a small

part of the data is replaced by new numbers• Robust – estimator is insensitive to

underlying assumptions (e.g., shape of distribution)

Page 6: Robust and exploratory statistics: Applications using

Robust statistical methods

• Mean is a classical measure of central tendency. – Unfortunately, it is not robust. – It is affected by a single wild value.

• The median is also a measure of central tendency.– Unlike the mean, the median is not affected by a single

wild value and therefore is robust.• The breakdown bound is the fraction of the

observations in a sample that can be turned into bad values (e.g., essentially ∞) without causing the estimate to change greatly.

Page 7: Robust and exploratory statistics: Applications using

Robust statistical methods

• Example: – 19 observations were generated from a N(0,1)

distribution– 1 observation was generated from a N(0,4) distribution

0.847 -0.740 0.095 -0.864 -0.976 -1.178 -0.714 0.280 0.790 0.153 0.990 -1.009 0.175 0.036 0.553 -1.683 0.848 -1.173 -0.111 7.660

– Mean = 0.20– Median = 0.07

Page 8: Robust and exploratory statistics: Applications using

Microarrays

A snapshot that captures the activity pattern of thousands of genes at once.

Affymetrix GeneChipCustom spotted arrays

Page 9: Robust and exploratory statistics: Applications using

Needs• Although microarray experiments may include

thousands of outcomes, due to expense we are usually dealing with small sample sizes.

• Due to obtaining measurements for thousands of outcomes, we cannot realistically check assumptions and underlying distributions for all genes.

• Estimating intensity of a probe and probe set requires methods insensitive to outlying values

• Need alternative estimators of location rather than using the mean.

Page 10: Robust and exploratory statistics: Applications using

Transforming data to meet model assumptions

• Induce symmetry• Achieve constant variance• Achieve linearity• Achieve normality

Page 11: Robust and exploratory statistics: Applications using

Log Transformation• For custom spotted arrays, the quantity used

for analysis is most often the

• This ratio may or may not include background subtraction

2Sample signal - Sample backgroundlog

Reference signal - Reference backgound

2Sample signal log

Reference signal

Page 12: Robust and exploratory statistics: Applications using

Log Transformation

0.5

2.0

Ratioxi

-1.050100B

1.015075A

Log2

(Ratio)Sample Signal

ReferenceSignal

Gene

Computing the average fold change relative to the reference

( ) ( )2 ilog x 1.0 1.0n 2 02 2 2 1

+ − = = =∑

2Geometric mean= 2.0 0.5=1.0×

2.0+0.5Arithmetic average= =1.252

Page 13: Robust and exploratory statistics: Applications using

Data• The Universal Human Reference RNA (Stratagene, La

Jolla, CA), isolated from 10 human cell lines, was hybridized to 16 HG-U133A GeneChips® (Dumur et al., 2003) to study GeneChip® reproducibility.

• Five identical aliquots of 40 µg of the Reference RNA were used as template for cDNA synthesis, cRNA in vitro transcription, and labelling reactions. The final fragmented (frg) cRNA aliquots were pooled together.

• Hybridization was performed using the same lot number of the HG-U133A arrays, which contains 22,283 probe sets.

Page 14: Robust and exploratory statistics: Applications using

Mean versus variance plot of 16 Replicate GeneChipsMean versus Variance Plot

Page 15: Robust and exploratory statistics: Applications using

Log Transformation

• Applied to custom spotted arrays to facilitate interpretation and induce symmetry.

• Its usefulness in application to custom spotted arrays has led to widespread use of the log transformation to AffymetrixGeneChip data.

Page 16: Robust and exploratory statistics: Applications using

Mean versus variance plot after applying the log2 transformation

Mean versus Variance Plot after applying log2 transformation

Page 17: Robust and exploratory statistics: Applications using

Box-Cox TransformationsCommonly used family of variance stabilizing transformations is the Box-Cox transformation, defined as

( )( )

1 for 0,

ln for = 0,

p

p

x ppxx p

−≠Φ =

Where x is the observed data, p is a number to be determined,and Φ is the transformed data.Box and Cox (1964) selected p as the value that maximizedthe likelihood.

Page 18: Robust and exploratory statistics: Applications using

Spread-versus-Level Plot

• Emerson (1987) demonstrated use of the spread-versus-level plot as a means to empirically estimate p.– Plot log10 of median (x-axis) by log10 of fourth-

spread (y-axis).– The estimated slope b of the spread-versus-

level plot identifies the approximate value of the exponent for the Box-Cox transformation, where p = 1 – b.

Page 19: Robust and exploratory statistics: Applications using

Sample median

• If n is odd, then the sample median is the middle observation, where k = ( n + 1 )/2;

• if n is even, the sample median is considered to be any value between the two middle observations and( )kY ( )k 1Y +

Page 20: Robust and exploratory statistics: Applications using

Spread-versus-Level Plot• Median: after ranking the observations, the

median is at depth (n+1)/2.• Upper (FU ) and Lower Fourths (FL): the fourths

are those values with a depth of([depth of median] + 1)/2.

• Fourth-spread: robust measure of spread, defined by dF = FU – FL.

Page 21: Robust and exploratory statistics: Applications using

Spread-versus-Level Plot• X is a real random variable with population median

MX, lower fourth FL, upper fourth FU, and fourth spread dF.

• Problem is to identify a transformationfor which the variance is constant.

• A Taylor series expansion of Φ(x) about MX is( )y x= Φ

( ) ( ) ( )( ) ( ) ( )2

2!

XX X X X

x Mx M M x M M

−′ ′′Φ + Φ − + Φ + .Φ =

Page 22: Robust and exploratory statistics: Applications using

Spread-versus-Level Plot

Let λ(MX) represent the proportionate distance from MX to FU as a proportion of the distance dF;therefore, the median and upper and lower fourths for the data transformed by

Substituting into the Taylor Series Expansion,

( ) ,Y XM M= Φ

( ) ( ) ( ) ,U U X X FYF F M M dλ= Φ = Φ +

( ) ( ) ( ){ }1 .L L X X FYF F M M dλ = Φ = Φ − −

( ) ( )( )

( )22 1

2!X F

F F X XY

M dd d M M

λ ⋅ − ′ ′′= ⋅Φ + Φ +

Page 23: Robust and exploratory statistics: Applications using

Spread-versus-Level Plot• Since the quadratic term of (dF)Y is negligible

compared to the leading term,

• Now supposing that the relationship between spread dF and level X is a power transformation of the form for some constants k and b,

( )( )

11 2

3 4

for 11 1 ln for 1.

b

b

c x c bx dx

k x c x c b

− + ≠Φ = = + =

( ) bFd x k x= ⋅

( ) ( )F F XYd d M′≈ ⋅Φ

Page 24: Robust and exploratory statistics: Applications using

Spread-versus-Level Plot

• Recall, so that

• Therefore, the slope of the spread-versus-level plot estimates b. Hence the suggested power transformation would be

( ) bF xd x k M= ⋅

ˆlog( ) log( ) log( )F xd k b M= + ⋅

1 p bx x −=

Page 25: Robust and exploratory statistics: Applications using

Graphical Plot for identifying power transformation

Spread-versus-level Plot

0β̂ 1̂β=0.0519; =0.5692

Page 26: Robust and exploratory statistics: Applications using

Results

• The power transformation turned out to be, = 1 – 0.57 = 0.43.

• Typically, for simplicity the parameter estimate is rounded to the nearest half-integer, which in this case would lead to

=0.5.• Here,Φ0.5(x)= 0.5( 1) / 0.5 2 2x x− = −

Page 27: Robust and exploratory statistics: Applications using

0β̂ 1̂β=0.0519; =0.5692

Graphical Plot for identifying power transformationMean versus Variance Plot after

2 2x − transformation

Page 28: Robust and exploratory statistics: Applications using

Other Variance Stabilizing Transformations

• Assuming the gene expression data may be modelled asy = α + µeη + ε

where y represents the measured intensity, α represents the average background, µ represents the true gene expression level, with normally distributed error terms η and ε with zero mean and differing non-zero variances,

the generalized logarithm transformation g(y) = ln ( y - α + sqrt[(y - α)2 + c])

where c = σε2 /Sη, stabilizes the variance (Durbin et al., 2002)

for large samples.

Page 29: Robust and exploratory statistics: Applications using

Rank versus fourth-spread plots

Page 30: Robust and exploratory statistics: Applications using

Conclusions re. utility of spread-versus-level plot

• One common assumption of many traditional statistical methods is that the variance is constant over the levels of gene expression.

• Since the absolute intensities are positive, this assumption is generally not valid because the variance increases with increasing intensities.

• The log2 transformation, which is routinely applied to absolute intensities that result from AffymetrixGeneChip® experiments (Hubbell et al., 2002; Irizarry et al., 2003), unfortunately does not achieve the desired results.

Page 31: Robust and exploratory statistics: Applications using

Conclusions re. utility of spread-versus-level plot

• To overcome this drawback of the log2transformation, methods such as the Local Pooled Error test have been suggested (Jain et al., 2003) to more accurately estimate the variance for lowly expressed log2 signal intensities.

• As an alternative, the spread-versus-level plot may be used to identify more appropriate monotonic transformations that appropriately stabilizes the variance.

Page 32: Robust and exploratory statistics: Applications using

Median polish:

An alternative to two-way ANOVA

Page 33: Robust and exploratory statistics: Applications using

Analysis of a two-way table

xgnxg2xg1g

x2nx22x212x1nx12x111n21

Page 34: Robust and exploratory statistics: Applications using

Analysis of a two-way table• Additive model: the relationship between the

response variable and two factors is easiest to summarize and interpret if their joint contribution to the response is the sum of a separate contribution from each factor.

• Fitting an additive model to a two-way table –want to estimate an – (1) overall main effect, – (2) a set of row effects, – (3) a set of column effects, – (4) a table of residuals.

Page 35: Robust and exploratory statistics: Applications using

Analysis of a two-way table

xgnxg2xg1g

x2nx22x212x1nx12x111n21

ij i j ijx µ α β ε= + + +Model:

where µ is the overall mean, αi is the row effect, βj is the column effect,εij is the error

Page 36: Robust and exploratory statistics: Applications using

Classical method: ANOVA

..1ˆ ij

i jx x

IJµ = = ∑∑

. ..ˆi ix xα = −

. ..ˆ

j jx xβ = −

( )ˆˆ ˆˆij i j ijxε µ α β= + + −

Page 37: Robust and exploratory statistics: Applications using

Median Polish

• Model:

• The fit and the residuals at the end of n iterations is

ij i j ijy m a b e= + + +

( ) ( ) ( ) ( )n n n nij i j ijy m a b e= + + +

Page 38: Robust and exploratory statistics: Applications using

Median Polish

• Row, column, and overall effects are initialized at 0.

• That is(0)

(0)

(0)

00

0i

j

mab

=

=

=

Page 39: Robust and exploratory statistics: Applications using

Median Polish

• Thereafter, the row medians are determined and subtracted from all observations within the row.

• From the new data table (the original data with the row medians subtracted out) find the column medians.

• Polish by subtracting the column median from all observations within the column.

Page 40: Robust and exploratory statistics: Applications using

Median Polish

• After a row and column polish, effects are estimated by

(n) (n-1) (n)a i i∆m =median(a + a )∆

(n) (n-1)b j∆m =median(b )

(n) (n-1) (n) (n)a bm =m + m m∆ + ∆

(n) (n-1) (n) (n)i i i aa =a +∆a -∆m(n) (n-1) (n) (n)j j j bb =b +∆b -∆m

Page 41: Robust and exploratory statistics: Applications using

Median Polish

• Repeat the row/column polish procedure 2 times.

• The quantities , , and are the median polish estimates of the row effects, column effects, and main effect, respectively.

(2)ia (2)

jb (2)m

Page 42: Robust and exploratory statistics: Applications using
Page 43: Robust and exploratory statistics: Applications using

Expression Quantification• Rather then an entire gene being placed in a

Affymetrix Genechip is an oligonucleotide array consisting of a several perfect match (PM) and their corresponding mismatch (MM) probes that interrogate for a single gene. – PM is the exact complementary sequence of the

target genetic sequence, composed of 25 base pairs – MM probe, which has the same sequence with

exception that the middle base (13th) position has been reversed

– There are roughly 11-20 PM/MM probe pairs that interrogate for each gene, called a probe set

Page 44: Robust and exploratory statistics: Applications using

Expression Quantification

PM and MM intensities are combined to form an expression measure for the probe set (gene)

11 – 20 Probe Pairs interrogate each gene

PMGCGCCGGCTGCAGGAGCAGGAGGAG

GCGCCGGCTGCACGAGCAGGAGGAG MM

Page 45: Robust and exploratory statistics: Applications using

Expression Quantification• Initially, Affymetrix signal was calculated as

where j indexes the probe pairs for each probe set A. This is known as the “Average Difference” method.

• Problems: – Large variability in PM-MM – MM probes may be measuring signal for another gene/EST– PM-MM calculations are sometimes negative

( )j jj A

1AvDiff PM MMA ∈

= −∑

Page 46: Robust and exploratory statistics: Applications using

205586_x_at

Page 47: Robust and exploratory statistics: Applications using

Median Polish

• Example (handout)

Page 48: Robust and exploratory statistics: Applications using

Comparing ANOVA and Median Polish

0003

0002

0091

321

Page 49: Robust and exploratory statistics: Applications using

Classical method: ANOVA( )..

1ˆ 9 0 0 0 0 0 0 0 0 13 3

xµ = = + + + + + + + + =×

( )11ˆ 9 0 0 1 23

α = + + − =

( )11ˆ 9 0 0 1 23

β = + + − =

( )ˆˆ ˆˆij i j ijxε µ α β= + + −

( )21ˆ 0 0 0 1 13

α = + + − = −

( )31ˆ 0 0 0 1 13

α = + + − = −

( )21ˆ 0 0 0 1 13

β = + + − = −

( )31ˆ 0 0 0 1 13

β = + + − = −

Page 50: Robust and exploratory statistics: Applications using

ANOVA: Parameter Estimates and Residuals

-2-12

-121-23

-121-22

2-1-241

321 ˆiα

ˆjβ ˆ 1µ =

Page 51: Robust and exploratory statistics: Applications using

Median Polish: Parameter Estimates and Residuals

000

00003

00002

00091

321 ˆia

ˆjb ˆ 0m =

Page 52: Robust and exploratory statistics: Applications using

References• Box, G.E.P., Cox, D.R. (1964) An analysis of transformations, Journal of the Royal Statistical Society,

Series B, 26: 211-252.• Durbin, B.P., Hardin, J.S., Hawkins, D.M., Rocke, D.M. (2002) A variance-stabilizing transformation for

gene-expression microarray data, Bioinformatics, 18: S105-S110.• Emerson, J.D. (1987) Mathematical aspects of transformation. In Hoaglin, D.C., Mosteller, F. and Tukey,

J.W. (eds), Understanding robust and exploratory data analysis John Wiley & Sons, New York, pp. 247-282.

• Emerson, J.D. and Hoaglin, D.C. (1987) Analysis of two-way tables by medians. In Hoaglin, D.C., Mosteller, F. and Tukey, J.W. (eds), Understanding robust and exploratory data analysis John Wiley & Sons, New York, pp. 129-165.

• Hoaglin, D.C. (1987) Letter values: A set of selected order statistics. In Hoaglin, D.C., Mosteller, F. and Tukey, J.W. (eds), Understanding robust and exploratory data analysis John Wiley & Sons, New York, pp. 33-57.

• Hubbell, E., Liu, W.-M., Mei, R. (2002) Robust estimators for expression analysis, Bioinformatics, 18: 1585-1592.

• Irizarry, R.A., Bolstad, B.M., Collin, F., Cope, L.M., Hobbs, B., Speed, T.P. (2003) Summaries of affymetrix genechip probe level data., Nucleic Acids Research, 31: e15.

• Jain, N., Thatte, J., Braciale, T., Ley, K., O'connell, M., Lee, J.K. (2003) Local-pooled-error test for identifying differentially expressed genes with a small number of replicated microarrays, Bioinformatics, 19: 1945-1951.

• Kohane, I.S., Kho, A.T., Butte, A.J. (2003) Microarrays for an integrative genomics. MIT Press, Cambridge, MA.

• Quackenbush, J. (2002) Microarray data normalization and transformation, Nature Genetics, 32: 496-501.