robust and exploratory statistics: applications using

Center for Research Methods and Analysis ForumApril 30, 2004

Robust and exploratory statistics: Applications using microarray

dataKellie J. Archer, Ph.D.

Department of BiostatisticsCenter for the Study of Biological Complexity

Virginia Commonwealth [email protected]

Outline

• Background regarding robust methods of estimation

• Microarray data• Robust statistical methods

– Spread-versus-level plot for identifying a variance stabilizing transformation

– Median polish as an alternative to two-way ANOVA

Classical statistical methods

• Classical theory emphasizes large sample properties.

• Classical statistical techniques are designed to be the best possible when stringent assumptions apply.

• Classical techniques can behave badly when the practical situation departs from the ideal.

Robust statistical methods

• Data sets are usually small and/or the data do not adhere to underlying assumptions.

• Robust and resistant methods are those which are “best” compromises for a broad range of situations.


• Resistant – estimator is insensitive to outlying values – estimate changes only slightly when a small

part of the data is replaced by new numbers• Robust – estimator is insensitive to

underlying assumptions (e.g., shape of distribution)


• Mean is a classical measure of central tendency. – Unfortunately, it is not robust. – It is affected by a single wild value.

• The median is also a measure of central tendency.– Unlike the mean, the median is not affected by a single

wild value and therefore is robust.• The breakdown bound is the fraction of the

observations in a sample that can be turned into bad values (e.g., essentially ∞) without causing the estimate to change greatly.


• Example: – 19 observations were generated from a N(0,1)

distribution– 1 observation was generated from a N(0,4) distribution

0.847 -0.740 0.095 -0.864 -0.976 -1.178 -0.714 0.280 0.790 0.153 0.990 -1.009 0.175 0.036 0.553 -1.683 0.848 -1.173 -0.111 7.660

– Mean = 0.20– Median = 0.07

Microarrays

A snapshot that captures the activity pattern of thousands of genes at once.

Affymetrix GeneChipCustom spotted arrays

Needs• Although microarray experiments may include

thousands of outcomes, due to expense we are usually dealing with small sample sizes.

• Due to obtaining measurements for thousands of outcomes, we cannot realistically check assumptions and underlying distributions for all genes.

• Estimating intensity of a probe and probe set requires methods insensitive to outlying values

• Need alternative estimators of location rather than using the mean.

Transforming data to meet model assumptions

• Induce symmetry• Achieve constant variance• Achieve linearity• Achieve normality

Log Transformation• For custom spotted arrays, the quantity used

for analysis is most often the

• This ratio may or may not include background subtraction

2Sample signal - Sample backgroundlog

Reference signal - Reference backgound

2Sample signal log

Reference signal

Log Transformation

0.5

2.0

Ratioxi

-1.050100B

1.015075A

Log2

(Ratio)Sample Signal

ReferenceSignal

Gene

Computing the average fold change relative to the reference

( ) ( )2 ilog x 1.0 1.0n 2 02 2 2 1

+ − = = =∑

2Geometric mean= 2.0 0.5=1.0×

2.0+0.5Arithmetic average= =1.252

Data• The Universal Human Reference RNA (Stratagene, La

Jolla, CA), isolated from 10 human cell lines, was hybridized to 16 HG-U133A GeneChips® (Dumur et al., 2003) to study GeneChip® reproducibility.

• Five identical aliquots of 40 µg of the Reference RNA were used as template for cDNA synthesis, cRNA in vitro transcription, and labelling reactions. The final fragmented (frg) cRNA aliquots were pooled together.

• Hybridization was performed using the same lot number of the HG-U133A arrays, which contains 22,283 probe sets.

Mean versus variance plot of 16 Replicate GeneChipsMean versus Variance Plot

Log Transformation

• Applied to custom spotted arrays to facilitate interpretation and induce symmetry.

• Its usefulness in application to custom spotted arrays has led to widespread use of the log transformation to AffymetrixGeneChip data.

Mean versus variance plot after applying the log2 transformation

Mean versus Variance Plot after applying log2 transformation

Box-Cox TransformationsCommonly used family of variance stabilizing transformations is the Box-Cox transformation, defined as

( )( )

1 for 0,

ln for = 0,

p

p

x ppxx p

−≠Φ =

Where x is the observed data, p is a number to be determined,and Φ is the transformed data.Box and Cox (1964) selected p as the value that maximizedthe likelihood.

Spread-versus-Level Plot

• Emerson (1987) demonstrated use of the spread-versus-level plot as a means to empirically estimate p.– Plot log10 of median (x-axis) by log10 of fourth-

spread (y-axis).– The estimated slope b of the spread-versus-

level plot identifies the approximate value of the exponent for the Box-Cox transformation, where p = 1 – b.

Sample median

• If n is odd, then the sample median is the middle observation, where k = ( n + 1 )/2;

• if n is even, the sample median is considered to be any value between the two middle observations and( )kY ( )k 1Y +

Spread-versus-Level Plot• Median: after ranking the observations, the

median is at depth (n+1)/2.• Upper (FU ) and Lower Fourths (FL): the fourths

are those values with a depth of([depth of median] + 1)/2.

• Fourth-spread: robust measure of spread, defined by dF = FU – FL.

Spread-versus-Level Plot• X is a real random variable with population median

MX, lower fourth FL, upper fourth FU, and fourth spread dF.

• Problem is to identify a transformationfor which the variance is constant.

• A Taylor series expansion of Φ(x) about MX is( )y x= Φ

( ) ( ) ( )( ) ( ) ( )2

2!

XX X X X

x Mx M M x M M

−′ ′′Φ + Φ − + Φ + .Φ =


Let λ(MX) represent the proportionate distance from MX to FU as a proportion of the distance dF;therefore, the median and upper and lower fourths for the data transformed by

Substituting into the Taylor Series Expansion,

( ) ,Y XM M= Φ

( ) ( ) ( ) ,U U X X FYF F M M dλ= Φ = Φ +

( ) ( ) ( ){ }1 .L L X X FYF F M M dλ = Φ = Φ − −

( ) ( )( )

( )22 1

2!X F

F F X XY

M dd d M M

λ ⋅ − ′ ′′= ⋅Φ + Φ +

Spread-versus-Level Plot• Since the quadratic term of (dF)Y is negligible

compared to the leading term,

• Now supposing that the relationship between spread dF and level X is a power transformation of the form for some constants k and b,

( )( )

11 2

3 4

for 11 1 ln for 1.

b

b

c x c bx dx

k x c x c b

− + ≠Φ = = + =

∫

( ) bFd x k x= ⋅

( ) ( )F F XYd d M′≈ ⋅Φ


• Recall, so that

• Therefore, the slope of the spread-versus-level plot estimates b. Hence the suggested power transformation would be

( ) bF xd x k M= ⋅

ˆlog( ) log( ) log( )F xd k b M= + ⋅

1 p bx x −=

Graphical Plot for identifying power transformation

Spread-versus-level Plot

0β̂ 1̂β=0.0519; =0.5692

Results

• The power transformation turned out to be, = 1 – 0.57 = 0.43.

• Typically, for simplicity the parameter estimate is rounded to the nearest half-integer, which in this case would lead to

=0.5.• Here,Φ0.5(x)= 0.5( 1) / 0.5 2 2x x− = −

p̂

p̂

0β̂ 1̂β=0.0519; =0.5692

Graphical Plot for identifying power transformationMean versus Variance Plot after

2 2x − transformation

Other Variance Stabilizing Transformations

• Assuming the gene expression data may be modelled asy = α + µeη + ε

where y represents the measured intensity, α represents the average background, µ represents the true gene expression level, with normally distributed error terms η and ε with zero mean and differing non-zero variances,

the generalized logarithm transformation g(y) = ln ( y - α + sqrt[(y - α)2 + c])

where c = σε2 /Sη, stabilizes the variance (Durbin et al., 2002)

for large samples.

Rank versus fourth-spread plots

Conclusions re. utility of spread-versus-level plot

• One common assumption of many traditional statistical methods is that the variance is constant over the levels of gene expression.

• Since the absolute intensities are positive, this assumption is generally not valid because the variance increases with increasing intensities.

• The log2 transformation, which is routinely applied to absolute intensities that result from AffymetrixGeneChip® experiments (Hubbell et al., 2002; Irizarry et al., 2003), unfortunately does not achieve the desired results.

Conclusions re. utility of spread-versus-level plot

• To overcome this drawback of the log2transformation, methods such as the Local Pooled Error test have been suggested (Jain et al., 2003) to more accurately estimate the variance for lowly expressed log2 signal intensities.

• As an alternative, the spread-versus-level plot may be used to identify more appropriate monotonic transformations that appropriately stabilizes the variance.

Median polish:

An alternative to two-way ANOVA

Analysis of a two-way table

xgnxg2xg1g

x2nx22x212x1nx12x111n21

Analysis of a two-way table• Additive model: the relationship between the

response variable and two factors is easiest to summarize and interpret if their joint contribution to the response is the sum of a separate contribution from each factor.

• Fitting an additive model to a two-way table –want to estimate an – (1) overall main effect, – (2) a set of row effects, – (3) a set of column effects, – (4) a table of residuals.

Analysis of a two-way table

xgnxg2xg1g

x2nx22x212x1nx12x111n21

ij i j ijx µ α β ε= + + +Model:

where µ is the overall mean, αi is the row effect, βj is the column effect,εij is the error

Classical method: ANOVA

..1ˆ ij

i jx x

IJµ = = ∑∑

. ..ˆi ix xα = −

. ..ˆ

j jx xβ = −

( )ˆˆ ˆˆij i j ijxε µ α β= + + −

Median Polish

• Model:

• The fit and the residuals at the end of n iterations is

ij i j ijy m a b e= + + +

( ) ( ) ( ) ( )n n n nij i j ijy m a b e= + + +

Median Polish

• Row, column, and overall effects are initialized at 0.

• That is(0)

(0)

(0)

00

0i

j

mab

=

=

=

Median Polish

• Thereafter, the row medians are determined and subtracted from all observations within the row.

• From the new data table (the original data with the row medians subtracted out) find the column medians.

• Polish by subtracting the column median from all observations within the column.

Median Polish

• After a row and column polish, effects are estimated by

(n) (n-1) (n)a i i∆m =median(a + a )∆

(n) (n-1)b j∆m =median(b )

(n) (n-1) (n) (n)a bm =m + m m∆ + ∆

(n) (n-1) (n) (n)i i i aa =a +∆a -∆m(n) (n-1) (n) (n)j j j bb =b +∆b -∆m

Median Polish

• Repeat the row/column polish procedure 2 times.

• The quantities , , and are the median polish estimates of the row effects, column effects, and main effect, respectively.

(2)ia (2)

jb (2)m

Expression Quantification• Rather then an entire gene being placed in a

Affymetrix Genechip is an oligonucleotide array consisting of a several perfect match (PM) and their corresponding mismatch (MM) probes that interrogate for a single gene. – PM is the exact complementary sequence of the

target genetic sequence, composed of 25 base pairs – MM probe, which has the same sequence with

exception that the middle base (13th) position has been reversed

– There are roughly 11-20 PM/MM probe pairs that interrogate for each gene, called a probe set

Expression Quantification

PM and MM intensities are combined to form an expression measure for the probe set (gene)

11 – 20 Probe Pairs interrogate each gene

PMGCGCCGGCTGCAGGAGCAGGAGGAG

GCGCCGGCTGCACGAGCAGGAGGAG MM

Expression Quantification• Initially, Affymetrix signal was calculated as

where j indexes the probe pairs for each probe set A. This is known as the “Average Difference” method.

• Problems: – Large variability in PM-MM – MM probes may be measuring signal for another gene/EST– PM-MM calculations are sometimes negative

( )j jj A

1AvDiff PM MMA ∈

= −∑

205586_x_at

Median Polish

• Example (handout)

Comparing ANOVA and Median Polish

0003

0002

0091

321

Classical method: ANOVA( )..

1ˆ 9 0 0 0 0 0 0 0 0 13 3

xµ = = + + + + + + + + =×

( )11ˆ 9 0 0 1 23

α = + + − =

( )11ˆ 9 0 0 1 23

β = + + − =

( )ˆˆ ˆˆij i j ijxε µ α β= + + −

( )21ˆ 0 0 0 1 13

α = + + − = −

( )31ˆ 0 0 0 1 13

α = + + − = −

( )21ˆ 0 0 0 1 13

β = + + − = −

( )31ˆ 0 0 0 1 13

β = + + − = −

ANOVA: Parameter Estimates and Residuals

-2-12

-121-23

-121-22

2-1-241

321 ˆiα

ˆjβ ˆ 1µ =

Median Polish: Parameter Estimates and Residuals

000

00003

00002

00091

321 ˆia

ˆjb ˆ 0m =

References• Box, G.E.P., Cox, D.R. (1964) An analysis of transformations, Journal of the Royal Statistical Society,

Series B, 26: 211-252.• Durbin, B.P., Hardin, J.S., Hawkins, D.M., Rocke, D.M. (2002) A variance-stabilizing transformation for

gene-expression microarray data, Bioinformatics, 18: S105-S110.• Emerson, J.D. (1987) Mathematical aspects of transformation. In Hoaglin, D.C., Mosteller, F. and Tukey,

J.W. (eds), Understanding robust and exploratory data analysis John Wiley & Sons, New York, pp. 247-282.

• Emerson, J.D. and Hoaglin, D.C. (1987) Analysis of two-way tables by medians. In Hoaglin, D.C., Mosteller, F. and Tukey, J.W. (eds), Understanding robust and exploratory data analysis John Wiley & Sons, New York, pp. 129-165.

• Hoaglin, D.C. (1987) Letter values: A set of selected order statistics. In Hoaglin, D.C., Mosteller, F. and Tukey, J.W. (eds), Understanding robust and exploratory data analysis John Wiley & Sons, New York, pp. 33-57.

• Hubbell, E., Liu, W.-M., Mei, R. (2002) Robust estimators for expression analysis, Bioinformatics, 18: 1585-1592.

• Irizarry, R.A., Bolstad, B.M., Collin, F., Cope, L.M., Hobbs, B., Speed, T.P. (2003) Summaries of affymetrix genechip probe level data., Nucleic Acids Research, 31: e15.

• Jain, N., Thatte, J., Braciale, T., Ley, K., O'connell, M., Lee, J.K. (2003) Local-pooled-error test for identifying differentially expressed genes with a small number of replicated microarrays, Bioinformatics, 19: 1945-1951.

• Kohane, I.S., Kho, A.T., Butte, A.J. (2003) Microarrays for an integrative genomics. MIT Press, Cambridge, MA.

• Quackenbush, J. (2002) Microarray data normalization and transformation, Nature Genetics, 32: 496-501.

robust and exploratory statistics: applications using

Documents