robust and exploratory statistics: applications using
TRANSCRIPT
Center for Research Methods and Analysis ForumApril 30, 2004
Robust and exploratory statistics: Applications using microarray
dataKellie J. Archer, Ph.D.
Department of BiostatisticsCenter for the Study of Biological Complexity
Virginia Commonwealth [email protected]
Outline
• Background regarding robust methods of estimation
• Microarray data• Robust statistical methods
– Spread-versus-level plot for identifying a variance stabilizing transformation
– Median polish as an alternative to two-way ANOVA
Classical statistical methods
• Classical theory emphasizes large sample properties.
• Classical statistical techniques are designed to be the best possible when stringent assumptions apply.
• Classical techniques can behave badly when the practical situation departs from the ideal.
Robust statistical methods
• Data sets are usually small and/or the data do not adhere to underlying assumptions.
• Robust and resistant methods are those which are “best” compromises for a broad range of situations.
Robust statistical methods
• Resistant – estimator is insensitive to outlying values – estimate changes only slightly when a small
part of the data is replaced by new numbers• Robust – estimator is insensitive to
underlying assumptions (e.g., shape of distribution)
Robust statistical methods
• Mean is a classical measure of central tendency. – Unfortunately, it is not robust. – It is affected by a single wild value.
• The median is also a measure of central tendency.– Unlike the mean, the median is not affected by a single
wild value and therefore is robust.• The breakdown bound is the fraction of the
observations in a sample that can be turned into bad values (e.g., essentially ∞) without causing the estimate to change greatly.
Robust statistical methods
• Example: – 19 observations were generated from a N(0,1)
distribution– 1 observation was generated from a N(0,4) distribution
0.847 -0.740 0.095 -0.864 -0.976 -1.178 -0.714 0.280 0.790 0.153 0.990 -1.009 0.175 0.036 0.553 -1.683 0.848 -1.173 -0.111 7.660
– Mean = 0.20– Median = 0.07
Microarrays
A snapshot that captures the activity pattern of thousands of genes at once.
Affymetrix GeneChipCustom spotted arrays
Needs• Although microarray experiments may include
thousands of outcomes, due to expense we are usually dealing with small sample sizes.
• Due to obtaining measurements for thousands of outcomes, we cannot realistically check assumptions and underlying distributions for all genes.
• Estimating intensity of a probe and probe set requires methods insensitive to outlying values
• Need alternative estimators of location rather than using the mean.
Transforming data to meet model assumptions
• Induce symmetry• Achieve constant variance• Achieve linearity• Achieve normality
Log Transformation• For custom spotted arrays, the quantity used
for analysis is most often the
• This ratio may or may not include background subtraction
2Sample signal - Sample backgroundlog
Reference signal - Reference backgound
2Sample signal log
Reference signal
Log Transformation
0.5
2.0
Ratioxi
-1.050100B
1.015075A
Log2
(Ratio)Sample Signal
ReferenceSignal
Gene
Computing the average fold change relative to the reference
( ) ( )2 ilog x 1.0 1.0n 2 02 2 2 1
+ − = = =∑
2Geometric mean= 2.0 0.5=1.0×
2.0+0.5Arithmetic average= =1.252
Data• The Universal Human Reference RNA (Stratagene, La
Jolla, CA), isolated from 10 human cell lines, was hybridized to 16 HG-U133A GeneChips® (Dumur et al., 2003) to study GeneChip® reproducibility.
• Five identical aliquots of 40 µg of the Reference RNA were used as template for cDNA synthesis, cRNA in vitro transcription, and labelling reactions. The final fragmented (frg) cRNA aliquots were pooled together.
• Hybridization was performed using the same lot number of the HG-U133A arrays, which contains 22,283 probe sets.
Mean versus variance plot of 16 Replicate GeneChipsMean versus Variance Plot
Log Transformation
• Applied to custom spotted arrays to facilitate interpretation and induce symmetry.
• Its usefulness in application to custom spotted arrays has led to widespread use of the log transformation to AffymetrixGeneChip data.
Mean versus variance plot after applying the log2 transformation
Mean versus Variance Plot after applying log2 transformation
Box-Cox TransformationsCommonly used family of variance stabilizing transformations is the Box-Cox transformation, defined as
( )( )
1 for 0,
ln for = 0,
p
p
x ppxx p
−≠Φ =
Where x is the observed data, p is a number to be determined,and Φ is the transformed data.Box and Cox (1964) selected p as the value that maximizedthe likelihood.
Spread-versus-Level Plot
• Emerson (1987) demonstrated use of the spread-versus-level plot as a means to empirically estimate p.– Plot log10 of median (x-axis) by log10 of fourth-
spread (y-axis).– The estimated slope b of the spread-versus-
level plot identifies the approximate value of the exponent for the Box-Cox transformation, where p = 1 – b.
Sample median
• If n is odd, then the sample median is the middle observation, where k = ( n + 1 )/2;
• if n is even, the sample median is considered to be any value between the two middle observations and( )kY ( )k 1Y +
Spread-versus-Level Plot• Median: after ranking the observations, the
median is at depth (n+1)/2.• Upper (FU ) and Lower Fourths (FL): the fourths
are those values with a depth of([depth of median] + 1)/2.
• Fourth-spread: robust measure of spread, defined by dF = FU – FL.
Spread-versus-Level Plot• X is a real random variable with population median
MX, lower fourth FL, upper fourth FU, and fourth spread dF.
• Problem is to identify a transformationfor which the variance is constant.
• A Taylor series expansion of Φ(x) about MX is( )y x= Φ
( ) ( ) ( )( ) ( ) ( )2
2!
XX X X X
x Mx M M x M M
−′ ′′Φ + Φ − + Φ + .Φ =
Spread-versus-Level Plot
Let λ(MX) represent the proportionate distance from MX to FU as a proportion of the distance dF;therefore, the median and upper and lower fourths for the data transformed by
Substituting into the Taylor Series Expansion,
( ) ,Y XM M= Φ
( ) ( ) ( ) ,U U X X FYF F M M dλ= Φ = Φ +
( ) ( ) ( ){ }1 .L L X X FYF F M M dλ = Φ = Φ − −
( ) ( )( )
( )22 1
2!X F
F F X XY
M dd d M M
λ ⋅ − ′ ′′= ⋅Φ + Φ +
Spread-versus-Level Plot• Since the quadratic term of (dF)Y is negligible
compared to the leading term,
• Now supposing that the relationship between spread dF and level X is a power transformation of the form for some constants k and b,
( )( )
11 2
3 4
for 11 1 ln for 1.
b
b
c x c bx dx
k x c x c b
− + ≠Φ = = + =
∫
( ) bFd x k x= ⋅
( ) ( )F F XYd d M′≈ ⋅Φ
Spread-versus-Level Plot
• Recall, so that
• Therefore, the slope of the spread-versus-level plot estimates b. Hence the suggested power transformation would be
( ) bF xd x k M= ⋅
ˆlog( ) log( ) log( )F xd k b M= + ⋅
1 p bx x −=
Graphical Plot for identifying power transformation
Spread-versus-level Plot
0β̂ 1̂β=0.0519; =0.5692
Results
• The power transformation turned out to be, = 1 – 0.57 = 0.43.
• Typically, for simplicity the parameter estimate is rounded to the nearest half-integer, which in this case would lead to
=0.5.• Here,Φ0.5(x)= 0.5( 1) / 0.5 2 2x x− = −
p̂
p̂
0β̂ 1̂β=0.0519; =0.5692
Graphical Plot for identifying power transformationMean versus Variance Plot after
2 2x − transformation
Other Variance Stabilizing Transformations
• Assuming the gene expression data may be modelled asy = α + µeη + ε
where y represents the measured intensity, α represents the average background, µ represents the true gene expression level, with normally distributed error terms η and ε with zero mean and differing non-zero variances,
the generalized logarithm transformation g(y) = ln ( y - α + sqrt[(y - α)2 + c])
where c = σε2 /Sη, stabilizes the variance (Durbin et al., 2002)
for large samples.
Rank versus fourth-spread plots
Conclusions re. utility of spread-versus-level plot
• One common assumption of many traditional statistical methods is that the variance is constant over the levels of gene expression.
• Since the absolute intensities are positive, this assumption is generally not valid because the variance increases with increasing intensities.
• The log2 transformation, which is routinely applied to absolute intensities that result from AffymetrixGeneChip® experiments (Hubbell et al., 2002; Irizarry et al., 2003), unfortunately does not achieve the desired results.
Conclusions re. utility of spread-versus-level plot
• To overcome this drawback of the log2transformation, methods such as the Local Pooled Error test have been suggested (Jain et al., 2003) to more accurately estimate the variance for lowly expressed log2 signal intensities.
• As an alternative, the spread-versus-level plot may be used to identify more appropriate monotonic transformations that appropriately stabilizes the variance.
Median polish:
An alternative to two-way ANOVA
Analysis of a two-way table
xgnxg2xg1g
x2nx22x212x1nx12x111n21
Analysis of a two-way table• Additive model: the relationship between the
response variable and two factors is easiest to summarize and interpret if their joint contribution to the response is the sum of a separate contribution from each factor.
• Fitting an additive model to a two-way table –want to estimate an – (1) overall main effect, – (2) a set of row effects, – (3) a set of column effects, – (4) a table of residuals.
Analysis of a two-way table
xgnxg2xg1g
x2nx22x212x1nx12x111n21
ij i j ijx µ α β ε= + + +Model:
where µ is the overall mean, αi is the row effect, βj is the column effect,εij is the error
Classical method: ANOVA
..1ˆ ij
i jx x
IJµ = = ∑∑
. ..ˆi ix xα = −
. ..ˆ
j jx xβ = −
( )ˆˆ ˆˆij i j ijxε µ α β= + + −
Median Polish
• Model:
• The fit and the residuals at the end of n iterations is
ij i j ijy m a b e= + + +
( ) ( ) ( ) ( )n n n nij i j ijy m a b e= + + +
Median Polish
• Row, column, and overall effects are initialized at 0.
• That is(0)
(0)
(0)
00
0i
j
mab
=
=
=
Median Polish
• Thereafter, the row medians are determined and subtracted from all observations within the row.
• From the new data table (the original data with the row medians subtracted out) find the column medians.
• Polish by subtracting the column median from all observations within the column.
Median Polish
• After a row and column polish, effects are estimated by
(n) (n-1) (n)a i i∆m =median(a + a )∆
(n) (n-1)b j∆m =median(b )
(n) (n-1) (n) (n)a bm =m + m m∆ + ∆
(n) (n-1) (n) (n)i i i aa =a +∆a -∆m(n) (n-1) (n) (n)j j j bb =b +∆b -∆m
Median Polish
• Repeat the row/column polish procedure 2 times.
• The quantities , , and are the median polish estimates of the row effects, column effects, and main effect, respectively.
(2)ia (2)
jb (2)m
Expression Quantification• Rather then an entire gene being placed in a
Affymetrix Genechip is an oligonucleotide array consisting of a several perfect match (PM) and their corresponding mismatch (MM) probes that interrogate for a single gene. – PM is the exact complementary sequence of the
target genetic sequence, composed of 25 base pairs – MM probe, which has the same sequence with
exception that the middle base (13th) position has been reversed
– There are roughly 11-20 PM/MM probe pairs that interrogate for each gene, called a probe set
Expression Quantification
PM and MM intensities are combined to form an expression measure for the probe set (gene)
11 – 20 Probe Pairs interrogate each gene
PMGCGCCGGCTGCAGGAGCAGGAGGAG
GCGCCGGCTGCACGAGCAGGAGGAG MM
Expression Quantification• Initially, Affymetrix signal was calculated as
where j indexes the probe pairs for each probe set A. This is known as the “Average Difference” method.
• Problems: – Large variability in PM-MM – MM probes may be measuring signal for another gene/EST– PM-MM calculations are sometimes negative
( )j jj A
1AvDiff PM MMA ∈
= −∑
205586_x_at
Median Polish
• Example (handout)
Comparing ANOVA and Median Polish
0003
0002
0091
321
Classical method: ANOVA( )..
1ˆ 9 0 0 0 0 0 0 0 0 13 3
xµ = = + + + + + + + + =×
( )11ˆ 9 0 0 1 23
α = + + − =
( )11ˆ 9 0 0 1 23
β = + + − =
( )ˆˆ ˆˆij i j ijxε µ α β= + + −
( )21ˆ 0 0 0 1 13
α = + + − = −
( )31ˆ 0 0 0 1 13
α = + + − = −
( )21ˆ 0 0 0 1 13
β = + + − = −
( )31ˆ 0 0 0 1 13
β = + + − = −
ANOVA: Parameter Estimates and Residuals
-2-12
-121-23
-121-22
2-1-241
321 ˆiα
ˆjβ ˆ 1µ =
Median Polish: Parameter Estimates and Residuals
000
00003
00002
00091
321 ˆia
ˆjb ˆ 0m =
References• Box, G.E.P., Cox, D.R. (1964) An analysis of transformations, Journal of the Royal Statistical Society,
Series B, 26: 211-252.• Durbin, B.P., Hardin, J.S., Hawkins, D.M., Rocke, D.M. (2002) A variance-stabilizing transformation for
gene-expression microarray data, Bioinformatics, 18: S105-S110.• Emerson, J.D. (1987) Mathematical aspects of transformation. In Hoaglin, D.C., Mosteller, F. and Tukey,
J.W. (eds), Understanding robust and exploratory data analysis John Wiley & Sons, New York, pp. 247-282.
• Emerson, J.D. and Hoaglin, D.C. (1987) Analysis of two-way tables by medians. In Hoaglin, D.C., Mosteller, F. and Tukey, J.W. (eds), Understanding robust and exploratory data analysis John Wiley & Sons, New York, pp. 129-165.
• Hoaglin, D.C. (1987) Letter values: A set of selected order statistics. In Hoaglin, D.C., Mosteller, F. and Tukey, J.W. (eds), Understanding robust and exploratory data analysis John Wiley & Sons, New York, pp. 33-57.
• Hubbell, E., Liu, W.-M., Mei, R. (2002) Robust estimators for expression analysis, Bioinformatics, 18: 1585-1592.
• Irizarry, R.A., Bolstad, B.M., Collin, F., Cope, L.M., Hobbs, B., Speed, T.P. (2003) Summaries of affymetrix genechip probe level data., Nucleic Acids Research, 31: e15.
• Jain, N., Thatte, J., Braciale, T., Ley, K., O'connell, M., Lee, J.K. (2003) Local-pooled-error test for identifying differentially expressed genes with a small number of replicated microarrays, Bioinformatics, 19: 1945-1951.
• Kohane, I.S., Kho, A.T., Butte, A.J. (2003) Microarrays for an integrative genomics. MIT Press, Cambridge, MA.
• Quackenbush, J. (2002) Microarray data normalization and transformation, Nature Genetics, 32: 496-501.