wolfgang huber european bioinformatics institutemarray.economia.unimi.it/2005/material/l4.pdf · x...
TRANSCRIPT
![Page 1: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)](https://reader030.vdocuments.mx/reader030/viewer/2022040511/5e5abdda0bc6f20a8a0e4368/html5/thumbnails/1.jpg)
Wolfgang HuberEuropean Bioinformatics Institute
Quality control and normalization
![Page 2: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)](https://reader030.vdocuments.mx/reader030/viewer/2022040511/5e5abdda0bc6f20a8a0e4368/html5/thumbnails/2.jpg)
Acknowledgements
Anja von Heydebreck (Darmstadt)Robert Gentleman (Seattle)Günther Sawitzki (Heidelberg)Martin Vingron (Berlin)Annemarie Poustka, Holger Sültmann, Andreas
Buness, Markus Ruschhaupt (Heidelberg)Rafael Irizarry (Baltimore)Judith Boer (Leiden) Anke Schroth (Heidelberg)Friederike Wilmer (Hilden)
![Page 3: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)](https://reader030.vdocuments.mx/reader030/viewer/2022040511/5e5abdda0bc6f20a8a0e4368/html5/thumbnails/3.jpg)
Which genes are differentially transcribed?
same-same tumor-normal
log-ratio
![Page 4: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)](https://reader030.vdocuments.mx/reader030/viewer/2022040511/5e5abdda0bc6f20a8a0e4368/html5/thumbnails/4.jpg)
Statistics 101:
←bias accuracy→
←pr
ecis
ion
varia
nce→
![Page 5: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)](https://reader030.vdocuments.mx/reader030/viewer/2022040511/5e5abdda0bc6f20a8a0e4368/html5/thumbnails/5.jpg)
Basic dogma of data analysis:
Can always increase sensitivity on the cost of specificity,
or vice versa,
the art is to find the best trade-off.
X
X
X
X
X
X
X
X
X
(It can also be possible to increase both by better choice of method / model)
![Page 6: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)](https://reader030.vdocuments.mx/reader030/viewer/2022040511/5e5abdda0bc6f20a8a0e4368/html5/thumbnails/6.jpg)
ratios and fold changes
Fold changes are useful to describe continuous changes in expression
10001500
3000x3
x1.5
A B C
0200
3000?
?
A B C
But what if the gene is “off” (below detection limit) in one condition?
![Page 7: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)](https://reader030.vdocuments.mx/reader030/viewer/2022040511/5e5abdda0bc6f20a8a0e4368/html5/thumbnails/7.jpg)
ratios and fold changesThe idea of the log-ratio (base 2)
0: no change+1: up by factor of 21 = 2+2: up by factor of 22 = 4-1: down by factor of 2-1 = 1/2-2: down by factor of 2-2 = ¼
![Page 8: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)](https://reader030.vdocuments.mx/reader030/viewer/2022040511/5e5abdda0bc6f20a8a0e4368/html5/thumbnails/8.jpg)
ratios and fold changesThe idea of the log-ratio (base 2)
0: no change+1: up by factor of 21 = 2+2: up by factor of 22 = 4-1: down by factor of 2-1 = 1/2-2: down by factor of 2-2 = ¼
A unit for measuring changes in expression: assumes that a change from 1000 to 2000 units has a similar biological meaning to one from 5000 to 10000.
![Page 9: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)](https://reader030.vdocuments.mx/reader030/viewer/2022040511/5e5abdda0bc6f20a8a0e4368/html5/thumbnails/9.jpg)
ratios and fold changesThe idea of the log-ratio (base 2)
0: no change+1: up by factor of 21 = 2+2: up by factor of 22 = 4-1: down by factor of 2-1 = 1/2-2: down by factor of 2-2 = ¼
A unit for measuring changes in expression: assumes that a change from 1000 to 2000 units has a similar biological meaning to one from 5000 to 10000.
What about a change from 0 to 500?- conceptually- noise, measurement precision
![Page 10: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)](https://reader030.vdocuments.mx/reader030/viewer/2022040511/5e5abdda0bc6f20a8a0e4368/html5/thumbnails/10.jpg)
A complex measurement process lies between mRNA concentrations and intensities
o other array manufacturing-related issues
o hybridization efficiency and specificity
o DNA-support binding
o reverse transcription efficiency
o ‘background’ correction
o spotting efficiency
o amplification efficiency
o signal quantification
o PCR yield, contamination
o RNA degradation
o image segmentation
o clone identification and mapping
o tissue contamination
![Page 11: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)](https://reader030.vdocuments.mx/reader030/viewer/2022040511/5e5abdda0bc6f20a8a0e4368/html5/thumbnails/11.jpg)
A complex measurement process lies between mRNA concentrations and intensities
o other array manufacturing-related issues
o hybridization efficiency and specificity
o DNA-support binding
o reverse transcription efficiency
o ‘background’ correction
o spotting efficiency
o amplification efficiency
o signal quantification
o PCR yield, contamination
o RNA degradation
o image segmentation
o clone identification and mapping
o tissue contamination
The problem is less that these steps are ‘not perfect’; it is that they vary from array to array, experiment to experiment.
![Page 12: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)](https://reader030.vdocuments.mx/reader030/viewer/2022040511/5e5abdda0bc6f20a8a0e4368/html5/thumbnails/12.jpg)
♦ How to compare microarrayintensities with each other?
♦ How to address measurement uncertainty (“variance”)?
♦ How to calibrate (“normalize”) for biases between samples?
Questions
![Page 13: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)](https://reader030.vdocuments.mx/reader030/viewer/2022040511/5e5abdda0bc6f20a8a0e4368/html5/thumbnails/13.jpg)
Sources of variationamount of RNA in the biopsy efficiencies of-RNA extraction-reverse transcription -labeling-fluorescent detection
probe purity and length distribution
spotting efficiency, spot sizecross-/unspecific hybridizationstray signal
![Page 14: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)](https://reader030.vdocuments.mx/reader030/viewer/2022040511/5e5abdda0bc6f20a8a0e4368/html5/thumbnails/14.jpg)
Sources of variationamount of RNA in the biopsy efficiencies of-RNA extraction-reverse transcription -labeling-fluorescent detection
probe purity and length distribution
spotting efficiency, spot sizecross-/unspecific hybridizationstray signal
Systematico similar effect on many measurementso corrections can be estimated from data
![Page 15: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)](https://reader030.vdocuments.mx/reader030/viewer/2022040511/5e5abdda0bc6f20a8a0e4368/html5/thumbnails/15.jpg)
Sources of variationamount of RNA in the biopsy efficiencies of-RNA extraction-reverse transcription -labeling-fluorescent detection
probe purity and length distribution
spotting efficiency, spot sizecross-/unspecific hybridizationstray signal
Systematico similar effect on many measurementso corrections can be estimated from data
Calibration
![Page 16: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)](https://reader030.vdocuments.mx/reader030/viewer/2022040511/5e5abdda0bc6f20a8a0e4368/html5/thumbnails/16.jpg)
Sources of variationamount of RNA in the biopsy efficiencies of-RNA extraction-reverse transcription -labeling-fluorescent detection
probe purity and length distribution
spotting efficiency, spot sizecross-/unspecific hybridizationstray signal
Systematico similar effect on many measurementso corrections can be estimated from data
Calibration
Stochastico too random to be ex-plicitely accounted for o remain as “noise”
![Page 17: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)](https://reader030.vdocuments.mx/reader030/viewer/2022040511/5e5abdda0bc6f20a8a0e4368/html5/thumbnails/17.jpg)
Sources of variationamount of RNA in the biopsy efficiencies of-RNA extraction-reverse transcription -labeling-fluorescent detection
probe purity and length distribution
spotting efficiency, spot sizecross-/unspecific hybridizationstray signal
Systematico similar effect on many measurementso corrections can be estimated from data
Calibration
Stochastico too random to be ex-plicitely accounted for o remain as “noise”
Error model
![Page 18: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)](https://reader030.vdocuments.mx/reader030/viewer/2022040511/5e5abdda0bc6f20a8a0e4368/html5/thumbnails/18.jpg)
Error models
describe the possible outcomes of a set of measurements
Outcomes depend on:-true value of the measured quantity (abundances of specific molecules in biological sample)
-measurement apparatus (cascade of biochemical reactions, optical detection system with laser scanner or CCD camera)
![Page 19: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)](https://reader030.vdocuments.mx/reader030/viewer/2022040511/5e5abdda0bc6f20a8a0e4368/html5/thumbnails/19.jpg)
Error models
Purpose:
1. Data compression: summary statistic instead of full empirical distribution
2. Quality control
3. Statistical inference: appropriate parametric methods have better power than non-parametric (this has practical, financial, and ethical aspects)
![Page 20: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)](https://reader030.vdocuments.mx/reader030/viewer/2022040511/5e5abdda0bc6f20a8a0e4368/html5/thumbnails/20.jpg)
ε= +iik ika aai per-sample offset
εik ~ N(0, bi2s1
2)“additive noise”
bi per-samplenormalization factor
bk sequence-wiseprobe efficiency
ηik ~ N(0,s22)
“multiplicative noise”
exp( )iik k ikb b b η=
ik ik ik ky a b x= +
The two component model
measured intensity = offset + gain × true abundance
![Page 21: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)](https://reader030.vdocuments.mx/reader030/viewer/2022040511/5e5abdda0bc6f20a8a0e4368/html5/thumbnails/21.jpg)
The two-component model
raw scale log scaleB. Durbin, D. Rocke, JCB 2001
![Page 22: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)](https://reader030.vdocuments.mx/reader030/viewer/2022040511/5e5abdda0bc6f20a8a0e4368/html5/thumbnails/22.jpg)
The two-component model
raw scale log scaleB. Durbin, D. Rocke, JCB 2001
“additive” noise
“multiplicative” noise
![Page 23: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)](https://reader030.vdocuments.mx/reader030/viewer/2022040511/5e5abdda0bc6f20a8a0e4368/html5/thumbnails/23.jpg)
Parameterization
(1 )y a b xy a b x eη
ε η
ε
= + + ⋅ ⋅ +
= + + ⋅ ⋅
two practically equivalent forms
(η<<1)
iid per arrayiid in whole experiment
η random gain fluctuations
per array x color x print-tip group
per array x colorb systematic gain factor
iid per arrayiid in whole experiment
ε random background
per array x color x print-tip group
same for all probes (per array x color)
a systematic background
![Page 24: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)](https://reader030.vdocuments.mx/reader030/viewer/2022040511/5e5abdda0bc6f20a8a0e4368/html5/thumbnails/24.jpg)
variance stabilizing transformations
Xu a family of random variables with EXu=u, VarXu=v(u). Define
⇒ var f(Xu ) ≈ independent of u
1( )v( )
x
f x duu
= ∫
derivation: linear approximation
![Page 25: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)](https://reader030.vdocuments.mx/reader030/viewer/2022040511/5e5abdda0bc6f20a8a0e4368/html5/thumbnails/25.jpg)
0 20000 40000 60000
8.0
8.5
9.0
9.5
10.0
11.0
raw scale
trans
form
ed s
cale
variance stabilizing transformations
f(x)
x
![Page 26: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)](https://reader030.vdocuments.mx/reader030/viewer/2022040511/5e5abdda0bc6f20a8a0e4368/html5/thumbnails/26.jpg)
variance stabilizing transformations1( )
v ( )
x
f x d uu
= ∫1.) constant variance (‘additive’) 2( ) sv u f u= ⇒ ∝
2.) constant CV (‘multiplicative’) 2( ) logv u u f u∝ ⇒ ∝
4.) additive and multiplicative
2 2 00( ) ( ) arsinh u uv u u u s f
s+
∝ + + ⇒ ∝
3.) offset 20 0( ) ( ) log( )v u u u f u u∝ + ⇒ ∝ +
![Page 27: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)](https://reader030.vdocuments.mx/reader030/viewer/2022040511/5e5abdda0bc6f20a8a0e4368/html5/thumbnails/27.jpg)
the “glog” transformation
intensity-200 0 200 400 600 800 1000
- - - f(x) = log(x)
——— hs(x) = asinh(x/s)
( )( )
2arsinh( ) log 1
arsinh log log 2 0limx
x x x
x x→∞
= + +
− − =
P. Munson, 2001
D. Rocke & B. Durbin, ISMB 2002
W. Huber et al., ISMB 2002
![Page 28: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)](https://reader030.vdocuments.mx/reader030/viewer/2022040511/5e5abdda0bc6f20a8a0e4368/html5/thumbnails/28.jpg)
raw scale log glog
difference
log-ratio
generalized
log-ratio
glog
![Page 29: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)](https://reader030.vdocuments.mx/reader030/viewer/2022040511/5e5abdda0bc6f20a8a0e4368/html5/thumbnails/29.jpg)
raw scale log glog
difference
log-ratio
generalized
log-ratio
glog
constant partvariance:
proportional part
![Page 30: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)](https://reader030.vdocuments.mx/reader030/viewer/2022040511/5e5abdda0bc6f20a8a0e4368/html5/thumbnails/30.jpg)
the transformed model
2
Yarsinh
(0, )
sikik ki
si
ki
ab
N c
µ ε
ε
−= +
∼
i: arrays k: probess: probe strata (e.g. print-tip, region)
![Page 31: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)](https://reader030.vdocuments.mx/reader030/viewer/2022040511/5e5abdda0bc6f20a8a0e4368/html5/thumbnails/31.jpg)
“usual” log-ratio
'glog' (generalized log-ratio)
+ +
+ +
1
2
2 21 1 1
2 22 2 2
log
log
xx
x x cx x c
c1, c2 are experiment specific parameters (~level of background noise)
![Page 32: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)](https://reader030.vdocuments.mx/reader030/viewer/2022040511/5e5abdda0bc6f20a8a0e4368/html5/thumbnails/32.jpg)
Variance Bias Trade-Off
Estimat
ed log
-fold-
chan
ge
Signal intensity
logglog
![Page 33: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)](https://reader030.vdocuments.mx/reader030/viewer/2022040511/5e5abdda0bc6f20a8a0e4368/html5/thumbnails/33.jpg)
Variance-bias trade-off and shrinkage estimators
Shrinkage estimators:pay a small price in bias for a large decrease of variance, so overall the mean-squared-error (MSE) is reduced.
Particularly useful if you have few replicates.
Generalized log-ratio: = a shrinkage estimator for fold change
There are many possible choices, we chose “variance-stabilization”:+ interpretable even in cases where genes are off in some conditions+ can subsequently use standard statistical methods (hypothesis testing, ANOVA, clustering, classification…) without the worries about low-level variability that are often warranted on the log-scale
![Page 34: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)](https://reader030.vdocuments.mx/reader030/viewer/2022040511/5e5abdda0bc6f20a8a0e4368/html5/thumbnails/34.jpg)
evaluation: effects of different data transformationsdiff
eren
ce r
ed-g
reen
rank(average)
![Page 35: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)](https://reader030.vdocuments.mx/reader030/viewer/2022040511/5e5abdda0bc6f20a8a0e4368/html5/thumbnails/35.jpg)
Normality: QQ-plot
![Page 36: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)](https://reader030.vdocuments.mx/reader030/viewer/2022040511/5e5abdda0bc6f20a8a0e4368/html5/thumbnails/36.jpg)
“Single color normalization”
n red-green arrays (R1, G1, R2, G2,… Rn, Gn)
within/between slidesfor (i=1:n)
calculate Mi= log(Ri/Gi), Ai= ½ log(Ri*Gi)normalize Mi vs Ai
normalize M1…Mn
all at oncenormalize the matrix of (R, G)then calculate log-ratios or any other
contrast you like
![Page 37: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)](https://reader030.vdocuments.mx/reader030/viewer/2022040511/5e5abdda0bc6f20a8a0e4368/html5/thumbnails/37.jpg)
What about non-linear effectso Microarrays can be operated in a linear regime, where fluorescence intensity increases proportionally to target abundance (see e.g. Affymetrix dilution series)
Two reasons for non-linearity:
o At the high intensity end: saturation/quenching. This can and should be avoided experimentally - loss of data!
o At the low intensity end: background offsets, instead of y=k·x we have y=k·x+x0, and in the log-log plot this can look curvilinear. But this is an affine-linear effect and can be correct by affine normalization. Non-parametric methods (e.g. loess) risk overfitting and loss of power.
![Page 38: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)](https://reader030.vdocuments.mx/reader030/viewer/2022040511/5e5abdda0bc6f20a8a0e4368/html5/thumbnails/38.jpg)
Non-linear or affine linear?
![Page 39: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)](https://reader030.vdocuments.mx/reader030/viewer/2022040511/5e5abdda0bc6f20a8a0e4368/html5/thumbnails/39.jpg)
Definitions
linear affine linear genuinely non-linear
![Page 40: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)](https://reader030.vdocuments.mx/reader030/viewer/2022040511/5e5abdda0bc6f20a8a0e4368/html5/thumbnails/40.jpg)
How to compare and assess different ‘normalization’ methods?
Normalization :=1. correction for systematic experimental biases2. provision of expression values that can subsequently be used for testing, clustering, classification, modelling…3. provision of a measure of measurement uncertainty
Quality trade-off: the better the measurements, the less need for normalization. Need for “too much” normalization relates to a quality problem.
Variance-Bias trade-off: how do you weigh measurements that have low signal-noise ratio?- just use anyway- ignore- shrink
![Page 41: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)](https://reader030.vdocuments.mx/reader030/viewer/2022040511/5e5abdda0bc6f20a8a0e4368/html5/thumbnails/41.jpg)
How to compare and assess different ‘normalization’ methods?
Aesthetic criteriaLogarithm is more beautiful than arsinh
Practical criteraIt takes forever to run method XX. Referees will only accept my paper if it uses the original MAS5.
Silly criteriaThe best method is that that makes all my scatterplotslook like straight, slim cigars
Physical criteriaNormalization calculations should be based on physical/chemical model
Economical/political criteriaLife would be so much easier if everybody were just using the same method, who cares which one
![Page 42: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)](https://reader030.vdocuments.mx/reader030/viewer/2022040511/5e5abdda0bc6f20a8a0e4368/html5/thumbnails/42.jpg)
How to compare and assess different ‘normalization’ methods?
Comparison against a ground truthBut you have millions of numbers – need to choose the metric that measures deviation from truth.FN/FP: do you find all the differentially expressed genes, and do you not find non-d.e. genes?qualitative/quantitative: how well do you estimate abundance, fold-change?
Spike-In and Dilution series… great, but how representative are they of other data?
Implicitely, from resampling / cross-validating with the actual experiment of interest… but isn’t that too much like Münchhausen’s bootstrap?
![Page 43: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)](https://reader030.vdocuments.mx/reader030/viewer/2022040511/5e5abdda0bc6f20a8a0e4368/html5/thumbnails/43.jpg)
evaluation: a benchmark for Affymetrixgenechip expression measures
o Data:Spike-in series: from Affymetrix 59 x HGU95A, 16 genes, 14 concentrations, complex backgroundDilution series: from GeneLogic 60 x HGU95Av2,liver & CNS cRNA in different proportions and amounts
o Benchmark:15 quality measures regarding-reproducibility-sensitivity-specificityPut together by Rafael Irizarry (Johns Hopkins)http://affycomp.biostat.jhsph.edu
![Page 44: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)](https://reader030.vdocuments.mx/reader030/viewer/2022040511/5e5abdda0bc6f20a8a0e4368/html5/thumbnails/44.jpg)
Quality assessment and control:
an overview over some diagnostic plots and
common artifacts
![Page 45: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)](https://reader030.vdocuments.mx/reader030/viewer/2022040511/5e5abdda0bc6f20a8a0e4368/html5/thumbnails/45.jpg)
Scatterplot, colored by PCR-plateTwo RZPD Unigene II filters (cDNA nylon membranes)
PCR plates
![Page 46: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)](https://reader030.vdocuments.mx/reader030/viewer/2022040511/5e5abdda0bc6f20a8a0e4368/html5/thumbnails/46.jpg)
PCR platesPCR plates
![Page 47: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)](https://reader030.vdocuments.mx/reader030/viewer/2022040511/5e5abdda0bc6f20a8a0e4368/html5/thumbnails/47.jpg)
PCR plates: boxplotsPCR plates: boxplots
![Page 48: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)](https://reader030.vdocuments.mx/reader030/viewer/2022040511/5e5abdda0bc6f20a8a0e4368/html5/thumbnails/48.jpg)
array batchesarray batches
![Page 49: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)](https://reader030.vdocuments.mx/reader030/viewer/2022040511/5e5abdda0bc6f20a8a0e4368/html5/thumbnails/49.jpg)
print-tip effectsprint-tip effects
-0.8 -0.6 -0.4 -0.2 0.0 0.2
0.0
0.2
0.4
0.6
0.8
1.0
41 (a42-u07639vene.txt) by spotting pin
log(fg.green/fg.red)
F̂
1:11:21:31:42:12:22:32:43:13:23:33:44:14:24:34:4
q (log-ratio)
F(q)
![Page 50: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)](https://reader030.vdocuments.mx/reader030/viewer/2022040511/5e5abdda0bc6f20a8a0e4368/html5/thumbnails/50.jpg)
spotting pin quality declinespotting pin quality decline
after delivery of 3x105 spots
after delivery of 5x105 spots
H. Sueltmann DKFZ/MGA
![Page 51: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)](https://reader030.vdocuments.mx/reader030/viewer/2022040511/5e5abdda0bc6f20a8a0e4368/html5/thumbnails/51.jpg)
spatial effectsspatial effects
R Rb R-Rbcolor scale by rank
spotted cDNA arrays, Stanford-type
another array:
print-tip
color scale ~ log(G)
color scale ~ rank(G)
![Page 52: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)](https://reader030.vdocuments.mx/reader030/viewer/2022040511/5e5abdda0bc6f20a8a0e4368/html5/thumbnails/52.jpg)
10 20 30 40 50 60
1020
3040
5060
1:nrhyb
1:nr
hyb
1 2 3 4 5 6 7 8 910111213141516171823242526272829303132333435363738737475767778798081828384858687888990919293949596979899100
0.6
0.8
1.0
1.2
1.4
1.6
1.8
Batches: array to array differences dij = madk(hik -hjk)
arrays i=1…63; roughly sorted by time
![Page 53: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)](https://reader030.vdocuments.mx/reader030/viewer/2022040511/5e5abdda0bc6f20a8a0e4368/html5/thumbnails/53.jpg)
Density representation of the scatterplot(76,000 clones, RZPD Unigene-II filters)
See: package hexbin; also, smoothscatter in package prada
![Page 54: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)](https://reader030.vdocuments.mx/reader030/viewer/2022040511/5e5abdda0bc6f20a8a0e4368/html5/thumbnails/54.jpg)
Oligonucleotide chips
![Page 55: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)](https://reader030.vdocuments.mx/reader030/viewer/2022040511/5e5abdda0bc6f20a8a0e4368/html5/thumbnails/55.jpg)
Affymetrix files
Main software from Affymetrix: MAS - MicroArray Suite.
DAT file: Image file, ~108 pixels, ~200 MB.CEL file: probe intensities, ~106 numbersCDF file: Chip Description File. Describes
which probes go in which probe sets (genes, gene fragments, ESTs).
1LQ file: Probe sequences and intended targets in the transcriptome
![Page 56: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)](https://reader030.vdocuments.mx/reader030/viewer/2022040511/5e5abdda0bc6f20a8a0e4368/html5/thumbnails/56.jpg)
Image analysis
DAT image files CEL filesEach probe cell: 10x10 pixels.Gridding: estimate location of probe cell centers.Signal:
– Remove outer 36 pixels 8x8 pixels.– The probe cell signal, PM or MM, is the 75th
percentile of the 8x8 pixel values.Background: Average of the lowest 2% probe cells
is taken as the background value and subtracted.
Compute also quality values.
![Page 57: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)](https://reader030.vdocuments.mx/reader030/viewer/2022040511/5e5abdda0bc6f20a8a0e4368/html5/thumbnails/57.jpg)
Data and notationPMijg , MMijg = Intensities for perfect match and
mismatch probe j for gene g in chip ii = 1,…, n one to hundreds of chipsj = 1,…, J usually 11 or 16 probe pairsg = 1,…, G 6…30,000 probe sets.
Tasks:calibrate (normalize) the measurements from different
chips (samples)summarize for each probe set the probe level data, i.e.,
11 PM and MM pairs, into a single expression measure.
compare between chips (samples) for detecting differential expression.
![Page 58: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)](https://reader030.vdocuments.mx/reader030/viewer/2022040511/5e5abdda0bc6f20a8a0e4368/html5/thumbnails/58.jpg)
expression measures: MAS 4.0
expression measures: MAS 4.0
Affymetrix GeneChip MAS 4.0 software uses AvDiff, a trimmed mean:
o sort dj = PMj -MMjo exclude highest and lowest valueo J := those pairs within 3 standard
deviations of the average
1 ( )# j j
j JAvDiff PM MM
J ∈
= −∑
![Page 59: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)](https://reader030.vdocuments.mx/reader030/viewer/2022040511/5e5abdda0bc6f20a8a0e4368/html5/thumbnails/59.jpg)
Expression measures MAS 5.0
Expression measures MAS 5.0
Instead of MM, use "repaired" version CTCT= MM if MM<PM
= PM / "typical log-ratio" if MM>=PM
"Signal" =Tukey.Biweight (log(PM-CT))
(… ≈median)
Tukey Biweight: B(x) = (1 – (x/c)^2)^2 if |x|<c, 0 otherwise
![Page 60: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)](https://reader030.vdocuments.mx/reader030/viewer/2022040511/5e5abdda0bc6f20a8a0e4368/html5/thumbnails/60.jpg)
Expression measures: Li & Wong
Expression measures: Li & Wong
dChip fits a model for each gene
where– θi: expression index for gene i– φj: probe sensitivity
Maximum likelihood estimate of MBEI is used as expression measure of the gene in chip i.
Need at least 10 or 20 chips.
Current version works with PMs only.
2, (0, )ij ij i j ij ijPM MM Nθ φ ε ε σ− = + ∝
![Page 61: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)](https://reader030.vdocuments.mx/reader030/viewer/2022040511/5e5abdda0bc6f20a8a0e4368/html5/thumbnails/61.jpg)
Expression measures RMA: Irizarry et al. (2002)
Expression measures RMA: Irizarry et al. (2002)
o Estimate one global background value b=mode(MM). No probe-specific background!
o Assume: PM = strue + bEstimate s≥0 from PM and b as a conditional expectation E[strue|PM, b].
o Use log2(s).o Nonparametric nonlinear calibration
('quantile normalization') across a set of chips.
![Page 62: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)](https://reader030.vdocuments.mx/reader030/viewer/2022040511/5e5abdda0bc6f20a8a0e4368/html5/thumbnails/62.jpg)
AvDiff-like
with A a set of “suitable” pairs.
Li-Wong-like: additive model
Estimate RMA = ai for chip i using robust method median polish (successively remove row and column medians, accumulate terms, until convergence). Works with d>=2
21RMA log ( )j j
j APM BG
∈
= −Α ∑
Expression measures RMA: Irizarry et al. (2002)
Expression measures RMA: Irizarry et al. (2002)
2log ( )ij i j ijPM BG a b ε− = + +
![Page 63: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)](https://reader030.vdocuments.mx/reader030/viewer/2022040511/5e5abdda0bc6f20a8a0e4368/html5/thumbnails/63.jpg)
Affymetrix: IPM = IMM + Ispecific ?
log(PM/MM)0From: R. Irizarry et al.,
Biostatistics 2002
![Page 64: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)](https://reader030.vdocuments.mx/reader030/viewer/2022040511/5e5abdda0bc6f20a8a0e4368/html5/thumbnails/64.jpg)
Sequence-dependent preprocessing
i
25
1log log ( )i i
iY x w s ε
=
= + +∑
wi
position- and sequence-specific effects wi(s):Naef et al., Phys Rev E 68 (2003)
![Page 65: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)](https://reader030.vdocuments.mx/reader030/viewer/2022040511/5e5abdda0bc6f20a8a0e4368/html5/thumbnails/65.jpg)
Software for pre-processing Affymetrix data
• Bioconductor R package affy.• Background estimation.• Probe-level normalization.• Expression measures• Two main functions: ReadAffy, expresso.• See also: gcrma, tilingArray, vsn.
![Page 66: Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)](https://reader030.vdocuments.mx/reader030/viewer/2022040511/5e5abdda0bc6f20a8a0e4368/html5/thumbnails/66.jpg)
References
Bioinformatics and computational biology solutions using R and Bioconductor, R. Gentleman, V. Carey, W. Huber, R. Irizarry, S. Dudoit, Springer (2005).
Variance stabilization applied to microarray data calibration and to thequantification of differential expression. W. Huber, A. von Heydebreck, H. Sültmann, A. Poustka, M. Vingron. Bioinformatics 18 suppl. 1 (2002), S96-S104.
Exploration, Normalization, and Summaries of High Density OligonucleotideArray Probe Level Data. R. Irizarry, B. Hobbs, F. Collins, …, T. Speed. Biostatistics 4 (2003) 249-264.
Error models for microarray intensities. W. Huber, A. von Heydebreck, and M. Vingron. Encyclopedia of Genomics, Proteomics and Bioinformatics. John Wiley & sons (2005).
Differential Expression with the Bioconductor Project. A. von Heydebreck, W. Huber, and R. Gentleman. Encyclopedia of Genomics, Proteomics and Bioinformatics. John Wiley & sons (2005).