carlo colantuoni carlo@illuminatobiotech
DESCRIPTION
Summer Inst. Of Epidemiology and Biostatistics, 2010: Gene Expression Data Analysis 1:30pm – 5:00pm in Room W2015. Carlo Colantuoni [email protected]. http://www.illuminatobiotech.com/GEA2010/GEA2010.htm. Class Outline. Basic Biology & Gene Expression Analysis Technology - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/1.jpg)
Summer Inst. Of Epidemiology and Biostatistics, 2010:
Gene Expression Data Analysis
1:30pm – 5:00pm in Room W2015
Carlo [email protected]
http://www.illuminatobiotech.com/GEA2010/GEA2010.htm
![Page 2: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/2.jpg)
Class Outline• Basic Biology & Gene Expression Analysis Technology
• Data Preprocessing, Normalization, & QC
• Measures of Differential Expression
• Multiple Comparison Problem
• Clustering and Classification
• The R Statistical Language and Bioconductor
• GRADES – independent project with Affymetrix data.
http://www.illuminatobiotech.com/GEA2010/GEA2010.htm
![Page 3: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/3.jpg)
Cla
ss O
utlin
e -
Det
aile
d
• Basic Biology & Gene Expression Analysis Technology– The Biology of Our Genome & Transcriptome– Genome and Transcriptome Structure & Databases– Gene Expression & Microarray Technology
• Data Preprocessing, Normalization, & QC– Intensity Comparison & Ratio vs. Intensity Plots (log transformation)– Background correction (PM-MM, RMA, GCRMA)– Global Mean Normalization– Loess Normalization– Quantile Normalization (RMA & GCRMA)– Quality Control: Batches, plates, pins, hybs, washes, and other artifacts– Quality Control: PCA and MDS for dimension reduction– SVA: Surrogate Variable Analysis
• Measures of Differential Expression– Basic Statistical Concepts– T-tests and Associated Problems– Significance analysis in microarrays (SAM) [ & Empirical Bayes]– Complex ANOVA’s (limma package in R)
• Multiple Comparison Problem– Bonferroni– False Discovery Rate Analysis (FDR)
• Differential Expression of Functional Gene Groups– Functional Annotation of the Genome– Hypergeometric test?, Χ2, KS, pDens, Wilcoxon Rank Sum– Gene Set Enrichment Analysis (GSEA)– Parametric Analysis of Gene Set Enrichment (PAGE)– geneSetTest– Notes on Experimental Design
• Clustering and Classification– Hierarchical clustering– K-means– Classification
• LDA (PAM), kNN, Random Forests• Cross-Validation
• Additional Topics• eQTL (expression + SNPs)• Next-Gen Sequencing data: RNAseq, ChIPseq• Epigenetics?– The R Statistical Language: http://www.r-project.org/– Bioconductor : http://www.bioconductor.org/docs/install/– Affymetrix data processing example
![Page 4: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/4.jpg)
DAY #2:
•Intensity Comparison & Ratio vs. Intensity Plots
•Log transformation
•Background correction (Affymetrix, 2-color, other)
•Normalization: global and local mean centering
•Normalization: quantile normalization
•Batches, plates, pins, hybs, washes, and other artifacts
•QC: PCA and MDS for dimension reduction
•SVA: Surrogate Variable Analysis
![Page 5: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/5.jpg)
Log Intensity
Lo
g I
nte
nsi
ty
Microarray Data Quantification
![Page 6: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/6.jpg)
Log Intensity
Lo
g R
atio
Microarray Data Quantification
![Page 7: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/7.jpg)
Logarithmic Transformation:
if : logz(x)=y then : zy=x
Logarithm math refresher:
log(x) + log(y) = log( x * y )
log(x) - log(y) = log( x / y )
![Page 8: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/8.jpg)
Intensity vs. Intensity: LINEAR
Intensity Distribution: LINEAR
![Page 9: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/9.jpg)
Intensity vs. Intensity: LOG
Intensity Distribution:LOG
![Page 10: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/10.jpg)
Intensity vs. Intensity: LINEAR
![Page 11: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/11.jpg)
Intensity vs. Intensity: LOG
![Page 12: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/12.jpg)
Int vs. Int:LINEAR
Int vs. Int:LOG
Ratio vs. Int: LOG
Microarray Data Quantification
![Page 13: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/13.jpg)
Background Subtraction
![Page 14: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/14.jpg)
Before Hybridization
Array 1 Array 2
Sample 1 Sample 2
![Page 15: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/15.jpg)
After Hybridization
Array 1 Array 2
![Page 16: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/16.jpg)
More Realistic - Before
Array 1 Array 2
Sample 1 Sample 2
![Page 17: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/17.jpg)
Array 1 Array 2
More Realistic - After
![Page 18: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/18.jpg)
poly CNo label
![Page 19: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/19.jpg)
Intensity distributions for theno-label and Yeast DNA
![Page 20: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/20.jpg)
The presence of background noise is clear from the fact that the minimum PM intensity is not 0 and that the geometric mean of the probesets with no spike-in is around 200 units.
Why Adjust for Background?
Hs RNA on Hs chip(w/ spike-ins)
PM intensities
![Page 21: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/21.jpg)
Why Adjust for Background?
Local slope decreases as nominal concentration
decreases!
(E1 + B) / (E2 + B) ≈ 1
(E1 + B) / (E2 + B) ≈ E1 / E2
(E1 + B) ≈ B or …
(E1 + B) ≈ E1 or …
![Page 22: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/22.jpg)
By using the log-scale transformation before analyzing microarray data, investigators have, implicitly or explicitly, assumed a multiplicative measurement error model (Dudoit et al., 2002; Newton et al., 2001; Kerr et al., 200; Wolfinger et al., 2001). The fact, seen in Figure 2, that observed intensity increase linearly with concentration in the original scale but not in the log-scale suggests that background noise is additive with non-zero mean. Durbin et al. (2002), Huber et al. (2002), Cui, Kerr, and Churchill (2003), and Irizarry et al. (2003a) have proposed additive-background-multiplicative-measurement-error models for intensities read from microarray scanners.
PM intensities
![Page 23: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/23.jpg)
Affymetrix GeneChip Design
5’ 3’
Reference sequence
…TGTGATGGTGCATGATGGGTCAGAAGGCCTCCGATGCGCCGATTGAGAAT…GTACTACCCAGTCTTCCGGAGGCTAGTACTACCCAGTGTTCCGGAGGCTA
Perfectmatch (PM)Mismatch (MM)
NSB & SB
NSB
![Page 24: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/24.jpg)
Motivation: PM - MM
PM = B + S MM = B
PM – MM = S
The hope is that:
But this is not correct!
![Page 25: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/25.jpg)
MM is too much:
S=signal;B=background
At low S (A), S-B≈0 so:
S1-B1 / S2-B2 (M) is highly unstable.
Why not subtract MM?
![Page 26: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/26.jpg)
Why not subtract MM?
![Page 27: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/27.jpg)
Why not subtract MM?
![Page 28: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/28.jpg)
Background: Solutions
![Page 29: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/29.jpg)
Simulation
• We create some feature level data for two replicate arrays
• Then compute Y=log(PM-kMM) for each array
• We make an MA using the Ys for each array
• We make a observed concentration versus known concentration plot
• We do this for various values of k. The following “movie” shows k moving from 0 to 1.
![Page 30: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/30.jpg)
k=0
Known level (log2)
Obs
erve
d le
vel (
log2
)
Log2(Intensity)
Log2
(Rat
io)
![Page 31: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/31.jpg)
k=1/4
Known level (log2)
Obs
erve
d le
vel (
log2
)
Log2(Intensity)
Log2
(Rat
io)
![Page 32: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/32.jpg)
k=1/2
Known level (log2)
Obs
erve
d le
vel (
log2
)
Log2(Intensity)
Log2
(Rat
io)
![Page 33: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/33.jpg)
k=3/4
Known level (log2)
Obs
erve
d le
vel (
log2
)
Log2(Intensity)
Log2
(Rat
io)
![Page 34: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/34.jpg)
k=1
Known level (log2)
Obs
erve
d le
vel (
log2
)
Log2(Intensity)
Log2
(Rat
io)
![Page 35: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/35.jpg)
Real Data
MAS 5.0 RMA
![Page 36: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/36.jpg)
RMA: The Basic Idea
PM=B+S
Observed: PMOf interest: S
Pose a statistical model and use it to predict S from the observed PM
![Page 37: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/37.jpg)
The Basic Idea
PM=B+S
• A mathematically convenient, useful model
– B ~ Normal (,) S ~ Exponential ()
– No MM– Borrowing strength across probes
ˆ S E[S | PM]
![Page 38: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/38.jpg)
![Page 39: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/39.jpg)
Notice improved precision but worse accuracy
![Page 40: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/40.jpg)
Problem
• Global background correction ignores probe-specific NSB
• MM have problems
• Another possibility: Use probe sequence
![Page 41: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/41.jpg)
Probe-specific Background
![Page 42: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/42.jpg)
G-C content effect in PM’s
Boxplots of log intensities from the array hybridized to Yeast DNA for strata of probes defined by their G-C content. Probes with 6 or less G-C are grouped together. Probes with 20 or more are grouped together as well. Smooth density plots are shown for the strata with G-C contents of 6,10,14, and 18.
Any given probe will have some propensity to non-specific binding. As described in Section 2.3 and demonstrated in Figure 3, this tends to be directly related to its G-C content. We propose a statistical model that describes the relationship between the PM, MM, and probes of the same G-C content.
![Page 43: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/43.jpg)
General Model (GCRMA)
NSB SB
PMgij OiPM exp(hi( j
PM ) bgjPM gij
PM ) exp( f i( j ) gi gij )
MMgij OiMM exp(hi( j
MM ) bgjMM gij
MM )
We can calculate:
E[gi PMgij , MMgij ]Due to the associated variance with the measured MM intensities we argue that one data point is not enough to obtain a useful adjustment. In this paper we propose using probe sequence information to select other probes that can serve the same purpose as the MM pair. We do this by defining subsets of the existing MM probes with similar hybridization properties. We therefore propose to use subsets of probes with the same G-C content as a population of MM probes that can be considered pseudo-MM for all PM with the same G-C content.
![Page 44: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/44.jpg)
The MA plot shows log fold change as a function of mean log expression level. A set of 14 arrays representing a single experiment from the Affymetrix spike-in data are used for this plot. A total of 13 sets of fold changes are generated by comparing the first array in the set to each of the others. Genes are symbolized by numbers representing the nominal log2 fold change for the gene. Non-differentially expressed genes with observed fold changes larger than 2 are plotted in red. All other probesets are represented with black dots. The smooth lines are 3SDs away with SD depending on log expression.
![Page 45: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/45.jpg)
![Page 46: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/46.jpg)
Naef & Magnasco (2003),PHYSICAL REVIEW E 68, 011906, 2003
Another sequence effect in PM’s and MM’s
![Page 47: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/47.jpg)
Another sequence effect in PM’s and MM’s
We show in Fig. 2 joint probability distributions of PMs and MMs, obtained from all probe pairs in a large set of experiments. Actually, two separate probability distributions are superimposed: in red, the distribution for all probe pairs whose 13th letter is a purine, and in cyan those whose 13th letter is a pyrimidine. The plot clearly shows two distinct branches in two colors, corresponding to the basic distinction between the shapes of the bases: purines are large, double ringed nucleotides while pyrimidines have smaller single rings. This underscores that by replacing the middle letter of the PM with its complementary base, the situation on the MM probe is that the middle letter always faces itself, leading to two quite distinct outcomes according to the size of the nucleotide. If the letter is a purine, there is no room within an undistorted backbone for two large bases, so this mismatch distorts the geometry of the double helix, incurring a large steric and stacking cost. But if the letter is a pyrimidine, there is room to spare, and the bases just dangle. The only energy lost is that of the hydrogen bonds.
Naef & Magnasco (2003),PHYSICAL REVIEW E 68, 011906, 2003
![Page 48: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/48.jpg)
C and T are pyrimidines.
(small)
A and G are purines.(large)
![Page 49: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/49.jpg)
Another sequence effect in PM’s
Naef & Magnasco (2003), PHYSICAL REVIEW E 68, 011906, 2003
The asymmetry of (A,T) and (G,C) affinities in Fig. 3 can be explained because only A-U and G-C bonds carry labels (purines U and C on the mRNA are labeled). Carrying labels inside the proberegion is unfavorable because labels interfere with binding. (Remember also that G-C pairs have 3 and A-T pairs have 2 hydrogen bonds!).
C G T A
CG’s have 3 H bonds
U and C on the mRNA are labeled (A and G in
probe), and this label can interfere with binding.
Both these effects are greater when at the center
of the hybrid.
![Page 50: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/50.jpg)
Why not subtract MM?
![Page 51: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/51.jpg)
Two color platforms (Agilent, cDNA)
• Common to have just one feature per gene
• 60 vs. 25 NT?
• Optical noise still a concern
• After spots are identified, a measure of local background is obtained from area around spot
(this is also applicable to some spotted one-channel data)
![Page 52: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/52.jpg)
Local background
---- GenePix
---- QuantArray
---- ScanAnalyze
![Page 53: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/53.jpg)
Two color feature level data
• Red and Green foreground and background obtained from each feature
• We have Rfgij, Gfgij, Rbgij, Gbgij (g is gene, i is array and j is replicate)
• A default summary statistic is the log-ratio:
log2 [(Rf - Rb) / (Gf - Gb)]
![Page 54: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/54.jpg)
Background subtractionNo background
subtraction
![Page 55: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/55.jpg)
Diagnostics: images of Rb, Gb, scatterplot of log2 (Rf/Gf) vs. log2(Rb/Gb)
![Page 56: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/56.jpg)
Correlation may be spatially dependent
![Page 57: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/57.jpg)
Two color platforms
• Again, we can assess the tradeoff of accuracy and precision via simulation
• Simulation uses a self versus self (SVS) hybridization experiment -- no differential expression should occur.
• Mean squared error (MSE) = bias^2 + variance.
![Page 58: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/58.jpg)
Lower MSE with NBS if correlation < 0.2
![Page 59: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/59.jpg)
• A procedure that subtracts local background as a function of the correlation of fg and bg ratios may be a nice compromise between background subtraction and no background subtraction.
• For references, see background subtraction paper by C. Kooperberg J Computational Biol 2002.
• Limma package in R has many useful functions for background subtraction.
• Following the decision to background subtract, we need to consider a normalization algorithm.
Background Subtraction: Conclusions
![Page 60: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/60.jpg)
Normalization
![Page 61: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/61.jpg)
Normalization
• Normalization is needed to ensure that differences in intensities are indeed due to differential expression, and not some printing, hybridization, or scanning artifact.
• Normalization is necessary before any analysis which involves within or between slides comparisons of intensities, e.g., clustering, testing.
• Somewhat different approaches are used in two-color and one-color technologies
![Page 62: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/62.jpg)
Varying distributions of intensities from each microarray.
![Page 63: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/63.jpg)
Distributions of intensities after global mean normalization.
![Page 64: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/64.jpg)
What does this normalization mean in Int vs. Int, or Ratio vs. Int space?
![Page 65: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/65.jpg)
Distributions of intensities after global mean normalization – global mean
normalization is not enough …
Possible solutions:
Local Mean Normalization
Quantile Normalization
![Page 66: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/66.jpg)
Local Mean Normalization
(loess):
Adjusts for intensity-dependent bias in
ratios.
Requires Comparison!
![Page 67: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/67.jpg)
![Page 68: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/68.jpg)
![Page 69: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/69.jpg)
Loess
![Page 70: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/70.jpg)
Loess
![Page 71: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/71.jpg)
Loess
![Page 72: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/72.jpg)
Loess
![Page 73: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/73.jpg)
Loess
![Page 74: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/74.jpg)
Loess
![Page 75: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/75.jpg)
Quantile Normalization
![Page 76: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/76.jpg)
Quantile normalization
• Quantiles is commonly used because its fast and conceptually simple
• Basic idea: – order values in each array– take average across probes– Substitute probe intensity with average– Put in original order
![Page 77: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/77.jpg)
Example of quantile normalization
2 4 4
5 4 14
4 6 8
3 5 8
3 3 9
2 3 4
3 4 8
3 4 8
4 5 9
5 6 14
3 3 3
5 5 5
5 5 5
6 6 6
8 8 8
3 5 3
8 5 8
6 8 5
5 6 5
5 3 6
Original Ordered Averaged Re-ordered
![Page 78: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/78.jpg)
Before Quantile Normalization
![Page 79: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/79.jpg)
After Quantile Normalization
A worry is that it over corrects
![Page 80: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/80.jpg)
QC
![Page 81: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/81.jpg)
![Page 82: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/82.jpg)
![Page 83: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/83.jpg)
![Page 84: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/84.jpg)
![Page 85: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/85.jpg)
Print-tip Effect
![Page 86: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/86.jpg)
Print-tip Loess
![Page 87: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/87.jpg)
Plate effect
![Page 88: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/88.jpg)
Bad Plate Effect
![Page 89: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/89.jpg)
Bad Plate Effect
![Page 90: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/90.jpg)
Print Order Effect
![Page 91: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/91.jpg)
Microarray Pseudo Images: Intensity
![Page 92: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/92.jpg)
Microarray Pseudo Images: Ratios
![Page 93: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/93.jpg)
Images of probe level data
This is the raw data
![Page 94: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/94.jpg)
Images of probe level data
Residuals (or weights) from probe level model fits show problem clearly
![Page 95: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/95.jpg)
Hybridization Artifacts
![Page 96: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/96.jpg)
![Page 97: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/97.jpg)
![Page 98: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/98.jpg)
PCA, MDS, and Clustering:
Dimension Reduction to Detect Experimental
Artifacts and Biological Effects
![Page 99: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/99.jpg)
Principle Components Analysis (PCA)
and
Multi-Dimensional Scaling (MDS)
![Page 100: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/100.jpg)
PCA
![Page 101: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/101.jpg)
MDS
![Page 102: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/102.jpg)
![Page 103: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/103.jpg)
![Page 104: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/104.jpg)
![Page 105: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/105.jpg)
![Page 106: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/106.jpg)
Uncorrected Intensities: MDS Colored by Batch
![Page 107: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/107.jpg)
Removing The Batch Effect
Much LikeRed:Green Analysis
![Page 108: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/108.jpg)
Uncorrected Intensities: MDS Colored by Batch
![Page 109: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/109.jpg)
Batch Subtracted Measures: MDS Colored by Batch
![Page 110: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/110.jpg)
MDS of All Array Experiments: Subject Replicates
![Page 111: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/111.jpg)
![Page 112: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/112.jpg)
AGE
?
![Page 113: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/113.jpg)
![Page 114: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/114.jpg)
![Page 115: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/115.jpg)
![Page 116: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/116.jpg)
![Page 117: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/117.jpg)
![Page 118: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/118.jpg)
![Page 119: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/119.jpg)
![Page 120: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/120.jpg)
AGE
RN
A Q
ual
ity
![Page 121: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/121.jpg)
AGE
Bat
ch
![Page 122: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/122.jpg)
Surrogate Variable Analysis:
Removing Unwanted/Unknown
Effects
![Page 123: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/123.jpg)
Surrogate Variable Analysis:
Requires the definition of the effect you are interested in (Exp. Vs. Con., or age, etc.).
Removes “unexplained” variance in gene expression data.
PCA-based (no missing data).
Quite a “strong” data clean up method.
Leek JT, Storey JD. Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genet. 2007 Sep;3(9):1724-35.
![Page 124: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/124.jpg)
Surrogate Variable Analysis:
Take residuals from model defining effect of interest.
PCA (SVD) on residual matrix.
Use top ? PC’s and determine genes associated with these PC’s. Each PC will be used to generate a SV.
Generate SV’s using data from these genes in original data matrix (not the residual matrix).
Incorporate SV’s in all subsequent analysis, e.g. as covariates in regression analysis.
![Page 125: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/125.jpg)
BEFORE CORRECTION
![Page 126: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/126.jpg)
BEFORE CORRECTION
![Page 127: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/127.jpg)
BEFORE CORRECTION
![Page 128: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/128.jpg)
BEFORE CORRECTION
![Page 129: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/129.jpg)
AFTER CORRECTION
![Page 130: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/130.jpg)
AFTER CORRECTION
![Page 131: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/131.jpg)
AFTER CORRECTION
![Page 132: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/132.jpg)
AFTER CORRECTION
![Page 133: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/133.jpg)
![Page 134: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/134.jpg)
![Page 135: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/135.jpg)
Biological Effects:
Tissue Types and Growth Factor
TreatmentsMake sure your normalization and QC methods 1] preserve what you are looking for, 2] remove what
you don’t want, and 3] don’t introduce artifacts.
![Page 136: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/136.jpg)
Illumina 24K
![Page 137: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/137.jpg)
![Page 138: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/138.jpg)
![Page 139: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/139.jpg)
![Page 140: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/140.jpg)
![Page 141: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/141.jpg)
![Page 142: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/142.jpg)
![Page 143: Carlo Colantuoni carlo@illuminatobiotech](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56815414550346895dc2104f/html5/thumbnails/143.jpg)