the second-simplest cdna microarray data analysis problem
DESCRIPTION
The second-simplest cDNA microarray data analysis problem. Terry Speed, UC Berkeley Bioinformatic Strategies For Application of Genomic Tools to Environmental Health Research, March 5, 2001 NIEHS National Center for Toxicogenomics NCSU Bioinformatics Research Center. Biological question - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: The second-simplest cDNA microarray data analysis problem](https://reader036.vdocuments.mx/reader036/viewer/2022062309/5681591e550346895dc64737/html5/thumbnails/1.jpg)
The second-simplest cDNA The second-simplest cDNA microarray data analysis problemmicroarray data analysis problem
Terry Speed, UC Berkeley
Bioinformatic Strategies For Application of Genomic Tools to Environmental Health
Research, March 5, 2001
NIEHS National Center for Toxicogenomics NCSU Bioinformatics Research Center
![Page 2: The second-simplest cDNA microarray data analysis problem](https://reader036.vdocuments.mx/reader036/viewer/2022062309/5681591e550346895dc64737/html5/thumbnails/2.jpg)
Biological questionDifferentially expressed genesSample class prediction etc.
Testing
Biological verification and interpretation
Microarray experiment
Estimation
Experimental design
Image analysis
Normalization
Clustering Discrimination
R, G
16-bit TIFF files
(Rfg, Rbg), (Gfg, Gbg)
![Page 3: The second-simplest cDNA microarray data analysis problem](https://reader036.vdocuments.mx/reader036/viewer/2022062309/5681591e550346895dc64737/html5/thumbnails/3.jpg)
Some motherhood statementsSome motherhood statementsImportant aspects of a statistical analysis
include:• Tentatively separating systematic from
random sources of variation• Removing the former and quantifying the
latter, when the system is in control• Identifying and dealing with the most relevant
source of variation in subsequent analyses
Only if this is done can we hope to make more or less valid probability statements
![Page 4: The second-simplest cDNA microarray data analysis problem](https://reader036.vdocuments.mx/reader036/viewer/2022062309/5681591e550346895dc64737/html5/thumbnails/4.jpg)
The simplest cDNA microarray The simplest cDNA microarray data analysis problem is data analysis problem is identifying differentially identifying differentially
expressed genes using one slideexpressed genes using one slide
• This is a common enough hope
• Efforts are frequently successful
• It is not hard to do by eye
• The problem is probably beyond formal statistical inference (valid p-values, etc)
for the foreseeable future, and here’s why
![Page 5: The second-simplest cDNA microarray data analysis problem](https://reader036.vdocuments.mx/reader036/viewer/2022062309/5681591e550346895dc64737/html5/thumbnails/5.jpg)
An M vs. A plotAn M vs. A plotM = log2(R / G)A = log2(R*G) / 2
![Page 6: The second-simplest cDNA microarray data analysis problem](https://reader036.vdocuments.mx/reader036/viewer/2022062309/5681591e550346895dc64737/html5/thumbnails/6.jpg)
Background mattersBackground matters
From Spot From GenePix
![Page 7: The second-simplest cDNA microarray data analysis problem](https://reader036.vdocuments.mx/reader036/viewer/2022062309/5681591e550346895dc64737/html5/thumbnails/7.jpg)
From the NCI60 data set (Stanford web site)
No background correction With background correction
![Page 8: The second-simplest cDNA microarray data analysis problem](https://reader036.vdocuments.mx/reader036/viewer/2022062309/5681591e550346895dc64737/html5/thumbnails/8.jpg)
An experiment having within-slide replicates
![Page 9: The second-simplest cDNA microarray data analysis problem](https://reader036.vdocuments.mx/reader036/viewer/2022062309/5681591e550346895dc64737/html5/thumbnails/9.jpg)
Background makes a differenceBackground makes a difference
Background method Segmentation method Exp1 Exp2S.nbg 6 6Gp.nbg 7 6SA.nbg 6 6
No background QA.fix.nbg 7 6QA.hist.nbg 7 6QA.adp.nbg 14 14S.valley 17 21GP 11 11
Local surrounding SA 12 14QA.fix 18 23QA.hist 9 8QA.adp 27 26
Others S.morph 9 9S.const 14 14
Medians of the SD of log2(R/G) for 8 replicated spots multiplied by 100and rounded to the nearest integer.
![Page 10: The second-simplest cDNA microarray data analysis problem](https://reader036.vdocuments.mx/reader036/viewer/2022062309/5681591e550346895dc64737/html5/thumbnails/10.jpg)
Normalisation - lowessNormalisation - lowess• Global lowess (Matt Callow’s data, LNBL)• Assumption: changes roughly symmetric at all intensities.
![Page 11: The second-simplest cDNA microarray data analysis problem](https://reader036.vdocuments.mx/reader036/viewer/2022062309/5681591e550346895dc64737/html5/thumbnails/11.jpg)
From the NCI60 data set (Stanford web site)
![Page 12: The second-simplest cDNA microarray data analysis problem](https://reader036.vdocuments.mx/reader036/viewer/2022062309/5681591e550346895dc64737/html5/thumbnails/12.jpg)
Ngai lab, UCB
![Page 13: The second-simplest cDNA microarray data analysis problem](https://reader036.vdocuments.mx/reader036/viewer/2022062309/5681591e550346895dc64737/html5/thumbnails/13.jpg)
Tiago’s data from the Goodman lab, UCB
![Page 14: The second-simplest cDNA microarray data analysis problem](https://reader036.vdocuments.mx/reader036/viewer/2022062309/5681591e550346895dc64737/html5/thumbnails/14.jpg)
From the Ernest Gallo Clinic & Research Center
![Page 15: The second-simplest cDNA microarray data analysis problem](https://reader036.vdocuments.mx/reader036/viewer/2022062309/5681591e550346895dc64737/html5/thumbnails/15.jpg)
From Peter McCallum Cancer Research Institute, Australia
![Page 16: The second-simplest cDNA microarray data analysis problem](https://reader036.vdocuments.mx/reader036/viewer/2022062309/5681591e550346895dc64737/html5/thumbnails/16.jpg)
Normalisation - print tipNormalisation - print tipAssumption: For every print group, changes roughly symmetric at all intensities.
![Page 17: The second-simplest cDNA microarray data analysis problem](https://reader036.vdocuments.mx/reader036/viewer/2022062309/5681591e550346895dc64737/html5/thumbnails/17.jpg)
M vs A after print-tip normalisationM vs A after print-tip normalisation
![Page 18: The second-simplest cDNA microarray data analysis problem](https://reader036.vdocuments.mx/reader036/viewer/2022062309/5681591e550346895dc64737/html5/thumbnails/18.jpg)
Normalization (ctd) Another data setNormalization (ctd) Another data set
• After within slide global lowess normalization.• Likely to be a spatial effect.
Print-tip groups
Lo
g-r
ati o
s
![Page 19: The second-simplest cDNA microarray data analysis problem](https://reader036.vdocuments.mx/reader036/viewer/2022062309/5681591e550346895dc64737/html5/thumbnails/19.jpg)
Assumption:
All print-tip-groups have the same spread in M
True log ratio is ij where i represents different print-tip-groups and j represents different spots.
Observed is Mij, where
Mij = ai ij
Robust estimate of ai is
MADi = medianj { |yij - median(yij) | }
Taking scale into accountTaking scale into account
MADi
MADii=1
I∏I
![Page 20: The second-simplest cDNA microarray data analysis problem](https://reader036.vdocuments.mx/reader036/viewer/2022062309/5681591e550346895dc64737/html5/thumbnails/20.jpg)
Normalization (ctd) That same data setNormalization (ctd) That same data set
• After print-tip location and scale normalization.• Incorporate quality measures.
Lo
g-r
ati o
s
Print-tip groups
![Page 21: The second-simplest cDNA microarray data analysis problem](https://reader036.vdocuments.mx/reader036/viewer/2022062309/5681591e550346895dc64737/html5/thumbnails/21.jpg)
Matt Callow’s Srb1 dataset (#5). Newton’s and Chen’s single slide method
![Page 22: The second-simplest cDNA microarray data analysis problem](https://reader036.vdocuments.mx/reader036/viewer/2022062309/5681591e550346895dc64737/html5/thumbnails/22.jpg)
Matt Callow’s Srb1 dataset (#8). Newton’s, Sapir & Churchill’s and Chen’s single slide method
![Page 23: The second-simplest cDNA microarray data analysis problem](https://reader036.vdocuments.mx/reader036/viewer/2022062309/5681591e550346895dc64737/html5/thumbnails/23.jpg)
10
100
1000
10000
100000
10 100 1000 10000 100000
Genomic DNA vs. Genomic DNA
The approach of Roberts et al (Rosetta)
X =a1 −a2
(σ12 +σ2
2 ) + f2 (a12 +a2
2 )
P=2(1−Erf(|X |))
Data from Bing Ren
![Page 24: The second-simplest cDNA microarray data analysis problem](https://reader036.vdocuments.mx/reader036/viewer/2022062309/5681591e550346895dc64737/html5/thumbnails/24.jpg)
The second simplest cDNA microarray The second simplest cDNA microarray data analysis problem is identifying data analysis problem is identifying differentially expressed genes using differentially expressed genes using
replicated slidesreplicated slides
There are a number of different aspects:• First, between-slide normalization; then• What should we look at: averages, SDs t-
statistics, other summaries?• How should we look at them?• Can we make valid probability statements?
A report on work in progress
![Page 25: The second-simplest cDNA microarray data analysis problem](https://reader036.vdocuments.mx/reader036/viewer/2022062309/5681591e550346895dc64737/html5/thumbnails/25.jpg)
Normalization (ctd) Yet another data set
• Between slides this time (10 here)
• Only small differences in spread apparent
• We often see much greater differences
Slides
Lo
g-r
ati o
s
![Page 26: The second-simplest cDNA microarray data analysis problem](https://reader036.vdocuments.mx/reader036/viewer/2022062309/5681591e550346895dc64737/html5/thumbnails/26.jpg)
The “NCI 60” experiments (no bg)
![Page 27: The second-simplest cDNA microarray data analysis problem](https://reader036.vdocuments.mx/reader036/viewer/2022062309/5681591e550346895dc64737/html5/thumbnails/27.jpg)
Assumption: All slides have the same spread in M
True log ratio is ij where i represents different slides and j represents different spots.
Observed is Mij, where
Mij = ai ij
Robust estimate of ai is
MADi = medianj { |yij - median(yij) | }
Taking scale into accountTaking scale into account
MADi
MADii=1
I∏I
![Page 28: The second-simplest cDNA microarray data analysis problem](https://reader036.vdocuments.mx/reader036/viewer/2022062309/5681591e550346895dc64737/html5/thumbnails/28.jpg)
Which genes are (relatively) up/down Which genes are (relatively) up/down regulated?regulated?
Two samples.
e.g. KO vs. WT or mutant vs. WT
T C n
For each gene form the t statistic: average of n trt Ms
sqrt(1/n (SD of n trt Ms)2)
n
![Page 29: The second-simplest cDNA microarray data analysis problem](https://reader036.vdocuments.mx/reader036/viewer/2022062309/5681591e550346895dc64737/html5/thumbnails/29.jpg)
Which genes are (relatively) up/down Which genes are (relatively) up/down regulated?regulated?
Two samples with a reference (e.g. pooled control)
T C* n
• For each gene form the t statistic: average of n trt Ms - average of n ctl Ms
sqrt(1/n (SD of n trt Ms)2 + (SD of n ctl Ms)2)
C C* n
![Page 30: The second-simplest cDNA microarray data analysis problem](https://reader036.vdocuments.mx/reader036/viewer/2022062309/5681591e550346895dc64737/html5/thumbnails/30.jpg)
One factor: more than 2 samplesOne factor: more than 2 samples
Samples: Liver tissue from mice treated by cholesterol modifying drugs.
Question 1: Find genes that respond differently between the treatment and the control.
Question 2: Find genes that respond similarly across two or more treatments relative to control.
T1
C
T2 T3 T4
x 2x 2x 2 x 2
![Page 31: The second-simplest cDNA microarray data analysis problem](https://reader036.vdocuments.mx/reader036/viewer/2022062309/5681591e550346895dc64737/html5/thumbnails/31.jpg)
One factor: more than 2 samplesOne factor: more than 2 samples
Samples: tissues from different regions of the mouse olfactory bulb.
Question 1: differences between different regions.
Question 2: identify genes with a pre-specified patterns across regions.
T3 T4
T2
T6T1
T5
![Page 32: The second-simplest cDNA microarray data analysis problem](https://reader036.vdocuments.mx/reader036/viewer/2022062309/5681591e550346895dc64737/html5/thumbnails/32.jpg)
Two or more factorsTwo or more factors
6 different experiments at each time point.
Dyeswaps.
4 time points (30 minutes, 1 hour, 4 hours, 24 hours)
2 x 2 x 4 factorial experiment.
ctl OSM
EGF OSM & EGF
4 times
![Page 33: The second-simplest cDNA microarray data analysis problem](https://reader036.vdocuments.mx/reader036/viewer/2022062309/5681591e550346895dc64737/html5/thumbnails/33.jpg)
Which genes have changed?Which genes have changed?When permutation testing possibleWhen permutation testing possible
1. For each gene and each hybridisation (8 ko + 8 ctl), use M=log2(R/G).
2. For each gene form the t statistic:
average of 8 ko Ms - average of 8 ctl Mssqrt(1/8 (SD of 8 ko Ms)2 + (SD of 8 ctl Ms)2)
3. Form a histogram of 6,000 t values.
4. Do a normal Q-Q plot; look for values “off the line”.
5. Permutation testing.
6. Adjust for multiple testing.
![Page 34: The second-simplest cDNA microarray data analysis problem](https://reader036.vdocuments.mx/reader036/viewer/2022062309/5681591e550346895dc64737/html5/thumbnails/34.jpg)
Histogram & qq plotHistogram & qq plot
ApoA1
![Page 35: The second-simplest cDNA microarray data analysis problem](https://reader036.vdocuments.mx/reader036/viewer/2022062309/5681591e550346895dc64737/html5/thumbnails/35.jpg)
Apo A1: Adjusted and Unadjusted p-values for the Apo A1: Adjusted and Unadjusted p-values for the 50 genes with the largest absolute t-statistics.50 genes with the largest absolute t-statistics.
![Page 36: The second-simplest cDNA microarray data analysis problem](https://reader036.vdocuments.mx/reader036/viewer/2022062309/5681591e550346895dc64737/html5/thumbnails/36.jpg)
Which genes have changed?Which genes have changed?Permutation testing not possiblePermutation testing not possible
Our current approach is to use averages, SDs, t-statistics and a new statistic we call B, inspired by empirical Bayes.
We hope in due course to calibrate B and use that as our main tool.
We begin with the motivation, using data from a study in which each slide was replicated four times.
![Page 37: The second-simplest cDNA microarray data analysis problem](https://reader036.vdocuments.mx/reader036/viewer/2022062309/5681591e550346895dc64737/html5/thumbnails/37.jpg)
Results from 4 replicatesResults from 4 replicates
![Page 38: The second-simplest cDNA microarray data analysis problem](https://reader036.vdocuments.mx/reader036/viewer/2022062309/5681591e550346895dc64737/html5/thumbnails/38.jpg)
B=LOR comparedB=LOR compared
![Page 39: The second-simplest cDNA microarray data analysis problem](https://reader036.vdocuments.mx/reader036/viewer/2022062309/5681591e550346895dc64737/html5/thumbnails/39.jpg)
•M •t•t M
Results from the Apo AI ko experiment
![Page 40: The second-simplest cDNA microarray data analysis problem](https://reader036.vdocuments.mx/reader036/viewer/2022062309/5681591e550346895dc64737/html5/thumbnails/40.jpg)
•M •t•t M
Results from the Apo AI ko experiment
![Page 41: The second-simplest cDNA microarray data analysis problem](https://reader036.vdocuments.mx/reader036/viewer/2022062309/5681591e550346895dc64737/html5/thumbnails/41.jpg)
B=const+log
2an
+s2 +M•2
2an
+s2 +M•
2
1+nc
⎛
⎝
⎜ ⎜
⎞
⎠
⎟ ⎟
Empirical Bayes log posterior odds ratio
![Page 42: The second-simplest cDNA microarray data analysis problem](https://reader036.vdocuments.mx/reader036/viewer/2022062309/5681591e550346895dc64737/html5/thumbnails/42.jpg)
•M •B•t•M B•t B•t M B
Results from SR-BI transgenic experiment
![Page 43: The second-simplest cDNA microarray data analysis problem](https://reader036.vdocuments.mx/reader036/viewer/2022062309/5681591e550346895dc64737/html5/thumbnails/43.jpg)
•M •B•t•M B•t B•t M B
Results from SR-BI transgenic experiment
![Page 44: The second-simplest cDNA microarray data analysis problem](https://reader036.vdocuments.mx/reader036/viewer/2022062309/5681591e550346895dc64737/html5/thumbnails/44.jpg)
Extensions include dealing withExtensions include dealing with
• Replicates within and between slides
• Several effects: use a linear model
• ANOVA: are the effects equal?
• Time series: selecting genes for trends
![Page 45: The second-simplest cDNA microarray data analysis problem](https://reader036.vdocuments.mx/reader036/viewer/2022062309/5681591e550346895dc64737/html5/thumbnails/45.jpg)
10
100
1000
10000
100000
1000000
10 100 1000 10000 100000 1000000
Galactose
PCL10GAL80
GAL1/10
GAL2
GAL3
GAL7
GCY1
MTH1
WCE-DNA (Cy3)
IP-DNA (Cy5)
Un
-en
rich
ed D
NA
(C
y3)
antibody-enriched DNA (Cy5)
Rosetta once more: In vivo Binding Sites of Gal4p in Galactose
P <0.001
![Page 46: The second-simplest cDNA microarray data analysis problem](https://reader036.vdocuments.mx/reader036/viewer/2022062309/5681591e550346895dc64737/html5/thumbnails/46.jpg)
Summary (for the second simplest problem)Summary (for the second simplest problem)• Microarray experiments typically have thousands of genes, but only few (1-10) replicates for each gene.• Averages can be driven by outliers.• Ts can be driven by tiny variances.• B = LOR will, we hope
– use information from all the genes– combine the best of M. and T– avoid the problems of M. and T
![Page 47: The second-simplest cDNA microarray data analysis problem](https://reader036.vdocuments.mx/reader036/viewer/2022062309/5681591e550346895dc64737/html5/thumbnails/47.jpg)
AcknowledgmentsAcknowledgments
UCB/WEHIUCB/WEHI
Yee Hwa YangYee Hwa Yang
Sandrine DudoitSandrine Dudoit
Ingrid Lönnstedt
Natalie Thorne Natalie Thorne
David FreedmanDavid Freedman
CSIRO Image Analysis Group
Michael BuckleyMichael Buckley
Ryan Lagerstorm
Ngai lab, UCB
Goodman lab, UCB
Peter Mac CI, Melb.
Ernest Gallo CRC
Brown-Botstein lab
Matt Callow (LBNL)
Bing Ren (WI)Bing Ren (WI)
![Page 48: The second-simplest cDNA microarray data analysis problem](https://reader036.vdocuments.mx/reader036/viewer/2022062309/5681591e550346895dc64737/html5/thumbnails/48.jpg)
Some web sites:
Technical reports, talks, software etc.
http://www.stat.berkeley.edu/users/terry/zarray/Html/
Statistical software R “GNU’s S” http://lib.stat.cmu.edu/R/CRAN/
Packages within R environment:
-- Spot http://www.cmis.csiro.au/iap/spot.htm
-- SMA (statistics for microarray analysis) http://www.stat.berkeley.edu/users/terry/zarray/Software /smacode.html
![Page 49: The second-simplest cDNA microarray data analysis problem](https://reader036.vdocuments.mx/reader036/viewer/2022062309/5681591e550346895dc64737/html5/thumbnails/49.jpg)
Factorial DesignFactorial Design
Zone Effect
A1P01
P04 A 4
1
2
3
4
5
Age Effect
![Page 50: The second-simplest cDNA microarray data analysis problem](https://reader036.vdocuments.mx/reader036/viewer/2022062309/5681591e550346895dc64737/html5/thumbnails/50.jpg)
Different ways of estimating parameters.
e.g. Z effect.
1 = ( + z) - ()
= z
2 - 5 = (( + a) - ()) -(( + a)-( + z))
= (a) - (a + z)
= z
4 + 3 - 5 =…= z
Factorial designFactorial design
a
z z+a+za
A1P01
P04 A 4
1
2
3
4
5
How do we combine the information?
![Page 51: The second-simplest cDNA microarray data analysis problem](https://reader036.vdocuments.mx/reader036/viewer/2022062309/5681591e550346895dc64737/html5/thumbnails/51.jpg)
Regression analysis
Define a matrix X so that E(M)=X
Use least squares estimate for z, a, za
E
m1
m2
m3
m4
m5
⎛
⎝
⎜ ⎜ ⎜ ⎜
⎞
⎠
⎟ ⎟ ⎟ ⎟
=
1 0 0
0 1 0
−1 0 −1
1 1 1
−1 1 0
⎛
⎝
⎜ ⎜ ⎜ ⎜
⎞
⎠
⎟ ⎟ ⎟ ⎟
•
z
a
za
⎛
⎝
⎜ ⎜
⎞
⎠ ⎟ ⎟
ˆ = X' X( )−1X'M
![Page 52: The second-simplest cDNA microarray data analysis problem](https://reader036.vdocuments.mx/reader036/viewer/2022062309/5681591e550346895dc64737/html5/thumbnails/52.jpg)
Looking at effect of Z: log(zone 4 / zone1)
gene A
gene B
![Page 53: The second-simplest cDNA microarray data analysis problem](https://reader036.vdocuments.mx/reader036/viewer/2022062309/5681591e550346895dc64737/html5/thumbnails/53.jpg)
EstimateEstimate
Log2(SE)
Z e
ffec
t
•t = / SE t
![Page 54: The second-simplest cDNA microarray data analysis problem](https://reader036.vdocuments.mx/reader036/viewer/2022062309/5681591e550346895dc64737/html5/thumbnails/54.jpg)
ZoneAgeZone Age
![Page 55: The second-simplest cDNA microarray data analysis problem](https://reader036.vdocuments.mx/reader036/viewer/2022062309/5681591e550346895dc64737/html5/thumbnails/55.jpg)
Age
48
229
Zone . Age interaction
Zone
19
Top 50 genesfrom each effect
0
0
19
![Page 56: The second-simplest cDNA microarray data analysis problem](https://reader036.vdocuments.mx/reader036/viewer/2022062309/5681591e550346895dc64737/html5/thumbnails/56.jpg)
•T •B•t M B• t B
![Page 57: The second-simplest cDNA microarray data analysis problem](https://reader036.vdocuments.mx/reader036/viewer/2022062309/5681591e550346895dc64737/html5/thumbnails/57.jpg)
![Page 58: The second-simplest cDNA microarray data analysis problem](https://reader036.vdocuments.mx/reader036/viewer/2022062309/5681591e550346895dc64737/html5/thumbnails/58.jpg)
•M •t•t M
![Page 59: The second-simplest cDNA microarray data analysis problem](https://reader036.vdocuments.mx/reader036/viewer/2022062309/5681591e550346895dc64737/html5/thumbnails/59.jpg)
•M •B•t•M B•t B•t MB
![Page 60: The second-simplest cDNA microarray data analysis problem](https://reader036.vdocuments.mx/reader036/viewer/2022062309/5681591e550346895dc64737/html5/thumbnails/60.jpg)
Some statistical questionsSome statistical questions
Image analysis: addressing, segmenting, quantifying
Normalisation: within and between slides
Quality: of images, of spots, of (log) ratios
Which genes are (relatively) up/down regulated?
Assigning p-values to tests/confidence to results.
![Page 61: The second-simplest cDNA microarray data analysis problem](https://reader036.vdocuments.mx/reader036/viewer/2022062309/5681591e550346895dc64737/html5/thumbnails/61.jpg)
Some statistical questions, ctdSome statistical questions, ctd
Planning of experiments: design, sample size
Discrimination and allocation of samples
Clustering, classification: of samples, of genes
Selection of genes relevant to any given analysis
Analysis of time course, factorial and other special experiments…………………………...& much more
![Page 62: The second-simplest cDNA microarray data analysis problem](https://reader036.vdocuments.mx/reader036/viewer/2022062309/5681591e550346895dc64737/html5/thumbnails/62.jpg)
The “NCI 60” experiments (bg)