model-based analysis of oligonucleotide arrays, dchip software statistics and genomics – lecture 4...
TRANSCRIPT
![Page 1: Model-based analysis of oligonucleotide arrays, dChip software Statistics and Genomics – Lecture 4 Department of Biostatistics Harvard School of Public](https://reader035.vdocuments.mx/reader035/viewer/2022062720/56649f1b5503460f94c2ffd2/html5/thumbnails/1.jpg)
Model-based analysis of oligonucleotide arrays, dChip
software
Statistics and Genomics – Lecture 4
Department of Biostatistics
Harvard School of Public Health
January 23-25, 2002
Cheng Li
(Joint work with Wing Wong)
![Page 2: Model-based analysis of oligonucleotide arrays, dChip software Statistics and Genomics – Lecture 4 Department of Biostatistics Harvard School of Public](https://reader035.vdocuments.mx/reader035/viewer/2022062720/56649f1b5503460f94c2ffd2/html5/thumbnails/2.jpg)
Source: Affymetrix website
![Page 3: Model-based analysis of oligonucleotide arrays, dChip software Statistics and Genomics – Lecture 4 Department of Biostatistics Harvard School of Public](https://reader035.vdocuments.mx/reader035/viewer/2022062720/56649f1b5503460f94c2ffd2/html5/thumbnails/3.jpg)
Custom software: raw image
![Page 4: Model-based analysis of oligonucleotide arrays, dChip software Statistics and Genomics – Lecture 4 Department of Biostatistics Harvard School of Public](https://reader035.vdocuments.mx/reader035/viewer/2022062720/56649f1b5503460f94c2ffd2/html5/thumbnails/4.jpg)
Custom software: getting representative value of a probe cell
![Page 5: Model-based analysis of oligonucleotide arrays, dChip software Statistics and Genomics – Lecture 4 Department of Biostatistics Harvard School of Public](https://reader035.vdocuments.mx/reader035/viewer/2022062720/56649f1b5503460f94c2ffd2/html5/thumbnails/5.jpg)
Normalization is needed to minimize non-biological variation between arrays
![Page 6: Model-based analysis of oligonucleotide arrays, dChip software Statistics and Genomics – Lecture 4 Department of Biostatistics Harvard School of Public](https://reader035.vdocuments.mx/reader035/viewer/2022062720/56649f1b5503460f94c2ffd2/html5/thumbnails/6.jpg)
Normalization methods
• Current software uses linear normalization
• Nonlinear curve fitting based on scatter plot is still inadequate because 1) effects of differentially expressed genes
may be “normalized” 2) regression phenomenon and asymmetry
![Page 7: Model-based analysis of oligonucleotide arrays, dChip software Statistics and Genomics – Lecture 4 Department of Biostatistics Harvard School of Public](https://reader035.vdocuments.mx/reader035/viewer/2022062720/56649f1b5503460f94c2ffd2/html5/thumbnails/7.jpg)
Regression phenomenon and asymmetry
![Page 8: Model-based analysis of oligonucleotide arrays, dChip software Statistics and Genomics – Lecture 4 Department of Biostatistics Harvard School of Public](https://reader035.vdocuments.mx/reader035/viewer/2022062720/56649f1b5503460f94c2ffd2/html5/thumbnails/8.jpg)
Invariant set normalization method
• A set of points (xi, yi) is said to be order-preserving if yi < yj whenever xi < xj
• The maximal order-preserving subset can be obtained by dynamic programming
• If a gene is really differentially expressed, it’s cells tend not to be included into an large order-preserving subset
• Our method is based on an approximately order preserving subset, called “Invariant set”
![Page 9: Model-based analysis of oligonucleotide arrays, dChip software Statistics and Genomics – Lecture 4 Department of Biostatistics Harvard School of Public](https://reader035.vdocuments.mx/reader035/viewer/2022062720/56649f1b5503460f94c2ffd2/html5/thumbnails/9.jpg)
Fig. 2.9 Normalization of a pair of replicated arrays
![Page 10: Model-based analysis of oligonucleotide arrays, dChip software Statistics and Genomics – Lecture 4 Department of Biostatistics Harvard School of Public](https://reader035.vdocuments.mx/reader035/viewer/2022062720/56649f1b5503460f94c2ffd2/html5/thumbnails/10.jpg)
Figure 2.10. Two different samples. The smoothing spline in (A) is affected by several points at the lower-right corner, which might belong to differentially expressed genes. Whereas the “invariant set” does not include these points when determining normalization curve, leading to a different normalization relationship at the high end.
![Page 11: Model-based analysis of oligonucleotide arrays, dChip software Statistics and Genomics – Lecture 4 Department of Biostatistics Harvard School of Public](https://reader035.vdocuments.mx/reader035/viewer/2022062720/56649f1b5503460f94c2ffd2/html5/thumbnails/11.jpg)
A pair of split-sample replicate arrays
adeno-268t1(dup b4)-A93-3 normalized
ad
en
o-2
68
t2d
-A1
89
-4 n
orm
aliz
ed
0 2000 4000 6000
02
00
04
00
06
00
0
adeno-268t1(dup b4)-A93-3
ad
en
o-2
68
t2d
-A1
89
-4
0 2000 4000 6000 8000 10000 12000
02
00
04
00
06
00
0
adeno-268t2d-A189-4
ad
en
o-1
14
T1
-A9
-1
0 2000 4000 6000
02
00
04
00
06
00
08
00
0
adeno-268t1(dup b4)-A93-3
ad
en
o-1
14
T1
-A9
-1
0 2000 4000 6000 8000 10000 12000
02
00
04
00
06
00
08
00
0
![Page 12: Model-based analysis of oligonucleotide arrays, dChip software Statistics and Genomics – Lecture 4 Department of Biostatistics Harvard School of Public](https://reader035.vdocuments.mx/reader035/viewer/2022062720/56649f1b5503460f94c2ffd2/html5/thumbnails/12.jpg)
Source: Affymetrix website
![Page 13: Model-based analysis of oligonucleotide arrays, dChip software Statistics and Genomics – Lecture 4 Department of Biostatistics Harvard School of Public](https://reader035.vdocuments.mx/reader035/viewer/2022062720/56649f1b5503460f94c2ffd2/html5/thumbnails/13.jpg)
PM/MM differences eliminate background and cross-hybridization signals
Data for one probe set, one array
![Page 14: Model-based analysis of oligonucleotide arrays, dChip software Statistics and Genomics – Lecture 4 Department of Biostatistics Harvard School of Public](https://reader035.vdocuments.mx/reader035/viewer/2022062720/56649f1b5503460f94c2ffd2/html5/thumbnails/14.jpg)
Lockhart et al. (1996) Nature Genetics, Vol 14: 1675-1680
Validation experiments suggest Average Differences are linear to mRNA concentrations at certain dynamic range
![Page 15: Model-based analysis of oligonucleotide arrays, dChip software Statistics and Genomics – Lecture 4 Department of Biostatistics Harvard School of Public](https://reader035.vdocuments.mx/reader035/viewer/2022062720/56649f1b5503460f94c2ffd2/html5/thumbnails/15.jpg)
Data for one gene in many arrays
![Page 16: Model-based analysis of oligonucleotide arrays, dChip software Statistics and Genomics – Lecture 4 Department of Biostatistics Harvard School of Public](https://reader035.vdocuments.mx/reader035/viewer/2022062720/56649f1b5503460f94c2ffd2/html5/thumbnails/16.jpg)
Box plot showing array and probe effects
![Page 17: Model-based analysis of oligonucleotide arrays, dChip software Statistics and Genomics – Lecture 4 Department of Biostatistics Harvard School of Public](https://reader035.vdocuments.mx/reader035/viewer/2022062720/56649f1b5503460f94c2ffd2/html5/thumbnails/17.jpg)
Modeling probe effects1) Probes sequence has different hybridization efficiency
2) cross hybridization, SNP, alternative splicing
3) Probe position effect, 3’ bias
Probe effects can dominate biological variation of interest
Previous method : use multiple probes, average to reduce “noise”
Our methods: statistical models for probe effects, “meta-analysis”, learning algorithms, estimation of expression level conditional on knowledge of probe effect
![Page 18: Model-based analysis of oligonucleotide arrays, dChip software Statistics and Genomics – Lecture 4 Department of Biostatistics Harvard School of Public](https://reader035.vdocuments.mx/reader035/viewer/2022062720/56649f1b5503460f94c2ffd2/html5/thumbnails/18.jpg)
Principal component analysis (42 points in 20-space)
suggests the data matrix has approx. rank 1
Comp. 1Comp. 2Comp. 3Comp. 4Comp. 5Comp. 6Comp. 7Comp. 8Comp. 9Comp. 10
02
*10
^64
*10
^66
*10
^6x
Va
ria
nce
s
0.91
0.9420.952 0.961 0.968 0.973 0.978 0.982 0.985 0.988
![Page 19: Model-based analysis of oligonucleotide arrays, dChip software Statistics and Genomics – Lecture 4 Department of Biostatistics Harvard School of Public](https://reader035.vdocuments.mx/reader035/viewer/2022062720/56649f1b5503460f94c2ffd2/html5/thumbnails/19.jpg)
0,0,0,0:constraint
jpair probe zingcharacteri factors scaling: ,
i, chipin l)index(leve expression :
j,pair probefor levelion hybridizat-rossbaseline/c:
j, probe i, chipfor valuesPM and MM observed : and 10
11
00
jjij
jj
i
j
ijij
ijjijijij
ijjijij
XX
X
X
Model for one gene in multiple arrays
![Page 20: Model-based analysis of oligonucleotide arrays, dChip software Statistics and Genomics – Lecture 4 Department of Biostatistics Harvard School of Public](https://reader035.vdocuments.mx/reader035/viewer/2022062720/56649f1b5503460f94c2ffd2/html5/thumbnails/20.jpg)
5 10 15 20
05
00
01
00
00
15
00
0
1
5 10 15 20
05
00
01
00
00
15
00
0
2
5 10 15 20
05
00
01
00
00
15
00
0
3
5 10 15 20
05
00
01
00
00
15
00
0
4
5 10 15 20
05
00
01
00
00
15
00
0
5
5 10 15 20
05
00
01
00
00
15
00
0
6
Figure 1.1. Black curves are the PM and MM data of gene A in the first 6 arrays. Light curves are the fitted values to model (1). Probe pairs are labeled 1 to 20 on the horizontal axis.
![Page 21: Model-based analysis of oligonucleotide arrays, dChip software Statistics and Genomics – Lecture 4 Department of Biostatistics Harvard School of Public](https://reader035.vdocuments.mx/reader035/viewer/2022062720/56649f1b5503460f94c2ffd2/html5/thumbnails/21.jpg)
Using PM/MM Differences
• PM/MM differences eliminate most background and cross-hybridization signals
• Affyemtrix’s GeneChip software is using average differences as basis for determining fold changes, and their validation showed average differences are linear to mRNA concentrations at certain dynamic range
![Page 22: Model-based analysis of oligonucleotide arrays, dChip software Statistics and Genomics – Lecture 4 Department of Biostatistics Harvard School of Public](https://reader035.vdocuments.mx/reader035/viewer/2022062720/56649f1b5503460f94c2ffd2/html5/thumbnails/22.jpg)
errors normal random :
ilityidentifiabfor J constraint subject to
, jpair probefor index y sensitivit probe :
level expresion true toalproportion be toassumed
level, differnece PM/MM tocomparable
, iarray for index level expression :
:gene onefor model tivemultiplica An
2
j
ij
j
j
i
ijjiijijij MMPMY
Model for PM/MM differences (1.2)
![Page 23: Model-based analysis of oligonucleotide arrays, dChip software Statistics and Genomics – Lecture 4 Department of Biostatistics Harvard School of Public](https://reader035.vdocuments.mx/reader035/viewer/2022062720/56649f1b5503460f94c2ffd2/html5/thumbnails/23.jpg)
Figure 1.2. Black curves are the PM-MM difference data of gene A in the first 6 arrays. Light curves are the fitted values to model (2).
5 10 15 20
02
00
06
00
01
00
00 1
5 10 15 20
02
00
06
00
01
00
00 2
5 10 15 20
02
00
06
00
01
00
00 3
5 10 15 20
02
00
06
00
01
00
00 4
5 10 15 20
02
00
06
00
01
00
00 5
5 10 15 20
02
00
06
00
01
00
00 6
![Page 24: Model-based analysis of oligonucleotide arrays, dChip software Statistics and Genomics – Lecture 4 Department of Biostatistics Harvard School of Public](https://reader035.vdocuments.mx/reader035/viewer/2022062720/56649f1b5503460f94c2ffd2/html5/thumbnails/24.jpg)
Probe pair
PM
/MM
Dif
f
5 10 15 20
02
00
01
Probe pair
PM
/MM
Dif
f
5 10 15 20
02
00
0
2
Probe pair
PM
/MM
Dif
f
5 10 15 20
02
00
0
3
Probe pair
PM
/MM
Dif
f
5 10 15 20
02
00
0
4
Probe pair
PM
/MM
Dif
f
5 10 15 20
02
00
0
5
Probe pair
PM
/MM
Dif
f
5 10 15 20
02
00
0
6
Probe pair
PM
/MM
Dif
f
5 10 15 20
02
00
0
7
Probe pair
PM
/MM
Dif
f
5 10 15 20
02
00
0
8
Probe pair
PM
/MM
Dif
f
5 10 15 20
02
00
0
9
Probe pair
PM
/MM
Dif
f
5 10 15 20
02
00
0
10
Probe pair
PM
/MM
Dif
f
5 10 15 20
02
00
0
11
Probe pair
PM
/MM
Dif
f
5 10 15 20
02
00
0
12
Probe pair
PM
/MM
Dif
f
5 10 15 20
02
00
0
13
Probe pair
PM
/MM
Dif
f
5 10 15 20
02
00
0
14
Probe pair
PM
/MM
Dif
f
5 10 15 20
02
00
0
15
Probe pair
PM
/MM
Dif
f
5 10 15 20
02
00
0
16
Probe pair
PM
/MM
Dif
f
5 10 15 20
02
00
0
17
Probe pair
PM
/MM
Dif
f
5 10 15 20
02
00
0
18
Probe pair
PM
/MM
Dif
f
5 10 15 20
02
00
0
19
Probe pair
PM
/MM
Dif
f
5 10 15 20
02
00
0
20
Probe pair
PM
/MM
Dif
f
5 10 15 20
02
00
0
21
Probe pair
PM
/MM
Dif
f
5 10 15 20
02
00
0
22
Probe pair
PM
/MM
Dif
f
5 10 15 20
02
00
0
23
Probe pair
PM
/MM
Dif
f
5 10 15 20
02
00
0
24
Probe pair
PM
/MM
Dif
f
5 10 15 20
02
00
0
25
Probe pair
PM
/MM
Dif
f
5 10 15 20
02
00
0
26
Probe pair
PM
/MM
Dif
f
5 10 15 20
02
00
0
27
Probe pair
PM
/MM
Dif
f
5 10 15 20
02
00
0
28
Probe pair
PM
/MM
Dif
f
5 10 15 20
02
00
0
29
Probe pair
PM
/MM
Dif
f
5 10 15 20
02
00
0
30
Probe pair
PM
/MM
Dif
f
5 10 15 20
02
00
0
31
Probe pair
PM
/MM
Dif
f
5 10 15 20
02
00
0
32
Probe pair
PM
/MM
Dif
f
5 10 15 20
02
00
0
33
Probe pair
PM
/MM
Dif
f
5 10 15 20
02
00
0
34
Probe pair
PM
/MM
Dif
f
5 10 15 20
02
00
0
35
Probe pair
PM
/MM
Dif
f
5 10 15 20
02
00
0
36
Probe pair
PM
/MM
Dif
f
5 10 15 20
02
00
0
37
Probe pair
PM
/MM
Dif
f
5 10 15 20
02
00
0
38
Probe pair
PM
/MM
Dif
f
5 10 15 20
02
00
0
39
Probe pair
PM
/MM
Dif
f
5 10 15 20
02
00
0
40
Probe pair
PM
/MM
Dif
f
5 10 15 200
20
00
41
Probe pair
PM
/MM
Dif
f
5 10 15 20
02
00
0
42
Multiplicative model fits the data well, explained variance: 95%
![Page 25: Model-based analysis of oligonucleotide arrays, dChip software Statistics and Genomics – Lecture 4 Department of Biostatistics Harvard School of Public](https://reader035.vdocuments.mx/reader035/viewer/2022062720/56649f1b5503460f94c2ffd2/html5/thumbnails/25.jpg)
Residuals of the fitting
![Page 26: Model-based analysis of oligonucleotide arrays, dChip software Statistics and Genomics – Lecture 4 Department of Biostatistics Harvard School of Public](https://reader035.vdocuments.mx/reader035/viewer/2022062720/56649f1b5503460f94c2ffd2/html5/thumbnails/26.jpg)
Model fitting amounts to fixing ’s and regress to estimate
),0(~,1, 2 NJjy jjjj
JJ
yyj
jj
jj
jjj
/)~
Var( and )~
E( with, ~ 2
2
)1/())((ˆ with)/ˆ( )~
( Error Std 222 JobserevedfittedJj
![Page 27: Model-based analysis of oligonucleotide arrays, dChip software Statistics and Genomics – Lecture 4 Department of Biostatistics Harvard School of Public](https://reader035.vdocuments.mx/reader035/viewer/2022062720/56649f1b5503460f94c2ffd2/html5/thumbnails/27.jpg)
5 10 15 20
05
00
01
50
00
3
5 10 15 20
05
00
01
50
00
4
5 10 15 20
05
00
01
50
00
5
5 10 15 20
05
00
01
50
00
6
5 10 15 20
05
00
01
50
00
7
5 10 15 20
05
00
01
50
00
8
theta1000 2000 3000 4000 5000
050
150
250 array 4
Fig 1.5 Array outlier: large standard errors of 4
![Page 28: Model-based analysis of oligonucleotide arrays, dChip software Statistics and Genomics – Lecture 4 Department of Biostatistics Harvard School of Public](https://reader035.vdocuments.mx/reader035/viewer/2022062720/56649f1b5503460f94c2ffd2/html5/thumbnails/28.jpg)
Fig. 1.6 Probe outlier: large standard errors of 17
phi0.0 0.5 1.0 1.5 2.0 2.5
0.0
0.1
00
.20
probe 17
B
5 10 15 20-4
00
00
04
00
00
1
5 10 15 20
-40
00
00
40
00
0
2
5 10 15 20
-40
00
00
40
00
0
3
5 10 15 20
-40
00
00
40
00
0
4
5 10 15 20
-40
00
00
40
00
0
5
5 10 15 20
-40
00
00
40
00
0
6
A
Also see gene 6898
![Page 29: Model-based analysis of oligonucleotide arrays, dChip software Statistics and Genomics – Lecture 4 Department of Biostatistics Harvard School of Public](https://reader035.vdocuments.mx/reader035/viewer/2022062720/56649f1b5503460f94c2ffd2/html5/thumbnails/29.jpg)
Fig. 1.4 Array outlier image shows that the model automatically handles image contamination
![Page 30: Model-based analysis of oligonucleotide arrays, dChip software Statistics and Genomics – Lecture 4 Department of Biostatistics Harvard School of Public](https://reader035.vdocuments.mx/reader035/viewer/2022062720/56649f1b5503460f94c2ffd2/html5/thumbnails/30.jpg)
Compare Model-based expression with Average Difference
• The array set 5 has 29 pair of arrays replicated at split-mRNA level
• The differences between the replicated arrays provides a opportunity to assess different expression calculation method
![Page 31: Model-based analysis of oligonucleotide arrays, dChip software Statistics and Genomics – Lecture 4 Department of Biostatistics Harvard School of Public](https://reader035.vdocuments.mx/reader035/viewer/2022062720/56649f1b5503460f94c2ffd2/html5/thumbnails/31.jpg)
Figure 2.5. Log (base 10) expression indexes of a pair of replicate arrays (array 1 and 2 of array set 5) for MBEI method (A) and AD method (B). The center line is y=x, and the flanking lines indicate the
difference of a factor of two.
![Page 32: Model-based analysis of oligonucleotide arrays, dChip software Statistics and Genomics – Lecture 4 Department of Biostatistics Harvard School of Public](https://reader035.vdocuments.mx/reader035/viewer/2022062720/56649f1b5503460f94c2ffd2/html5/thumbnails/32.jpg)
0.0
0.1
0.2
0.3
0.4
0.5
0.6
26930
.00
.10
.20
.30
.40
.50
.6677
0.0
0.1
0.2
0.3
0.4
0.5
0.6
485
0.0
0.1
0.2
0.3
0.4
0.5
0.6
466
0.0
0.1
0.2
0.3
0.4
0.5
0.6
438
0.0
0.1
0.2
0.3
0.4
0.5
0.6
1626
0.0
0.1
0.2
0.3
0.4
0.5
0.6
2693
0.0
0.1
0.2
0.3
0.4
0.5
0.6
677
0.0
0.1
0.2
0.3
0.4
0.5
0.6
485
0.0
0.1
0.2
0.3
0.4
0.5
0.6
466
0.0
0.1
0.2
0.3
0.4
0.5
0.6
438
0.0
0.1
0.2
0.3
0.4
0.5
0.6
1626
(A) (B)
Figure 2.6. Boxplots of average absolute log (base 10) ratios between replicate arrays stratified by presence proportion for (A) MBEI method, (B) AD method.
![Page 33: Model-based analysis of oligonucleotide arrays, dChip software Statistics and Genomics – Lecture 4 Department of Biostatistics Harvard School of Public](https://reader035.vdocuments.mx/reader035/viewer/2022062720/56649f1b5503460f94c2ffd2/html5/thumbnails/33.jpg)
Source: Affymetrix website
![Page 34: Model-based analysis of oligonucleotide arrays, dChip software Statistics and Genomics – Lecture 4 Department of Biostatistics Harvard School of Public](https://reader035.vdocuments.mx/reader035/viewer/2022062720/56649f1b5503460f94c2ffd2/html5/thumbnails/34.jpg)
Finding Confidence Interval of Fold Change
case. varianceunequal the toextened be can method This
r.for CI get the test tothe
invert andtest -levelconstruct oquantity t pivatal thisUse
,~1
)ˆˆ(Q
then,
let,known with),(~ˆ),,(~ˆ Suppose
212
221
2
1
222
211
r
r
r
NN
![Page 35: Model-based analysis of oligonucleotide arrays, dChip software Statistics and Genomics – Lecture 4 Department of Biostatistics Harvard School of Public](https://reader035.vdocuments.mx/reader035/viewer/2022062720/56649f1b5503460f94c2ffd2/html5/thumbnails/35.jpg)
Expression 1 Std Error 1 Expression 2 Std Error 2 Fold Change Lower CB Upper CB
Gene 1 859.635 41.7808 347.57 36.0887 2.47327 2.06844 3.02672 Gene 2 405.72 31.2305 164.014 44.2505 2.47369 1.66938 4.49127
Gene 3 283.931 28.5281 114.705 18.4661 2.47531 1.83926 3.48466
Gene 4 45.9821 64.2419 18.5727 84.5308 2.47579 0 Infinity
Gene 5 225.178 57.489 90.9045 36.1766 2.47709 1.18104 7.48749
Gene 6 247.002 50.6518 99.6642 19.5384 2.47834 1.51079 4.0211
Gene 7 49.9739 21.5345 20.1514 23.5651 2.47992 0.487603 Infinity Gene 8 276.491 18.6883 111.373 36.1004 2.48256 1.59069 5.34635
Gene 9 436.071 32.9779 175.384 21.0669 2.48638 1.98665 3.18811
Gene 10 75.6914 17.7215 30.4395 17.9707 2.48662 1.07209 86.1656
Gene 11 80.673 25.3085 32.4314 16.9626 2.4875 0.960787 18.1833
Gene 12 181.528 42.4837 72.8751 28.1787 2.49094 1.24668 7.11945 Gene 13 1122.28 99.2835 449.889 63.2821 2.49456 1.92075 3.35055
Gene 14 168.234 40.629 67.4387 30.2982 2.49462 1.17639 9.81547
Table 2.1 Using expression levels and associated standard errors to determine confidence intervals of fold changes
![Page 36: Model-based analysis of oligonucleotide arrays, dChip software Statistics and Genomics – Lecture 4 Department of Biostatistics Harvard School of Public](https://reader035.vdocuments.mx/reader035/viewer/2022062720/56649f1b5503460f94c2ffd2/html5/thumbnails/36.jpg)
Resampling hierarchical clustering using standard errors of model-based expression
![Page 37: Model-based analysis of oligonucleotide arrays, dChip software Statistics and Genomics – Lecture 4 Department of Biostatistics Harvard School of Public](https://reader035.vdocuments.mx/reader035/viewer/2022062720/56649f1b5503460f94c2ffd2/html5/thumbnails/37.jpg)
Incorporate biological knowledge and database when analyzing microarray data
Right picture: Gene Ontology: tool for the unification of biology, Nature Genetics, 25, p25
![Page 38: Model-based analysis of oligonucleotide arrays, dChip software Statistics and Genomics – Lecture 4 Department of Biostatistics Harvard School of Public](https://reader035.vdocuments.mx/reader035/viewer/2022062720/56649f1b5503460f94c2ffd2/html5/thumbnails/38.jpg)
Found 13 structural protein genes out of a 49-cluster (all: 198/2622, PValue: 1.00e+000)
Functional significant clusters
![Page 39: Model-based analysis of oligonucleotide arrays, dChip software Statistics and Genomics – Lecture 4 Department of Biostatistics Harvard School of Public](https://reader035.vdocuments.mx/reader035/viewer/2022062720/56649f1b5503460f94c2ffd2/html5/thumbnails/39.jpg)
Problems with LWR model:
• LWR model:
• The expression index can still be negative.
• Genes with negative index can still be classified as present.
ij ij i j jPM MM
Statistical analysis of high-density oligonucleotide arrays: a multiplicative noise
model
R. Sasik and J. Corbeil (UCSF)
Slides prepared by Xuemin Fang
![Page 40: Model-based analysis of oligonucleotide arrays, dChip software Statistics and Genomics – Lecture 4 Department of Biostatistics Harvard School of Public](https://reader035.vdocuments.mx/reader035/viewer/2022062720/56649f1b5503460f94c2ffd2/html5/thumbnails/40.jpg)
Statistical model:
• Based on the same assumption as the LW model, that PM intensity is directly proportional to the concentration ci of the transcript, . Write the relation in the form
• Our model is• where
• Least squared estimation of the parameters.• Constraint:
~ij j iPM c
2 2 2log ~ log logij j iPM c
ij j i ij
~ (0, )ij N
ij ji
0ij jij
![Page 41: Model-based analysis of oligonucleotide arrays, dChip software Statistics and Genomics – Lecture 4 Department of Biostatistics Harvard School of Public](https://reader035.vdocuments.mx/reader035/viewer/2022062720/56649f1b5503460f94c2ffd2/html5/thumbnails/41.jpg)
Algorithm -- When analyzing a batch of ns samples:
• Normalize all samples to the first one on the list by requiring the sum of all PM intensities be the same as that of the first sample.
• Select the background probes using Naef’s method (MM is used in this step).
• Subtract the median of the background probe intensity from every PM probe in the array.
• Probes that become negative are eliminated. • Fit the model and probes contributes most to the sum of
squares are eliminated.• Normalize again and repeat 1-5, until the distribution of
residuals is Gaussian.
![Page 42: Model-based analysis of oligonucleotide arrays, dChip software Statistics and Genomics – Lecture 4 Department of Biostatistics Harvard School of Public](https://reader035.vdocuments.mx/reader035/viewer/2022062720/56649f1b5503460f94c2ffd2/html5/thumbnails/42.jpg)
Bias, variance and fit for three measures of expression: AvDiff, Li & Wong's,
AvLog (PM -bg)
Rafael Irizarry, Terry Speed (Johns Hopkins)
Slides prepared by Xuemin Fang
![Page 43: Model-based analysis of oligonucleotide arrays, dChip software Statistics and Genomics – Lecture 4 Department of Biostatistics Harvard School of Public](https://reader035.vdocuments.mx/reader035/viewer/2022062720/56649f1b5503460f94c2ffd2/html5/thumbnails/43.jpg)
A background plus signal model:
• Here represents background signal caused by optical noise and non-specific binding.
• The mean background level is represented with and the random component with .
• The transcript signal contains a probe affinity effect , the log expression measures , and an error term.
• Both error terms and are independent standard normal.
ijn ijn ijnPM bg s 2log ( )in ijnbg z
iijnz
2log ( )ijn n jn ijns jn
n
ijnz ijn
![Page 44: Model-based analysis of oligonucleotide arrays, dChip software Statistics and Genomics – Lecture 4 Department of Biostatistics Harvard School of Public](https://reader035.vdocuments.mx/reader035/viewer/2022062720/56649f1b5503460f94c2ffd2/html5/thumbnails/44.jpg)
Expression index:
• A naïve estimate of is given by
with the mode of the log2(MM) distribution.
• An estimate of this distribution is obtained using a density kernel estimate.
n^ ^
1
1
log( )J
in n ijn ij
J PM
^
2log ( )i
![Page 45: Model-based analysis of oligonucleotide arrays, dChip software Statistics and Genomics – Lecture 4 Department of Biostatistics Harvard School of Public](https://reader035.vdocuments.mx/reader035/viewer/2022062720/56649f1b5503460f94c2ffd2/html5/thumbnails/45.jpg)
Acknowledgement
Data source:
Stan Nelson (UCLA)Sven de Vos (UCLA) Dan Tang (DFCI)Andy Bhattacharjee (DFCI)Richardson Andresa (DFCI) Allen Fienberg (Rockefeller)