![Page 1: 1 A combining approach to statistical methods for p >> n problems Shinto Eguchi Workshop on Statistical Genetics, Nov 9, 2004 at ISM](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697bfb81a28abf838c9f558/html5/thumbnails/1.jpg)
1
A combining approach to statistical methods for p >> n problems
Shinto Eguchi
Workshop on Statistical Genetics, Nov 9, 2004 at ISM
![Page 2: 1 A combining approach to statistical methods for p >> n problems Shinto Eguchi Workshop on Statistical Genetics, Nov 9, 2004 at ISM](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697bfb81a28abf838c9f558/html5/thumbnails/2.jpg)
2
Microarray data
cDNA microarry
![Page 3: 1 A combining approach to statistical methods for p >> n problems Shinto Eguchi Workshop on Statistical Genetics, Nov 9, 2004 at ISM](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697bfb81a28abf838c9f558/html5/thumbnails/3.jpg)
3
Prediction from gene expressions
),( 1 pxx x
}1,1{ y
yf x:
Feature vector dimension = number of genes p components = quantities of gene expression
Class label disease, adverse effect
Classification machine
based on training dataset }1:),({ niyii x
![Page 4: 1 A combining approach to statistical methods for p >> n problems Shinto Eguchi Workshop on Statistical Genetics, Nov 9, 2004 at ISM](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697bfb81a28abf838c9f558/html5/thumbnails/4.jpg)
4
Leukemic diseases, Golub et alhttp://www.broad.mit.edu/cgi-bin/cancer/publications/
![Page 5: 1 A combining approach to statistical methods for p >> n problems Shinto Eguchi Workshop on Statistical Genetics, Nov 9, 2004 at ISM](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697bfb81a28abf838c9f558/html5/thumbnails/5.jpg)
5
Web microarray data
p n y = +1 y = - 1
ALLAML 7129 72 37 35
Colon 2000 62 40 22
Estrogen 7129 49 25 24
p >> n
http://microarray.princeton.edu/oncology/http://mgm.duke.edu/genome/dna micro/work/
![Page 6: 1 A combining approach to statistical methods for p >> n problems Shinto Eguchi Workshop on Statistical Genetics, Nov 9, 2004 at ISM](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697bfb81a28abf838c9f558/html5/thumbnails/6.jpg)
6
Genomic data
SNPs Proteome Microarray
Data
dimension p 1,000~100,000
function 5,000~20,000
data size n 100 ~ 1000 5 ~ 20 20 ~ 100
mRNA ProteinGenome
![Page 7: 1 A combining approach to statistical methods for p >> n problems Shinto Eguchi Workshop on Statistical Genetics, Nov 9, 2004 at ISM](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697bfb81a28abf838c9f558/html5/thumbnails/7.jpg)
7
Problem: p >> n
Fundamental issue on Bioinformatics
p is the dimension of biomarker
(SNPs, proteome, microarray, …)
n is the number of individuals
(informed consent, institutional protocol, …bioethics)
![Page 8: 1 A combining approach to statistical methods for p >> n problems Shinto Eguchi Workshop on Statistical Genetics, Nov 9, 2004 at ISM](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697bfb81a28abf838c9f558/html5/thumbnails/8.jpg)
8
Current paradigm
Biomarker space pRI B
SNPs Haplotype block (Fujisawa)
Microarray Model-based clustering
Proteome Peak data reduction (Miyata)
GroupBoost (Takenouchi)
pnp )dim(but, B
Network gene model
Haplotype & adverse effects (Matsuura)
![Page 9: 1 A combining approach to statistical methods for p >> n problems Shinto Eguchi Workshop on Statistical Genetics, Nov 9, 2004 at ISM](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697bfb81a28abf838c9f558/html5/thumbnails/9.jpg)
9
An approach by combining
Let B be a biomarker space
Rapid expansion of genomic data
},,1:)({ kik niD z
pnKK
kk
1
larger
Let be K experimental facilitiesKII ,,1
),...,1( Kknp k
![Page 10: 1 A combining approach to statistical methods for p >> n problems Shinto Eguchi Workshop on Statistical Genetics, Nov 9, 2004 at ISM](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697bfb81a28abf838c9f558/html5/thumbnails/10.jpg)
10
Bridge Study?
1D
CAMDA (Critical Assessment of Microarray Data Analysis )
DDBJ (DNA Data Bank Japan, NIG)
2D
KD
)( 1Df
)( 2Df
)( KDf
…. ….
)|( 11 DDf
)|( 22 DDf
)|( KK DDf
…. result
![Page 11: 1 A combining approach to statistical methods for p >> n problems Shinto Eguchi Workshop on Statistical Genetics, Nov 9, 2004 at ISM](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697bfb81a28abf838c9f558/html5/thumbnails/11.jpg)
11
CAMDA 2003
4 datasets for Lung Cancer
Harvard PNAS, 2001 Affymetrix
Michigan Nature Med,
2002
Affymetrix
Stanford PNAS, 2001 cDNA
Ontario Cancer Res
2001
cDNA
http://www.camda.duke.edu/camda03/datasets/
![Page 12: 1 A combining approach to statistical methods for p >> n problems Shinto Eguchi Workshop on Statistical Genetics, Nov 9, 2004 at ISM](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697bfb81a28abf838c9f558/html5/thumbnails/12.jpg)
12
Some problems
1. Heterogeneity in feature space
cDNA, Affymetrix
Differences in covariates, medical diagnosis
Uncertainty for microarray experiments
2. Heterogeneous class-labeling
3. Heterogeneous generalization powers
A vast of unpublished studies
4. Publication bias
![Page 13: 1 A combining approach to statistical methods for p >> n problems Shinto Eguchi Workshop on Statistical Genetics, Nov 9, 2004 at ISM](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697bfb81a28abf838c9f558/html5/thumbnails/13.jpg)
13
Machine learning
Leanability: boosting weak learners?
AdaBoost : Freund & Schapire (1997)
weak classifiers
})(,....),({ 1 xx pff
A strong classifier
)()( )()1(1 xx tt ff
)(xf
)()1(1 xfstagewise
![Page 14: 1 A combining approach to statistical methods for p >> n problems Shinto Eguchi Workshop on Statistical Genetics, Nov 9, 2004 at ISM](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697bfb81a28abf838c9f558/html5/thumbnails/14.jpg)
14
AdaBoost
0)(),1()(: settings Initial.1 01
1 xFniiw
n
,)())((I)(1
iwfyf t
n
iiit
x
)(
)(1
21
)(
)(log)b(tt
ttt
f
f
T
tttTT fFF
1)( )()( where,)(sign.3 )( xxx
Tt ,,1For .2
))(exp()()()c( )(1 iitttt yfiwiw x
)(min)()a( )( ff tf
tt
![Page 15: 1 A combining approach to statistical methods for p >> n problems Shinto Eguchi Workshop on Statistical Genetics, Nov 9, 2004 at ISM](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697bfb81a28abf838c9f558/html5/thumbnails/15.jpg)
15
One-gene classifier
Error number 5 5 5 6 466 5 565
})(sgn({minarg i
ijij bxyIbb
one-gene classifier
jj
jjj bx
bxf
if1
if1)(x
jb
Let be expressions of the j-th genenjj xx ,...,1
jx
![Page 16: 1 A combining approach to statistical methods for p >> n problems Shinto Eguchi Workshop on Statistical Genetics, Nov 9, 2004 at ISM](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697bfb81a28abf838c9f558/html5/thumbnails/16.jpg)
16
The second training
Errror number 45.5 7 9
87.56
798.5
Update the weight:
jx
4.5
jb
Weight up to 2
})(sgn()({minarg i
ijij bxyIiwbb
2log4
16log5.0
ans. false of nb.
ans.correct of nb.log5.01
Weight down to 0.5
jb
![Page 17: 1 A combining approach to statistical methods for p >> n problems Shinto Eguchi Workshop on Statistical Genetics, Nov 9, 2004 at ISM](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697bfb81a28abf838c9f558/html5/thumbnails/17.jpg)
17
Learning algorithm
D
)(,),1( 11 nww
)(,),1( 22 nww 1
2
1
2
T
)()1( xf
T
1)( )(
ttt f x
)(,),1( nww TT
)()2( xf
)()( xTf
Final machine
T
tttTT fFF
1)( )()(where,)(sign )( xxx
![Page 18: 1 A combining approach to statistical methods for p >> n problems Shinto Eguchi Workshop on Statistical Genetics, Nov 9, 2004 at ISM](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697bfb81a28abf838c9f558/html5/thumbnails/18.jpg)
18
Exponential loss
n
iiiFy
nFL
1
)}(exp{1
)( xExponential loss
)(minarglog 1)()(
)(
)(
)(1
21
ttttt
ttt FfL
f
f
21)()(min)( )(1)( ttt
ftt fff
)}({)}({ 1 iwiw tt Update :
![Page 19: 1 A combining approach to statistical methods for p >> n problems Shinto Eguchi Workshop on Statistical Genetics, Nov 9, 2004 at ISM](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697bfb81a28abf838c9f558/html5/thumbnails/19.jpg)
19
Different datasets
}1:),({where )()(k
ki
kik niyD x
K
kkDD
1
Normalization: ]1,0[ RI )()( pki
ki
p xx
),...,1(minmax
min)()(
)()(
)( pjxx
xxx
kji
i
kji
i
kji
i
kjik
ji
∋
)(
)( RIk
i
pki
y
x expression vector of the same genes
label of the same clinical item
![Page 20: 1 A combining approach to statistical methods for p >> n problems Shinto Eguchi Workshop on Statistical Genetics, Nov 9, 2004 at ISM](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697bfb81a28abf838c9f558/html5/thumbnails/20.jpg)
20
Weighted Errors
))((I)(1
)()(
1
)(
)()(
)(
)(
k
k
n
i
ki
kin
i
k
t
k
tkt fyf
iw
iwx
K
k
kt
kt ff
1
)()( )()(
The k-th weighted error
The combined weighted error
K
h
n
i
h
t
n
i
k
tk
h
k
iw
iw
1 1
)(
1
)(
)(
)(
)(
where
![Page 21: 1 A combining approach to statistical methods for p >> n problems Shinto Eguchi Workshop on Statistical Genetics, Nov 9, 2004 at ISM](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697bfb81a28abf838c9f558/html5/thumbnails/21.jpg)
21
)(
)(121
)(
)(
log)b( )(k
k
tt
ttkt
f
f
))(exp()()()d( )()()()()(1
ki
kit
kt
kt
kt yfiwiw x
)(minarg(a) )()( ff kt
f
kt
BridgeBoost
K
k
kt
ktt f
Kf
1
)()( )(1
)()c( xx
Kkkk
t niiw 1)( }1:)({
![Page 22: 1 A combining approach to statistical methods for p >> n problems Shinto Eguchi Workshop on Statistical Genetics, Nov 9, 2004 at ISM](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697bfb81a28abf838c9f558/html5/thumbnails/22.jpg)
22
Learning
Stage t : )( )()()1()1(1)(
Kt
K
tttKt fff
)()1( xtf1D
KD
2D )()2( xtf
)()( xKtf
)1(t
)2(t
)(Kt
)}({ )1( iwt
)}({ )2( iwt
})({ )( iw Kt
D
)}({ )1(1 iwt
)}({ )2(1 iwt
})({ )(1 iw K
t
)()( xtf
Stage t+1 :
![Page 23: 1 A combining approach to statistical methods for p >> n problems Shinto Eguchi Workshop on Statistical Genetics, Nov 9, 2004 at ISM](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697bfb81a28abf838c9f558/html5/thumbnails/23.jpg)
23
Mean exponential loss
kn
i
ki
ki
kk Fy
nFL
1
)()( )}(exp{1
)( xExponential loss
)(minarglog 1)()(
)(
)()(
)(
)(121
tk
tk
tktt
kttk
t FfLf
f
K
kk FLFL
1
)()(Mean exponential loss
)()(1
)( 11
)()(11
t
K
k
kt
kttktt FLfFL
KfFL
Note: convexity of Expo-Loss
![Page 24: 1 A combining approach to statistical methods for p >> n problems Shinto Eguchi Workshop on Statistical Genetics, Nov 9, 2004 at ISM](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697bfb81a28abf838c9f558/html5/thumbnails/24.jpg)
24
Meta-leaning
validatory-crossis)( 1
)( t
kth FfLkh
kk
thh DfDL onis;onis)( )(
kh
tk
thtk
tk
K
ht
kth FfLFfLFfL )()()( 1
)(1
)(
11
)(
Separate learning Meta-learning
![Page 25: 1 A combining approach to statistical methods for p >> n problems Shinto Eguchi Workshop on Statistical Genetics, Nov 9, 2004 at ISM](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697bfb81a28abf838c9f558/html5/thumbnails/25.jpg)
25
Simulation
Collapsed dataset
Traning error Test error
3 datasets
},,{ 321 DDD
21 , DD
3D
Test error 0 ( ideal)
Test error 0.5( ideal)
50,50,50,100 321 nnnp
data 1, data2
data3
![Page 26: 1 A combining approach to statistical methods for p >> n problems Shinto Eguchi Workshop on Statistical Genetics, Nov 9, 2004 at ISM](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697bfb81a28abf838c9f558/html5/thumbnails/26.jpg)
26
Comparison
Separate AdaBoost BridgeBoost
Training error Training errorTest error Test error
![Page 27: 1 A combining approach to statistical methods for p >> n problems Shinto Eguchi Workshop on Statistical Genetics, Nov 9, 2004 at ISM](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697bfb81a28abf838c9f558/html5/thumbnails/27.jpg)
27
Min =15% Min =4%Min =43%Min = 3% Min = 4%
Collapsed AdaBoost Separate AdaBoost BridgeBoost
Test errors
![Page 28: 1 A combining approach to statistical methods for p >> n problems Shinto Eguchi Workshop on Statistical Genetics, Nov 9, 2004 at ISM](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697bfb81a28abf838c9f558/html5/thumbnails/28.jpg)
28
Conclusion
1D
2D
KD
)( 1Df
)( 2Df
)( KDf
…. ….
)|( 11 DDf
)|( 22 DDf
)|( KK DDf
….result
SeparateLeaning
Meta-leaning
![Page 29: 1 A combining approach to statistical methods for p >> n problems Shinto Eguchi Workshop on Statistical Genetics, Nov 9, 2004 at ISM](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697bfb81a28abf838c9f558/html5/thumbnails/29.jpg)
29
Unsolved problems
3. On the information on the unmatched genes in combining datasets
2. Prediction for class-label for a given new x ?
4. Heterogeneity is OK, but publication bias?
1. Which dataset should be joined or deleted in BridgeBoost ?
![Page 30: 1 A combining approach to statistical methods for p >> n problems Shinto Eguchi Workshop on Statistical Genetics, Nov 9, 2004 at ISM](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697bfb81a28abf838c9f558/html5/thumbnails/30.jpg)
30
Mean and s.d. of 37 studies
37,...,1:),ˆ( kskk
Passive smokers vs lung cancer
Funnel plot
heterogeneitypublication bias
Publication bias?
(Copas & Shi, 2001)
![Page 31: 1 A combining approach to statistical methods for p >> n problems Shinto Eguchi Workshop on Statistical Genetics, Nov 9, 2004 at ISM](https://reader035.vdocuments.mx/reader035/viewer/2022062519/5697bfb81a28abf838c9f558/html5/thumbnails/31.jpg)
31
References
[1] A class of logistic-type discriminant functions. S. Eguchi and J. Copas, Biometrika 89, 1-22 (2002). [2] Information geometry of U-Boost and Bregman divergence. N. Murata, T. Takenouchi, T. Kanamori and S. Eguchi Neural Computation 16, 1437-1481 (2004). [3] Robustifying AdaBoost by adding the naive error rate. T. Takenouchi and S. Eguchi. Neural Computation 16, 767-787 (2004). [4] GroupAdaBoost for selecting important genes. In preparation. T. Takenouchi, M. Ushijima and S. Eguchi [5] Local model uncertainty and incomplete data bias. J. Copas and S. Eguchi. ISM Research Memo. 884 July. (2003). [6] Local sensitivity approximation for selectivity bias. J. Copas and S. Eguchi. J. Royal Statistical Society B 63 (2001) 871-895. [7] Reanalysis of epidemiological evidence on lung cancer and passive smoking. J. Copas and J.Q. Shi, British Medical Journal 7232 (2000) 417-418.