1 a combining approach to statistical methods for p >> n problems shinto eguchi workshop on...
Post on 21-Jan-2016
215 Views
Preview:
TRANSCRIPT
1
A combining approach to statistical methods for p >> n problems
Shinto Eguchi
Workshop on Statistical Genetics, Nov 9, 2004 at ISM
2
Microarray data
cDNA microarry
3
Prediction from gene expressions
),( 1 pxx x
}1,1{ y
yf x:
Feature vector dimension = number of genes p components = quantities of gene expression
Class label disease, adverse effect
Classification machine
based on training dataset }1:),({ niyii x
4
Leukemic diseases, Golub et alhttp://www.broad.mit.edu/cgi-bin/cancer/publications/
5
Web microarray data
p n y = +1 y = - 1
ALLAML 7129 72 37 35
Colon 2000 62 40 22
Estrogen 7129 49 25 24
p >> n
http://microarray.princeton.edu/oncology/http://mgm.duke.edu/genome/dna micro/work/
6
Genomic data
SNPs Proteome Microarray
Data
dimension p 1,000~100,000
function 5,000~20,000
data size n 100 ~ 1000 5 ~ 20 20 ~ 100
mRNA ProteinGenome
7
Problem: p >> n
Fundamental issue on Bioinformatics
p is the dimension of biomarker
(SNPs, proteome, microarray, …)
n is the number of individuals
(informed consent, institutional protocol, …bioethics)
8
Current paradigm
Biomarker space pRI B
SNPs Haplotype block (Fujisawa)
Microarray Model-based clustering
Proteome Peak data reduction (Miyata)
GroupBoost (Takenouchi)
pnp )dim(but, B
Network gene model
Haplotype & adverse effects (Matsuura)
9
An approach by combining
Let B be a biomarker space
Rapid expansion of genomic data
},,1:)({ kik niD z
pnKK
kk
1
larger
Let be K experimental facilitiesKII ,,1
),...,1( Kknp k
10
Bridge Study?
1D
CAMDA (Critical Assessment of Microarray Data Analysis )
DDBJ (DNA Data Bank Japan, NIG)
2D
KD
)( 1Df
)( 2Df
)( KDf
…. ….
)|( 11 DDf
)|( 22 DDf
)|( KK DDf
…. result
11
CAMDA 2003
4 datasets for Lung Cancer
Harvard PNAS, 2001 Affymetrix
Michigan Nature Med,
2002
Affymetrix
Stanford PNAS, 2001 cDNA
Ontario Cancer Res
2001
cDNA
http://www.camda.duke.edu/camda03/datasets/
12
Some problems
1. Heterogeneity in feature space
cDNA, Affymetrix
Differences in covariates, medical diagnosis
Uncertainty for microarray experiments
2. Heterogeneous class-labeling
3. Heterogeneous generalization powers
A vast of unpublished studies
4. Publication bias
13
Machine learning
Leanability: boosting weak learners?
AdaBoost : Freund & Schapire (1997)
weak classifiers
})(,....),({ 1 xx pff
A strong classifier
)()( )()1(1 xx tt ff
)(xf
)()1(1 xfstagewise
14
AdaBoost
0)(),1()(: settings Initial.1 01
1 xFniiw
n
,)())((I)(1
iwfyf t
n
iiit
x
)(
)(1
21
)(
)(log)b(tt
ttt
f
f
T
tttTT fFF
1)( )()( where,)(sign.3 )( xxx
Tt ,,1For .2
))(exp()()()c( )(1 iitttt yfiwiw x
)(min)()a( )( ff tf
tt
15
One-gene classifier
Error number 5 5 5 6 466 5 565
})(sgn({minarg i
ijij bxyIbb
one-gene classifier
jj
jjj bx
bxf
if1
if1)(x
jb
Let be expressions of the j-th genenjj xx ,...,1
jx
16
The second training
Errror number 45.5 7 9
87.56
798.5
Update the weight:
jx
4.5
jb
Weight up to 2
})(sgn()({minarg i
ijij bxyIiwbb
2log4
16log5.0
ans. false of nb.
ans.correct of nb.log5.01
Weight down to 0.5
jb
17
Learning algorithm
D
)(,),1( 11 nww
)(,),1( 22 nww 1
2
1
2
T
)()1( xf
T
1)( )(
ttt f x
)(,),1( nww TT
)()2( xf
)()( xTf
Final machine
T
tttTT fFF
1)( )()(where,)(sign )( xxx
18
Exponential loss
n
iiiFy
nFL
1
)}(exp{1
)( xExponential loss
)(minarglog 1)()(
)(
)(
)(1
21
ttttt
ttt FfL
f
f
21)()(min)( )(1)( ttt
ftt fff
)}({)}({ 1 iwiw tt Update :
19
Different datasets
}1:),({where )()(k
ki
kik niyD x
K
kkDD
1
Normalization: ]1,0[ RI )()( pki
ki
p xx
),...,1(minmax
min)()(
)()(
)( pjxx
xxx
kji
i
kji
i
kji
i
kjik
ji
∋
)(
)( RIk
i
pki
y
x expression vector of the same genes
label of the same clinical item
20
Weighted Errors
))((I)(1
)()(
1
)(
)()(
)(
)(
k
k
n
i
ki
kin
i
k
t
k
tkt fyf
iw
iwx
K
k
kt
kt ff
1
)()( )()(
The k-th weighted error
The combined weighted error
K
h
n
i
h
t
n
i
k
tk
h
k
iw
iw
1 1
)(
1
)(
)(
)(
)(
where
21
)(
)(121
)(
)(
log)b( )(k
k
tt
ttkt
f
f
))(exp()()()d( )()()()()(1
ki
kit
kt
kt
kt yfiwiw x
)(minarg(a) )()( ff kt
f
kt
BridgeBoost
K
k
kt
ktt f
Kf
1
)()( )(1
)()c( xx
Kkkk
t niiw 1)( }1:)({
22
Learning
Stage t : )( )()()1()1(1)(
Kt
K
tttKt fff
)()1( xtf1D
KD
2D )()2( xtf
)()( xKtf
)1(t
)2(t
)(Kt
)}({ )1( iwt
)}({ )2( iwt
})({ )( iw Kt
D
)}({ )1(1 iwt
)}({ )2(1 iwt
})({ )(1 iw K
t
)()( xtf
Stage t+1 :
23
Mean exponential loss
kn
i
ki
ki
kk Fy
nFL
1
)()( )}(exp{1
)( xExponential loss
)(minarglog 1)()(
)(
)()(
)(
)(121
tk
tk
tktt
kttk
t FfLf
f
K
kk FLFL
1
)()(Mean exponential loss
)()(1
)( 11
)()(11
t
K
k
kt
kttktt FLfFL
KfFL
Note: convexity of Expo-Loss
24
Meta-leaning
validatory-crossis)( 1
)( t
kth FfLkh
kk
thh DfDL onis;onis)( )(
kh
tk
thtk
tk
K
ht
kth FfLFfLFfL )()()( 1
)(1
)(
11
)(
Separate learning Meta-learning
25
Simulation
Collapsed dataset
Traning error Test error
3 datasets
},,{ 321 DDD
21 , DD
3D
Test error 0 ( ideal)
Test error 0.5( ideal)
50,50,50,100 321 nnnp
data 1, data2
data3
26
Comparison
Separate AdaBoost BridgeBoost
Training error Training errorTest error Test error
27
Min =15% Min =4%Min =43%Min = 3% Min = 4%
Collapsed AdaBoost Separate AdaBoost BridgeBoost
Test errors
28
Conclusion
1D
2D
KD
)( 1Df
)( 2Df
)( KDf
…. ….
)|( 11 DDf
)|( 22 DDf
)|( KK DDf
….result
SeparateLeaning
Meta-leaning
29
Unsolved problems
3. On the information on the unmatched genes in combining datasets
2. Prediction for class-label for a given new x ?
4. Heterogeneity is OK, but publication bias?
1. Which dataset should be joined or deleted in BridgeBoost ?
30
Mean and s.d. of 37 studies
37,...,1:),ˆ( kskk
Passive smokers vs lung cancer
Funnel plot
heterogeneitypublication bias
Publication bias?
(Copas & Shi, 2001)
31
References
[1] A class of logistic-type discriminant functions. S. Eguchi and J. Copas, Biometrika 89, 1-22 (2002). [2] Information geometry of U-Boost and Bregman divergence. N. Murata, T. Takenouchi, T. Kanamori and S. Eguchi Neural Computation 16, 1437-1481 (2004). [3] Robustifying AdaBoost by adding the naive error rate. T. Takenouchi and S. Eguchi. Neural Computation 16, 767-787 (2004). [4] GroupAdaBoost for selecting important genes. In preparation. T. Takenouchi, M. Ushijima and S. Eguchi [5] Local model uncertainty and incomplete data bias. J. Copas and S. Eguchi. ISM Research Memo. 884 July. (2003). [6] Local sensitivity approximation for selectivity bias. J. Copas and S. Eguchi. J. Royal Statistical Society B 63 (2001) 871-895. [7] Reanalysis of epidemiological evidence on lung cancer and passive smoking. J. Copas and J.Q. Shi, British Medical Journal 7232 (2000) 417-418.
top related