1 a combining approach to statistical methods for p >> n problems shinto eguchi workshop on...

A combining approach to statistical methods for p >> n problems

Shinto Eguchi

Workshop on Statistical Genetics, Nov 9, 2004 at ISM

Microarray data

cDNA microarry

Prediction from gene expressions

),( 1 pxx x

}1,1{ y

Feature vector dimension = number of genes p components = quantities of gene expression

Class label disease, adverse effect

Classification machine

based on training dataset }1:),({ niyii x

Leukemic diseases, Golub et alhttp://www.broad.mit.edu/cgi-bin/cancer/publications/

Web microarray data

p n y = +1 y = － 1

ALLAML 7129 72 37 35

Colon 2000 62 40 22

Estrogen 7129 49 25 24

p >> n

http://microarray.princeton.edu/oncology/http://mgm.duke.edu/genome/dna micro/work/

Genomic data

SNPs Proteome Microarray

dimension p 1,000～100,000

function 5,000~20,000

data size n 100 ～ 1000 5 ～ 20 20 ～ 100

mRNA ProteinGenome

Problem: p >> n

Fundamental issue on Bioinformatics

p is the dimension of biomarker

(SNPs, proteome, microarray, …)

n is the number of individuals

(informed consent, institutional protocol, …bioethics)

Current paradigm

Biomarker space pRI B

SNPs Haplotype block (Fujisawa)

Microarray Model-based clustering

Proteome Peak data reduction (Miyata)

GroupBoost (Takenouchi)

pnp )dim(but, B

Network gene model

Haplotype & adverse effects (Matsuura)

An approach by combining

Let B be a biomarker space

Rapid expansion of genomic data

},,1:)({ kik niD z

larger

Let be K experimental facilitiesKII ,,1

),...,1( Kknp k

Bridge Study?

CAMDA (Critical Assessment of Microarray Data Analysis )

DDBJ (DNA Data Bank Japan, NIG)

)( 1Df

)( 2Df

)( KDf

…. ….

)|( 11 DDf

)|( 22 DDf

)|( KK DDf

…. result

CAMDA 2003

4 datasets for Lung Cancer

Harvard PNAS, 2001 Affymetrix

Michigan Nature Med,

Affymetrix

Stanford PNAS, 2001 cDNA

Ontario Cancer Res

http://www.camda.duke.edu/camda03/datasets/

Some problems

1. Heterogeneity in feature space

cDNA, Affymetrix

Differences in covariates， medical diagnosis

Uncertainty for microarray experiments

2. Heterogeneous class-labeling

3. Heterogeneous generalization powers

A vast of unpublished studies

4. Publication bias

Machine learning

Leanability: boosting weak learners?

AdaBoost : Freund & Schapire (1997)

weak classifiers

})(,....),({ 1 xx pff

A strong classifier

)()( )()1(1 xx tt ff

)()1(1 xfstagewise

AdaBoost

0)(),1()(: settings Initial.1 01

1 xFniiw

n　　　

,)())((I)(1

iwfyf t

)(log)b(tt

tttTT fFF

1)( )()( where,)(sign.3 )( 　xxx

Tt ,,1For .2

))(exp()()()c( )(1 iitttt yfiwiw x

)(min)()a( )( ff tf

One-gene classifier

Error number 5 5 5 6 466 5 565

})(sgn({minarg i

ijij bxyIbb

one-gene classifier

jjj bx

if1)(x

Let be expressions of the j-th genenjj xx ,...,1

The second training

Errror number 45.5 7 9

Update the weight:

Weight up to 2

})(sgn()({minarg i

ijij bxyIiwbb

16log5.0

ans. false of nb.

ans.correct of nb.log5.01

Weight down to 0.5

Learning algorithm

)(,),1( 11 nww

)(,),1( 22 nww 1

)()1( xf

1)( )(

ttt f x

)(,),1( nww TT

)()2( xf

)()( xTf

Final machine

tttTT fFF

1)( )()(where,)(sign )( 　xxx

Exponential loss

)}(exp{1

)( xExponential loss

)(minarglog 1)()(

ttt FfL

21)()(min)( )(1)( ttt

ftt fff

)}({)}({ 1 iwiw tt Update :

Different datasets

}1:),({where )()(k

kik niyD x

Normalization: ]1,0[ RI )()( pki

),...,1(minmax

min)()(

)( pjxx

)( RIk

x expression vector of the same genes

label of the same clinical item

Weighted Errors

))((I)(1

tkt fyf

)()( )()(

The k-th weighted error

The combined weighted error

log)b( )(k

))(exp()()()d( )()()()()(1

kt yfiwiw x

)(minarg(a) )()( ff kt

BridgeBoost

)()( )(1

)()c( xx

t niiw 1)( }1:)({

Learning

Stage t : )( )()()1()1(1)(

tttKt fff

)()1( xtf1D

2D )()2( xtf

)()( xKtf

)}({ )1( iwt

)}({ )2( iwt

})({ )( iw Kt

)}({ )1(1 iwt

)}({ )2(1 iwt

})({ )(1 iw K

)()( xtf

Stage t+1 :

Mean exponential loss

)()( )}(exp{1

)( xExponential loss

)(minarglog 1)()(

t FfLf

kk FLFL

)()(Mean exponential loss

)()(11

kttktt FLfFL

Note: convexity of Expo-Loss

Meta-leaning

validatory-crossis)( 1

kth FfLkh

thh DfDL onis;onis)( )(

kth FfLFfLFfL )()()( 1

Separate learning Meta-learning

Simulation

Collapsed dataset

Traning error Test error

3 datasets

},,{ 321 DDD

21 , DD

Test error 0 （ ideal）

Test error 0.5（ ideal）

50,50,50,100 321 nnnp

data 1, data2

Comparison

Separate AdaBoost BridgeBoost

Training error Training errorTest error Test error

Min =15% Min =4%Min =43%Min = 3% Min = 4%

Collapsed AdaBoost Separate AdaBoost BridgeBoost

Test errors

Conclusion

)( 1Df

)( 2Df

)( KDf

…. ….

)|( 11 DDf

)|( 22 DDf

)|( KK DDf

….result

SeparateLeaning

Meta-leaning

Unsolved problems

3. On the information on the unmatched genes in combining datasets

2. Prediction for class-label for a given new x ?

4. Heterogeneity is OK, but publication bias?

1. Which dataset should be joined or deleted in BridgeBoost ?

Mean and s.d. of 37 studies

37,...,1:),ˆ( kskk

Passive smokers vs lung cancer

Funnel plot

heterogeneitypublication bias

Publication bias?

(Copas & Shi, 2001)

References

[1] A class of logistic-type discriminant functions. S. Eguchi and J. Copas, Biometrika 89, 1-22 (2002). [2] Information geometry of U-Boost and Bregman divergence. N. Murata, T. Takenouchi, T. Kanamori and S. Eguchi Neural Computation 16, 1437-1481 (2004). [3] Robustifying AdaBoost by adding the naive error rate. T. Takenouchi and S. Eguchi. Neural Computation 16, 767-787 (2004). [4] GroupAdaBoost for selecting important genes. In preparation. T. Takenouchi, M. Ushijima and S. Eguchi [5] Local model uncertainty and incomplete data bias. J. Copas and S. Eguchi. ISM Research Memo. 884 July. (2003). [6] Local sensitivity approximation for selectivity bias. J. Copas and S. Eguchi. J. Royal Statistical Society B 63 (2001) 871-895. [7] Reanalysis of epidemiological evidence on lung cancer and passive smoking. J. Copas and J.Q. Shi, British Medical Journal 7232 (2000) 417-418.

1 a combining approach to statistical methods for p >> n problems shinto eguchi workshop on...

microarray data p nhttp

ideal data

idealtest error

bioinformatics p

data size n100

microarray experiments2

genes label

p n fundamental issue

Documents

to shinto shinto€¦ · the word shinto comes from two...

a study of shinto

sanno ichijitsu shinto

shinto posters.pptx

encyclopedia of shinto

shinto (world religions).pdf

essentials of shinto

the shinto of yoshida kanetomo

japanese shinto something

duality in a maximum generalized entropy...

shinto tugas 2

shinto search for meaning

1 information geometry of self-organizing maximum likelihood...

shinto & ecology

shinto shrines presentation

showcasing shinto

1 information geometry on classification logistic, adaboost,...

religion in japan shinto. shinto ancient traditional and...

shinto soul of japan

beyond the grave: shinto and buddhist influences in ... ·...