1 a combining approach to statistical methods for p >> n problems shinto eguchi workshop on...

31
1 A combining approach to statistical methods for p >> n problems Shinto Eguchi shop on Statistical Genetics, Nov 9, 2004 at IS

Upload: bonnie-harmon

Post on 21-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 A combining approach to statistical methods for p >> n problems Shinto Eguchi Workshop on Statistical Genetics, Nov 9, 2004 at ISM

1

A combining approach to statistical methods for p >> n problems

Shinto Eguchi

Workshop on Statistical Genetics, Nov 9, 2004 at ISM

Page 2: 1 A combining approach to statistical methods for p >> n problems Shinto Eguchi Workshop on Statistical Genetics, Nov 9, 2004 at ISM

2

Microarray data

cDNA microarry

Page 3: 1 A combining approach to statistical methods for p >> n problems Shinto Eguchi Workshop on Statistical Genetics, Nov 9, 2004 at ISM

3

Prediction from gene expressions

),( 1 pxx x

}1,1{ y

yf x:

Feature vector dimension = number of genes p components = quantities of gene expression

Class label disease, adverse effect

Classification machine

based on training dataset }1:),({ niyii x

Page 4: 1 A combining approach to statistical methods for p >> n problems Shinto Eguchi Workshop on Statistical Genetics, Nov 9, 2004 at ISM

4

Leukemic diseases, Golub et alhttp://www.broad.mit.edu/cgi-bin/cancer/publications/

Page 5: 1 A combining approach to statistical methods for p >> n problems Shinto Eguchi Workshop on Statistical Genetics, Nov 9, 2004 at ISM

5

Web microarray data

p n y = +1 y = - 1

ALLAML 7129 72 37 35

Colon 2000 62 40 22

Estrogen 7129 49 25 24

p >> n

http://microarray.princeton.edu/oncology/http://mgm.duke.edu/genome/dna micro/work/

Page 6: 1 A combining approach to statistical methods for p >> n problems Shinto Eguchi Workshop on Statistical Genetics, Nov 9, 2004 at ISM

6

Genomic data

SNPs Proteome Microarray

Data

dimension p 1,000~100,000

function 5,000~20,000

data size n 100 ~ 1000 5 ~ 20 20 ~ 100

mRNA ProteinGenome

Page 7: 1 A combining approach to statistical methods for p >> n problems Shinto Eguchi Workshop on Statistical Genetics, Nov 9, 2004 at ISM

7

Problem: p >> n

Fundamental issue on Bioinformatics

p is the dimension of biomarker

(SNPs, proteome, microarray, …)

n is the number of individuals

(informed consent, institutional protocol, …bioethics)

Page 8: 1 A combining approach to statistical methods for p >> n problems Shinto Eguchi Workshop on Statistical Genetics, Nov 9, 2004 at ISM

8

Current paradigm

Biomarker space pRI B

SNPs Haplotype block (Fujisawa)

Microarray Model-based clustering

Proteome Peak data reduction (Miyata)

GroupBoost (Takenouchi)

pnp )dim(but, B

Network gene model

Haplotype & adverse effects (Matsuura)

Page 9: 1 A combining approach to statistical methods for p >> n problems Shinto Eguchi Workshop on Statistical Genetics, Nov 9, 2004 at ISM

9

An approach by combining

Let B be a biomarker space

Rapid expansion of genomic data

},,1:)({ kik niD z

pnKK

kk

1

larger

Let be K experimental facilitiesKII ,,1

),...,1( Kknp k

Page 10: 1 A combining approach to statistical methods for p >> n problems Shinto Eguchi Workshop on Statistical Genetics, Nov 9, 2004 at ISM

10

Bridge Study?

1D

CAMDA (Critical Assessment of Microarray Data Analysis )

DDBJ (DNA Data Bank Japan, NIG)

2D

KD

)( 1Df

)( 2Df

)( KDf

…. ….

)|( 11 DDf

)|( 22 DDf

)|( KK DDf

…. result

Page 11: 1 A combining approach to statistical methods for p >> n problems Shinto Eguchi Workshop on Statistical Genetics, Nov 9, 2004 at ISM

11

CAMDA 2003

4 datasets for Lung Cancer

Harvard PNAS, 2001 Affymetrix

Michigan Nature Med,

2002

Affymetrix

Stanford PNAS, 2001 cDNA

Ontario Cancer Res

2001

cDNA

http://www.camda.duke.edu/camda03/datasets/

Page 12: 1 A combining approach to statistical methods for p >> n problems Shinto Eguchi Workshop on Statistical Genetics, Nov 9, 2004 at ISM

12

Some problems

1. Heterogeneity in feature space

cDNA, Affymetrix

Differences in covariates, medical diagnosis

Uncertainty for microarray experiments

2. Heterogeneous class-labeling

3. Heterogeneous generalization powers

A vast of unpublished studies

4. Publication bias

Page 13: 1 A combining approach to statistical methods for p >> n problems Shinto Eguchi Workshop on Statistical Genetics, Nov 9, 2004 at ISM

13

Machine learning

Leanability: boosting weak learners?

AdaBoost : Freund & Schapire (1997)

weak classifiers

})(,....),({ 1 xx pff

A strong classifier

)()( )()1(1 xx tt ff

)(xf

)()1(1 xfstagewise

Page 14: 1 A combining approach to statistical methods for p >> n problems Shinto Eguchi Workshop on Statistical Genetics, Nov 9, 2004 at ISM

14

AdaBoost

0)(),1()(: settings Initial.1 01

1 xFniiw

n    

,)())((I)(1

iwfyf t

n

iiit

x

)(

)(1

21

)(

)(log)b(tt

ttt

f

f

T

tttTT fFF

1)( )()( where,)(sign.3 )(  xxx

Tt ,,1For .2

))(exp()()()c( )(1 iitttt yfiwiw x

)(min)()a( )( ff tf

tt

Page 15: 1 A combining approach to statistical methods for p >> n problems Shinto Eguchi Workshop on Statistical Genetics, Nov 9, 2004 at ISM

15

One-gene classifier

Error number 5 5 5 6 466 5 565

})(sgn({minarg i

ijij bxyIbb

one-gene classifier

jj

jjj bx

bxf

if1

if1)(x

jb

Let be expressions of the j-th genenjj xx ,...,1

jx

Page 16: 1 A combining approach to statistical methods for p >> n problems Shinto Eguchi Workshop on Statistical Genetics, Nov 9, 2004 at ISM

16

The second training

Errror number 45.5 7 9

87.56

798.5

Update the weight:

jx

4.5

jb

Weight up to 2

})(sgn()({minarg i

ijij bxyIiwbb

2log4

16log5.0

ans. false of nb.

ans.correct of nb.log5.01

Weight down to 0.5

jb

Page 17: 1 A combining approach to statistical methods for p >> n problems Shinto Eguchi Workshop on Statistical Genetics, Nov 9, 2004 at ISM

17

Learning algorithm

D

)(,),1( 11 nww

)(,),1( 22 nww 1

2

1

2

T

)()1( xf

T

1)( )(

ttt f x

)(,),1( nww TT

)()2( xf

)()( xTf

Final machine

T

tttTT fFF

1)( )()(where,)(sign )(  xxx

Page 18: 1 A combining approach to statistical methods for p >> n problems Shinto Eguchi Workshop on Statistical Genetics, Nov 9, 2004 at ISM

18

Exponential loss

n

iiiFy

nFL

1

)}(exp{1

)( xExponential loss

)(minarglog 1)()(

)(

)(

)(1

21

ttttt

ttt FfL

f

f

21)()(min)( )(1)( ttt

ftt fff

)}({)}({ 1 iwiw tt Update :

Page 19: 1 A combining approach to statistical methods for p >> n problems Shinto Eguchi Workshop on Statistical Genetics, Nov 9, 2004 at ISM

19

Different datasets

}1:),({where )()(k

ki

kik niyD x

K

kkDD

1

Normalization: ]1,0[ RI )()( pki

ki

p xx

),...,1(minmax

min)()(

)()(

)( pjxx

xxx

kji

i

kji

i

kji

i

kjik

ji

)(

)( RIk

i

pki

y

x expression vector of the same genes

label of the same clinical item

Page 20: 1 A combining approach to statistical methods for p >> n problems Shinto Eguchi Workshop on Statistical Genetics, Nov 9, 2004 at ISM

20

Weighted Errors

))((I)(1

)()(

1

)(

)()(

)(

)(

k

k

n

i

ki

kin

i

k

t

k

tkt fyf

iw

iwx

K

k

kt

kt ff

1

)()( )()(

The k-th weighted error

The combined weighted error

K

h

n

i

h

t

n

i

k

tk

h

k

iw

iw

1 1

)(

1

)(

)(

)(

)(

where

Page 21: 1 A combining approach to statistical methods for p >> n problems Shinto Eguchi Workshop on Statistical Genetics, Nov 9, 2004 at ISM

21

)(

)(121

)(

)(

log)b( )(k

k

tt

ttkt

f

f

))(exp()()()d( )()()()()(1

ki

kit

kt

kt

kt yfiwiw x

)(minarg(a) )()( ff kt

f

kt

BridgeBoost

K

k

kt

ktt f

Kf

1

)()( )(1

)()c( xx

Kkkk

t niiw 1)( }1:)({

Page 22: 1 A combining approach to statistical methods for p >> n problems Shinto Eguchi Workshop on Statistical Genetics, Nov 9, 2004 at ISM

22

Learning

Stage t : )( )()()1()1(1)(

Kt

K

tttKt fff

)()1( xtf1D

KD

2D )()2( xtf

)()( xKtf

)1(t

)2(t

)(Kt

)}({ )1( iwt

)}({ )2( iwt

})({ )( iw Kt

D

)}({ )1(1 iwt

)}({ )2(1 iwt

})({ )(1 iw K

t

)()( xtf

Stage t+1 :

Page 23: 1 A combining approach to statistical methods for p >> n problems Shinto Eguchi Workshop on Statistical Genetics, Nov 9, 2004 at ISM

23

Mean exponential loss

kn

i

ki

ki

kk Fy

nFL

1

)()( )}(exp{1

)( xExponential loss

)(minarglog 1)()(

)(

)()(

)(

)(121

tk

tk

tktt

kttk

t FfLf

f

K

kk FLFL

1

)()(Mean exponential loss

)()(1

)( 11

)()(11

t

K

k

kt

kttktt FLfFL

KfFL

Note: convexity of Expo-Loss

Page 24: 1 A combining approach to statistical methods for p >> n problems Shinto Eguchi Workshop on Statistical Genetics, Nov 9, 2004 at ISM

24

Meta-leaning

validatory-crossis)( 1

)( t

kth FfLkh

kk

thh DfDL onis;onis)( )(

kh

tk

thtk

tk

K

ht

kth FfLFfLFfL )()()( 1

)(1

)(

11

)(

Separate learning Meta-learning

Page 25: 1 A combining approach to statistical methods for p >> n problems Shinto Eguchi Workshop on Statistical Genetics, Nov 9, 2004 at ISM

25

Simulation

Collapsed dataset

Traning error Test error

3 datasets

},,{ 321 DDD

21 , DD

3D

Test error 0 ( ideal)

Test error 0.5( ideal)

50,50,50,100 321 nnnp

data 1, data2

data3

Page 26: 1 A combining approach to statistical methods for p >> n problems Shinto Eguchi Workshop on Statistical Genetics, Nov 9, 2004 at ISM

26

Comparison

Separate AdaBoost BridgeBoost

Training error Training errorTest error Test error

Page 27: 1 A combining approach to statistical methods for p >> n problems Shinto Eguchi Workshop on Statistical Genetics, Nov 9, 2004 at ISM

27

Min =15% Min =4%Min =43%Min = 3% Min = 4%

Collapsed AdaBoost Separate AdaBoost BridgeBoost

Test errors

Page 28: 1 A combining approach to statistical methods for p >> n problems Shinto Eguchi Workshop on Statistical Genetics, Nov 9, 2004 at ISM

28

Conclusion

1D

2D

KD

)( 1Df

)( 2Df

)( KDf

…. ….

)|( 11 DDf

)|( 22 DDf

)|( KK DDf

….result

SeparateLeaning

Meta-leaning

Page 29: 1 A combining approach to statistical methods for p >> n problems Shinto Eguchi Workshop on Statistical Genetics, Nov 9, 2004 at ISM

29

Unsolved problems

3. On the information on the unmatched genes in combining datasets

2. Prediction for class-label for a given new x ?

4. Heterogeneity is OK, but publication bias?

1. Which dataset should be joined or deleted in BridgeBoost ?

Page 30: 1 A combining approach to statistical methods for p >> n problems Shinto Eguchi Workshop on Statistical Genetics, Nov 9, 2004 at ISM

30

Mean and s.d. of 37 studies

37,...,1:),ˆ( kskk

Passive smokers vs lung cancer

Funnel plot

heterogeneitypublication bias

Publication bias?

(Copas & Shi, 2001)

Page 31: 1 A combining approach to statistical methods for p >> n problems Shinto Eguchi Workshop on Statistical Genetics, Nov 9, 2004 at ISM

31

References

[1] A class of logistic-type discriminant functions. S. Eguchi and J. Copas, Biometrika 89, 1-22 (2002). [2] Information geometry of U-Boost and Bregman divergence. N. Murata, T. Takenouchi, T. Kanamori and S. Eguchi Neural Computation 16, 1437-1481 (2004). [3] Robustifying AdaBoost by adding the naive error rate. T. Takenouchi and S. Eguchi. Neural Computation 16, 767-787 (2004). [4] GroupAdaBoost for selecting important genes. In preparation. T. Takenouchi, M. Ushijima and S. Eguchi [5] Local model uncertainty and incomplete data bias. J. Copas and S. Eguchi. ISM Research Memo. 884 July. (2003). [6] Local sensitivity approximation for selectivity bias. J. Copas and S. Eguchi. J. Royal Statistical Society B 63 (2001) 871-895. [7] Reanalysis of epidemiological evidence on lung cancer and passive smoking. J. Copas and J.Q. Shi, British Medical Journal 7232 (2000) 417-418.