a sparse solution approach to gene selection for cancer ...microarray data i relative amount of...

27
A Sparse Solution Approach to Gene Selection for Cancer Diagnosis Using Microarray Data Yoonkyung Lee Department of Statistics The Ohio State University http://www.stat.ohio-state.edu/yklee May 13, 2005

Upload: others

Post on 08-Oct-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Sparse Solution Approach to Gene Selection for Cancer ...Microarray Data I Relative amount of mRNAs of tens of thousands of genes. (e.g. Affymetrix Human U133a GeneChips: 22K genes)

A Sparse Solution Approach toGene Selection for Cancer Diagnosis

Using Microarray Data

Yoonkyung LeeDepartment of Statistics

The Ohio State Universityhttp://www.stat.ohio-state.edu/∼yklee

May 13, 2005

Page 2: A Sparse Solution Approach to Gene Selection for Cancer ...Microarray Data I Relative amount of mRNAs of tens of thousands of genes. (e.g. Affymetrix Human U133a GeneChips: 22K genes)

Microarray Data

I Relative amount of mRNAs of tens of thousands of genes.(e.g. Affymetrix Human U133a GeneChips: 22K genes)

I Gene expression profiles as fingerprints at the molecularlevel.

I Potential for cancer diagnosis or prognosis.I Examples

I Golub et al. (1999) Science: acute leukemia classification.I Perou et al. (2000) Nature: breast cancer subclass

identification.I Van’t Veer et al. (2002) Nature: breast cancer prognosis.

Page 3: A Sparse Solution Approach to Gene Selection for Cancer ...Microarray Data I Relative amount of mRNAs of tens of thousands of genes. (e.g. Affymetrix Human U133a GeneChips: 22K genes)

Objectives and related issues

I Accurate diagnosis or prognosis.I Interpretability.I Subset selection or dimension reduction.I Systematic approach to gene (variable) selection.I Assessment of the variability in analysis results.

Page 4: A Sparse Solution Approach to Gene Selection for Cancer ...Microarray Data I Relative amount of mRNAs of tens of thousands of genes. (e.g. Affymetrix Human U133a GeneChips: 22K genes)

Small Round Blue Cell Tumors of Childhood

I Khan et al. (2001) in Nature MedicineI Tumor types: neuroblastoma (NB), rhabdomyosarcoma

(RMS), non-Hodgkin lymphoma (NHL) and the Ewingfamily of tumors (EWS).

I Number of genes : 2308I Class distribution of data set

Data set EWS BL(NHL) NB RMS totalTraining set 23 8 12 20 63Test set 6 3 6 5 20Total 29 11 18 25 83

Page 5: A Sparse Solution Approach to Gene Selection for Cancer ...Microarray Data I Relative amount of mRNAs of tens of thousands of genes. (e.g. Affymetrix Human U133a GeneChips: 22K genes)

Classification problem

I A training data set {(x i , yi), i = 1, . . . , n}, whereyi ∈ {1, . . . , k}.

I Learn a functional relationship f between x = (x1, . . . , xp)and y from the training data, which can be generalized tonovel cases.

Page 6: A Sparse Solution Approach to Gene Selection for Cancer ...Microarray Data I Relative amount of mRNAs of tens of thousands of genes. (e.g. Affymetrix Human U133a GeneChips: 22K genes)

Measures of marginal association for gene selection

Pick genes with the largest marginal association.I Two-class: two-sample t-test statistic, correlation, and etc.I Multi-class: Dudoit et al. (2000)

For gene `, the ratio of between classes sum of squares towithin class sum of squares is defined as

BSS(`)

WSS(`)=

∑ni=1∑k

j=1 I(yi = j)(x (j)·` − x·`)2

∑ni=1∑k

j=1 I(yi = j)(xi` − x (j)·` )2

.

Page 7: A Sparse Solution Approach to Gene Selection for Cancer ...Microarray Data I Relative amount of mRNAs of tens of thousands of genes. (e.g. Affymetrix Human U133a GeneChips: 22K genes)

0 0.5 1 1.5 2 2.5 3 3.50

0.5

1

1.5

2

2.5

3

3.5

4

4.5

gene order (log10

scale)

ratio

of b

etw

een

to w

ithin

observedpermuted (95%)

Figure: Observed ratios of between-class SS to within-class SS andthe 95 percentiles of the corresponding ratios for expression levelswith randomly permuted class labels.

Page 8: A Sparse Solution Approach to Gene Selection for Cancer ...Microarray Data I Relative amount of mRNAs of tens of thousands of genes. (e.g. Affymetrix Human U133a GeneChips: 22K genes)

d=1 d=5, TE = 5−10%

d=100 (20~500), TE = 0%

EWS

NHL

NB

RMS

EWS NHL NB RMS

d=all, TE = 0−15%

Figure: Pairwise distance matrices for the training data as thenumbers of genes included change, and test error rates of MSVMwith Gaussian kernel.

Page 9: A Sparse Solution Approach to Gene Selection for Cancer ...Microarray Data I Relative amount of mRNAs of tens of thousands of genes. (e.g. Affymetrix Human U133a GeneChips: 22K genes)

Support Vector Machines

Vapnik (1995), http://www.kernel-machines.orgI yi ∈ {−1, 1}.I Find f (x) = b + h(x) with h ∈ HK minimizing

1n

n∑

i=1

(1 − yi f (x i))+ + λ‖h‖2HK

.

Then f (x) = b +∑n

i=1 ciK (x i , x), where K : a bivariatepositive definite function called a reproducing kernel.

I Classification rule: φ(x) = sign[f (x)].

Page 10: A Sparse Solution Approach to Gene Selection for Cancer ...Microarray Data I Relative amount of mRNAs of tens of thousands of genes. (e.g. Affymetrix Human U133a GeneChips: 22K genes)

Hinge loss

−2 −1 0 1 2

0

1

2

t=yf

[−t]*

(1−t)+

Figure: (1 − yf (x))+ is an upper bound of the misclassification lossfunction I(y 6= φ(x)) = [−yf (x)]∗ ≤ (1 − yf (x))+ where [t ]∗ = I(t ≥ 0)and (t)+ = max{t , 0}.

Page 11: A Sparse Solution Approach to Gene Selection for Cancer ...Microarray Data I Relative amount of mRNAs of tens of thousands of genes. (e.g. Affymetrix Human U133a GeneChips: 22K genes)

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

wx+b=−1

wx+b=0

wx+b=1

Margin=2/||w||

Figure: Linear Support Vector Machine example in separable case.Red and blue circles indicate positive and negative examples,respectively. The solid line is the separating hyperplane with themaximum margin in this example.

Page 12: A Sparse Solution Approach to Gene Selection for Cancer ...Microarray Data I Relative amount of mRNAs of tens of thousands of genes. (e.g. Affymetrix Human U133a GeneChips: 22K genes)

Characteristics of Support Vector Machines

I Competitive classification accuracy.I Flexibility - implicit embedding through kernel.I Handle high dimensional data.I A black box unless the embedding is explicit.

Page 13: A Sparse Solution Approach to Gene Selection for Cancer ...Microarray Data I Relative amount of mRNAs of tens of thousands of genes. (e.g. Affymetrix Human U133a GeneChips: 22K genes)

Variable (feature) Selection

I The best subset selection.I Nonnegative garrote [Breiman, Technometrics (1995)]I Least Absolute Shrinkage and Selection Operator

[Tibshirani, JRSS (1996)]I COmponent Selection and Smoothing Operator

[Lin & Zhang, Technical Report (2003)]I Structural modelling with sparse kernels

[Gunn & Kandola, Machine Learning (2002)]

Page 14: A Sparse Solution Approach to Gene Selection for Cancer ...Microarray Data I Relative amount of mRNAs of tens of thousands of genes. (e.g. Affymetrix Human U133a GeneChips: 22K genes)

Functional ANOVA decomposition

Wahba (1990)

I Function: f (x) = b +∑p

α=1 fα(xα) +∑

α<β fαβ(xα, xβ) + · · ·

I Functional space: f ∈ H = ⊗pα=1({1} ⊕ Hα),

H = {1} ⊕∑p

α=1 Hα ⊕∑

α<β(Hα ⊗ Hβ) ⊕ · · ·

I Reproducing kernel (r.k.):K (x , x ′) = 1 +

∑pα=1 Kα(x , x ′) +

α<β Kαβ(x , x ′) + · · ·

I Modification of r.k. by rescaling parameters θ ≥ 0Kθ(x , x ′) = 1+

∑pα=1 θαKα(x , x ′)+

α<β θαβKαβ(x , x ′)+· · ·

Page 15: A Sparse Solution Approach to Gene Selection for Cancer ...Microarray Data I Relative amount of mRNAs of tens of thousands of genes. (e.g. Affymetrix Human U133a GeneChips: 22K genes)

l1 penalty on θ

I Truncating H to F = {1} ⊕dν=1 Fν , find f (x) ∈ F minimizing

1n

n∑

i=1

L(yi , f (x i)) + λ∑

ν

θ−1ν ‖Pν f‖2.

Then f (x) = b +∑n

i=1 ci∑d

ν=1 θνKν(x i , x).

I For sparsity, minimize

1n

n∑

i=1

L(yi , f (x i)) + λ∑

ν

θ−1ν ‖Pν f‖2 + λθ

ν

θν

subject to θν ≥ 0, ∀ν.

Page 16: A Sparse Solution Approach to Gene Selection for Cancer ...Microarray Data I Relative amount of mRNAs of tens of thousands of genes. (e.g. Affymetrix Human U133a GeneChips: 22K genes)

Structured MSVM with ANOVA decomposition

Lee, Lin & Wahba, JASA (2004)I Find f = (f 1, . . . , f k ) = (b1 + h1(x), . . . , bk + hk (x)) with

the sum-to-zero constraint minimizing

1n

n∑

i=1

L(y i) · (f (x i) − y i)+ +λ

2

k∑

j=1

(

d∑

ν=1

θ−1ν ‖Pνhj‖2

)

+λθ

d∑

ν=1

θν subject to θν ≥ 0, for ν = 1, . . . , d .

I By the representer theorem,f j(x) = bj +

∑ni=1 c j

i

∑dν=1 θνKν(x i , x).

Page 17: A Sparse Solution Approach to Gene Selection for Cancer ...Microarray Data I Relative amount of mRNAs of tens of thousands of genes. (e.g. Affymetrix Human U133a GeneChips: 22K genes)

Updating Algorithm

Denoting the objective function by Φ(θ, b, C),I Initialize θ

(0) = (1, . . . , 1)t and(b(0), C(0)) = argmin Φ(θ(0), b, C).

I At the m-th step (m = 1, 2, . . .)I θ-step:

Find θ(m) minimizing Φ(θ, b(m−1)

, C(m−1)) with (b, C) fixed.I c-step:

Find (b(m), C(m)) minimizing Φ(θ(m), b, C) with θ fixed.

I One-step update can be used in practice.

Page 18: A Sparse Solution Approach to Gene Selection for Cancer ...Microarray Data I Relative amount of mRNAs of tens of thousands of genes. (e.g. Affymetrix Human U133a GeneChips: 22K genes)

A toy example: visualization of the θ-step

x1

x 2

0.0 0.5 1.0

0.0

0.5

1.0

0.6

0.7

0.8

0.9

0.0 0.5 1.0 1.5

0.0

0.5

1.0

1.5

θ1

θ 2

Figure: Left: the scatter plot of two covariates with class labelsdistinguished by color. Right: the feasible region of (θ1, θ2) in grayand a level plot of the θ-step objective function.

Page 19: A Sparse Solution Approach to Gene Selection for Cancer ...Microarray Data I Relative amount of mRNAs of tens of thousands of genes. (e.g. Affymetrix Human U133a GeneChips: 22K genes)

The trajectory of θ as a function of λθ

log2(λθ)

θ

−15 −10 −5 0

0.0

0.5

1.0

121 x 2

Figure: The trajectories of θ1; ◦, θ2; +, and θ12; × with the two-wayinteraction spline kernel with λθ tuned by GCKL. The overlap of +and × indicates that the trajectories of θ2 and θ12 are almostindistinguishable.

Page 20: A Sparse Solution Approach to Gene Selection for Cancer ...Microarray Data I Relative amount of mRNAs of tens of thousands of genes. (e.g. Affymetrix Human U133a GeneChips: 22K genes)

A synthetic miniature data set

I It consists of 100 genes from Khan et al. (63 training and20 test cases)

I Use the F-ratio for each gene based on the training casesonly.

I The top 20 genes as variables truly associated with theclass.

I The bottom 80 genes with the class label randomlyjumbled as irrelevant variables.

I 100 replicates by bootstrapping samples from thisminiature data set keeping the class proportions the sameas the original data.

Page 21: A Sparse Solution Approach to Gene Selection for Cancer ...Microarray Data I Relative amount of mRNAs of tens of thousands of genes. (e.g. Affymetrix Human U133a GeneChips: 22K genes)

The proportion of gene inclusion (%)

0 20 40 60 80 100

020

4060

8010

0

Variable id

Prop

ortio

n

Figure: The proportion of inclusion (%) of each gene in the finalclassifiers over 100 runs. The dotted line delimits informativevariables from noninformative ones. 10-fold CV was used for tuning.

Page 22: A Sparse Solution Approach to Gene Selection for Cancer ...Microarray Data I Relative amount of mRNAs of tens of thousands of genes. (e.g. Affymetrix Human U133a GeneChips: 22K genes)

Summary of the synthetic miniature data analysiswith Structured MSVM

I 13 informative genes were selected more than 95% of theruns.

I 74 noninformative genes were picked up less than 5% ofthe runs.

I Gene selection resulted in decrease in the test error rateover the 20 test cases on average of 0.0095 (from 0.0555to 0.0460) with standard error 0.00413.

Page 23: A Sparse Solution Approach to Gene Selection for Cancer ...Microarray Data I Relative amount of mRNAs of tens of thousands of genes. (e.g. Affymetrix Human U133a GeneChips: 22K genes)

The original data with 2308 genes

gene rank

Pro

port

ion

of in

clus

ion

0 500 1000 1500 2000

0.0

0.5

1.0

Figure: The proportion of selection of each gene in one-step updatedSMSVMs for 100 bootstrap samples. Genes are presented in theorder of marginal rank in the original sample.

Page 24: A Sparse Solution Approach to Gene Selection for Cancer ...Microarray Data I Relative amount of mRNAs of tens of thousands of genes. (e.g. Affymetrix Human U133a GeneChips: 22K genes)

0.0 0.2 0.4 0.6 0.8 1.0

050

010

0015

0020

00

Proportion of inclusion

Cum

ulat

ive

num

ber

of g

enes

Figure: The number of genes selected less often than or asfrequently as a given proportion in 100 runs.

Page 25: A Sparse Solution Approach to Gene Selection for Cancer ...Microarray Data I Relative amount of mRNAs of tens of thousands of genes. (e.g. Affymetrix Human U133a GeneChips: 22K genes)

Summary of the full data analysis

I The empirical distribution of the number of genes includedin one-step updates contained the middle 50% of valuesbetween 212 and 228 with median 221.

I 67 genes were consistently selected for more than 95% ofthe time.

I About 2000 genes were selected less than 20% of thetime.

I Gene selection led to reduction in test error rates by0.0230 on average (from 0.0455 to 0.0225) with standarderror of 0.00484.

I It also reduced the variance of test error rates.

Page 26: A Sparse Solution Approach to Gene Selection for Cancer ...Microarray Data I Relative amount of mRNAs of tens of thousands of genes. (e.g. Affymetrix Human U133a GeneChips: 22K genes)

Concluding remarks

I Integrate feature selection with learning classification rules.I Enhance interpretation without compromising prediction

accuracy.I Characterize the solution path of SMSVM for effective

computation and tuning.I Efron et al. LAR (2004) and Hastie et al. SVM solution path

(2004).I c-step: the entire spectrum of solutions from the simplest

majority rule to the complete overfit to data.I θ-step: the entire spectrum of solutions from the constant

model to the full model with all the variables.

Page 27: A Sparse Solution Approach to Gene Selection for Cancer ...Microarray Data I Relative amount of mRNAs of tens of thousands of genes. (e.g. Affymetrix Human U133a GeneChips: 22K genes)

The following papers are available fromwww.stat.ohio-state.edu/∼yklee.

I Structured Multicategory Support Vector Machine withANOVA decomposition, Lee, Y., Kim, Y., Lee, S., and Koo,J.-Y., Technical Report No. 743, The Ohio State University,2004.

I Characterizing the Solution Path of Multicategory SupportVector Machines, Lee, Y. and Cui, Z., Technical Report No.754, The Ohio State University, 2005.