penalized logistic regression and classification of microarray...

54
Penalized Logistic Regression and Classification of Microarray Data Milan, May 2003 Anestis Antoniadis Laboratoire IMAG-LMC University Joseph Fourier Grenoble, France Penalized Logistic Regression andClassification of Microarray Data – p.1/32

Upload: duongthuy

Post on 25-May-2019

225 views

Category:

Documents


0 download

TRANSCRIPT

Penalized Logistic Regression andClassification of Microarray Data

Milan, May 2003

Anestis Antoniadis

Laboratoire IMAG-LMC

University Joseph Fourier

Grenoble, France

Penalized Logistic Regression andClassification of Microarray Data – p.1/32

Outline

Logistic regression

Penalized Logistic Regression andClassification of Microarray Data – p.2/32

Outline

Logistic regression

Classical

Penalized Logistic Regression andClassification of Microarray Data – p.2/32

Outline

Logistic regression

Classical

Multicollinearity

Penalized Logistic Regression andClassification of Microarray Data – p.2/32

Outline

Logistic regression

Classical

Multicollinearity

Overfitting

Penalized Logistic Regression andClassification of Microarray Data – p.2/32

Outline

Logistic regression

Classical

Multicollinearity

Overfitting

Penalized Logistic Regression

Penalized Logistic Regression andClassification of Microarray Data – p.2/32

Outline

Logistic regression

Classical

Multicollinearity

Overfitting

Penalized Logistic Regression

Ridge

Penalized Logistic Regression andClassification of Microarray Data – p.2/32

Outline

Logistic regression

Classical

Multicollinearity

Overfitting

Penalized Logistic Regression

Ridge

Other penalization methods

Penalized Logistic Regression andClassification of Microarray Data – p.2/32

Outline

Logistic regression

Classical

Multicollinearity

Overfitting

Penalized Logistic Regression

Ridge

Other penalization methods

Classification

Penalized Logistic Regression andClassification of Microarray Data – p.2/32

Outline

Logistic regression

Classical

Multicollinearity

Overfitting

Penalized Logistic Regression

Ridge

Other penalization methods

Classification

An example of application

Penalized Logistic Regression andClassification of Microarray Data – p.2/32

Introduction

Logistic regression provides a good method forclassification by modeling the probability ofmembership of a class with transforms of linearcombinations of explanatory variables.

Penalized Logistic Regression andClassification of Microarray Data – p.3/32

Introduction

Logistic regression provides a good method forclassification by modeling the probability ofmembership of a class with transforms of linearcombinations of explanatory variables.

Classical logistic regression does not work formicroarrays because there are far more variables thanobservations.

Penalized Logistic Regression andClassification of Microarray Data – p.3/32

Introduction

Logistic regression provides a good method forclassification by modeling the probability ofmembership of a class with transforms of linearcombinations of explanatory variables.

Classical logistic regression does not work formicroarrays because there are far more variables thanobservations.

Particular problems are multicollinearity andoverfitting

Penalized Logistic Regression andClassification of Microarray Data – p.3/32

Introduction

Logistic regression provides a good method forclassification by modeling the probability ofmembership of a class with transforms of linearcombinations of explanatory variables.

Classical logistic regression does not work formicroarrays because there are far more variables thanobservations.

Particular problems are multicollinearity andoverfitting

A solution: use penalized logistic regression.

Penalized Logistic Regression andClassification of Microarray Data – p.3/32

Two-class classification

Situation: A number of biological samples have beencollected, preprocessed and hybridized to microarrays.Each sample can be in one of two classes. Suppose that wehave n1 microarrays taken from one of the classes and n2

microarrays taken from the other (e.g., 1=ALL, 2=AML);n1 + n2 = n.

Question: Find a statistical procedure that uses theexpression profiles measured on the arrays to compute theprobability that a sample belongs to one of the two classesand use it to classify any new array that comes along.

Solution: Logistic regression?

Penalized Logistic Regression andClassification of Microarray Data – p.4/32

Logistic regression

Let Yj indicate the status of array j, j = 1, . . . , n and let xij,

i = 1, . . . , m be the normalized and scaled expressions ofthe m genes on that array.

Imagine that one specific gene has been selected as a goodcandidate for discrimination between the two classes (sayy = 0 and y = 1).

If X is the expression measured for this gene, let p(x) be the

probabilty that an array with measured expression X = x

represents a class of type Y = 1.

Penalized Logistic Regression andClassification of Microarray Data – p.5/32

A simple regression model would be

p(x) = α + βx

with α and β estimated with a standard linear regressionprocedure.

Not a good idea!

p(x) may be estimated with negative values

Penalized Logistic Regression andClassification of Microarray Data – p.6/32

A simple regression model would be

p(x) = α + βx

with α and β estimated with a standard linear regressionprocedure.

Not a good idea!

p(x) may be estimated with negative values

p(x) may be larger than 1

Penalized Logistic Regression andClassification of Microarray Data – p.6/32

Solution

Transform p(x) to η(x):

η(x) = logp(x)

1 − p(x)= α + βx.

The curve that computes p(x) from η(x),

p(x) =1

1 + exp(−η(x))

is called the logistic curve.This a special case of the generalized linear model. Fast andstable algorithms to estimate the parameters exist (glmpackage in R).

Penalized Logistic Regression andClassification of Microarray Data – p.7/32

Example from the ALL/AML data

2.5 3.0 3.5 4.0

0.0

0.2

0.4

0.6

0.8

1.0

Expression value

Pre

dic

ted

Pro

ba

bili

tie

s

ll l lll ll ll ll ll llll l ll l lll ll

ll ll ll ll ll l

Simple Logistic Regression

The estimated curve gives the probability of y = 1 (AML) for a given value of x. error rate = 13.16 %

Penalized Logistic Regression andClassification of Microarray Data – p.8/32

It is straightforward to extend the model with morevariables (genes expressions), introducing explanatoryvariables x1, x2, . . . , xp:

η(x1, x2, . . . , xp) = logp(x)

1 − p(x)= α +

p

∑i=1

βixi

The maximum likelihood algorithm for glm’s can beextended to this case very easily.

So why not use all expressions on an array to build a logistic

regression model for class prediction?

Penalized Logistic Regression andClassification of Microarray Data – p.9/32

Rule of thumb

A rule of thumb, for a moderate number of explanatoryvariables (less than 15), is that the number of observations nshould at least be five times or more the number ofexplanatory variables. Here n << m! So there are manymore unknowns than equations and infinitely manysolutions exist.

Another problem is a perfect fit to the data (no bias buthigh variance). This is something to distrust because thelarge variability in the estimates produces a predictionformula for discrimination with almost no power!

Moreover the algorithm may be highly instable due tomulticollinearities in the x’s.

Is logistic regression doomed to failure?

Penalized Logistic Regression andClassification of Microarray Data – p.10/32

Penalized logistic regression

Consider first a classical multiple linear regression model

Y = µ + ε = Xβ + ε,

where Y, µ are n × 1 and X is a n × m matrix. The ncomponents of the mean µ of Y are modeled as linearcombinations of the m columns of X.The regression coefficients are estimated by minimizing

S =1

n

n

∑i=1

(yi −m

∑j=1

xijβ j)2,

leading to

β = (X′X)−1X′Y

A large m may lead to serious overfitting.Penalized Logistic Regression andClassification of Microarray Data – p.11/32

A remedy

The key idea in penalization methods is that overfitting isavoided by imposing a penalty on large fluctuations on theestimated parameters and thus on the fitted curve.Denote

Sn(β) =1

n

n

∑i=1

(yi −m

∑j=1

xijβ j)2 + λJ(β),

where J(β) is penalty that discourages high values of theelements of β.

When J(β) = ‖β‖22 = ∑

mj=1 β2

j the method is called quadratic

regularization or ridge regression, and leads, for λ > 0, to

the unique solution βλ = (X′X + λIm)−1X′Y.

Penalized Logistic Regression andClassification of Microarray Data – p.12/32

Penalizing logistic regression

ηi =pi

1 − pi= α +

m

∑j=1

xijβ j

In the GLM jargon:

η is the linear predictor.

Penalized Logistic Regression andClassification of Microarray Data – p.13/32

Penalizing logistic regression

ηi =pi

1 − pi= α +

m

∑j=1

xijβ j

In the GLM jargon:

η is the linear predictor.

the logistic function log(x/(1 − x)) is the canonicallink function for the binomial family.

Penalized Logistic Regression andClassification of Microarray Data – p.13/32

Penalizing logistic regression

ηi =pi

1 − pi= α +

m

∑j=1

xijβ j

In the GLM jargon:

η is the linear predictor.

the logistic function log(x/(1 − x)) is the canonicallink function for the binomial family.

`(y, β) = ∑ni=1 yi log pi + ∑

ni=1(1 − yi) log(1 − pi) is the log-

likelihood and −` + λ2 J(β) is the penalized negative log-

likelihood

Penalized Logistic Regression andClassification of Microarray Data – p.13/32

Denote by u the vector of ones in Rn. By differentiation of

the penalized -log-likelihood with respect to α and the β j’s

follows

u′(y − p) = 0

X′(y − p) = λβ

The equations are nonlinear because of the nonlinearrelation between p and α and β. A first order Taylorexpansion gives

pi ∼ pi +∂pi

∂α(α − α) +

m

∑j=1

∂pi

∂β j(β j − β j)

Penalized Logistic Regression andClassification of Microarray Data – p.14/32

Now

∂pi

∂α= pi(1 − pi)

∂pi

∂β j= pi(1 − pi)xij

Using it and setting W = diag(pi(1 − pi)) we have

u′Wuα + u′WXβ = u′(y − p − Wη),

X′Wuα + (X′WX + λI)β = X′(y − p − Wη)

Suitable starting values are α = log(y/(1 − y)) and β = 0.

Penalized Logistic Regression andClassification of Microarray Data – p.15/32

Choosing λ

The choice of λ is crucial! Need a procedure that estimatesthe ”optimal” value of the smoothing or ridge parameter λfrom the data.Cross-validation: set apart some of the data, fit a model tothe rest and see how well it predicts. Several schemes canbe conceived. One is to set apart, say, one third of the data.More complicated is "leave-one-out" cross-validation: eachof the n observations is set apart in turn and the model isfitted to the m − 1 remaining ones. This is rather expensiveas the amount of work is proportional to m(m − 1).

The performance of cross-validation one may use either the

fraction of misclassification or the strength of log-likelihood

prediction ∑i(yi log p−i + (1 − yi) log(1 − p−i)).

Penalized Logistic Regression andClassification of Microarray Data – p.16/32

AIC

Akaike’s Information Criterion (AIC). Choose λ thatbalances rightly the complexity of the model and thefidelity to the data. AIC is defined as

AIC = Dev(y|p) + 2effdim,

where Dev(·) is the deviance ( equal to −2`) and effdim isthe effective dimension of the model. Hastie and Tibshiraniestimate this by

effdim = trace(Z(Z′WZ + λR)−1WZ′)

where Z = [u|X] and R is the m + 1 × m + 1 identity matrix

with r11 = 0.

Penalized Logistic Regression andClassification of Microarray Data – p.17/32

Other choices of J

The behavior of the resulting estimate not only depends onλ but also on the form of the penalty function J(β).Another form that one could consider is

J(β) = ∑k

γkψ(βk) where γk > 0.

Several penalty functions have been used in the literature.

Penalized Logistic Regression andClassification of Microarray Data – p.18/32

Penalties

Typical examples are:

The L1 penalty ψ(β) = |β| results in LASSO (firstproposed by Donoho and Johnstone (1994) in thewavelet setting and extended by Tibshirani (1996) forgeneral least squares settings).

Penalized Logistic Regression andClassification of Microarray Data – p.19/32

Penalties

Typical examples are:

The L1 penalty ψ(β) = |β| results in LASSO (firstproposed by Donoho and Johnstone (1994) in thewavelet setting and extended by Tibshirani (1996) forgeneral least squares settings).

More generally, the Lq (0 ≤ q ≤ 1) ψ(β) = |β|q leads to

bridge regression (see Frank and Friedman (1993)).

Penalized Logistic Regression andClassification of Microarray Data – p.19/32

Penalties

Typical examples are:

The L1 penalty ψ(β) = |β| results in LASSO (firstproposed by Donoho and Johnstone (1994) in thewavelet setting and extended by Tibshirani (1996) forgeneral least squares settings).

More generally, the Lq (0 ≤ q ≤ 1) ψ(β) = |β|q leads to

bridge regression (see Frank and Friedman (1993)).

Such penalties have the feature that many of the components

of β are shrunk all the way to 0. In effect, these coefficients

are deleted. Therefore, such procedures perform a model se-

lection.

Penalized Logistic Regression andClassification of Microarray Data – p.19/32

Conditions on ψ

Usually, the penalty function ψ is chosen to be symmetric and

increasing on [0, +∞). Furthermore, ψ can be convex or

non-convex, smooth or non-smooth.

In the wavelet setting, Antoniadis and Fan (2001) provide

some insights into how to choose a penalty function. A good

penalty function should result in

unbiasedness,

Penalized Logistic Regression andClassification of Microarray Data – p.20/32

Conditions on ψ

Usually, the penalty function ψ is chosen to be symmetric and

increasing on [0, +∞). Furthermore, ψ can be convex or

non-convex, smooth or non-smooth.

In the wavelet setting, Antoniadis and Fan (2001) provide

some insights into how to choose a penalty function. A good

penalty function should result in

unbiasedness,

sparsity

Penalized Logistic Regression andClassification of Microarray Data – p.20/32

Conditions on ψ

Usually, the penalty function ψ is chosen to be symmetric and

increasing on [0, +∞). Furthermore, ψ can be convex or

non-convex, smooth or non-smooth.

In the wavelet setting, Antoniadis and Fan (2001) provide

some insights into how to choose a penalty function. A good

penalty function should result in

unbiasedness,

sparsity

stability.

Penalized Logistic Regression andClassification of Microarray Data – p.20/32

Examples

Penalty function Convexity Smoothness at 0 Authors

ψ(β) = |β| yes ψ′(0+) = 1 (Rudin 1992)

ψ(β) = |β|α , α ∈ (0, 1) no ψ′(0+) = ∞ (Saquib 1998)

ψ(β) = α|β|/(1 + α|β|) no ψ′(0+) = α (Geman 92, 95)

ψ(0) = 0, ψ(β) = 1, ∀β 6= 0 no discontinuous Leclerc 1989

ψ(β) = |β|α , α > 1 yes yes Bouman 1993

ψ(β) = αβ2/(1 + αβ2) no yes McClure 1987

ψ(β) = min{αβ2 , 1} no yes Geman 1984

ψ(β) =√

α + β2 yes yes Vogel 1987

ψ(β) = log(cosh(αβ)) yes yes Green 1990

ψ(β) =

β2/2 if |β| ≤ α,

α|β| − α2/2 if |β| > α.yes yes Huber 1990

Examples of penalty functions

Penalized Logistic Regression andClassification of Microarray Data – p.21/32

An example

As in Eilers et al. (2002):

The Golub data set preprocessed as in Dudoit et al.

(2002)

Penalized Logistic Regression andClassification of Microarray Data – p.22/32

An example

As in Eilers et al. (2002):

The Golub data set preprocessed as in Dudoit et al.

(2002)

Prefilter the expressions in X with a floor of 100 and a

ceiling of 1600.

Penalized Logistic Regression andClassification of Microarray Data – p.22/32

An example

As in Eilers et al. (2002):

The Golub data set preprocessed as in Dudoit et al.

(2002)

Prefilter the expressions in X with a floor of 100 and a

ceiling of 1600.

Eliminate columns of X with max/min ≤ 5 and

max− min ≤ 500.

Penalized Logistic Regression andClassification of Microarray Data – p.22/32

An example

As in Eilers et al. (2002):

The Golub data set preprocessed as in Dudoit et al.

(2002)

Prefilter the expressions in X with a floor of 100 and a

ceiling of 1600.

Eliminate columns of X with max/min ≤ 5 and

max− min ≤ 500.

Take logarithms to base 10.

Penalized Logistic Regression andClassification of Microarray Data – p.22/32

An example

As in Eilers et al. (2002):

The Golub data set preprocessed as in Dudoit et al.

(2002)

Prefilter the expressions in X with a floor of 100 and a

ceiling of 1600.

Eliminate columns of X with max/min ≤ 5 and

max− min ≤ 500.

Take logarithms to base 10.

We are left with m = 3571 genes and 38 arrays, 25 of them consisting of ALL and 11 of AML.

To find λ AIC was used. The range of λ was considered on a logarithmic scale grid from 0 to

5. The optimal value was λ = 400 and effdim corresponding to 4.5.

Penalized Logistic Regression andClassification of Microarray Data – p.22/32

The choice of λ

0 1 2 3 4 540

45

50

55

60

65Akaike’s Information Criterion

logλ

AIC

0 1 2 3 4 50

2

4

6

8

10

12

14

16

logλ

Dim

Effective dimension

AIC and effdim as a function of log λ

Penalized Logistic Regression andClassification of Microarray Data – p.23/32

One can get an impression of the estimated coefficients bydoing a sequence plot of them.

In order to get the same weight for the coefficients one alsomay display them by multiplying them by the standarddeviation of the corresponding column of X in order toproduce a scale invariant plot.

Penalized Logistic Regression andClassification of Microarray Data – p.24/32

Estimated coefficients

0 1000 2000 3000 4000 0.02

0.015

0.01

0.005

0

0.005

0.01

0.015

0.02Regression coefficients

Gene number

0 1000 2000 3000 4000 0.02

0.015

0.01

0.005

0

0.005

0.01

0.015

0.02

Gene number

Weighted regression coefficients

0.02 0.01 0 0.01 0.020

50

100

150

200

250

300

350Regression coefficients

0.01 0.005 0 0.005 0.010

100

200

300

400

500

600

700

800

900Weighted regression coefficients

Estimated regression coefficients and their histograms

Penalized Logistic Regression andClassification of Microarray Data – p.25/32

Without any knowledge of the expressions, but knowingthat a fraction y of the arrays is from AML cases, one wouldclassify an array as AML with probability y.

Thus a very simple decision rule would be to compute pand see if it is lower of higher than y.

The real test is the classification of new data, that were notused for the estimation of the model. The test set consist of20 cases of ALL and 14 of AML. This rule would put threearrays in the wrong class.

Penalized Logistic Regression andClassification of Microarray Data – p.26/32

Classification

0 0.2 0.4 0.6 0.8 1

0

0.2

0.4

0.6

0.8

1

Training set

Probability of AML

0 0.2 0.4 0.6 0.8 1

0

0.2

0.4

0.6

0.8

1

Test set

Probability of AML

4 2 0 2 4

0

0.2

0.4

0.6

0.8

1

Pro

babi

litie

s

Training set

logλ

4 2 0 2 4

0

0.2

0.4

0.6

0.8

1

Test set

Pro

babi

litie

s

logλ

4 2 0 2 4 24

22

20

18

16

14

12

10

8

6

4

Log

likel

ihoo

d

Test set

logλ

Classification prob. as a function of λ. A vertical line at p = y.Penalized Logistic Regression andClassification of Microarray Data – p.27/32

Same example with prefiltering

This time we have used Dudoit et al. (2002) procedures.

We pre-filter and select the 280 most discriminant genes (onthe learning set).

To perform the computations we have used the lrm.fitfunction in the Design package.

Penalized Logistic Regression andClassification of Microarray Data – p.28/32

Estimated coefficients (280)

� � � � �

������� ��

� � �

������������

� � � � �

���� ���������� ��

� � �

���������������

� � � � � ! " # � $ % � & $ $ � % � & ' � �

()�*+��,

- .� - . � . . �

�/����

0 1 2 3 4 5 6 7 8 4 9 : ; 1 5 < 3 ; = > 4 ; 9 9 1 > 1 ; ? 3 2

()�*+��,

- . � . . � . �

���@���

Estimated regression coefficients and their histogramsPenalized Logistic Regression andClassification of Microarray Data – p.29/32

Classification (LS)

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Predicted Probability

Act

ual

Pro

bab

ilit

y

Logistic calibrationNonparametricGrouped observations

R2 Brier InterceptSlope

1.000 0.07712.75822.090

Predicted prob (λ = 200)

Penalized Logistic Regression andClassification of Microarray Data – p.30/32

Classification (TS)

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Predicted Probability

Act

ua

l P

rob

ab

ilit

y

Logistic calibrationNonparametricGrouped observations

R2 Brier InterceptSlope

0.9050.1344.5057.102

Predicted prob (λ = 200)

Penalized Logistic Regression andClassification of Microarray Data – p.31/32

Some referencesA. Antoniadis and J. Fan, JASA, 2001.

P. H. C. Eilers, J. M. Boerb, G-J. van Ommenb and H. C. van Houwelingen (2002). preprint.

T. R. Golub, D. K. Slonim, and P. T. et al., Science 286, pp. 531– 537, 1999.

A. E. Hoerl, R. W. Kennard, and R. W. Hoerl, Applied Statistics 34, pp. 114–120, 1985.

B. D. Marx and P. H. C. Eilers, Technometrics 41, pp. 1–13, 1999.

P. McCullagh and J. A. Nelder, Generalized Linear Models, 2nd edition, Chapman and Hall,

London, 1989.

I. E. Frank and J. H. Friedman, Technometrics 35, pp. 109–148, 1993.

J. C. van Houwelingen and S. le Cessie, Statistics in Medicine 9, pp. 1303–1325, 1990.

K. P. Burnham and D. R. Anderson, Model Selection and Inference, Springer, New-York,

1998.

T. J. Hastie and R. J. Tibshirani, Generalized Additive Models, Chapman and Hall, London,

1990.

Tibshirani, R, Journal of the Royal Statistical Society, series B 58, 267-288, 1996.

Penalized Logistic Regression andClassification of Microarray Data – p.32/32