a bayesian approach to joint feature selection and classifier design balaji krishnapuram, alexander...

38
A Bayesian Approach to Joint Feature Selection and Classifier Design Balaji Krishnapuram, Alexander J. Hartemink, Lawrence Carin, Fellow, IEEE, and Mario A.T. Figueiredo, Senior Member, IEEE IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGNEC VOL. 26, NO.9, SEPTEMBER 2004

Post on 19-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Bayesian Approach to Joint Feature Selection and Classifier Design Balaji Krishnapuram, Alexander J. Hartemink, Lawrence Carin, Fellow, IEEE, and Mario

A Bayesian Approach to Joint Feature Selection and Classifier Design

Balaji Krishnapuram, Alexander J. Hartemink, Lawrence Carin, Fellow, IEEE, and Mario A.T. Figueiredo, Senior Member, IEEEIEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE

INTELLIGNECVOL. 26, NO.9, SEPTEMBER 2004

Page 2: A Bayesian Approach to Joint Feature Selection and Classifier Design Balaji Krishnapuram, Alexander J. Hartemink, Lawrence Carin, Fellow, IEEE, and Mario

Outline

• Introduction to MAP• Introduction to Expectation-Maximization• Introduction to Generalized linear models• Introduction• Sparsity-promoting priors• MAP parameter estimation via EM• Experimental Results• Conclusion

Page 3: A Bayesian Approach to Joint Feature Selection and Classifier Design Balaji Krishnapuram, Alexander J. Hartemink, Lawrence Carin, Fellow, IEEE, and Mario

Introduction to MAP

• Maximum-likelihood estimation– is considered as parameter vector

• In MAP estimation– is considered as a random vector described by a

pdf p()– Assume p() is known– Given a set of training samples D={x1,x2,…,xn}

Page 4: A Bayesian Approach to Joint Feature Selection and Classifier Design Balaji Krishnapuram, Alexander J. Hartemink, Lawrence Carin, Fellow, IEEE, and Mario

Introduction to MAP

• Find the maximum of

– If p() is uniform or flat enough

Page 5: A Bayesian Approach to Joint Feature Selection and Classifier Design Balaji Krishnapuram, Alexander J. Hartemink, Lawrence Carin, Fellow, IEEE, and Mario

Introduction to Expectation-Maximization

Page 6: A Bayesian Approach to Joint Feature Selection and Classifier Design Balaji Krishnapuram, Alexander J. Hartemink, Lawrence Carin, Fellow, IEEE, and Mario

Introduction to Expectation-Maximization

Page 7: A Bayesian Approach to Joint Feature Selection and Classifier Design Balaji Krishnapuram, Alexander J. Hartemink, Lawrence Carin, Fellow, IEEE, and Mario

Introduction to Expectation-Maximization

Page 8: A Bayesian Approach to Joint Feature Selection and Classifier Design Balaji Krishnapuram, Alexander J. Hartemink, Lawrence Carin, Fellow, IEEE, and Mario

Introduction to Expectation-Maximization

Page 9: A Bayesian Approach to Joint Feature Selection and Classifier Design Balaji Krishnapuram, Alexander J. Hartemink, Lawrence Carin, Fellow, IEEE, and Mario

Introduction to generalized linear models

• Supervised learning can be formalized as the problem of inferring a function y=f(x), based on a training set D={(x1,y1),…, (xn,yn)}

• When y is continuous (e.g. y R ), we are in the context of regression, whereas in classification problems, y is of categorical nature (e.g. binary, y={-1,1})

• The function f is assumed to have a fixed structure and to depend on a set of parameters .

• We write y=f(x, ) and the goal becomes to estimate from the training data.

Page 10: A Bayesian Approach to Joint Feature Selection and Classifier Design Balaji Krishnapuram, Alexander J. Hartemink, Lawrence Carin, Fellow, IEEE, and Mario

Introduction to generalized linear models

• Regression functions

where is a vector of k fixed functions of the input, often called features.– Linear regression : k=d+1– Nonlinear regression via a set of k fixed basis

functions : – Kernel regression :

Page 11: A Bayesian Approach to Joint Feature Selection and Classifier Design Balaji Krishnapuram, Alexander J. Hartemink, Lawrence Carin, Fellow, IEEE, and Mario

Introduction to generalized linear models

– Assume that the output variables in the training set were contaminated by addictive white Gaussian noise :where is a set of independent zero-mean Gaussian samples with variance 2 .

– With , the likelihood function :

is the so-called design matrix. The element of , denoted , is given by

Page 12: A Bayesian Approach to Joint Feature Selection and Classifier Design Balaji Krishnapuram, Alexander J. Hartemink, Lawrence Carin, Fellow, IEEE, and Mario

Introduction to generalized linear models

– With a zero-mean Gaussian prior for , with covariance

– The posterior is still Gaussian

)|()|(maxargˆ

pDp

Page 13: A Bayesian Approach to Joint Feature Selection and Classifier Design Balaji Krishnapuram, Alexander J. Hartemink, Lawrence Carin, Fellow, IEEE, and Mario

Introduction to generalized linear models

– In logistic regression, the link function is

– Probit link or Probit model:dx

xdxxNz

z

z

-

2

)2

exp(2

1)1,0|()(

Page 14: A Bayesian Approach to Joint Feature Selection and Classifier Design Balaji Krishnapuram, Alexander J. Hartemink, Lawrence Carin, Fellow, IEEE, and Mario

Introduction to generalized linear models

• Hidden variable where w is a zero-mean unit-variance Gaussian noise

• If the classification rule is

• Given training datahidden variables where

Page 15: A Bayesian Approach to Joint Feature Selection and Classifier Design Balaji Krishnapuram, Alexander J. Hartemink, Lawrence Carin, Fellow, IEEE, and Mario

Introduction to generalized linear models

• If we had z, we would have a simple linear likelihood with unit noise variance

• The use of the EM algorithm to estimate , by treating z as missing data.

),|()|( IH yNyp

Page 16: A Bayesian Approach to Joint Feature Selection and Classifier Design Balaji Krishnapuram, Alexander J. Hartemink, Lawrence Carin, Fellow, IEEE, and Mario

Introduction

• Given the training set

two standard tasks :– Classifier design :

To learn a function that most accurately predicts the class of a new example

– Feature selection:To identify a subset of the features that is most

informative about the class distinction (feature selection)

ni

ipii yxyxD 1)()((i))( }{0,1},R:),{(

Page 17: A Bayesian Approach to Joint Feature Selection and Classifier Design Balaji Krishnapuram, Alexander J. Hartemink, Lawrence Carin, Fellow, IEEE, and Mario

Introduction

• In this paper, joint classifier and feature optimization (JCFO)– The association of a nonnegative scaling factor

with each feature– These scaling factors are then estimated from the

data, under an a priori preference for values that are either significantly large or otherwise exactly zero.

kθkx

Page 18: A Bayesian Approach to Joint Feature Selection and Classifier Design Balaji Krishnapuram, Alexander J. Hartemink, Lawrence Carin, Fellow, IEEE, and Mario

Introduction

• In this paper, focus on probabilistic kernel classifiers of the form :

)),(()|1( )(

10

in

iiKyP xxx

dxx

dxxNzz

z

-

2

)2

exp(2

1)1,0|()(

linkprobit theasknown also

function,on distributi cumulativeGaussian theis )(z

Page 19: A Bayesian Approach to Joint Feature Selection and Classifier Design Balaji Krishnapuram, Alexander J. Hartemink, Lawrence Carin, Fellow, IEEE, and Mario

Introduction

• two of the most popular kernels :– rth order polynomial :

– Gaussian radial basis function(RBG) :

)( example training theand example test ebetween th

similarity of measure nonlinear)(possibly a is ),( )(

ixx

xxK i

rp

k

ikkk

i xxK )1(),(1

)()(

xx

rp

k

ikkk

i xxK ))(exp(),(1

2)()(

xx

Page 20: A Bayesian Approach to Joint Feature Selection and Classifier Design Balaji Krishnapuram, Alexander J. Hartemink, Lawrence Carin, Fellow, IEEE, and Mario

Introduction

• JCFO seeks sparsity in its use of both basis functions (sparsity in ) and features (sparsity in )

• For kernel classification, sparsity in the use of basis functions is known to impact the capacity of the classifier, which controls its generalization performance.

• Sparsity in feature utilization is another important factor for increased robustness.

Tn ],...,[ 0 βT

p ],...,[ 0 θ

Page 21: A Bayesian Approach to Joint Feature Selection and Classifier Design Balaji Krishnapuram, Alexander J. Hartemink, Lawrence Carin, Fellow, IEEE, and Mario

Sparsity-promoting priors

• To encourage sparsity in the estimates of the parameter vector and , we adopt a Laplacian prior for each.

• For small , the difference between and is much larger for a Laplacian than for a Gaussian.

• As a result, using a Laplacian prior in a learning procedure that seeks to maximize the posterior density strongly favors values of that are exactly 0 over values that are simply close to 0.

β θ

iβ )0(p )( ip

)()|()|( iii pDpDp

Page 22: A Bayesian Approach to Joint Feature Selection and Classifier Design Balaji Krishnapuram, Alexander J. Hartemink, Lawrence Carin, Fellow, IEEE, and Mario

Sparsity-promoting priors

• To avoid nondifferentiability at the origin, we use an alternative hierarchical formulation which is equivalent to the Laplacian prior :– Let each have a zero-mean Gaussian prior

– Let all the variances be independently distributed according to a common exponential distribution (the hyperprior)

)0,|(),0|( iiii Np

0for ),2/exp()2/()|( 111 iiip

Page 23: A Bayesian Approach to Joint Feature Selection and Classifier Design Balaji Krishnapuram, Alexander J. Hartemink, Lawrence Carin, Fellow, IEEE, and Mario

Sparsity-promoting priors

– The effective prior can be obtained by marginalizing with respect to each

• For each scaling coefficient , we adopt similar saprsity-promoting prior, but we must ensure that any estimate of is nonnegative.

i

|)|exp(2

)|()|()|( 11

1

0

1 iiiiii dppp

k

k

Page 24: A Bayesian Approach to Joint Feature Selection and Classifier Design Balaji Krishnapuram, Alexander J. Hartemink, Lawrence Carin, Fellow, IEEE, and Mario

Sparsity-promoting priors

– A hierarchical model for similar to the one described above , but with the Gaussian prior replaced by a truncated Gaussian prior that explicitly forbids negative values :

– An exponential hyperprior

kiβ

)2/exp()2/()|( 222 kkp

Page 25: A Bayesian Approach to Joint Feature Selection and Classifier Design Balaji Krishnapuram, Alexander J. Hartemink, Lawrence Carin, Fellow, IEEE, and Mario

Sparsity-promoting priors

– The effective prior can be obtained by marginalizing with respect to .k

Page 26: A Bayesian Approach to Joint Feature Selection and Classifier Design Balaji Krishnapuram, Alexander J. Hartemink, Lawrence Carin, Fellow, IEEE, and Mario

MAP parameter estimation via EM

• Given the priors described above, our goal is to find the maximum a posteriori (MAP) estimate :

))|()|(),|((max arg)ˆˆ( 21),( ppDp,

Page 27: A Bayesian Approach to Joint Feature Selection and Classifier Design Balaji Krishnapuram, Alexander J. Hartemink, Lawrence Carin, Fellow, IEEE, and Mario

MAP parameter estimation via EM

• We use EM algorithm that finds MAP estimates using the hierarchical prior models and latent variable interpretation of the probit model.

• (N+1)-dimensioanl vector function: random function :

TnKKh )],(),...,,(,1[)( )()1( xxxxx

)1,0(~ where,)(),,( Nwwhz T βxθβx

Page 28: A Bayesian Approach to Joint Feature Selection and Classifier Design Balaji Krishnapuram, Alexander J. Hartemink, Lawrence Carin, Fellow, IEEE, and Mario

MAP parameter estimation via EM

• If a classifier were to assign the label y=1 to an example x whenever and y=0 whenever

• We would recover the probit mode:

• Consider the vector of missing variables:

0),,( θβxz0),,( θβxz

),,( where,],...,[ )()()()1( θβxz iiTn zzzz T

n ],...,[ 0 τT

n ],...,[ 0 ρ

Page 29: A Bayesian Approach to Joint Feature Selection and Classifier Design Balaji Krishnapuram, Alexander J. Hartemink, Lawrence Carin, Fellow, IEEE, and Mario

MAP parameter estimation via EM

• The EM algorithm will produce a sequence of estimates and by alternating between the E-step and the M-step.

• E-step– Compute the expected value of the complete log-

posterior conditioned on the data D and the current esitimate of the parameters, and .

)(ˆ tβ )(ˆ tθ

)(ˆ tβ )(ˆ tθ

Page 30: A Bayesian Approach to Joint Feature Selection and Classifier Design Balaji Krishnapuram, Alexander J. Hartemink, Lawrence Carin, Fellow, IEEE, and Mario

MAP parameter estimation via EM

Page 31: A Bayesian Approach to Joint Feature Selection and Classifier Design Balaji Krishnapuram, Alexander J. Hartemink, Lawrence Carin, Fellow, IEEE, and Mario

MAP parameter estimation via EM

• M-step

Page 32: A Bayesian Approach to Joint Feature Selection and Classifier Design Balaji Krishnapuram, Alexander J. Hartemink, Lawrence Carin, Fellow, IEEE, and Mario

Experimental Results

• Each data is initially normalized to have zero mean and unit variance.

• The regularization parameters and (controlling the degrees of sparsity enforced on and ,respectively ) are selected by cross-validation on the training data.

1

β

2

θ

Page 33: A Bayesian Approach to Joint Feature Selection and Classifier Design Balaji Krishnapuram, Alexander J. Hartemink, Lawrence Carin, Fellow, IEEE, and Mario

Experimental Results

• Effect of irrelevant predictor variables– Generate synthetic data from one of two normal

distributions with unit variance

Page 34: A Bayesian Approach to Joint Feature Selection and Classifier Design Balaji Krishnapuram, Alexander J. Hartemink, Lawrence Carin, Fellow, IEEE, and Mario

Experimental Results

• Results with high-dimensional gene expression data set– Strategy for learning a classifier is likely to be less

relevant here than the choice of feature selection method.

– Two commonly-analyzed data sets• Contains expression level of 7129 genes from 47

patients with acute myeloid leukemia (AML) and 25 patients with acute lymphoblastic leukemia (ALL).• Contains expression levels of 2000 genes from 40

tumor and 22 normal colon tissues.

Page 35: A Bayesian Approach to Joint Feature Selection and Classifier Design Balaji Krishnapuram, Alexander J. Hartemink, Lawrence Carin, Fellow, IEEE, and Mario

Experimental Results

– Full leave-one-out cross-validation procedure:• Train on n-1 samples and test the obtained classifier on

the remaining sample, and repeat this procedure for every sample.

Page 36: A Bayesian Approach to Joint Feature Selection and Classifier Design Balaji Krishnapuram, Alexander J. Hartemink, Lawrence Carin, Fellow, IEEE, and Mario

Experimental Results

• Results with low-dimensional benchmark data sets

Page 37: A Bayesian Approach to Joint Feature Selection and Classifier Design Balaji Krishnapuram, Alexander J. Hartemink, Lawrence Carin, Fellow, IEEE, and Mario

Experimental Results

– Our algorithm did not perform any feature selection in the Ripley data set, selected five out of eight variables in the Pima data set, and selected three out of five variables in the Crabs.

– JCFO is also very sparse in its utilization of kernel functions:• It chooses an average of 4 out of 100, 5 out of 200, 5

out of 80, and 6 out of 300 kernels for the Ripley, Pima, Crabs, and WBC data set, respectively.

Page 38: A Bayesian Approach to Joint Feature Selection and Classifier Design Balaji Krishnapuram, Alexander J. Hartemink, Lawrence Carin, Fellow, IEEE, and Mario

Conclusion

• It use of sparsity-promoting priors that encourage many of the to be exactly zero.

• Experimental results indicate that, for high-dimensional data with many irrelevant features, the classification accuracy of JCFO is likely to be statistically superior to other methods,

k