a bayesian approach to joint feature selection and classifier design balaji krishnapuram, alexander...

A Bayesian Approach to Joint Feature Selection and Classifier Design

Balaji Krishnapuram, Alexander J. Hartemink, Lawrence Carin, Fellow, IEEE, and Mario A.T. Figueiredo, Senior Member, IEEEIEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE

INTELLIGNECVOL. 26, NO.9, SEPTEMBER 2004

Outline

• Introduction to MAP• Introduction to Expectation-Maximization• Introduction to Generalized linear models• Introduction• Sparsity-promoting priors• MAP parameter estimation via EM• Experimental Results• Conclusion

Introduction to MAP

• Maximum-likelihood estimation– is considered as parameter vector

• In MAP estimation– is considered as a random vector described by a

pdf p()– Assume p() is known– Given a set of training samples D={x1,x2,…,xn}

Introduction to MAP

• Find the maximum of

– If p() is uniform or flat enough

Introduction to Expectation-Maximization

Introduction to generalized linear models

• Supervised learning can be formalized as the problem of inferring a function y=f(x), based on a training set D={(x1,y1),…, (xn,yn)}

• When y is continuous (e.g. y R ), we are in the context of regression, whereas in classification problems, y is of categorical nature (e.g. binary, y={-1,1})

• The function f is assumed to have a fixed structure and to depend on a set of parameters .

• We write y=f(x, ) and the goal becomes to estimate from the training data.


• Regression functions

where is a vector of k fixed functions of the input, often called features.– Linear regression : k=d+1– Nonlinear regression via a set of k fixed basis

functions : – Kernel regression :


– Assume that the output variables in the training set were contaminated by addictive white Gaussian noise :where is a set of independent zero-mean Gaussian samples with variance 2 .

– With , the likelihood function :

is the so-called design matrix. The element of , denoted , is given by


– With a zero-mean Gaussian prior for , with covariance

– The posterior is still Gaussian

)|()|(maxargˆ

pDp


– In logistic regression, the link function is

– Probit link or Probit model:dx

xdxxNz

z

z

-

2

)2

exp(2

1)1,0|()(


• Hidden variable where w is a zero-mean unit-variance Gaussian noise

• If the classification rule is

• Given training datahidden variables where


• If we had z, we would have a simple linear likelihood with unit noise variance

• The use of the EM algorithm to estimate , by treating z as missing data.

),|()|( IH yNyp

Introduction

• Given the training set

two standard tasks :– Classifier design :

To learn a function that most accurately predicts the class of a new example

– Feature selection:To identify a subset of the features that is most

informative about the class distinction (feature selection)

ni

ipii yxyxD 1)()((i))( }{0,1},R:),{(

Introduction

• In this paper, joint classifier and feature optimization (JCFO)– The association of a nonnegative scaling factor

with each feature– These scaling factors are then estimated from the

data, under an a priori preference for values that are either significantly large or otherwise exactly zero.

kθkx

Introduction

• In this paper, focus on probabilistic kernel classifiers of the form :

)),(()|1( )(

10

in

iiKyP xxx

dxx

dxxNzz

z

-

2

)2

exp(2

1)1,0|()(

linkprobit theasknown also

function,on distributi cumulativeGaussian theis )(z

Introduction

• two of the most popular kernels :– rth order polynomial :

– Gaussian radial basis function(RBG) :

)( example training theand example test ebetween th

similarity of measure nonlinear)(possibly a is ),( )(

ixx

xxK i

rp

k

ikkk

i xxK )1(),(1

)()(

xx

rp

k

ikkk

i xxK ))(exp(),(1

2)()(

xx

Introduction

• JCFO seeks sparsity in its use of both basis functions (sparsity in ) and features (sparsity in )

• For kernel classification, sparsity in the use of basis functions is known to impact the capacity of the classifier, which controls its generalization performance.

• Sparsity in feature utilization is another important factor for increased robustness.

Tn ],...,[ 0 βT

p ],...,[ 0 θ

Sparsity-promoting priors

• To encourage sparsity in the estimates of the parameter vector and , we adopt a Laplacian prior for each.

• For small , the difference between and is much larger for a Laplacian than for a Gaussian.

• As a result, using a Laplacian prior in a learning procedure that seeks to maximize the posterior density strongly favors values of that are exactly 0 over values that are simply close to 0.

β θ

iβ )0(p )( ip

)()|()|( iii pDpDp

iβ


• To avoid nondifferentiability at the origin, we use an alternative hierarchical formulation which is equivalent to the Laplacian prior :– Let each have a zero-mean Gaussian prior

– Let all the variances be independently distributed according to a common exponential distribution (the hyperprior)

iβ

)0,|(),0|( iiii Np

0for ),2/exp()2/()|( 111 iiip


– The effective prior can be obtained by marginalizing with respect to each

• For each scaling coefficient , we adopt similar saprsity-promoting prior, but we must ensure that any estimate of is nonnegative.

i

|)|exp(2

)|()|()|( 11

1

0

1 iiiiii dppp

k

k


– A hierarchical model for similar to the one described above , but with the Gaussian prior replaced by a truncated Gaussian prior that explicitly forbids negative values :

– An exponential hyperprior

–

kiβ

)2/exp()2/()|( 222 kkp


– The effective prior can be obtained by marginalizing with respect to .k

MAP parameter estimation via EM

• Given the priors described above, our goal is to find the maximum a posteriori (MAP) estimate :

))|()|(),|((max arg)ˆˆ( 21),( ppDp,


• We use EM algorithm that finds MAP estimates using the hierarchical prior models and latent variable interpretation of the probit model.

• (N+1)-dimensioanl vector function: random function :

TnKKh )],(),...,,(,1[)( )()1( xxxxx

)1,0(~ where,)(),,( Nwwhz T βxθβx


• If a classifier were to assign the label y=1 to an example x whenever and y=0 whenever

• We would recover the probit mode:

• Consider the vector of missing variables:

0),,( θβxz0),,( θβxz

),,( where,],...,[ )()()()1( θβxz iiTn zzzz T

n ],...,[ 0 τT

n ],...,[ 0 ρ


• The EM algorithm will produce a sequence of estimates and by alternating between the E-step and the M-step.

• E-step– Compute the expected value of the complete log-

posterior conditioned on the data D and the current esitimate of the parameters, and .

)(ˆ tβ )(ˆ tθ

)(ˆ tβ )(ˆ tθ


–


• M-step

Experimental Results

• Each data is initially normalized to have zero mean and unit variance.

• The regularization parameters and (controlling the degrees of sparsity enforced on and ,respectively ) are selected by cross-validation on the training data.

1

β

2

θ


• Effect of irrelevant predictor variables– Generate synthetic data from one of two normal

distributions with unit variance


• Results with high-dimensional gene expression data set– Strategy for learning a classifier is likely to be less

relevant here than the choice of feature selection method.

– Two commonly-analyzed data sets• Contains expression level of 7129 genes from 47

patients with acute myeloid leukemia (AML) and 25 patients with acute lymphoblastic leukemia (ALL).• Contains expression levels of 2000 genes from 40

tumor and 22 normal colon tissues.


– Full leave-one-out cross-validation procedure:• Train on n-1 samples and test the obtained classifier on

the remaining sample, and repeat this procedure for every sample.


• Results with low-dimensional benchmark data sets


– Our algorithm did not perform any feature selection in the Ripley data set, selected five out of eight variables in the Pima data set, and selected three out of five variables in the Crabs.

– JCFO is also very sparse in its utilization of kernel functions:• It chooses an average of 4 out of 100, 5 out of 200, 5

out of 80, and 6 out of 300 kernels for the Ripley, Pima, Crabs, and WBC data set, respectively.

Conclusion

• It use of sparsity-promoting priors that encourage many of the to be exactly zero.

• Experimental results indicate that, for high-dimensional data with many irrelevant features, the classification accuracy of JCFO is likely to be statistically superior to other methods,

k

a bayesian approach to joint feature selection and classifier design balaji krishnapuram, alexander...

Documents