hebrew university1 a new paradigm for feature selection with some surprising results amnon shashua...

Hebrew University 1

A New Paradigm for Feature SelectionWith some surprising results

Amnon Shashua

School of Computer Science & Eng.The Hebrew University

Joint work with Lior Wolf

Wolf & Shashua, ICCV’03

Hebrew University 2

....

1M iM qMTm1Tm2

Tnm

Given a sample of feature measurementsqn

q RMMM ×∈= ],...,[ 1

feature vector

data point

12=im

Find a subset of features

{ }Ti

Ti s

mm ,...,1

which are most “relevant” with respect to aninference (learning) task.

Problem Definition

Comments:

• Data points can represent images (pixels), feature attributes, wavelet coefs,..• The task is to select a subset of coordinates of the data points such that the accuracy, confidence, and training sample size of a learning algorithm would be optimal - ideally. • Need to define a “relevance” score.• Need to overcome the exponential nature of subset selection.• If a “soft” selection approach is used, need to make sure the solution is “sparse”.

Hebrew University 3

Examples:

• Text Classification: typically 74 1010 − features representing word frequency

counts – only a small fraction is expected to be relevant. Typical examplesinclude automatic sorting of URLs into web directory and detection of SpamEmail.

• Visual Recognition:

-| - |1 similarity( , ) = e

• Genomics:

tissue samples

Ge

ne

exp

ress

ion

s

Goal: recognizing the relevant genes which separate between normal and tumor cells, between different sub classes of cancer, and so on.

Hebrew University 4

Why Select Features?

• Most learning algorithms do not scale well with the growth of irrelevant features. ex1: number of training examples for some supervised learning methods grow exponentially. ex2: for classifiers which can optimize their “capacity” (e.g., large margin hyper-planes)

• Computational efficiency considerations when number of coordinates is very high.

• Run-time of the (already trained) inference engine on new test examples.

€

m ≥1

εlog

1

δ+

d

εlog

1

ε

the effective VC dimension d grows fast with irrelevant coordinates - faster than the capacity increase.

• Structure of data gets obscured with large amounts of irrelevant coordinates.

Hebrew University 5

Existing Approaches

• Filter methods: pre-process of the data independent of the inference engine. examples: use of mutual information measure, correlation coefficients, cluster..

• Embedded, Wrapper: select features useful to build a good predictor/inference. example: run SVM training on every candidate subset of features. Computationally expensive approach in general.

Hebrew University 6

Feature Subset Relevance - Key Idea

1M iM qMTm1Tm2

Tnm

€

ˆ M 1

€

ˆ M i

€

ˆ M q€

M i ∈ Rn

€

ˆ M i ∈ R l

€

l ≤ n

€

R l

Working Assumption: the relevant subset of rows induce columns that are coherently clustered.

Note: we are assuming an unsupervised setting (data points are not labeled). The framework can easilyapply to supervised settings as well.

Hebrew University 7

• How to measure cluster coherency? We wish to avoid explicitly clustering for each subset of rows. We wish a measure which is amenable to continuous functional analysis.

key idea: use spectral information from the affinity matrix

€

ˆ M 1

€

ˆ M i

€

ˆ M q

€

ˆ M =

€

ˆ M T ˆ M

• How to represent ?

€

ˆ M T ˆ M

{ }liis ,...,1=

€

Aα = ˆ M T ˆ M = α imimiT

i=1

n

∑

subset of features⎩⎨⎧ ∈

=otherwise

sii 0

1α

Hebrew University 8

Definition of RelevancyThe Standard Spectrum

General Idea:

Select a subset of rows from the sample matrix M such that the resultingaffinity matrix will have high values associated with the first k eigenvalues.

....

Tm1Tim1

Tnm

{ }liis ,...,1=

),...,(1 lii xxrel )( αααα QAAQtrace TT=

∑ ==

k

j j1

2λ

Tim2

Til

m

∑==

n

i

Tiii mmA

1αα

subset of features⎩⎨⎧ ∈

=otherwise

sii 0

1α

resulting affinity matrix

αQ consists of the first k eigenvectors of αA

Hebrew University 9

(unsupervised) Optimization Problem

( )QAAQtrace TT

Q nαα

αα ,...,, 1

max

∑==

n

i

Tiii mmA

1αα

IQQT =

Let

subject to

€

α ∈ 0,1{ }n

Optimization is too difficult to be considered in practice (integer and continuous variables programming).

Hebrew University 10

(unsupervised) Optimization ProblemSoft Version I

€

maxQ,α 1 ,...,α n

trace QT AαT Aα Q( ) + h(α )

∑==

n

i

Tiii mmA

1αα

IQQT =

Let

subject to

€

α i ≥ 0

The non-linear function penalizes for uniform

The result is a non-linear programming problem - could be quite difficult to solve.

€

α i

i

∑ =1

€

h(α )

€

α


(unsupervised) Optimization Problem

( )QAAQtrace TT

Q nαα

αα ,...,, 1

max

∑==

n

i

Tiii mmA

1αα

IQQT =

Let for some unknown real scalars ( )Tnααα ,...,1=

subject to

1=αα T

Note: the optimization definition ignores the requirements:

1. 2. The weight vector should be sparse.

0≥iαα

Motivation: from spectral clustering it is knownthat the eigenvectors tend to be discontinuous and that may lead to an effortless sparsityproperty.


The Algorithmα−Q

( )QAAQtrace TT

Q nαα

αα ,...,, 1

max IQQT = 1=αα T

If were known, then is known and Q is simply the first k eigenvectors ofα αA αA

If Q were known, then the problem becomes:

αααα

GT

n,...,1

max subject to 1=αα T

jTT

ijTiij QmQmmmG )(=where

α is the largest eigenvector of G


The AlgorithmPower-embedded

α−Q

jrrT

ijTi

rij mQQmmmG

T )1()1()( )( −−=1. Let be defined)(rG

2. Let be the largest eigenvector of )(rG)(rα

3. Let ∑==

n

i

Tii

ri

r mmA1

)()( α

4. Let )1()()( −= rrr QAZ

5. )()()( rrQRr RQZ ⏐ →⏐ “QR” factorization step

6. Increment r

Convergence proof: take k=1 for example. Steps 4,5 become:qq

Aqq

Tr =)(

Need to show: qAqqAq Trr T 2)(2)( ≥ qAqqAq

qAq TT

T2

2

4

≥

For all symmetric matrices A and unit vectors q

follows from convexity

orthogonal iteration


Positivity and Sparsity ofHand-Waving Argument

α

( )QAAQtrace TT

Q nαα

αα ,...,, 1

maxarg { }22

,...,, 1

minargFF

T

QAAQQA

nααα

αα−−=

minimized if rank(A)=k

add redundant terms

∑==

n

i

Tiii mmA

1αα = sum of rank-1 matrices

If we would like rank(A) to be small, we shouldn’tadd too many rank-1 terms, Therefore should be sparse.α

Note: this argument does not say anything with regard to why should be positive.α


Positivity and Sparsity of αThe key for the emergence of a sparse and positive has to do with the wayThe entries of are defined:

αG

))()(()(1 j

Til

Tj

k

l lTij

TTij

Tiij mmqmqmQmQmmmG ∑=

==

Consider k=1 for example, then each entry is of the form:

))()(( cbcabaf TTT= 1=== cba

Clearly, 11 ≤<− f 1−=f1)(,1)(,1)( −=−=−= cbcaba TTT

only if

which cannot happen

a

b

Expected values of the entries of G are biased towards a positive number

1)(,1)(,1)( ==−= cbcaba TTTor


Positivity and Sparsity of α

1. What is the minimal value of ))()(( cbcabaf TTT= when

vary over the n-dimensional unit hyper sphere?

18

1≤≤− f

2. Given a uniform sampling of over the n-dim unit hyper sphere, what is the mean and variance of

cba ,,

cba ,,μ 2σ f

18

1,

6

1 2 == σμ

3. Given that what is the probability that the first eigenvector),(~ 2σμNGij

of is strictly non-negative (same sign)?G


Proposition 3:

⎪⎩

⎪⎨

⎧

=≠>

ji

ji

N

NGij

),2

1(

),0(~

2

2

σ

σμwith an infinitesimal

2σ

Let be the largest eigenvector. Then,xGx λ=

11

)0(],0[

⏐⏐ →⏐⎟⎠

⎞⎜⎝

⎛Φ=> ∞→n

n

n nxp

μσ

where ( )x],[ 2σμ

Φ Is the cumulative distribution function of ),( 2σμN

n

n n⎟⎠

⎞⎜⎝

⎛Φ1

],0[μσ

empirical


Proposition 4: (with O. Zeitouni and M. Ben-Or)

1)0( ⏐⏐ →⏐=> ∞→nxpσwhen for any value of 0>μ


Sparsity Gapdefinition

Let be the fraction of relevant features and 10 << p pq −=1

Let ⎥⎦

⎤⎢⎣

⎡=

CB

BAG T where

),(~ 2σμanpnp NA ×

),(~, 2σμbnqnq NCB × 6

1=bμ

Let Txxx ),( 21= be the largest eigenvector of G, where holds the first npentries and holds the remaining nq entries.

1x2x

The sparsity gap corresponding to G is the ratio2

1

x

x=ρ

where 1x is the mean of and 1x2x is the mean of 2x


Sparsity GapProposition 4:

Let

22×⎥⎦

⎤⎢⎣

⎡=

bb

ba

nqnpnqnp

Gμμμμ

Let Txxx ),( 21= be the largest eigenvector of G

The sparsity gap corresponding to G is:

),0(

),0(

2

2

2

1

nqNx

npNx

σ

σ

ρ+

+=

Example: 100,6

1,85.0 === nba μμ

p

ppp

20

10082033211061 2 +−+−=ρ


The feature selection for SVM benchmark

• Two synthetic data sets were proposed in a paper by Weston, Mukherjee, Chapelle, Pontil, Poggio and Vapnik from NIPS 2001.

• The data sets were designed to have few features which are separable by SVM, combined with many non relevant features.

• The data sets were designed for the labeled case.

The linear dataset• The linear data set is almost separable linearly once the correct features are recovered.

• There were 6 relevant features and 196 irrelevant ones.

• At probability 0.7 the data is almost separable by the first 3 relevant features and un-separable by the rest 3 relevant features. At probability 0.3 the second group of relevant features is the separable one. Remaining 196 features were drawn from N(0,20).

Results – linear data set

The unsupervised algorithm started to be effective only from 80 data points and up and is not shown here

Results – non-linear data set


There are two species of frogs in this figure:

Hebrew University 25American toadGreen frog (Rana clamitans)


Automatic separation

• We use small patches as basic features:

-| - |1

similarity( , ) = e

• In order to compare patches we use the L1 norm on the color histograms:


The matrix A: many features over ~40 images

The similarity between an image A and a patch Bis the maximum over all similarities between the patches p in the image A and of the patch B

similarity( , ) = max similarity( , )


American toadGreen frog (Rana clamitans)

Selected features

Using these features the clustering was correct on 80% of the samples – compared to 55% correct clustering using conventional spectral clustering


sea-elephantelephant

Another example

Using these features the clustering was correct on 90% of the samples – compared to 65% correct clustering using conventional spectral clustering


Genomics

tissue samples

Gen

e ex

pres

sion

s

The microarray technology provides many measurements of gene expressions for different sample tissues.

Goal: recognizing the relevantgenes that separate betweencells with different biological characteristics (normal vs. tumor,different subclasses of tumor cells)

• Classification of Tissue Samples (type of Cancer, Normal vs. Tumor)• Find Novel Subclasses (unsupervised)• Find Genes responsible for classification (new insights for drug design).

Few samples (~50) and large dimension (~10,000)


The synthetic dataset of Ben-Dor, Friedman, and Yakhini

• The model consists of 6 parameters:

),( AA sN μμ

Parameter description Leukemia

a # class A Samples 25

b # class b Samples 47

m # features 600

e,(1-e) % irrelevant/relevant 72%,28%

(3d) Size of interval of means d=555

s Std coefficient .75

• A relevant feature is sampled or where the means of the classes μ μ are sampled uniformly from [-1.5d,1.5d]

• An irrelevant feature is sampled N(0,s)

),( AA sN μμ ),( BB sN μμ


Param. description Leuk. MSA Q-α remarksa # class A Samples 25

b # class b Samples 47

m # features 432 <250 <5 MSA uses redundancy

e % irrelevant features 168 >95% >99.5% Easy data set

d Size of interval 555 [1,1000] At least[1,1000]

data is normalized

s Spread .75 <2 <1000 MSA needs good separation

The synthetic dataset of Ben-Dor, Friedman, and Yakhini

• MSA – max surprise algorithm of Ben-Dor, Friedman, and Yakhini.

• Results of simulations done by varying one parameter out of sedm ,,,


Follow Up Work

Feature selection with “side” information:

∑=

n

i

Tiii mm

1α

( )Tnααα ,...,1=

∑=

n

i

Tiii ww

1α

WM ,Given the “main” and “side” data. Find weights

such that has coherent k clusters

has low cluster coherence (single cluster)and

Shashua & Wolf, ECCV’04


“Kernalizing” :α−Q

)( ii mm φ→ high dimensional mapping

),()()( jijT

i mmkmm =φφ

Follow Up Work

∑==

n

i

Tiii mmA

1)()( φφαα

Rather than having inner-products we have outer-products.

Shashua & Wolf, ECCV’04


END

hebrew university1 a new paradigm for feature selection with some surprising results amnon shashua...

Documents

subset of coordinates

relevant subset of rows

subset of featureswhich

candidate subset of

hebrew universityexamples

hebrew universitywhy

hebrew universityjoint

number of coordinates