hebrew university1 a new paradigm for feature selection with some surprising results amnon shashua...
TRANSCRIPT
Hebrew University 1
A New Paradigm for Feature SelectionWith some surprising results
Amnon Shashua
School of Computer Science & Eng.The Hebrew University
Joint work with Lior Wolf
Wolf & Shashua, ICCV’03
Hebrew University 2
....
1M iM qMTm1Tm2
Tnm
Given a sample of feature measurementsqn
q RMMM ×∈= ],...,[ 1
feature vector
data point
12=im
Find a subset of features
{ }Ti
Ti s
mm ,...,1
which are most “relevant” with respect to aninference (learning) task.
Problem Definition
Comments:
• Data points can represent images (pixels), feature attributes, wavelet coefs,..• The task is to select a subset of coordinates of the data points such that the accuracy, confidence, and training sample size of a learning algorithm would be optimal - ideally. • Need to define a “relevance” score.• Need to overcome the exponential nature of subset selection.• If a “soft” selection approach is used, need to make sure the solution is “sparse”.
Hebrew University 3
Examples:
• Text Classification: typically 74 1010 − features representing word frequency
counts – only a small fraction is expected to be relevant. Typical examplesinclude automatic sorting of URLs into web directory and detection of SpamEmail.
• Visual Recognition:
-| - |1 similarity( , ) = e
• Genomics:
tissue samples
Ge
ne
exp
ress
ion
s
Goal: recognizing the relevant genes which separate between normal and tumor cells, between different sub classes of cancer, and so on.
Hebrew University 4
Why Select Features?
• Most learning algorithms do not scale well with the growth of irrelevant features. ex1: number of training examples for some supervised learning methods grow exponentially. ex2: for classifiers which can optimize their “capacity” (e.g., large margin hyper-planes)
• Computational efficiency considerations when number of coordinates is very high.
• Run-time of the (already trained) inference engine on new test examples.
€
m ≥1
εlog
1
δ+
d
εlog
1
ε
the effective VC dimension d grows fast with irrelevant coordinates - faster than the capacity increase.
• Structure of data gets obscured with large amounts of irrelevant coordinates.
Hebrew University 5
Existing Approaches
• Filter methods: pre-process of the data independent of the inference engine. examples: use of mutual information measure, correlation coefficients, cluster..
• Embedded, Wrapper: select features useful to build a good predictor/inference. example: run SVM training on every candidate subset of features. Computationally expensive approach in general.
Hebrew University 6
Feature Subset Relevance - Key Idea
1M iM qMTm1Tm2
Tnm
€
ˆ M 1
€
ˆ M i
€
ˆ M q€
M i ∈ Rn
€
ˆ M i ∈ R l
€
l ≤ n
€
R l
Working Assumption: the relevant subset of rows induce columns that are coherently clustered.
Note: we are assuming an unsupervised setting (data points are not labeled). The framework can easilyapply to supervised settings as well.
Hebrew University 7
• How to measure cluster coherency? We wish to avoid explicitly clustering for each subset of rows. We wish a measure which is amenable to continuous functional analysis.
key idea: use spectral information from the affinity matrix
€
ˆ M 1
€
ˆ M i
€
ˆ M q
€
ˆ M =
€
ˆ M T ˆ M
• How to represent ?
€
ˆ M T ˆ M
{ }liis ,...,1=
€
Aα = ˆ M T ˆ M = α imimiT
i=1
n
∑
subset of features⎩⎨⎧ ∈
=otherwise
sii 0
1α
Hebrew University 8
Definition of RelevancyThe Standard Spectrum
General Idea:
Select a subset of rows from the sample matrix M such that the resultingaffinity matrix will have high values associated with the first k eigenvalues.
....
Tm1Tim1
Tnm
{ }liis ,...,1=
),...,(1 lii xxrel )( αααα QAAQtrace TT=
∑ ==
k
j j1
2λ
Tim2
Til
m
∑==
n
i
Tiii mmA
1αα
subset of features⎩⎨⎧ ∈
=otherwise
sii 0
1α
resulting affinity matrix
αQ consists of the first k eigenvectors of αA
Hebrew University 9
(unsupervised) Optimization Problem
( )QAAQtrace TT
Q nαα
αα ,...,, 1
max
∑==
n
i
Tiii mmA
1αα
IQQT =
Let
subject to
€
α ∈ 0,1{ }n
Optimization is too difficult to be considered in practice (integer and continuous variables programming).
Hebrew University 10
(unsupervised) Optimization ProblemSoft Version I
€
maxQ,α 1 ,...,α n
trace QT AαT Aα Q( ) + h(α )
∑==
n
i
Tiii mmA
1αα
IQQT =
Let
subject to
€
α i ≥ 0
The non-linear function penalizes for uniform
The result is a non-linear programming problem - could be quite difficult to solve.
€
α i
i
∑ =1
€
h(α )
€
α
Hebrew University 11
(unsupervised) Optimization Problem
( )QAAQtrace TT
Q nαα
αα ,...,, 1
max
∑==
n
i
Tiii mmA
1αα
IQQT =
Let for some unknown real scalars ( )Tnααα ,...,1=
subject to
1=αα T
Note: the optimization definition ignores the requirements:
1. 2. The weight vector should be sparse.
0≥iαα
Motivation: from spectral clustering it is knownthat the eigenvectors tend to be discontinuous and that may lead to an effortless sparsityproperty.
Hebrew University 12
The Algorithmα−Q
( )QAAQtrace TT
Q nαα
αα ,...,, 1
max IQQT = 1=αα T
If were known, then is known and Q is simply the first k eigenvectors ofα αA αA
If Q were known, then the problem becomes:
αααα
GT
n,...,1
max subject to 1=αα T
jTT
ijTiij QmQmmmG )(=where
α is the largest eigenvector of G
Hebrew University 13
The AlgorithmPower-embedded
α−Q
jrrT
ijTi
rij mQQmmmG
T )1()1()( )( −−=1. Let be defined)(rG
2. Let be the largest eigenvector of )(rG)(rα
3. Let ∑==
n
i
Tii
ri
r mmA1
)()( α
4. Let )1()()( −= rrr QAZ
5. )()()( rrQRr RQZ ⏐ →⏐ “QR” factorization step
6. Increment r
Convergence proof: take k=1 for example. Steps 4,5 become:qq
Aqq
Tr =)(
Need to show: qAqqAq Trr T 2)(2)( ≥ qAqqAq
qAq TT
T2
2
4
≥
For all symmetric matrices A and unit vectors q
follows from convexity
orthogonal iteration
Hebrew University 14
Positivity and Sparsity ofHand-Waving Argument
α
( )QAAQtrace TT
Q nαα
αα ,...,, 1
maxarg { }22
,...,, 1
minargFF
T
QAAQQA
nααα
αα−−=
minimized if rank(A)=k
add redundant terms
∑==
n
i
Tiii mmA
1αα = sum of rank-1 matrices
If we would like rank(A) to be small, we shouldn’tadd too many rank-1 terms, Therefore should be sparse.α
Note: this argument does not say anything with regard to why should be positive.α
Hebrew University 15
Positivity and Sparsity of αThe key for the emergence of a sparse and positive has to do with the wayThe entries of are defined:
αG
))()(()(1 j
Til
Tj
k
l lTij
TTij
Tiij mmqmqmQmQmmmG ∑=
==
Consider k=1 for example, then each entry is of the form:
))()(( cbcabaf TTT= 1=== cba
Clearly, 11 ≤<− f 1−=f1)(,1)(,1)( −=−=−= cbcaba TTT
only if
which cannot happen
a
b
Expected values of the entries of G are biased towards a positive number
1)(,1)(,1)( ==−= cbcaba TTTor
Hebrew University 16
Positivity and Sparsity of α
1. What is the minimal value of ))()(( cbcabaf TTT= when
vary over the n-dimensional unit hyper sphere?
18
1≤≤− f
2. Given a uniform sampling of over the n-dim unit hyper sphere, what is the mean and variance of
cba ,,
cba ,,μ 2σ f
18
1,
6
1 2 == σμ
3. Given that what is the probability that the first eigenvector),(~ 2σμNGij
of is strictly non-negative (same sign)?G
Hebrew University 17
Proposition 3:
⎪⎩
⎪⎨
⎧
=≠>
ji
ji
N
NGij
),2
1(
),0(~
2
2
σ
σμwith an infinitesimal
2σ
Let be the largest eigenvector. Then,xGx λ=
11
)0(],0[
⏐⏐ →⏐⎟⎠
⎞⎜⎝
⎛Φ=> ∞→n
n
n nxp
μσ
where ( )x],[ 2σμ
Φ Is the cumulative distribution function of ),( 2σμN
n
n n⎟⎠
⎞⎜⎝
⎛Φ1
],0[μσ
empirical
Hebrew University 18
Proposition 4: (with O. Zeitouni and M. Ben-Or)
1)0( ⏐⏐ →⏐=> ∞→nxpσwhen for any value of 0>μ
Hebrew University 19
Sparsity Gapdefinition
Let be the fraction of relevant features and 10 << p pq −=1
Let ⎥⎦
⎤⎢⎣
⎡=
CB
BAG T where
),(~ 2σμanpnp NA ×
),(~, 2σμbnqnq NCB × 6
1=bμ
Let Txxx ),( 21= be the largest eigenvector of G, where holds the first npentries and holds the remaining nq entries.
1x2x
The sparsity gap corresponding to G is the ratio2
1
x
x=ρ
where 1x is the mean of and 1x2x is the mean of 2x
Hebrew University 20
Sparsity GapProposition 4:
Let
22×⎥⎦
⎤⎢⎣
⎡=
bb
ba
nqnpnqnp
Gμμμμ
Let Txxx ),( 21= be the largest eigenvector of G
The sparsity gap corresponding to G is:
),0(
),0(
2
2
2
1
nqNx
npNx
σ
σ
ρ+
+=
Example: 100,6
1,85.0 === nba μμ
p
ppp
20
10082033211061 2 +−+−=ρ
Hebrew University 21
The feature selection for SVM benchmark
• Two synthetic data sets were proposed in a paper by Weston, Mukherjee, Chapelle, Pontil, Poggio and Vapnik from NIPS 2001.
• The data sets were designed to have few features which are separable by SVM, combined with many non relevant features.
• The data sets were designed for the labeled case.
The linear dataset• The linear data set is almost separable linearly once the correct features are recovered.
• There were 6 relevant features and 196 irrelevant ones.
• At probability 0.7 the data is almost separable by the first 3 relevant features and un-separable by the rest 3 relevant features. At probability 0.3 the second group of relevant features is the separable one. Remaining 196 features were drawn from N(0,20).
Results – linear data set
The unsupervised algorithm started to be effective only from 80 data points and up and is not shown here
Results – non-linear data set
Hebrew University 24
There are two species of frogs in this figure:
Hebrew University 25American toadGreen frog (Rana clamitans)
Hebrew University 26
Automatic separation
• We use small patches as basic features:
-| - |1
similarity( , ) = e
• In order to compare patches we use the L1 norm on the color histograms:
Hebrew University 27
The matrix A: many features over ~40 images
The similarity between an image A and a patch Bis the maximum over all similarities between the patches p in the image A and of the patch B
similarity( , ) = max similarity( , )
Hebrew University 28
American toadGreen frog (Rana clamitans)
Selected features
Using these features the clustering was correct on 80% of the samples – compared to 55% correct clustering using conventional spectral clustering
Hebrew University 29
sea-elephantelephant
Another example
Using these features the clustering was correct on 90% of the samples – compared to 65% correct clustering using conventional spectral clustering
Hebrew University 30
Genomics
tissue samples
Gen
e ex
pres
sion
s
The microarray technology provides many measurements of gene expressions for different sample tissues.
Goal: recognizing the relevantgenes that separate betweencells with different biological characteristics (normal vs. tumor,different subclasses of tumor cells)
• Classification of Tissue Samples (type of Cancer, Normal vs. Tumor)• Find Novel Subclasses (unsupervised)• Find Genes responsible for classification (new insights for drug design).
Few samples (~50) and large dimension (~10,000)
Hebrew University 31
The synthetic dataset of Ben-Dor, Friedman, and Yakhini
• The model consists of 6 parameters:
),( AA sN μμ
Parameter description Leukemia
a # class A Samples 25
b # class b Samples 47
m # features 600
e,(1-e) % irrelevant/relevant 72%,28%
(3d) Size of interval of means d=555
s Std coefficient .75
• A relevant feature is sampled or where the means of the classes μ μ are sampled uniformly from [-1.5d,1.5d]
• An irrelevant feature is sampled N(0,s)
),( AA sN μμ ),( BB sN μμ
Hebrew University 32
Param. description Leuk. MSA Q-α remarksa # class A Samples 25
b # class b Samples 47
m # features 432 <250 <5 MSA uses redundancy
e % irrelevant features 168 >95% >99.5% Easy data set
d Size of interval 555 [1,1000] At least[1,1000]
data is normalized
s Spread .75 <2 <1000 MSA needs good separation
The synthetic dataset of Ben-Dor, Friedman, and Yakhini
• MSA – max surprise algorithm of Ben-Dor, Friedman, and Yakhini.
• Results of simulations done by varying one parameter out of sedm ,,,
Hebrew University 33
Follow Up Work
Feature selection with “side” information:
∑=
n
i
Tiii mm
1α
( )Tnααα ,...,1=
∑=
n
i
Tiii ww
1α
WM ,Given the “main” and “side” data. Find weights
such that has coherent k clusters
has low cluster coherence (single cluster)and
Shashua & Wolf, ECCV’04
Hebrew University 34
“Kernalizing” :α−Q
)( ii mm φ→ high dimensional mapping
),()()( jijT
i mmkmm =φφ
Follow Up Work
∑==
n
i
Tiii mmA
1)()( φφαα
Rather than having inner-products we have outer-products.
Shashua & Wolf, ECCV’04
Hebrew University 35
END