feature ranking using linear svm - eth z · feature ranking using linear svm yin-wen chang...

Feature Ranking Using Linear SVM

Yin-Wen ChangDepartment of Computer Science

National Taiwan University

Joint work with Chih-Jen LinJune 4, 2008

Yin-Wen Chang and Chih-Jen Lin (NTU) 1 / 21

Outline

IntroductionSupport Vector ClassificationFeature Ranking StrategiesExperimental ResultsDiscussion and Conclusions


The Problem

The testing and training sets have different distributions

Three versions of each task, one unmanipulated (0) and twomanipulated (1, 2) testing sets

Goal:

Make good predictions on the testing sets

Causality discovery could help

Feature ranking via linear SVM are used


Outline



Support Vector Classification

Training data (xi , yi ), i = 1, . . . , l , xi ∈ Rn, yi = ±1

Find a separating hyperplane with the maximal margin between twoclasses

minw,b

1

2wTw + C

l∑i=1

ξ(w, b; xi , yi )

L2-loss function:

ξ(w, b; xi , yi ) = max(1− yi (wTxi + b), 0)2

The decision function:

f (x) = sgn(wTx + b

)


Outline



Feature Ranking Strategies (1/2)

Assign a weight to each feature and rank the features accordingly

F-score

Simple statistic measure

Measure the discrimination between a feature and the label

Independent of the classifiers

Does not consider mutual information between features

Linear SVM Weight

[Guyon et al., 2002]

Weight w in the decision function: f (x) = sgn(wTx + b

).

Rank the features according to |wj | of the jth feature

Larger |wj | indicates more relevance


Feature Ranking Strategies (2/2)

Change of AUC/ACC with/without Removing Each Feature

How the performance is influenced without a feature

Performance measures are:

CV AUC and CV ACC (accuracy rate)

Applicable to any classifier and performance measure

Expensive training and prediction

Infeasible when the number of features is large


Outline



Challenge Datasets

Dataset Feature type # Feature # Training # Testing

REGED numerical 999 500 20,000SIDO binary 4,932 12,678 10,000CINA mixed 132 16,033 10,000

MARTI numerical 1,024 500 20,000LUCAS binary 11 2,000 10,000LUCAP binary 143 2,000 10,000


Preprocessing (1/2)

Scale each feature of REGED and CINA to [0, 1]

Using the same scaling parameter to both training and testing sets

SIDO, LUCAS, and LUCAP: Binary data with value ∈ {0, 1}Instance-wise normalized to unit length


Preprocessing (2/2)

The training set of MARTI is perturbed by noises

Use Gaussian filtering to eliminate the low frequency noise

Rearrange the 1024 features into a 32×32 array

Neighboring positions are similarly affected

Spatial Gaussian filter:

g(x0) =1

G (x0)

∑x

e−12(‖x−x0‖

σ)2f (x)

where G (x0) =∑

x e−12(‖x−x0‖

σ)2

- f ′ = f − g is the feature value we use

Training and testing sets are scaled to [-1,1] separately

- A bias value to be removed after Gaussian filtering

- Better performance


Classifier and Model Selection

L2-loss linear SVM

LIBLINEAR [Hsieh et al., 2008]

http://www.csie.ntu.edu.tw/~cjlin/liblinear

Grid search to determine the penalty parameter C for linear SVM

minw,b

1

2wTw + C

l∑i=1

ξ(w, b; xi , yi )

Prediction value: decision value from linear SVM

f (x) = sgn(wTx + b

)




Experiments

We do not use information of

- the manipulated features in REGED and MARTI

- the 25 calibrant features in MARTI

We do not learn from testing sets

Nested subsets of feature size {1, 2, 4, 8, ...,max #}

Three methods to compare different strategies

Quartile information

Cross-validation AUC

Testing AUC (Tscore) on toy examples LUCAS and LUCAP


Comparisons of methods (1/3)

Five-fold cross validation AUC of the best feature size

Dataset REGED SIDO CINA MARTI

F-score 0.9998 (16) 0.9461 (2,048) 0.9694 (128) 0.9210 (512)W 1.0000 (32) 0.9552 (512) 0.9710 (64) 0.9632 (128)

D-AUC 1.0000 (16) – 0.9699 (128) 0.9640 (256)D-ACC 0.9998 (64) – 0.9694 (132) 0.8993 (32)

SVM 0.9963 (999) 0.94471 (4932) 0.9694 (132) 0.8854 (1024)

F-score: F-score

W: Linear SVM weight

D-AUC: Change of AUC with/without each feature

D-ACC: Change of ACC with/without each feature

SVM: Linear SVM without feature ranking

D-AUC and D-ACC are infeasible for SIDO due to the large number offeatures



Testing AUC on toy examples

Dataset LUCAS 0 LUCAS 1 LUCAS 2 LUCAP 0 LUCAP 1 LUCAP 2

F-score 0.9208 0.8989 0.7446 0.9702 0.8327 0.7453W 0.9208 0.8989 0.7654 0.9702 0.9130 0.9159

D-AUC 0.9208 0.8989 0.7654 0.9696 0.8648 0.8655D-ACC 0.9208 0.8989 0.7446 0.9696 0.7755 0.6011

W performs best

LUCAP 1 and LUCAP 2

W > D-AUC > F-score > D-ACC



The top ranked features given by W and D-AUC are similar

CINA (132)

W : [121, 18, 79, 25, 9, 27, 41, 71]

D-AUC: [121, 18, 79, 25, 100, 27, 41, 9]

REGED (999)

W : [593, 251, 321, 409, 453, 930, 425, 344]

D-AUC: [593, 930, 251, 453, 409, 344, 321, 425]


Final Submission (1/2)

Feature ranking using linear SVM weights

Dataset Fnum Fscore Tscore Top Ts Max Ts Rank

REGED 0 16/999 0.8526 0.9998 1.0000 1.0000REGED 1 16/999 0.8566 0.9556 0.9980 0.9980REGED 2 8/999 0.9970 0.8392 0.8600 0.9543

mean 0.9316 1

SIDO 0 1,024/4,932 0.6516 0.9432 0.9443 0.9467SIDO 1 4,096/4,932 0.5685 0.7523 0.7532 0.7893SIDO 2 2,048/4,932 0.5685 0.6235 0.6684 0.7674

mean 0.7730 2

CINA 0 64/132 0.6000 0.9715 0.9788 0.9788CINA 1 64/132 0.7053 0.8446 0.8977 0.8977CINA 2 4/132 0.7053 0.8157 0.8157 0.8910

mean 0.8773 1

MARTI 0 256/1,024 0.8073 0.9914 0.9996 0.9996MARTI 1 256/1,024 0.7279 0.9209 0.9470 0.9542MARTI 2 2/1,024 0.9897 0.7606 0.7975 0.8273

mean 0.8910 3


Final Submission (2/2)

Best feature size from CV quite consistent with Fnum

Fnum is small on REGED, CINA 2, MARTI 2

On CINA 2, our method might benefit from good feature ranking

Fnum is four while the Tscore is best among all submissions


Outline



Discussions

Our linear SVM classifier is good

Excellent performance on version 0

CV results and true testing AUC not very consistent

D-ACC worst in cross validation but not in final results

We benefit from nested-subset submission

Our ranking method is good enough to have better performance

How about combining our feature ranking with other SVM kernels orother classifiers?


feature ranking using linear svm - eth z · feature ranking using linear svm yin-wen chang...

Documents