feature ranking using linear svm - eth z · feature ranking using linear svm yin-wen chang...

21
Feature Ranking Using Linear SVM Yin-Wen Chang Department of Computer Science National Taiwan University Joint work with Chih-Jen Lin June 4, 2008 Yin-Wen Chang and Chih-Jen Lin (NTU) 1 / 21

Upload: others

Post on 31-Oct-2019

13 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Feature Ranking Using Linear SVM - ETH Z · Feature Ranking Using Linear SVM Yin-Wen Chang Department of Computer Science National Taiwan University Joint work with Chih-Jen Lin June

Feature Ranking Using Linear SVM

Yin-Wen ChangDepartment of Computer Science

National Taiwan University

Joint work with Chih-Jen LinJune 4, 2008

Yin-Wen Chang and Chih-Jen Lin (NTU) 1 / 21

Page 2: Feature Ranking Using Linear SVM - ETH Z · Feature Ranking Using Linear SVM Yin-Wen Chang Department of Computer Science National Taiwan University Joint work with Chih-Jen Lin June

Outline

IntroductionSupport Vector ClassificationFeature Ranking StrategiesExperimental ResultsDiscussion and Conclusions

Yin-Wen Chang and Chih-Jen Lin (NTU) 2 / 21

Page 3: Feature Ranking Using Linear SVM - ETH Z · Feature Ranking Using Linear SVM Yin-Wen Chang Department of Computer Science National Taiwan University Joint work with Chih-Jen Lin June

The Problem

The testing and training sets have different distributions

Three versions of each task, one unmanipulated (0) and twomanipulated (1, 2) testing sets

Goal:

Make good predictions on the testing sets

Causality discovery could help

Feature ranking via linear SVM are used

Yin-Wen Chang and Chih-Jen Lin (NTU) 3 / 21

Page 4: Feature Ranking Using Linear SVM - ETH Z · Feature Ranking Using Linear SVM Yin-Wen Chang Department of Computer Science National Taiwan University Joint work with Chih-Jen Lin June

Outline

IntroductionSupport Vector ClassificationFeature Ranking StrategiesExperimental ResultsDiscussion and Conclusions

Yin-Wen Chang and Chih-Jen Lin (NTU) 4 / 21

Page 5: Feature Ranking Using Linear SVM - ETH Z · Feature Ranking Using Linear SVM Yin-Wen Chang Department of Computer Science National Taiwan University Joint work with Chih-Jen Lin June

Support Vector Classification

Training data (xi , yi ), i = 1, . . . , l , xi ∈ Rn, yi = ±1

Find a separating hyperplane with the maximal margin between twoclasses

minw,b

1

2wTw + C

l∑i=1

ξ(w, b; xi , yi )

L2-loss function:

ξ(w, b; xi , yi ) = max(1− yi (wTxi + b), 0)2

The decision function:

f (x) = sgn(wTx + b

)

Yin-Wen Chang and Chih-Jen Lin (NTU) 5 / 21

Page 6: Feature Ranking Using Linear SVM - ETH Z · Feature Ranking Using Linear SVM Yin-Wen Chang Department of Computer Science National Taiwan University Joint work with Chih-Jen Lin June

Outline

IntroductionSupport Vector ClassificationFeature Ranking StrategiesExperimental ResultsDiscussion and Conclusions

Yin-Wen Chang and Chih-Jen Lin (NTU) 6 / 21

Page 7: Feature Ranking Using Linear SVM - ETH Z · Feature Ranking Using Linear SVM Yin-Wen Chang Department of Computer Science National Taiwan University Joint work with Chih-Jen Lin June

Feature Ranking Strategies (1/2)

Assign a weight to each feature and rank the features accordingly

F-score

Simple statistic measure

Measure the discrimination between a feature and the label

Independent of the classifiers

Does not consider mutual information between features

Linear SVM Weight

[Guyon et al., 2002]

Weight w in the decision function: f (x) = sgn(wTx + b

).

Rank the features according to |wj | of the jth feature

Larger |wj | indicates more relevance

Yin-Wen Chang and Chih-Jen Lin (NTU) 7 / 21

Page 8: Feature Ranking Using Linear SVM - ETH Z · Feature Ranking Using Linear SVM Yin-Wen Chang Department of Computer Science National Taiwan University Joint work with Chih-Jen Lin June

Feature Ranking Strategies (2/2)

Change of AUC/ACC with/without Removing Each Feature

How the performance is influenced without a feature

Performance measures are:

CV AUC and CV ACC (accuracy rate)

Applicable to any classifier and performance measure

Expensive training and prediction

Infeasible when the number of features is large

Yin-Wen Chang and Chih-Jen Lin (NTU) 8 / 21

Page 9: Feature Ranking Using Linear SVM - ETH Z · Feature Ranking Using Linear SVM Yin-Wen Chang Department of Computer Science National Taiwan University Joint work with Chih-Jen Lin June

Outline

IntroductionSupport Vector ClassificationFeature Ranking StrategiesExperimental ResultsDiscussion and Conclusions

Yin-Wen Chang and Chih-Jen Lin (NTU) 9 / 21

Page 10: Feature Ranking Using Linear SVM - ETH Z · Feature Ranking Using Linear SVM Yin-Wen Chang Department of Computer Science National Taiwan University Joint work with Chih-Jen Lin June

Challenge Datasets

Dataset Feature type # Feature # Training # Testing

REGED numerical 999 500 20,000SIDO binary 4,932 12,678 10,000CINA mixed 132 16,033 10,000

MARTI numerical 1,024 500 20,000LUCAS binary 11 2,000 10,000LUCAP binary 143 2,000 10,000

Yin-Wen Chang and Chih-Jen Lin (NTU) 10 / 21

Page 11: Feature Ranking Using Linear SVM - ETH Z · Feature Ranking Using Linear SVM Yin-Wen Chang Department of Computer Science National Taiwan University Joint work with Chih-Jen Lin June

Preprocessing (1/2)

Scale each feature of REGED and CINA to [0, 1]

Using the same scaling parameter to both training and testing sets

SIDO, LUCAS, and LUCAP: Binary data with value ∈ {0, 1}Instance-wise normalized to unit length

Yin-Wen Chang and Chih-Jen Lin (NTU) 11 / 21

Page 12: Feature Ranking Using Linear SVM - ETH Z · Feature Ranking Using Linear SVM Yin-Wen Chang Department of Computer Science National Taiwan University Joint work with Chih-Jen Lin June

Preprocessing (2/2)

The training set of MARTI is perturbed by noises

Use Gaussian filtering to eliminate the low frequency noise

Rearrange the 1024 features into a 32×32 array

Neighboring positions are similarly affected

Spatial Gaussian filter:

g(x0) =1

G (x0)

∑x

e−12(‖x−x0‖

σ)2f (x)

where G (x0) =∑

x e−12(‖x−x0‖

σ)2

- f ′ = f − g is the feature value we use

Training and testing sets are scaled to [-1,1] separately

- A bias value to be removed after Gaussian filtering

- Better performance

Yin-Wen Chang and Chih-Jen Lin (NTU) 12 / 21

Page 13: Feature Ranking Using Linear SVM - ETH Z · Feature Ranking Using Linear SVM Yin-Wen Chang Department of Computer Science National Taiwan University Joint work with Chih-Jen Lin June

Classifier and Model Selection

L2-loss linear SVM

LIBLINEAR [Hsieh et al., 2008]

http://www.csie.ntu.edu.tw/~cjlin/liblinear

Grid search to determine the penalty parameter C for linear SVM

minw,b

1

2wTw + C

l∑i=1

ξ(w, b; xi , yi )

Prediction value: decision value from linear SVM

f (x) = sgn(wTx + b

)

Yin-Wen Chang and Chih-Jen Lin (NTU) 13 / 21

Page 14: Feature Ranking Using Linear SVM - ETH Z · Feature Ranking Using Linear SVM Yin-Wen Chang Department of Computer Science National Taiwan University Joint work with Chih-Jen Lin June

Experiments

We do not use information of

- the manipulated features in REGED and MARTI

- the 25 calibrant features in MARTI

We do not learn from testing sets

Nested subsets of feature size {1, 2, 4, 8, ...,max #}

Three methods to compare different strategies

Quartile information

Cross-validation AUC

Testing AUC (Tscore) on toy examples LUCAS and LUCAP

Yin-Wen Chang and Chih-Jen Lin (NTU) 14 / 21

Page 15: Feature Ranking Using Linear SVM - ETH Z · Feature Ranking Using Linear SVM Yin-Wen Chang Department of Computer Science National Taiwan University Joint work with Chih-Jen Lin June

Comparisons of methods (1/3)

Five-fold cross validation AUC of the best feature size

Dataset REGED SIDO CINA MARTI

F-score 0.9998 (16) 0.9461 (2,048) 0.9694 (128) 0.9210 (512)W 1.0000 (32) 0.9552 (512) 0.9710 (64) 0.9632 (128)

D-AUC 1.0000 (16) – 0.9699 (128) 0.9640 (256)D-ACC 0.9998 (64) – 0.9694 (132) 0.8993 (32)

SVM 0.9963 (999) 0.94471 (4932) 0.9694 (132) 0.8854 (1024)

F-score: F-score

W: Linear SVM weight

D-AUC: Change of AUC with/without each feature

D-ACC: Change of ACC with/without each feature

SVM: Linear SVM without feature ranking

D-AUC and D-ACC are infeasible for SIDO due to the large number offeatures

Yin-Wen Chang and Chih-Jen Lin (NTU) 15 / 21

Page 16: Feature Ranking Using Linear SVM - ETH Z · Feature Ranking Using Linear SVM Yin-Wen Chang Department of Computer Science National Taiwan University Joint work with Chih-Jen Lin June

Comparisons of methods (2/3)

Testing AUC on toy examples

Dataset LUCAS 0 LUCAS 1 LUCAS 2 LUCAP 0 LUCAP 1 LUCAP 2

F-score 0.9208 0.8989 0.7446 0.9702 0.8327 0.7453W 0.9208 0.8989 0.7654 0.9702 0.9130 0.9159

D-AUC 0.9208 0.8989 0.7654 0.9696 0.8648 0.8655D-ACC 0.9208 0.8989 0.7446 0.9696 0.7755 0.6011

W performs best

LUCAP 1 and LUCAP 2

W > D-AUC > F-score > D-ACC

Yin-Wen Chang and Chih-Jen Lin (NTU) 16 / 21

Page 17: Feature Ranking Using Linear SVM - ETH Z · Feature Ranking Using Linear SVM Yin-Wen Chang Department of Computer Science National Taiwan University Joint work with Chih-Jen Lin June

Comparisons of methods (3/3)

The top ranked features given by W and D-AUC are similar

CINA (132)

W : [121, 18, 79, 25, 9, 27, 41, 71]

D-AUC: [121, 18, 79, 25, 100, 27, 41, 9]

REGED (999)

W : [593, 251, 321, 409, 453, 930, 425, 344]

D-AUC: [593, 930, 251, 453, 409, 344, 321, 425]

Yin-Wen Chang and Chih-Jen Lin (NTU) 17 / 21

Page 18: Feature Ranking Using Linear SVM - ETH Z · Feature Ranking Using Linear SVM Yin-Wen Chang Department of Computer Science National Taiwan University Joint work with Chih-Jen Lin June

Final Submission (1/2)

Feature ranking using linear SVM weights

Dataset Fnum Fscore Tscore Top Ts Max Ts Rank

REGED 0 16/999 0.8526 0.9998 1.0000 1.0000REGED 1 16/999 0.8566 0.9556 0.9980 0.9980REGED 2 8/999 0.9970 0.8392 0.8600 0.9543

mean 0.9316 1

SIDO 0 1,024/4,932 0.6516 0.9432 0.9443 0.9467SIDO 1 4,096/4,932 0.5685 0.7523 0.7532 0.7893SIDO 2 2,048/4,932 0.5685 0.6235 0.6684 0.7674

mean 0.7730 2

CINA 0 64/132 0.6000 0.9715 0.9788 0.9788CINA 1 64/132 0.7053 0.8446 0.8977 0.8977CINA 2 4/132 0.7053 0.8157 0.8157 0.8910

mean 0.8773 1

MARTI 0 256/1,024 0.8073 0.9914 0.9996 0.9996MARTI 1 256/1,024 0.7279 0.9209 0.9470 0.9542MARTI 2 2/1,024 0.9897 0.7606 0.7975 0.8273

mean 0.8910 3

Yin-Wen Chang and Chih-Jen Lin (NTU) 18 / 21

Page 19: Feature Ranking Using Linear SVM - ETH Z · Feature Ranking Using Linear SVM Yin-Wen Chang Department of Computer Science National Taiwan University Joint work with Chih-Jen Lin June

Final Submission (2/2)

Best feature size from CV quite consistent with Fnum

Fnum is small on REGED, CINA 2, MARTI 2

On CINA 2, our method might benefit from good feature ranking

Fnum is four while the Tscore is best among all submissions

Yin-Wen Chang and Chih-Jen Lin (NTU) 19 / 21

Page 20: Feature Ranking Using Linear SVM - ETH Z · Feature Ranking Using Linear SVM Yin-Wen Chang Department of Computer Science National Taiwan University Joint work with Chih-Jen Lin June

Outline

IntroductionSupport Vector ClassificationFeature Ranking StrategiesExperimental ResultsDiscussion and Conclusions

Yin-Wen Chang and Chih-Jen Lin (NTU) 20 / 21

Page 21: Feature Ranking Using Linear SVM - ETH Z · Feature Ranking Using Linear SVM Yin-Wen Chang Department of Computer Science National Taiwan University Joint work with Chih-Jen Lin June

Discussions

Our linear SVM classifier is good

Excellent performance on version 0

CV results and true testing AUC not very consistent

D-ACC worst in cross validation but not in final results

We benefit from nested-subset submission

Our ranking method is good enough to have better performance

How about combining our feature ranking with other SVM kernels orother classifiers?

Yin-Wen Chang and Chih-Jen Lin (NTU) 21 / 21