feature ranking using linear svm - eth z · feature ranking using linear svm yin-wen chang...
TRANSCRIPT
Feature Ranking Using Linear SVM
Yin-Wen ChangDepartment of Computer Science
National Taiwan University
Joint work with Chih-Jen LinJune 4, 2008
Yin-Wen Chang and Chih-Jen Lin (NTU) 1 / 21
Outline
IntroductionSupport Vector ClassificationFeature Ranking StrategiesExperimental ResultsDiscussion and Conclusions
Yin-Wen Chang and Chih-Jen Lin (NTU) 2 / 21
The Problem
The testing and training sets have different distributions
Three versions of each task, one unmanipulated (0) and twomanipulated (1, 2) testing sets
Goal:
Make good predictions on the testing sets
Causality discovery could help
Feature ranking via linear SVM are used
Yin-Wen Chang and Chih-Jen Lin (NTU) 3 / 21
Outline
IntroductionSupport Vector ClassificationFeature Ranking StrategiesExperimental ResultsDiscussion and Conclusions
Yin-Wen Chang and Chih-Jen Lin (NTU) 4 / 21
Support Vector Classification
Training data (xi , yi ), i = 1, . . . , l , xi ∈ Rn, yi = ±1
Find a separating hyperplane with the maximal margin between twoclasses
minw,b
1
2wTw + C
l∑i=1
ξ(w, b; xi , yi )
L2-loss function:
ξ(w, b; xi , yi ) = max(1− yi (wTxi + b), 0)2
The decision function:
f (x) = sgn(wTx + b
)
Yin-Wen Chang and Chih-Jen Lin (NTU) 5 / 21
Outline
IntroductionSupport Vector ClassificationFeature Ranking StrategiesExperimental ResultsDiscussion and Conclusions
Yin-Wen Chang and Chih-Jen Lin (NTU) 6 / 21
Feature Ranking Strategies (1/2)
Assign a weight to each feature and rank the features accordingly
F-score
Simple statistic measure
Measure the discrimination between a feature and the label
Independent of the classifiers
Does not consider mutual information between features
Linear SVM Weight
[Guyon et al., 2002]
Weight w in the decision function: f (x) = sgn(wTx + b
).
Rank the features according to |wj | of the jth feature
Larger |wj | indicates more relevance
Yin-Wen Chang and Chih-Jen Lin (NTU) 7 / 21
Feature Ranking Strategies (2/2)
Change of AUC/ACC with/without Removing Each Feature
How the performance is influenced without a feature
Performance measures are:
CV AUC and CV ACC (accuracy rate)
Applicable to any classifier and performance measure
Expensive training and prediction
Infeasible when the number of features is large
Yin-Wen Chang and Chih-Jen Lin (NTU) 8 / 21
Outline
IntroductionSupport Vector ClassificationFeature Ranking StrategiesExperimental ResultsDiscussion and Conclusions
Yin-Wen Chang and Chih-Jen Lin (NTU) 9 / 21
Challenge Datasets
Dataset Feature type # Feature # Training # Testing
REGED numerical 999 500 20,000SIDO binary 4,932 12,678 10,000CINA mixed 132 16,033 10,000
MARTI numerical 1,024 500 20,000LUCAS binary 11 2,000 10,000LUCAP binary 143 2,000 10,000
Yin-Wen Chang and Chih-Jen Lin (NTU) 10 / 21
Preprocessing (1/2)
Scale each feature of REGED and CINA to [0, 1]
Using the same scaling parameter to both training and testing sets
SIDO, LUCAS, and LUCAP: Binary data with value ∈ {0, 1}Instance-wise normalized to unit length
Yin-Wen Chang and Chih-Jen Lin (NTU) 11 / 21
Preprocessing (2/2)
The training set of MARTI is perturbed by noises
Use Gaussian filtering to eliminate the low frequency noise
Rearrange the 1024 features into a 32×32 array
Neighboring positions are similarly affected
Spatial Gaussian filter:
g(x0) =1
G (x0)
∑x
e−12(‖x−x0‖
σ)2f (x)
where G (x0) =∑
x e−12(‖x−x0‖
σ)2
- f ′ = f − g is the feature value we use
Training and testing sets are scaled to [-1,1] separately
- A bias value to be removed after Gaussian filtering
- Better performance
Yin-Wen Chang and Chih-Jen Lin (NTU) 12 / 21
Classifier and Model Selection
L2-loss linear SVM
LIBLINEAR [Hsieh et al., 2008]
http://www.csie.ntu.edu.tw/~cjlin/liblinear
Grid search to determine the penalty parameter C for linear SVM
minw,b
1
2wTw + C
l∑i=1
ξ(w, b; xi , yi )
Prediction value: decision value from linear SVM
f (x) = sgn(wTx + b
)
Yin-Wen Chang and Chih-Jen Lin (NTU) 13 / 21
Experiments
We do not use information of
- the manipulated features in REGED and MARTI
- the 25 calibrant features in MARTI
We do not learn from testing sets
Nested subsets of feature size {1, 2, 4, 8, ...,max #}
Three methods to compare different strategies
Quartile information
Cross-validation AUC
Testing AUC (Tscore) on toy examples LUCAS and LUCAP
Yin-Wen Chang and Chih-Jen Lin (NTU) 14 / 21
Comparisons of methods (1/3)
Five-fold cross validation AUC of the best feature size
Dataset REGED SIDO CINA MARTI
F-score 0.9998 (16) 0.9461 (2,048) 0.9694 (128) 0.9210 (512)W 1.0000 (32) 0.9552 (512) 0.9710 (64) 0.9632 (128)
D-AUC 1.0000 (16) – 0.9699 (128) 0.9640 (256)D-ACC 0.9998 (64) – 0.9694 (132) 0.8993 (32)
SVM 0.9963 (999) 0.94471 (4932) 0.9694 (132) 0.8854 (1024)
F-score: F-score
W: Linear SVM weight
D-AUC: Change of AUC with/without each feature
D-ACC: Change of ACC with/without each feature
SVM: Linear SVM without feature ranking
D-AUC and D-ACC are infeasible for SIDO due to the large number offeatures
Yin-Wen Chang and Chih-Jen Lin (NTU) 15 / 21
Comparisons of methods (2/3)
Testing AUC on toy examples
Dataset LUCAS 0 LUCAS 1 LUCAS 2 LUCAP 0 LUCAP 1 LUCAP 2
F-score 0.9208 0.8989 0.7446 0.9702 0.8327 0.7453W 0.9208 0.8989 0.7654 0.9702 0.9130 0.9159
D-AUC 0.9208 0.8989 0.7654 0.9696 0.8648 0.8655D-ACC 0.9208 0.8989 0.7446 0.9696 0.7755 0.6011
W performs best
LUCAP 1 and LUCAP 2
W > D-AUC > F-score > D-ACC
Yin-Wen Chang and Chih-Jen Lin (NTU) 16 / 21
Comparisons of methods (3/3)
The top ranked features given by W and D-AUC are similar
CINA (132)
W : [121, 18, 79, 25, 9, 27, 41, 71]
D-AUC: [121, 18, 79, 25, 100, 27, 41, 9]
REGED (999)
W : [593, 251, 321, 409, 453, 930, 425, 344]
D-AUC: [593, 930, 251, 453, 409, 344, 321, 425]
Yin-Wen Chang and Chih-Jen Lin (NTU) 17 / 21
Final Submission (1/2)
Feature ranking using linear SVM weights
Dataset Fnum Fscore Tscore Top Ts Max Ts Rank
REGED 0 16/999 0.8526 0.9998 1.0000 1.0000REGED 1 16/999 0.8566 0.9556 0.9980 0.9980REGED 2 8/999 0.9970 0.8392 0.8600 0.9543
mean 0.9316 1
SIDO 0 1,024/4,932 0.6516 0.9432 0.9443 0.9467SIDO 1 4,096/4,932 0.5685 0.7523 0.7532 0.7893SIDO 2 2,048/4,932 0.5685 0.6235 0.6684 0.7674
mean 0.7730 2
CINA 0 64/132 0.6000 0.9715 0.9788 0.9788CINA 1 64/132 0.7053 0.8446 0.8977 0.8977CINA 2 4/132 0.7053 0.8157 0.8157 0.8910
mean 0.8773 1
MARTI 0 256/1,024 0.8073 0.9914 0.9996 0.9996MARTI 1 256/1,024 0.7279 0.9209 0.9470 0.9542MARTI 2 2/1,024 0.9897 0.7606 0.7975 0.8273
mean 0.8910 3
Yin-Wen Chang and Chih-Jen Lin (NTU) 18 / 21
Final Submission (2/2)
Best feature size from CV quite consistent with Fnum
Fnum is small on REGED, CINA 2, MARTI 2
On CINA 2, our method might benefit from good feature ranking
Fnum is four while the Tscore is best among all submissions
Yin-Wen Chang and Chih-Jen Lin (NTU) 19 / 21
Outline
IntroductionSupport Vector ClassificationFeature Ranking StrategiesExperimental ResultsDiscussion and Conclusions
Yin-Wen Chang and Chih-Jen Lin (NTU) 20 / 21
Discussions
Our linear SVM classifier is good
Excellent performance on version 0
CV results and true testing AUC not very consistent
D-ACC worst in cross validation but not in final results
We benefit from nested-subset submission
Our ranking method is good enough to have better performance
How about combining our feature ranking with other SVM kernels orother classifiers?
Yin-Wen Chang and Chih-Jen Lin (NTU) 21 / 21