prediction model building and feature selection with svm in breast cancer diagnosis cheng-lung...

27
Prediction model building and feature selection with SVM in breast cancer diagnosis Cheng-Lung Huang, Hung-Chang Liao, Mu-Chen Chen Expert Systems with Applicati ons 2008

Upload: reynold-hoover

Post on 28-Dec-2015

217 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Prediction model building and feature selection with SVM in breast cancer diagnosis Cheng-Lung Huang, Hung-Chang Liao, Mu- Chen Chen Expert Systems with

Prediction model building and feature selection with SVM in breast

cancer diagnosis

Cheng-Lung Huang, Hung-Chang Liao, Mu-Chen Chen

Expert Systems with Applications 2008

Page 2: Prediction model building and feature selection with SVM in breast cancer diagnosis Cheng-Lung Huang, Hung-Chang Liao, Mu- Chen Chen Expert Systems with

Introduction

Breast cancer is a serious problem for the young women of Taiwan.

Almost 64.1% of women with breast cancer are diagnosed before the age of 50 and 29.3% of women with breast cancer are diagnosed before the age of 40.

However, the causes are still unknown.

Page 3: Prediction model building and feature selection with SVM in breast cancer diagnosis Cheng-Lung Huang, Hung-Chang Liao, Mu- Chen Chen Expert Systems with

Introduction

This study (Ziegler et al., 1993) shows that fibroadenoma shared some risk factors with breast cancer.

HSV-1 (herpes simplex virus type 1) EBV (Epstein-Barr virus) CMV (cytomegalovirus) HPV (human papillomavirus) HHV-8 (human herpesvirus-8)

Page 4: Prediction model building and feature selection with SVM in breast cancer diagnosis Cheng-Lung Huang, Hung-Chang Liao, Mu- Chen Chen Expert Systems with

Introduction

DNA viruses, as causes, are closely related to the human cancers as part of the high-risk factors.

In order to obtain the relationship between DNA viruses and breast tumors.

This paper uses the support vector machines (SVM) to find the pertinent bioinformatics.

Page 5: Prediction model building and feature selection with SVM in breast cancer diagnosis Cheng-Lung Huang, Hung-Chang Liao, Mu- Chen Chen Expert Systems with

Two Important Challenge

When using SVM, two problems are confronted: How to choose the optimal input feature subset for

SVM. How to set the best kernel parameters.

These two problems are crucial because the feature subset choice influences the appropriate kernel parameters and vice versa.

Page 6: Prediction model building and feature selection with SVM in breast cancer diagnosis Cheng-Lung Huang, Hung-Chang Liao, Mu- Chen Chen Expert Systems with

Feature Selection

Feature selection is an important issue in building classification systems.

It is advantageous to limit the number of input features in a classifier in order to have a good predictive and less computationally intensive model.

This study tried F-score calculation to select input features.

Page 7: Prediction model building and feature selection with SVM in breast cancer diagnosis Cheng-Lung Huang, Hung-Chang Liao, Mu- Chen Chen Expert Systems with

F-Score

Page 8: Prediction model building and feature selection with SVM in breast cancer diagnosis Cheng-Lung Huang, Hung-Chang Liao, Mu- Chen Chen Expert Systems with

F-Score Algorithm

Page 9: Prediction model building and feature selection with SVM in breast cancer diagnosis Cheng-Lung Huang, Hung-Chang Liao, Mu- Chen Chen Expert Systems with

Parameters Optimization

To design a SVM, one must choose a kernel function,set the kernel parameters and determine a soft margin constant C.

The grid algorithm is an alternative to finding the best C and gamma when using the RBF kernel function.

This study tried grid search to find the best SVM model parameters.

Page 10: Prediction model building and feature selection with SVM in breast cancer diagnosis Cheng-Lung Huang, Hung-Chang Liao, Mu- Chen Chen Expert Systems with

Grid-Search Algorithm

Page 11: Prediction model building and feature selection with SVM in breast cancer diagnosis Cheng-Lung Huang, Hung-Chang Liao, Mu- Chen Chen Expert Systems with

Data collection

The source of 80 data points (tissue samples) 52 specimens of non-familial invasive ductal breast

cancer. 28 mammary fibroadenomas. (From Chung-Shan Medical University Hospital )

Page 12: Prediction model building and feature selection with SVM in breast cancer diagnosis Cheng-Lung Huang, Hung-Chang Liao, Mu- Chen Chen Expert Systems with

Data partition

Data set is further randomly partitioned into training and independent testing sets via a stratified 5-fold cross validation.

Page 13: Prediction model building and feature selection with SVM in breast cancer diagnosis Cheng-Lung Huang, Hung-Chang Liao, Mu- Chen Chen Expert Systems with

SVM-based optimize parameters and feature selection

Page 14: Prediction model building and feature selection with SVM in breast cancer diagnosis Cheng-Lung Huang, Hung-Chang Liao, Mu- Chen Chen Expert Systems with

The relative feature importance with F-score

Page 15: Prediction model building and feature selection with SVM in breast cancer diagnosis Cheng-Lung Huang, Hung-Chang Liao, Mu- Chen Chen Expert Systems with

The relative importance of DNA virus based on the F-score

Page 16: Prediction model building and feature selection with SVM in breast cancer diagnosis Cheng-Lung Huang, Hung-Chang Liao, Mu- Chen Chen Expert Systems with

The five feature subsets based on the F-score

Page 17: Prediction model building and feature selection with SVM in breast cancer diagnosis Cheng-Lung Huang, Hung-Chang Liao, Mu- Chen Chen Expert Systems with

Overall training and testing accuracy for each feature subset

Page 18: Prediction model building and feature selection with SVM in breast cancer diagnosis Cheng-Lung Huang, Hung-Chang Liao, Mu- Chen Chen Expert Systems with

Type I and type II errors

Type I errors (the "false positive"): the error of rejecting the null hypothesis given that it is actually true

Type II errors (the "false negative"): the error of failing to reject the null hypothesis given that the alternative hypothesis is actually true

Page 19: Prediction model building and feature selection with SVM in breast cancer diagnosis Cheng-Lung Huang, Hung-Chang Liao, Mu- Chen Chen Expert Systems with

Detail testing accuracy for feature subset of size 2 and 3

Page 20: Prediction model building and feature selection with SVM in breast cancer diagnosis Cheng-Lung Huang, Hung-Chang Liao, Mu- Chen Chen Expert Systems with

Linear discriminate analysis (LDA)

Originally developed in 1936 by R.A. Fisher, Discriminate Analysis is a classic method of classification.

Discriminate analysis can be used only for classification Linear discriminant analysis finds a linear transformation

("discriminant function") of the two predictors, X and Y, that yields a new set of transformed values that provides a more accurate discrimination than either predictor alone:

Transformed Target = C1*X + C2*Y

Page 21: Prediction model building and feature selection with SVM in breast cancer diagnosis Cheng-Lung Huang, Hung-Chang Liao, Mu- Chen Chen Expert Systems with

The P-level of each attribute for LDA

Selection criteria: P-level value < 0.05

Page 22: Prediction model building and feature selection with SVM in breast cancer diagnosis Cheng-Lung Huang, Hung-Chang Liao, Mu- Chen Chen Expert Systems with

Training and testing accuracy for LDA

Page 23: Prediction model building and feature selection with SVM in breast cancer diagnosis Cheng-Lung Huang, Hung-Chang Liao, Mu- Chen Chen Expert Systems with

Comparison summary between SVM and LDA

Page 24: Prediction model building and feature selection with SVM in breast cancer diagnosis Cheng-Lung Huang, Hung-Chang Liao, Mu- Chen Chen Expert Systems with

Conclusion

In order to find the correlation DNA viruses with breast tumor, and to achieve a high classificatory accuracy.

F-score is adapted to find the important features. grid search approach is used to search the optimal

SVM parameters. The results revealed that the SVM-based model ha

s good performance in diagnosing breast cancer according to our data set.

Page 25: Prediction model building and feature selection with SVM in breast cancer diagnosis Cheng-Lung Huang, Hung-Chang Liao, Mu- Chen Chen Expert Systems with

Conclusion

The present study’s results also show that the attributes{HSV-1, HHV-8} or {HSV-1, HHV-8, CMV} can achieve identical high accuracy, at 86% of average overall hit rate.

This study suggests simultaneously considering HSV-1 and HHV-8 is feasible; however, only considering HHV-8 or HSV-1 is less accurate.

Page 26: Prediction model building and feature selection with SVM in breast cancer diagnosis Cheng-Lung Huang, Hung-Chang Liao, Mu- Chen Chen Expert Systems with

Future Work

The practical obstacle of the SVM-based (as well as neural networks) classification model is its black-box nature.

A possible solution for this issue is the use of SVM rule extraction techniques or the use of hybrid-SVM model combined with other more interpretable models.

Page 27: Prediction model building and feature selection with SVM in breast cancer diagnosis Cheng-Lung Huang, Hung-Chang Liao, Mu- Chen Chen Expert Systems with

Thank You