microarrays: a comparison of classification and feature selection algorithms for interpretation lynn...

Microarrays: A Comparison of Classification and Feature

Selection Algorithms for Interpretation

Lynn H. Lee, Hiram Shaish, Eric A. Smith, Min C. Zhang

Responsibilities• Lynn Lee studied and described the classification

methods, and performed all the experiments that use KNN as the classification method.

• Hiram Shaish studied and described the background of microarrays, and compiled and analyzed the experimental results.

• Eric Smith programmed, tested, and described the data parser.

• Min Zhang studied and described the feature selection methods, and performed all the experiments that use SVM as the classification method.

• Each team member contributed to the writing and editing process.

The Parser• Written in Perl• 100 lines of code, plus 90 lines of comments and

blank lines• 2 phases:

– Parse SOFT headers to generate some ARFF headers

– Parse SOFT matrix, generating the rest of the ARFF headers and the ARFF matrix

The Data

• 75 samples• 22215 genes• 3 classes: smokers, non-smokers, those

who quit smoking• Easy phenotype to verify• Caveats?

Feature Selection • Info Gain• Chi Square• 1, 2, 5, 10, 20, 50, 100, 200, 300, 400, 500

features selected• Results: almost identical features selected

for both algorithms• Reflects ‘partitionability’ of data set

Classification • ECOC• KNN• Paired the 2 classification algorithms with

2 feature selection algorithms• Results:

-KNN ‘out-classifies’ ECOC with less

features (70% with 1)

-Highest accuracy as a function of feature selection algorithm

Classification • Accuracy does not increase beyond a

maximum potential, regardless of feature #• Suggests an inherent characteristic of the

data

microarrays: a comparison of classification and feature selection algorithms for interpretation lynn...

Documents