a study on classification algorithms for predicting colon ... · classification of the gene...

20
A Study on Classification Algorithms for Predicting Colon Cancer using Gene Tissue Parameters Aditya Tekur 1 , Prerna Jain 2 Department of Information Technology, SRM Institute of Science and Technology, Chennai, India. 1 email: [email protected] 2 email: [email protected] ABSTRACT Cancer is a class of illnesses characterized by out-of-control cell increase. Computer Aided diagnosis is now helping the medical field in finding out the onset of cancer at an earlier stage. This paper presents a comparative study of numerous classification prediction models for Colon Cancer. This would help in identifying whether the person with the parameters provided, can be classified for the chance of colon cancer or not. Interpreting the current research outcomes, classification of the gene expression data set for colon cancer has been realized as an arduous task. This study makes use of the gene expression data of 62 samples of colon epithelial tissues, out of which, 40 are tumorous and 22 normal. The Waikato Environment for Knowledge Analysis (WEKA 3.8) tool has been used for the classification of the dataset, using the n (10) fold cross validation technique. Pre- dominated genes, which are highly correlated with colon cancer, are obtained using the feature selection methods with filter and wrapper approach, in order to obtain better classification accuracy. The results indicate that Naive Bayes is the best predictor reaching the highest accuracy rate of 93.6%, followed by Logistic, Decision Table and Hoeffding tree with 90.3% and Bagging classifier came with the lowest accuracy of 67.7%, among the algorithms used in this paper. Keywords: cancer prediction; machine learning; classification; computer aided diagnosis. I. INTRODUCTION Colon cancer is a form of cancer that affects the large intestine. In many previous cases of colon cancer, it starts with the development of small, non- cancerous cells called adenomatous polyps. Many of these polyps go on to become International Journal of Pure and Applied Mathematics Volume 119 No. 18 2018, 2147-2166 ISSN: 1314-3395 (on-line version) url: http://www.acadpubl.eu/hub/ Special Issue http://www.acadpubl.eu/hub/ 2147

Upload: others

Post on 05-Oct-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Study on Classification Algorithms for Predicting Colon ... · classification of the gene expression data set for colon cancer has been realized as an arduous task. This study makes

A Study on Classification Algorithms for

Predicting Colon Cancer using

Gene Tissue Parameters

Aditya Tekur1, Prerna Jain2

Department of Information Technology,

SRM Institute of Science and Technology, Chennai, India. 1email: [email protected]

2email: [email protected]

ABSTRACT

Cancer is a class of illnesses characterized by out-of-control cell increase.

Computer Aided diagnosis is now helping the medical field in finding out the

onset of cancer at an earlier stage. This paper presents a comparative study of

numerous classification prediction models for Colon Cancer. This would help in

identifying whether the person with the parameters provided, can be classified

for the chance of colon cancer or not. Interpreting the current research outcomes,

classification of the gene expression data set for colon cancer has been realized as

an arduous task. This study makes use of the gene expression data of 62 samples

of colon epithelial tissues, out of which, 40 are tumorous and 22 normal. The

Waikato Environment for Knowledge Analysis (WEKA 3.8) tool has been used

for the classification of the dataset, using the n (10) fold cross validation

technique. Pre- dominated genes, which are highly correlated with colon cancer,

are obtained using the feature selection methods with filter and wrapper

approach, in order to obtain better classification accuracy. The results indicate

that Naive Bayes is the best predictor reaching the highest accuracy rate of

93.6%, followed by Logistic, Decision Table and Hoeffding tree with 90.3% and

Bagging classifier came with the lowest accuracy of 67.7%, among the algorithms

used in this paper.

Keywords: cancer prediction; machine learning; classification; computer aided

diagnosis.

I. INTRODUCTION

Colon cancer is a form of cancer that affects the large intestine. In many

previous cases of colon cancer, it starts with the development of small, non-

cancerous cells called adenomatous polyps. Many of these polyps go on to become

International Journal of Pure and Applied MathematicsVolume 119 No. 18 2018, 2147-2166ISSN: 1314-3395 (on-line version)url: http://www.acadpubl.eu/hub/Special Issue http://www.acadpubl.eu/hub/

2147

Page 2: A Study on Classification Algorithms for Predicting Colon ... · classification of the gene expression data set for colon cancer has been realized as an arduous task. This study makes

cancerous. These cells can vary in size and may produce symptoms, through

which the detection can take place. Inherited gene mutations become a reason

for the increased risk of this cancer as it can be passed down the family

hierarchical order, but these inherited genes do not always result in the colon

cancer.

Histopathological examination of a tissue specimen is a common method

that is used to locate as well as classify the colon cancer. In an alternate method

for colon cancer detection, pathologists examine the various parameters that

may cause changes in the cell structure. The tissue distribution and changes in

the cell structure help in determining the erratic region in the specimen, if any

[1].

This method of examination of the specimens is very tedious for the

histopathologist nonetheless being very expansive and subjective. Most of the

times it leads to variability [3].

Researches are happening in the field of Computer Aided Diagnosis. The

rise in the use of Machine Learning algorithms has helped a lot in medical

diagnosis. Computers can assist doctors in diagnosing a patient for certain type

of diseases that can help in saving a patient's life. Early prediction of diseases

like cancer will help the medical field in providing better diagnosis and take

preventive measures.

We through our research plan to implement various classification

algorithms on the data set containing gene concentration in a tissue to predict

the occurrence of colon cancer in a human being. The training and the testing

data have been formulated using the 10-fold cross validation method.

The remaining paper is organized as follows. Section II describes the

related work that has already been done in this field. Section III deals with the

methodology. The results achieved are analyzed in section IV. Section V

concludes the paper.

II. RELATED WORK

Cancer has been one of the deadliest diseases to have affected humanity;

more than 1/6th of the deaths worldwide are due to cancer. Generally, mutations

of gene structure, lead to the changes in the composition of the gene that

eventually causes the cancerous growth of cells. If we could possibly identify the

gene which changed, that eventually lead to the cell turning cancerous, we can

International Journal of Pure and Applied Mathematics Special Issue

2148

Page 3: A Study on Classification Algorithms for Predicting Colon ... · classification of the gene expression data set for colon cancer has been realized as an arduous task. This study makes

supply better treatment to cancer patients. Hence, utilizing gene expression

profile is a vital step towards integration of the complex genomic information

that is unique and, in many ways, customized for an individual patient.

To deliver a reliable forecast result, a suitable approach is required that

can give high precision in classification, which is subject to the efficient approval

strategy. The main procedure of gene expression data classification task

includes: feature selection and pattern classification stage [22]. The feature

selection selects a list of genes which may be informative for the prediction of

tumour suppressor. The pattern classifier makes a call to the class, to which the

gene pattern belongs to, at the prediction stage.

The Oligonucleotide arrays give brief information on the condition of the

cell. It checks the expression level of various genes at the same time. It is

important to create strategies for separating helpful data from the subsequent

informational collections. A proficient 2-way clustering method was applied by

Alon et al. [16], to a set of gene expressions in 22 normal and 40 tumour colon

tissues. This led to uncovering of wide coherent designs, those which recommend

a high level of organization underlying gene expression in the tissues.

On the basis of the above selected genes, gene sets were summarized by

Zhang [23] using the recursive partitioning tree. Floating search algorithm was

used by Liu Jin Quan [24] to deal with colon cancer gene expression data. A fast

correlation based filter algorithm was used by Yu and Liu [17] that utilized

relationship degree to eliminate repetition, and gain significant genes.

SVM-RBF-RFE algorithm that figured out the weight of each feature was

a wrapper selection technique proposed by Yang Jhang [18]. This method was

able to identify most of the significant genes related with the colon cancer. A

hybrid approach of the filter and wrapper methods was put forward by Xing et

al. [19].

A gene selection method for cancer classification, consisting of the genetic

algorithm and the SVM was proposed by Shutao et al. [20]. The Wilcoxon rank

sum test was used to filter out the repetitive genes. A definitive subset, including

exceeding isolating genes was achieved by analysing the repetition of the

presence of every gene in the distinctive subsets. Shen et al. [21] put forward the

combination of particle swarm optimisation (PSO) and SVM. Informative genes

were extracted by applying PSO and the classifier used was SVM. In this

process, t test was applied to filter the data.

In order to improve the method stated by Shutao et al. implementation of

GA/SVM was modified by combining it with the cross- validation method [22]. K

means classification technique was utilized by Zhang Ya [25] to extricate 22

International Journal of Pure and Applied Mathematics Special Issue

2149

Page 4: A Study on Classification Algorithms for Predicting Colon ... · classification of the gene expression data set for colon cancer has been realized as an arduous task. This study makes

informative genes. To select the master gene, SVM was used for classification,

which reached the maximum accuracy rate of 86.4%.

III. METHODOLOGY

A. DATA SOURCE:

Colon Cancer dataset by Alon [16], which is frequently used as a benchmark, has

been chosen to perform the comparative study of the various algorithms. The

dataset consists of 62 samples of colon epithelial cells, out of which 40 are

tumorous and 22 normal. These tissue samples were collected from the patients

affected by the colon cancer. The “tumour” biopsies were extracted from the

tumorous part, whereas the “normal” biopsies were extracted from the healthy

part of the colon.

High density oligonucleotide arrays were used to measure the gene

expression levels in the 62 samples. 2000 genes out of the 6000 were selected

based on the confidence in the measured expression levels. The raw data consists

of two more files, one with the tissue data and the other with the gene names.

The dataset is available at:-

http://genomicspubs.princeton.edu/oncology/affydata/index.html.

B. PREDICTION MODELS USED:

PREDICTION MODELS Explanation

Bayes Net Probabilistic graphical model that

represents a set of variables and their

conditional dependencies via a directed acyclic graph.

Naïve Bayes It’s a classifier which uses the Bayes

Theorem. It predicts membership

probabilities for each class, such as

the probability that given record or

data point belongs to a particular class.

Logistic Algorithm An equation as the representation,

very much like linear regression.

SGD Stochastic gradient descent for learning various linear models.

Simple logistic This is a classifier for building linear,

logistic regression models.

SMO The SVM training algorithm builds a

model that assigns new examples to

one category or the other, making it a

International Journal of Pure and Applied Mathematics Special Issue

2150

Page 5: A Study on Classification Algorithms for Predicting Colon ... · classification of the gene expression data set for colon cancer has been realized as an arduous task. This study makes

non-probabilistic binary linear

classifier.

Voted Perceptron An algorithm for linear classification,

which combines the Rosenblatt's

perceptron algorithm with leave-one-

out method.

IBk The K - Nearest neighbour is also an

algorithm for analysis, used for

regression.

K Star An instance-based classifier that is,

the class of a test instance is based

upon the class of those training

instances similar to it, as determined by some similar function.

LWL Non-parametric and the current

prediction is done by local functions

which are using only a subset of the

data.

Adaboost M1 A general ensemble method that

creates a strong classifier from a

number of weak classifiers.

Attribute Selected Classifier Dimensionality of training and test

data is reduced by attribute selection

before being passed on to a classifier

using this algorithm.

Bagging Bootstrapping is a process of selecting

samples from the original sample and

using these samples for estimating

various statistics or model accuracy.

Classification via regression For every single value of the classes, a

single regression model is constructed.

Random Committee A class is used for building an

ensemble of classifiers with a

randomizable base .

Randomizable filtered Classifier It is a simple variant of the filtered

classifier, that instantiates the model with the classifier.

Decision Table The class is used for building and it

uses a simple decision table as its

classifier. The minimum number of instances is 1.

JRip Implements a propositional rule

learner, Repeated Incremental

Pruning to Produce Error Reduction.

Decision Stump A machine learning model consisting

of a one-level decision tree.

Hoeffding Tree Is an incremental, anytime decision

International Journal of Pure and Applied Mathematics Special Issue

2151

Page 6: A Study on Classification Algorithms for Predicting Colon ... · classification of the gene expression data set for colon cancer has been realized as an arduous task. This study makes

tree induction algorithm that is

capable of learning from massive data

streams, assuming that the

distribution generating examples does not change over time.

J48 Generating pruned and unpruned C4

[11]. A depth first approach is used for

decision growth.

LMT For building 'logistic model trees',

which are classification trees with

logistic regression functions at the leaves.

Random Forest Easy to use machine learning

algorithm which is very flexible and

produces great results most of the

time, even without proper hyper-

parameter tuning.

Random Tree Supervised Classifier; it is an

ensemble learning algorithm that

generates many individual learners.

Rep Tree Uses the regression tree logic and

creates multiple trees in different

iterations.

C. IMPLEMENTATION:

Fig 1a: Flow Chart Diagram

International Journal of Pure and Applied Mathematics Special Issue

2152

Page 7: A Study on Classification Algorithms for Predicting Colon ... · classification of the gene expression data set for colon cancer has been realized as an arduous task. This study makes

This section will describe the prediction process, consisting of five main phases:

pre-processing (data cleansing), pre- selection, feature selection, classification

and validation phase.

a. Pre- processing phase

This phase is also referred to as the data cleansing step. In any data mining

application, this phase is amongst the most important steps. To understand the

dataset and train it for mining, exploratory data analysis was performed. The

original data, I2000 matrix (MXN) consists of gene expression data of 62 samples

(N) over 2000 genes (M). This data can be obtained in the arff format from the

link http://csse.szu.edu.cn/staff/zhuzx/Datasets.html. Out of the 2000 genes, 92

were found to be redundant, which were filtered out and 1908 genes were

obtained. Since the goal of this project is to develop efficient models for the

prediction of colon cancer, a binary dependent variable representing the class of

the tissue, namely “tumour” and “normal” is created. For each sample, it is

indicated if it has come from a tumour or a normal biopsy.

b. Pre- selection phase

This phase aims to use the Info gain attribute evaluator which evaluates the

dataset, ranks the features of the evaluated data set and finally sorts them

according to the top rank information gain, statistically [22]. The worth of an

attribute is evaluated by measuring the information gain. The info gain

evaluator calculates the worth of an attribute. The search method ranks

attributes by their individual evaluations. During this selection process, those

gene features considered as the most discriminatory features are extracted. For

this paper, the top 130 genes are selected for the classification process.

Generally, this phase aims at reducing the dimensionality of the dataset.

c. Feature selection phase

This phase allows learning algorithms to operate faster and more effectively. In

order to achieve high accuracy, we will be searching for uniform patterns to

select the predominant genes out of a large number of initial gene features. This

is performed with an objective of finding an optimal relevant subset of attributes

(genes). In addition to improving accuracy, a representation of the target class

can be obtained easily.

The classifier subset evaluator (feature selection with wrapper) along with the

best first search (BFS) is used to achieve this. We choose the classifier pertaining

to which we require the informative genes and perform the BFS.

If there are n number of attributes initially, the possible number of subsets that

can be formed are n2. The best way to choose the perfect one would be by trying

out all.

International Journal of Pure and Applied Mathematics Special Issue

2153

Page 8: A Study on Classification Algorithms for Predicting Colon ... · classification of the gene expression data set for colon cancer has been realized as an arduous task. This study makes

The best first search starts with an empty subset, and starts generating all

single attribute expansions. The highest evaluated subset is chosen and is

expanded in the above similar manner. If the resulting subset leads to no

advancement, the search backtracks to the next best unexpanded result and the

execution continues. This way, the entire search space is covered and the best

subset found is returned after the termination of the search [26].

The subset obtained as a result of the above procedure results in the selection of

5-8 genes, varying from classifier to classifier. The reduced data derived above,

can be saved in the arff format, which is used for the further classification

process.

d. Classification phase

Classification, which is considered as an instance of supervised learning,

involves identifying to which a set of observation belongs. Our objective is to

classify the colon cancer dataset into benign and malignant using various

classifiers such as Support vector machine (SVM), Random Forest, BayesNet etc.

The reduced data obtained in the feature selection phase, containing the

informative genes corresponding to a particular algorithm, is loaded into the

WEKA tool. This data is passed through the classifier. In the training phase, the

learning algorithm finds patterns in the input, that map the data attributes to

the target class, and outputs an ML model, which captures these patterns. This

model is tested using the cross validation technique, explained in the following

section.

e. Validation Phase

Researchers tend to use the k- fold cross- validation to reduce the bias that is

usually associated, in terms of the random sampling of the training and the hold

out data samples, in comparing the predictive accuracy of two or more methods.

The predictive models are evaluated by splitting the original samples into

training and testing data sets. The total number of samples in the set is divided

into k subsets, where each subset approximately contains equal number of

samples. The classification model is trained and tested k times. Every time the

classification model is trained on k-1 subsamples, and it is tested on the

remaining single fold. This way, each subsample gets a chance to act as the

validation set. Since empirical studies have proven 10 to be an optimal value for

k, we use the 10 fold cross validation.

International Journal of Pure and Applied Mathematics Special Issue

2154

Page 9: A Study on Classification Algorithms for Predicting Colon ... · classification of the gene expression data set for colon cancer has been realized as an arduous task. This study makes

IV. RESULT ANALYSIS

Upon performing the implementation procedure for the selected classification

algorithms, the results of the algorithms are compared with the fellow

algorithms under the same class of classifier, the algorithm having the highest

accuracy of them all is selected from each of the classifier techniques. The

comparison between the selected best algorithms is done to identify the best

algorithm for the detection of the colon cancer.

Before the results, a brief explanation about the confusion matrix, accuracy and

sensitivity is given.

Confusion Matrix: It is a binary classifier. A confusion matrix can be of any size

depending upon the different number of parameters inputted (labels in our case).

The confusion matrix in our case is a 2x2 matrix.

TP FN

FP TN

TP-True Positive, FN-False Negative, FP-False Positive, TN-True

Negative

TP and TN denote the number of instances which have been correctly classified

as tumorous and normal respectively. FP and FN signify the number of instances

which have been wrongly classified as tumorous and normal respectively.

Accuracy: The accuracy can be calculated with the help of the formula given

below.

Accuracy= 𝑇𝑃+𝑇𝑁

𝑇𝑃+𝐹𝑁+𝐹𝑃+𝑇𝑁

Sensitivity: The sensitivity can be calculated as follows.

Sensitivity= 𝑇𝑃 𝑇𝑃+𝐹𝑁

The category of classifiers that we have selected from the WEKA tool is:

T- Denotes Tumour

N- Denotes Non-Tumour

International Journal of Pure and Applied Mathematics Special Issue

2155

Page 10: A Study on Classification Algorithms for Predicting Colon ... · classification of the gene expression data set for colon cancer has been realized as an arduous task. This study makes

Bayes

0.94

0.92

0.9

0.88

Accuracy Sensitivity

BayesNet NaiveBayes

BAYES

Table 1a: Table of Results of Bayes Classifier

Name of

Algorithm

Confusion

Matrix

Correctly

Classified Instances

Accuracy Sensitivity

BayesNet T N 90.3226 % 0.903 0.903

38 2

4 18

NaiveBayes T N 93.5484 % 0.936 0.935

39 1

3 19

From the results above, the Naive Bayes Algorithm has a higher value of

Accuracy and Sensitivity as compared to BayesNet. It is also understood that

the NaïveBayes algorithm can classify 93% of the instances correctly as

compared to the BayesNet algorithm. Hence it can be concluded that

NaïveBayes is the better algorithm, in terms of accuracy.

Fig 1b: Bar Graph for Bayes Classifier

FUNCTIONS

Table 2a: Results for Function Classifier

Name of

Algorithm

Confusion

Matrix

Correctly

Classified Instances

Accuracy Sensitivity

Logistic T N 90.3226 % 0.903 0.903

38 2

4 18

International Journal of Pure and Applied Mathematics Special Issue

2156

Page 11: A Study on Classification Algorithms for Predicting Colon ... · classification of the gene expression data set for colon cancer has been realized as an arduous task. This study makes

Logistic SGD Simple Logistic SMO Voted Perceptron

Sensitivity Accuracy

0.92

0.9

0.88

0.86

0.84

0.82

0.8

0.78

0.76

0.74

Functions

SGD T N 90.3226 % 0.903 0.903

38 2

4 18

Simple

Logistic

T N 85.4839 % 0.854 0.855

36 4

5 17

SMO T N 87.0968 % 0.870 0.871

37 3

5 17

Voted

Perception

T N 80.6452 % 0.812 0.806

33 7

5 17

In this class of classifiers, Logistic algorithm has the highest accuracy

with 0.903 along with the SGD which has the same value in this category. Voted

Perceptron has the least value in terms of both Accuracy and Sensitivity

amongst the 5 algorithms. Logistic and SGD algorithms also classify

approximately 91% of the instances correctly, which prove out to be the highest

in terms of all the other algorithms. As the accuracy of both Logistic and the

SGD are same, we can select any algorithm out of these. Logistic is selected.

Fig 2b: Bar Graph for Functions Classifier

International Journal of Pure and Applied Mathematics Special Issue

2157

Page 12: A Study on Classification Algorithms for Predicting Colon ... · classification of the gene expression data set for colon cancer has been realized as an arduous task. This study makes

LWL K Star Ibk

Sensitivity Accuracy

0.825

0.82

0.815

0.81

0.805

0.8

0.795

0.79

Lazy

LAZY

Table 3a: Table of Results for Lazy Classifier

Name of

Algorithm

Confusion

Matrix

Correctly

Classified

Instances

Accuracy Sensitivity

lBk T N 80.6452 % 0.804 0.806

35 5

7 15

K Star T N 82.2581 % 0.820 0.823

36 4

7 15

LWL T N 82.2581 % 0.823 0.823

37 3

8 14

The K star and the LWL algorithms have the same value for the correctly

classified instances that stands at approximately 82%. Both of these

algorithms have the same value for Sensitivity. LWL algorithm is chosen over

K Star because it has a greater value of Accuracy.

Fig 3b: Bar Graph for Lazy Classifiers

International Journal of Pure and Applied Mathematics Special Issue

2158

Page 13: A Study on Classification Algorithms for Predicting Colon ... · classification of the gene expression data set for colon cancer has been realized as an arduous task. This study makes

META

Table 4a: Table of Result for Meta Classifiers

Name of the

Algorithm

Confusion

Matrix

Correctly

Classified

Instances

Accuracy Sensitivity

Adaboost M1 T N 88.7097 % 0.893 0.887

39 1

6 16

Attribute

Selected

Classifier

T N 88.7097 % 0.893 0.887

39 1

6 16

Bagging T N 67.7419 % 0.677 0.677

30 10

10 12

Classification

Via

Regression

T N 83.871% 0.837 0.839

36 4

6 16

Random

Committee

T N 69.3548% 0.704 0.694

29 11

8 14

Randomizable

Filtered

Classifier

T N 69.3548% 0.704 0.694

29 11

8 14

Amongst the above algorithms, AdaBoostM1 and Attribute Selected Classifier

have the highest percentage of instances that have been classified correctly.

These two algorithms also have the highest values of Accuracy and

Sensitivity amongst all the algorithms. Bagging has the least value out of all

the algorithms. AdaBoostM1 is selected.

International Journal of Pure and Applied Mathematics Special Issue

2159

Page 14: A Study on Classification Algorithms for Predicting Colon ... · classification of the gene expression data set for colon cancer has been realized as an arduous task. This study makes

Fig 4b: Bar Graph for Meta Classifiers

RULES

Table 5a: Table of Result for Rules Classifier

Name of the

Algorithm

Confusion

Matrix

Correctly

Classified

Instances

Accuracy Sensitivity

Decision

Table

T N 90.3226% 0.903 0.903

38 2

4 18

JRip T N 83.871% 0.837 0.839

36 4

6 16

Through the Result table, Decision Table Algorithm has the highest

accuracy and sensitivity value as compared to the JRip algorithm. Decision

Table is selected.

Attribute Selected Classifier

Classification via Regression

Randomizable Filtered Classifier

AdaBoostM1

Bagging

Random Committee

Sensitivity Accuracy

1

0.8

0.6

0.4

0.2

0

META

International Journal of Pure and Applied Mathematics Special Issue

2160

Page 15: A Study on Classification Algorithms for Predicting Colon ... · classification of the gene expression data set for colon cancer has been realized as an arduous task. This study makes

RULES

0.92

0.9

0.88

0.86

0.84

0.82

0.8

Accuracy Sensitivity

Decision Table Jrip

Fig 5b: Bar Graph for Rules Classifier

TREES

Table 6a: Table of Results for Trees Classifier

Name of the

Algorithm

Confusion

Matrix

Correctly

Classified

Instances

Accuracy Sensitivity

Decision

Stump

T N 85.4839% 0.867 0.855

39 1

8 14

Hoeffding

tree

T N 90.3226% 0.903 0.903

38 2

4 18

J48 T N 88.7097% 0.893 0.887

39 1

6 16

LMT T N 85.4839% 0.854 0.855

36 4

5 17

Random

Forest

T N 69.3548% 0.704 0.694

29 11

8 14

Random Tree T N 69.3548% 0.704 0.694

29 11

8 14

International Journal of Pure and Applied Mathematics Special Issue

2161

Page 16: A Study on Classification Algorithms for Predicting Colon ... · classification of the gene expression data set for colon cancer has been realized as an arduous task. This study makes

REP Tree T N 77.4194% 0.786 0.774

38 2

12 10

From the above table of result, the Hoeffding Tree algorithm shows that it

is highly accurate as well as highly sensitive when compared to all the other

algorithms. The J48 algorithm comes second best with an accuracy of 0.893.

The RandomForest and the RandomTree have the least accuracy and

sensitivity values. The Hoeffding Tree algorithm is selected.

Fig 6b: Bar Chart for Trees

BEST OF ALL CLASSIFIERS:

Table 7: Table Result of the best of classifiers

Name of the

Algorithm

Confusion

Matrix

Correctly

Classified

Instances

Accuracy Sensitivity

NaiveBayes T N 93.5484 % 0.936 0.935

39 1

3 19

Logistic T N 90.3226 % 0.903 0.903

38 2

4 18

Random Forest

LMT J48

REP Tree

Decision Stump Hoeffding Tree

Random Tree

Sensitivity Accuracy

1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

TREES

International Journal of Pure and Applied Mathematics Special Issue

2162

Page 17: A Study on Classification Algorithms for Predicting Colon ... · classification of the gene expression data set for colon cancer has been realized as an arduous task. This study makes

LWL T N 82.2581 % 0.823 0.823

37 3

8 14

Adaboost M1 T N 88.7097 % 0.893 0.887

39 1

6 16

Decision

Table

T N 90.3226% 0.903 0.903

38 2

4 18

Hoeffding tree T N 90.3226% 0.903 0.903

38 2

4 18

From the selected best, it can be concluded that NaiveBayes algorithm is

the best algorithm in Classification as it shows the highest correctly classified

instances and the highest values for accuracy and sensitivity.

Fig 7: Bar chart for Best Classifiers

Best Algorithm 0.96

0.94

0.92

0.9

0.88

0.86

0.84

0.82

0.8

0.78

0.76

NaiveBayes

Accuracy

Logistic

Sensitivity

LWL AdaBoostM1 Decision Table Hoeffding Tree

International Journal of Pure and Applied Mathematics Special Issue

2163

Page 18: A Study on Classification Algorithms for Predicting Colon ... · classification of the gene expression data set for colon cancer has been realized as an arduous task. This study makes

V. CONCLUSION AND FUTURE WORK:

This paper studied the different classification algorithms existing today

for predicting the chance for colon cancer. The prediction is based on using the

different gene parameters and training them into different classifiers for

classifying as tumorous or non-tumorous. The work carried out here clearly

shows how classification algorithms like Naïve Bayes, Logistic regression and

decision trees provide better accuracy. In future this work can be extended in

using Neural Networks and Deep Neural networks for aiming better accuracy.

REFERENCES

[1]. Madeeha Naiyar, Yousra Asim, Aqsa Shahid “Automated colon cancer

detection using structural and morphological features” 2015.

[2]. Francesco Archetti, Mauro Castelli, Ilaria Giordani, Leonardo Vanneschi

“Classification of colon tumor tissues using genetic programming” 2010.

[3]. G.D Thomas, M.F. Dixon, N.C Smeeton “Observer Variation in the

Histological Grading of Rectal Carcinoma”, Journal of Clinical Pathology,

Vol 36, no 4, pp.385-391, 1983

[4]. C. Demir and B. Yener “Automated cancer Diagnosis based on

Histopathological Images: A systematic survey” 2009.

[5]. Eibe Frank, Yong Wang, Stuart Inglis, Geoffrey Holmes, Ian H. Witten

“Using Model Trees for Classification” 1998.

[6]. http://weka.sourceforge.net/doc.dev/weka/classifiers/rules/DecisionTable.ht

ml

[7]. http://weka.sourceforge.net/doc.dev/weka/classifiers/meta/RandomCommit

tee.html

[8]. https://www.eecs.yorku.ca/tdb/_doc.php/userg/sw/weka/doc/weka/classifiers

/rules/JRip.html

[9]. Iba, Wayne; and Langley, Pat (1992); “Induction of One-Level Decision

Trees”, in ML92: Proceedings of the Ninth International Conference on

Machine Learning, Aberdeen, Scotland, 1–3 July 1992, San Francisco, CA:

Morgan Kaufmann, pp. 233–240

[10]. http://weka.sourceforge.net/doc.dev/weka/classifiers/trees/HoeffdingTree.h

tml

[11]. Rausheen Bal, Sangeeta Sharma “Review on Meta Classification

Algorithms using WEKA” 2016

[12]. http://weka.sourceforge.net/doc.dev/weka/classifiers/trees/LMT.html

[13]. Sumner, Marc, Eibe Frank, and Mark Hall (2005) “Speeding up logistic

model tree induction” PKDD. Springer. pp. 675–683.

[14]. Sushil Kumar Rameshpant Kalmegh “Comparative Analysis of WEKA

Data Mining Algorithm Random Forest,Random Tree and LAD Tree for

Classification of Indigenous News Data” 2015.

International Journal of Pure and Applied Mathematics Special Issue

2164

Page 19: A Study on Classification Algorithms for Predicting Colon ... · classification of the gene expression data set for colon cancer has been realized as an arduous task. This study makes

[15]. Sushil Kumar Kalmegh “Analysis of WEKA data mining Algorithm REP

Tree, Simple Cart and Random Tree for Classification of Indian News”

2015

[16]. Alon, U., Barkai, N., Notterman, D. A., Gish, K., Ybarra, S., Mack, D. and

Levine, A. J. (1999). “Broad Patterns of Gene Expression Revealed by

Clustering Analysis of Tumor and Normal Colon Tissues Probed by

Oligonucleotide Arrays”.

[17]. Yu, L. and Liu, H. (2004). “Efficient Feature Selection via Analysis of

Relevance and Redundancy. Journal of Machine Learning Research”.

[18]. Quanzhong Liu, Chihau Chen, Yang Zhang, Zhengguo Hu. “Feature

selection for support vector machines with RBF kernel” 2011.

[19]. Xing, E. P., Jordan, M. I. and Karp, R. M. (2001). “Feature Selection for

High-dimensional Genomic Microarray Data”.

[20]. Shutao Li, Xixian Wu, Xiaoyan Hu. ”Gene selection using genetic

algorithm and support vectors machines” 2008.

[21]. Shen, Q., Min, W., Kong, S. W. and Xian, B. Y. (2007). “A Combination of

Modified Particle Swarm Optimization Algorithm and Support Vector

Machine for Gene Selection and Tumor Classification”.

[22]. Zuraini Ali Shah, Puteh Saad, Razib M. Othman. “Feature Selection for

Classification of Gene Expression Data”.

[23]. Zhang H, Yu C Y, et al. “Recursive Partioning for Tumor Classification

with Gene Expression Microarray Data”.

[24]. Liu Jinjin, Lin Yinxin et al. “Informative Genes Selection for Colon

Tumor Based on Gene Expression Profiles”. Journal of

KunmingUniversity of Science and Technology(Science and Technology),

2006.

[25]. Zhang Ya, Rao Nini et al. “A Feature Selection Method for Colon Tumor

Based on Gene Expression Profiles”. Space Medicine and Medical

Engineering, 2008.

[26]. Mark A. Hall, Lloyd A. Smith. “Practical Feature Subset Selection for

Machine Learning”.

[27]. https://www.cs.waikato.ac.nz/~ml/weka/

International Journal of Pure and Applied Mathematics Special Issue

2165

Page 20: A Study on Classification Algorithms for Predicting Colon ... · classification of the gene expression data set for colon cancer has been realized as an arduous task. This study makes

2166