experimental and comparative analysis of machine learning classifiers
TRANSCRIPT
-
8/14/2019 Experimental and Comparative Analysis of Machine Learning Classifiers
1/9
2013, I JARCSSE All Righ ts Reserved Page | 955
Volume 3, Issue 10, October 2013 ISSN: 2277 128X
International Journal of Advanced Research in
Computer Science and Software EngineeringResearch Paper
Available online at:www.ijarcsse.comExperimental and Comparative Analysis of Machine Learning Classifiers
AbstractClassif ication methods have been r apidly used for variety of the dif ferent f ields including medical, bankingand fi nance, social science, polit ical and economic science and variety of other di ff erent f ields to classif y the availabledata, which may be containi ng many dif ferent attributes and it wi l l be diff icul t to classif y them manually. As peopleare generati ng more data everyday so there is a need for such a classif ier which can classif y those newly generateddata accurately and eff ici entl y. Thi s paper main ly focuses on the supervised learni ng technique call ed the Randomfor ests for classif icati on of data by changing the values of di f ferent hyper parameters in Random Forests Classif ier to
get accurate classif icati on resul ts. This paper al so focuses on experimental compar ison of Random Forests classif ierwith some of the state of ar t supervised learn ing technique like NB (Nave Bayes), C4.5 and I D3 ( I terativeDi chotomiser 3) wi th respect to their accuracy of corr ectly classif ied instances, incor rectly classif ied instances andvery important ROC Ar ea which helps in understanding the classif ication model and their resul ts, whi ch can also helpother researchers in maki ng decision f or the selection i n classif icati on model based on their data and number ofattributes.
KeywordsData mining, Machine Learning, Classif iers, Random Forests.
I. INTRODUCTIONIn Data Mining there are mainly two techniques are available for the data analysis and those techniques are known as
the Data Classification and the Data Prediction [2]. Where classification techniques are mainly used to predict the
discrete class labels for the new observation or new data on the basis of training data set provided to the classifier
algorithm, and prediction techniques generally works with the continuous valued functions. Classification techniqueshave been used in many of the different fields like Computer Vision [3], Text Classification [4], Fraud Detection [2],
Sentiment Analysis [5] and many other. This paper focuses on use of supervised classification techniques, which wouldbe working with two things one is known as the training set, which is the collection of the data which are already
classified, and second is known as the testing set which are the collection of the data whose class labels are required to be
determined based on the training data set. This paper will focus on four different classification algorithm which are 1.
Nave Bayes, 2. Decision Tree LearningID3 (Iterative Dichotomiser 3), 3. Decision Tree LearningC4.5 (extension of
ID3), 4. Random Forests.
This paper is organized in to six sections, section one discusses introduction and the usage of Machine Learning
classification techniques, section two discusses approaches to four classifier techniques, section three is on literature
survey, section four demonstrates experiments and results followed by the experimental evaluations in section five and
conclusion is in section six.
II. UNDERSTANDING SUPERVISED MACHINE LEARNING APPROACHThis section deals with the basic understanding of the above mentioned four algorithm with their advantages and the
disadvantages. Supervised machine learning approach provides a really great results in terms of accuracy, if we are
having more data, which can be used for the training purpose for the classifier algorithm to test out the new input data. It
is always said that if there is more data for training more accurate the results would be [2]. Supervised machine learning
approaches have their own advantages and disadvantages [7], which are described below.
Advantages:
Often provide much accurate results compared to humans driven data analysis. Can analyze very large amount of data which is certainly impossible for any human.
Disadvantages:
Need high amount of training data for accurate results.
Impossible to get results which are perfect accurate.
All the classifier that are mentioned above are described here with their introduction, working and ended with the
strength, weaknesses and research issues of a classifier are also mentioned with them.
Mr. Hitesh H. Parmar Prof. Glory H. Shah
P.G. Student, C.E. Department Asst. Professor, C.E. Department
Marwadi Education Foundations Group of Institutions Marwadi Education Foundations Group of Institutions
Rajkot, Gujarat, India Rajkot, Gujarat, India
http://www.ijarcsse.com/http://www.ijarcsse.com/http://www.ijarcsse.com/http://www.ijarcsse.com/ -
8/14/2019 Experimental and Comparative Analysis of Machine Learning Classifiers
2/9
H itesh et al., I nternational Journal of Advanced Research i n Computer Science and Software Engineeri ng 3(10),October - 2013, pp. 955-963
2013, I JARCSSE All Righ ts Reserved Page | 956
Nave Bayes Classifier
Strength:
Relatively Fast.Easy to run and easy to understand.
Automatic variable selection.
Good at missing value handling.
Weakness:
Sharp decision Boundaries
Model tends to evolve around the strongest effect.
Doesnt support pruning.
Research Issue:
Nave Bayes algorithm is good at dealing with features which are completely independent, and sometimes
surprisingly well on those features which are dependent as well, so there is a need for the deep study related
to data characteristics which really affect or which can affect the performance of Nave Bayes algorithm
(Rish et al. 2001).
Decision Tree Classifier: ID3
Strength:
Relatively Fast.
Easy to run and easy to understand.
Automatic variable selection.
Good at missing value handling.
Weakness:Sharp decision Boundaries
Model tends to evolve around the strongest effect.
Doesnt support pruning.
Research Issue:
Larger the Decision Tree grows, poorest the accuracy results it will return, researchers can work on
algorithm which can produce decision trees that are small in size and depth as well with trees providing
good accuracy results (Kothari et al. 2001).
A.Nave Bayesian Classifier :Nave Bayes Classifier is very widely used classifier and it is also known as the state of the art technique for many of
different applications which makes this classifier useful and accurate in providing results (Zhang et al. 2004). It is also
known as the probabilistic classifier because it uses the concept of Bayes theorem which is named after its founder
Thomas Bayes, for classifying the data with strong independence assumptions. Working of the Bayesian classifier
depends on the presence or the absence of a particular feature of the class, and does not depend on the presence or the
absence of the other features (Amiri et al. 2013). Nave Bayes algorithm can be used for binary classification as well the
multi label classification. Results provided by the Bayesian classifiers are very comparable to the approaches- likeDecision Tree [10]. Strength and weakness of the Nave Bayes classifier is shown down below.
B. Decision Tree Classifier ID3:Decision Tree classifier ID3 is also known as the (Iterative Dichotomiser 3) which was developed by (Quinlan et al.
1986). Classifier uses the concept of tree structure to classify the given data in to the different number of classes based onthe training data, Structure is mainly divided into two parts nodes and branches, there are mainly two things in tree which
plays very important role in classifying the data, one is Root node from which all the instances are going to be classified
and goes to the leaf node based on their feature values, Leaf node contains the actual class label which is required to be
determined. Every single node in decision tree represents a feature which will help in classification of an instance, and
each branch in decision tree represents a value of a node [10]. ID3 Algorithm is good at dealing with categorical
attributes [2]. When dealing with the multiple attributes in the decision tree, then the split point for the decision tree isgoing to be computed using the measure from information theory called Information Gain (Hunt et al., 1966), which is
known as an attribute selection method for the ID3 algorithm. ID3 does not guarantee an optimal solution to a problem, itis greedy in nature and look forward to the local optimum.
There are few advantages, disadvantages and research issues which are mentioned here in the figure of ID3 in terms of
Strength, Weakness and Research Issues.
-
8/14/2019 Experimental and Comparative Analysis of Machine Learning Classifiers
3/9
H itesh et al., I nternational Journal of Advanced Research i n Computer Science and Software Engineeri ng 3(10),October - 2013, pp. 955-963
2013, I JARCSSE All Righ ts Reserved Page | 957
Random Forests
Strength:
Good at dealing with the outliers due to randomness.
Provides good accuracy.
Weakness:Difficult in selecting the hyperparameters.
Time consuming.
Research Issue:
There multiple number of parameters are required to be set while dealing with Random Forests, to get good
accuracy results those parameters are required to be set manually, so researchers can focus on optimizing
the algorithm which will automate the tuning of setting those parameters.
Decision Tree Classifier: C4.5
Strength:
Handling Training data with missing values.Pruning trees after creation of trees.
Weakness:
Not good while dealing with the continuous data values [12].
Trees created from the numeric data sets can be complex.
Research Issue:
One of the problem with C4.5 is with the high memory usage while generating rule sets for the given data
[35].
C.Decision Tree Classifier C4.5:Decision Tree classifier C4.5 was developed by (Quinlan et al. 1993), which is an improved version of the previously
developed classifier algorithm ID3, as there were few problems with the ID3 algorithm, (Tom et al. 1997) shows that for
an ID3 generated tree, as the number of nodes grows, the accuracy increases on the training data, but decreases on unseen
test cases; this is called overfitting the data. C4.5 classifier algorithm overcomes this problem. Uses the Gain Ratio for
the attribute selection method. C4.5 Algorithm also provides the pruning of generated tree which was not possible in ID3,
in pruning operation all the nodes which are irrelevant are going to be eliminated and that will results in to the reduction
of tree size. Strength, weakness and research issues for C4.5 algorithm are mentioned below.
D.Random Forest:Random Forest classifiers are the ensemble classifier which are combination of many Decision Tree classifiers, and it
was developed by (Breiman et al. 2001), which works on the concept of generating multiple Random Trees, with the help
of bootstrap of training dataset and generating trees, and there are other things as well, which are required to be set torandom forest before execution of an algorithm like, Number of Trees in the forests, Depth of the tree, Number of
samples for bagging[13], Number of features for splitting node. The main advantage of using Random Forest is because
of its randomness, so that it doesnt depend of the data, so it is good with dealing with the outliers. Random Forest is
good at dealing with high dimensional data, and performance and the accuracy results of random forests are very much
promising and comparable with some of the state of the art techniques [13].
One of the best thing that random forests adds is the randomness on to the data while it classifies, and due torandomness each tree would be highly uncorrelated with the other random tree. If trees would be correlated then it would
not give the proper results of accuracy, but due to highly uncorrelated tree, when you combine those tree using baggingapproach, results would be much improved, compared to those which are highly correlated and generating the most
similar results all the time, that is shows a classifier Random Forests can provide good accuracy, and can efficiently
handle outliers.Strength, Weakness and Research Issues of Random Forests are mentioned in figure below, Random Forests was
founded in 2001, and many of the researchers have started using it for different number of application field which
includes medical research, image processing and there have been many of the competitions won by making use of
Random Forests [36].
III.LITERATURE SURVEYThere have been many experiments performed on this data mining classification technique and those experimentsusually changes from domain to domain, as (Wolpert et al. 1997) has published that there is no algorithm which works
best for all the domains, which is also known as no free lunch theorems [22] . This literature survey focuses on the
some of the previous work carried out by the researchers in measuring the performance and comparing different data
-
8/14/2019 Experimental and Comparative Analysis of Machine Learning Classifiers
4/9
H itesh et al., I nternational Journal of Advanced Research i n Computer Science and Software Engineeri ng 3(10),October - 2013, pp. 955-963
2013, I JARCSSE All Righ ts Reserved Page | 958
mining classification algorithm, and provide the valid information and remarks on the performance and output andworking behavior of the algorithm on the various cases and various parameters.
(Sharma et al. 2011) performed the experimental evaluation of the various classification algorithm on Weka[38] which
are C4.5, ID3, CART[23] to determine the classification accuracy of an algorithm on the emails to classify that whether
those emails are spam or not. All those classification algorithm are then compared based on the correctly classified
instances. As the part of the conclusion they mentioned that the classification accuracy of the C4.5 algorithm outperforms
the all other three classifier, and the accuracy of the CART algorithm is also very promising and comparable to the c4.5
whereas the other two algorithm reported less accuracy compared to C4.5 and CART [14](Yadav et al. 2012) performed the experiment to find out the best classifier which can work on students data and
predict the students performance based on those data, this experiment was conducted on the sample of 50 students who
werepursuing MCA from VBS Purvanchal University, Jaunpur, Uttar Pradesh, India. Data includes students previous
semester marks, class test grade, seminar performance, attendance and assignment work, and with the help of these data
they tried to predict the end semester results of these 50 students, for that they used the same three classifiers to predictthe results which are ID3, CART and C4.5. In the conclusion of an experiment they found that the classification accuracy
of algorithm C4.5 is much better compared to the other two classifier [15].
(Lakshmi et al. 2013) They have also worked in the field of educational data mining and they have also used the same
three classifiers which are used by (Yadav et al. 2012), their main intention was very similar to that of (Yadav et al.
2012) but the only difference was that with the type of data that they used, (Lakshmi et al. 2013) used the classifier to
predict the performance of the students, they collected the sample of 120 Under graduate students and the data includesqualification details of the students, students location details, their financial support and the family support and relatio ns
etc. They used all these samples and tested in Weka with three classifiers which were mentioned before, and from theirexperimental evaluations and conclusion they achieved the highest accuracy with the CART algorithm which was also
comparable to the accuracy that they got with the use of C4.5 [16].
(Khoonsari et al. 2012) performed many different experiments on two classification approaches which are ID3 andC4.5 to check efficiency and the robustness of the classifier techniques, their experiments were carried out on nine
different data sets from the UCI Repository, which contains many of the Gold standard data set for the machine learning,
and most of the researchers in the data mining and machine learning tend to use only those data sets to test out the
performance results of their experiments over some of the previously published research.
The data set they used were of type categorical only, and the number of instances varies from 40 to 67557, and number
of different attributes varies from 4 to 4, and none of the data set contains any of the missing value. At the end of their
experiments they concluded that robustness and the accuracy of the classification technique C4.5 is outperforming the
accuracy results returned by the ID3 classifier [17].
(Patel et al. 2012) demonstrated the use of Machine Learning classification techniques for Network Intrusion DetectionSystem, They carried out their experiments with some of the state of the art classification techniques which also include
Nave Bayes and C4.5. Data set that they used was DARPA KDD99, which is also known as the gold standard data set
for the researchers working in Intrusion Detection and evaluation. From their experiments and evaluations C4.5
performed better than Nave Bayes classifier, and they also concluded that instead of using only single classifier for the
evaluation, results can be improved by combining more than one classifiers to remove the disadvantages of one another
[18].
(Khan et al. 2008) used the machine learning classification techniques such as Decision Tree for mining of OralMedicine. They are using Decision Tree classifiers to mine the certain large Electronic Medical Records which contain
lot of information, and these information can be useful to teach to the students of the Oral Medicine. They worked with
the data set which contains examination records of more than 20000 patients. Their data set contains more than 180
different attributes, and data set also suffers from the missing values as well. They concluded that C4.5 performs better
than ID3 classifier when there is a data set which contains missing values, C4.5 avoids the overfitting of data values and
can also better handle data with missing values compared to ID3, which will not be able to perform well when there aremissing values in the data [19].
(Kotsiantis et al. 2007) Performed and represented a very explanatory and detailed survey of many different
classification technique, which includes both supervised and unsupervised, out of which it includes both Decision Trees
and Nave Bayes, he focuses on many of the issues of the classifiers which includes algorithm selection, issue regarding
supervised learning and their accuracy, implementations, and also provided information on what are the advantages and
disadvantages of one classifier over the other classifier approach. He has concluded the paper with suggestion that the
researchers should not select the classifier based on it is better than the other one or not but it has to be selected on the
basis of characteristics under which that classifier can perform really well, characteristics includes Type of attributes like
some of the classifiers are good with numbers whereas some of the classifiers are good at both numbers and categorical,
number of instances also plays a very important role as some of the classifier provides very good result when provided
small data set like Nave Bayes, whereas some of the classifiers provide very good accuracy results on given high
dimensional data like SVM. More than one methods can also be combined but requires amount of study on both the
methods so after that limitation of one method can be handled by the other method. But integrating multiple methodsmay also lead to increase in storage amount, this will also increase overall computation time.
As there are many classification technique available in the machine learning, and to compare these techniques there are
number of parameters available with the help of which classification techniques can be compared, few of those
-
8/14/2019 Experimental and Comparative Analysis of Machine Learning Classifiers
5/9
H itesh et al., I nternational Journal of Advanced Research i n Computer Science and Software Engineeri ng 3(10),October - 2013, pp. 955-963
2013, I JARCSSE All Righ ts Reserved Page | 959
parameters are described here which plays very important role in the classification technique to make decision inselection of classification techniques, those parameters includes classification scheme, based on data that we want to
work, data specification, computation time, ability of a classifier to deal with noise or outliers, classification accuracy,
number of different model parameters to set to get an efficient classification output. All these points are briefly described
and compared below with respect to four classification approach. This will also help the researchers in data mining and
machine learning to get to know the classifiers from the different parameter perspective.
Classifier Scheme: Classifier scheme generally refers to the way of classifying data by the classifiers, as this paperfocuses on only two type of classification scheme, one is the Hierarchical which includes ID3, C4.5, Random Forests and
the second one is Probabilistic which includes Nave Bayes.
Data Specifications: Some of the algorithm can very well handle the large dimensional data or more columns but very
less number of rows eg. Random Forests [13]. Classifiers like Random Forests, which is an ensemble of decision treesare good at dealing with large dimensional data and can perform very well with those type of data, and it can also handle
both category of data which includes Categorical as well as Numerical but large sample size is also one of the key point
to get the higher prediction accuracy. Where there are other algorithms like Nave Bayes which can certainly perform
well when they are dealing with the small data sets. And when dealing with the logic based classifiers like Decision Tree
eg. ID3, C4.5, they tends to perform better if they are going to be used with data with features of type categorical or
discrete [20].
Computational Time: It is really important that the classifier should return very promising accuracy predictions withinthe finite amount of time, faster the classifier with the good prediction more better it is. In that case Nave Bayes is much
better approach with its faster and short computational time for training. Computation time of ID3, C4.5 is also very
comparable with Nave Bayes[20], but with the Random Forest the computation time would be more compared to all theother three classifiers, there are few reasons for that and which are mentioned here, Random Forests is an ensemble
decision tree classifier, so first it is required to produce number of random trees and after parsing all those different
random trees, the bagging operation will be performed which will provide the accuracy results which are always quite
comparable to state of the art classification technique[13].
Outliers: While dealing with large amount of data it is not sure that all the data would be accurate, there may be some of
the data which are usually considered as noise or outliers, which are required to be handled by the classifiers accurately
otherwise the prediction of the classifier will be biased to that outlier. Decision tree classifiers like ID3, C4.5 are robust
with respect to outliers in training data [26], Nave Bayes classifier provides very high tolerance to outliers [25], whereas
Random forests is using the concept of bagging which makes it very less sensitive to noise or outliers compared to otherclassifier algorithms.
Accuracy:Accuracy of an algorithm depends upon what kind of data that the algorithm can handle and what kind of data
that is been given as the input, and also the size of the data also matters. ID3 often faces the problem of overfitting when
dealing with large amount of data, which affects the accuracy predictions [26]. Whereas accuracy of Random Forests is
certainly good as there is no requirement for pruning trees and overfitting is also not a problem for it which is main
problem for the classifier algorithm like ID3, and Random Forests can also handle both Categorical as well asContinuous and Binary attributes very well [26]. Naive Bayes works on the assumptions of the independence of child
nodes and that certainly not always correct due that reason it is sometimes less accurate compared to other supervised
algorithms [20]. Domingos & Pazzani (1997) provided great results by comparing Nave Bayes Classifier with some of
state of the art classification technique which also includes Decision Tree Induction ID3 classifier, and found that
Nave Bayes is sometimes performed really well to the other learning techniques [27].
Parameters: Classification accuracy can also be improved by setting up the different parameters available to classifier,
Nave Bayes Classifier just works fine as there is nothing like the parameter settings, While dealing with the decision
trees there is a need to take care of parameters, as ID3 works just fine on the basis of training datasets, where as there is a
need for the setting up parameters in C4.5 for CF (confidence factor) which is set to 25%, and MS (minimum numbers of
split-off cases) which is set to 2. Parameter values are the one which were suggested as the default values for the
classifier by Quinlan, the inventor of C4.5 [30]. If those parameters are set to values other than the default values, the
classifier may perform surprisingly well which leads to better result in accuracy predictions [29], but is difficult to set
exact parameters for a particular algorithm. So compared to Nave Bayes and ID3 algorithm in C4.5 it is required to deal
other parameters than just running an algorithm, same is the case with the Random Forests which also deals with many
hyperparameters compared to C4.5 which are required to set to deal with number of trees to create in forests, depth of
each tree etc. [13].
There has not been any of the survey done which carries out the comparison of the supervised classifiers ID3, C4.5,
Nave Bayes, Random Forests, this paper focuses only on supervised classification techniques, next section showscomparison of all four above mentioned classifier in terms of their accuracy with correctly classified instances,
incorrectly classified instances and ROC Area which are really important in deciding the performance of any classifier
model [37].
-
8/14/2019 Experimental and Comparative Analysis of Machine Learning Classifiers
6/9
H itesh et al., I nternational Journal of Advanced Research i n Computer Science and Software Engineeri ng 3(10),October - 2013, pp. 955-963
2013, I JARCSSE All Righ ts Reserved Page | 960
IV. EXPERIMENTS AND RESULTSTo check the performance of the classifier algorithm experiments were performed on five different data sets from UCI
repository, and those data sets belong to both the types numeric as well as nominal. Data set Information is shown below.
Name of The Data Set Number of Instances in the Data Set Number of Attributes in the Data Set
Arrhythmia [A] 452 280
Musk [B] 6598 169
Splice [C] 3190 62Spect_test [D] 187 23
Nursery [E] 12960 8
Machine learning tool called Weka was used with above mentioned data sets in order to perform analysis on three
things 1) Correctly classified instances, 2) Incorrectly classified instances and 3) ROC Area for all four algorithms which
are Nave Bayes, ID3, C4.5 and Random Forests. Those data sets ranges from low to high dimension, out of which firsttwo data sets [A], [B] contains the numeric values and missing values as well, but ID3 algorithm cant deal with missing
values so those missing values are replaced by applying the replacemissingvalues filter in Weka, and those numeric
values were converted to nominal using numerictonominal filter available in Weka.
The experiments and results are shown in terms of three different chart which are results for correctly classified
instances, second chart shows results of incorrectly classified instances, and the last chart shows the ROC Area results
which collectively best defines the classification algorithm [37]. All these experiments are performed by splitting data
sets into 80% for training and 20% for testing which is generally considered as good amount of split for supervisedclassifiers. All these experiments are performed under Weka version 3.6.10 on 32 bit Microsoft windows 7 platform with
3GB of RAM.
Following Fig. 1. Chart represents result analysis of correctly classified instances. Here vertical axis shows the
accuracy of correctly classified instances in percentage, and horizontal axis represents the data sets used for the
experiment.
[Fig. 1: Correctly Classified Instances of Data sets]
Following Fig. 2. Represents percentage results of incorrectly classified instances for all five data sets. From this chart
we can say that size of the data set also plays a very important role in classification approach. Here vertical axis shows
the accuracy of incorrectly classified instances in percentage, and horizontal axis represents the data sets used for the
experiment.
[Fig. 2: Incorrectly Classified Instances of Data sets]
-
8/14/2019 Experimental and Comparative Analysis of Machine Learning Classifiers
7/9
H itesh et al., I nternational Journal of Advanced Research i n Computer Science and Software Engineeri ng 3(10),October - 2013, pp. 955-963
2013, I JARCSSE All Righ ts Reserved Page | 961
Following Fig. 3. Represents percentage results of ROC (Receiver Operating Characteristics) Area for all five datasets. ROC is a plot of True Positive Rate against False Positive Rate. ROC Area results provides great information on
how accurate the classifier is. Here vertical axis represents the probability value for ROC, where value 1 means the
perfect classification, and horizontal axis represents the data sets used for the experiment.
[Fig. 3: ROC Area results for Data Sets]
By performing the classification task on those five above mentioned data sets we got some of the really interesting
information with respect to performance results which are mentioned in the Experimental Evaluation section.
V.EXPERIMENT EVALUATIONSAfter applying classification algorithms on data sets, evaluation is as follows, sometimes ID3 classifier may also lead
to data which may be unclassified, as from the Fig. 1 for the data sets [A] and [C] the accuracy results received were very
less because most of the instances were unclassified which resulted in poor accuracy results. Whereas out of all data sets,
for data set [B] accuracy achieved was 100% and for data set [E] accuracy results were very nearer to 100% for all three
tree based classifiers.
Performance of Nave Bayes was also very better compared to ID3 for correctly classified instances for all five data
sets, in a same way results of Random Forests and C4.5 are also very comparable but overall Random Forests
outperforms all the three classifiers for correctly classified instances for all five out of four data sets. A Random Forests
is the classification algorithm which deals with different types of parameters like number of random trees and number of
features to be selected, and to get better accuracy results those parameters are required to be set accurately. All of these
performance analysis is carried out using Weka, in which for Random Forests those parameter values of random trees areset to 10 and features are selected randomly at the runtime. Those default parameter values of Random Forests in Weka
does not guarantee that the classification results will always be better, sometimes changing those parameter values
manually may lead to drastic change in result accuracy, as using those default values provided by Weka did not help in
getting good accuracy result in the first execution.
While performing these experiments with Random Forest was tuned with some different parameter values set
manually, instead of those default values provided in Weka for number of trees in forest and number of features for
Random Forests, as these parameter plays very important role in its classification technique. These different parametervalues of trees and features that helped to get good accuracy results are 10 trees, 9 random features for data set [A]. 10
trees, 130 random features for data set [B]. 50 trees, 10 random features for data set [C]. 10 trees, 10 random features for
data set [D] and 100 trees, 9 random features for data set [E] and because of these parameter values Random Forests
provided very promising and accurate results, sometimes 100% as for data set B. Setting those parameters is really a
search problem which requires great efforts to get good accuracy results. For incorrectly classified instances ID3 did not
performed well, because for two data sets [A], [C] it resulted into 10% and 1% of correctly classified instances andincorrectly classified instances respectively, and rest of the instances were unclassified. For this results again Random
Forests outperformed all three classifier in four out of five data sets. For ROC Area ID3 did not perform well compared
to other three classifiers, Nave Bayes surprisingly performed well and results were very comparable to Random Forests.
And compared to C4.5, Random Forests performed better in all five data sets. So overall form all these data and result
analysis Random Forests performed well compared to other three algorithms.
VI. CONCLUSION AND FUTURE WORKClassification techniques are being used in many different application areas, and there is no single classifier which can
perform best all the time for variety of data. This paper focuses on experimental comparison of four different classifier
which are ID3, Nave Bayes, C4.5 and Random Forests on five standard data sets form UCI and result analysis clearly
shows that the accuracy results returned by Random Forests outperforms all the three classifiers in terms of correctly and
incorrectly classified instances and ROC Area. Future work focuses on optimizing hyperparameters of Random Forests, as
to work with Random Forests there is a need to set those hyperparameters manually to improve the accuracy results.Setting those parameters manually will be time consuming and may not lead to better solution always, so we will look
forward to implement optimization technique on top of Random Forests for automatic tuning of those hyperparameters
which may lead to better accuracy results.
-
8/14/2019 Experimental and Comparative Analysis of Machine Learning Classifiers
8/9
H itesh et al., I nternational Journal of Advanced Research i n Computer Science and Software Engineeri ng 3(10),October - 2013, pp. 955-963
2013, I JARCSSE All Righ ts Reserved Page | 962
REFERENCES[1] B. Liu, Statistical Approaches to Concept-Level Sentiment Analysis, IEEE, Vol. 28, Issue 3, 2013, pp. 6-9[2] J. Han, M. Kamber, Data mining: concepts and techniques, Edition 2, 2006, pp. 258[3] R. Szeliski, Computer Vision: Algorithms and Applications. Springer-Verlag. 2002[4] B Baharudin, A Review of Machine Learning Algorithms for Text-Documents Classification, JOURNAL OF
ADVANCES IN INFORMATION TECHNOLOGY, VOL. 1, NO. 1, pp. 4-20, FEBRUARY 2010
[5] B. Pang, L. Lee, and S. Vaithyanathan, Thumbs up? : sentiment classification using machine learningtechniques,Proceedings of the ACL-02 conference on Empirical methods in natural language processing, vol. 10,2002, pp. 79-86.
[6] Ridgeway G, Madigan D, Richardson T (1998) Interpretable boosted naive Bayes classification. In: Agrawal R,StolorzP, Piatetsky-Shapiro G (eds) Proceedings of the fourth international conference on knowledge discovery
and data mining.. AAAI Press, Menlo Park pp 101104.
[7] Machine Learning Algorithm for Classification - http://www.cs.princeton.edu/~schapire/talks/picasso-minicourse.pdf
[8] Zhang, H., The optimality of naive bayes, Proceedings of the 17th International FLAIRS Conference 2004.[9] M. Amiri, Using Nave Bayes Classifier to Accelerate Constructing Fuzzy Intrusion Detection Systems,
International Journal of Soft Computing and Engineering (IJSCE) ISSN: 2231-2307, Volume-2, Issue-6, January
2013
[10] A. S. Galathiya, Improved Decision Tree Induction Algorithm with Feature Selection, Cross Validation, ModelComplexity and Reduced Error Pruning, ) International Journal of Computer Science and Information
Technologies, Vol. 3 (2) , pp. 3427-3431[11] Tom M. Mitchell, lecture slides for textbook Machine Learning, McGraw Hill, 1997[12] J. R. Quinlan, Improved Use of Continuous Attributes in C4.5, Journal of Artificial Intelligence Research 4
(1996) 77-90[13] L. Breiman, Random forests. Machine Learning, vol. 45. Issue 1, 2001, pp. 5 -32.[14] Aman Kumar Sharma, Suruchi Sahni A Comparative Study of Classification Algorithms for Spam Email Data
Analysis, International Journal on Computer Science and Engineering, May 2011 ,Vol. 3 No. 5 ,pp 1890-1895.
[15] Surjeet Kumar Yadav, Brijesh Bharadwaj, Saurabh Pal, A Data Mining Application: A Comparative study forpredicting students Performance, International Journal of Innovative Technology and Creative Engineering,
Vol.1 No.12 (2012) 13-19
[16] TM Lakshmi, An Analysis on Performance of Decision Tree Algorithms using Students Qualitative Data, I.J.Modern Education and Computer Science, 2013, 5, 18-27
[17] Payam Emami Khoonsari and AhmadReza Motie, A Comparison of Efficiency and obustness of ID3 and C4.5Algorithms Using Dynamic Test and Training Data Sets International Journal of Machine Learning andComputing, Vol. 2, No. 5, October 2012, pp. 540-543
[18] A. Ganatra, R. Patel, A. Thakkar, A Survey and Comparative Analysis of Data Mining Techniques for NetworkIntrusion Detection Systems, International Journal of Soft Computing and Engineering (IJSCE) ISSN: 2231-2307,
Volume-2, Issue-1, March 2012, pp. 265-271
[19] KHAN, F. S., ANWER, R. M., TORGERSSON, O. & FALKMAN, G. Data mining in oral medicine usingdecision trees. World Academy of Science, Engineering and Technology, 37, 225-230. 2008
[20] S. B. Kotsiantis, Supervised Machine Learning: A Review of Classification Techniques, Informatica 31 (2007)249-268 249
[21] archive.ics.uci.edu/ml/datasets.html: UCI Machine Learning Repository: Data Sets.[22] Wolpert, D.H., Macready, W.G. (1997), "No Free Lunch Theorems for Optimization", IEEE Transactions on
Evolutionary Computation 1, 67.
[23] L. Breiman,Classification and Regression Trees.CRC Press, New York, 1999.[24] A. Ganatra, H. Bhavsar, A Comparative Study of Training Algorithms for Supervised Machine Learning,International Journal of Soft Computing and Engineering (IJSCE) ISSN: 2231-2307, Volume-2, Issue-4,
September 2012
[25] A. S. Galathiya, A. P. Ganatra, C. K. Bhensdadia, Improved Decision Tree Induction Algorithm with FeatureSelection, Cross Validation, Model Complexity and Reduced Error Pruning, International Journal of Computer
Science and Information Technologies, Vol. 3 (2) , 2012,3427-3431
[26] Ned Horning, Introduction to decision trees and random forests, American Museum of Natural History's Center forBiodiversity and Conservation,
http://www.whrc.org/education/indonesia/pdf/DecisionTrees_RandomForest_v2.pdf
[27] Domingos, P. & Pazzani, M. (1997). On the optimality of the simple Bayesian classifier under zero-one loss.Machine Learning 29: 103-130.
[28] JR Beck, A Backward Adjusting Strategy and Optimization of the C4.5 Parameters to Improve C4.5sPerformance, Proceedings of the Twenty-First International FLAIRS Conference (2008), Association for the
Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.[29] Banfield, R.E.; Hall, L.O.; Bowyer, K.W.; Kegelmeyer, W.P., "A Comparison of Decision Tree Ensemble
Creation Techniques,"Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol.29, no.1,
pp.173,180, Jan. 2007.
http://www.cs.princeton.edu/~schapire/talks/picasso-minicourse.pdfhttp://www.cs.princeton.edu/~schapire/talks/picasso-minicourse.pdfhttp://scholar.google.co.in/citations?view_op=view_citation&hl=en&user=mXSv_1UAAAAJ&citation_for_view=mXSv_1UAAAAJ:M3NEmzRMIkIChttp://www.whrc.org/education/indonesia/pdf/DecisionTrees_RandomForest_v2.pdfhttp://www.whrc.org/education/indonesia/pdf/DecisionTrees_RandomForest_v2.pdfhttp://scholar.google.co.in/citations?view_op=view_citation&hl=en&user=mXSv_1UAAAAJ&citation_for_view=mXSv_1UAAAAJ:M3NEmzRMIkIChttp://www.cs.princeton.edu/~schapire/talks/picasso-minicourse.pdfhttp://www.cs.princeton.edu/~schapire/talks/picasso-minicourse.pdf -
8/14/2019 Experimental and Comparative Analysis of Machine Learning Classifiers
9/9
H itesh et al., I nternational Journal of Advanced Research i n Computer Science and Software Engineeri ng 3(10),October - 2013, pp. 955-963
2013, I JARCSSE All Righ ts Reserved Page | 963
[30] Jason R. Beck,Maria Garcia,Mingyu Zhong,Michael Georgiopoulos, andGeorgios C. Anagnostopoulos, ABackward Adjusting Strategy and Optimization of the C4.5 Parameters to Improve C4.5's Performance. FLAIRS
Conference, page 35-40. AAAI Press, (2008)
[31] Quinlan, J. R. 1986. Induction of Decision Trees. Mach. Learn. 1, 1 (Mar. 1986), 81-106.[32] Quinlan, J. R. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, 1993.[33] Irina Rish. An empirical study of the naive bayes classifier. In IJCAI-01, workshop on Empirical Methods in AI.[34] R. Kothari and M. Dong, "Decision trees for classification: A review and some new results," in Lecture Notes in
Pattern Recognition, S. K. Pal and A. Pal, Eds., Singapore, 2000, World Scientific Publishing Company.[35] Experiments on C4.5http://rulequest.com/see5-comparison.html[36] https://www.kaggle.com/wiki/RandomForests[37] M. Abernethy,Data mining with WEKA, Part 2: Classification and clustering, IBM developer Works,2010[38] http://www.cs.waikato.ac.nz/ml/weka/
http://www.bibsonomy.org/author/Beckhttp://www.bibsonomy.org/author/Garciahttp://www.bibsonomy.org/author/Zhonghttp://www.bibsonomy.org/author/Georgiopouloshttp://www.bibsonomy.org/author/Anagnostopouloshttp://www.bibsonomy.org/bibtex/3b211570445d1f07f7dc42040231c7fdhttp://www.bibsonomy.org/bibtex/3b211570445d1f07f7dc42040231c7fdhttp://rulequest.com/see5-comparison.htmlhttps://www.kaggle.com/wiki/RandomForestshttps://www.kaggle.com/wiki/RandomForestshttp://rulequest.com/see5-comparison.htmlhttp://www.bibsonomy.org/bibtex/3b211570445d1f07f7dc42040231c7fdhttp://www.bibsonomy.org/bibtex/3b211570445d1f07f7dc42040231c7fdhttp://www.bibsonomy.org/bibtex/3b211570445d1f07f7dc42040231c7fdhttp://www.bibsonomy.org/author/Anagnostopouloshttp://www.bibsonomy.org/author/Georgiopouloshttp://www.bibsonomy.org/author/Zhonghttp://www.bibsonomy.org/author/Garciahttp://www.bibsonomy.org/author/Beck