experimental and comparative analysis of machine learning classifiers

Upload: hitesh-h-parmar

Post on 04-Jun-2018

225 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/14/2019 Experimental and Comparative Analysis of Machine Learning Classifiers

    1/9

    2013, I JARCSSE All Righ ts Reserved Page | 955

    Volume 3, Issue 10, October 2013 ISSN: 2277 128X

    International Journal of Advanced Research in

    Computer Science and Software EngineeringResearch Paper

    Available online at:www.ijarcsse.comExperimental and Comparative Analysis of Machine Learning Classifiers

    AbstractClassif ication methods have been r apidly used for variety of the dif ferent f ields including medical, bankingand fi nance, social science, polit ical and economic science and variety of other di ff erent f ields to classif y the availabledata, which may be containi ng many dif ferent attributes and it wi l l be diff icul t to classif y them manually. As peopleare generati ng more data everyday so there is a need for such a classif ier which can classif y those newly generateddata accurately and eff ici entl y. Thi s paper main ly focuses on the supervised learni ng technique call ed the Randomfor ests for classif icati on of data by changing the values of di f ferent hyper parameters in Random Forests Classif ier to

    get accurate classif icati on resul ts. This paper al so focuses on experimental compar ison of Random Forests classif ierwith some of the state of ar t supervised learn ing technique like NB (Nave Bayes), C4.5 and I D3 ( I terativeDi chotomiser 3) wi th respect to their accuracy of corr ectly classif ied instances, incor rectly classif ied instances andvery important ROC Ar ea which helps in understanding the classif ication model and their resul ts, whi ch can also helpother researchers in maki ng decision f or the selection i n classif icati on model based on their data and number ofattributes.

    KeywordsData mining, Machine Learning, Classif iers, Random Forests.

    I. INTRODUCTIONIn Data Mining there are mainly two techniques are available for the data analysis and those techniques are known as

    the Data Classification and the Data Prediction [2]. Where classification techniques are mainly used to predict the

    discrete class labels for the new observation or new data on the basis of training data set provided to the classifier

    algorithm, and prediction techniques generally works with the continuous valued functions. Classification techniqueshave been used in many of the different fields like Computer Vision [3], Text Classification [4], Fraud Detection [2],

    Sentiment Analysis [5] and many other. This paper focuses on use of supervised classification techniques, which wouldbe working with two things one is known as the training set, which is the collection of the data which are already

    classified, and second is known as the testing set which are the collection of the data whose class labels are required to be

    determined based on the training data set. This paper will focus on four different classification algorithm which are 1.

    Nave Bayes, 2. Decision Tree LearningID3 (Iterative Dichotomiser 3), 3. Decision Tree LearningC4.5 (extension of

    ID3), 4. Random Forests.

    This paper is organized in to six sections, section one discusses introduction and the usage of Machine Learning

    classification techniques, section two discusses approaches to four classifier techniques, section three is on literature

    survey, section four demonstrates experiments and results followed by the experimental evaluations in section five and

    conclusion is in section six.

    II. UNDERSTANDING SUPERVISED MACHINE LEARNING APPROACHThis section deals with the basic understanding of the above mentioned four algorithm with their advantages and the

    disadvantages. Supervised machine learning approach provides a really great results in terms of accuracy, if we are

    having more data, which can be used for the training purpose for the classifier algorithm to test out the new input data. It

    is always said that if there is more data for training more accurate the results would be [2]. Supervised machine learning

    approaches have their own advantages and disadvantages [7], which are described below.

    Advantages:

    Often provide much accurate results compared to humans driven data analysis. Can analyze very large amount of data which is certainly impossible for any human.

    Disadvantages:

    Need high amount of training data for accurate results.

    Impossible to get results which are perfect accurate.

    All the classifier that are mentioned above are described here with their introduction, working and ended with the

    strength, weaknesses and research issues of a classifier are also mentioned with them.

    Mr. Hitesh H. Parmar Prof. Glory H. Shah

    P.G. Student, C.E. Department Asst. Professor, C.E. Department

    Marwadi Education Foundations Group of Institutions Marwadi Education Foundations Group of Institutions

    Rajkot, Gujarat, India Rajkot, Gujarat, India

    http://www.ijarcsse.com/http://www.ijarcsse.com/http://www.ijarcsse.com/http://www.ijarcsse.com/
  • 8/14/2019 Experimental and Comparative Analysis of Machine Learning Classifiers

    2/9

    H itesh et al., I nternational Journal of Advanced Research i n Computer Science and Software Engineeri ng 3(10),October - 2013, pp. 955-963

    2013, I JARCSSE All Righ ts Reserved Page | 956

    Nave Bayes Classifier

    Strength:

    Relatively Fast.Easy to run and easy to understand.

    Automatic variable selection.

    Good at missing value handling.

    Weakness:

    Sharp decision Boundaries

    Model tends to evolve around the strongest effect.

    Doesnt support pruning.

    Research Issue:

    Nave Bayes algorithm is good at dealing with features which are completely independent, and sometimes

    surprisingly well on those features which are dependent as well, so there is a need for the deep study related

    to data characteristics which really affect or which can affect the performance of Nave Bayes algorithm

    (Rish et al. 2001).

    Decision Tree Classifier: ID3

    Strength:

    Relatively Fast.

    Easy to run and easy to understand.

    Automatic variable selection.

    Good at missing value handling.

    Weakness:Sharp decision Boundaries

    Model tends to evolve around the strongest effect.

    Doesnt support pruning.

    Research Issue:

    Larger the Decision Tree grows, poorest the accuracy results it will return, researchers can work on

    algorithm which can produce decision trees that are small in size and depth as well with trees providing

    good accuracy results (Kothari et al. 2001).

    A.Nave Bayesian Classifier :Nave Bayes Classifier is very widely used classifier and it is also known as the state of the art technique for many of

    different applications which makes this classifier useful and accurate in providing results (Zhang et al. 2004). It is also

    known as the probabilistic classifier because it uses the concept of Bayes theorem which is named after its founder

    Thomas Bayes, for classifying the data with strong independence assumptions. Working of the Bayesian classifier

    depends on the presence or the absence of a particular feature of the class, and does not depend on the presence or the

    absence of the other features (Amiri et al. 2013). Nave Bayes algorithm can be used for binary classification as well the

    multi label classification. Results provided by the Bayesian classifiers are very comparable to the approaches- likeDecision Tree [10]. Strength and weakness of the Nave Bayes classifier is shown down below.

    B. Decision Tree Classifier ID3:Decision Tree classifier ID3 is also known as the (Iterative Dichotomiser 3) which was developed by (Quinlan et al.

    1986). Classifier uses the concept of tree structure to classify the given data in to the different number of classes based onthe training data, Structure is mainly divided into two parts nodes and branches, there are mainly two things in tree which

    plays very important role in classifying the data, one is Root node from which all the instances are going to be classified

    and goes to the leaf node based on their feature values, Leaf node contains the actual class label which is required to be

    determined. Every single node in decision tree represents a feature which will help in classification of an instance, and

    each branch in decision tree represents a value of a node [10]. ID3 Algorithm is good at dealing with categorical

    attributes [2]. When dealing with the multiple attributes in the decision tree, then the split point for the decision tree isgoing to be computed using the measure from information theory called Information Gain (Hunt et al., 1966), which is

    known as an attribute selection method for the ID3 algorithm. ID3 does not guarantee an optimal solution to a problem, itis greedy in nature and look forward to the local optimum.

    There are few advantages, disadvantages and research issues which are mentioned here in the figure of ID3 in terms of

    Strength, Weakness and Research Issues.

  • 8/14/2019 Experimental and Comparative Analysis of Machine Learning Classifiers

    3/9

    H itesh et al., I nternational Journal of Advanced Research i n Computer Science and Software Engineeri ng 3(10),October - 2013, pp. 955-963

    2013, I JARCSSE All Righ ts Reserved Page | 957

    Random Forests

    Strength:

    Good at dealing with the outliers due to randomness.

    Provides good accuracy.

    Weakness:Difficult in selecting the hyperparameters.

    Time consuming.

    Research Issue:

    There multiple number of parameters are required to be set while dealing with Random Forests, to get good

    accuracy results those parameters are required to be set manually, so researchers can focus on optimizing

    the algorithm which will automate the tuning of setting those parameters.

    Decision Tree Classifier: C4.5

    Strength:

    Handling Training data with missing values.Pruning trees after creation of trees.

    Weakness:

    Not good while dealing with the continuous data values [12].

    Trees created from the numeric data sets can be complex.

    Research Issue:

    One of the problem with C4.5 is with the high memory usage while generating rule sets for the given data

    [35].

    C.Decision Tree Classifier C4.5:Decision Tree classifier C4.5 was developed by (Quinlan et al. 1993), which is an improved version of the previously

    developed classifier algorithm ID3, as there were few problems with the ID3 algorithm, (Tom et al. 1997) shows that for

    an ID3 generated tree, as the number of nodes grows, the accuracy increases on the training data, but decreases on unseen

    test cases; this is called overfitting the data. C4.5 classifier algorithm overcomes this problem. Uses the Gain Ratio for

    the attribute selection method. C4.5 Algorithm also provides the pruning of generated tree which was not possible in ID3,

    in pruning operation all the nodes which are irrelevant are going to be eliminated and that will results in to the reduction

    of tree size. Strength, weakness and research issues for C4.5 algorithm are mentioned below.

    D.Random Forest:Random Forest classifiers are the ensemble classifier which are combination of many Decision Tree classifiers, and it

    was developed by (Breiman et al. 2001), which works on the concept of generating multiple Random Trees, with the help

    of bootstrap of training dataset and generating trees, and there are other things as well, which are required to be set torandom forest before execution of an algorithm like, Number of Trees in the forests, Depth of the tree, Number of

    samples for bagging[13], Number of features for splitting node. The main advantage of using Random Forest is because

    of its randomness, so that it doesnt depend of the data, so it is good with dealing with the outliers. Random Forest is

    good at dealing with high dimensional data, and performance and the accuracy results of random forests are very much

    promising and comparable with some of the state of the art techniques [13].

    One of the best thing that random forests adds is the randomness on to the data while it classifies, and due torandomness each tree would be highly uncorrelated with the other random tree. If trees would be correlated then it would

    not give the proper results of accuracy, but due to highly uncorrelated tree, when you combine those tree using baggingapproach, results would be much improved, compared to those which are highly correlated and generating the most

    similar results all the time, that is shows a classifier Random Forests can provide good accuracy, and can efficiently

    handle outliers.Strength, Weakness and Research Issues of Random Forests are mentioned in figure below, Random Forests was

    founded in 2001, and many of the researchers have started using it for different number of application field which

    includes medical research, image processing and there have been many of the competitions won by making use of

    Random Forests [36].

    III.LITERATURE SURVEYThere have been many experiments performed on this data mining classification technique and those experimentsusually changes from domain to domain, as (Wolpert et al. 1997) has published that there is no algorithm which works

    best for all the domains, which is also known as no free lunch theorems [22] . This literature survey focuses on the

    some of the previous work carried out by the researchers in measuring the performance and comparing different data

  • 8/14/2019 Experimental and Comparative Analysis of Machine Learning Classifiers

    4/9

    H itesh et al., I nternational Journal of Advanced Research i n Computer Science and Software Engineeri ng 3(10),October - 2013, pp. 955-963

    2013, I JARCSSE All Righ ts Reserved Page | 958

    mining classification algorithm, and provide the valid information and remarks on the performance and output andworking behavior of the algorithm on the various cases and various parameters.

    (Sharma et al. 2011) performed the experimental evaluation of the various classification algorithm on Weka[38] which

    are C4.5, ID3, CART[23] to determine the classification accuracy of an algorithm on the emails to classify that whether

    those emails are spam or not. All those classification algorithm are then compared based on the correctly classified

    instances. As the part of the conclusion they mentioned that the classification accuracy of the C4.5 algorithm outperforms

    the all other three classifier, and the accuracy of the CART algorithm is also very promising and comparable to the c4.5

    whereas the other two algorithm reported less accuracy compared to C4.5 and CART [14](Yadav et al. 2012) performed the experiment to find out the best classifier which can work on students data and

    predict the students performance based on those data, this experiment was conducted on the sample of 50 students who

    werepursuing MCA from VBS Purvanchal University, Jaunpur, Uttar Pradesh, India. Data includes students previous

    semester marks, class test grade, seminar performance, attendance and assignment work, and with the help of these data

    they tried to predict the end semester results of these 50 students, for that they used the same three classifiers to predictthe results which are ID3, CART and C4.5. In the conclusion of an experiment they found that the classification accuracy

    of algorithm C4.5 is much better compared to the other two classifier [15].

    (Lakshmi et al. 2013) They have also worked in the field of educational data mining and they have also used the same

    three classifiers which are used by (Yadav et al. 2012), their main intention was very similar to that of (Yadav et al.

    2012) but the only difference was that with the type of data that they used, (Lakshmi et al. 2013) used the classifier to

    predict the performance of the students, they collected the sample of 120 Under graduate students and the data includesqualification details of the students, students location details, their financial support and the family support and relatio ns

    etc. They used all these samples and tested in Weka with three classifiers which were mentioned before, and from theirexperimental evaluations and conclusion they achieved the highest accuracy with the CART algorithm which was also

    comparable to the accuracy that they got with the use of C4.5 [16].

    (Khoonsari et al. 2012) performed many different experiments on two classification approaches which are ID3 andC4.5 to check efficiency and the robustness of the classifier techniques, their experiments were carried out on nine

    different data sets from the UCI Repository, which contains many of the Gold standard data set for the machine learning,

    and most of the researchers in the data mining and machine learning tend to use only those data sets to test out the

    performance results of their experiments over some of the previously published research.

    The data set they used were of type categorical only, and the number of instances varies from 40 to 67557, and number

    of different attributes varies from 4 to 4, and none of the data set contains any of the missing value. At the end of their

    experiments they concluded that robustness and the accuracy of the classification technique C4.5 is outperforming the

    accuracy results returned by the ID3 classifier [17].

    (Patel et al. 2012) demonstrated the use of Machine Learning classification techniques for Network Intrusion DetectionSystem, They carried out their experiments with some of the state of the art classification techniques which also include

    Nave Bayes and C4.5. Data set that they used was DARPA KDD99, which is also known as the gold standard data set

    for the researchers working in Intrusion Detection and evaluation. From their experiments and evaluations C4.5

    performed better than Nave Bayes classifier, and they also concluded that instead of using only single classifier for the

    evaluation, results can be improved by combining more than one classifiers to remove the disadvantages of one another

    [18].

    (Khan et al. 2008) used the machine learning classification techniques such as Decision Tree for mining of OralMedicine. They are using Decision Tree classifiers to mine the certain large Electronic Medical Records which contain

    lot of information, and these information can be useful to teach to the students of the Oral Medicine. They worked with

    the data set which contains examination records of more than 20000 patients. Their data set contains more than 180

    different attributes, and data set also suffers from the missing values as well. They concluded that C4.5 performs better

    than ID3 classifier when there is a data set which contains missing values, C4.5 avoids the overfitting of data values and

    can also better handle data with missing values compared to ID3, which will not be able to perform well when there aremissing values in the data [19].

    (Kotsiantis et al. 2007) Performed and represented a very explanatory and detailed survey of many different

    classification technique, which includes both supervised and unsupervised, out of which it includes both Decision Trees

    and Nave Bayes, he focuses on many of the issues of the classifiers which includes algorithm selection, issue regarding

    supervised learning and their accuracy, implementations, and also provided information on what are the advantages and

    disadvantages of one classifier over the other classifier approach. He has concluded the paper with suggestion that the

    researchers should not select the classifier based on it is better than the other one or not but it has to be selected on the

    basis of characteristics under which that classifier can perform really well, characteristics includes Type of attributes like

    some of the classifiers are good with numbers whereas some of the classifiers are good at both numbers and categorical,

    number of instances also plays a very important role as some of the classifier provides very good result when provided

    small data set like Nave Bayes, whereas some of the classifiers provide very good accuracy results on given high

    dimensional data like SVM. More than one methods can also be combined but requires amount of study on both the

    methods so after that limitation of one method can be handled by the other method. But integrating multiple methodsmay also lead to increase in storage amount, this will also increase overall computation time.

    As there are many classification technique available in the machine learning, and to compare these techniques there are

    number of parameters available with the help of which classification techniques can be compared, few of those

  • 8/14/2019 Experimental and Comparative Analysis of Machine Learning Classifiers

    5/9

    H itesh et al., I nternational Journal of Advanced Research i n Computer Science and Software Engineeri ng 3(10),October - 2013, pp. 955-963

    2013, I JARCSSE All Righ ts Reserved Page | 959

    parameters are described here which plays very important role in the classification technique to make decision inselection of classification techniques, those parameters includes classification scheme, based on data that we want to

    work, data specification, computation time, ability of a classifier to deal with noise or outliers, classification accuracy,

    number of different model parameters to set to get an efficient classification output. All these points are briefly described

    and compared below with respect to four classification approach. This will also help the researchers in data mining and

    machine learning to get to know the classifiers from the different parameter perspective.

    Classifier Scheme: Classifier scheme generally refers to the way of classifying data by the classifiers, as this paperfocuses on only two type of classification scheme, one is the Hierarchical which includes ID3, C4.5, Random Forests and

    the second one is Probabilistic which includes Nave Bayes.

    Data Specifications: Some of the algorithm can very well handle the large dimensional data or more columns but very

    less number of rows eg. Random Forests [13]. Classifiers like Random Forests, which is an ensemble of decision treesare good at dealing with large dimensional data and can perform very well with those type of data, and it can also handle

    both category of data which includes Categorical as well as Numerical but large sample size is also one of the key point

    to get the higher prediction accuracy. Where there are other algorithms like Nave Bayes which can certainly perform

    well when they are dealing with the small data sets. And when dealing with the logic based classifiers like Decision Tree

    eg. ID3, C4.5, they tends to perform better if they are going to be used with data with features of type categorical or

    discrete [20].

    Computational Time: It is really important that the classifier should return very promising accuracy predictions withinthe finite amount of time, faster the classifier with the good prediction more better it is. In that case Nave Bayes is much

    better approach with its faster and short computational time for training. Computation time of ID3, C4.5 is also very

    comparable with Nave Bayes[20], but with the Random Forest the computation time would be more compared to all theother three classifiers, there are few reasons for that and which are mentioned here, Random Forests is an ensemble

    decision tree classifier, so first it is required to produce number of random trees and after parsing all those different

    random trees, the bagging operation will be performed which will provide the accuracy results which are always quite

    comparable to state of the art classification technique[13].

    Outliers: While dealing with large amount of data it is not sure that all the data would be accurate, there may be some of

    the data which are usually considered as noise or outliers, which are required to be handled by the classifiers accurately

    otherwise the prediction of the classifier will be biased to that outlier. Decision tree classifiers like ID3, C4.5 are robust

    with respect to outliers in training data [26], Nave Bayes classifier provides very high tolerance to outliers [25], whereas

    Random forests is using the concept of bagging which makes it very less sensitive to noise or outliers compared to otherclassifier algorithms.

    Accuracy:Accuracy of an algorithm depends upon what kind of data that the algorithm can handle and what kind of data

    that is been given as the input, and also the size of the data also matters. ID3 often faces the problem of overfitting when

    dealing with large amount of data, which affects the accuracy predictions [26]. Whereas accuracy of Random Forests is

    certainly good as there is no requirement for pruning trees and overfitting is also not a problem for it which is main

    problem for the classifier algorithm like ID3, and Random Forests can also handle both Categorical as well asContinuous and Binary attributes very well [26]. Naive Bayes works on the assumptions of the independence of child

    nodes and that certainly not always correct due that reason it is sometimes less accurate compared to other supervised

    algorithms [20]. Domingos & Pazzani (1997) provided great results by comparing Nave Bayes Classifier with some of

    state of the art classification technique which also includes Decision Tree Induction ID3 classifier, and found that

    Nave Bayes is sometimes performed really well to the other learning techniques [27].

    Parameters: Classification accuracy can also be improved by setting up the different parameters available to classifier,

    Nave Bayes Classifier just works fine as there is nothing like the parameter settings, While dealing with the decision

    trees there is a need to take care of parameters, as ID3 works just fine on the basis of training datasets, where as there is a

    need for the setting up parameters in C4.5 for CF (confidence factor) which is set to 25%, and MS (minimum numbers of

    split-off cases) which is set to 2. Parameter values are the one which were suggested as the default values for the

    classifier by Quinlan, the inventor of C4.5 [30]. If those parameters are set to values other than the default values, the

    classifier may perform surprisingly well which leads to better result in accuracy predictions [29], but is difficult to set

    exact parameters for a particular algorithm. So compared to Nave Bayes and ID3 algorithm in C4.5 it is required to deal

    other parameters than just running an algorithm, same is the case with the Random Forests which also deals with many

    hyperparameters compared to C4.5 which are required to set to deal with number of trees to create in forests, depth of

    each tree etc. [13].

    There has not been any of the survey done which carries out the comparison of the supervised classifiers ID3, C4.5,

    Nave Bayes, Random Forests, this paper focuses only on supervised classification techniques, next section showscomparison of all four above mentioned classifier in terms of their accuracy with correctly classified instances,

    incorrectly classified instances and ROC Area which are really important in deciding the performance of any classifier

    model [37].

  • 8/14/2019 Experimental and Comparative Analysis of Machine Learning Classifiers

    6/9

    H itesh et al., I nternational Journal of Advanced Research i n Computer Science and Software Engineeri ng 3(10),October - 2013, pp. 955-963

    2013, I JARCSSE All Righ ts Reserved Page | 960

    IV. EXPERIMENTS AND RESULTSTo check the performance of the classifier algorithm experiments were performed on five different data sets from UCI

    repository, and those data sets belong to both the types numeric as well as nominal. Data set Information is shown below.

    Name of The Data Set Number of Instances in the Data Set Number of Attributes in the Data Set

    Arrhythmia [A] 452 280

    Musk [B] 6598 169

    Splice [C] 3190 62Spect_test [D] 187 23

    Nursery [E] 12960 8

    Machine learning tool called Weka was used with above mentioned data sets in order to perform analysis on three

    things 1) Correctly classified instances, 2) Incorrectly classified instances and 3) ROC Area for all four algorithms which

    are Nave Bayes, ID3, C4.5 and Random Forests. Those data sets ranges from low to high dimension, out of which firsttwo data sets [A], [B] contains the numeric values and missing values as well, but ID3 algorithm cant deal with missing

    values so those missing values are replaced by applying the replacemissingvalues filter in Weka, and those numeric

    values were converted to nominal using numerictonominal filter available in Weka.

    The experiments and results are shown in terms of three different chart which are results for correctly classified

    instances, second chart shows results of incorrectly classified instances, and the last chart shows the ROC Area results

    which collectively best defines the classification algorithm [37]. All these experiments are performed by splitting data

    sets into 80% for training and 20% for testing which is generally considered as good amount of split for supervisedclassifiers. All these experiments are performed under Weka version 3.6.10 on 32 bit Microsoft windows 7 platform with

    3GB of RAM.

    Following Fig. 1. Chart represents result analysis of correctly classified instances. Here vertical axis shows the

    accuracy of correctly classified instances in percentage, and horizontal axis represents the data sets used for the

    experiment.

    [Fig. 1: Correctly Classified Instances of Data sets]

    Following Fig. 2. Represents percentage results of incorrectly classified instances for all five data sets. From this chart

    we can say that size of the data set also plays a very important role in classification approach. Here vertical axis shows

    the accuracy of incorrectly classified instances in percentage, and horizontal axis represents the data sets used for the

    experiment.

    [Fig. 2: Incorrectly Classified Instances of Data sets]

  • 8/14/2019 Experimental and Comparative Analysis of Machine Learning Classifiers

    7/9

    H itesh et al., I nternational Journal of Advanced Research i n Computer Science and Software Engineeri ng 3(10),October - 2013, pp. 955-963

    2013, I JARCSSE All Righ ts Reserved Page | 961

    Following Fig. 3. Represents percentage results of ROC (Receiver Operating Characteristics) Area for all five datasets. ROC is a plot of True Positive Rate against False Positive Rate. ROC Area results provides great information on

    how accurate the classifier is. Here vertical axis represents the probability value for ROC, where value 1 means the

    perfect classification, and horizontal axis represents the data sets used for the experiment.

    [Fig. 3: ROC Area results for Data Sets]

    By performing the classification task on those five above mentioned data sets we got some of the really interesting

    information with respect to performance results which are mentioned in the Experimental Evaluation section.

    V.EXPERIMENT EVALUATIONSAfter applying classification algorithms on data sets, evaluation is as follows, sometimes ID3 classifier may also lead

    to data which may be unclassified, as from the Fig. 1 for the data sets [A] and [C] the accuracy results received were very

    less because most of the instances were unclassified which resulted in poor accuracy results. Whereas out of all data sets,

    for data set [B] accuracy achieved was 100% and for data set [E] accuracy results were very nearer to 100% for all three

    tree based classifiers.

    Performance of Nave Bayes was also very better compared to ID3 for correctly classified instances for all five data

    sets, in a same way results of Random Forests and C4.5 are also very comparable but overall Random Forests

    outperforms all the three classifiers for correctly classified instances for all five out of four data sets. A Random Forests

    is the classification algorithm which deals with different types of parameters like number of random trees and number of

    features to be selected, and to get better accuracy results those parameters are required to be set accurately. All of these

    performance analysis is carried out using Weka, in which for Random Forests those parameter values of random trees areset to 10 and features are selected randomly at the runtime. Those default parameter values of Random Forests in Weka

    does not guarantee that the classification results will always be better, sometimes changing those parameter values

    manually may lead to drastic change in result accuracy, as using those default values provided by Weka did not help in

    getting good accuracy result in the first execution.

    While performing these experiments with Random Forest was tuned with some different parameter values set

    manually, instead of those default values provided in Weka for number of trees in forest and number of features for

    Random Forests, as these parameter plays very important role in its classification technique. These different parametervalues of trees and features that helped to get good accuracy results are 10 trees, 9 random features for data set [A]. 10

    trees, 130 random features for data set [B]. 50 trees, 10 random features for data set [C]. 10 trees, 10 random features for

    data set [D] and 100 trees, 9 random features for data set [E] and because of these parameter values Random Forests

    provided very promising and accurate results, sometimes 100% as for data set B. Setting those parameters is really a

    search problem which requires great efforts to get good accuracy results. For incorrectly classified instances ID3 did not

    performed well, because for two data sets [A], [C] it resulted into 10% and 1% of correctly classified instances andincorrectly classified instances respectively, and rest of the instances were unclassified. For this results again Random

    Forests outperformed all three classifier in four out of five data sets. For ROC Area ID3 did not perform well compared

    to other three classifiers, Nave Bayes surprisingly performed well and results were very comparable to Random Forests.

    And compared to C4.5, Random Forests performed better in all five data sets. So overall form all these data and result

    analysis Random Forests performed well compared to other three algorithms.

    VI. CONCLUSION AND FUTURE WORKClassification techniques are being used in many different application areas, and there is no single classifier which can

    perform best all the time for variety of data. This paper focuses on experimental comparison of four different classifier

    which are ID3, Nave Bayes, C4.5 and Random Forests on five standard data sets form UCI and result analysis clearly

    shows that the accuracy results returned by Random Forests outperforms all the three classifiers in terms of correctly and

    incorrectly classified instances and ROC Area. Future work focuses on optimizing hyperparameters of Random Forests, as

    to work with Random Forests there is a need to set those hyperparameters manually to improve the accuracy results.Setting those parameters manually will be time consuming and may not lead to better solution always, so we will look

    forward to implement optimization technique on top of Random Forests for automatic tuning of those hyperparameters

    which may lead to better accuracy results.

  • 8/14/2019 Experimental and Comparative Analysis of Machine Learning Classifiers

    8/9

    H itesh et al., I nternational Journal of Advanced Research i n Computer Science and Software Engineeri ng 3(10),October - 2013, pp. 955-963

    2013, I JARCSSE All Righ ts Reserved Page | 962

    REFERENCES[1] B. Liu, Statistical Approaches to Concept-Level Sentiment Analysis, IEEE, Vol. 28, Issue 3, 2013, pp. 6-9[2] J. Han, M. Kamber, Data mining: concepts and techniques, Edition 2, 2006, pp. 258[3] R. Szeliski, Computer Vision: Algorithms and Applications. Springer-Verlag. 2002[4] B Baharudin, A Review of Machine Learning Algorithms for Text-Documents Classification, JOURNAL OF

    ADVANCES IN INFORMATION TECHNOLOGY, VOL. 1, NO. 1, pp. 4-20, FEBRUARY 2010

    [5] B. Pang, L. Lee, and S. Vaithyanathan, Thumbs up? : sentiment classification using machine learningtechniques,Proceedings of the ACL-02 conference on Empirical methods in natural language processing, vol. 10,2002, pp. 79-86.

    [6] Ridgeway G, Madigan D, Richardson T (1998) Interpretable boosted naive Bayes classification. In: Agrawal R,StolorzP, Piatetsky-Shapiro G (eds) Proceedings of the fourth international conference on knowledge discovery

    and data mining.. AAAI Press, Menlo Park pp 101104.

    [7] Machine Learning Algorithm for Classification - http://www.cs.princeton.edu/~schapire/talks/picasso-minicourse.pdf

    [8] Zhang, H., The optimality of naive bayes, Proceedings of the 17th International FLAIRS Conference 2004.[9] M. Amiri, Using Nave Bayes Classifier to Accelerate Constructing Fuzzy Intrusion Detection Systems,

    International Journal of Soft Computing and Engineering (IJSCE) ISSN: 2231-2307, Volume-2, Issue-6, January

    2013

    [10] A. S. Galathiya, Improved Decision Tree Induction Algorithm with Feature Selection, Cross Validation, ModelComplexity and Reduced Error Pruning, ) International Journal of Computer Science and Information

    Technologies, Vol. 3 (2) , pp. 3427-3431[11] Tom M. Mitchell, lecture slides for textbook Machine Learning, McGraw Hill, 1997[12] J. R. Quinlan, Improved Use of Continuous Attributes in C4.5, Journal of Artificial Intelligence Research 4

    (1996) 77-90[13] L. Breiman, Random forests. Machine Learning, vol. 45. Issue 1, 2001, pp. 5 -32.[14] Aman Kumar Sharma, Suruchi Sahni A Comparative Study of Classification Algorithms for Spam Email Data

    Analysis, International Journal on Computer Science and Engineering, May 2011 ,Vol. 3 No. 5 ,pp 1890-1895.

    [15] Surjeet Kumar Yadav, Brijesh Bharadwaj, Saurabh Pal, A Data Mining Application: A Comparative study forpredicting students Performance, International Journal of Innovative Technology and Creative Engineering,

    Vol.1 No.12 (2012) 13-19

    [16] TM Lakshmi, An Analysis on Performance of Decision Tree Algorithms using Students Qualitative Data, I.J.Modern Education and Computer Science, 2013, 5, 18-27

    [17] Payam Emami Khoonsari and AhmadReza Motie, A Comparison of Efficiency and obustness of ID3 and C4.5Algorithms Using Dynamic Test and Training Data Sets International Journal of Machine Learning andComputing, Vol. 2, No. 5, October 2012, pp. 540-543

    [18] A. Ganatra, R. Patel, A. Thakkar, A Survey and Comparative Analysis of Data Mining Techniques for NetworkIntrusion Detection Systems, International Journal of Soft Computing and Engineering (IJSCE) ISSN: 2231-2307,

    Volume-2, Issue-1, March 2012, pp. 265-271

    [19] KHAN, F. S., ANWER, R. M., TORGERSSON, O. & FALKMAN, G. Data mining in oral medicine usingdecision trees. World Academy of Science, Engineering and Technology, 37, 225-230. 2008

    [20] S. B. Kotsiantis, Supervised Machine Learning: A Review of Classification Techniques, Informatica 31 (2007)249-268 249

    [21] archive.ics.uci.edu/ml/datasets.html: UCI Machine Learning Repository: Data Sets.[22] Wolpert, D.H., Macready, W.G. (1997), "No Free Lunch Theorems for Optimization", IEEE Transactions on

    Evolutionary Computation 1, 67.

    [23] L. Breiman,Classification and Regression Trees.CRC Press, New York, 1999.[24] A. Ganatra, H. Bhavsar, A Comparative Study of Training Algorithms for Supervised Machine Learning,International Journal of Soft Computing and Engineering (IJSCE) ISSN: 2231-2307, Volume-2, Issue-4,

    September 2012

    [25] A. S. Galathiya, A. P. Ganatra, C. K. Bhensdadia, Improved Decision Tree Induction Algorithm with FeatureSelection, Cross Validation, Model Complexity and Reduced Error Pruning, International Journal of Computer

    Science and Information Technologies, Vol. 3 (2) , 2012,3427-3431

    [26] Ned Horning, Introduction to decision trees and random forests, American Museum of Natural History's Center forBiodiversity and Conservation,

    http://www.whrc.org/education/indonesia/pdf/DecisionTrees_RandomForest_v2.pdf

    [27] Domingos, P. & Pazzani, M. (1997). On the optimality of the simple Bayesian classifier under zero-one loss.Machine Learning 29: 103-130.

    [28] JR Beck, A Backward Adjusting Strategy and Optimization of the C4.5 Parameters to Improve C4.5sPerformance, Proceedings of the Twenty-First International FLAIRS Conference (2008), Association for the

    Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.[29] Banfield, R.E.; Hall, L.O.; Bowyer, K.W.; Kegelmeyer, W.P., "A Comparison of Decision Tree Ensemble

    Creation Techniques,"Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol.29, no.1,

    pp.173,180, Jan. 2007.

    http://www.cs.princeton.edu/~schapire/talks/picasso-minicourse.pdfhttp://www.cs.princeton.edu/~schapire/talks/picasso-minicourse.pdfhttp://scholar.google.co.in/citations?view_op=view_citation&hl=en&user=mXSv_1UAAAAJ&citation_for_view=mXSv_1UAAAAJ:M3NEmzRMIkIChttp://www.whrc.org/education/indonesia/pdf/DecisionTrees_RandomForest_v2.pdfhttp://www.whrc.org/education/indonesia/pdf/DecisionTrees_RandomForest_v2.pdfhttp://scholar.google.co.in/citations?view_op=view_citation&hl=en&user=mXSv_1UAAAAJ&citation_for_view=mXSv_1UAAAAJ:M3NEmzRMIkIChttp://www.cs.princeton.edu/~schapire/talks/picasso-minicourse.pdfhttp://www.cs.princeton.edu/~schapire/talks/picasso-minicourse.pdf
  • 8/14/2019 Experimental and Comparative Analysis of Machine Learning Classifiers

    9/9

    H itesh et al., I nternational Journal of Advanced Research i n Computer Science and Software Engineeri ng 3(10),October - 2013, pp. 955-963

    2013, I JARCSSE All Righ ts Reserved Page | 963

    [30] Jason R. Beck,Maria Garcia,Mingyu Zhong,Michael Georgiopoulos, andGeorgios C. Anagnostopoulos, ABackward Adjusting Strategy and Optimization of the C4.5 Parameters to Improve C4.5's Performance. FLAIRS

    Conference, page 35-40. AAAI Press, (2008)

    [31] Quinlan, J. R. 1986. Induction of Decision Trees. Mach. Learn. 1, 1 (Mar. 1986), 81-106.[32] Quinlan, J. R. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, 1993.[33] Irina Rish. An empirical study of the naive bayes classifier. In IJCAI-01, workshop on Empirical Methods in AI.[34] R. Kothari and M. Dong, "Decision trees for classification: A review and some new results," in Lecture Notes in

    Pattern Recognition, S. K. Pal and A. Pal, Eds., Singapore, 2000, World Scientific Publishing Company.[35] Experiments on C4.5http://rulequest.com/see5-comparison.html[36] https://www.kaggle.com/wiki/RandomForests[37] M. Abernethy,Data mining with WEKA, Part 2: Classification and clustering, IBM developer Works,2010[38] http://www.cs.waikato.ac.nz/ml/weka/

    http://www.bibsonomy.org/author/Beckhttp://www.bibsonomy.org/author/Garciahttp://www.bibsonomy.org/author/Zhonghttp://www.bibsonomy.org/author/Georgiopouloshttp://www.bibsonomy.org/author/Anagnostopouloshttp://www.bibsonomy.org/bibtex/3b211570445d1f07f7dc42040231c7fdhttp://www.bibsonomy.org/bibtex/3b211570445d1f07f7dc42040231c7fdhttp://rulequest.com/see5-comparison.htmlhttps://www.kaggle.com/wiki/RandomForestshttps://www.kaggle.com/wiki/RandomForestshttp://rulequest.com/see5-comparison.htmlhttp://www.bibsonomy.org/bibtex/3b211570445d1f07f7dc42040231c7fdhttp://www.bibsonomy.org/bibtex/3b211570445d1f07f7dc42040231c7fdhttp://www.bibsonomy.org/bibtex/3b211570445d1f07f7dc42040231c7fdhttp://www.bibsonomy.org/author/Anagnostopouloshttp://www.bibsonomy.org/author/Georgiopouloshttp://www.bibsonomy.org/author/Zhonghttp://www.bibsonomy.org/author/Garciahttp://www.bibsonomy.org/author/Beck