comparison of classification algorithms using weka on...

6

Click here to load reader

Upload: vannhi

Post on 22-Mar-2018

215 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Comparison of Classification Algorithms using WEKA on ...serialsjournals.com/serialjournalmanager/pdf/1330330013.pdf · Comparison of Classification Algorithms using WEKA on ... Naive

IJCSIT International Journal of Computer Science and Information Technology, Vol. 4, No. 2, December 2011, pp. 85-90

Comparison of Classification Algorithms usingWEKA on Various DatasetsBharat Deshmukh, Ajay S. Patil2 & B.V. Pawar2

1Sinhgad Institute of Management and Computer Application (SIMCA), Pune- 411041 (M.S.), IndiaE-mail: [email protected]

2Department of Computer Science, North Maharashtra University, Jalgaon-425001 (M.S), IndiaE-mail: [email protected], [email protected]

ABSTRACT: Data mining is a step in the knowledge discovery process consisting of data mining algorithms that used tofinds patterns or models in data. Data Mining also can be define as an analytic process designed to explore large amountsof data in search for consistent patterns and systematic relationships between variables and then to validate the findings byapplying the detected patterns to new subsets of data. Classification is the most commonly applied data mining technique,which employs a set of pre-classified examples to develop a model that can classify the population of records at large. Inclassification techniques a model is built based on training data and applied to test data. WEKA is an open source datamining tool which includes implementation of data mining algorithms. Using WEKA we have compared the ADTree, BayesNetwork, Decision Table, J48, Logistic, Naive Bayes, NBTree, PART, RBFNetwork and SMO algorithms. To compare thesealgorithms we have used five datasets.

Keywords: Algorithms, Data Mining, Classification

1. INTRODUCTIONData mining[1] is the rapidly growing interdisciplinaryfield, which merges together database management,statistics, machine learning and related areas aimingat extracting useful knowledge from large collectionsof data. The data mining process consists of three basicstages: exploration, model building or patterndefinition, and validation/verification. Ideally, if thenature of available data allows, it is typically repeatediteratively until a “robust” model is identified.However, in business practice the options to validatethe model at the stage of analysis are typically limitedand, thus, the initial results often have the status ofheuristics that could influence the decision process.Data mining can be done with large number ofalgorithms and techniques which includesclassification, clustering, regression, association rule,artificial intelligence, neural networks, geneticalgorithm, nearest neighbor method etc. WEKA1

includes implementation of various classificationalgorithms like Decision trees, Naïve Bayes , ZeroRetc. In this paper we have studied and compared theADTree, Bayes Network, Decision Table, J48,Logistic, Naive Bayes, NBTree, PART, RBF Networkand SMO algorithms using the four dataset available

on UCI dataset repository2 and bank data set fromDepaul University3.

An ADTree (Alternative Decision Tree) [6] hasnew semantic representation of decision tree which hasprediction nodes at leaves and at root as well. A Bayesnetwork algorithm [7] is based on Bayes theorem andis a directed acyclic graphical model which is used torepresent the conditional dependencies between set ofrandom variables. There are two main limitations ofBayes network, first is the computational difficulty ofexploring a previous unknown network and second isquality of prior beliefs used in calculating the network.Decision tables [2] are used to lay out in tabular formall possible situations which a business decision mayencounter and to specify which action to take in eachof these situations. It is the simplest machine learningalgorithm, which is of tabular format which indicatesthe table entries for input and output. A J48, decisiontree [2] is a predictive machine-learning model thatdecides the target value (dependent variable) of a newsample based on various attribute values of theavailable data. Logistic [8] is a linear classifier forsupervised learning which has properties like featureselection and robustness to noise. The Naïve Bayesclassifier [9] works on a simple, but comparatively

Page 2: Comparison of Classification Algorithms using WEKA on ...serialsjournals.com/serialjournalmanager/pdf/1330330013.pdf · Comparison of Classification Algorithms using WEKA on ... Naive

86 International Journal of Computer Science and Information Technology

intuitive concept. Also, in some cases it is also seenthat Naïve Bayes outperforms many othercomparatively complex algorithms. It makes use of thevariables contained in the data sample, by observingthem individually, independent of each other. TheNaïve Bayes classifier is based on the Bayes rule ofconditional probability. It makes use of all the attributescontained in the data, and analyses them individuallyas though they are equally important and independentof each other. The Naïve Bayes classifier will considereach of these attributes separately when classifying anew instance. NBTree [10] is an hybrid approach whichincludes the capabilities of decision tree and naïvebayes classifier, the decision-tree nodes contain splitsas regular decision-trees, but the leaves contain Naive-Bayesian classifiers. PART [11] is a partial decisiontree which is an extension of C 4.5 for generation therule set. PART builds a rule, removes the instances itcovers, and continues creating rules recursively for theremaining instances until none are left. RBF (radialbasis function) network [12] is a variant of neuralnetwork, which are embedded in to two layers. In orderto use radial basis function we need to specify thehidden unit activation function, the number ofprocessing units, a criterion for modeling the given taskand a training algorithm for finding the parameters ofthe network. The SMO (sequential minimaloptimization) [13] is extension of Support VectorMachines (SVM) to solve the problem of handlinglarge datasets in SVM.

2. RELATED WORKIn [2], Xindong Wu, Vipin Kumar et al. has given thedescriptive study of the 10 data mining classification

algorithms which includes C4.5, k-Means, SVM,Apriori, EM, PageRank, AdaBoost, kNN, Naive Bayes,and CART. Andrew Secker, Matthew N. Davies et.al[3] compared different classification algorithms forhierarchical prediction for protein function based onthe predictive accuracy of classifiers. In [4], the authorhas performed the experiments on weed and cropimages and dataset to test the classification algorithms.In [5], Ryan Potter has performed the comparison ofclassification algorithms on breast cancer dataset toperform the diagnosis of the patients.

3. EXPERIMENTAL DETAILSWe have used the data sets available from DepaulUniversity and UCI machinery website. From WEKAGUI we have tested the ADTree (ADT), BayesNetwork (BayesNet), DecisionTable (DT), J48,Logistic, naive bayes (NB), NBTree (NBT), PART,RBFNetwork (RBFN) and SMO algorithms on fivedata sets and observed the following results. We haveused the bank, car, breast cancer, credit-g and diabetesdatasets. Table 1 shows the brief description about theeach dataset. We have studied and compared thesealgorithms on the parameters like Correctly ClassifiedInstances (CCI), Incorrectly Classified Instances (ICI),Kappa Statistic (KS), Mean Absolute Error (MAE),Root Mean Squared Error (RMSE). Kappa statisticsis used to measure the inter-rater agreement forcategorical items i.e. it is an index which comparesthe agreement against that which might be expectedby chance. Kappa can be thought of as the chance-corrected proportional agreement, and possible valuesrange from +1 (perfect agreement) via 0 (no agreementabove that expected by chance) to -1 (complete

Table 1Description of Datasets available from Depaul University and UCI repository

Data Set Number of Attributes Number of ClassesAttributes data items

Bank 11 age,sex, region, income, married, children, car, save-account, 600 2current-account, mortgage, pep

Car 6 buying capacity, maintenance, number of doors, person seating 1728 4capacity, luggage boot space, safety and class

Breast Cancer 10 class, age, menopause, tumor size, inv-nodes, nodes-cap, deg-malig, 286 2breast, breast-quad, irradiat

Credit-g 20 checking account, month, credit history, purpose, amount, saving 1000 2account, present employment, rate, sex, residence, property, age,other installments, housing, dependent, telephone, foreign worker

Diabetes 9 number of times pregnant, plasma glucose concentration, blood 768 2pressure, triceps skin fold thickness, serum insulin, body mass index,diabetes pedigree function,age

Page 3: Comparison of Classification Algorithms using WEKA on ...serialsjournals.com/serialjournalmanager/pdf/1330330013.pdf · Comparison of Classification Algorithms using WEKA on ... Naive

Comparison of Classification Algorithms using WEKA on Various Datasets 87

disagreement). Mean absolute error is used to measurehow close predictions to the eventual outcome. It isaverage of absolute errors in the predictions. The rootmean squared mean error is a measure to the varianceof the predictions. Root mean square error is afrequently-used measure of the differences betweenvalues predicted by a model or an estimator and thevalues actually observed from the thing being modeledor estimated.

3.1. Bank DatasetIn Bank dataset there are 11 attributes (age, sex, region,income, married, children, car, save- account, current-account, mortgage and pep) and 600 data items,classified into two classes, the classification is donewhether the person will go for Pension Equity Plan(PEP) or not.

Table 2Result from WEKA for Bank Dataset

Algorithm CCI(%) ICI(%) KS MAE RMSE

ADT 84.67 15.33 0.6853 0.3350 0.3728

BayesNet 70.00 30.00 0.3862 0.3968 0.4487

DT 80.83 19.17 0.6123 0.2988 0.3750

J48 91.00 9.00 0.8178 0.1559 0.2903

Logistic 73.00 27.00 0.4518 0.3607 0.4303

Naive Bayes 69.00 31.00 0.3724 0.3773 0.4397

NBT 88.67 11.33 0.7710 0.1766 0.3194

PART 85.17 14.83 0.7003 0.1803 0.3573

RBFN 73.33 26.67 0.4585 0.3590 0.4317

SMO 70.80 29.20 0.4062 0.2917 0.5401

for bank dataset. The Kappa Statistic value for J48 ismuch closer to 1 (i.e. 0.8178) which indicates that J48provides the perfect agreement for classification of dataitems. J48 has lesser error rate in mean absolute errorand root mean squared error as it provides the moreperfect predictions and lesser variance in predictions.

3.2. Car DatasetThere are 6 attributes (buying capacity, maintenance,number of doors, person seating capacity, luggage bootspace, safety and class) and 1728 data items in car dataset. The data items are classified into four classes’unacc, acc, good and very good acceptance level ofcar by people based on six attributes.

Table 3Result from WEKA for Car Dataset

Algorithm CCI(% ) ICI (%) KS MAE RMSE

ADT - - - - -

BayesNet 85.71 14.29 0.6713 0.1114 0.2254

DT 91.03 08.97 0.7987 0.2748 0.3220

J48 92.36 07.64 0.8343 0.0421 0.1718

Logistic 93.11 06.89 0.8504 0.0428 0.1520

Naive Bayes 85.53 14.47 0.6665 0.1137 0.2262

NBT 94.21 05.79 0.8752 0.0676 0.1571

PART 95.78 04.22 0.9091 0.0241 0.1276

RBFN 94.21 05.79 0.8752 0.0676 0.1571

SMO 93.75 06.25 0.8649 0.2559 0.3202

For the car dataset PART performs the bestfollowed by RBFNetwork and NBTree. ADTree isdisabled for car dataset in WEKA as it providespredictions for a dataset with two classes. The KappaStatistic for PART is closest to perfect agreement (i.e.0.9091). PART has highest percentage for correctlyclassified instances and lesser mean absolute error androot mean squared error.

Figure 1: Graph of KS, MAE and RMSE for Bank Dataset

For the bank dataset J48 decision tree performsthe best followed by NBTree and PART. J48 providesthe highest percentage of correctly classified instances Figure 2: Graph of KS, MAE and RMSE for Car Dataset

Page 4: Comparison of Classification Algorithms using WEKA on ...serialsjournals.com/serialjournalmanager/pdf/1330330013.pdf · Comparison of Classification Algorithms using WEKA on ... Naive

88 International Journal of Computer Science and Information Technology

3.3. Breast Cancer DatasetBreast cancer dataset has 10 attributes (class, age,menopause, tumor size, inv-nodes, nodes-cap, deg-malig, breast, breast-quad and irradiat) and 286 dataitems. Data items are classified in no recurrence andrecurrence events based on the attributes.

Table 4Result from WEKA for Breast Cancer Dataset

Algorithm CCI (%) ICI (%) KS MAE RMSE

ADT 73.78 26.22 0.3290 0.3919 0.4333

BayesNet 72.03 27.97 0.2919 0.3297 0.4566

DT 73.43 26.57 0.2462 0.3748 0.4407

J48 75.52 24.48 0.2826 0.3676 0.4324

Logistic 68.88 31.12 0.1979 0.3700 0.4631

Naive Bayes 71.68 28.32 0.2857 0.3272 0.4534

NBT 70.98 29.02 0.2465 0.3265 0.4753

PART 71.33 28.67 0.1995 0.3650 0.4762

RBFN 70.98 29.02 0.2177 0.3574 0.4443

SMO 69.58 30.42 0.1983 0.3042 0.5515

For the breast cancer dataset J48 decision treeperforms the best followed by ADTree and decisiontable. J48 provides the highest percentage of correctlyclassified instances for bank dataset. The KappaStatistic value for J48 is much closer to 0 (i.e. 0.2826)which is following to Kappa Statistic value of ADTree,which indicates that no more agreement is expectedby chance. But J48 has the highest percentage ofcorrectly classified instances than the ADTree and otheralgorithms. The values for mean absolute error and rootmean squared error of J48 is almost closer to ADTreeand Logistic.

3.4. Credit-g DatasetIn credit German data set [16] there are 20 attributes(checking account, month, credit history, purpose,amount, saving account, present employment, rate, sex,residence, property, age, other installments, housing,dependent, telephone and foreign worker) and 1000data instances. Based on 20 attributes credit rating ofa person is classified into two classes good and bad.

Table 5Results from WEKA for Credit-g Dataset

Algorithm CCI (%) ICI (%) KS MAE RMSE

ADT 72.40 27.60 0.2988 0.3895 0.4315

BayesNet 75.50 24.50 0.3893 0.3101 0.4187

DT 71.00 29.00 0.2033 0.3677 0.4321

J48 70.50 29.50 0.2467 0.3467 0.4796

Logistic 75.20 24.80 0.3750 0.3098 0.4087

Naive Bayes 75.40 24.60 0.3813 0.2936 0.4201

NBT 75.50 24.50 0.3918 0.3102 0.4221

PART 70.20 29.80 0.2767 0.3245 0.4974

RBFN 74.00 26.00 0.3340 0.3388 0.4204

SMO 75.10 24.90 0.3654 0.2490 0.4990

For the credit-g data set Bayes network andNBTree performs the best followed by Naïve Bayesand RBFN. Bayes network and NBTree has the samenumber of correctly classified instances percentage.Both algorithms has almost same values for Kappastatistics , mean absolute error and root mean squarederror. NBTree is a hybrid approach based on NaïveBayes and decision tree where as Bayes network isbased on Naïve Bayes only.

Figure 3: Graph of KS, MAE and RMSE for Breast CancerDataset Figure 4: Graph of KS, MAE and RMSE for Credit-g Dataset

Page 5: Comparison of Classification Algorithms using WEKA on ...serialsjournals.com/serialjournalmanager/pdf/1330330013.pdf · Comparison of Classification Algorithms using WEKA on ... Naive

Comparison of Classification Algorithms using WEKA on Various Datasets 89

3.5. Diabetes DatasetThe diabetes data set has 9 attributes (number of timespregnant, plasma glucose concentration, bloodpressure, triceps skin fold thickness, serum insulin,body mass index, diabetes pedigree function and age)and 768 data instances. The instances are classifiedinto two classes whether the women is tested positivefor diabetes or not.

Table 6Results from WEKA for Diabetes Dataset

Algorithm CCI (%) ICI (%) KS MAE RMSE

ADT 72.92 27.08 0.3736 0.3613 0.4195

BayesNet 74.35 25.65 0.4290 0.2987 0.4208

DT 71.22 28.78 0.3492 0.3448 0.4277

J48 73.83 26.17 0.4164 0.3158 0.4463

Logistic 77.21 22.79 0.4734 0.3094 0.3954

Naive Bayes 76.30 23.70 0.4664 0.2841 0.4168

NBT 74.35 25.65 0.4260 0.3099 0.4280

PART 75.26 24.74 0.4390 0.3101 0.4149

RBFN 75.39 24.61 0.4303 0.3448 0.4191

SMO 77.34 22.66 0.4682 0.2266 0.4760

For the diabetes dataset SMO performs the bestfollowed by logistic and Naïve Bayes. SMO has thehighest percentage of correctly classified instancesand higher value for Kappa statistics than otheralgorithms. The mean absolute error is lesser as SMOgives the closer predictions for diabetes dataset androot mean squared error is not less as it indicates thatthere is a possibility of variance in prediction similarwhich is similar to other algorithms for diabetesdataset.

4. CONCLUSIONIn this paper we have studied and compared Bayesnetwork, Naïve Bayes, SMO, RBFNetwork, Logistic,Decision Trees (J48), ADTree, NBTree and Decisiontable algorithms on five data sets in WEKA. Overallobservation is that no algorithm performs the best forevery dataset. For the bank and breast cancer datasetthe J48 has highest correctly classified instances thanremaining algorithms. For credit-g and diabetes datasetnaïve bayes is in first five algorithms. This shows thatthere is no single classification algorithm which canprovide the best predictive model for all datasets. Theaccuracy of predictive model is affected by theselection of attribute. With this we can conclude thatthe different classification algorithms are designed toperform better for certain types of dataset.

REFERENCES[1] Daniel T. Larose, “Data Mining Methods and Models”,

John Wiley & Sons, INC Publication, Hoboken, New Jersey(2006).

[2] Xindog Wu, Vipin Kumar et al., “Top 10 Algorithms in DataMining”, Knowledge and Information Systems, 14(1), 1-37(2008).

[3] Andrew Secker, Matthew N. Davies et al., “An ExperimentalComparison of Classification Algorithms for the HierarchicalPrediction of Protein Function”, Expert Update (the BCS-SGAI) Magazine, 9(3), 17-22, (2007).

[4] Martin Weis, Till Rumpf, Roland Gerhards, Lutz Plümer,“Comparison of Different Classification Algorithms for WeedDetection from Images based on Shape Parameters”, ATBPublication Volume 69, ISSN 00947-7314, Page No. 53-64(2007).

[5] Ryan Potter, “Comparison of Classification Algorithms Appliedto Breast Cancer Diagnosis and Prognosis”, Wiley ExpertSystems, 24(1), 17-31, (2007).

[6] Yoav Freund and Llew Mason, “The Alternative Decision TreeLearning Algorithm” International Conference on MachineLearning, 124-133, (1999).

[7] Daryle Niedermayer, “An Introduction to Bayesian Networksand their Contemporary Applications” Springer – Studies inComputational Intelligence, 56, 117-130, (2008).

[8] Jianing Shi, Wotao Yin et.al.,”Fast Hybrid Algorithm for Large-Scale –!1-Regularized Logistic Regression”, Journal ofMachine Learning Research, 11, 713-741, (2010).

[9] Kim Larsen, “Generalized Naive Bayes Classifiers” SIGKDDExplorations. 7(1), 76-81, (2005).

[10] Manuel J. Fonseca, Joaquim A. Jorge, “NB-Tree: An IndexingStructure for Content-Based Retrieval in Large Databases”,Proceedings of the Eighth International Conference onDatabase Systems for Advanced Applications, Pages 267-276,(2003).

[11] Frank, Eibe Witten, Ian H, “Generating Accurate Rule Setswithout Global Optimization”, Proceedings of the FifteenthInternational Conference on Machine Learning, 144-151,(1998).Figure 5: Graph of KS, MAE and RMSE for Diabetes Dataset

Page 6: Comparison of Classification Algorithms using WEKA on ...serialsjournals.com/serialjournalmanager/pdf/1330330013.pdf · Comparison of Classification Algorithms using WEKA on ... Naive

90 International Journal of Computer Science and Information Technology

[12] Adrian G. Bors, I. Pitas, “Introduction to RBF Network”,Online Symposium for Electronics Engineers, 1(1), 1-7(2001).

[13] Jingmin Wang, Kanzhang Wu, “Study of the SMO AlgorithmApplied in Power System Load Forecasting” Springer LNCS,pp. 1022-1026, (2006).