predicting performance of classification algorithms

10
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 6, Issue 2, February (2015), pp. 19-28 © IAEME 19 PREDICTING PERFORMANCE OF CLASSIFICATION ALGORITHMS Firas Mohammed Ali 1 , Dr. Prof. El-Bahlul Emhemed Fgee 2 , Dr.Prof.Zakaria Suliman Zubi 3 1 B.Sc IT Student, IT Department, Libyan Academy, Tripoli, Libya, 2 Supervisor, Computer Department, Libyan Academy, Tripoli, Libya, 3 External Guide, Sirt University, Sirt, Libya, ABSTRACT Classification is the most commonly applied data mining method, and is used to develop models that can classify large amounts of data to predict the best performance. Identifying the best classification algorithm among all available is a challenging task. This paper presents a performance comparative study of the most widely used classification algorithms. Moreover, the performances of these algorithms have been analyzed by using different data sets. Three different datasets from University of California, Irvine (UCI) are compared with different classification techniques. Each technique has been evaluated with respect to accuracy and execution time and performance evaluation has been carried out with selected classification algorithms. The WEKA machine learning tool is used to analysis of these three different data sets based on applying these classification methods to selected datasets and predicting the best performance results. Keywords: Classification Algorithms, Weka, LMT, Random Tree, Neive Base I. INTRODUCTION Nowadays there is huge amount of data being collected and stored in databases everywhere across the globe. The tendency is to keep increasing year after year. It is not hard to find databases with Terabytes of data in enterprises and research facilities. That is over 1012 bytes of data. There is invaluable information and knowledge “hidden” in such databases; and without automatic methods for extracting this information it is practically impossible to mine for them [1]. Throughout the years many algorithms were created to extract what is called nuggets of knowledge from large sets of data. There are several different methodologies to approach this problem: classification. Classification is a data mining (machine learning) technique used to predict group membership for data instances. For example, you may wish to use classification to predict whether the weather on a particular day will INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET) ISSN 0976 – 6367(Print) ISSN 0976 – 6375(Online) Volume 6, Issue 2, February (2015), pp. 19-28 © IAEME: www.iaeme.com/IJCET.asp Journal Impact Factor (2015): 8.9958 (Calculated by GISI) www.jifactor.com IJCET © I A E M E

Upload: samsung-electronics

Post on 12-Apr-2017

132 views

Category:

Documents


0 download

TRANSCRIPT

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),

ISSN 0976 - 6375(Online), Volume 6, Issue 2, February (2015), pp. 19-28 © IAEME

19

PREDICTING PERFORMANCE OF CLASSIFICATION

ALGORITHMS

Firas Mohammed Ali1, Dr. Prof. El-Bahlul Emhemed Fgee

2, Dr.Prof.Zakaria Suliman Zubi

3

1B.Sc IT Student, IT Department, Libyan Academy, Tripoli, Libya,

2Supervisor, Computer Department, Libyan Academy, Tripoli, Libya,

3External Guide, Sirt University, Sirt, Libya,

ABSTRACT

Classification is the most commonly applied data mining method, and is used to develop

models that can classify large amounts of data to predict the best performance. Identifying the best

classification algorithm among all available is a challenging task. This paper presents a performance

comparative study of the most widely used classification algorithms. Moreover, the performances of

these algorithms have been analyzed by using different data sets. Three different datasets from

University of California, Irvine (UCI) are compared with different classification techniques. Each

technique has been evaluated with respect to accuracy and execution time and performance

evaluation has been carried out with selected classification algorithms. The WEKA machine learning

tool is used to analysis of these three different data sets based on applying these classification

methods to selected datasets and predicting the best performance results.

Keywords: Classification Algorithms, Weka, LMT, Random Tree, Neive Base

I. INTRODUCTION

Nowadays there is huge amount of data being collected and stored in databases everywhere

across the globe. The tendency is to keep increasing year after year. It is not hard to find databases

with Terabytes of data in enterprises and research facilities. That is over 1012 bytes of data. There is

invaluable information and knowledge “hidden” in such databases; and without automatic methods

for extracting this information it is practically impossible to mine for them [1]. Throughout the years

many algorithms were created to extract what is called nuggets of knowledge from large sets of data.

There are several different methodologies to approach this problem: classification. Classification is a

data mining (machine learning) technique used to predict group membership for data instances. For

example, you may wish to use classification to predict whether the weather on a particular day will

INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING &

TECHNOLOGY (IJCET)

ISSN 0976 – 6367(Print)

ISSN 0976 – 6375(Online)

Volume 6, Issue 2, February (2015), pp. 19-28

© IAEME: www.iaeme.com/IJCET.asp

Journal Impact Factor (2015): 8.9958 (Calculated by GISI)

www.jifactor.com

IJCET

© I A E M E

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),

ISSN 0976 - 6375(Online), Volume 6, Issue 2, February (2015), pp. 19-28 © IAEME

20

be sunny, rainy or cloudy. Popular classification techniques include decision trees and neural

networks. It involves using a training set of data that contains observations to identify which

categories each observation should be placed in. Individual observations are analyzed and grouped in

explanatory variables, which may have categorical, ordinal, integer-valued, or real-valued properties.

Figure.1 shows the classification process.

II. PROBLEM DESCRIPTION

Classification consists of predicting a certain outcome based on a given input. In order to

predict the outcome, the algorithm processes a training set containing a set of attributes and the

respective outcome, usually called goal or prediction attribute. The algorithm tries to discover

relationships between the attributes that would make it possible to predict the outcome. Next the

algorithm is given a data set not seen before, called prediction set, which contains the same set of

attributes, except for the prediction attribute – not yet known. The algorithm analyses the input and

produces a prediction. The prediction accuracy defines how “good” the algorithm is. For example, in

a medical database the training set would have relevant patient information recorded previously,

where the prediction attribute is whether or not the patient had a heart problem [2].

III. THE SELECTED CLASSIFICATION ALGORITHMS USED IN WEKA

These are the selected WEKA algorithms I chose to analyze since whey where implemented

in the WEKA suite and ready to use directly. The decision to use the following algorithms was based

on the efficiencies seen in the reports I read about data classification. I tried to pick at least one type

of classifier from each of the major classifier groups and ended up with the following below is a

small description of as follows:

a) Naive Bayes

This was a very simple classifier that performed decent and should be easy to implement

regardless of language used. The drawback was that it wasn't in the top when it came to classifying

instances correctly. This was however not a big drawback since it was quick at constructing a

classification model, as well as classifying data [3].

b) SMO

Sequential minimal optimization algorithm that uses support vectors. Has built in support to

handle multiple classes using pairwise classification. Note that this algorithm is of the lazy type

which does all the calculations [3].

c) KStar (K*)

Aha, Kibler & Albert describe three instance-based learners of increasing sophistication. IB1

is an implementation of a nearest neighbor algorithm with a specific distance function. IB3 is a

further extension to improve tolerance to noisy data. Instances that have a sufficiently bad

classification history are forgotten and only instances that have a good classification history are used

for classification [4].

d) AdaBoostM1

Class for boosting a nominal class classifier using the Adaboost M1 method. Only nominal

class problems can be tackled. Often dramatically improves performance, but sometimes over fits

[4].

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),

ISSN 0976 - 6375(Online), Volume 6, Issue 2, February (2015), pp. 19-28 © IAEME

21

e) JRip

A decent classifier that performed okay even though I had higher expectations of this rule

learner due to the reports I saw where it had been used. The drawback of this classifier was that it

requires an extremely long time to construct a classification model for big data sets when using a

high WTK value to a point where it becomes useless. For example, it required almost 54 hours to

construct a classification model for data set B in chapter 4.2 meanwhile the Naive Bayes classifier

managed to do the same in under 3 minutes [3]..

f) OneR

Class for building and using a 1R classifier; in other words, uses the minimum-error attribute

for prediction, discretizing numeric attributes [4].

g) PART

Class for generating a PART decision list. PART uses the separate-and-conquer strategy,

where it builds a rule in that manner and removes the instances it covers, and continues creating rules

recursively for the remaining instances. Where C4.5 and RIPPER does global optimization to

produce accurate rule sets, this added simplicity is the main advantage of PART [4].

h) J48

An open source implementation of the C4.5 algorithm that builds a decision tree using

information entropy. That means that when building the decision tree, C4.5 will at each node select

the attribute that most successfully splits its set of samples seen to the difference in entropy that the

selected subtree generates [3].

i) LMT

Classifier for building 'logistic model trees', which are classification trees with logistic

regression functions at the leaves. The algorithm can deal with binary and multi-class target

variables, numeric and nominal attributes and missing values [4].

j) Random tree

Class for constructing a tree that considers K randomly chosen attributes at each node.

Performs no pruning [4]

IV. DEVELOPMENT

How a method to analyze data can be constructed are discussed and implemented. And also

discuss how different algorithms perform when classifying data. In theory, using a big data set to

construct the classifier model will increase the performance when classifying new data since it would

be easier to construct a more general model and hence finding a suitable match for our dataset. The

optimal size of the data set used to construct the classifier model is dependent on a number of things

such as the size of the classification problem, the classifier algorithm used and the quality of the data

set. The goal was to see how well the different algorithms performed, not just by comparing the

number of correct classifications, but also by looking into the time required to construct the

classification model depending on the size of the input data and number features used of as well as

the time required to classify a data set using the generated classification model. It was entirely

possible to implement these algorithms into classifiers from scratch since there were a lot of

documentations describing them. Mainly three data sets used in this thesis are again taken from the

UCI data sets [5, 6].

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),

ISSN 0976 - 6375(Online), Volume 6, Issue 2, February (2015), pp. 19-28 © IAEME

22

V. CLASSIFICATION USING WEKA- IMPLEMENTATION STEPS

Step 1. Open WEKA Application

Start > All Programs > WEKA 3.7.11 > WEKA 3.7

Step 2. Load a Dataset file

Explorer > Open file… > Local Disk (C :) > Program Files > Weka-3-7 > data >

“ select dataset file”

Figure 1: Load a Dataset file

Step 3. Building “Classifiers”

Classify > Choose > “ select the classifier name “

Figure 2: Building “Classifiers”

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),

ISSN 0976 - 6375(Online), Volume 6, Issue 2, February (2015), pp. 19-28 © IAEME

23

Step 4.Load the Test Option

Click on ‘Choose’ button in the ‘Classifier’ box just below the tabs and select C4.5 classifier WEKA

-> Classifiers -> Trees ->J48.

Figure 3: Load the Test Option

VI. DATA SET INFORMATION

Three data sets used in this for predicting performance with selected classification

algorithms.

Table 1: Credit German dataset information

Dataset Instances Attributes Data Type

Credit-g 1000 21 String

Table 2: Ionosphere dataset information

Dataset Instances Attributes Data Type

Ionosphere 351 35 Numeric

Table 3: Vote dataset information

Dataset Instances Attributes Data Type

Vote 435 17 Nominal

VII. RESULTS AND DISCUSSIONS

In this paper to evaluate performance of selected tool using the given datasets, several

experiments are conducted. For evaluation purpose, three test modes are used, the training set, the

cross-validation mode and percentage split mode. At the end, the recorded measures are averaged. It

is common to have 66% of the objects of the original database as a training set and the rest of objects

as a test set.There's a few more variables to considered before making the final decision, but from the

performance seen in earlier chapters, the proposed solution for how researchers should tackle the

problem of classifying structured data in there data sets is to implement a solution. The reason why

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),

ISSN 0976 - 6375(Online), Volume 6, Issue 2, February (2015), pp. 19-28 © IAEME

24

Random Tree is proposed instead of the other two candidates AdaBoostM1 and LMT that also

managed to reach the goal of a positive classification 100% percentage three times , whereas LMT

classification percentage perform 75.90 % and 77.06 %.Some predictive performance accuracies

given as an example in Table 4,5 and Table 6 shows best accuracy results highlighted in red and blue

colors with respect to the percentage split test mode, cross fold and training set on the three selected

UCI data sets such as German credit data, ionosphere and vote data sets [7].

Table 4: Comparison of classifiers using German Credit Data set in Percentage split mode

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),

ISSN 0976 - 6375(Online), Volume 6, Issue 2, February (2015), pp. 19-28 © IAEME

25

Table 5: Comparison of classifiers using ionosphere Data set in Cross-validation mode

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),

ISSN 0976 - 6375(Online), Volume 6, Issue 2, February (2015), pp. 19-28 © IAEME

26

Table 6: Comparison of classifiers using vote Data set in Training set mode

Table 7: Predictive performance of credit.g dataset

TestMode High accuracy

TrainingSet RandomTree

Crossfolds 10 LMT

Percentage split LMT

Table 8: Predictive performance of ionosphere dataset

TestMode High accuracy

TrainingSet RandomTree

Crossfolds 10 LMT

Percentage split AdaBoostM1

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),

ISSN 0976 - 6375(Online), Volume 6, Issue 2, February (2015), pp. 19-28 © IAEME

27

Table 9: Predictive performance of Vote dataset

TestMode High accuracy

TrainingSet RandomTree

Crossfolds 10 J48

Percentage split AdaBoostM1

Figure 4: Tree analysis of Highest Performance Algorithms

VIII. CONCLUSION

Classification is one of the data mining tasks that applied in many area especially in medical

applications. One reason for using this technique is selecting the appropriate algorithm for each data

type .There is no algorithm that is the best for all classification domains. This paper results is a way

to select the proper algorithm for a particular domain with respect the test modes. Due to this, in my

opinion the RandomTree and LMTare the best predictive performance classifiers that come out in

top in this analysis. Future work will focus on the combination of best classification techniques that

can be used to improve the performance.

IX. ACKNOWLEDGEMENTS

I would like to thank my supervisor and external guide to their valuable suggestions and tips

to write this paper.

REFERENCES

1. http://www.tutorialspoint.com/data mining/dm_classification_prediction.htm

2. Fabricio Voznika, Leoardo Viana, Data Mining Classifications.

3. Lilla Gula, Robin Norberg Information Data Management for the Future of communication,

2013.

4. http://weka.sourceforge.net/

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),

ISSN 0976 - 6375(Online), Volume 6, Issue 2, February (2015), pp. 19-28 © IAEME

28

5. Ghazi Johnny, Interactive KDD System for Fast Mining Association Rules. Date of Lecturer/

Staff Developing Center, Acceptance 8/6/2009.

6. Dr.Philip Gordon, Data Mining: Predicting tipping points, 2013.

7. Deepali Kharche, K. Rajeswari, Deepa Abin, SASTRA University, Comparison of different

datasets using various classification techniques with WEKA, Vol. 3, Issue. 4, April 2014.

8. Shravan Vishwanathan and Thirunavukkarasu K, “Performance Analysis of Learning and

Classification Algorithms” International journal of Computer Engineering & Technology

(IJCET), Volume 5, Issue 4, 2014, pp. 138 - 149, ISSN Print: 0976 – 6367, ISSN Online:

0976 – 6375.

9. Prof. Sindhu P Menon and Dr. Nagaratna P Hegde, “Research on Classification Algorithms

and Its Impact on web Mining” International journal of Computer Engineering & Technology

(IJCET), Volume 4, Issue 4, 2013, pp. 495 - 504, ISSN Print: 0976 – 6367, ISSN Online:

0976 – 6375.

10. Nitin Mohan Sharma and Kunwar Pal, “Implementation of Decision Tree Algorithm After

Clustering Through Weka” International journal of Computer Engineering & Technology

(IJCET), Volume 4, Issue 1, 2013, pp. 358 - 363, ISSN Print: 0976 – 6367, ISSN Online:

0976 – 6375.

AUTHORS DETAILS

Firas Mohammed Ali He received his BSc in computer science in 2010 from

Sirte University. He currently pursuing Master in Information Technology from

The Libyan Academy. His research area is Data Mining and Artificial

intelligence.

Dr. Prof.El-BahlulEmhemedFgee He received his PhD. in Internetworking,

Department of Engineering Mathematics and Internetworking in 2006 from

Dalhousie University, Halifax NS. Dr.Fgee Supervise students in Network Design

and Management .He Worked as the Dean of Gharyan High Institute of

Vocational Studies from 2008 to 2012 .and published many researches and

technical reports in international journals and conference proceedings.

Dr. Prof. ZakariaSulimanZubi He received his Ph.D. in Computer Science in

2002 from Debrecen University in Hungary he is an Associate Professor since

2010. Dr. Zubi, served his university under various administrative positions

including the Head of Computer Science Department 2003-2005. He published

as authors and a co-author in many researches and technical reports in local and

international journals and conference proceedings.