efficient spam classification by appropriate feature selection

Upload: ramdas-mandare

Post on 02-Apr-2018

225 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/27/2019 Efficient Spam Classification by Appropriate Feature Selection

    1/17

    International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-

    6367(Print), ISSN 0976 6375(Online) Volume 4, Issue 3, May June (2013), IAEME

    123

    EFFICIENT SPAM CLASSIFICATION BY APPROPRIATE FEATURE

    SELECTION

    Prajakta Ozarkar, Dr. Manasi PatwardhanVishwakarma Institute of Technology, Pune.

    ABSTRACT

    Spam is a key problem in electronic communication, including large-scale emailsystems and the growing number of blogs. Currently a lot of research work is performed onautomatic detection of spam emails using classification techniques such as SVM, NB, MLP,KNN, ID3, J48, RandomTree, etc. For spam dataset it is possible to have large number oftraining instances. Based on this fact, we have made use of Random Forest and PartialDecision Trees algorithm to classify spam vs non-spam emails. These algorithmsoutperformed the previously implemented algorithms in terms of accuracy and timecomplexity. As a preprocessing step we have used feature selection methods such asChisquare, Information gain, Gain ratio, Symmetrical uncertainty, Relief, OneR andCorrelation. This allowed us to select subset of relevant, non redundant and most contributingfeatures to have an added benefit in terms of improvisation in accuracy and reduced timecomplexity.

    INTRODUCTION

    In this paper we have studied previous approaches used for classifying spam and nonspam emails by using distinct classification algorithms. We have also studied the distinct

    features extracted for classifier training and the feature selection algorithms applied to get ridof irrelevant features and selecting the most contributing features. After studying the currentfeature selection and classification approaches, we have applied two new classificationtechniques viz. Random forests and Partial decision trees along with distinct feature selectionalgorithms.

    R.Parimala, et.al. [1] presents a new FS (Feature Selection) technique which is guidedby Fselector Package. They have used nine feature selection techniques such as Correlationbased feature selection, Chisquare, Entropy, Information Gain,Gain Ratio, MutualInformation, Symmetrical Uncertainty, OneR, Relief and five classification algorithms such

    INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING

    & TECHNOLOGY (IJCET)ISSN 0976 6367(Print)ISSN 0976 6375(Online)

    Volume 4, Issue 3, May-June (2013), pp. 123-139 IAEME:www.iaeme.com/ijcet.aspJournal Impact Factor (2013): 6.1302 (Calculated by GISI)

    www.jifactor.com

    IJCET

    I A E M E

  • 7/27/2019 Efficient Spam Classification by Appropriate Feature Selection

    2/17

    International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-

    6367(Print), ISSN 0976 6375(Online) Volume 4, Issue 3, May June (2013), IAEME

    124

    as Linear Discriminant Analysis, Random Forest, Rpart, Nave Bayes and Support VectorMachine on spambase dataset. In their evaluation, the results show that filter methods CFS, Chi-squared, GR, ReliefF, SU, IG, and oneR enables the classifiers to achieve the highest increase inclassification accuracy. They conclude that the implemented FS can improve the accuracy ofSupport vector machine classifiers by performing FS.

    In the paper by R. Kishore Kumar, et.al.[2] spam dataset is analyzed using Tanagra datamining tool. Initially, feature construction and feature selection is done to extract the relevantfeatures by using Fisher filtering, ReliefF, Runs Filtering, Step disc. Then classificationalgorithms such as C4.5, C-PLS, C-RT, CS-CRT, CS-MC4, CS-SVC, ID, K-NN LDA, Log RegTRIRLS, Multilayer Perceptron, Multilogical Logistic Regression, Nave Bayes Continuous,PLS-DA, PLS-LDA, Rnd Tree and SVM are applied over spambase dataset and cross validationis done for each of these classifiers. They conclude Fisher filtering and Runs filtering featureselection algorithms performs better for many classifiers. The Rnd tree classification algorithmwith the relevant features extracted by fisher filtering produces more than 99% accuracy in spamdetection.

    W.A. Awad,et.al.[3] reviews machine learning methods Bayesian classification, k-NN,

    ANNs, SVMs, Artificial immune system and Rough sets on the SpamAssassin spam corpus.They conclude Nave bayes method has the highest precision among the six algorithms while thek-nearest neighbor has the worst precision percentage. Also, the rough sets method has a verycompetitive percentage.

    In the work by V.Christina, et.al.[4] employs supervised machine learning techniquesnamely C4.5 Decision tree classifier, Multilayer perceptron and Nave Bayes classifier. Fivefeatures of an e-mail: all (A), header (H), body (B), subject (S), and body with subject (B+S), areused to evaluate the performance of four machine learning algorithms. The training dataset, spamand legitimate message corpus is generated from the mails that they have received from theirinstitute mail server for a period of six months. They conclude Multilayer Perceptron classifieroutperforms other classifiers and the false positive rate is also very low compared to otheralgorithms.

    Rafiqul Islam,et.al.[5] have presented an effective and efficient email classificationtechnique based on data filtering method. In their testing they have introduced an innovativefiltering technique using instance selection method (ISM) to reduce the pointless data instancesfrom training model and then classify the test data. In their model, tokenization and domainspecific feature selection methods are used for feature extraction. The behavioral features are alsoincluded for improving performance, especially for reducing false positive (FP) problems. Thebehavioral features include the frequency of sending/receiving emails, email attachment, type ofattachment, size of attachment and length of the email. In their experiment, they have tested fivebase classifiers Naive Bayes, SVM, IB1, Decision Table and Random Forest on 6 differentdatasets. They also have tested adaptive boosting (AdaboostM1) as meta-classifier on top of baseclassifiers. They have achieved overall classification accuracy above 97%.

    A comparative analysis is performed by Ms.DKarthikaRenuka, et.al. [6], for the

    classification techniques such as MLP, J48 and Nave Bayesian, for classifying spam messagesfrom e-mail using WEKA tool. The dataset gathered from UCI repository had 2788 legitimateand 1813 spam emails received during a period of several months. Using this dataset as a trainingdataset, models are built for classification algorithms. The study reveals that the same classifierperformed dissimilarly when run on the same dataset but using different software tools. Thus,from all perspectives MLP is top performer in all cases and thus, can be deemed consistent.

    Following table summarizes all the previous classification approaches enlisted above andprovides a comparison in terms of % accuracy they have achieved with the application of aspecific feature selection algorithm.

  • 7/27/2019 Efficient Spam Classification by Appropriate Feature Selection

    3/17

    International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-

    6367(Print), ISSN 0976 6375(Online) Volume 4, Issue 3, May June (2013), IAEME

    125

    Table 1. Comparison of previous approaches of spam detection

    Reference Classifier Used and features % Feature

    Selection

    Acc (%)

    R.Parimala,et.al. [1]

    SVM (100%) - 93

    SVM (16%) CFS 91.44

    SVM (70%) Chi 93.00

    SVM (70%) IG 93.00

    SVM (70%) GR 93.39

    SVM (70%) SU 93.33

    SVM (70%) oneR 92.65

    SVM (70%) Relief 93.15

    SVM( 32%) Lda 91.90

    SVM (12%) Rpart 90.51

    SVM (16%) SVM 89.95

    SVM (21%) RF 91.23

    SVM (7%) NB 80.00

    R. Kishore Kumar,

    et.al.[2]

    C4.5 Fisher 99.9637

    C-PLS Fisher 99.8976C-RT Fisher 99.9465

    CS-CRT Fisher 99.9465

    CS-MC4 Fisher 99.9415

    CS-SVC Fisher 99.9685

    ID3 Fisher 99.9137

    KNN Fisher 99.9391

    LDA Fisher 99.8861

    LogReg TRIRLS Fisher 99.8552

    MLP Fisher 99.9459

    Multilogical Logistic Reg Fisher 99.9311

    NBC Fisher 99.8865

    PLS-DA Fisher 99.8752

    PLD-LDA Fisher 99.8757

    Rnd Tree Fisher 99.9911

    SVM Fisher 99.9070C4.5 Relief 99.9487

    C-PLS Relief 99.8537

    C-RT Relief 99.9261

    CS-CRT Relief 99.9261

    CS-MC4 Relief 99.9324

    CS-SVC Relief 99.8794

    ID3 Relief 99.895

    KNN Relief 99.9176

    LDA Relief 99.8481

    LogReg TRIRLS Relief 99.8179

    MLP Relief 99.9185

    Multilogical Logistic Reg Relief 99.8883

    NBC Relief 99.8587

    PLS-DA Relief 99.8474

    PLD-LDA Relief 99.8476Rnd Tree Relief 99.9676

    SVM Relief 99.8639

    C4.5 Runs 99.9633

    C-PLS Runs 99.9102

    C-RT Runs 99.9404

    CS-CRT Runs 99.9404

    CS-MC4 Runs 99.9615

    CS-SVC Runs 99.9233

    ID3 Runs 99.9137

  • 7/27/2019 Efficient Spam Classification by Appropriate Feature Selection

    4/17

    International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-

    6367(Print), ISSN 0976 6375(Online) Volume 4, Issue 3, May June (2013), IAEME

    126

    Reference Classifier Used and features % Feature

    Selection

    Acc (%)

    R. Kishore Kumar,et.al.[2]

    KNN Runs 99.9404

    LDA Runs 99.8887

    MLP Runs 99.9607LogReg TRIRLS Runs 99.8611

    Multilogical Logistic Reg Runs 99.9313

    NBC Runs 99.8874

    PLS-DA Runs 99.8879

    PLD-LDA Runs 99.8879

    Rnd Tree Runs 99.9883

    SVM Runs 99.9076

    C4.5 StepDisc 99.9633

    C-PLS StepDisc 99.9081

    C-RT StepDisc 99.9341

    CS-CRT StepDisc 99.9341

    CS-MC4 StepDisc 99.9604

    CS-SVC StepDisc 99.9218

    ID3 StepDisc 99.9105

    KNN StepDisc 99.935LDA StepDisc 99.8881

    LogReg TRIRLS StepDisc 99.8587

    MLP StepDisc 99.9481

    Multilogical Logistic Reg StepDisc 99.9294

    NBC StepDisc 99.8829

    PLS-DA StepDisc 99.8826

    PLD-LDA StepDisc 99.8829

    Rnd Tree StepDisc 99.99

    SVM StepDisc 99.905

    W. A. Awad, et.al.[3] NBC - 99.46

    SVM - 96.9

    KNN - 96.2

    ANN - 96.83

    AIS - 96.23

    Rough Sets - 97.42V.Christina, et.al.[4] NBC - 98.6

    J48 - 96.6

    MLP - 99.3

    RafiqulIslam,et.al [5] NB - 92.3

    SMO - 96.4

    IB1 - 95.8

    DT - 95.9

    RF - 96.1

    Ms.DKarthikaRenuka,et.al [6]

    MLP - 93

    J48 - 92

    NBC - 89

    PROPOSED WORK

    After a detailed review of the existing techniques used for spam detection, in thissection we are illustrating the methodology and techniques we used for spam mail detection.

    Figure 1 shows the process we have used for spam mail identification and how it isused in conjunction with a machine learning scheme. Feature ranking techniques such asChisquare, Information gain, Gain ratio, Symmetrical uncertainty, Relief, OneR andCorrelation are applied to a copy of the training data. After the feature selection subset withthe highest merit is used to reduce the dimensionality of both the original training data andthe testing data. Both reduced datasets may then be passed to a machine learning scheme for

  • 7/27/2019 Efficient Spam Classification by Appropriate Feature Selection

    5/17

    International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-

    6367(Print), ISSN 0976 6375(Online) Volume 4, Issue 3, May June (2013), IAEME

    127

    training and testing. Results are obtained by using Random Forest and Part classificationtechniques.

    Figure 1. Stages of Spam Email Classification

    In the following subsections we discuss the basic concept related to our work. Itincludes a brief background on feature ranking techniques, classification techniques andresults.

    Dataset

    The dataset used for our experiment is spambase [13]. The last column of

    'spambase.data' denotes whether the e-mail was considered spam (1) or not (0). Most of theattributes indicate the frequency of spam related term occurrences. The first 48 set ofattributes (148) give tf-idf (term frequency and inverse document frequency) values forspam related words, whereas the next 6 attributes (49-54) provide tf-idf values for spamrelated terms. The run-length attributes (55-57) measure the length of sequences ofconsecutive capital letters, capital_run_length_average, capital_run_length_longest andcapital_run_length_total. Thus, our dataset has in total 57 attributes serving as an inputfeatures for spam detection and the last attribute represents the class (spam/non-spam).We have also used one public dataset Enron [20]. The preprocessed subdirectory containsthe messages in the preprocessed format. Each message is in a separate text file. The body ofan email contains the actual information. This information needs to be extracted beforerunning a filter process by means of preprocessing. The purpose for preprocessing is to

    transform messages in mail into a uniform format that can be understood by the learningalgorithm.Following are the steps involved in preprocessing:1. Feature extraction (Tokenization): Extracting featuresfrom e-mail in to a vector space.2.Stemming: Stemming is a process for removing the commoner morphological and in-flexional endings from words in English.3.Stop word removal: Removal of non-informative words.4. Noise removal: Removal of obscure text or symbols from features.

  • 7/27/2019 Efficient Spam Classification by Appropriate Feature Selection

    6/17

    International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-

    6367(Print), ISSN 0976 6375(Online) Volume 4, Issue 3, May June (2013), IAEME

    128

    5. Representation: tf-idf is a statistical measure used to calculate how significant a word is toa document in a feature corpus. Word frequency is established by term frequency (tf), numberof times the word appears in the message yields the significance of the word to the document.

    The term frequency then is multiplied with inverse document frequency (idf) which measuresthe frequency of the word occurring in all messages.

    Feature Ranking and Subset Selection

    From the above defined feature vector of total 58 features, we use feature ranking andselection algorithms to select the subset of features. We rank the given set of features usingthe following distinct approaches.

    1.Chisquare

    A chi-squared 2 hypothesis tests may be performed on contingency tables in order to

    decide whether or not effects are present. Effects in a contingency table are defined asrelationships between the row and column variables; that is, are the levels of the row variabledifferentially distributed over levels of the column variables. Significance in this hypothesistest means that interpretation of the cell frequencies is warranted. Non-significance meansthat any differences in cell frequencies could be explained by chance. Hypothesis tests oncontingency tables are based on a statistic called Chi-square [8].

    Where, O Observed cell frequency,E Expected cell frequency.

    2. Information gain

    Information Gain is the expected reduction in entropy caused by partitioning theexamples according to a given attribute. Information gain is a symmetrical measure that is,the amount of information gained about Y after observing X is equal to the amount ofinformation gained about X after observing Y. The entropy of Y is given by [9]

    2

    If the observed values of Y in the training data are partitioned according to the values of asecond feature X, and the entropy of Y with respect to the partitions induced by X is less thanthe entropy of Y prior to partitioning, then there is a relationship between features Y and X.

    Equation gives the entropy of Y after observing X

    | 2 |

    The amount by which the entropy of Y decreases reflects additional information about Yprovided by X and is called the information gain or alternatively, mutual information [9].Information gain is given by

  • 7/27/2019 Efficient Spam Classification by Appropriate Feature Selection

    7/17

    International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-

    6367(Print), ISSN 0976 6375(Online) Volume 4, Issue 3, May June (2013), IAEME

    129

    | |

    ,

    3. Gain ratio

    The various selection criteria have been compared empirically in a series ofexperiments. When all attributes are binary, the gain ratio criterion has been found to giveconsiderably smaller decision trees. When the task includes attributes with large numbers ofvalues, the subset criterion gives smaller decision trees that also have better predictiveperformance, but can require much more computation. However, when these many-valuedattributes are augmented by redundant attributes which contain the same information at alower level of detail, the gain ratio criterion gives decision trees with the greatest predictiveaccuracy. All in all, it suggest that the gain ratio criterion does pick a good attribute for theroot of the tree [12].

    , 4. Symmetrical uncertainty

    Information gain is a symmetrical measure that is, the amount of information gainedabout Y after observing X is equal to the amount of information gained about X afterobserving Y. Symmetry is a desirable property for a measure of feature-featureintercorrelation to have. Unfortunately, information gain is biased in favour of features withmore values. Symmetrical uncertainty compensates for information gains bias towardattributes with more values and normalizes its value to the range [0, 1] [9]:

    2.0 5. Relief

    Relief [10] is a feature weighting algorithm that is sensitive to feature interactions.Relief attempts to approximate the following difference of probabilities for the weight of afeature X [9]:

    |

    | By removing the context sensitivity provided by the nearest instance condition, attributesare treated as independent of one another;

    | | which can be reformulated as

    2 1 2 2

  • 7/27/2019 Efficient Spam Classification by Appropriate Feature Selection

    8/17

    International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-

    6367(Print), ISSN 0976 6375(Online) Volume 4, Issue 3, May June (2013), IAEME

    130

    Where, C is the class variable and

    1 2 2 | 1 |

    6. OneR

    Like other empirical learning methods, 1R [11] takes as input a set of examples, eachwith several attributes and a class. The aim is to infer a rule that predicts the class given thevalues of the attributes. The 1R algorithm chooses the most informative single attribute andbases the rule on this attribute alone. The basic idea is:

    For each attribute a, form a rule as follows:For each value v from the domain ofa,

    Select the set of instances where a hasvalue v.Let c be the most frequent class in thatset.Add the following clause to the rule fora:

    if a has value v then the class is c

    Calculate the classification accuracy of this rule. Use the rule with the highest classificationaccuracy. The algorithm assumes that the attributes are discrete. If not, then they must bediscretized.

    7. Correlation

    Feature selection for classification tasks in machine learning can be accomplished onthe basis of correlation between features, and that such a feature selection procedure can bebeneficial to common machine learning algorithms [9]. Features are relevant if their valuesvary systematically with category membership. In other words, a feature is useful if it iscorrelated with or predictive of the class; otherwise it is irrelevant. A good feature subset isone that contains features highly correlated with (predictive of) the class, yet uncorrelatedwith (not predictive of) each other. The acceptance of a feature will depend on the extent towhich it predicts classes in areas of the instance space not already predicted by other features.Correlation based feature selection feature subset evaluation function [9]:

    1

    Where, - the heuristic merit of a feature subset S containing k features, -the meanfeature-class correlation, -the average feature-feature inter-correlation.

  • 7/27/2019 Efficient Spam Classification by Appropriate Feature Selection

    9/17

    International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-

    6367(Print), ISSN 0976 6375(Online) Volume 4, Issue 3, May June (2013), IAEME

    131

    Feature ranking further help us to-

    1. Remove irrelevant features, which might be misleading the classifier decreasing theclassifier interpretability by reducing generalization by increasing over fitting.2. Remove redundant features, which provide no additional information than the otherset of features, unnecessarily decreasing the efficiency of the classifier.

    3. Selecting high rank features, which may not affect much as far as improving precisionand recall is concerned; but reduces time complexity drastically. Selection of suchhigh rank features reduces the dimensionality feature space of the domain. It speedsup the classifier there of improving the performance and increasing thecomprehensibility of the classification result.

    We have considered 87%, 77% and 70% of the features; wherein there is a performanceimprovement in 70% feature consideration.

    Classification Methods

    Based on the assumption that the given dataset has enough number of the traininginstances we have chosen the following two classification algorithms. The algorithms workwell based on the fact that the dataset is of good quality.

    1. Random Forest

    Random Forests [14] are a combination of tree predictors such that each tree depends onthe values of a random vector sampled independently and with the same distribution for alltrees in the forest. The generalization error for forests converges a.s. to a limit as the numberof trees in the forest becomes large. The generalization error of a forest of tree classifiers

    depends on the strength of the individual trees in the forest and the correlation between them.Each tree is grown as follows:

    1. If the number of cases in the training set is N, sample N cases at random - but withreplacement, from the original data. This sample will be the training set for growingthe tree.

    2. If there are M input variables, a number m

  • 7/27/2019 Efficient Spam Classification by Appropriate Feature Selection

    10/17

    International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-

    6367(Print), ISSN 0976 6375(Online) Volume 4, Issue 3, May June (2013), IAEME

    132

    Total Random forest Trees: 10Numbers of random features: 4Out of bag error: 0.1092391304347826

    All the trees in the forest:RandomTree==========word_freq_hpl < 0.07| char_freq_$ < 0.03| | word_freq_you < 0.12| | | word_freq_hp < 0.02| | | | char_freq_! < 0.01| | | | | word_freq_3d < 9.87| | | | | | word_freq_000 < 0.08| | | | | | | char_freq_( < 0.04| | | | | | | | word_freq_meeting < 0.85

    | | | | | | | | | word_freq_remove < 2.27| | | | | | | | | | word_freq_free < 6.47| | | | | | | | | | | word_freq_will < 0.17| | | | | | | | | | | | word_freq_pm < 0.42| | | | | | | | | | | | | word_freq_all < 0.21| | | | | | | | | | | | | | word_freq_mail < 2.96| | | | | | | | | | | | | | | word_freq_re < 5.4| | | | | | | | | | | | | | | | word_freq_technology < 1.43| | | | | | | | | | | | | | | | | capital_run_length_total < 18.5| | | | | | | | | | | | | | | | | | word_freq_re < 0.68| | | | | | | | | | | | | | | | | | | word_freq_make < 1.39| | | | | | | | | | | | | | | | | | | | capital_run_length_total < 10.5 : 0 (218/0)

    | | | | | | | | | | | | | | | | | | | | capital_run_length_total >= 10.5| | | | | | | | | | | | | | | | | | | | | word_freq_internet < 0.89| | | | | | | | | | | | | | | | | | | | | | word_freq_people < 1.47| | | | | | | | | | | | | | | | | | | | | | | word_freq_data < 3.7| | | | | | | | | | | | | | | | | | | | | | | | word_freq_edu < 2.38| | | | | | | | | | | | | | | | | | | | | | | | | char_freq_[ < 0.59| | | | | | | | | | | | | | | | | | | | | | | | | | char_freq_; < 0.16| | | | | | | | | | | | | | | | | | | | | | | | | | | capital_run_length_total 0.0 ANDchar_freq_! > 0.049 ANDword_freq_edu 0.058 ANDword_freq_hp 9.0 AND

    word_freq_1999

  • 7/27/2019 Efficient Spam Classification by Appropriate Feature Selection

    12/17

    International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-

    6367(Print), ISSN 0976 6375(Online) Volume 4, Issue 3, May June (2013), IAEME

    134

    Results

    Spambase Results

    The dataset spambase was taken from UCI machine learning repository [13].Spambase dataset contains 4601 instances and 58 attributes. 1 - 57 continuous attributes and1 nominal class label. The email spam classification has been implemented in Eclipse.Eclipse considered by many to be the best Java development tool available. Feature rankingand feature selection is done by using the methods such as Chi-square, Information gain,Gain ratio, Relief, OneR, Correlation as a preprocessing step so as to select feature subset forbuilding the learning model.

    Classification algorithms are from decision tree family, viz, Random Forest andPartial Decision Trees. Random forest is an effective tool in prediction. Because of the law oflarge numbers they do not over fit. Random inputs and random features produce good resultsin classification-less so in regression. For the larger data sets, it seems that significantly lowererror rates are possible [14]. Feature space can be reduced by the magnitude of 10 while

    achieving similar classification results. For example, it takes about 2,000 2 features toachieve similar accuracies as those obtained with 149 PART features [15].

    As a part of our implementation, we have divided the dataset into two parts. 80% ofthe dataset is used for training purpose and 20% for the testing purpose. After preprocessingstep top 87%, 77% and 70% features are considered while building training model and testingbecause there is a significant performance improvement. Prediction accuracy, correctlyclassified instances, incorrectly classified instances, confusion matrix and time complexityare used as performance measures of the system.

    More than 99% prediction accuracy is achieved by Random forest with all the sevenfeature selection methods in consideration; whereas 97% prediction accuracy is achieved byPART with almost all the seven feature selection methods while training the model. Trainingand testing results, when 100% features have considered are given in Table 2.

    Table 2. Results of 100% feature selection

    Classifier Training Testing Time

    (ms)

    RandomForest

    99.918 94.354 1540

    PART 96.416 92.291 4938

    Both training results and testing results on spambase dataset after feature ranking andsubset selection are shown in the Table 3 and Table 4.

  • 7/27/2019 Efficient Spam Classification by Appropriate Feature Selection

    13/17

    International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-

    6367(Print), ISSN 0976 6375(Online) Volume 4, Issue 3, May June (2013), IAEME

    135

    Table 3. Training ResultsFS

    (%)FS RF Acc (%) Time

    (ms)Part Acc (%) Time

    (ms )

    87% Chi 99.891 1349 98.234 3797Infogain 99.837 1330 98.505 3080

    Gainratio 99.918 1386 98.315 3611

    Relief 99.891 1397 96.63 3467

    SU 99.918 1367 98.505 3124

    OneR 99.918 1470 96.902 4727

    Corr 99.728 1153 95.027 847

    77% Chi 99.918 1373 97.283 2701

    Infogain 99.891 1498 97.147 3131

    Gainratio 99.918 1604 97.006 4007

    Relief 99.864 1367 97.799 3829

    SU 99.891 1294 97.147 2867

    OneR 99.891 1406 94.973 3469Corr 99.728 1145 95.027 835

    70% Chi 99.891 1282 97.092 2437

    Infogain 99.918 1314 97.092 2409

    Gainratio 99.864 1383 96.821 2642

    Relief 99.81 1428 96.658 2855

    SU 99.918 1276 97.092 2394

    OneR 99.918 1442 95.245 2528

    Corr 99.728 1152 95.027 845

    Table 4. Testing ResultsFS (%) FS RF Acc (%) Part Acc (%)

    87% Chi 94.788 92.291

    Infogain 94.137 93.16

    Gainratio 93.594 94.137

    Relief 95.114 93.185

    SU 93.16 93.16

    OneR 92.834 89.902

    Corr 93.051 92.942

    77% Chi 93.485 92.508

    Infogain 94.028 93.051

    Gainratio 94.245 92.291

    Relief 93.485 92.617

    SU 94.028 93.051

    OneR 93.16 91.531Corr 93.051 92.942

    5

    70% Chi 94.245 93.268

    Infogain 94.68 93.268

    Gainratio 94.028 94.463

    Relief 93.811 91.965

    SU 94.137 93.268

    OneR 93.16 89.794

    Corr 93.051 92.942

  • 7/27/2019 Efficient Spam Classification by Appropriate Feature Selection

    14/17

    International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-

    6367(Print), ISSN 0976 6375(Online) Volume 4, Issue 3, May June (2013), IAEME

    136

    From the results above, it can be observed that for Random Forest, after using 70% ofthe feature set extracted using Infogain, Symmetrical Uncertainty and OneR feature selectionalgorithms the training accuracy remained the same (99.918%) whereas the computation time

    reduced by 20% (from 1540ms to 1276ms). This shows that the remaining 30% featureswere not contributing towards the classification.Also it can be observed that for PART, after using 70% of the feature set extracted

    using Chisquare, Infogain and Symmetrical Uncertainty feature selection algorithms thetraining accuracy is increased by1.521% and computation time is reduced by 52% (from4938 ms to 2409ms). This shows that not only the remaining 30% features were redundantbut also they were misleading the classification.

    Enron Results

    More than 96% prediction accuracy is achieved by Random forest with all the sevenfeature selection methods in consideration; whereas more than 95% prediction accuracy is

    achieved by PART with almost all the seven feature selection methods while training themodel. Training and testing results, when 100% features have considered are given in Table5.

    Table 5. Results of 100% feature selection

    Classifier Training Testing Time

    (ms)

    RandomForest

    96.181 93.623 9466

    PART 95.093 91.787 18558

    Both training and testing results after feature ranking and subset selection are shownin the Table 6 and Table 7.

    Table 6. Training Results

    FS

    (%)

    FS RF Acc

    (%)

    Time

    (ms)

    Part Acc

    (%)

    Time

    (ms )

    87% Chi 96.012 4210 94.634 5961

    Infogain 96.012 4106 94.634 5839

    Gainratio 96.012 4584 94.634 5791Relief 96.012 4070 94.634 5806

    SU 96.012 4170 94.634 5854

    OneR 96.012 4085 94.634 5856

    Corr 96.012 4147 94.634 5821

  • 7/27/2019 Efficient Spam Classification by Appropriate Feature Selection

    15/17

    International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-

    6367(Print), ISSN 0976 6375(Online) Volume 4, Issue 3, May June (2013), IAEME

    137

    Table 7. Testing Results

    FS (%) FS RF Acc

    (%)

    Part Acc

    (%)

    87% Chi 93.43 90.725

    Infogain 93.43 90.725

    Gainratio 93.43 90.725

    Relief 93.43 90.725

    SU 93.43 90.725OneR 93.43 90.725

    Corr 93.43 90.725

    From the results above, it can be observed that for Random Forest, after using 87% of

    the feature set extracted the training accuracy is (96.012%) whereas the computation timereduced by 51.574% (from 9466ms to 4584ms). This shows that the remaining 13%features were not contributing towards the classification.

    Also, it can be observed that for Part, after using 87%, 77% of the feature setextracted the training accuracy is increased. There is a significant improvement in 87%feature selection by 1% and computation time is reduced by 67.879% (from 18558 ms to5961ms). This shows that not only the remaining 30% features were redundant but also theywere misleading the classification.

    Gmail Dataset Test Results

    Further, we have tested our Enron model on the dataset created by using emails wehave received in our Gmail accounts during the period of last 3 months. The results areshown in the Table 8. In this, experiment we test dataset is completely non-overlapping withthe training set allowing us to truly evaluate the performance of our system.

    Table 8. Personal Email Dataset Testing Results

    Classifier Testing Accuracy

    (%)

    Random Forest 96

    PART 97.33

    CONCLUSION

    In this paper we have studied previous approaches of spam email detection usingmachine learning methodologies. We have compared and evaluated the approaches based onthe factors such as dataset used; features extracted, ranked and selected; feature selectionalgorithms used and the results received in terms of accuracy (precision, recall and error rate)and performance (time required).

  • 7/27/2019 Efficient Spam Classification by Appropriate Feature Selection

    16/17

    International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-

    6367(Print), ISSN 0976 6375(Online) Volume 4, Issue 3, May June (2013), IAEME

    138

    The datasets available for spam detection are large in number and for such largerdatasets Random Forest and Part tend to produce better results with lower error rates andhigher precision. So, we used these two classifiers to classify spam email detection. For

    spambase dataset, we acquired the best percentage accuracy of 99.918% with Random Forestwhich is 9% better than previous spambase approaches and 96.416% with Part. For Enrondataset, we acquired the best percentage accuracy of 96.181% with Random Forest and95.093% with Part. Enron dataset is used by [21] in an unsupervised spam learning anddetection scheme. Above all, for the dataset created by using our personal emails thepercentage accuracy of 96% is achieved with Random Forest and 97.33% with Part. Thefeature selection algorithms used also contributed to achieve better accuracy with lower timecomplexity due to dimensionality reduction. For Random Forest, after using 70% of thefeature set extracted, for spambase data set, the training accuracy remained the same(99.918%) whereas the computation time reduced by 20% (from 1540ms to 1276ms),whereas for PART, the training accuracy is increased by 1.521% and computation time isreduced by 52% (from 4938 ms to 2409ms).

    REFERENCES

    1.A Study of Spam E-mail classification using Feature Selection package, R.Parimala,Dr. R. Nallaswamy, National Institute of Technology, Global Journal of Computer Scienceand Technology, Volume 11 Issue 7 Version 1.0 May 2011.2.Comparative Study on Email Spam Classifier using Data Mining Techniques, R. KishoreKumar, G. Poonkuzhali, P. Sudhakar, Member, IAENG, Proceedings of the InternationalMulticonference of Engineers and Computer Scientists 2012 Vol I, IMECS 2012, March 14-16, Hong Kong.3.Machine Learning Methods for Spam E-mail Classification, W.A. Awad and S.M.ELseuofi, International Journal of Computer Applications (0975 8887)Volume 16 No.1,

    February 2011.4.Email Spam Filtering using Supervised Machine Learning Techniques, V.Christina,S.Karpagavalli, G.Suganya, (IJCSE) International Journal on Computer Science andEngineeringVol. 02, No. 09, 2010, 3126-3129.5.Email Classification Using Data Reduction Method, Rafiqul Islam and Yang Xiang,member IEEE, School of Information Technology Deakin University, Burwood 3125,Victoria, Australia.6.Spam Classification based on Supervised Learning using Machine LearningTechniques, Ms.D Karthika Renuka, Dr.T.Hamsapriya, Mr.M.Raja Chakkaravarthi,Ms. P. Lakshmi Surya, 978-1-61284-764-1/11/$26.00 2011 IEEE.7.An Empirical Performance Comparison of Machine Learning Methods for Spam E-mailCategorization, Chih-Chin Lai, Ming-Chi Tsai, Proceedings of the Fourth International

    Conference on Hybrid Intelligent Systems (HIS04) 0-7695-2291-2/04 $ 20.00 IEEE.8.Introductory Statistics: Concepts, Models, and Applications, David W. Stockburger.9.Feature Subset Selection: A Correlation Based Filter Approach, Hall, M. A., Smith, L.A., 1997, International Conference on Neural Information Processing and IntelligentInformation Systems, Springer, p855-858.10.A practical approach to feature selection, K. Kira and L. A. Rendell, Proceedings of theNinth International Conference, 1992.

  • 7/27/2019 Efficient Spam Classification by Appropriate Feature Selection

    17/17

    International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-

    6367(Print), ISSN 0976 6375(Online) Volume 4, Issue 3, May June (2013), IAEME

    139

    11.Very simple classification rules perform well on most commonly used datasets, Holte,R.C.(1993) Machine Learning, Vol. 11, 6391.12.Induction of decision trees, J.R. Quinlan, Machine Learning 1, 81-106, 1986.

    13.UCI repository of Machine learning Databases, Department of Information andComputer Science, University of California, Irvine, CA,http://www.ics.uci.edu/~mlearn/MLRepository.html, Hettich, S., Blake, C. L., and Merz,C. J.,1998.14.Random Forests, Leo Breiman, Statistics Department University of CaliforniaBerkeley, CA 94720, January 2001.15.Exploiting Partial Decision Trees for Feature Subset Selection in eMail Categorization,Helmut Berger, Dieter Merkl, Michael Dittenbach, SAC06April 2327, 2006, Dijon, FranceCopyright 2006 ACM 1595931082/06/0004.16.C4.5: Programs for Machine Learning, J. R. Quinlan, Morgan Kaufmann PublishersInc., 1993.17.Fast effective rule induction, W. W. Cohen, In Proc. of the Intl Conf. on Machine

    Learning, pages 115123. Morgan Kaufmann, 1995.18.Toward optimal feature selection using Ranking methods and classification Algorithms,Jasmina Novakovi, PericaStrbac, DusanBulatovi, March 2011.19.SpamAssassin, http://spamassassin.apache.org.20. The enron spam dataset http://www.aueb.gr/users/ion/data/enron-spam/21. A Case for Unsupervised-Learning-based Spam Filtering, Feng Qian, Abhinav Pathak,Y. Charlie Hu, Z. Morley Mao, Yinglian Xie.22. Jyoti Pruthi and Dr. Ela Kumar, Data Set Selection in Anti-Spamming Algorithm -Large or Small, International Journal of Computer Engineering & Technology (IJCET),Volume 3, Issue 2, 2012, pp. 206 - 212, ISSN Print: 0976 6367, ISSN Online: 0976 6375.23. C.R. Cyril Anthoni and Dr. A. Christy, Integration of Feature Sets with MachineLearning Techniques for Spam Filtering, International Journal of Computer Engineering &

    Technology (IJCET), Volume 2, Issue 1, 2011, pp. 47 - 52, ISSN Print: 0976 6367,ISSN Online: 0976 6375.