spam email classification 3

Upload: austin-kinion

Post on 05-Oct-2015

215 views

Category:

Documents


0 download

DESCRIPTION

Testing of Spam Classifier

TRANSCRIPT

  • Spam Email Classification 3----Austin Kinion #I certify that I have acknowledged any code that I used from any other person #in the class, from Piazza or any Web site or book or other source. #Any other work is my own.

    print(load(url("http://eeyore.ucdavis.edu/stat141/Data/trainVariables.rda"))) Data=trainVariables

    Pulling out some of the nonsense variables

    Data$numLinesInBody= NULL Data$bodyCharacterCount=NULL Data$subjectQuestCount=NULL Data$percentSubjectBlanks=NULL Data$messageIdHasNoHostname=NULL Data$percentForwards=NULL Data$isDear=NULL Data$averageWordLength=NULL Data$percentHTMLTags=NULL Data$numAttachments=NULL

    Name Data2 as new data with nonsense variables and isSpam out, then scale.

    Data2= as.matrix(Data[,1:19]) Dat_scaled= scale(Data2) #Compute different type of distance matrices from discussion section man_dist= as.matrix(dist(Dat_scaled, method="manhattan")) euc_dist= as.matrix(dist(Dat_scaled)) mink_dist= as.matrix(dist(Dat_scaled, method="minkowski"))

    Voting method: Most of this function was obtained from Nick Ulle's OH's

    vote= function(email, dist.matrix= euc_dist, Train= Dat_scaled, k){ neighbors = names(sort(dist.matrix[email,])[1:k]) prediction = Train[rownames(Train) %in% neighbors, 'isSpam'] pred_mean = mean(prediction) if (pred_mean >= .5){ return(TRUE) } else{ return(FALSE) }

  • } vote_result = vote(email, dist.matrix= euc_dist, Train= Dat_scaled, k)

    How accurate voting is: Help from Charles was given for this:

    acc= function(email, pred_mean= vote_result, Train= Dat_scaled){ accur=mean(Train[email, 'isSpam']== pred.mean,) return(cbind(accur, k)) }

    knn and CV for getting best k:

    First shuffle:

    set.seed(5779616) rand.rowx= sample(1: nrow(Dat_scaled))

    Then split into 5 groups for 5fold:

    n=5

    group = function(rand.rowx,n) split(rand.rowx, factor(sort(rank(rand.rowx)%%n))) group(rand.rowx,n)

    These next two lines are from Nick Ulle:

    store= matrix(NA, ncol= 2, nrow= 20) result= matrix(NA, ncol=5, nrow= 20)

    Write loop for each group 1:5 is used as test: Help from Charles Arnold was given to write this function

    for (i in 1:5){

    test_dat= group(rand.rowx,n)[i] train_dat= euc_dist[, unlist(group(rand.rowx,n)[i])]

    #20 as number of k to plot was suggested by Nick. #Make loop for up to 20 k to see which is best from function vote and vote_result for(k in 1:20){

    x= sapply(test_dat, vote, euc_dist= train_dat, Train= Dat_scaled, k) res= vote_result(test_dat, pred_mean= x, Train= Dat_scaled, k) #Use store and result matrices above to fill in: store[k,]= 1- res

  • } colnames(store)= c('k', 'Accuracy')

    result[,i]= store[,2] } }

    To find average error rate of k, used stackoverflow function and modified it:

    err_rate=c() for(i in 1:20){ err_rate=mean(result[i,]) } #Help from a girl in Charles OH's was given for this next part: #Show result and make data.frame show_result= cbind(result, err_rate) colnames(show_result)= c("1", "2", "3", "4", "5", "Error Rate") res_dat.fram= data.frame(cbind(1:20, err_rate))

    As learned in Nick'sSection, order the dataframe to show best to worst:

    order_df= res_dat.fram[order(res_dat.fram$error.rate)]

    Plot the k's vs. error rate:

    plot(1:20, result[,6], sub="k vs. Error Rate", ylab = "Error Rate", xlab = "k", type="o")

    Finally, create the confusion matrix. Used Lecture notes and OH notes and modified them for confusion matrix:

    prediction = sapply(rand.rowx, dist.matrix = euc_dist, vote, Train= Dat_scaled, k = best.k)

  • #find the truth

    t = function(email, Train= Dat_scaled, k) { true = Train[email, 'isSpam'] return(true) }

    trues = sapply(rand.rowx, vote, k = best.k, Train= Dat_scaled)

    table(true, prediction)

    prediction

    true FALSE TRUE

    FALSE 4697 173

    TRUE 212 1459

    So the type 1 error rate for the rpart is: 173/(173+4697)= 3.5% ANd the type 2 error rate from the rpart is: 212/(212+1459)= 12.6% And the total misclassification rate is (173+212)/6541= 5.8% I used several methods to explore the misclassified observations, but found the most interesting to be SubjectSpamWords, which I will talk about in this report. It is no surprise to me that this is a good classifier, as I can usually tell the difference between Spam and non-spam just by reading the subject.

    library(lattice)

    densityplot(~ subjectSpamWords, Data2, groups= isSpam, col= c("green", "blue"), main = "Ham(green) and Spam(Blue) for subjectSpamWords")

    On the plot below, we can see that there is a significant difference in the densities for Ham and Spam emails for the variable subjectSpamWords. Ham clearly has a higher density when there are no spam words in the subject, and it is not spam, and Spam clearly hasa higher density when there are spam words in the subject, and the email was Spam.

  • For classification tree, used code from class and modified it:

    library(rpart)

    ct= rpart(factor(isSpam)~ ., Dat_scaled )

    #Makes a much nicer tree, found on r-project.org

    library(rpart.plot)

    prp(ct)

    To get the Confusion matrix from rpart:

  • prediction = predict(ct, Dat_scaled, type = "class")

    idx = 1:nrow(Dat_scaled) true = sapply(idx, t, k = best.k, TrainV = Dat_scaled)

    table(true, prediction)

    prediction

    true FALSE TRUE

    FALSE 4628 235

    TRUE 361 1321

    So the type 1 error rate for the rpart is: 235/(235+4628)= 4.8%

    ANd the type 2 error rate from the rpart is: 361/(361+1321)= 21.4%

    And the total misclassification rate is (235+361)/6541= 9.1%

    So it is obvious that the knn and cross validation method for predicting Spam is much better than the rpart() method, with the knn misclassification rate of 5.8% and the rpart() misclassification rate of 9.1%.

    I used several methods to explore the misclassified observations, but found the most interesting to be isWrote, which I will talk about in this report.

    library(lattice)

    densityplot(~ isWrote, Data2, groups = isSpam, col = c("green", "blue"),

    main = "Ham(green) and Spam(Blue) for isWrote")

  • As we can see here, there is a significant difference for emails both with and without isWrote. We can see that the densities when isWrote is True is much greater for Spam than Ham, and the opposite is true for when isWrote is False.

    Compare the test and training data sets to see if they have similar characteristics.

    Going to use the two models I fit with the training data to predict the values for the test set.

    Examine the confusion matrix:

    #First, combine training and test data with rbind:

    #Using original trainvariables instead of Data2 so that the two match:

    emails= rbind(testVariables, trainVariables) #Get distance matrix and scale it with isSpam removed again: euc_dist = as.matrix(dist(scale(emails[, 1:29]))) #Get the prediction idx = 1:nrow(testVariables) #Get train matrix train = euc_dist[6542:8541, 1:6541] prediction = sapply(idx, vote, dist.matrix = trainings, TrainV = trainVariables, k = best.k) #Get the confusion matrix true = sapply(idx, t, TrainV = testVariables, k = best.k,) table(true, prediction)

    prediction true FALSE TRUE FALSE 1447 67 TRUE 87 398 #For rpart on test data: library(rpart) library(rpart.plot) ct = rpart(factor(isSpam) ~ ., testVariables) prp(ct)

  • #Get the confusion matrix from rpart:

    prediction = predict(ct, testVariables, type = "class")

    idx = 1:nrow(testVariables)

    true = sapply(idx, t, k = best.k, TrainV = testVariables)

    table(true, prediction)

    prediction true FALSE TRUE FALSE 1420 90 TRUE 155 335