machine learning basics with applications to email spam detection ugr p roject - h aoyu li, brittany...
TRANSCRIPT
![Page 1: Machine Learning Basics with Applications to Email Spam Detection UGR P ROJECT - H AOYU LI, BRITTANY EDWARDS, WEI ZHANG UNDER XIAOXIAO XU AND ARYE NEHORAI](https://reader035.vdocuments.mx/reader035/viewer/2022062404/5517fc3f550346a2228b4a89/html5/thumbnails/1.jpg)
Machine Learning Basics with Applications to Email Spam
Detection
UGR PROJECT - HAOYU LI, BRITTANY EDWARDS, WEI ZHANG UNDER XIAOXIAO XU AND ARYE NEHORAI
![Page 2: Machine Learning Basics with Applications to Email Spam Detection UGR P ROJECT - H AOYU LI, BRITTANY EDWARDS, WEI ZHANG UNDER XIAOXIAO XU AND ARYE NEHORAI](https://reader035.vdocuments.mx/reader035/viewer/2022062404/5517fc3f550346a2228b4a89/html5/thumbnails/2.jpg)
General background information about the process of machine
learning
![Page 3: Machine Learning Basics with Applications to Email Spam Detection UGR P ROJECT - H AOYU LI, BRITTANY EDWARDS, WEI ZHANG UNDER XIAOXIAO XU AND ARYE NEHORAI](https://reader035.vdocuments.mx/reader035/viewer/2022062404/5517fc3f550346a2228b4a89/html5/thumbnails/3.jpg)
The process of email detection
⦿ Motivation of this project
⦿ Pre-processing of data
⦿ Classifier Models● Evaluation of classifiers
![Page 4: Machine Learning Basics with Applications to Email Spam Detection UGR P ROJECT - H AOYU LI, BRITTANY EDWARDS, WEI ZHANG UNDER XIAOXIAO XU AND ARYE NEHORAI](https://reader035.vdocuments.mx/reader035/viewer/2022062404/5517fc3f550346a2228b4a89/html5/thumbnails/4.jpg)
Motivation of this project
⦿Spam email has been annoyed every personal email account●60% of January 2004 emails were spam● Fraud & Phishing
⦿Spam vs. Ham email
![Page 5: Machine Learning Basics with Applications to Email Spam Detection UGR P ROJECT - H AOYU LI, BRITTANY EDWARDS, WEI ZHANG UNDER XIAOXIAO XU AND ARYE NEHORAI](https://reader035.vdocuments.mx/reader035/viewer/2022062404/5517fc3f550346a2228b4a89/html5/thumbnails/5.jpg)
Our Goal
![Page 6: Machine Learning Basics with Applications to Email Spam Detection UGR P ROJECT - H AOYU LI, BRITTANY EDWARDS, WEI ZHANG UNDER XIAOXIAO XU AND ARYE NEHORAI](https://reader035.vdocuments.mx/reader035/viewer/2022062404/5517fc3f550346a2228b4a89/html5/thumbnails/6.jpg)
Spam Email example
![Page 7: Machine Learning Basics with Applications to Email Spam Detection UGR P ROJECT - H AOYU LI, BRITTANY EDWARDS, WEI ZHANG UNDER XIAOXIAO XU AND ARYE NEHORAI](https://reader035.vdocuments.mx/reader035/viewer/2022062404/5517fc3f550346a2228b4a89/html5/thumbnails/7.jpg)
Ham Email example
![Page 8: Machine Learning Basics with Applications to Email Spam Detection UGR P ROJECT - H AOYU LI, BRITTANY EDWARDS, WEI ZHANG UNDER XIAOXIAO XU AND ARYE NEHORAI](https://reader035.vdocuments.mx/reader035/viewer/2022062404/5517fc3f550346a2228b4a89/html5/thumbnails/8.jpg)
The process of email detection
⦿ Motivation of this project⦿ Pre-processing of data
⦿ Classifier Models● Evaluation of classifiers
![Page 9: Machine Learning Basics with Applications to Email Spam Detection UGR P ROJECT - H AOYU LI, BRITTANY EDWARDS, WEI ZHANG UNDER XIAOXIAO XU AND ARYE NEHORAI](https://reader035.vdocuments.mx/reader035/viewer/2022062404/5517fc3f550346a2228b4a89/html5/thumbnails/9.jpg)
Pre-processing of data
⦿ Convert capital letters to lowercase
⦿ Remove numbers, and extra white space
⦿ Remove punctuations
⦿ Remove stop-words
⦿ Delete terms with length greater than 20.
![Page 10: Machine Learning Basics with Applications to Email Spam Detection UGR P ROJECT - H AOYU LI, BRITTANY EDWARDS, WEI ZHANG UNDER XIAOXIAO XU AND ARYE NEHORAI](https://reader035.vdocuments.mx/reader035/viewer/2022062404/5517fc3f550346a2228b4a89/html5/thumbnails/10.jpg)
Pre-processing of data
⦿Original Email
![Page 11: Machine Learning Basics with Applications to Email Spam Detection UGR P ROJECT - H AOYU LI, BRITTANY EDWARDS, WEI ZHANG UNDER XIAOXIAO XU AND ARYE NEHORAI](https://reader035.vdocuments.mx/reader035/viewer/2022062404/5517fc3f550346a2228b4a89/html5/thumbnails/11.jpg)
Pre-processing of data
⦿After pre-processing
![Page 12: Machine Learning Basics with Applications to Email Spam Detection UGR P ROJECT - H AOYU LI, BRITTANY EDWARDS, WEI ZHANG UNDER XIAOXIAO XU AND ARYE NEHORAI](https://reader035.vdocuments.mx/reader035/viewer/2022062404/5517fc3f550346a2228b4a89/html5/thumbnails/12.jpg)
Pre-processing of data
⦿Extract Terms
![Page 13: Machine Learning Basics with Applications to Email Spam Detection UGR P ROJECT - H AOYU LI, BRITTANY EDWARDS, WEI ZHANG UNDER XIAOXIAO XU AND ARYE NEHORAI](https://reader035.vdocuments.mx/reader035/viewer/2022062404/5517fc3f550346a2228b4a89/html5/thumbnails/13.jpg)
Pre-processing of data
⦿Reduce Terms●Keep word length < 20
![Page 14: Machine Learning Basics with Applications to Email Spam Detection UGR P ROJECT - H AOYU LI, BRITTANY EDWARDS, WEI ZHANG UNDER XIAOXIAO XU AND ARYE NEHORAI](https://reader035.vdocuments.mx/reader035/viewer/2022062404/5517fc3f550346a2228b4a89/html5/thumbnails/14.jpg)
The process of email detection
⦿ Motivation of this project
⦿ Pre-processing of data⦿ Classifier Models● Evaluation of classifiers
![Page 15: Machine Learning Basics with Applications to Email Spam Detection UGR P ROJECT - H AOYU LI, BRITTANY EDWARDS, WEI ZHANG UNDER XIAOXIAO XU AND ARYE NEHORAI](https://reader035.vdocuments.mx/reader035/viewer/2022062404/5517fc3f550346a2228b4a89/html5/thumbnails/15.jpg)
Different classification methods
⦿ K Nearest Neighbor (KNN)
⦿ Naive Bayes Classifier
⦿ Logistic Regression
⦿ Decision Tree Analysis
![Page 16: Machine Learning Basics with Applications to Email Spam Detection UGR P ROJECT - H AOYU LI, BRITTANY EDWARDS, WEI ZHANG UNDER XIAOXIAO XU AND ARYE NEHORAI](https://reader035.vdocuments.mx/reader035/viewer/2022062404/5517fc3f550346a2228b4a89/html5/thumbnails/16.jpg)
What is K Nearest Neighbor
⦿ Use k "closet" samples (nearest neighbors) to perform classification
![Page 17: Machine Learning Basics with Applications to Email Spam Detection UGR P ROJECT - H AOYU LI, BRITTANY EDWARDS, WEI ZHANG UNDER XIAOXIAO XU AND ARYE NEHORAI](https://reader035.vdocuments.mx/reader035/viewer/2022062404/5517fc3f550346a2228b4a89/html5/thumbnails/17.jpg)
What is K Nearest Neighbor
![Page 18: Machine Learning Basics with Applications to Email Spam Detection UGR P ROJECT - H AOYU LI, BRITTANY EDWARDS, WEI ZHANG UNDER XIAOXIAO XU AND ARYE NEHORAI](https://reader035.vdocuments.mx/reader035/viewer/2022062404/5517fc3f550346a2228b4a89/html5/thumbnails/18.jpg)
Initial outcome and strategies for improvement
⦿ KNN accuracy was ~64% - very low
⦿ KNN classifier does not fit our project
⦿ Term-list is still too large
⦿ Try different method to classify and see if evaluation results are better than KNN results
⦿ Continue to reduce size of term list by removing terms that are not meaningful
![Page 19: Machine Learning Basics with Applications to Email Spam Detection UGR P ROJECT - H AOYU LI, BRITTANY EDWARDS, WEI ZHANG UNDER XIAOXIAO XU AND ARYE NEHORAI](https://reader035.vdocuments.mx/reader035/viewer/2022062404/5517fc3f550346a2228b4a89/html5/thumbnails/19.jpg)
Steps for improvement
⦿Remove sparsity⦿Reduced length threshold⦿Created hashtable⦿Used alternative classifier
●Naive- Bayes Classifier
![Page 20: Machine Learning Basics with Applications to Email Spam Detection UGR P ROJECT - H AOYU LI, BRITTANY EDWARDS, WEI ZHANG UNDER XIAOXIAO XU AND ARYE NEHORAI](https://reader035.vdocuments.mx/reader035/viewer/2022062404/5517fc3f550346a2228b4a89/html5/thumbnails/20.jpg)
⦿ Calculate Hash Key for each term in term-list. ⦿ Once collision occurs, use the separate chain
Hashtable
![Page 21: Machine Learning Basics with Applications to Email Spam Detection UGR P ROJECT - H AOYU LI, BRITTANY EDWARDS, WEI ZHANG UNDER XIAOXIAO XU AND ARYE NEHORAI](https://reader035.vdocuments.mx/reader035/viewer/2022062404/5517fc3f550346a2228b4a89/html5/thumbnails/21.jpg)
Naive- Bayes classifier
![Page 22: Machine Learning Basics with Applications to Email Spam Detection UGR P ROJECT - H AOYU LI, BRITTANY EDWARDS, WEI ZHANG UNDER XIAOXIAO XU AND ARYE NEHORAI](https://reader035.vdocuments.mx/reader035/viewer/2022062404/5517fc3f550346a2228b4a89/html5/thumbnails/22.jpg)
Secondary Results
⦿Correctness increases from 62% to 82.36%
![Page 23: Machine Learning Basics with Applications to Email Spam Detection UGR P ROJECT - H AOYU LI, BRITTANY EDWARDS, WEI ZHANG UNDER XIAOXIAO XU AND ARYE NEHORAI](https://reader035.vdocuments.mx/reader035/viewer/2022062404/5517fc3f550346a2228b4a89/html5/thumbnails/23.jpg)
Suggestions for further improvement
⦿Revise pre-processing⦿Apply additional classifiers
![Page 24: Machine Learning Basics with Applications to Email Spam Detection UGR P ROJECT - H AOYU LI, BRITTANY EDWARDS, WEI ZHANG UNDER XIAOXIAO XU AND ARYE NEHORAI](https://reader035.vdocuments.mx/reader035/viewer/2022062404/5517fc3f550346a2228b4a89/html5/thumbnails/24.jpg)
Thank you
⦿Questions?