lexical feature based phishing url detection using online learning reporter: jing chiu advisor:...
TRANSCRIPT
Lexical Feature Based Phishing URL Detection Using Online Learning
Reporter: Jing Chiu
Advisor: Yuh-Jye Lee
Email: [email protected]
2011/3/17 Data Mining and Machine Learning Lab. 1
Paper Information
Authors: Aaron Blum
(University of Alabama, Birmingham) Brad Wardman
(University of Alabama, Birmingham) Thamar Solorio
(University of Alabama, Birmingham) Source:
ACM Artificial Intelligence Security Workshop 3rd, 2010
2011/3/17 Data Mining and Machine Learning Lab. 2
Outline
Introduction Related Work Approach Data Evaluation Conclusion
2011/3/17 Data Mining and Machine Learning Lab. 3
Introduction Phishing
A cybercrime comes from spammed emails and fraudulent websites
Entice victims to provide sensitive information The information is used to steal identities or gain access to money
Characteristics Highly dynamic environment
Model need to be updated frequently New ideas
Combine online learning with content-inspection based approach Model trained only by largely lexical features
(without host based features) Provide results to show the performance of URL inspection based
detection is as well as content inspection based detection
2011/3/17 Data Mining and Machine Learning Lab. 4
Related Work Content based Phishing URL Detection
Use the similarity between the content files to detect phishing websites
Purely URL based Malicious URL Detection Use host information and URL lexical features with
online learning algorithms PhishNet
Extend the usability of blacklists Domain Blacklisting
Expand blacklist by the DNS zone file data and WHOIS information
2011/3/17 Data Mining and Machine Learning Lab. 5
Approach
Feature Extraction Delimiters: “/”, ”?”, ”.”, ”=” and “_” Bigram combination Lexical feature groups
Learning algorithm Confident Weighted Algorithm
Updating model by different weights of the features’ occurrence
2011/3/17 Data Mining and Machine Learning Lab. 6
Approach (cont.) MD5 Matching
Use files’ MD5 checksum to check files similarity
Easy to evade ( by varying the content) Examples
Deep MD5 Matching Download all the associated content files Compare the similarity between two websites’
content files by Kulczynski 2 coefficient
2011/3/17 Data Mining and Machine Learning Lab. 7
Data Data Source
UAB Phishing Data Mine Two and half a year collecting time Benigns may look “phishy” (e.g.) 9,506unique domains 25,203 URLs (6,114 malicious)
Cyveillance 18,990 unique domains 34,234 URLs (all malicious)
All feeds are fully de-duplicated Datasets
UAB Feeds Cyveillance full Cyveillance abridged Mixed
2011/3/17 Data Mining and Machine Learning Lab. 8
Data (cont.)
Percentage of total URLs vs. Individual Domains
2011/3/17 Data Mining and Machine Learning Lab. 9
Evaluation
Experiment setting Training and testing set was conducted on daily
batches Training initially conducted on UAB data Model will be updated by a daily URL
blacklist/whitelist feed False positive and false negative error rates
were computed every prediction
2011/3/17 Data Mining and Machine Learning Lab. 10
Evaluation(cont.)
2011/3/17 Data Mining and Machine Learning Lab. 11
Evaluation(cont.)
2011/3/17 Data Mining and Machine Learning Lab. 12
Evaluation(cont.)
2011/3/17 Data Mining and Machine Learning Lab. 13
Conclusion
Lexical features based learning provide robust performance by CW algorithm
Quality diverse training data could approve a accuracy higher than 97%
For proposed system Training data could be collected from any
blacklists Easy implement and robust performance
2011/3/17 Data Mining and Machine Learning Lab. 14
Thanks for your attention
Q&A?
2011/3/17 Data Mining and Machine Learning Lab. 15
Lexical Feature Group
2011/3/17 Data Mining and Machine Learning Lab. 16
URLs including the recipient’s email
2011/3/17 Data Mining and Machine Learning Lab. 17
Data in UAB Phishing Data Mine
2011/3/17 Data Mining and Machine Learning Lab. 18