countering spam using classification techniqueslxiong/cs570s08/share/slides/webb_spam.pdfcountering...

46
Countering Spam Using Classification Techniques Steve Webb [email protected] Data Mining Guest Lecture February 21, 2008

Upload: others

Post on 15-Oct-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Countering Spam Using Classification Techniqueslxiong/cs570s08/share/slides/Webb_spam.pdfCountering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest

Countering Spam Using Classification Techniques

Steve [email protected] Mining Guest LectureFebruary 21, 2008

Page 2: Countering Spam Using Classification Techniqueslxiong/cs570s08/share/slides/Webb_spam.pdfCountering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest

Overview

IntroductionCountering Email Spam

Problem DescriptionClassification HistoryOngoing Research

Countering Web SpamProblem DescriptionClassification HistoryOngoing Research

Conclusions

Page 3: Countering Spam Using Classification Techniqueslxiong/cs570s08/share/slides/Webb_spam.pdfCountering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest

IntroductionThe Internet has spawned numerous information-rich environments

Email SystemsWorld Wide WebSocial Networking Communities

Openness facilities information sharing, but it also makes them vulnerable…

Page 4: Countering Spam Using Classification Techniqueslxiong/cs570s08/share/slides/Webb_spam.pdfCountering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest

Denial of Information (DoI) Attacks

Deliberate insertion of low quality information (or noise) into information-rich environments

Information analog to Denial of Service (DoS) attacks

Two goalsPromotion of ideals by means of deceptionDenial of access to high quality information

Spam is the currently the most prominent example of a DoI attack

Page 5: Countering Spam Using Classification Techniqueslxiong/cs570s08/share/slides/Webb_spam.pdfCountering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest

Overview

IntroductionIntroductionIntroductionCountering Email Spam

Problem DescriptionClassification HistoryOngoing Research

Countering Web SpamCountering Web SpamCountering Web SpamProblem DescriptionProblem DescriptionProblem DescriptionClassification HistoryClassification HistoryClassification HistoryOngoing ResearchOngoing ResearchOngoing Research

ConclusionsConclusionsConclusions

Page 6: Countering Spam Using Classification Techniqueslxiong/cs570s08/share/slides/Webb_spam.pdfCountering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest

Countering Email Spam

Close to 200 billion (yes, billion) emails are sent each day

Spam accounts for around 90% of that email traffic

~2 million spam messages every second

Page 7: Countering Spam Using Classification Techniqueslxiong/cs570s08/share/slides/Webb_spam.pdfCountering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest

Old Email Spam Examples

Page 8: Countering Spam Using Classification Techniqueslxiong/cs570s08/share/slides/Webb_spam.pdfCountering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest

Problem Description

Email spam detection can be modeled as a binary text classification problem

Two classes: spam and legitimate (non-spam)

Example of supervised learningBuild a model (classifier) based on training data to approximatethe target function

Construct a function φ: M {spam, legitimate} such that it overlaps Φ: M {spam, legitimate} as much as possible

Page 9: Countering Spam Using Classification Techniqueslxiong/cs570s08/share/slides/Webb_spam.pdfCountering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest

Problem Description (cont.)

How do we represent a message?

How do we generate features?

How do we process features?

How do we evaluate performance?

Page 10: Countering Spam Using Classification Techniqueslxiong/cs570s08/share/slides/Webb_spam.pdfCountering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest

How do we represent a message?

Classification algorithms require a consistent format

Salton’s vector space model (“bag of words”) is the most popular representation

Each message m is represented as a feature vector f of n features: <f1, f2, …, fn>

Page 11: Countering Spam Using Classification Techniqueslxiong/cs570s08/share/slides/Webb_spam.pdfCountering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest

How do we generate features?

Sources of informationSMTP connections

Network properties

Email headersSocial networks

Email bodyTextual partsURLsAttachments

Page 12: Countering Spam Using Classification Techniqueslxiong/cs570s08/share/slides/Webb_spam.pdfCountering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest

How do we process features?

Feature TokenizationAlphanumeric tokensN-gramsPhrases

Feature ScrubbingStemmingStop word removal

Feature SelectionSimple feature removalInformation-theoretic algorithms

Page 13: Countering Spam Using Classification Techniqueslxiong/cs570s08/share/slides/Webb_spam.pdfCountering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest

dccFN

babFP

dcdR

dbdP

+=

+=

+=

+=

How do we evaluate performance?

Traditional IR metricsPrecision vs. Recall

False positives vs. False negatives

Imbalanced error costs

ROC curves

Page 14: Countering Spam Using Classification Techniqueslxiong/cs570s08/share/slides/Webb_spam.pdfCountering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest

Classification History

Sahami et al. (1998)Used a Naïve Bayes classifierWere the first to apply text classification research to the spam problem

Pantel and Lin (1998)Also used a Naïve Bayes classifierFound that Naïve Bayes outperforms RIPPER

Page 15: Countering Spam Using Classification Techniqueslxiong/cs570s08/share/slides/Webb_spam.pdfCountering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest

Classification History (cont.)

Drucker et al. (1999)Evaluated Support Vector Machines as a solution to spamFound that SVM is more effective than RIPPER and Rocchio

Hidalgo and Lopez (2000)Found that decision trees (C4.5) outperform Naïve Bayes and k-NN

Page 16: Countering Spam Using Classification Techniqueslxiong/cs570s08/share/slides/Webb_spam.pdfCountering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest

Classification History (cont.)

Up to this point, private corpora were used exclusively in email spam research

Androutsopoulos et al. (2000a)Created the first publicly available email spam corpus (Ling-spam)Performed various feature set size, training set size, stemming, and stop-list experiments with a Naïve Bayes classifier

Page 17: Countering Spam Using Classification Techniqueslxiong/cs570s08/share/slides/Webb_spam.pdfCountering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest

Classification History (cont.)

Androutsopoulos et al. (2000b)Created another publicly available email spam corpus (PU1)Confirmed previous research than Naïve Bayesoutperforms a keyword-based filter

Carreras and Marquez (2001)Used PU1 to show that AdaBoost is more effective than decision trees and Naïve Bayes

Page 18: Countering Spam Using Classification Techniqueslxiong/cs570s08/share/slides/Webb_spam.pdfCountering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest

Classification History (cont.)

Androutsopoulos et al. (2004)Created 3 more publicly available corpora (PU2, PU3, and PUA)Compared Naïve Bayes, Flexible Bayes, Support Vector Machines, and LogitBoost: FB, SVM, and LB outperform NB

Zhang et al. (2004)Used Ling-spam, PU1, and the SpamAssassin corporaCompared Naïve Bayes, Support Vector Machines, and AdaBoost: SVM and AB outperform NB

Page 19: Countering Spam Using Classification Techniqueslxiong/cs570s08/share/slides/Webb_spam.pdfCountering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest

Classification History (cont.)CEAS (2004 – present)

Focuses solely on email and anti-spam researchGenerates a significant amount of academic and industry anti-spam research

Klimt and Yang (2004)Published the Enron Corpus – the first large-scale corpus of legitimate email messages

TREC Spam Track (2005 – present)Produces new corpora every yearProvides a standardized platform to evaluate classification algorithms

Page 20: Countering Spam Using Classification Techniqueslxiong/cs570s08/share/slides/Webb_spam.pdfCountering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest

Ongoing Research

Concept Drift

New Classification Approaches

Adversarial Classification

Image Spam

Page 21: Countering Spam Using Classification Techniqueslxiong/cs570s08/share/slides/Webb_spam.pdfCountering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest

Concept Drift

Spam content is extremely dynamic

Topic drift (e.g., specific scams)Technique drift (e.g., obfuscations)

How do we keep up with the Joneses?

Batch vs. Online Learning

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

01/0301/03 01/0401/04 01/0501/05 01/06Pe

rcen

tage

of

Spam

Mes

sage

sMonth

OBFUSCATING_COMMENTINTERRUPTUS

HTML_FONT_LOW_CONTRASTHTML_TINY_FONT

Page 22: Countering Spam Using Classification Techniqueslxiong/cs570s08/share/slides/Webb_spam.pdfCountering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest

New Classification Approaches

Filter Fusion

Compression-based Filtering

Network behavioral clustering

Page 23: Countering Spam Using Classification Techniqueslxiong/cs570s08/share/slides/Webb_spam.pdfCountering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest

Adversarial Classification

Classifiers assume a clear distinction between spam and legitimate features

Camouflaged messagesMask spam content with legitimate contentDisrupt decision boundaries for classifiers

Page 24: Countering Spam Using Classification Techniqueslxiong/cs570s08/share/slides/Webb_spam.pdfCountering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest

0.9

0.91

0.92

0.93

0.94

0.95

0.96

0.97

0.98

0.99

1

640 320 160 80 40 20 10

Wei

ghte

d A

ccur

acy,

λ =

9

Number of Retained Features

Naive BayesSVM

LogitBoost 0.9

0.91

0.92

0.93

0.94

0.95

0.96

0.97

0.98

0.99

1

640 320 160 80 40 20 10

Wei

ghte

d A

ccur

acy,

λ =

9

Number of Retained Features

Naive BayesSVM

LogitBoost 0.4

0.5

0.6

0.7

0.8

0.9

1

640 320 160 80 40 20 10

Wei

ghte

d A

ccur

acy,

λ =

9

Number of Retained Features

Naive BayesSVM

LogitBoost

Camouflage Attacks

Baseline performanceAccuracies consistently higher than 98%

Classifiers under attackAccuracies degrade to between 50% and 70%

Retrained classifiersAccuracies climb back to between 91% and 99%

Page 25: Countering Spam Using Classification Techniqueslxiong/cs570s08/share/slides/Webb_spam.pdfCountering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest

Camouflage Attacks (cont.)

Retraining postpones the problem, but it doesn’t solve it

We can identify features that are less susceptible to attack, but that’s simply another stalling technique 0

0.2

0.4

0.6

0.8

1

4(A)43(A)32(A)21(A)10(A)0

Frac

tion

of F

alse

Neg

ativ

es

Round Number (A denotes Attack)

NaiveBayesSVM

LogitBoost

Page 26: Countering Spam Using Classification Techniqueslxiong/cs570s08/share/slides/Webb_spam.pdfCountering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest

Image Spam

What happens when an email does not contain textual features?

OCR is easily defeated

Classification using image properties

Page 27: Countering Spam Using Classification Techniqueslxiong/cs570s08/share/slides/Webb_spam.pdfCountering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest

Overview

IntroductionIntroductionIntroductionCountering Email SpamCountering Email SpamCountering Email Spam

Problem DescriptionProblem DescriptionProblem DescriptionClassification HistoryClassification HistoryClassification HistoryOngoing ResearchOngoing ResearchOngoing Research

Countering Web SpamProblem DescriptionClassification HistoryOngoing Research

ConclusionsConclusionsConclusions

Page 28: Countering Spam Using Classification Techniqueslxiong/cs570s08/share/slides/Webb_spam.pdfCountering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest

Countering Web Spam

What is web spam?Traditional definitionOur definition

Between 13.8% and 22.1% of all web pages

Page 29: Countering Spam Using Classification Techniqueslxiong/cs570s08/share/slides/Webb_spam.pdfCountering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest

Ad Farms

Only contain advertising links (usually ad listings)

Elaborate entry pages used to deceive visitors

Page 30: Countering Spam Using Classification Techniqueslxiong/cs570s08/share/slides/Webb_spam.pdfCountering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest

Ad Farms (cont.)

Clicking on an entry page link leads to an ad listing

Ad syndicators provide the content

Web spammers create the HTML structures

Page 31: Countering Spam Using Classification Techniqueslxiong/cs570s08/share/slides/Webb_spam.pdfCountering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest

Parked Domains

Domain parking servicesProvide place holders for newly registered domainsAllow ad listings to be used as place holders to monetize a domain

Inevitably, web spammers abused these services

Page 32: Countering Spam Using Classification Techniqueslxiong/cs570s08/share/slides/Webb_spam.pdfCountering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest

Parked Domains (cont.)

Functionally equivalent to Ad FarmsBoth rely on ad syndicators for contentBoth provide little to no value to their visitors

Unique CharacteristicsReliance on domain parking services (e.g., apps5.oingo.com, searchportal.information.com, etc.)Typically for sale by owner (“Offer To Buy This Domain”)

Page 33: Countering Spam Using Classification Techniqueslxiong/cs570s08/share/slides/Webb_spam.pdfCountering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest

Parked Domains (cont.)

Page 34: Countering Spam Using Classification Techniqueslxiong/cs570s08/share/slides/Webb_spam.pdfCountering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest

Advertisements

Pages advertising specific products or services

Examples of the kinds of pages being advertised in Ad Farms and Parked Domains

Page 35: Countering Spam Using Classification Techniqueslxiong/cs570s08/share/slides/Webb_spam.pdfCountering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest

Problem Description

Web spam detection can also be modeled as a binary text classification problem

Salton’s vector space model is quite common

Feature processing and performance evaluation are also quite similar

But what about feature generation…

Page 36: Countering Spam Using Classification Techniqueslxiong/cs570s08/share/slides/Webb_spam.pdfCountering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest

How do we generate features?

Sources of informationHTTP connections

Hosting IP addressesSession headers

HTML contentTextual propertiesStructural properties

URL linkage structurePageRank scoresNeighbor properties

Page 37: Countering Spam Using Classification Techniqueslxiong/cs570s08/share/slides/Webb_spam.pdfCountering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest

Classification History

Davison (2000)Was the first to investigate link-based web spamBuilt decision trees to successfully identify “nepotistic links”

Becchetti et al. (2005)Revisited the use of decision trees to identify link-based web spamUsed link-based features such as PageRank and TrustRank scores

Page 38: Countering Spam Using Classification Techniqueslxiong/cs570s08/share/slides/Webb_spam.pdfCountering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest

Classification History

Drost and Scheffer (2005)Used Support Vector Machines to classify web spam pagesRelied on content-based features as well as link-based features

Ntoulas et al. (2006)Built decision trees to classify web spamUsed content-based features (e.g., fraction of visible content, compressibility, etc.)

Page 39: Countering Spam Using Classification Techniqueslxiong/cs570s08/share/slides/Webb_spam.pdfCountering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest

Classification HistoryUp to this point, previous web spam research was limited to small (on the order of a few thousand), private data sets

Webb et al. (2006)Presented the Webb Spam Corpus – a first-of-its-kind large-scale, publicly available web spam corpus (almost 350K web spam pages)http://www.webbspamcorpus.org

Castillo et al. (2006)Presented the WEBSPAM-UK2006 corpus – a publicly available web spam corpus (only contains 1,924 web spam pages)

Page 40: Countering Spam Using Classification Techniqueslxiong/cs570s08/share/slides/Webb_spam.pdfCountering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest

Classification HistoryCastillo et al. (2007)

Created a cost-sensitive decision tree to identify web spam in the WEBSPAM-UK2006 data setUsed link-based features from [Becchetti et al. (2005)] and content-based features from [Ntoulas et al. (2006)]

Webb et al. (2008)Compared various classifiers (e.g., SVM, decision trees, etc.) using HTTP session information exclusivelyUsed the Webb Spam Corpus, WebBase data, and the WEBSPAM-UK2006 data setFound that these classifiers are comparable to (and in many cases, better than) existing approaches

Page 41: Countering Spam Using Classification Techniqueslxiong/cs570s08/share/slides/Webb_spam.pdfCountering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest

Ongoing Research

Redirection

Phishing

Social Spam

Page 42: Countering Spam Using Classification Techniqueslxiong/cs570s08/share/slides/Webb_spam.pdfCountering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest

Redirection

144,801 unique redirect chains (1.54 average HTTP redirects)

43.9% of web spam pages use some form of HTML or JavaScript redirection

49%

14%

11%

8%

7%

5%

3%

2%

1%

302 HTTP redirect

frame redirect

301 HTTP redirect

iframe redirect

meta refresh andlocation.replace()meta refresh

meta refresh and location

location*

Other

Page 43: Countering Spam Using Classification Techniqueslxiong/cs570s08/share/slides/Webb_spam.pdfCountering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest

Phishing

Interesting form of deception that affects email and web users

Another form of adversarial classification

Page 44: Countering Spam Using Classification Techniqueslxiong/cs570s08/share/slides/Webb_spam.pdfCountering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest

Social Spam

Comment spam

Bulletin spam

Message spam

Page 45: Countering Spam Using Classification Techniqueslxiong/cs570s08/share/slides/Webb_spam.pdfCountering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest

Conclusions

Email and web spam are currently two of the largest information security problems

Classification techniques offer an effective way to filter this low quality information

Spammers are extremely dynamic, generating various areas of important future research…

Page 46: Countering Spam Using Classification Techniqueslxiong/cs570s08/share/slides/Webb_spam.pdfCountering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest

Questions