a comparison of feature-based and feature-free case-based reasoning for spam filtering

59
A Comparison of Feature-Based and Feature- Free Case-Based Reasoning for Spam Filtering Derek Bridge University College Cork work done with Sarah Jane Delany Dublin Institute of Technology

Upload: hija

Post on 25-Jan-2016

58 views

Category:

Documents


0 download

DESCRIPTION

A Comparison of Feature-Based and Feature-Free Case-Based Reasoning for Spam Filtering. Derek Bridge University College Cork work done with Sarah Jane Delany Dublin Institute of Technology. Overview. Introduction Case-Based Spam Filtering Feature-Based Feature-Free Experiments I - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: A Comparison of  Feature-Based and Feature-Free Case-Based Reasoning  for Spam Filtering

A Comparison of Feature-Based and Feature-Free Case-Based Reasoning

for Spam Filtering

Derek BridgeUniversity College Cork

work done with

Sarah Jane DelanyDublin Institute of Technology

Page 2: A Comparison of  Feature-Based and Feature-Free Case-Based Reasoning  for Spam Filtering

Overview

• Introduction• Case-Based Spam Filtering

– Feature-Based– Feature-Free– Experiments I

• Case Base Maintenance– Competence-Based Editing– Experiments II

• Concept Drift– Incremental & periodic solutions– Experiments III

• Conclusions

Page 3: A Comparison of  Feature-Based and Feature-Free Case-Based Reasoning  for Spam Filtering

Introduction

• From the Spamhaus project (www.spamhaus.org)

– “An electronic message is ‘spam’ IF:1) the recipient's personal identity and context are

irrelevant because the message is equally applicable to many other potential recipients;

AND2) the recipient has not verifiably granted deliberate,

explicit, and still-revocable permission for it to be sent.”

• “[It’s] about consent, not content”

• We focus on email spam

Page 4: A Comparison of  Feature-Based and Feature-Free Case-Based Reasoning  for Spam Filtering

Spam Filtering

• Spam filtering is classification:– is an incoming email ham or spam?

• Spam filters– procedural

• whitelists, blacklists, challenge-response systems,…

– collaborative • sharing signatures

– content-based • rules, decision trees, probabilities, case bases,…

– hybrid.

Page 5: A Comparison of  Feature-Based and Feature-Free Case-Based Reasoning  for Spam Filtering

Challenges of Spam Filtering• Spam is subjective and personal;• It is heterogeneous;• There is a high costs to false

positives (where ham is classified as spam); and

• It is constantly changing (‘concept drift’).

Page 6: A Comparison of  Feature-Based and Feature-Free Case-Based Reasoning  for Spam Filtering

Overview

• IntroductionCase-Based Spam Filtering

– Feature-Based– Feature-Free– Experiments I

• Case Base Maintenance– Competence-Based Editing– Experiments II

• Concept Drift– Incremental & periodic solutions– Experiments III

• Conclusions

Page 7: A Comparison of  Feature-Based and Feature-Free Case-Based Reasoning  for Spam Filtering

Case-Based Reasoning

Generalknowledge

Tested/RepairedCase

AdaptedCase

LearnedCase

RetrievedCase

Newproblem

PreviousCase

RETRIEVE

REVISE

RETAIN REUSE

[Aamodt & Plaza 1994]

Page 8: A Comparison of  Feature-Based and Feature-Free Case-Based Reasoning  for Spam Filtering

Case-Based Reasoning

Generalknowledge

PreviousCase

MAINTAIN

Page 9: A Comparison of  Feature-Based and Feature-Free Case-Based Reasoning  for Spam Filtering

Is Case-Based Reasoning (CBR) the answer?• Spam is subjective and personal;• It is heterogeneous;• There is a high costs to false

positives (where ham is classified as spam); and

• It is constantly changing (‘concept drift’). Users can have

individual case bases created from their own

emails

It is known that CBR handles disjunctive

concepts well

We can bias CBR away from false positivesCase bases can be

updated incrementally

Page 10: A Comparison of  Feature-Based and Feature-Free Case-Based Reasoning  for Spam Filtering

Overview

• Introduction• Case-Based Spam Filtering

Feature-Based– Feature-Free– Experiments I

• Case Base Maintenance– Competence-Based Editing– Experiments II

• Concept Drift– Incremental & periodic solutions– Experiments III

• Conclusions

Page 11: A Comparison of  Feature-Based and Feature-Free Case-Based Reasoning  for Spam Filtering

Email Classification Using Examples (ECUE)• ECUE uses Case-Based Reasoning (CBR) to

classify emails • A case base contains a user’s email (both

ham and spam)• ECUE classifies an incoming email using the

k-nearest neighbour algorithm:– It retrieves from the case base the k nearest

neighbours (the k that are closest or most similar)

– The cases it retrieves then vote to decide the class of the new email

– To bias away from false positives, ECUE uses unanimous voting.

Page 12: A Comparison of  Feature-Based and Feature-Free Case-Based Reasoning  for Spam Filtering

Feature-Based ECUE

• Email

• Features extracted (fij )– words, characters, structural features

• Binary representation: fi1= 1 or fi1= 0

EmailEmailEmailEmail FeatureExtraction

Casebase

label class,,..., 21 iNiii fffe

Page 13: A Comparison of  Feature-Based and Feature-Free Case-Based Reasoning  for Spam Filtering

Feature-Based ECUE

• Information Gain used to select the 700 most predictive features

EmailEmailEmailEmail FeatureExtraction

Casebase

FeatureSelection

Casebase

Page 14: A Comparison of  Feature-Based and Feature-Free Case-Based Reasoning  for Spam Filtering

Feature-Based ECUE

EmailEmailEmailEmail FeatureExtraction

Casebase

FeatureSelection

Casebase

CaseSelection

Casebase

• Competence-Based Editing usedto edit case base

Page 15: A Comparison of  Feature-Based and Feature-Free Case-Based Reasoning  for Spam Filtering

Runtime System

Feature-Based ECUE

EmailEmailEmailEmail FeatureExtraction

Casebase

FeatureSelection

Casebase

CaseSelection

Casebase

Classification

spam!

NewCase

Page 16: A Comparison of  Feature-Based and Feature-Free Case-Based Reasoning  for Spam Filtering

Feature-Based ECUE

• The distance between cases is a count of the number of features that they do not share

• Naïve Bayes classifier thought to be among the best for spam filtering

• Feature-Based ECUE has comparable, and sometimes slightly better, accuracy than Naïve Bayes

Page 17: A Comparison of  Feature-Based and Feature-Free Case-Based Reasoning  for Spam Filtering

Overview

• Introduction• Case-Based Spam Filtering

– Feature-BasedFeature-Free– Experiments I

• Case Base Maintenance– Competence-Based Editing– Experiments II

• Concept Drift– Incremental & periodic solutions– Experiments III

• Conclusions

Page 18: A Comparison of  Feature-Based and Feature-Free Case-Based Reasoning  for Spam Filtering

Feature-Free ECUE

• Alternative to Feature-Based ECUE• Inspired by theory of Kolmogorov

Complexity– K(x) = size of smallest Turing machine

that can output x to its tape– K(x|y) = size of smallest Turing machine

that can output x when given y• Basis for distance measure

if K(x|y) < K(x|z) then y is more similar to x than z

[Li et al. 2003]

Page 19: A Comparison of  Feature-Based and Feature-Free Case-Based Reasoning  for Spam Filtering

Feature-Free ECUE

• Approximate K(x) by C(x)C(x) = size of x after compression

• Text compression exploits intra-document redundancy

Case based reasoningCase b•d reasoning

Page 20: A Comparison of  Feature-Based and Feature-Free Case-Based Reasoning  for Spam Filtering

Using Compression

• Consider length of two documents allowing for inter-document redundancy = len(gzip( + ))docX docY

= len(gzip( ))docX docY

= len( )docX docY

= C(xy)

Page 21: A Comparison of  Feature-Based and Feature-Free Case-Based Reasoning  for Spam Filtering

Using Compression

• Consider length of two documents not allowing for inter-document redundancy

= len(gzip( )) + len(gzip( ))docX docY

= len( ) + len( )docX docY

= C(x) + C(y)

Page 22: A Comparison of  Feature-Based and Feature-Free Case-Based Reasoning  for Spam Filtering

Compression-Based Dissimilarity (CDM)

• Max value ≤ 1 (furthest)Min value > 0.5 (nearest)

• HoweverCDM(x,x) ≠ 0; CDM(x,y) ≠ CDM(y,x); CDM(x,y) + CDM(y,z) ≥ CDM(x,z)

)()(

)(),(

yCxC

xyCyxCDM

[Keogh et al 2004]

Page 23: A Comparison of  Feature-Based and Feature-Free Case-Based Reasoning  for Spam Filtering

Runtime System

Feature-Based ECUE

EmailEmailEmailEmail FeatureExtraction

Casebase

FeatureSelection

Casebase

Case BaseEdit

Casebase

Classification

spam!

NewEmail

Page 24: A Comparison of  Feature-Based and Feature-Free Case-Based Reasoning  for Spam Filtering

Runtime System

Feature-Free ECUE

EmailEmailEmailEmailEmail

Casebase

Case BaseEdit

Casebase

Classification

spam!

NewEmail

EmailEmailEmail

Page 25: A Comparison of  Feature-Based and Feature-Free Case-Based Reasoning  for Spam Filtering

Experiments I• Created 4 datasets of 1000 emails from

two years of email from two people– each dataset has 500 consecutive ham, 500

consecutive spam• 10-fold cross-validation • Settings:

– k = 3– Feature-based: 700 features– Feature-free: GZip as text compressor

• Measures:– FPRate = #false positives/#ham– FNRate = #false negatives/#spam– Err = (FPRate + FNRate) / 2

Page 26: A Comparison of  Feature-Based and Feature-Free Case-Based Reasoning  for Spam Filtering

Results - % Error

5.7%

2.4%

4.0%

0.2%

9.8%

2.2%

13.2%

1.5%

Feature-Based Feature-Free (GZip)

Dataset 1

Dataset 2

Dataset 3

Dataset 4

Page 27: A Comparison of  Feature-Based and Feature-Free Case-Based Reasoning  for Spam Filtering

Results - % False Positives

9.2%

1.4%1.4%

0.0%

1.0% 0.8%0.6%1.2%

Feature-Based Feature-Free (GZip)

Dataset 1

Dataset 2Dataset 3

Dataset 4

Page 28: A Comparison of  Feature-Based and Feature-Free Case-Based Reasoning  for Spam Filtering

Overview

• Case-Based Spam Filtering– Feature-Based & Feature-Free– Experiments I

Case Base Maintenance– Competence-Based Editing– Experiments II

• Concept Drift– Incremental & periodic solutions– Experiments III

• Conclusions

Page 29: A Comparison of  Feature-Based and Feature-Free Case-Based Reasoning  for Spam Filtering

Case Base Maintenance

• Case base editing algorithms– remove redundant cases, and– remove noisy cases.

• Their goal is to– reduce retrieval time but– maintain or even improve accuracy.

Page 30: A Comparison of  Feature-Based and Feature-Free Case-Based Reasoning  for Spam Filtering

Competence Model

• For each case c, compute– coverage set of c

• cases that have c as one of their k-NN and which have same class as c

– liability set of c• cases that have c as one of their k-NN and

which have different class from c

xc x is in coverage set of c

y

y is in liability set of c

Page 31: A Comparison of  Feature-Based and Feature-Free Case-Based Reasoning  for Spam Filtering

Competence-Based Editing

• Blame-Based Noise Reduction– For each case c with non-empty liability set

(taken in descending order of size of liability set),• if the cases in c’s coverage set can still be correctly

classified without c, then c can be deleted.

– This emphasises removal of cases that cause misclassifications.

• Conservative Redundancy Reduction– For each remaining case c (taken in ascending

order of size of coverage set)• retain c but delete the cases in c’s coverage set

– This retains cases close to class boundaries

Page 32: A Comparison of  Feature-Based and Feature-Free Case-Based Reasoning  for Spam Filtering

Results - % Error

5.7%

3.8%2.4% 2.2%

9.8%

7.0%

2.2% 2.6%

Feature-Based (full)

Feature-Based

(edited)

Feature-Free(full)

Feature-Free(edited)

Dataset 1 Dataset 3

• Feature-based edited size = 75% and 65%• Feature-free edited size = 59% and 57%

Page 33: A Comparison of  Feature-Based and Feature-Free Case-Based Reasoning  for Spam Filtering

Results - % False Positives

9.2%

3.4%

1.4% 1.0%1.0%2.2%

0.8% 0.4%

Feature-Based (full)

Feature-Based

(edited)

Feature-Free(full)

Feature-Free(edited)

Dataset 1 Dataset 3

Page 34: A Comparison of  Feature-Based and Feature-Free Case-Based Reasoning  for Spam Filtering

Overview

• Case-Based Spam Filtering– Feature-Based & Feature-Free– Experiments I

• Case Base Maintenance– Competence-Based Editing– Experiments II

Concept Drift– Incremental & periodic solutions– Experiments III

• Conclusions

Page 35: A Comparison of  Feature-Based and Feature-Free Case-Based Reasoning  for Spam Filtering

Concept Drift• The target concept is not static

– it changes according to season– it changes according to world events– people’s interests and tolerances

change– there is an arm’s race:

• ever more devious spamouflage!

• We need to investigate behaviour over time

Page 36: A Comparison of  Feature-Based and Feature-Free Case-Based Reasoning  for Spam Filtering

Experiments III• Took ~10000 emails from two years of

email from two people in date-order• Created a case base for each person from

earliest 500 consecutive ham & earliest 500 consecutive spam

• Remaining ~9000 emails presented chronologically as test cases

• Same settings and measures as before– k = 3– Feature-based: 700 features– Feature-free: GZip as text compressor

Page 37: A Comparison of  Feature-Based and Feature-Free Case-Based Reasoning  for Spam Filtering

Retention policies• CBR (and other lazy learners) can

easily incorporate the most recent examples– retain-all: store all new emails in the

case base– retain-misclassifieds: store a new email

if our prediction is wrong

Page 38: A Comparison of  Feature-Based and Feature-Free Case-Based Reasoning  for Spam Filtering

Results - % Error

15.9%

2.3%

12.6%

3.2%

Feature-Free (GZip) Feature-Free (GZip):retain-misclassifieds

Dataset A Dataset B

• When we retain-misclassified cases, case bases increase in size by ~30%

Page 39: A Comparison of  Feature-Based and Feature-Free Case-Based Reasoning  for Spam Filtering

Results - % False Positives

0.7%

1.5%

4.0%3.5%

Feature-Free (GZip) Feature-Free (GZip):retain-misclassifieds

Dataset A Dataset B

Page 40: A Comparison of  Feature-Based and Feature-Free Case-Based Reasoning  for Spam Filtering

Retention• Bigger case base reduces efficiency• Obsolete cases may reduce accuracy• Obsolete features may reduce

accuracy

• Need a deletion policy

Page 41: A Comparison of  Feature-Based and Feature-Free Case-Based Reasoning  for Spam Filtering

Incremental Solutions

• Consider add-1-delete-1– Case base size remains constant– retention policy

• retain-all• retain-misclassified

– forgetting policy• forget-oldest• forget-least-accurate

instance selection

instance weighting

Page 42: A Comparison of  Feature-Based and Feature-Free Case-Based Reasoning  for Spam Filtering

Incremental Solutions

• Consider add-1-delete-1– Case base size remains constant– retention policy

• retain-all• retain-misclassified

– forgetting policy• forget-oldest• forget-least-accurate

Accuracy = #successes#retrievals

Page 43: A Comparison of  Feature-Based and Feature-Free Case-Based Reasoning  for Spam Filtering

Results - % Error

15.9%

2.3% 1.7% 1.8% 1.9%

12.6%

3.2% 2.8%4.0% 3.0%

Feature-Free Feature-Free:retain-

misclassifieds,forget-oldest

Feature-Free:retain-all, forget-

oldest

Feature-Free:retain-

misclassifieds,forget-least-

accurate

Feature-Free:retain-all, forget-least-accurate

Dataset A Dataset B

Page 44: A Comparison of  Feature-Based and Feature-Free Case-Based Reasoning  for Spam Filtering

Results - % False Positives

0.7%1.3% 1.7% 1.8%

2.4%

4.0%3.5%

4.2%

6.4%

5.0%

Feature-Free Feature-Free:retain-

misclassifieds,forget-oldest

Feature-Free:retain-all, forget-

oldest

Feature-Free:retain-

misclassifieds,forget-least-

accurate

Feature-Free:retain-all, forget-least-accurate

Dataset A Dataset B

Negative effect on FPs?

Page 45: A Comparison of  Feature-Based and Feature-Free Case-Based Reasoning  for Spam Filtering

Periodic Solutions

• Periodic– Feature-based:

• retain-misclassified;• monthly, feature re-extraction, feature re-

selection, case base rebuild and case base edit

– Feature-free• retain-misclassified; • monthly, case base edit

Page 46: A Comparison of  Feature-Based and Feature-Free Case-Based Reasoning  for Spam Filtering

Feature-Based ECUE

EmailEmailEmailEmail FeatureExtraction

Casebase

FeatureSelection

Casebase

Case BaseEdit

Casebase

Page 47: A Comparison of  Feature-Based and Feature-Free Case-Based Reasoning  for Spam Filtering

Results - % Error

15.4%

4.5%

15.9%

2.3%

19.2%

6.1%

12.6%

2.6%

Feature-Based Feature-Based: retain-misclassifieds,

monthly reselect &edit

Feature-Free Feature-Free: retain-misclassifieds,

monthly edit

Dataset A Dataset B

Page 48: A Comparison of  Feature-Based and Feature-Free Case-Based Reasoning  for Spam Filtering

Results - % False Positives

20.0%

2.0% 0.7% 0.9%

14.7%

2.4%4.0%

2.5%

Feature-Based Feature-Based: retain-misclassifieds,

monthly reselect &edit

Feature-Free Feature-Free: retain-misclassifieds,

monthly edit

Dataset A Dataset B

Page 49: A Comparison of  Feature-Based and Feature-Free Case-Based Reasoning  for Spam Filtering

Overview

• Case-Based Spam Filtering– Feature-Based & Feature-Free– Experiments I

• Case Base Maintenance– Competence-Based Editing– Experiments II

• Concept Drift– Incremental & periodic solutions– Experiments III

• Conclusions

Page 50: A Comparison of  Feature-Based and Feature-Free Case-Based Reasoning  for Spam Filtering

Feature-Free ECUE: Advantages• Accuracy

– lower error rate than traditional feature-based methods

– often lower false positive rate

• Costs– it uses the raw text– no need to extract, select or weight features– no need to update features as spam changes

• Concept drift– simple retention/forgetting policies can be

effective

Page 51: A Comparison of  Feature-Based and Feature-Free Case-Based Reasoning  for Spam Filtering

Feature-Free ECUE: Disadvantages

• No justification factors to explain results or drive adaptation

• Higher computation time– Time to classify email (with cb of 1000)

Feature-free = 2 secs Feature-based = .01 sec

• Not a metric

Page 52: A Comparison of  Feature-Based and Feature-Free Case-Based Reasoning  for Spam Filtering

Future Work

• Investigating algorithms to speed up retrieval time

• Application of measure to text other than emails

Page 53: A Comparison of  Feature-Based and Feature-Free Case-Based Reasoning  for Spam Filtering

Thank you for your attention!

Page 54: A Comparison of  Feature-Based and Feature-Free Case-Based Reasoning  for Spam Filtering

Spare slides

Page 55: A Comparison of  Feature-Based and Feature-Free Case-Based Reasoning  for Spam Filtering

Normalized Compression Distance (NCD)

• Max value = 1 + ε (furthest)Min value = 0 (nearest)

• HoweverNCD(x,x) ≠ 0; NCD(x,y) ≠ NCD(y,x); NCD(x,y) + NCD(y,z) ≥ NCD(x,z)

))(),(max())(),(min()(

),(yCxC

yCxCxyCyxNCD

[Li et al 2003]

Page 56: A Comparison of  Feature-Based and Feature-Free Case-Based Reasoning  for Spam Filtering

Comparing Compression Algorithms

• The better the compression the better the measure?

• Compared GZip with Prediction by Partial Matching (PPM)– GZip = Lempel-Ziv variant– PPM = adaptive statistical compressor

Page 57: A Comparison of  Feature-Based and Feature-Free Case-Based Reasoning  for Spam Filtering

Results - % Error

2.4% 2.3%2.1% 2.0%

0.1% 0.2% 0.2% 0.2%

2.4%

1.9%2.2%

2.5%

1.4%1.1%

1.6% 1.7%

GZip PPM(2) PPM(4) PPM(8)

Dataset 1

Dataset 2Dataset 3

Dataset 4

Page 58: A Comparison of  Feature-Based and Feature-Free Case-Based Reasoning  for Spam Filtering

Results

• Little difference in classification error– Compressor choice does not greatly

matter

• PPM is generally considered better at compression but on our datasets...– average of 59% compression for GZip– average 57% compression for PPM

• PPM computationally expensive– 180 times slower than GZip

Page 59: A Comparison of  Feature-Based and Feature-Free Case-Based Reasoning  for Spam Filtering

GZip Speed Up

• GZip uses a 32 KByte sliding window

• Truncate each email to 16KB • Achieves speed ups of between 9.5%

to 25%

docX docY

32KB