improving spam detection based on structural similarity

22
Improving Spam Detection Based on Structural Similarity By Luiz H. Gomes, Fernando D. O. Castro, Rodrigo B. Almeida, Luis M. A. Bettencourt, Virgílio A. F. Almeida, Jussara M. Almeida Presented at Steps to Reducing Unwanted Traffic on the Internet Workshop, 2005 Presented by Jared Bott

Upload: germaine-puckett

Post on 31-Dec-2015

34 views

Category:

Documents


4 download

DESCRIPTION

Improving Spam Detection Based on Structural Similarity. By Luiz H. Gomes, Fernando D. O. Castro, Rodrigo B. Almeida, Luis M. A. Bettencourt, Virg í lio A. F. Almeida, Jussara M. Almeida Presented at Steps to Reducing Unwanted Traffic on the Internet Workshop, 2005 Presented by Jared Bott. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Improving Spam Detection Based on Structural Similarity

Improving Spam Detection Based on Structural Similarity

By Luiz H. Gomes, Fernando D. O. Castro, Rodrigo B. Almeida,Luis M. A. Bettencourt, Virgílio A. F. Almeida, Jussara M. Almeida

Presented at Steps to Reducing Unwanted Traffic on the Internet Workshop, 2005

Presented by Jared Bott

Page 2: Improving Spam Detection Based on Structural Similarity

2

Outline

Overview

Concepts

Detecting Spam

Experimental Results

Analysis of Paper

Page 3: Improving Spam Detection Based on Structural Similarity

3

Overview

New algorithm to detect spam messagesUses email information that is harder to

changeWorks in conjunction with another spam

classifier I.e. SpamAssassin

Less false positives than compared methods

Page 4: Improving Spam Detection Based on Structural Similarity

4

Spam Detection Problem

Spam detection algorithms use some part of emails to determine if a message is spam Spammers change messages so that they do

not meet detection criteria for spam

Very easy to change spam messages, usernames, domains, subjects, etc.

Page 5: Improving Spam Detection Based on Structural Similarity

5

Key Idea

The lists that spammers and legitimate users send messages to and from can be used as the identifiers of classes of email traffic. The lists of addresses spammers send to are

unlikely to be similar to those of legitimate users.

Lists don’t change that often

Page 6: Improving Spam Detection Based on Structural Similarity

6

Using Lists

A user is not just an email address. It can be a domain, etc.

Represent email user as a vector in multi-dimensional conceptual space created with all possible contacts Each sender and each recipient has their own

vectorModel relationship between senders and

recipients

Page 7: Improving Spam Detection Based on Structural Similarity

7

Constructing Vectors

If there is at least one email sent from sender si to recipient rn, then the value in si’s vector’s nth dimension is 1. Otherwise, that value is 0.

If there is at least one email received by recipient ri from sender sn, the value in ri’s vector’s nth dimension is 1. Otherwise it is 0.

Page 8: Improving Spam Detection Based on Structural Similarity

8

Example Vectors

User 1

User 2

User 3

S[0,1,1]R[0,1,0]

S[1,0,1]R[1,0,0]

S[0,0,0]R[1,1,0]

Page 9: Improving Spam Detection Based on Structural Similarity

9

Similarity Between Senders

Similarity between senders si and sk is the cosine of the angle between their vectors cos(si, sk) 0 means no shared contact 1 means identical contact lists

In legitimate email, a 1 means that the senders operate in the same social group.

In spammers, a 1 means that the senders use the same list or are the same person.

Page 10: Improving Spam Detection Based on Structural Similarity

10

Grouping Users Into Clusters

Group users with similar vectors Users with similar vectors are likely to have

related roles, i.e. spammer or legitimate user

Each cluster is represented by a vector This vector is the sum of all its component

users’ vectors

Page 11: Improving Spam Detection Based on Structural Similarity

11

Similarity Between a User and a Cluster

Similarity is derived from user to user similarity equation If sender si is a member of cluster sck, then the

similarity is cos(sck – si, si).

If sender si is not a member of cluster sck, then the similarity is cos(sck, si).

Similarity between a user and a cluster will change over time Remove the user’s vector from the cluster’s vector when

computing similarity and reclassifying a user

Page 12: Improving Spam Detection Based on Structural Similarity

12

Detecting Spam

Two probabilities to compute Ps(m) – Probability of an email m being sent by

a spammer

Pr(m) – Probability of an email m being addressed to users that receive spam

Page 13: Improving Spam Detection Based on Structural Similarity

13

Detecting Spam

When an email arrives, classify it using some other method

Find the cluster (sc) the email’s sender belongs in If many users in the cluster send messages that are

classified as spam by auxiliary method, the probability of all the users in that cluster sending spam is high

Update the sc’s spam probability Ps(m) ← sc’s spam probability

Page 14: Improving Spam Detection Based on Structural Similarity

14

Detecting Spam

For all recipients of the email, find the cluster (rc) each one belongs to

Update the spam probability for each cluster

Pr(m) ← Pr(m) + spam probability of each rc

Pr(m) ← Pr(m)/number of recipients

Page 15: Improving Spam Detection Based on Structural Similarity

15

Detecting Spam

Compute a spam rank for the email based upon Pr(m) and Ps(m)

If the spam rank is above some threshold (ω), label it as spam

If the spam rank is below 1- ω, label it is legitimate

Otherwise label the email as the auxiliary method’s classification

Page 16: Improving Spam Detection Based on Structural Similarity

16

Page 17: Improving Spam Detection Based on Structural Similarity

17

Experimental Results

Tested on a log of eight days of email from a large Brazilian university

Tested on a 2.8 GHz Pentium 4 with 512 MB RAM Able to classify 20 messages per second Faster than the average message arrival peak

rate

Page 18: Improving Spam Detection Based on Structural Similarity

18

Results

Measure Non-Spam Spam Aggregate

# of emails 191,417 173,584 365,001

Size of emails 11.3 GB 1.2 GB 12.5 GB

# of distinct senders

12,338 19,567 27,734

# of distinct recipients

22,762 27,926 38,875

Page 19: Improving Spam Detection Based on Structural Similarity

19

Results

Manually checked false positives to see if they were spam or not Auxiliary algorithm had more false positives

Algorithm % of Misclassifications

Original Classification 60.33%

Their approach 39.67%

Page 20: Improving Spam Detection Based on Structural Similarity

20

Strengths

Less false positives than SpamAssassin

Low-cost

Works with message information that doesn’t change that much

Page 21: Improving Spam Detection Based on Structural Similarity

21

Weaknesses

Needs an additional message classifier, i.e. SpamAssassin

Manual tuning of algorithm

Page 22: Improving Spam Detection Based on Structural Similarity

22

Improvements

Time correlation of similar addresses

Collaborative filtering based upon user feedback