[ieee 2009 international conference on advances in social network analysis and mining (asonam) -...

2
History-Based Email Prioritization Ronald Nussbaum, Abdol-Hossein Esfahanian, and Pang-Ning Tan Computer Science and Engineering Michigan State University East Lansing, MI 48824, U.S.A. [ronald,esfahanian,ptan]@cse.msu.edu Abstract The rise of email as a communication medium raises sev- eral issues. A majority of email messages sent are spam. Also, the amount of legitimate email received by many users is overwhelming. In this paper, we propose two new meth- ods of performing email prioritization. Both techniques rank users inboxes using models created from email history. With them, lower priority email messages may be dealt with so that the use of email remains a net productivity gain. 1. Introduction and Problem Statement Rather than trying to analyze the text in the subject and body, the goal of this paper is to use the header informa- tion to effectively prioritize incoming email messages. The first step is to take a corpus of email, and track the history of each pair of people. From this one can calculate the re- sponse rates and times between pairs of people, and use that as a basis of prediction. Further detail regarding methodol- ogy and experimental evaluation may be found in [7]. 1.1. Background There are several approaches to spam detection. Content-based filtering may be done, perhaps using a rule- based system or Bayesian methods [1]. Other methods use blacklists, whitelists, or some other type of list, often maintained by a third party [8]. With such techniques, the sender’s email address or IP address needs to be absent from (or belong to) the list, or the email is discarded as spam. With email prioritization, the emphasis is not on desig- nating messages as junk and non-junk, but ordering them according to importance. Each message must be assigned a numeric value, so not all spam detection techniques are applicable. Several researchers have investigated the use of social networks in spam detection and email prioritization [2, 3]. Other algorithms use user input to aid prediction. 1.2. Enron Database One large, publicly available email corpus is the Enron email dataset, made public in the aftermath of the Enron financial scandal and subsequent bankruptcy. It contains a good balance of internal email versus email originating from outside the enron.com domain or being sent outside of the enron.com domain. The dataset contains a reasonable but not overwhelming amount of spam. Klimt and Yang describe their preparation of the Enron email corpus in two papers [5, 6]. According to their work and other sources, the corpus contained 200399 email messages belonging to 158 users, with a median number of 757 incoming and outgoing messages per user after cleanup [4, 5]. In our prediction models, a node represents an individ- ual user, while a link represents the relationship between a pair of individuals. A user or a person is defined as a single email address. Before any modeling can be done, the Enron dataset needs to be preprocessed. First, unique email mes- sages are identified, and duplicate messages removed. Next, the email is grouped into threads, and other minor cleanup is done. All email addresses in the To, Cc, and Bcc fields are considered recipients of that email. 2. Main Results Once duplicate email messages are removed and the data is otherwise preprocessed, digraphs can be constructed rep- resenting users and the relationships between them. In order to evaluate the models, the email corpus is separated into a training set and a test set. Each person seen in the Enron dataset is a node in the digraph, and each pair of nodes has an arc going each way that represents the relationship be- tween those two users. In the local model, these links are created entirely from email messages sent or received by a single person. In the global model, the entire corpus is used. Both models are then evaluated based on the data in the test set. 2009 Advances in Social Network Analysis and Mining 978-0-7695-3689-7/09 $25.00 © 2009 IEEE DOI 10.1109/ASONAM.2009.44 364 2009 Advances in Social Network Analysis and Mining 978-0-7695-3689-7/09 $25.00 © 2009 IEEE DOI 10.1109/ASONAM.2009.44 364

Upload: pang-ning

Post on 24-Mar-2017

216 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: [IEEE 2009 International Conference on Advances in Social Network Analysis and Mining (ASONAM) - Athens, Greece (2009.07.20-2009.07.22)] 2009 International Conference on Advances in

History-Based Email Prioritization

Ronald Nussbaum, Abdol-Hossein Esfahanian, and Pang-Ning TanComputer Science and Engineering

Michigan State UniversityEast Lansing, MI 48824, U.S.A.

[ronald,esfahanian,ptan]@cse.msu.edu

Abstract

The rise of email as a communication medium raises sev-eral issues. A majority of email messages sent are spam.Also, the amount of legitimate email received by many usersis overwhelming. In this paper, we propose two new meth-ods of performing email prioritization. Both techniquesrank users inboxes using models created from email history.With them, lower priority email messages may be dealt withso that the use of email remains a net productivity gain.

1. Introduction and Problem Statement

Rather than trying to analyze the text in the subject and

body, the goal of this paper is to use the header informa-

tion to effectively prioritize incoming email messages. The

first step is to take a corpus of email, and track the history

of each pair of people. From this one can calculate the re-

sponse rates and times between pairs of people, and use that

as a basis of prediction. Further detail regarding methodol-

ogy and experimental evaluation may be found in [7].

1.1. Background

There are several approaches to spam detection.

Content-based filtering may be done, perhaps using a rule-

based system or Bayesian methods [1]. Other methods

use blacklists, whitelists, or some other type of list, often

maintained by a third party [8]. With such techniques, the

sender’s email address or IP address needs to be absent from

(or belong to) the list, or the email is discarded as spam.

With email prioritization, the emphasis is not on desig-

nating messages as junk and non-junk, but ordering them

according to importance. Each message must be assigned

a numeric value, so not all spam detection techniques are

applicable. Several researchers have investigated the use of

social networks in spam detection and email prioritization

[2, 3]. Other algorithms use user input to aid prediction.

1.2. Enron Database

One large, publicly available email corpus is the Enron

email dataset, made public in the aftermath of the Enron

financial scandal and subsequent bankruptcy. It contains

a good balance of internal email versus email originating

from outside the enron.com domain or being sent outside of

the enron.com domain. The dataset contains a reasonable

but not overwhelming amount of spam. Klimt and Yang

describe their preparation of the Enron email corpus in two

papers [5, 6]. According to their work and other sources, the

corpus contained 200399 email messages belonging to 158

users, with a median number of 757 incoming and outgoing

messages per user after cleanup [4, 5].

In our prediction models, a node represents an individ-

ual user, while a link represents the relationship between a

pair of individuals. A user or a person is defined as a single

email address. Before any modeling can be done, the Enron

dataset needs to be preprocessed. First, unique email mes-

sages are identified, and duplicate messages removed. Next,

the email is grouped into threads, and other minor cleanup

is done. All email addresses in the To, Cc, and Bcc fields

are considered recipients of that email.

2. Main Results

Once duplicate email messages are removed and the data

is otherwise preprocessed, digraphs can be constructed rep-

resenting users and the relationships between them. In order

to evaluate the models, the email corpus is separated into a

training set and a test set. Each person seen in the Enron

dataset is a node in the digraph, and each pair of nodes has

an arc going each way that represents the relationship be-

tween those two users. In the local model, these links are

created entirely from email messages sent or received by a

single person. In the global model, the entire corpus is used.

Both models are then evaluated based on the data in the test

set.

2009 Advances in Social Network Analysis and Mining

978-0-7695-3689-7/09 $25.00 © 2009 IEEE

DOI 10.1109/ASONAM.2009.44

364

2009 Advances in Social Network Analysis and Mining

978-0-7695-3689-7/09 $25.00 © 2009 IEEE

DOI 10.1109/ASONAM.2009.44

364

Page 2: [IEEE 2009 International Conference on Advances in Social Network Analysis and Mining (ASONAM) - Athens, Greece (2009.07.20-2009.07.22)] 2009 International Conference on Advances in

2.1. Model Generation

After counting the number of replies between each pair

of users, the average response time of the replies is com-

puted, and smoothed. The smoothed average response times

are the values actually used for the prioritization of email.

In order to evaluate the models, the first 70% of the email

messages, according to timestamp, are used as training data,

and the remaining 30% of the messages are used as test data.

The test data is divided into incoming and outgoing mes-

sages for each user. The outgoing messages are further di-

vided into clusters based on their timestamps. Each of these

outgoing clusters represents a point when the user checked

their inbox. The group of incoming messages between the

current outgoing cluster being examined and the previous

one is considered to be the inbox at that point, and predic-

tions are made for each group.

For a cluster of outgoing email messages and corre-

sponding inbox, messages in the inbox which did not re-

ceive a response are ignored. Predictions are then made for

the email messages in the inbox. For the local prioritization

model, each email message received by the user is assigned

the smoothed average response time of the author of that

email. For the global prioritization model, the smoothed

average response times of all other users and the sender is

averaged and used instead. The messages in the inbox are

then ranked according to increasing values, with ties broken

by email timestamps. Once this is done, the predicted rank-

ing of the messages for which a reply is sent out is compared

to the actual order of the outgoing messages in that cluster.

2.2. Prediction Results

The accuracy of the predictions for an outgoing cluster

of email messages is calculated by comparing the order of

each pair of outgoing messages to their order in the pre-

dicted rankings. If the order is the same, the pair is con-

sidered to have been correctly classified. If the pair is out

of order in the predicted rankings, the pair is considered to

have been incorrectly classified. The results of the predic-

tions for each outgoing cluster of email for a user are then

summed together.

Table 1. Local and Global Prediction Results

Correct Incorrect

Local 1526 (51.9%) 1413 (48.1%)

Global 1536 (52.3%) 1403 (47.7%)

2939 total pairs were classified with each model. Both

prioritization models performed only slightly better than av-

erage. Reweighting average response times in the global

model did not change the results significantly.

3. Conclusions

Several issues interfered with our attempt to create an

email prioritization system. Thread detection was the most

problematic, and this paper made no significant improve-

ments over the methods Klimt and Yang used [6]. Previ-

ous research evaded this dilemma by doing classification of

email messages instead.

The difficulty involved in determining information such

as when users checked their inboxes was another major

hurdle. Unfortunately, there is no large publicly available

dataset available at present containing such augmenting in-

formation along with the email corpus itself. While this

paper pursued entirely automatic email detection, it appears

that requiring manual user feedback might be required to

solve some of these issues.

Despite the size of the Enron email corpus, many of the

email messages do not belong to threads, and fewer still are

responses to messages. The result of this is that the amount

of training and testing data available to build models is actu-

ally quite small. It is unreasonable to require years of email

data before accurate predictions can be made. Furthermore,

the results indicate that a fast response time may not be a

good indicator of email importance, although this may be

due more to the other issues encountered.

References

[1] N. Belkin and W. Croft. Information filtering and information

retrieval: two sides of the same coin? Communications of theACM, 35(12):29–38, 1992.

[2] P. Boykin and V. Roychowdhury. Personal Email Net-

works: An Effective Anti-Spam Tool. Arxiv preprint cond-mat/0402143, 2004.

[3] P. Chirita, J. Diederich, and W. Nejdl. MailRank: using rank-

ing for spam detection. In Proceedings of the 14th ACM inter-national conference on Information and knowledge manage-ment, pages 373–380. ACM New York, NY, USA, 2005.

[4] S. Jabbari, B. Allison, D. Guthrie, and L. Guthrie. Towards

the Orwellian nightmare: separation of business and personal

emails. In Proceedings of the COLING/ACL on Main confer-ence poster sessions, pages 407–411. Association for Com-

putational Linguistics Morristown, NJ, USA, 2006.[5] B. Klimt and Y. Yang. Introducing the Enron corpus. In First

Conference on Email and Anti-Spam (CEAS), 2004.[6] B. Klimt and Y. Yang. The Enron Corpus: A New Dataset for

Email Classification Research. LECTURE NOTES IN COM-PUTER SCIENCE, pages 217–226, 2004.

[7] R. Nussbaum, A. H. Esfahanian, and P. Tan. Graph-based

Email Prioritization. Technical Report MSU-CSE-08-33,

Computer Science and Engineering, Michigan State Univer-

sity, 2008.[8] M. Perone. An overview of spam blocking techniques. Tech-

nical report, Barracuda Networks, 2004, 2004.

365365