[ieee 2009 international conference on advances in social network analysis and mining (asonam) -...
TRANSCRIPT
History-Based Email Prioritization
Ronald Nussbaum, Abdol-Hossein Esfahanian, and Pang-Ning TanComputer Science and Engineering
Michigan State UniversityEast Lansing, MI 48824, U.S.A.
[ronald,esfahanian,ptan]@cse.msu.edu
Abstract
The rise of email as a communication medium raises sev-eral issues. A majority of email messages sent are spam.Also, the amount of legitimate email received by many usersis overwhelming. In this paper, we propose two new meth-ods of performing email prioritization. Both techniquesrank users inboxes using models created from email history.With them, lower priority email messages may be dealt withso that the use of email remains a net productivity gain.
1. Introduction and Problem Statement
Rather than trying to analyze the text in the subject and
body, the goal of this paper is to use the header informa-
tion to effectively prioritize incoming email messages. The
first step is to take a corpus of email, and track the history
of each pair of people. From this one can calculate the re-
sponse rates and times between pairs of people, and use that
as a basis of prediction. Further detail regarding methodol-
ogy and experimental evaluation may be found in [7].
1.1. Background
There are several approaches to spam detection.
Content-based filtering may be done, perhaps using a rule-
based system or Bayesian methods [1]. Other methods
use blacklists, whitelists, or some other type of list, often
maintained by a third party [8]. With such techniques, the
sender’s email address or IP address needs to be absent from
(or belong to) the list, or the email is discarded as spam.
With email prioritization, the emphasis is not on desig-
nating messages as junk and non-junk, but ordering them
according to importance. Each message must be assigned
a numeric value, so not all spam detection techniques are
applicable. Several researchers have investigated the use of
social networks in spam detection and email prioritization
[2, 3]. Other algorithms use user input to aid prediction.
1.2. Enron Database
One large, publicly available email corpus is the Enron
email dataset, made public in the aftermath of the Enron
financial scandal and subsequent bankruptcy. It contains
a good balance of internal email versus email originating
from outside the enron.com domain or being sent outside of
the enron.com domain. The dataset contains a reasonable
but not overwhelming amount of spam. Klimt and Yang
describe their preparation of the Enron email corpus in two
papers [5, 6]. According to their work and other sources, the
corpus contained 200399 email messages belonging to 158
users, with a median number of 757 incoming and outgoing
messages per user after cleanup [4, 5].
In our prediction models, a node represents an individ-
ual user, while a link represents the relationship between a
pair of individuals. A user or a person is defined as a single
email address. Before any modeling can be done, the Enron
dataset needs to be preprocessed. First, unique email mes-
sages are identified, and duplicate messages removed. Next,
the email is grouped into threads, and other minor cleanup
is done. All email addresses in the To, Cc, and Bcc fields
are considered recipients of that email.
2. Main Results
Once duplicate email messages are removed and the data
is otherwise preprocessed, digraphs can be constructed rep-
resenting users and the relationships between them. In order
to evaluate the models, the email corpus is separated into a
training set and a test set. Each person seen in the Enron
dataset is a node in the digraph, and each pair of nodes has
an arc going each way that represents the relationship be-
tween those two users. In the local model, these links are
created entirely from email messages sent or received by a
single person. In the global model, the entire corpus is used.
Both models are then evaluated based on the data in the test
set.
2009 Advances in Social Network Analysis and Mining
978-0-7695-3689-7/09 $25.00 © 2009 IEEE
DOI 10.1109/ASONAM.2009.44
364
2009 Advances in Social Network Analysis and Mining
978-0-7695-3689-7/09 $25.00 © 2009 IEEE
DOI 10.1109/ASONAM.2009.44
364
2.1. Model Generation
After counting the number of replies between each pair
of users, the average response time of the replies is com-
puted, and smoothed. The smoothed average response times
are the values actually used for the prioritization of email.
In order to evaluate the models, the first 70% of the email
messages, according to timestamp, are used as training data,
and the remaining 30% of the messages are used as test data.
The test data is divided into incoming and outgoing mes-
sages for each user. The outgoing messages are further di-
vided into clusters based on their timestamps. Each of these
outgoing clusters represents a point when the user checked
their inbox. The group of incoming messages between the
current outgoing cluster being examined and the previous
one is considered to be the inbox at that point, and predic-
tions are made for each group.
For a cluster of outgoing email messages and corre-
sponding inbox, messages in the inbox which did not re-
ceive a response are ignored. Predictions are then made for
the email messages in the inbox. For the local prioritization
model, each email message received by the user is assigned
the smoothed average response time of the author of that
email. For the global prioritization model, the smoothed
average response times of all other users and the sender is
averaged and used instead. The messages in the inbox are
then ranked according to increasing values, with ties broken
by email timestamps. Once this is done, the predicted rank-
ing of the messages for which a reply is sent out is compared
to the actual order of the outgoing messages in that cluster.
2.2. Prediction Results
The accuracy of the predictions for an outgoing cluster
of email messages is calculated by comparing the order of
each pair of outgoing messages to their order in the pre-
dicted rankings. If the order is the same, the pair is con-
sidered to have been correctly classified. If the pair is out
of order in the predicted rankings, the pair is considered to
have been incorrectly classified. The results of the predic-
tions for each outgoing cluster of email for a user are then
summed together.
Table 1. Local and Global Prediction Results
Correct Incorrect
Local 1526 (51.9%) 1413 (48.1%)
Global 1536 (52.3%) 1403 (47.7%)
2939 total pairs were classified with each model. Both
prioritization models performed only slightly better than av-
erage. Reweighting average response times in the global
model did not change the results significantly.
3. Conclusions
Several issues interfered with our attempt to create an
email prioritization system. Thread detection was the most
problematic, and this paper made no significant improve-
ments over the methods Klimt and Yang used [6]. Previ-
ous research evaded this dilemma by doing classification of
email messages instead.
The difficulty involved in determining information such
as when users checked their inboxes was another major
hurdle. Unfortunately, there is no large publicly available
dataset available at present containing such augmenting in-
formation along with the email corpus itself. While this
paper pursued entirely automatic email detection, it appears
that requiring manual user feedback might be required to
solve some of these issues.
Despite the size of the Enron email corpus, many of the
email messages do not belong to threads, and fewer still are
responses to messages. The result of this is that the amount
of training and testing data available to build models is actu-
ally quite small. It is unreasonable to require years of email
data before accurate predictions can be made. Furthermore,
the results indicate that a fast response time may not be a
good indicator of email importance, although this may be
due more to the other issues encountered.
References
[1] N. Belkin and W. Croft. Information filtering and information
retrieval: two sides of the same coin? Communications of theACM, 35(12):29–38, 1992.
[2] P. Boykin and V. Roychowdhury. Personal Email Net-
works: An Effective Anti-Spam Tool. Arxiv preprint cond-mat/0402143, 2004.
[3] P. Chirita, J. Diederich, and W. Nejdl. MailRank: using rank-
ing for spam detection. In Proceedings of the 14th ACM inter-national conference on Information and knowledge manage-ment, pages 373–380. ACM New York, NY, USA, 2005.
[4] S. Jabbari, B. Allison, D. Guthrie, and L. Guthrie. Towards
the Orwellian nightmare: separation of business and personal
emails. In Proceedings of the COLING/ACL on Main confer-ence poster sessions, pages 407–411. Association for Com-
putational Linguistics Morristown, NJ, USA, 2006.[5] B. Klimt and Y. Yang. Introducing the Enron corpus. In First
Conference on Email and Anti-Spam (CEAS), 2004.[6] B. Klimt and Y. Yang. The Enron Corpus: A New Dataset for
Email Classification Research. LECTURE NOTES IN COM-PUTER SCIENCE, pages 217–226, 2004.
[7] R. Nussbaum, A. H. Esfahanian, and P. Tan. Graph-based
Email Prioritization. Technical Report MSU-CSE-08-33,
Computer Science and Engineering, Michigan State Univer-
sity, 2008.[8] M. Perone. An overview of spam blocking techniques. Tech-
nical report, Barracuda Networks, 2004, 2004.
365365