an experimental framework for email categorization and management kenrick mock...

22
An Experimental Framework for Email Categorization and Management Kenrick Mock [email protected]

Upload: rosemary-eaton

Post on 23-Dec-2015

227 views

Category:

Documents


0 download

TRANSCRIPT

An Experimental Framework for Email Categorization and

Management

Kenrick Mock

[email protected]

Project Overview• Motivation: Email Overload• Potential solution: Automatic categorization and

management techniques• Problem: The potential solution is very experimental.

Email use and user interaction is difficult to model, requiring a prototype that users can try on actual email

• The purpose of this work is to present a Microsoft Outlook 2000TM add-in that:– Can be used as a first step toward more experimental research into

automatic email management techniques– Helps manage the inbox via classification and relevancy-based

search

What’s the Problem with Email?

• Too much

• 6/26/2001 USA Today– “Workers polled this year by market

researcher Gartner spent an average of 49 minutes a day on e-mail, 30% to 35% more time than they did a year ago. Ferris Research estimates management-level workers will spend four hours a day on e-mail by 2002.”

Solutions?• Educate users

– Don’t send so much mail, don’t subscribe to lists

• Use technology in some way– Current efforts are toward some type of

classification system that learns

Folder “Conferences”with emailsregarding conferences

Training: Systemlearns what email

belongs to “Conferences”

New SIGIR email

New Miss Cleo email

Classify into “Conferences”

Classify into “Trash”

This Project• An architecture for exploring automatic

email management techniques• Built on Outlook 2000

– Primary code in Visual Basic• Produces DLL add-in for Outlook

– Visual C++ DLL component • Hashes strings to longs (logical operators not

available in VB)• Referenced from VB

– Not tested with Outlook 2002!

Architectural Overview

Outlook

Outlook Object ModelEvents

C++ Helper DLL(Hash Strings)

VB Add-In DLL

Out

look

/ C

lass

Int

erfa

ce G

lue

Folder ClassAddMsg()GetMessages via DictionaryCompareMsg()

Message Class

AddTerms() Display() Get Vals CompareMsg()

Add-In Interface : Messages• Message Class

– Mail folders scanned on startup, class instance created for each mail item (except Trash, Sent Items).

– Message text is tokenized and stoplisted using• Sender• Recipients• Subject• Text Body (possible to use more fields if desired)

– Text tokens are hashed to 32-bit longs to save space, greatly increase token comparison time

• Hash function by Bob Jenkins• 2 collisions on 87111 dictionary words• 10x faster to compare longs vs. strings via strcmp on Pentium II

– CompareMsg function computes similarity between two email messages

Add-In Interface : Folders

• Folder Class– User-created mail folders are scanned on

startup and a folder instance created for each mail folder (except Trash, Sent Items).

– Messages that the user has placed in each folder are added to the folder’s classifier for training

– CompareMsg function computes similarity between a new message and the classifier for the folder

• i.e. can use to classify a new message into folders

Classifier Implementation• CompareMsg

– It is the goal of this project to experiment with different classifiers and algorithms as the implementation of CompareMsg to find out what works and what doesn’t

– A simple classification scheme is implemented for now• Nearest Neighbor, common terms & frequencies

– Others schemes that have been examined in the past:• TF-IDF, Neural Networks, Bayesian, Rule Induction, SVM

• What should the classifier do when new email arrives?– Some options

• Move new email directly to classified folder• Annotate email with a category tag

Classifier Usage Challenges• In previous work, we built a proprietary rule

induction and tf-idf classifier into Outlook and GroupWise that classified messages into categories. It was tested on managers and developers.

• Problems we encountered were usage-driven: 1. The need for constant re-training to keep up with

dynamically changing categories.2. Classification errors are puzzling and instill distrust on

behalf of the users. 3. Insufficient data may be available as training examples.4. It is difficult for a user to examine or manually edit a

classifier.

Challenge 1: Categories Change• Common for Categories to change over time; “Topic

Drift” as in Newsgroups– Project ends or changes direction– Conversation slowly changes topics– General discussion might turn more technical

• Problems for learning algorithms– Classifiers need to be re-trained; how well can they handle

it? How fast is it?• Our users were willing to wait seconds, not minutes• Most classifiers are not incremental; require re-training using all

positive/negative examples, not just new ones• Often too slow for many algorithms (e.g. rule induction)

– Vector-based classifiers• Fast to re-train but may have problems with threshold calculations or

new vocabulary not in the vector

Challenge 2: Classifiers Make Errors, Destroy User Trust

• Users tolerate few errors• Want immediate corrections so the same error won’t

happen again– Vector classifier may require several examples before

centroid shifts enough to include similar message– Rule classifiers need explicit retrain

• Classification errors are inevitable– Classifier may over-generalize or be too specific– Errors could “break” users hard work setting up a folder– In some cases it’s more work to fix errors than the savings

the tool is intended to provide!

• Trust is easy to lose, users abandon the system

Challenge 3: Insufficient Data Available

• Many classifiers require a large amount of training data, e.g. statistical-based classifiers– May not have enough email available

– Users expect system to work well given only 6-12 training examples

– Effort to find more examples typically too high

– One solution: Bootstrap using data in existing folders• What about negative examples? Can be problematic for some

classification algorithms

Challenge 4: Model Editing and Understanding

• Some users want to manually fix or edit the classifier– These are naïve users, not programmers!

• Easy to understand, modify– Rule-based classifiers

• More difficult– Vector classifiers, may have many keywords

• Very difficult– Neural Network– SVM

Current Implementation• Publicly available source, binaries for open development

purposes• Simple nearest-neighbor classifier for Folders

– Speed, easy to train and classify– May help classify user-created folders that really encompass

multiple sub-folders (e.g. “work” where there are many work projects) better than classification techniques that rely on global data

• Individual term frequencies of sub-folders topics will be low• But message-to-message comparison may be high

– Don’t need negative examples

• Tag messages with category rather than move into a folder– Hopefully not too critical when misclassification occur

Current Implementation : User Interface

Upon startup of Outlook : Scan outlook folders, create classifiers and messages

View inbox grouped by category

Current Interface : New Email

New email automatically classified into the Best-matching folder (but not moved, only grouped)

Current Interface : Related Email

• Interface also supports finding other email similar to the current one– Iterate through all email message class objects

invoking the comparison function• Simple term-frequency comparison of both emails

for now

• Linear time, but not too bad– 300 of the author’s messages scanned per second on

400Mhz PII

Current Interface: Related Email

Select a message,Click on button

List of similar messages displayed, click to open

Comments on Personal Use• No formal user studies performed yet• But, I’ve been using it…some anecdotes:

– Nearest Neighbor classifier OK, could be better– Would be useful to index trash or sent-items

• If not indexed, there is no folder to classify into when junk mail arrives so it gets put somewhere else

• Temporary solution: Make a “Trash” folder with examples• But indexing trash could be a lot of messages…

– Grouping if incoming email useful?• Not really needed for frequent email reading• Useful when returning from a trip and need to triage the mail

– Relevant email• Useful for finding uncoupled email threads• Sent-Items would be useful to index here

Lots of Work To Do• Experiment with other classifiers

– Need to see relation with users on training issues, speed, etc. not just classification accuracy

• Latch onto more events– Better mail detection, drag & drop events

• Clean up code implementation– Support persistence, speed issues on startup scan– Implementation issues– Compatibility with Outlook 2002, VB .NET

• Other forms of visualization / categorization– E.g., color, thread information, graphical techniques

• Extend to other forms of Outlook data– Calendaring, Notes, Files

Try It Out

• Source Code & Binaries available online

– http://www.math.uaa.alaska.edu/~afkjm/emailaddin/

– Only tested with Windows 2000 & Outlook 2000

– Feel free to use or modify code as you see fit

– Warning: Developer docs and code cleanup still needs to be done!

• But I’ll be glad to answer any questions!