email clustering algorithm

57
Thesis topic definition paper

Upload: grepfruit1

Post on 16-Nov-2014

909 views

Category:

Documents


1 download

DESCRIPTION

This thesis describes analysis, design and implementation of a clustering algorithm thatoperates on email messages. Its aim is to recognize messages that deal with similar topicand aggregate them in one cluster. The algorithm makes use of general document clusteringmethods combined with analysis of data specic to email messages. Given the nature ofemail, it supports online clustering.The functionality will be integrated into exiting email client software named "eM Client".Implementation of suitable data structures with database backend, mail client interoperability code and user interface is also part of this work.

TRANSCRIPT

Thesis topic definition paper

Czech Technical University in PragueFaculty of Electrical Engineering

Department of Computer Science and Engineering

BACHELOR’S THESIS:Email clustering algorithm

Author: Libor GrafnetrSupervisor: Ing. Pavel Kordık, Ph.D.Year: 2009

Acknowledgments

I would like to thank my supervisor Ing. Pavel Kordık, Ph.D. for guidance through the workon this thesis.

I

II

Declaration

I hereby declare that I have completed this thesis independently and that I have listed allthe literature and publications used.I have no objection to usage of this work in compliance with the act §60 Zakona c. 121/2000Sb.(copyright law), and with the rights connected with the copyright act including the changesin the act.

In Prague on June 9, 2009 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

III

IV

Abstract

This thesis describes analysis, design and implementation of a clustering algorithm thatoperates on email messages. Its aim is to recognize messages that deal with similar topicand aggregate them in one cluster. The algorithm makes use of general document clusteringmethods combined with analysis of data specific to email messages. Given the nature ofemail, it supports online clustering.

The functionality will be integrated into exiting email client software named ”eM Client”.Implementation of suitable data structures with database backend, mail client interoperabil-ity code and user interface is also part of this work.

Abstrakt

Tato prace popisuje analyzu, navrh a implementaci shlukovacıho algoritmu, ktery bude pra-covat s emailovymi zpravami. Jeho cılem je rozpoznat zpravy, ktere se tykajı stejnehotematu, a sdruzit je do jednoho shluku. Algoritmus vyuzıva obecnych postupu shlukovanıdokumentu zkombinovanych s analyzou informacı specifickych pro emailove zpravy. Vzhle-dem k povaze emailu algoritmus podporuje inkrementalnı shlukovanı.

Shlukovacı funkcionalita bude integrovana do existujıcıho emailoveho klienta ”eM Client”.Soucastı prace je take implementace vhodnych datovych struktur s napojenım na databazi,kodu pro spolupraci s mailovym klientem a uzivatelskeho rozhranı.

V

VI

Contents

1 Introduction 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

2 Problem description 32.1 Project breakdown . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.2 Existing approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.3 Thesis structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

3 Analysis 73.1 Project components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3.1.1 Clustering framework . . . . . . . . . . . . . . . . . . . . . . . . . . . 73.1.2 Distance function and text analysis . . . . . . . . . . . . . . . . . . . . 73.1.3 Data structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83.1.4 Interoperability code . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83.1.5 User interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.2 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.2.1 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.2.1.1 K-means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.2.1.2 Hierarchical clustering . . . . . . . . . . . . . . . . . . . . . . 113.2.1.3 Fuzzy C-means . . . . . . . . . . . . . . . . . . . . . . . . . . 123.2.1.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.2.2 Data instances representation . . . . . . . . . . . . . . . . . . . . . . . 143.2.3 Measuring distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.2.4 Email messages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.2.5 Document clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.2.5.1 Tokenization . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.2.5.2 Term ranking and TF-IDF . . . . . . . . . . . . . . . . . . . 173.2.5.3 Cosine measure . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.2.6 eM Client . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4 Design 214.1 Hierarchical agglomerative clustering . . . . . . . . . . . . . . . . . . . . . . . 214.2 Email messages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.3 Data instance features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.4 Text analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

VII

4.5 Distance function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254.6 Database backend . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254.7 User interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254.8 Integration layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

5 Implementation 275.1 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275.2 Mail clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285.3 Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305.4 User interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305.5 Integration layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

6 Testing 336.1 Clustering quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

6.1.1 Subjective evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 336.1.2 Objective evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

6.2 Data structures optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 366.3 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

7 Conclusion 377.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

Bibliography 39

A SQL Schema 41

B Clustering Illustrations 45

C CD directory structure 47

VIII

Chapter 1

Introduction

1.1 Motivation

Email has become one of the most important means of communication among people inboth personal and corporate environment. Its unprecedented features have allowed peopleto communicate and cooperate in their activities like no other technology before. Since itsnotable spread among people, the volume of email messages has risen by far more than onlyfew ranks and now it represents a core component of every corporation’s workflow. However,the way people are presented their email messages by software almost has not changed.While simple list of emails may be sufficient for low volume of mostly independent mails,today’s communication frequently consist of interconnected messages, bound by similar topic,possibly from several different individuals.

The need to change how people work with their email is known for many years, theproblem is getting into general consciousness and public media are focusing its attentionto it more often. There are two basic (but interconnected) parts of the issue: the firstis the immense volume of emails that many users are subject to - this problem has beencoined ”Email Overload”, the second part is most users’ inability to use email efficiently, e.g.sending unnecessary messages or overutilizing reply-all functionality. There have even beenattempts to quantize costs of inefficient email usage, with estimates as high as $650 billion[1].

Automatically organizing user’s messages would be a significant remedy to the problem.It would allow users to distribute their time more efficiently to relevant tasks, speed upinformation look-up and generally create a more pleasant email client environment, in whichmany people spend significant part of their work time.

1

2 CHAPTER 1. INTRODUCTION

Chapter 2

Problem description

Aim of this work is to create an email client extension that automatically clusters user’s emailmessages and presents him with the result. The clustering process must be incrementalas new messages continually arrive into user’s inbox and framework’s data necessary toperform the clustering must be persistent between application executions. The resulting dataconsisting of clusters and emails they contain must be presented to user in a comprehensibleand non-intrusive way from within the email client.

2.1 Project breakdown

To achieve the above we have divided the project into following tasks that need to be ac-complished:

• Devise a variation on existing clustering algorithm that will suit specific needs of theclustering task.

• Analyze email messages nature and find relevant data mining features that will con-stitute basis for measuring similarity between email instances.

• Design a similarity measurement function yielding good results for evaluation of emails’topic similarity.

• Implement the algorithm, knowledge extraction framework and suitable data structuresthat will allow persisting clustering work data.

• Integrate the implemented module into ”eM Client” and create an user interface thatwill present the clustering result and allow user to execute necessary operations of theclustering framework.

• Test the project with testing dataset and point out areas where optimization had tobe added and where more improvement could be done in future.

3

4 CHAPTER 2. PROBLEM DESCRIPTION

2.2 Existing approaches

Despite the urgency of the problem a successful solution to that would gain widespreadacceptance has not been created yet. There are several partial solutions in use, but theyeither require continual feedback from the user, the extent of their helpfulness is ratherlimited or were not finished and made available for public use. Existing partial solutionsinclude following:

• Threaded view is a method that displays a conversation tree based on informationabout relation to previous emails. This method works only for mails that were repliedto or forwarded. New email concerning the same topic from a user is not connected.This method is one of the oldest attempts to improve experience when working withemail, but it is very limited, especially with the present scale of the overload problem.

• Conversation view is an improved variation on the threaded view approach. It has beenintroduced in Google’s Gmail service. It improves the user experience by flatteningthe threaded view tree and by displaying each conversation as single item in the basicfolder view. The approach provides a clearer view on the user’s emails, but still doesn’tconstitute a powerful tool to radically improve work with email.

• Rules - most modern desktop email clients have means to define rules consisting ofa condition and an action to be performed if the specified condition is met. Thecondition is usually used to match specific substring in email subject or body andthe action then performs move or copy operation if the condition is met. While rulesare relatively popular among desktop client users, its filtering capabilities are limited,because the user has to define every single rule. Common use therefore is separatingseveral basic types of email that enter user’s inbox rather than separating every newtopic that has to be dealt with.

• Automatic foldering is a more sophisticated approach based on filters matching themessage with existing mail folders. Real world implementations allows user to definefolders that automatically process incoming mails and filter messages similar to thosealready present in the folder. The user needs to add several example messages whenthe folder is created to provide some initial data to the folder filter and is also allowedto add or remove messages from the folder later on, further improving its filteringperformance. A popular implementation of this method is present in the browser/emailclient software ”Opera”. Although a promising approach, known implementations stillrequire the user to define a folder for each topic he would like to separate.Algorithms used in these filters are for example: Bayesian filtering, Support VectorMachines or k-NN. Attempts to integrate new folder recognition were also made, buthaven’t had any major influence. Overview of existing works in automatic folderingand their effectiveness can be found in[2].Our approach might be viewed as a restatement of automatic foldering extended withnew folder recognition, but from clustering branch of data mining point of view.

Most notable work having similar approach to this thesis is a master thesis of GaborCselle: Organizing Email[3]. Cselle’s work has notably influenced author in his research.Cselle has performed an extensive research on email organization and on basis of the work

2.3. THESIS STRUCTURE 5

created a clustering framework and an extension to email client ”Thunderbird” that per-formed email clustering. However the extension itself was never published.

2.3 Thesis structure

This document will be structured into five main chapters reflecting progress of author work:

• chapter 3 will describe project components, necessary clustering theory that is usedin our implementation and also the software ”eM Client” that will integrate our theclustering functionality.

• chapter 4 provides an in-depth description of the design of the clustering algorithm,text analysis algorithm, similarity function, utilized email message features, realizationof the database backend and the design of the user interface.

• chapter 5 explains architecture of the project, describes classes that were implemented,the way clustering is integrated with the email client and the background of the userinterface.

• chapter 6 describes how was the final product tested, what steps were taken to improveperformance and enumerates existing issues

6 CHAPTER 2. PROBLEM DESCRIPTION

Chapter 3

Analysis

This chapter will describe logical components employed in the clustering process, basic datamining concepts and the target environment that the project will be integrated with.

3.1 Project components

The whole mail clustering process has many steps that can be divided into following sev-eral blocks. The first component would be a clustering framework containing the clusteringalgorithm, clusters, instances, data source for instances and other necessary infrastructure.Tightly connected is a distance function utilized by the cluster algorithm and a text anal-ysis that evaluates textual similarity of the mail content. In addition there is a databasepersistence framework that needs to be provided to the clustering framework in the mosttransparent way possible. Finally a component integrating the clustering with the emailclient and an user interface displaying the results from the clustering framework needs to beconnected to the above components.

3.1.1 Clustering framework

The core part of the project is the clustering algorithm itself. On input this algorithm receivesdata instances from a data source. These items are first pre-processed by the distancefunction for information extraction. Afterwards, instances are processed by the distancefunction to determine similarity to existing instances. Which instance pairs are processed bythe distance function depends on the particular algorithm. Based on the calculated distancevalues clustering process is executed. During this process clusters are formed or modified.Once the algorithm meets its stop criteria, the new mail instances have been clustered andupdated clusters information may be loaded by other components.

3.1.2 Distance function and text analysis

The distance function provides the clustering algorithm information about individual in-stances similarity to each other. Every new instance is passed to the distance function by

7

8 CHAPTER 3. ANALYSIS

the clustering algorithm for processing. The distance function further passes the new in-stance to text analysis subcomponent. When a clustering algorithm calls distance function,similarity of selected mail item features is evaluated and also a mail body text similarityis calculated. These several scores are then merged into a distance value that is returnedto the clustering algorithm. Text analysis subcomponent processes mail body text of newinstances and updates its internal data structures that store relevance of individual words.When a distance function is asked to calculate similarity, text analysis performs similaritycomparison of the two mail instances texts and calculates text similarity score using knowntext clustering formulas.

3.1.3 Data structures

Specialized data structures are necessary to store all data used in the whole clustering process.Except simple lists used to store existing clusters and data instance, dictionaries are used forpre-calculated text analysis data and two-dimensional dictionaries are used to store distancevalues between individual instances and clusters. Although lists and one-dimensional dictio-naries exist in the target environment for in-memory operation, they had to be implementedas layers over a relational database used to persist the data. Two-dimensional dictionary hadto be implemented for both, in-memory and database operation. As these data structuresare heavily utilized during the clustering process, they need to be highly optimized as wellas the underlying database schema.

3.1.4 Interoperability code

To integrate previous components with the email client, an interoperability layer needs toexist. This layer will govern the initialization of the clustering framework and its databasebackend and provide a custom data source transforming email messages from the client todata instances that the clustering process works with. The data source also needs to monitoradditions and removal of email messages and forward these event to the clustering framework.There are several other functional roles described in the implementation chapter that thelayer needs to perform.

3.1.5 User interface

The email client must have a user interface that will provide access to results of the clusteringprocess. The interface must have a non-intrusive character, that is, it must let the user to usethe client as usual, but easily allow to take advantage of the clustering output. The interfacewill be developed using existing concepts present in the client. In the most minimal form, theinterface will provide view on existing clusters, display information about selected cluster,provide a method to quickly view only messages from the selected cluster and allow userto see to which cluster a mail belongs. The user interface will interact with the clusteringframework primarily through the interoperability code.

3.2. CLUSTERING 9

3.2 Clustering

Clustering is a branch of data mining discipline which aims at grouping similar data instancestogether into sets called clusters. To be able to devise a suitable clustering process, we needto overview relevant existing knowledge that will be necessary. Our view on the techniquesmust also reflect specifics of the data and environment we are about to embed the technologyin. Clustering of email messages can be understood as a special case of document clusteringwith a lot of additional metadata taking part in the clustering process. There are also otherspecifics such as online nature of the clustering, the fact that the target result is a largenumber of relatively small clusters with need to continually detect new ones and the needfor unsupervised operation of the algorithm. In following sections we will describe algorithms,data representation, distance functions and knowledge extraction principles that relate toour problem. For each link in the clustering chain, we will also analyze what features andcapabilities it needs to have in our implementation and why.

3.2.1 Algorithms

There are already many algorithms for clustering and new ones are continually being devised.This is mainly because each field has specific features that customized algorithms can takeadvantage of. For our clustering tasks we have researched the several popular and relativelyversatile clustering algorithms as they are known to perform reasonably well on documentclustering tasks and some of them have advantageous features that can address some of ourneeds. We will describe those algorithms including their advantages or disadvantages andstate whether we have chosen the algorithm for our project.

3.2.1.1 K-means

K-means is one of the most essential clustering algorithms. It dates back to 1956 and has beengaining popularity ever since. It isn’t complicated to implement and performs the clusteringrelatively fast - its computational complexity is linear. To use K-means, it is necessary tospecify the target number of clusters - let k be the desired number of clusters. The algorithmcreates k data points representing cluster centers, called centroids. An iterative partitioningloop described below is then executed to carry out the clustering.

The formal definition of K-means[4] is then to partition the data instances into k clustersso as to minimize a sum of squared error objective function that is defined as:

E =k∑

i=1

n∑j=1

|ci − pj |2

, where E is the squared-error value, k the desired number of clusters, n the number ofcluster data instances, ci is a centroid of the i-th cluster and pj is j-th data instance. Thesubtraction of instances represents distance of the two instances.

The algorithm below is a common approach to accomplish the described function mini-mization. The approached minimum is, however, local, which why initial centroid distribu-tion is so important.

10 CHAPTER 3. ANALYSIS

The data points selected as cluster centers may be generated randomly, or better someform of heuristics may be used to select centroids that are likely to yield good clusteringresult. A popular method is to gradually pick centroids from clustered data instances sothat their distance from nearest other centroid is maximal.

The clustering in K-means is performed iteratively by re-calculating centroids and re-associating data instances with nearest centroid until stability is reached and no data in-stances change their parent cluster. A K-means algorithm in pseudo-code is written below:

1 select_k_initial_centroids ();

2 do{

3 associate_data_instances_with_nearest_centroid ();

//if some data instance changes parent

cluster clustering_change variable is set to

true

4 calculate_new_centroid_positions_ (); //new mean

of data instances of each cluster is

calculated and set as new centroid for the

cluster.

5 while(clustering_change);

Figure 3.1: An example of K-means algorithm on random two-dimensional data

Although the standard definition of k-means doesn’t deal with incremental clustering, itis possible to modify the implementation to support it and some research on this topic andissues arising from time-dependent nature of the task has already been performed [5].

However, K-means has two major disadvantages. The first is the need to know the numberof clusters in advance, which would require additional estimation algorithms or substantialalgorithm modifications. Second issue arises from the concept of centroid that assumes that

3.2. CLUSTERING 11

a mean of data instances can be calculated. For more complicated data instances, that arenot just simple vectors (see subsection 3.2.2) this condition cannot be met. It can be over-come by avoiding the centroids completely a determining the instance distance from clustersby average distance from cluster’s member. Although a viable approach, the computationalcomplexity would rise enormously.

Although K-means is a straightforward, versatile and fast algorithm we have decided notto use it in its standard form as a clustering algorithm for our project.

3.2.1.2 Hierarchical clustering

The hierarchical clustering algorithm is together with K-means the most popular clusteringtechnique. This method operates by building a hierarchy of clusters based on similarity ofthe data instances they contain. The process of building the hierarchy may be terminatedwhen a specified condition is met, resulting in clusters of desired similarity and size.

The metric used to measure similarity between instances can be chosen independentlyof the algorithm, but a second measure is used that defines how the distance between twoclusters is calculated from distances between instances in these clusters. There are threebasic measures used:

• Minimum distance - the distance between two clusters is equal to the smallest distancebetween an instance from cluster A to an instance from cluster B.

• Maximum distance - the distance between two clusters is equal to the largest distancebetween an instance from cluster A to an instance from cluster B.

• Average distance - the distance between two clusters is equal to the average distanceof instances from cluster A to instances from cluster B.

An algorithm that employs minimum distance metric and also uses a minimal inter-clusterdistance as threshold to stop merging clusters is called single-linkage algorithm.

Once the measure of distance between clusters is defined there are two approaches to thehierarchy’s construction:

• Agglomerative - The algorithm starts with every data instance in a separate clusterand repetitively merges clusters that are nearest to each other.

• Divisive - At the beginning all instances are in one cluster and a logic using instancedistance function repetitively splits existing clusters.

Agglomerative method is the more frequent choice when using hierarchical algorithmand it also suits our needs better - it will allow easier implementation of the incrementalclustering and provides better means to define criteria that will stop the clustering.

12 CHAPTER 3. ANALYSIS

Figure 3.2: A dendogram of clusters created by hierarchical clustering on the well-knownIris dataset

A pseudo-code showing how agglomerative hierarchical single-linkage clustering operatesfollows:

1 create_cluster_for_each_data_instance ();

2 do{

3 calculate_intercluster_distances (); // Function also

stores minimal distance to variable

minimal_distance

4 if(minimal_distance < merge_threshold)

5 merge_nearest_clusters ();

6 }

7 while(minimal_distance < merge_threshold);

Hierarchical clustering doesn’t impose any requirements on data instances or the distancefunction. This makes it more suitable for clustering complex data as in our case.

However, the algorithm does have two issues that must be noted. Once a cluster merge(or split) is performed, it cannot be undone. That is, if an instance I is merged with acluster A and afterwards a cluster B containing instances very similar to instance I enters theclustering process, there is no way for instance I to be moved to cluster B. This might causeclustering performance degradation in incremental clustering. Another issue is algorithm’scomputational complexity that is quadratic! This is given by the need to calculate distancesbetween all pairs of data instances. The complexity may cause problems if the algorithm isto be employed in our project.

3.2.1.3 Fuzzy C-means

Fuzzy c-means is a variation of the K-means algorithm with addition of uncertainty. Previousalgorithms understood instance’s belonging to a cluster as a boolean statement and a single

3.2. CLUSTERING 13

instance could belong only to one cluster. Contrary to these models fuzzy c-means allowsone data instance to belong to several clusters. The extent to which an instance belongsto a cluster is denoted by a scalar value ”degree of membership”[6]. The fuzzy natureof membership would be a very useful feature for our application, allowing to propagatealgorithm’s uncertainty to user interface.

The algorithm operates similarly to K-means, but in each iteration instead of re-associatingdata instances with nearest clusters a matrix containing degree of membership for eachcluster-instance combination is recalculated. The formal definition of the algorithm is stillaimed at optimizing an objective function, but the function takes the degree of membershipinto account, so it is now defined as:

E =k∑

i=1

n∑j=1

uij · |ci − pj |2

, where uij is the degree of membership of j-th data instance to i-th cluster, other variableshave the same meaning as in objective function of K-means.

The formula for instance’s degree of membership to cluster A calculates the value basedon proximity of the cluster A in comparison to other clusters. The formula also contains anadditional parameter called fuzzifier that influences the gradient with which the membershipdecreases. If uij is the degree of membership of j-th data instance to i-th cluster, then

uij =1∑k

m=1 ( |ci−pj ||cm−pj |)

2m−1

, where k is the number of clusters, ci is the centroid for i-th cluster, pj is j-th data instanceand m is the fuzzifier.

The the variable degree of membership also has to be reflected in the centroids calculation.Instance with lesser membership to a cluster should have smaller impact on cluster’s centroidposition. This can be achieved by incorporating the uij value to arithmetic average fraction,to both - the denominator and the numerator.

Fuzzy c-means seems suitable for incorporation of simple new cluster detection. If aninstance’s maximum degree of membership to any cluster is under specified threshold a newcluster containing this instance may be created. Similarly to K-means, implementing incre-mental support into Fuzzy c-means seems viable.

While the author perceives this algorithm as promising and would like to focus on itspotential in future, research necessary to correctly implement it for our application wouldneed to be extensive and over the time frame available for this work, therefore it won’t beused in our project.

3.2.1.4 Conclusion

We have described three most promising candidates for the algorithm used in our project.All of them could be modified to suit our needs, but to ensure that results yielded by theclustering framework will be of decent quality from the beginning we have decided to usethe Hierarchical Agglomerative Clustering in its Single-linkage variation.

14 CHAPTER 3. ANALYSIS

3.2.2 Data instances representation

The way data instances are represented in the clustering process is highly important anddepends on many factors. Many steps of the process have specific needs on the form ofthe input data, therefore data conversion and preprocessing is a step, often very complexand lengthy one, that most data mining efforts start with. This is usually because the worktakes place in existing data mining environments that have predefined input format and onlylimited ability to alter how the data is being processed.

Given the fact that our clustering framework is being built from scratch, we will bedesigning its architecture in a very modular manner to allow as much independence on thedata form as possible. Only parts that need to understand the meaning of the data willhave to be customized, such as distance calculating component, the rest of the frameworkwill work only through these custom implementations. Therefore, there will be no need toconvert data processed by the framework in our project and possibly in any other use, shouldthere be any.

This is a correct approach from engineering standpoint, but also a mandatory requirementdictated by the context of the project. The data processed by clustering is already presentin the email client software and it is not feasible to perform any large scale conversion.

While he have described the emphasis that we put into keeping the data the way it is,certain transformations will have to take place to provide information necessary for text anal-ysis (see subsection 3.2.5). This transformation and data extraction will not, however, alterexisting data, but put the extracted knowledge in a distinct data store. The target data storewill be designed to facilitate performance and semantic needs of the text analysis component.

The data our project will work with are email messages. Emails have a rather complicatedstructure, but it can be condensed to two main parts that interest us:

• A decent set of metadata attributes. Many of these attributes will become instancefeatures, meaning they will be analyzed during the clustering process, but they don’tneed to be transformed or stored in an additional location.

• Content - each email has one or more content parts. One of the parts will be selectedand taken as a text document. This document part will be subject to processing usingdocument clustering techniques that require further data transformation and storage.

Document clustering is usually modeled by representing documents as items in vectorspace. Each vector in the vector space represent one document. Terms (individual words)extracted from all documents are used as vector’s components, each term being one compo-nent. Value of each component in a vector is a score determining relevance of the associatedterm in the document the current vector represents (see subsubsection 3.2.5.2).

Having each document represented by a vector with one component for each existingterm from the document set is highly inefficient. The vectors would be mostly empty as eachdocument contains only a small subset of all terms in a set. Therefore, the actual structureof the data store containing document terms will be different. Each document will have a

3.2. CLUSTERING 15

list of term and value pairs for terms contained in the document. Although this will slightlycomplicate the data structures and database backend for persisting the information, it willsave large amounts of space and will have a positive effect on performance.

As has been explained, there is no need to store the data instances in a specific formfor purposes of the clustering. The framework should process instances independently oftheir form through customized processing components and the text analysis component willonly store necessary calculated data, but won’t alter data instances or duplicate data theycontain.

3.2.3 Measuring distance

The algorithm performs clustering based on instances distance. The quality of clusteringtherefore tightly depends on quality of the distance calculation. Most of clustering algorithmsoperate with a pair-wise distance function that takes two instances as input and output adistance value. The distance function may range from a simple formula to an algorithmcombining several other distance functions, which may perform analytical computations.

The methods of a distance calculation depend on types of data features that are to bereflected in the calculation. When instances are represented by vectors that contain onlynumerical values, simple metrics such as Euclidean distance, Manhattan distance or evenlinear combination of the values may be used.

One feature type, our project will make use of, is a set of textual strings. For this typean intersection of two sets may be performed and an intersecting set’s size then incorporatedin a composite distance function.

Other feature type may be a time stamp. Two time stamp features may be subtractedand the result normalized into a predefined interval - thus obtaining a valid scalar value thatmay be added in the composite distance.

Many data mining processes employ feature weighting to further improve the outputquality. When feature weighting is used, the value of each feature is modified by a weightquotient. Weight quotients may be determined by many methods, such as training lin-ear Support Vector Machines, gradient descent technique, Odds Ratio or Information Gaintraining[7][8].

Document clustering also measures the distance of documents, but here the concept ofweighting individual values is realized in the text analysis process by term relevance scoring.

To join many distance values calculated by various methods a grouping distance functionmust be formulated. Such function can be viewed simply as a next level of distance functionprocessing the numerical feature values, except this time, the values are already an extractionof the features’ content. Therefore, the same formulas may be used. To reflect the essenceof the grouping distance function - to combine partial distance values into one - a weightedlinear combination is a good choice.

3.2.4 Email messages

Internet Message (termed email) is a very versatile information container from which alot of clustering relevant data can be extracted. Email message consist of an envelope

16 CHAPTER 3. ANALYSIS

and contents[9]. Envelope contains fields such as the sender, recipients, message’s subject,a time stamp defining when the message was sent or fields containing reference to otheremail messages. These fields convey valuable information that has a substantial influence onrecognizing relation between emails.

The contents of the message may either be a plain text or consist of content parts -MIME parts[10]. Most of today’s email communication contents is in the format of MIMEparts. These parts allow not only several sequential data blocks, such as mail text and abinary attachment, but can form nested structures allowing to provide alternative formsof one content(e.g. a rich text format and plain text format of the message), encapsulatealtered content (e.g. encryption using S/MIME standard) and have many other advantages.

This flexibility slightly complicates the necessary processing logic in mail client, becauseit has to select the best part to display to the user. It is obvious that the displayed part willalso be the one most suitable for the document clustering analysis.

When clustering email messages, both the envelope fields and contents of the messagemust be taken into account. Therefore several metrics for envelope fields must be combinedwith the metric of the document clustering component to get the final distance functionusing combination, as described in the previous section.

3.2.5 Document clustering

Document clustering aims at grouping similar documents based on analysis of their text. It isa field of text mining, which derives many concepts from information retrieval and statistics.Many approaches to determining similarity between texts exist, but the process has usuallytwo parts: document processing and similarity calculation. The document processing takesplace at the beginning of the text clustering process and can be divided into several steps:

• Decompose the text to single tokens - in most cases words. Apply preprocessing to thewords, such as stemming, case conversion or stop words exclusion. Preprocessed wordsare now regarded as terms.

• Analyze terms in context of the document they were extracted from and calculatevalues necessary to determine term’s relevance scoring later. The relevance scoringisn’t usually calculated during document processing, as its formula parameters changewhen other documents are clustered. Therefore, it is preferable to store intermediatevalues that relate only to this document.

• Update global values relating to each term processed in the document’s analysis.

When the clustering itself takes place, pairwise document similarities are being calculated.Ideally, during this calculation documents’ texts aren’t being processed, as all necessaryinformation - the terms the documents contain associated with the values necessary forterm’s relevance calculation - is already known from the initial documents’ analysis.

Document similarity calculation iterates through documents’ terms, determines a rele-vance scoring for each term and uses the selected measure to calculate the similarity valuefrom all term scorings from both documents.

In following sections, we will describe the relevant parts of this process in more detail.

3.2. CLUSTERING 17

3.2.5.1 Tokenization

Tokenization is a process of splitting a string of characters into tokens based on predefinedlexical grammar. In case of document clustering a simple grammar defining one tokenconsisting of letters and separated by a non-letter character may be sufficient.

3.2.5.2 Term ranking and TF-IDF

It has been described that each term needs to have a relevance scoring that is used duringsimilarity calculation. There are following reasons that require such scoring to exist and theformula used to calculate the score has to address them:

• Documents vary in length. Even if one document has more occurrences of a term thanother document, it doesn’t mean that it is more related to the term. The first documentmay just be several times longer and it may even deal with completely unrelated topic.Therefore the term’s number of occurrences must be put in relation to the document’slength.

• Individual terms highly vary in their importance. Some words are very frequent andmay be found in large portion of a document set, while others have very specificmeaning and are present only in few documents. The fact that two documents share aspecific, infrequent term has thus much higher weight than if they shared a term thatcan be found in most of the other documents. Thus a notion about global relevance ofa term must be maintained and used appropriately in the similarity calculation.

Both of these issues are addressed in a popular term ranking measure named ”TF-IDF”(Term Frequency - Inverse Document Frequency). It is a product of term’s frequency withina document and an inverse frequency of the term’s presence in all documents. TF-IDF valuefor term i in document j is defined by following formula:

tfidfij =tij∑m tmj

· log |D|mi

,where tij is the number of occurrences of term i in document j,∑

m tmj is the sum of allterms in document j, |D| is the number of document in the set and mi is the number ofdocument where term i appears.

As can be seen from the formula, the score increases when the term is frequent within thedocument, but decreases with the number of documents it is present in. This measure hasbeen proven to provide satisfactory output and we have decided to use it in our clusteringframework.

3.2.5.3 Cosine measure

A custom metric is necessary to calculate similarity of two documents based on term relevancescores. The metric must neutralize the number of terms the documents contain and it shouldbe normalized into defined interval. Cosine Measure is a frequent measure of choice used

18 CHAPTER 3. ANALYSIS

with TF-IDF measure. For similarity of documents j and k by summing term scores, it hasfollowing form:

cosinemeasurejk =∑

n tnj · tnk

‖j‖ · ‖k‖

,where∑

n iterates over union of all terms from j and k, tnj is the number of occurrences ofterm n in document j and ‖j‖ is magnitude of document j. The magnitude can be calculated

as if j was an Euclidean vector: ‖j‖ =√∑

n

tnj2. When term n is not present in a document

j the number of occurrences tnj shall be 0.

3.2.6 eM Client

The clustering framework will be integrated within software ”eM Client”[11]. It is a modernemail client on whose development the author participates. It aims at being a full-featuredcollaboration product easily integrated with third-party server messaging and groupwaresolutions. Aside from email, the client supports calendaring, task lists and contact manage-ment. All of these items may be synchronized with a server through variety of open protocolssuch as IMAP, POP3, SMTP, CalDAV or CardDAV.

The product’s history is rather short, as the development began three years ago. Thedevelopment team has now five members, although most of the software has been writtenby a team of three that has expanded several months ago.

The client is built on Microsoft .NET Framework that allows fairly dynamic growth of theapplication thanks to powerful features of the runtime environment and to large base classlibrary that provides solid building blocks for software’s expansion in almost any direction.

3.2. CLUSTERING 19

Figure 3.3: The main window of eM Client

eM Client doesn’t support any extensibility through plugins yet. Therefore, any newfunctionality needs to be integrated directly into project’s source code. However, thanks towell-designed architecture of the software, new component integration code needs to be addedonly to few locations. The user interface also allows new controls to be easily incorporatedinto existing layout and connected with other parts of the interface. The integration will notrequire any existing functionality to be modified.

For the implementation discussed in this work, we have decided that clustering frameworkwill operate only on items from Inbox folder of the mail client.

20 CHAPTER 3. ANALYSIS

Chapter 4

Design

Based on the general principles and facts described in previous chapter we will outlineconcrete form of algorithms and methods used in the clustering framework. The ideas havebeen formed partially before the implementation, but changed during coding based on arisingissues and even after the framework has been finished.

4.1 Hierarchical agglomerative clustering

From several candidates for the clustering algorithm, we have chosen the hierarchical clus-tering. Its principle can be implemented fairly easily, it will allow incremental clustering andthe way it operates will enable us to cache much of the intermediate values it produces.

We have decided for the agglomerative variant of the algorithm. Each instance thereforestarts with its own cluster and individual clustering iterations merge nearest clusters.

With regard to the purpose of the clustering, we must view the clustering process ascontinual and never ending. It won’t be performed again and again, but rather started onceand then slowly proceeding as new data instances will be added to the dataset.

When the process is started for the first time, initial processing must be performed. Allinstances present in the dataset must be analyzed, necessary values calculated, new clusterscreated and distances between the clusters calculated. Afterwards, once all instances havebeen processed, clustering on the dataset must be performed. We call this process initialclustering.

The algorithm has several data structures that contain information necessary to performthe clustering effectively. Some of these structures could be abandoned, but the performanceof the algorithm would then fall below usable threshold. The basic structures are:

• Instance distance matrix contains a matrix of distances between all instances beingclustered. This matrix is updated every time a new instance is added by insertingvalues returned by the distance function. All subsequent operations using a distancevalue retrieve the calculated value from this matrix. The matrix uses a two-dimensionaldictionary.

21

22 CHAPTER 4. DESIGN

• Cluster distance matrix contains distances of all existing clusters between eachother. Modifications are performed during clustering process when clusters are mergedand when a cluster for newly added instance is created. This matrix also uses a two-dimensional dictionary.

• Clusters list maintains all clusters that exist in the current state of the clusteringprocess.

• Instances is the list of all instances in the process. Although most of the time thislist contains same items as the algorithm’s data source, this allows for better clusteringframework separation and resolves issues with newly added instances

When an instance is processed by the algorithm, either during initial clustering or whenbeing added to the dataset three basic operation are performed:

• Instance is passed to the distance function for processing. The distance function per-forms necessary analysis, such as term extraction and ranking (see subsection 3.2.5)

• Instance is added to the algorithm’s data structures and a distance function is invokedfor pairs of existing instances and the new instances to add values to the instancedistance matrix.

• A new cluster is created and added to the cluster list.

For similarity between clusters the minimum distance measure has been chosen. Theminimum distance measure for two clusters is defined as the minimum from all distancesbetween instances from the two clusters. This approach is used when filling the cluster dis-tance matrix. When clusters are merged, the new cluster’s distance to cluster A is minimumof distances of the old clusters to cluster A and as only two clusters are always merged thissimplifies to choosing smaller of two numbers.The clustering itself is realized within a loop that iterates until any two clusters are closerthan a specified threshold. The value of this threshold is very important to the quality ofclustering. Single iteration of the loop consists of following steps:

• Find nearest two clusters. If their distance is above the threshold end the loop.

• Create a new cluster containing instances from the two clusters.

• Calculate distances to other clusters. Go to the first step.

Incremental clustering is supported by performing similar instance processing as duringthe initial clustering and then executing the clustering loop. If the instance lies close enoughto any existing cluster, merging will take place, otherwise instance’s cluster will continue toexist.

The client allows emails messages to be permanently removed. This event is announcedby to the clustering algorithm and clean up operations are performed. The instance isremoved from instances list and the cluster it is contained in. The cluster’s distance fromother clusters must then be updated. This operation is computationally far more costlythan when a cluster is merged as the modified cluster’s instances must be compared withinstances from all other clusters.

4.2. EMAIL MESSAGES 23

4.2 Email messages

To be able to work with emails and extract their content the framework must be able to parsethem. Although basic functionality to do so is relatively easy to implements a full-featured,convenient library that will provide all necessary functions is a long-time effort.

Thankfully, when integrated, the framework can make use of the library used in eMClient to work with email messages.

For testing purposes this library might be used independently of eM Client.

4.3 Data instance features

The features being used as a base for the distance calculation are of the same, if not higher,importance, as the clustering algorithm itself. Large amount of features can be createdfrom email messages properties. During analysis of what is important when deciding thatitems discuss the same topic, several obvious choices came up. We have also drawn fromother research, mainly Cselle’s work[3]. Each of the properties has also specific method ofcomparison and calculation of the numeric distance value. We devised the calculation ofeach feature to express similarity on interval from 0 to 1. Following features participate inthe distance calculation:

• Most email clients add headers for unique message identification and for items thatare replies a header containing identification string of the message that is being repliedto. The unique identification header is named Message-Id and the header in replymessages is In-Reply-To. Value of these headers has no specific meaning and is usuallyrandomly generated. Its only use is as operand in string comparison to determine ifmessage are related. It is probable that related messages will discuss same topic.Numeric feature value from these properties is 1 if one mail is a reply to the other(orvice versa) and 0 if there is no relationship between the messages.

• Based on the same headers as above, we check whether two mails are replies to thesame email. This would be a typical case when multiple users discuss a topic. Whenthe mails share In-Reply-To value, this feature is 1, otherwise it is 0.

• Sender also bears an informational value. It is stored in From header of an email.When the sender of compared mail is identical this feature is 1.

• Recipient sets (present in header field To) of email messages are compared to determinehow many recipients do emails share. Size of the intersecting set related to size of unionof both recipient sets is calculated.

• Each email also has an origination date in header Date. Emails relating to the sametopic will often be near each other on a time line. We have decided to normalize thetime distance to an interval of two months. The feature value calculation computestime difference in hours divided by total hours in 2 months and performs normalizationand inversion, so that value of 1 represent items not differing in time and value 0 itemsthat are further away than 2 months.

24 CHAPTER 4. DESIGN

• Subject similarity is a very important feature. Subjects of both messages are tokenizedand the number of tokens present in both subjects is divided by the total number oftokens in both texts.

• The final, most important feature, is the text similarity. Our distance function re-trieves the text similarity from the text analysis component, which is described laterin following section. The value is normalized on interval 0 to 1.

Listed features are combined in a distance function as described in section 4.5.

4.4 Text analysis

Performing a quality text analysis and a subsequent similarity calculation should bring thehighest benefit to the information value of the distance function. The text analysis compo-nent operation can be divided into tokenization, term ranking and similarity calculation asdescribed in subsection 3.2.5.

Tokenization in our framework simply splits words separated by any non-letter character.Regular expressions are used to achieve this and although not necessary, this will allow toimprove the tokenizer in the future. The used regular expression is ”(\S+)\s”.

Before describing the other two steps, it is necessary to list data structures that thesesteps use:

• Terms is a dictionary of all terms in all documents, each term has an associated valuedefining in how many documents is the term present. This structure is updated when anew instance is being processed by the distance function. The occurrence count valueis used when calculating the IDF coefficient.

• CachedIDFs dictionary is used to cache Inverse Document Frequency values of termsto avoid calculating them every time a document similarity is being computed.

• Documents is also a dictionary, this one contains information about Term Frequencyof terms contained in a data instance - document. Each document has an associateddictionary - TermTFs of terms within that document. This nested dictionary storesa Term Frequency value for each term.

Term ranking iterates through all terms returned from the tokenization and updates thenecessary items in the structures. Terms dictionary is checked for the presence of the term -if it is present occurrence count is incremented, else the term is added with a value of 1. Thecached IDF value for a term must be updated in CachedIDFs to reflect that the occurrencevalue has changed. Last update concerns Documents where the new document entry mustbe added and every processed term inserted into the inner dictionary.

As can be seen from the data structures used, the final TF-IDF score isn’t stored any-where. This is because a new document addition would require to recalculate the score forevery contained term in every document that contains the same term, which would be highly

4.5. DISTANCE FUNCTION 25

inefficient.

When a text similarity has to be calculated, both document’s entries are retrieved fromDocuments dictionary. Term lists from both documents are enumerated in sequence. Foreach term TF-IDF is calculated from values in CachedIDFs and TermTFs and variablesholding values to calculate cosine measure numerator and denominator are updated withthe TF-IDF value.

4.5 Distance function

The distance function combines all intermediate similarity values to form one resulting dis-tance value. Linear combination of the intermediate values is used and its coefficients arefeature weights. The value then needs to be inverted to become a distance. As each featurevalue was normalized to a known interval, we may calculate the maximal achievable valueand subtract the sum of current values. Final step is to normalize the resulting distanceand this is achieved by dividing by the maximal achievable value. The complete formula hasfollowing form:

distance =∑

m mmax −∑

m mval∑m mmax

,where∑

m enumerates over all features, mmax is the maximal value of feature m and mval

is the value for feature m for current document.

4.6 Database backend

The data used by components from this chapter needs to be stored in a persistent location.There are high demands on the performance of the persistent store, so a suitable databaseengine needs to be used. eM Client uses relational database engine SQLite, which is verylightweight and provides good performance. For compatibility reason and based on previouspositive experience we have chosen to use SQLite also in the clustering framework.

eM Client also employs a typical two layer model for objects with storage objects directlyrepresenting the database data and application objects operating above storage objects witha higher-level logic. The storage objects are further managed by repositories. We haveclosely mimicked this architecture when designing persistence classes for this project.

There are several types of application objects and collections for enumeration of theseobjects. The objects represent cluster, data instance, document information (contains list ofdocument’s terms) and term. The collections are list, dictionary and two-key dictionary.

Detailed form of the model is described in section 5.3 and a database schema is attachedin Appendix A.

4.7 User interface

The interface we will add to eM Client needs to fit in well with other parts of the software.The client contains a panel on the right side of the main window, where contact list, com-munication and attachment history are placed. The side bar is a non-intrusive, yet handy

26 CHAPTER 4. DESIGN

location and our functionality is similar to the other components on the panel - to makeuser’s work with email easier. Each component in the side bar is represented by a boxcontrol that needs to be implemented.

Visual layout of the control should be separated to two areas - a list containing existingclusters and an area containing details about each cluster. The clustering interface shouldactively react when the user selects an email and lookup the cluster to which the emailbelongs. This way the user will always be presented with the clustering context of the email.When a cluster is selected either from a list or by being looked up when an email is selected,it should allow very easily to filter emails just from the cluster. The interface should hide ifthe currently selected folder isn’t being clustered.

4.8 Integration layer

To interconnect the clustering framework with eM Client, an integration layer needs to becreated. For such relatively standalone functionality singleton managing classes are beingused in eM Client. These classes are initialized at startup and autonomously perform neces-sary operations, based on events from application classes, collections or from other sources.

The class will need to carry out following tasks:

• Manage background thread in which all clustering operations will be invoked. Allevents must be delegated to this thread.

• Act as a data source for the clustering algorithm. This consists of enumerating itemsfrom client’s folder, creating matching data instance objects and monitoring additionsand removals from the clustered folder and announcing these events in transformedform to the clustering framework.

• Execute initialization tasks when clustering is launched for the first time and actionsbased on user’s choices.

Chapter 5

Implementation

Based on the design from previous chapter, we have implemented the clustering framework,integration code and integrated the framework into the client.

The platform used for implementation is Microsoft .NET Framework. To maintain com-patibility with eM Client and its requirements, we use version 2.0 of the .NET Framework.Development environment has been Microsoft Visual Studio 2008.

When designing the object model of the project we made heavy use of interfaces to ensurethat individual components will not depend on specific implementation and any parts of theprocess may be easily swapped for a different implementation of the interface.

The project’s architecture can be divided into several blocks, which we will describe infollowing sections. These sections are concerned only with the general architecture consistingof interfaces and with classes that are used by the framework when integrated with eMClient. There are also many classes that were used only when working with the frameworkas standalone application. These classes either implement the interfaces to be described ina different manner (e.g. to access email files from a disk folder) or provide other helperfunctionality. In case of interest in these classes, they are available on the attached CD.

5.1 Clustering

This part contains all infrastructure related with the clustering itself. The elementary iteminterfaces for cluster and data instances are ICluster and IInstance. Clustering algorithm isrepresented by IClusterer, which depends on a source of instances defined by IDataSource.The algorithm also needs a metric that calculates the instance similarity and this metric isIDistance.

ICluster interface contains properties Identifier for textual cluster identification and In-stances which is a dictionary of instances paired with a membership degree. In the currentimplementation, where hierarchical clustering is used, the degree of membership is always 1.IInstance has only one string property Identifier that provides string for textual identifica-tion. Any other more specific members of the generic instance do not exist. The clusteringalgorithm doesn’t need to access any instance properties, since it lets IDistance to do theevaluation and it is expected that IDistance will be implemented to work with concreteIInstance implementation.

27

28 CHAPTER 5. IMPLEMENTATION

Since we support online clustering throughout the framework, but necessarily do notrequire the data source to be of online nature, members specific to online data source havebeen separated in IDataSource descendant IOnlineDataSource.

Our implementation of the IClusterer is HierarchicalClusterer class. In section 4.1 wehave described several data structures that this algorithm uses. To separate implementationof the data structures - we have used in-memory structures during testing and database-backed structures are implemented in the final version - an interface IHierarchicalCluster-erContext is used by the hierarchical clusterer class to access its data.

5.2 Mail clustering

To support clustering of emails several of the above interfaces had to be implemented toperform specific operations. Several other classes and interfaces had to be added in order tocorrectly structure the code.

Interface IMailInstance has been created as a descendant of IInstance to provide accessto the email messages the instances represent. The email is exposed through members Mail.

IDistance was implemented in MailDistance to perform distance calculations as describedearlier. However, since text clustering is a more complex process, MailDistance delegates alloperations related to document clustering to MailTextAnalyzer.

Class MailTextAnalyzer performs processing of new mail documents and text similaritycalculation. This, requires data structure filled by analysis and used afterwards for calcula-tions. Same approach as with hierarchical clustering has been used and an interface namedIMailTextAnalyzerContext hiding the storage implementation from the analyzer was created.

5.2. MAIL CLUSTERING 29

Figure 5.1: Architecture of the core clustering framework and associated classes for mailclustering.

30 CHAPTER 5. IMPLEMENTATION

5.3 Storage

Storage classes provide persistence to other parts of the framework. The storage classesoperate over a relational database, but completely shield the rest of the project from havingto manage the database.

The persistence is exposed through implementations MailHierarchicalClustererContextand MailTextAnalyzerContext of context interfaces IHierarchicalClustererContext and IMail-TextAnalyzerContext. These implementations then access application objects described be-low.

Database storage is separated into two levels:

• The data level directly cooperates with the relational database, it executes SQL queriesfor inserting, updating, removing or listing items. Each type of item, such as clusteror mail data instance has its own interface inherited from IRepositoryItem. Dataobjects implementing this interface are then used for in-memory storage of the item’sproperties. Each type of item is managed by a central repository class inherited frombase class DbRepository. Note that class DbRepository has been taken from eM Client’sstorage layer. The repository takes care of all database interaction and manages a cacheof storage objects. Specialized abstract class serving as a base for repositories that areencapsulated by two key dictionary application level collection was implemented inDbDoubleKeyRepository

• The application level provides an abstraction layer from the database. Item applicationclasses encapsulate the data storage items and trigger modification operations in therepository. Collection classes implement standard platform operations for enumeration,addition and removal of items. Matrices that are used in IHierarchicalClustererContextare implemented as descendants of class DoubleKeyDictionary.

5.4 User interface

The user interface in the client has been implemented correspondingly to section 4.7. Aclass ControlSidebarBoxClusters is a descendant of a control that supports embedding in thesidebar.

The inherited class contains a datagrid control that displays the list of clusters and apanel that where information about the current cluster is displayed. These two controlschange their visibility on folder selection change and are visible only if the current folder isbeing clustered. Information in the detail panel are refreshed when either the user selects acluster from the list or when a mail is selected in the main area of eM Client.

The user interface also allows to filter messages from cluster. When a user double clicksan item in the cluster list or clicks ”Filter cluster items” button in the detail panel, filteringis enabled in the messages area of the application and only messages from current clusterare visible.

Integration with other user interface components of the client is implemented in severalmethods in the main form of the application - formMain.

5.4. USER INTERFACE 31

Other interaction with the user, such as confirming actions of the clustering frameworkis implemented through message boxes in the class that manages clustering and is describedin following section.

Figure 5.2: Fragment of the eM Client’s main form showing ControlSidebarBoxClusters’slook.

32 CHAPTER 5. IMPLEMENTATION

5.5 Integration layer

To interconnect the clustering framework with the client a class controlling the clusteringand serving as a bridge for data in both directions needs to exist. This class is namedClusteringManager and it follows the Singleton design pattern. Upon eM Client startup, aninitialization method of this class is executed. The initialization sets up all components ofthe clustering framework including database-backed contexts for hierarchical clusterer andtext analysis.

A background thread is also created during the initialization. All actions such as initialclustering, mail item addition or removal are delegated to this thread and passed to theclusterer class. Therefore lengthy clustering operations do not block execution of threads thatoriginated the events. This thread has also background priority to avoid slowing program’soperation down.

ClusteringManager implements IOnlineDataSource to provide instance to the clusterer.Mail items in clustered folder are enumerated and a corresponding object based on IMailIn-stance is created for each item. Folder’s events are being monitored and trigger analogousevents of online data source interface.

To allow retrieving mails matching to data instances an another interface IMailLoaderis used by data instance objects. This interface is used to look up the mail item based onits identifier stored in the instance object whenever the instance object needs to access itsemail and does not already have a reference to the email item.

Last role of the ClusteringManager is to expose the list of clusters from the framework.This is achieved by a property that returns the cluster list present in the hierarchical clusterercontext interface.

Chapter 6

Testing

During and after the project’s implementation we have tested the functionality on a datasetthat represented a portion of author’s inbox. Performance of the clustering has been analyzedin regard to memory and time. Quality has been analyzed in two ways: by subjectiveevaluation and using cophenetic correlation coefficient.

6.1 Clustering quality

6.1.1 Subjective evaluation

First a subjective evaluation of the clusters’ quality was carried out - judging whether theycorrespond to real topics of the processed emails and if the aggregation does ease orientationin the sample dataset. The result of this review was positive - emails were grouped intoconvenient clusters that distinguished individual topics. To illustrate this, we have extracteda dendrogram of the sample dataset:

33

34 CHAPTER 6. TESTING

Figure 6.1: Dendogram of the sample dataset. The red line denotes maximal distance ofclusters to be merged, colored boxes encapsulate multi-item clusters presented to the user.

As can be seen, 8 clusters containing more than 1 email were created and each representsa compact topic corresponding to how author thinks of the contained emails. To let thereader judge for himself, emails’ subjects matched to leafs in the dendogram can be found inAppendix B. Emails that were not merged with other items are also represented as clustersin the framework, but will not be shown to the user in future.

6.1. CLUSTERING QUALITY 35

6.1.2 Objective evaluation

To perform objective evaluation of the clustering, we had to select one of known qualitymeasures. Since the testing dataset wasn’t manually processed to assign a target cluster toeach data instance, we have to restrict our choice to internal quality measures. One of suchmeasures, aimed at hierarchical clustering, is a cophenetic correlation coefficient.

This coefficient tells us, how well does the clustering represent the instances’ relationsbased on their distance. Two sets of values are compared: the instance distance matrix andthe distance of individual cluster in the dendogram. The calculation of the coefficient wasdone in MathWorks MATLAB, which contains necessary functions. The coefficient valuemay lie in an interval from 0 to 1, where 1 indicates an absolute correlation of clusters’distances with the instances’ distance.

We have performed several experiments on the same dataset as in the previous section.There were two elementary properties, whose impact on the clustering we have tested: thedistance function formula section 4.5 and the cluster distance metric of the hierarchicalclustering subsubsection 3.2.1.2.

The function that calculates the final distance between instances may have many forms,our current implementation uses weighted linear combination, but there are other two verypopular choices - Euclidean distance and Cosine measure. Since Cosine measure requirescomponent values of instance vectors, which is not compatible with the way our clusteringworks, only the Euclidean distance remains. We have modified the distance function and ranthe clustering for the standard and the alternative formula. The resulting distance matriceswere extracted and used in testing described below.

The second property we modified is the cluster distance metric, which has major impacton the shape of the clusters. The available possibilities are: single (which is used in ourimplementation), average and complete linkages.

Using extracted instance distance matrices for each distance function and different clustermetrics we have performed series of calculations of the cophenetic correlation coefficient inMATLAB and compiled the following comparison table:

Euclidean Linear comb.Single 0.965 0.960Average 0.974 0.963Complete 0.958 0.934

The data show that Euclidean instance distance measurement is able to perform slightlybetter than linear combination, which has been implemented at first. Therefore we havemodified the framework to use Euclidean distance by default.

Cluster metrics indicate that average and single linkage perform better for our clusteringtask, but single linkage was kept in use for now.

The differences between values are quite subtle and therefore a further evaluation onseveral larger datasets will need to be done to get more confident results.

36 CHAPTER 6. TESTING

6.2 Data structures optimization

Data structures used in context interfaces (see section 5.3) are linked with a database, whichmeans that every operation such as modification, enumeration or look up performs a SQLquery. While this is not a problem for lists such as Clusters or Instances, where modificationsand enumerations are performed at sustainable rate, it becomes an issue for two dimensionaldictionaries used to store instance distance and cluster distance matrices, which are heavilyutilized during the clustering process (their utilization corresponds do the algorithm’s com-plexity - n2). We have attempted to improve the situation at least for the read operation byimplementing a cache. An inner, in-memory two dimensional dictionary has been added andonce a value for each key pair is modified or retrieved, it is stored in the cached dictionary.All subsequent accesses are catered from the cache.

Another substantial optimization to be implemented in future may be taking advantageof the symmetric nature of the distance function. This allows the size of the distance matricesto be reduced by more than a half.

6.3 Performance

The hierarchical clusterer’s greatest disadvantage is its high resource consumption. We havedescribed an optimization to reduce disk access, which in result dramatically improves thespeed of the clustering. Measurements have shown an improvement of 30

To get a notion of the memory consumption of the whole clustering framework, we haveused memory allocation monitoring on two versions of eM Client - with and without theclustering implemented. This is not a byte-precise approach due to the nature of .NETFramework, but sufficient for our needs. We have executed initial clustering in the firstversion on a dataset with 70 items. The resulting difference was close to 6 megabytes ofallocated memory.

The overall speed and memory performance wasn’t found to be up to the author’s ex-pectations and will be subject to further improvements. Luckily, there is a large space forenhancement and thus getting the resource consumption to product grade standards will beachievable.

Chapter 7

Conclusion

The author has decided to realize an email clustering framework as an approach to evolve thethree decades old model of work with email. The motivation comes from necessity to copewith contemporary problems of email overload. This approach is very different from wellknown takes on the issue. It does not require user’s interaction, but is able to continuallyprovide alternative organization of the inbox that will likely correspond to how the userthinks about the clustered emails. It is also non-intrusive, therefore has high chances ofsuccessful adoption by users.

In this work we went through relevant concepts, algorithms and formulas that will powerthe clustering framework. After choosing suitable techniques we have described their de-sign and individual features they will exhibit. Afterwards, an implementation of the wholeframework has been described along with reasoning for decisions that were made. The im-plementation was realized with aim to allow integration into an existing email client andwith an emphasis on possibility of future expansion and improvement of the functionality.Furthermore, a convenient user interface that suits the target software’s environment hasbeen devised.

Finally, testing and performance evaluation yielded relevant information, which led toseveral optimizations and will be used as a guide in future work on improving performance.

We can conclude that we have succeeded in realizing goals laid out at the beginning ofour work.

7.1 Future work

Although a fair amount of effort went into the project so far, it needs much more to get toa state, where author would like to see it - as mature component performing advanced, highquality clustering and having many features further enhancing this new element in work withemail. Also integration with other clients is a long-term goal. The main tasks for futureevolution of this project are:

• Optimize operations with persistent data structures such as batch writing using trans-action scopes, intelligent caching making use of weak references, saving space whenhaving symmetric matrices.

37

38 CHAPTER 7. CONCLUSION

• Improve the clustering speed, optimize the algorithm as much as possible, add supportfor parallelization.

• Perform large scale experimenting with model parameters using automated proceduresbased on optimizing quality measures, both internal and external[12].

• Investigate possibilities of automatic model parameter adjustment, parameter profilesor user based parameter adjustments.

• Allow user based cluster modification - merge cluster, separate cluster, move itemsbetween clusters.

• Investigate on extending user’s perception of cluster as category and on possible inter-connection of rules with clusters.

• Work on Fuzzy C-Means algorithm, compare with current algorithm, evaluate advan-tages of the fuzzy membership from both user and programmatic standpoint, addincremental support, explore effect on the model upon external modification of theclustering and more.

Bibliography

[1] New York Times - Is Information Overload a 650 Billion Drag on the Economy?http://bits.blogs.nytimes.com/2007/12/20/is-information-overload-a-650-billion-drag-on-the-economy/.

[2] Andrew McCallum Ron Bekkerman and Gary Huang. Automatic categorization of emailinto folders: Benchmark experiments on enron and sri corpora.

[3] Gabor Cselle. Organizing email. Master’s thesis, ETH Zurich, 2006.

[4] Jiawei Han and Micheline Kamber. Data Mining: Concepts and Techniques. TheMorgan Kaufmann Publishers, 2nd edition, 2006.

[5] Andre Ponce de Leon F. de Carvalho Eduardo J. Spinosa and Joao Gama. An onlinelearning technique for coping with novelty detection and concept drift in data streams.

[6] Matteo Matteucci. Clustering - Fuzzy C-Means Clustering.http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/cmeans.html.

[7] D. S. Yeung and X. Z. Wang. Improving performance of similarity-based clustering byfeature weight learning.

[8] Marko Grobelnik Dunja Mladenic, Janez Brank and Natasa Milic-Frayling. Featureselection using linear classifier weights: interaction with classification models.

[9] P. Resnick. RFC 2822: Internet message format.

[10] N. Freed and N. Borenstein. RFC2045: Multipurpose internet mail extensions (mime)part one.

[11] eM Client.http://www.emclient.com.

[12] George Karypis Michael Steinbach and Vipin Kumar. A comparison of document clus-tering techniques.

39

40 BIBLIOGRAPHY

Appendix A

SQL Schema

1 ATTACH DATABASE ’clusters.dat’ as clusters;

2 PRAGMA clusters.auto_vacuum=full;

3 PRAGMA clusters.page_size =4096;

4 CREATE TABLE clusters."Clusters" (

5 "id" INTEGER NOT NULL PRIMARY KEY ,

6 "identifier" TEXT ,

7 "latestMailOriginationDate" TIMESTAMP ,

8 "color" INTEGER);

9 CREATE TABLE clusters."ClusterInstances" (

10 "id" INTEGER NOT NULL PRIMARY KEY ,

11 "parentId" INTEGER NOT NULL CONSTRAINT

fk_ClusterInstances_parentId REFERENCES "

Clusters"("id") ON DELETE CASCADE ,

12 "instanceId" INTEGER NOT NULL CONSTRAINT

fk_ClusterInstances_instanceId REFERENCES "

Instances"("id") ON DELETE CASCADE ,

13 "membershipDegree" FLOAT);

14 CREATE INDEX clusters."idx_Parent" ON "ClusterInstances"

("parentId");

15 DETACH DATABASE clusters;

16

17 ATTACH DATABASE ’instances.dat’ as instances;

18 PRAGMA instances.auto_vacuum=full;

19 PRAGMA instances.page_size =4096;

20 CREATE TABLE instances."MailInstances" (

21 "id" INTEGER NOT NULL PRIMARY KEY ,

22 "identifier" TEXT ,

23 "mailIdentifier" TEXT);

24 DETACH DATABASE instances;

25

26 ATTACH DATABASE ’instance_distances.dat’ as

instance_distances;

27 PRAGMA instance_distances.auto_vacuum=full;

28 PRAGMA instance_distances.page_size =4096;

29 CREATE TABLE instance_distances."InstanceDistanceMatrix"

(

30 "id" INTEGER NOT NULL PRIMARY KEY ,

31 "key1" TEXT ,

32 "key2" TEXT ,

41

42 APPENDIX A. SQL SCHEMA

33 "value" FLOAT);

34 CREATE INDEX instance_distances."idx_Keys" ON "

InstanceDistanceMatrix" ("key1", "key2");

35 DETACH DATABASE instance_distances;

36

37 ATTACH DATABASE ’cluster_distances.dat’ as

cluster_distances;

38 PRAGMA cluster_distances.auto_vacuum=full;

39 PRAGMA cluster_distances.page_size =4096;

40 CREATE TABLE cluster_distances."ClusterDistanceMatrix" (

41 "id" INTEGER NOT NULL PRIMARY KEY ,

42 "key1" TEXT ,

43 "key2" TEXT ,

44 "value" FLOAT);

45 CREATE INDEX cluster_distances."idx_Keys" ON "

ClusterDistanceMatrix" ("key1", "key2");

46 DETACH DATABASE cluster_distances;

47

48 ATTACH DATABASE ’terms.dat’ as terms;

49 PRAGMA terms.auto_vacuum=full;

50 PRAGMA terms.page_size =4096;

51 CREATE TABLE terms."Terms" (

52 "id" INTEGER NOT NULL PRIMARY KEY ,

53 "key1" TEXT ,

54 "key2" TEXT ,

55 "documentOccurrences" INTEGER);

56 CREATE INDEX terms."idx_Keys" ON "Terms" ("key1", "key2")

;

57 DETACH DATABASE terms;

58

59 ATTACH DATABASE ’document_infos.dat’ as document_infos;

60 PRAGMA document_infos.auto_vacuum=full;

61 PRAGMA document_infos.page_size =4096;

62 CREATE TABLE document_infos."DocumentInfo" (

63 "id" INTEGER NOT NULL PRIMARY KEY ,

64 "key1" TEXT ,

65 "key2" TEXT ,

66 "termsCount" INTEGER);

67 CREATE TABLE document_infos."DocumentTerms" (

68 "id" INTEGER NOT NULL PRIMARY KEY ,

69 "parentId" INTEGER NOT NULL CONSTRAINT

fk_ClusterInstances_parentId REFERENCES "

Clusters"("id") ON DELETE CASCADE ,

70 "term" TEXT ,

71 "tf" FLOAT);

72 CREATE INDEX document_infos."idx_Parent" ON "

DocumentTerms" ("parentId");

73 DETACH DATABASE document_infos;

74

75 ATTACH DATABASE ’idfs.dat’ as idfs;

76 PRAGMA idfs.auto_vacuum=full;

77 PRAGMA idfs.page_size =4096;

78 CREATE TABLE idfs."CachedIDFs" (

79 "id" INTEGER NOT NULL PRIMARY KEY ,

80 "key1" TEXT ,

81 "key2" TEXT ,

43

82 "value" FLOAT);

83 CREATE INDEX idfs."idx_Keys" ON "CachedIDFs" ("key1", "

key2");

84 DETACH DATABASE idfs;

44 APPENDIX A. SQL SCHEMA

45

46 APPENDIX B. CLUSTERING ILLUSTRATIONS

Appendix B

Clustering Illustrations

Figure B.1: Dendogram of the sample dataset with subject of each email.

Appendix C

CD directory structure

The attached CD has following directory structure:

\bin - Directory containing executable copy of eM Client with

clustering functionality (executable is eM Client.exe)

\clustering - Contains the source code of the clustering framework

(project file is Clustering.csproj)

\text - Contains this text (PDF is grafnl.pdf , Latex file grafnl1.

tex)

47